@robypag/langchain-splitter
v0.2.4
Published
A small wrapper module to simplify files and buffers tokenization using langchain
Downloads
72
Maintainers
Readme
Tokenizer Utility
This is a small utility I have built to support in other AI-related projects. It doesn't do much and I did not want to create more than this: it does exactly what I need.
If it can help you or you feel it is worth an upgrade, feel free to fork this. Pull-requests are warmly welcome.
Why
While working with Reality Augmented Generation, you usually have the need of processing a file in order to generate embeddings for it. The common technique is to split the file in chunks, then generate embeddings for each chunk.
It is a repetitive and tedious task and instead of copying/pasting the same function over and over again, I decided to build a small library.
What
Tokenizer exposes two main functions:
tokenizeFiletokenizeFromStringOrBuffer
They both do the same thing, but starting from a different point: as the name implies, you can provide a file path to tokenizeFile whereas you can provide a string or a buffer to tokenizeFromStringOrBuffer.
Supported Files
It currently supports files that can include text: pdf, doc and docx and text based files like txt, csv, etc...
It applies an heuristic approach to best determine which kind of file or buffer it is provided with:
tokenizeFilefirst uses themime-typesmodule to determine the file type. If this fails (mainly because the provided file has a mismatching extension or does not have an extension at all), it uses thefile-typemodule to look at the file content and determine its type.tokenizeFromStringOrBufferassumes that if the provided content is a string then the resulting file is a text-based one. If the provided content is a buffer, it usesfile-typeas above to look at the buffer content and determine which kind of file is and it generates a temporary file using the returned extension. Sincefile-typedoes not support text-files, it returns anundefinedvalue if the buffer contains a string or a text-only buffer: the function therefore generates atxttemporary file. After temporary file generation, it callstokenizeFileproviding the temp path to it.
Langchain Parameters
In all cases, this library uses Langchain's function RecursiveCharacterTextSplitter to process the given text.
You can check its signature here.
This library currently only uses 2 of them:
chunkSize: the size of each text chunk in bytes. Defaults to 1000chunkOverlap: amount of bytes that can overlap between two adjacent chunks. Defaults to 200
Who
Me myself and I.
License
See LICENSE
