@silyze/kb-scanner-text
v1.0.2
Published
Text implementation for DocumentScanner<T> for @silyze/kb
Readme
@silyze/kb-scanner-text
Text implementation of DocumentScanner<T> for @silyze/kb, using token-based chunking compatible with OpenAI's tiktoken.
Features
- Splits raw text or
Uint8Arrayinput into token-based chunks. - Configurable token stride and overlap.
- Supports multiple OpenAI models via
tiktoken. - Fully async via
AsyncReadStreamandAsyncTransformutilities.
Installation
npm install @silyze/kb-scanner-textUsage
import TextScanner from "@silyze/kb-scanner-text";
const scanner = new TextScanner({
model: "text-embedding-3-small", // optional
tokensPerPage: 512, // optional
overlap: 0.5, // optional: 50% overlap
});
async function run() {
const input = "The quick brown fox jumps over the lazy dog.";
const chunks = await scanner.scan(input).transform().toArray();
console.log(chunks);
}
run().then();Configuration
TextScanner accepts the following optional configuration:
type TextScannerConfig = {
encoding?: string; // default: "utf-8"
tokensPerPage?: number; // default: 512
model?: TiktokenModel; // default: "text-embedding-3-small"
overlap?: number; // default: 0.5 (50%)
};encoding: Text encoding forUint8Arrayinput.tokensPerPage: Number of tokens per chunk.overlap: Overlap between chunks — can be a float (ratio) or integer (absolute).model: Model name passed totiktoken.encoding_for_model().
How it works
- Accepts a string or
Uint8Arrayinput. - Cleans up and tokenizes the text using
tiktoken. - Chunks the token list using sliding windows, with optional overlap.
- Decodes each chunk and yields it as a string via an
AsyncReadStream.
This is designed to work as a plugin for the @silyze/kb knowledge base system, where documents need to be scanned and embedded for vector search.
Example Output
Given a basic string:
await scanner.scan("Hello world! This is a test.").transform().toArray();You might get:
["Hello world! This is a test."];Longer input will be chunked according to tokensPerPage and overlap.
