@silyze/kb-scanner-md
v1.0.0
Published
Markdown implementation of DocumentScanner<T> for @silyze/kb
Readme
@silyze/kb-scanner-md
Markdown implementation of DocumentScanner<T> for @silyze/kb, using marked to convert Markdown to HTML, then HtmlScanner to extract visible text and chunk it for AI embedding.
Features
- Converts Markdown to HTML using
marked. - Extracts visible text from the HTML using
HtmlScanner(jsdom+innerText). - Splits text into token-based chunks via
TextScanner, compatible with OpenAI’stiktoken. - Fully async via
AsyncReadStream.
Installation
npm install @silyze/kb-scanner-mdUsage
import MarkdownScanner from "@silyze/kb-scanner-md";
const scanner = new MarkdownScanner();
const md = `
# Hello World
This is a **Markdown** document.
`;
async function run() {
const chunks = await scanner.scan(md).transform().toArray();
console.log(chunks);
}
run().then();Output:
["Hello World\nThis is a Markdown document."];Configuration
MarkdownScanner accepts the same configuration options as TextScanner:
type MarkdownScannerConfig = TextScannerConfig;Examples:
tokensPerPage– tokens per chunk (default:512)overlap– overlap ratio or count (default:0.5)model– tokenizer model name (default:"text-embedding-3-small")encoding– text encoding (default:"utf-8")
How It Works
- Parses Markdown into HTML via
marked. - Feeds the HTML into
HtmlScanner. - Extracts visible text with
innerText. - Passes text into
TextScannerfor token-based chunking. - Returns an
AsyncReadStream<string>of chunks.
Example
For:
# Hello
**Bold** and _italic_ text.Output might be:
["Hello\nBold and italic text."];Longer documents are automatically chunked according to your token configuration.
