@silyze/kb-scanner-textract
v1.0.0
Published
textract wrapper implementation of DocumentScanner<T> for @silyze/kb
Readme
@silyze/kb-scanner-textract
textract-powered binary file implementation of DocumentScanner<T> for @silyze/kb, enabling seamless extraction and chunking of text from a wide range of document formats.
Features
- Extracts raw text from binary formats like PDFs, DOC/DOCX, PPTX, images (OCR), and more using
textract. - Uses
TextScannerfor token-based chunking compatible with OpenAI'stiktoken. - Handles
Bufferinputs with MIME type detection. - Fully async via
AsyncReadStreamandAsyncTransform.
Installation
npm install @silyze/kb-scanner-textractSome file types require system dependencies — see System Requirements.
Usage
import TextractScanner from "@silyze/kb-scanner-textract";
import fs from "fs/promises";
const scanner = new TextractScanner("application/pdf");
async function run() {
const buffer = await fs.readFile("./hello-world.pdf");
const chunks = await scanner.scan(buffer).transform().toArray();
console.log(chunks);
}
run().then();Configuration
TextractScanner accepts configuration options from both TextScanner and textract.
type TextractScannerConfig = TextScannerConfig & textract.Config;Common options:
tokensPerPage– tokens per chunk (default:512)overlap– overlap between chunks (default:0.5)model– tokenizer model (default:"text-embedding-3-small")preserveLineBreaks– preserve line breaks in extracted textexec.maxBuffer– increase buffer size for large filestesseract.lang– OCR language for image-based extraction (e.g."deu")pdftotextOptions– PDF-specific settings (e.g., for password-protected files)
Supported MIME Types
textract supports text extraction from a wide variety of MIME types, including:
application/pdfapplication/msword(.doc)application/vnd.openxmlformats-officedocument.wordprocessingml.document(.docx)application/vnd.ms-excel,.xls,.xlsx,.csv,.odsapplication/vnd.openxmlformats-officedocument.presentationml.presentation(.pptx)application/rtf,.odt,.xml,.epub,.html,.mdimage/png,image/jpeg,image/gif(OCR viatesseract)text/*,application/javascript, and more
You must provide the MIME type for correct extraction behavior.
How it Works
- Accepts a
Bufferand MIME type as input. - Uses
textract.fromBufferWithMime()to extract text. - Passes the text into
TextScannerfor token-based chunking. - Returns the chunks as an
AsyncReadStream<string>.
Example Output
["Hello world! This is a test PDF."];Longer documents will be split into multiple token-based chunks depending on your configuration.
System Requirements
Some file types require external system tools for extraction. textract will still work without them, but support for those types will be disabled.
| Format | Required Tool | Install (Ubuntu/Debian) |
| ------ | --------------- | --------------------------------------------------------- |
| .pdf | pdftotext | sudo apt install poppler-utils |
| .doc | antiword | sudo apt install antiword |
| .rtf | unrtf | sudo apt install unrtf |
| Images | tesseract | sudo apt install tesseract-ocr |
| .dxf | drawingtotext | Install manually |
To support most formats, you can run:
sudo apt install poppler-utils antiword unrtf tesseract-ocrNotes
- On macOS,
textutilis used for.docand.rtfby default (pre-installed). - Image OCR quality depends heavily on DPI and clarity of text content.
