@silyze/kb-scanner-textract

v1.0.0

Published

6 months ago

textract wrapper implementation of DocumentScanner<T> for @silyze/kb

0High
0Medium
0Low

simeonmarkoski

mojsoski

@silyze/kb-scanner-textract

textract-powered binary file implementation of DocumentScanner<T> for @silyze/kb, enabling seamless extraction and chunking of text from a wide range of document formats.

Features

Extracts raw text from binary formats like PDFs, DOC/DOCX, PPTX, images (OCR), and more using textract.
Uses TextScanner for token-based chunking compatible with OpenAI's tiktoken.
Handles Buffer inputs with MIME type detection.
Fully async via AsyncReadStream and AsyncTransform.

Installation

npm install @silyze/kb-scanner-textract

Some file types require system dependencies — see System Requirements.

Usage

import TextractScanner from "@silyze/kb-scanner-textract";
import fs from "fs/promises";

const scanner = new TextractScanner("application/pdf");

async function run() {
  const buffer = await fs.readFile("./hello-world.pdf");
  const chunks = await scanner.scan(buffer).transform().toArray();

  console.log(chunks);
}

run().then();

Configuration

TextractScanner accepts configuration options from both TextScanner and textract.

type TextractScannerConfig = TextScannerConfig & textract.Config;

Common options:

tokensPerPage – tokens per chunk (default: 512)
overlap – overlap between chunks (default: 0.5)
model – tokenizer model (default: "text-embedding-3-small")
preserveLineBreaks – preserve line breaks in extracted text
exec.maxBuffer – increase buffer size for large files
tesseract.lang – OCR language for image-based extraction (e.g. "deu")
pdftotextOptions – PDF-specific settings (e.g., for password-protected files)

Supported MIME Types

textract supports text extraction from a wide variety of MIME types, including:

application/pdf
application/msword (.doc)
application/vnd.openxmlformats-officedocument.wordprocessingml.document (.docx)
application/vnd.ms-excel, .xls, .xlsx, .csv, .ods
application/vnd.openxmlformats-officedocument.presentationml.presentation (.pptx)
application/rtf, .odt, .xml, .epub, .html, .md
image/png, image/jpeg, image/gif (OCR via tesseract)
text/*, application/javascript, and more

You must provide the MIME type for correct extraction behavior.

How it Works

Accepts a Buffer and MIME type as input.
Uses textract.fromBufferWithMime() to extract text.
Passes the text into TextScanner for token-based chunking.
Returns the chunks as an AsyncReadStream<string>.

Example Output

["Hello world! This is a test PDF."];

Longer documents will be split into multiple token-based chunks depending on your configuration.

System Requirements

Some file types require external system tools for extraction. textract will still work without them, but support for those types will be disabled.

| Format | Required Tool | Install (Ubuntu/Debian) | | ------ | --------------- | --------------------------------------------------------- | | .pdf | pdftotext | sudo apt install poppler-utils | | .doc | antiword | sudo apt install antiword | | .rtf | unrtf | sudo apt install unrtf | | Images | tesseract | sudo apt install tesseract-ocr | | .dxf | drawingtotext | Install manually |

To support most formats, you can run:

sudo apt install poppler-utils antiword unrtf tesseract-ocr

Notes

On macOS, textutil is used for .doc and .rtf by default (pre-installed).
Image OCR quality depends heavily on DPI and clarity of text content.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@silyze/kb-scanner-textract

Features

Installation

Usage

Configuration

Supported MIME Types

How it Works

Example Output

System Requirements

Notes