@kognitivedev/documents
v0.2.11
Published
Document parsing and OCR pipeline for Kognitive
Downloads
697
Maintainers
Readme
@kognitivedev/documents
Document parsing and OCR pipeline for Kognitive.
Installation
bun add @kognitivedev/documents aiQuick Start
import { createOpenAI } from "@ai-sdk/openai";
import {
AISDKVisionOCRProvider,
DocumentProcessor,
LibreOfficeConverter,
toRagDocuments,
} from "@kognitivedev/documents";
import {
AISDKEmbeddingProvider,
DocumentPipeline,
InMemoryVectorStore,
RecursiveTextChunker,
} from "@kognitivedev/rag";
const openai = createOpenAI({ apiKey: process.env.OPENAI_API_KEY });
const processor = new DocumentProcessor({
ocrProvider: new AISDKVisionOCRProvider({
model: openai("gpt-4o-mini"),
providerName: "openai",
modelName: "gpt-4o-mini",
}),
officeConverter: new LibreOfficeConverter(),
});
const processed = await processor.process({
path: "./contract.pdf",
});
const ragDocs = toRagDocuments(processed, { mode: "page" });
const pipeline = new DocumentPipeline({
chunker: new RecursiveTextChunker({ chunkSize: 1000, overlap: 200 }),
embedder: new AISDKEmbeddingProvider({
model: openai.embedding("text-embedding-3-small"),
}),
vectorStore: new InMemoryVectorStore(),
});
await pipeline.ingest(ragDocs);Supported Inputs
pdf: native text extraction with OCR fallback on sparse pagesdocx: native extraction via Mammothtxt,md,html,json: direct UTF-8 parsingpng,jpg,jpeg,webp,tiff: OCRdoc,odt,ppt,pptx,xls,xlsx: optional LibreOffice conversion to PDF, then PDF processing
Public API
DocumentProcessorProcessorFactoryOCRProviderAISDKVisionOCRProviderLibreOfficeConvertertoRagDocuments
