llm-extract
v1.1.1
Published
Modular SDK for structured text extraction from documents using LLMs
Maintainers
Readme
LLM Extract
Extract structured data from documents using LLMs. Inspired by LangExtract.
Key Features:
- Multi-worker parallel processing for large documents
- Document processing with PDF text extraction and OCR fallback
- Structured data extraction using LLMs with few-shot learning
Installation
npm install llm-extractSupported Providers
| Provider | Status | |----------|--------| | OpenAI | ✅ Available | | Azure OpenAI | ✅ Available | | Anthropic Claude | 🔄 Coming Soon | | Google Gemini | 🔄 Coming Soon |
Usage
import { LanguageExtractor, OpenAIProvider } from 'llm-extract';
// Setup with OpenAI
const provider = new OpenAIProvider({
apiKey: "sk-your-openai-api-key",
model: "gpt-4o"
});
// Or with Azure OpenAI
// const provider = new AzureOpenAIProvider({
// apiKey: "your-api-key",
// endpoint: "https://your-endpoint.openai.azure.com/",
// deploymentName: "gpt-4",
// model: "gpt-4"
// });
const extractor = new LanguageExtractor();
extractor.setLLMProvider(provider);
// Extract from text
const result = await extractor.extract({
textOrDocuments: "Contract with John Doe dated 2024-01-15",
promptDescription: "Extract names and dates",
temperature: 0.1
});
console.log(result.extractions);
// [{ extraction_class: "name", extraction_text: "John Doe" }, ...]Document Processing
import { PDFOCRProcessor, ImageOCRProcessor } from 'llm-extract';
// Step 1: Process document to extract text
const pdfProcessor = new PDFOCRProcessor();
const parsedDoc = await pdfProcessor.parseDocument(pdfBuffer, {
fallbackToBasic: true,
config: {
tesseract: { language: 'eng' },
pdf2pic: { density: 200 }
}
});
// Step 2: Extract structured data from text
const result = await extractor.extract({
textOrDocuments: parsedDoc.extractedText,
promptDescription: "Extract invoice details"
});Examples with Training Data
const result = await extractor.extract({
textOrDocuments: "Agreement with ABC Corp on 2024-01-15",
promptDescription: "Extract companies and dates",
examples: [
{
text: "Contract with XYZ Ltd dated 2023-12-01",
extractions: [
{ extraction_class: "company", extraction_text: "XYZ Ltd" },
{ extraction_class: "date", extraction_text: "2023-12-01" }
]
}
]
});License
MIT
