langextract
v1.2.0
Published
A TypeScript library for extracting structured and grounded information from text using LLMs
Maintainers
Readme
LangExtract TypeScript
A TypeScript translation of the original Python LangExtract library by Google LLC. This library provides structured information extraction from text using Large Language Models (LLMs) with full TypeScript support, comprehensive visualization tools, and a powerful CLI interface.
Original Repository: google/langextract
Features
- Structured Information Extraction: Extract entities, relationships, and structured data from text
- Multiple LLM Support: Works with Google Gemini, OpenAI, Ollama, and other LLM providers
- Schema Generation: Automatically generates JSON schemas from examples for better extraction
- Text Alignment: Aligns extracted information with original text positions
- Interactive Visualization: Built-in HTML visualization with animations and controls
- Command Line Interface: CLI tool for easy visualization generation
- Batch Processing: Process multiple documents efficiently
- TypeScript Support: Full TypeScript types and interfaces
- Flexible Output Formats: Support for JSON and YAML output formats
- Error Handling: Robust error handling and validation
Installation
# Install from npm
npm install langextract
# Or install from source
git clone https://github.com/kmbro/langextract.git
cd langextract/typescript
npm install
npm run buildQuick Start
Basic Extraction
import { extract, ExampleData } from "langextract";
// Define examples to guide the extraction
const examples: ExampleData[] = [
{
text: "John Smith is 30 years old and works at Google.",
extractions: [
{
extractionClass: "person",
extractionText: "John Smith",
attributes: {
age: "30",
employer: "Google",
},
},
],
},
];
// Extract information from text using Gemini
async function extractPersonInfo() {
const result = await extract("Alice Johnson is 25 and works at Microsoft.", {
promptDescription: "Extract person information including name, age, and employer",
examples: examples,
modelType: "gemini",
apiKey: "your-gemini-api-key",
modelId: "gemini-2.5-flash",
});
console.log(result.extractions);
// Output: [
// {
// extractionClass: "person",
// extractionText: "Alice Johnson",
// attributes: {
// age: "25",
// employer: "Microsoft"
// },
// charInterval: { startPos: 0, endPos: 13 },
// alignmentStatus: "match_exact"
// }
// ]
}
// Extract information from text using OpenAI
async function extractPersonInfoWithOpenAI() {
const result = await extract("Alice Johnson is 25 and works at Microsoft.", {
promptDescription: "Extract person information including name, age, and employer",
examples: examples,
modelType: "openai",
apiKey: "your-openai-api-key",
modelId: "gpt-4o-mini",
temperature: 0.1,
});
console.log(result.extractions);
}
### Quick Visualization
```typescript
import { visualize, saveVisualizationPage } from "langextract";
// Generate and save visualization
saveVisualizationPage(result, "./extraction-viz.html", {
animationSpeed: 1.0,
showLegend: true,
gifOptimized: true
});API Reference
Main Functions
extract(textOrDocuments, options)
The main function for extracting structured information from text.
Parameters:
textOrDocuments:string | Document | Document[]- Text or document(s) to processoptions: Extraction options object
Returns: Promise<AnnotatedDocument | AnnotatedDocument[]>
Options:
promptDescription:string- Instructions for what to extractexamples:ExampleData[]- Training examples to guide extractionmodelId:string- LLM model ID (default: "gemini-2.5-flash")modelType:"gemini" | "openai" | "ollama"- LLM provider type (default: "gemini")apiKey:string- API key for the LLM serviceformatType:FormatType- Output format (JSON or YAML)maxCharBuffer:number- Maximum characters per chunk (default: 1000)temperature:number- Sampling temperature (default: 0.5)fenceOutput:boolean- Whether to expect fenced output (default: false)useSchemaConstraints:boolean- Use schema constraints (default: true)batchLength:number- Documents per batch (default: 10)maxWorkers:number- Maximum parallel workers (default: 10)additionalContext:string- Additional context for extractiondebug:boolean- Enable debug mode (default: true)modelUrl:string- Custom model URL (for Ollama and Gemini)baseURL:string- Custom base URL (for OpenAI)extractionPasses:number- Number of extraction passes (default: 1)maxTokens:number- Maximum tokens in the response (default: 2048)
Core Types
ExampleData
interface ExampleData {
text: string;
extractions: Extraction[];
}CharInterval
interface CharInterval {
startPos?: number;
endPos?: number;
}Extraction
interface Extraction {
extractionClass: string;
extractionText: string;
charInterval?: CharInterval;
alignmentStatus?: AlignmentStatus;
extractionIndex?: number;
groupIndex?: number;
description?: string;
attributes?: Record<string, string | string[]>;
tokenInterval?: TokenInterval;
}Document
interface Document {
text: string;
documentId?: string;
additionalContext?: string;
tokenizedText?: TokenizedText;
}AnnotatedDocument
interface AnnotatedDocument {
documentId?: string;
extractions?: Extraction[];
text?: string;
tokenizedText?: TokenizedText;
}Visualization Functions
visualize(dataSource, options)
Generate interactive HTML visualization from extractions.
Parameters:
dataSource:AnnotatedDocument | string- Document or file pathoptions:VisualizationOptions- Visualization configuration
Returns: string - HTML content
saveVisualizationPage(dataSource, outputPath, options)
Save complete HTML page with visualization.
Parameters:
dataSource:AnnotatedDocument | string- Document or file pathoutputPath:string- Output file pathoptions:VisualizationOptions- Visualization configuration
Advanced Usage
Model Configuration
Google Gemini
import { GeminiLanguageModel } from "langextract";
const model = new GeminiLanguageModel({
modelId: "gemini-2.5-flash",
apiKey: "your-api-key",
temperature: 0.3,
});OpenAI
import { OpenAILanguageModel } from "langextract";
const model = new OpenAILanguageModel({
model: "gpt-4o-mini", // or "gpt-4", "gpt-3.5-turbo", etc.
apiKey: "your-openai-api-key",
temperature: 0.3,
baseURL: "https://api.openai.com/v1", // Optional: for custom endpoints
});Ollama (Local Models)
import { OllamaLanguageModel } from "langextract";
const model = new OllamaLanguageModel({
model: "llama2:latest",
modelUrl: "http://localhost:11434",
temperature: 0.7,
});Response Control
Limiting Response Length with maxTokens
You can control the maximum number of tokens in the model's response using the maxTokens option:
// Limit Gemini response to 100 tokens
const result = await extract("Extract person information from this text.", {
examples: examples,
apiKey: "your-api-key",
maxTokens: 100, // Short, concise responses
});
// Limit OpenAI response to 200 tokens
const result = await extract("Extract person information from this text.", {
examples: examples,
modelType: "openai",
apiKey: "your-openai-api-key",
maxTokens: 200, // Moderate response length
});
// Limit Ollama response to 150 tokens
const result = await extract("Extract person information from this text.", {
examples: examples,
modelType: "ollama",
modelUrl: "http://localhost:11434",
maxTokens: 150, // Local model with token limit
});Custom Model URLs
You can override the default API endpoints for custom deployments:
// Use custom Gemini endpoint (useful for self-hosted instances)
const result = await extract("Extract person information from this text.", {
examples: examples,
apiKey: "your-api-key",
modelType: "gemini",
modelUrl: "https://your-custom-gemini-endpoint.com", // Custom URL
maxTokens: 500,
});
// Use custom OpenAI endpoint
const result = await extract("Extract person information from this text.", {
examples: examples,
modelType: "openai",
apiKey: "your-openai-api-key",
baseURL: "https://your-custom-openai-endpoint.com", // Custom base URL
maxTokens: 300,
});
// Use custom Ollama endpoint
const result = await extract("Extract person information from this text.", {
examples: examples,
modelType: "ollama",
modelUrl: "http://your-custom-ollama-server:11434", // Custom Ollama server
maxTokens: 200,
});Prompt Engineering
Custom Prompt Templates
import { PromptTemplateStructured, QAPromptGeneratorImpl } from "langextract";
const template: PromptTemplateStructured = {
description: "Extract medical entities from clinical text",
examples: [
{
text: "Patient has diabetes and hypertension",
extractions: [
{
extractionClass: "condition",
extractionText: "diabetes",
},
{
extractionClass: "condition",
extractionText: "hypertension",
},
],
},
],
};
const generator = new QAPromptGeneratorImpl(template);
const prompt = generator.render("Patient shows signs of asthma");Output Processing
Custom Resolvers
import { Resolver, FormatType } from "langextract";
const resolver = new Resolver({
fenceOutput: true,
formatType: FormatType.YAML,
extractionAttributesSuffix: "_attrs",
});Schema Enforcement
OpenAI models support JSON schema enforcement through function calling. When you provide a schema, the model will be forced to return responses that conform to the specified structure:
import { OpenAILanguageModel, GeminiSchemaImpl } from "langextract";
// Create a custom schema
const bookSchema = new GeminiSchemaImpl({
type: "object",
properties: {
title: { type: "string" },
author: { type: "string" },
publication_year: { type: "number" },
genre: { type: "string" },
},
required: ["title", "author"],
});
const model = new OpenAILanguageModel({
model: "gpt-4o-mini",
apiKey: "your-openai-api-key",
openAISchema: bookSchema, // This enforces the schema
formatType: FormatType.JSON,
temperature: 0.0,
});Performance Optimization
Batch Processing
import { Document } from "langextract";
const documents: Document[] = [
{ text: "First document text", documentId: "doc1" },
{ text: "Second document text", documentId: "doc2" },
];
const results = await extract(documents, {
examples: examples,
apiKey: "your-api-key",
batchLength: 5,
});Examples
Use Cases
Medical Entity Extraction
const medicalExamples: ExampleData[] = [
{
text: "The patient has diabetes mellitus type 2 and hypertension.",
extractions: [
{
extractionClass: "condition",
extractionText: "diabetes mellitus type 2",
attributes: {
severity: "moderate",
type: "type 2",
},
},
{
extractionClass: "condition",
extractionText: "hypertension",
attributes: {
severity: "mild",
},
},
],
},
];
const result = await extract("Patient diagnosed with asthma and obesity.", {
promptDescription: "Extract medical conditions and their attributes",
examples: medicalExamples,
apiKey: "your-api-key",
});Named Entity Recognition
const nerExamples: ExampleData[] = [
{
text: "Apple Inc. was founded by Steve Jobs in Cupertino, California.",
extractions: [
{
extractionClass: "organization",
extractionText: "Apple Inc.",
attributes: {
type: "company",
},
},
{
extractionClass: "person",
extractionText: "Steve Jobs",
attributes: {
role: "founder",
},
},
{
extractionClass: "location",
extractionText: "Cupertino, California",
attributes: {
type: "city",
},
},
],
},
];Visualization
LangExtract provides powerful visualization capabilities to help you understand and analyze your extractions. The visualization creates interactive HTML that highlights extracted entities with animations and controls.
Features
- Interactive Controls: Play/pause, next/previous, and progress slider
- Color-coded Highlights: Each extraction class gets a unique color
- Attribute Display: Shows extraction attributes in a side panel
- Smooth Animations: Automatic highlighting with configurable speed
- GIF Optimization: Special styling for video capture and screenshots
- Responsive Design: Works on different screen sizes
- File Support: Load from JSONL files or AnnotatedDocument objects
Basic Visualization
import { visualize, saveVisualizationPage } from "langextract";
// Create a visualization from an AnnotatedDocument
const html = visualize(result, {
animationSpeed: 1.0, // Seconds between extractions
showLegend: true, // Show color legend
gifOptimized: true, // Optimize for video capture
});
// Save as a complete HTML page
saveVisualizationPage(result, "./extraction-visualization.html", {
animationSpeed: 1.5,
showLegend: true,
gifOptimized: false,
});Visualization Options
interface VisualizationOptions {
animationSpeed?: number; // Animation speed in seconds (default: 1.0)
showLegend?: boolean; // Show color legend (default: true)
gifOptimized?: boolean; // Optimize for GIFs (default: true)
contextChars?: number; // Context characters around extractions (default: 150)
}Loading from Files
// Visualize extractions from a JSONL file
const html = visualize("./extractions.jsonl", {
animationSpeed: 0.8,
showLegend: true,
});Command Line Interface
LangExtract provides a CLI tool for easy visualization generation:
# Basic usage
npx ts-node bin/visualize.ts input.jsonl output.html
# With custom options
npx ts-node bin/visualize.ts input.jsonl output.html --speed 1.5 --gif-optimized
# Hide legend
npx ts-node bin/visualize.ts input.jsonl output.html --no-legend
# Using npm script
npm run visualize -- input.jsonl output.html --speed 0.8CLI Options:
--speed <number>: Animation speed in seconds (default: 1.0)--no-legend: Hide the color legend--gif-optimized: Optimize styling for GIF/video capture--context <number>: Context characters around extractions (default: 150)--help: Show help message
Examples
# Create a fast animation for GIF capture
npx ts-node bin/visualize.ts extractions.jsonl demo.html --speed 0.5 --gif-optimized
# Create a presentation-friendly version
npx ts-node bin/visualize.ts extractions.jsonl presentation.html --speed 2.0 --no-legend
# Process multiple files
for file in *.jsonl; do
npx ts-node bin/visualize.ts "$file" "${file%.jsonl}.html"
doneError Handling
LangExtract provides comprehensive error handling for various scenarios:
try {
const result = await extract(text, {
examples: examples,
apiKey: "your-api-key",
});
} catch (error) {
if (error instanceof Error) {
console.error("Extraction failed:", error.message);
}
}Common Error Types
- Missing API Key: Ensure your API key is provided via parameter or environment variable
- Invalid Examples: Examples array must contain valid ExampleData objects
- Model Errors: Check model ID and API key for the specified provider
- File Not Found: Verify file paths for JSONL input files
- Invalid Character Positions: Ensure charInterval positions are within text bounds
Configuration
Environment Variables
Set your API key as an environment variable:
export LANGEXTRACT_API_KEY="your-api-key"TypeScript Configuration
Add to your tsconfig.json:
{
"compilerOptions": {
"esModuleInterop": true,
"allowSyntheticDefaultImports": true,
"target": "ES2020",
"module": "commonjs",
"strict": true,
"declaration": true,
"outDir": "./dist"
}
}Development Setup
# Clone and setup
git clone https://github.com/kmbro/langextract.git
cd langextract/typescript
# Install dependencies
npm install
# Build the project
npm run build
# Run tests
npm test
# Run specific integration tests (requires API key)
OPENAI_API_KEY=your-api-key npm test -- medical-extraction.test.ts
# Run visualization CLI
npm run visualize -- sample-extractions.jsonl output.htmlContributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
Apache 2.0 License - see LICENSE file for details.
Support
For issues and questions, please open an issue on the GitHub repository.
