@divinci-ai/langextract-ts
v0.2.0
Published
TypeScript port of Google's LangExtract — structured information extraction from text using LLMs with source grounding
Maintainers
Readme
LangExtract-TS
TypeScript port of Google's LangExtract — structured information extraction from text using LLMs with precise character-level source grounding.
Based on LangExtract v1.1.1 by Google. Ported to TypeScript with Gemini and Cloudflare Workers AI support.
Features
- Source grounding — every extraction maps back to exact character positions in the original text
- Sentence-aware chunking — three-strategy chunker that respects sentence boundaries
- Two-phase alignment — exact token matching + fuzzy fallback for robust source mapping
- Universal runtime — runs on Node.js 18+, Cloudflare Workers, Deno, and Bun
- Minimal dependencies — only
zodrequired; provider SDKs are optional - Interactive visualization — self-contained HTML with playback controls
- Provider plugins — built-in Gemini + Cloudflare, extensible for custom providers
Installation
npm install langextract-ts
# or
pnpm add langextract-tsFor Gemini support (optional):
npm install @google/genaiQuick Start
import { extract } from "langextract-ts";
const result = await extract(
"The patient takes Aspirin 81mg daily for heart health.",
{
promptDescription: "Extract all medications with their dosage and frequency.",
examples: [{
text: "She takes Lisinopril 10mg once daily.",
extractions: [{
extractionClass: "medication",
text: "Lisinopril",
attributes: { dosage: "10mg", frequency: "once daily" },
}],
}],
modelId: "gemini-2.0-flash",
apiKey: process.env.GOOGLE_API_KEY,
},
);
// result.extractions[0]:
// {
// extractionClass: "medication",
// text: "Aspirin",
// charInterval: { startPos: 18, endPos: 25 },
// alignmentStatus: "exact",
// attributes: { dosage: "81mg", frequency: "daily" },
// }With Cloudflare Workers AI
import { extract } from "langextract-ts";
const result = await extract(
"Romeo professes his love for Juliet in the famous balcony scene.",
{
promptDescription: "Extract all characters mentioned.",
examples: [{
text: "Hamlet speaks to Horatio.",
extractions: [{
extractionClass: "character",
text: "Hamlet",
}],
}],
modelId: "@cf/meta/llama-3.3-70b-instruct-fp8-fast",
apiKey: process.env.CF_API_TOKEN,
accountId: process.env.CF_ACCOUNT_ID,
},
);API
extract(input, options)
Main entry point. Accepts strings, URLs, or Document[].
| Option | Default | Description |
|---|---|---|
| promptDescription | required | Task instructions for the LLM |
| examples | required | Few-shot examples |
| modelId | "gemini-2.0-flash" | Model identifier |
| apiKey | env var | Provider API key |
| maxCharBuffer | 1000 | Max characters per chunk |
| batchLength | 10 | Chunks per inference batch |
| maxWorkers | 10 | Concurrent requests |
| extractionPasses | 1 | Number of extraction passes |
| contextWindowChars | 0 | Cross-chunk context window |
| formatType | "json" | Output format ("json" or "yaml") |
Chunking
import { chunkDocument, createDocument } from "langextract-ts";
const doc = createDocument("Your long text here...");
for (const chunk of chunkDocument(doc, { maxCharBuffer: 500 })) {
console.log(chunk.text, chunk.charInterval);
}Tokenization
import { RegexTokenizer, UnicodeTokenizer } from "langextract-ts";
const tokenizer = new RegexTokenizer();
const { tokens } = tokenizer.tokenize("Hello world!");
// tokens: [{ text: "Hello", tokenType: "word", charInterval: { startPos: 0, endPos: 5 } }, ...]
// For CJK/international text:
const unicode = new UnicodeTokenizer();
const { tokens: cjkTokens } = unicode.tokenize("Hello 世界");Visualization
import { visualize } from "langextract-ts";
const html = visualize(annotatedDocument, {
title: "Medication Extraction",
animationSpeed: 1500,
});
// Save `html` to a file and open in browserCustom Providers
import { BaseLanguageModel, registerProvider } from "langextract-ts";
class MyProvider extends BaseLanguageModel {
async *infer(prompts) {
for (const prompt of prompts) {
const response = await myApi.call(prompt);
yield [{ output: response, score: 1.0 }];
}
}
}
registerProvider([/^my-model/], () => MyProvider, 20);Architecture
Input Text/URL
-> Tokenization (RegexTokenizer or UnicodeTokenizer)
-> Sentence-aware Chunking (3 strategies)
-> Few-shot Prompt Construction
-> Batched LLM Inference (concurrent with Semaphore)
-> JSON Parsing + Extraction
-> Two-phase Alignment (exact + fuzzy via SequenceMatcher)
-> AnnotatedDocument with CharInterval positionsRuntime Compatibility
| Runtime | Supported | Notes | |---|---|---| | Node.js 18+ | Yes | Full support | | Cloudflare Workers | Yes | Web APIs only | | Deno | Yes | V8-based | | Bun | Yes | JavaScriptCore |
License
Apache-2.0
This project is a derivative work of Google's LangExtract, originally licensed under Apache-2.0.
