@divinci-ai/langextract-ts

v0.2.0

Published

14 days ago

TypeScript port of Google's LangExtract — structured information extraction from text using LLMs with source grounding

0High
0Medium
0Low

langextract extraction nlp llm rag chunking tokenizer structured-extraction source-grounding gemini cloudflare-workers-ai openai anthropic

LangExtract-TS

TypeScript port of Google's LangExtract — structured information extraction from text using LLMs with precise character-level source grounding.

Based on LangExtract v1.1.1 by Google. Ported to TypeScript with Gemini and Cloudflare Workers AI support.

Features

Source grounding — every extraction maps back to exact character positions in the original text
Sentence-aware chunking — three-strategy chunker that respects sentence boundaries
Two-phase alignment — exact token matching + fuzzy fallback for robust source mapping
Universal runtime — runs on Node.js 18+, Cloudflare Workers, Deno, and Bun
Minimal dependencies — only zod required; provider SDKs are optional
Interactive visualization — self-contained HTML with playback controls
Provider plugins — built-in Gemini + Cloudflare, extensible for custom providers

Installation

npm install langextract-ts
# or
pnpm add langextract-ts

For Gemini support (optional):

npm install @google/genai

Quick Start

import { extract } from "langextract-ts";

const result = await extract(
  "The patient takes Aspirin 81mg daily for heart health.",
  {
    promptDescription: "Extract all medications with their dosage and frequency.",
    examples: [{
      text: "She takes Lisinopril 10mg once daily.",
      extractions: [{
        extractionClass: "medication",
        text: "Lisinopril",
        attributes: { dosage: "10mg", frequency: "once daily" },
      }],
    }],
    modelId: "gemini-2.0-flash",
    apiKey: process.env.GOOGLE_API_KEY,
  },
);

// result.extractions[0]:
// {
//   extractionClass: "medication",
//   text: "Aspirin",
//   charInterval: { startPos: 18, endPos: 25 },
//   alignmentStatus: "exact",
//   attributes: { dosage: "81mg", frequency: "daily" },
// }

With Cloudflare Workers AI

import { extract } from "langextract-ts";

const result = await extract(
  "Romeo professes his love for Juliet in the famous balcony scene.",
  {
    promptDescription: "Extract all characters mentioned.",
    examples: [{
      text: "Hamlet speaks to Horatio.",
      extractions: [{
        extractionClass: "character",
        text: "Hamlet",
      }],
    }],
    modelId: "@cf/meta/llama-3.3-70b-instruct-fp8-fast",
    apiKey: process.env.CF_API_TOKEN,
    accountId: process.env.CF_ACCOUNT_ID,
  },
);

API

`extract(input, options)`

Main entry point. Accepts strings, URLs, or Document[].

| Option | Default | Description | |---|---|---| | promptDescription | required | Task instructions for the LLM | | examples | required | Few-shot examples | | modelId | "gemini-2.0-flash" | Model identifier | | apiKey | env var | Provider API key | | maxCharBuffer | 1000 | Max characters per chunk | | batchLength | 10 | Chunks per inference batch | | maxWorkers | 10 | Concurrent requests | | extractionPasses | 1 | Number of extraction passes | | contextWindowChars | 0 | Cross-chunk context window | | formatType | "json" | Output format ("json" or "yaml") |

Chunking

import { chunkDocument, createDocument } from "langextract-ts";

const doc = createDocument("Your long text here...");
for (const chunk of chunkDocument(doc, { maxCharBuffer: 500 })) {
  console.log(chunk.text, chunk.charInterval);
}

Tokenization

import { RegexTokenizer, UnicodeTokenizer } from "langextract-ts";

const tokenizer = new RegexTokenizer();
const { tokens } = tokenizer.tokenize("Hello world!");
// tokens: [{ text: "Hello", tokenType: "word", charInterval: { startPos: 0, endPos: 5 } }, ...]

// For CJK/international text:
const unicode = new UnicodeTokenizer();
const { tokens: cjkTokens } = unicode.tokenize("Hello 世界");

Visualization

import { visualize } from "langextract-ts";

const html = visualize(annotatedDocument, {
  title: "Medication Extraction",
  animationSpeed: 1500,
});
// Save `html` to a file and open in browser

Custom Providers

import { BaseLanguageModel, registerProvider } from "langextract-ts";

class MyProvider extends BaseLanguageModel {
  async *infer(prompts) {
    for (const prompt of prompts) {
      const response = await myApi.call(prompt);
      yield [{ output: response, score: 1.0 }];
    }
  }
}

registerProvider([/^my-model/], () => MyProvider, 20);

Architecture

Input Text/URL
  -> Tokenization (RegexTokenizer or UnicodeTokenizer)
  -> Sentence-aware Chunking (3 strategies)
  -> Few-shot Prompt Construction
  -> Batched LLM Inference (concurrent with Semaphore)
  -> JSON Parsing + Extraction
  -> Two-phase Alignment (exact + fuzzy via SequenceMatcher)
  -> AnnotatedDocument with CharInterval positions

Runtime Compatibility

| Runtime | Supported | Notes | |---|---|---| | Node.js 18+ | Yes | Full support | | Cloudflare Workers | Yes | Web APIs only | | Deno | Yes | V8-based | | Bun | Yes | JavaScriptCore |

License

Apache-2.0

This project is a derivative work of Google's LangExtract, originally licensed under Apache-2.0.