booknlp-ts
v1.1.0
Published
TypeScript port of entity, event, and supersense extraction pipelines of BookNLP (spaCy extraction excluded)
Downloads
975
Readme
BookNLP TypeScript Library
Browser-compatible TypeScript implementation of BookNLP for client-side NLP inference on long documents. This library provides complete entity recognition, supersense tagging, and event detection using pre-converted ONNX models running entirely in the browser via WebAssembly.
Quick Start
Installation
npm install booknlp-tsUsage
import { BookNLP, SpaCyContext } from 'booknlp-ts';
const config = {
// Optional: uses Hugging Face model by default
// modelPath: 'Terraa/entities_google_bert_uncased_L-4_H-256_A-4-v1.0-ONNX',
pipeline: ['entity', 'supersense', 'event'],
// Optional: specify execution providers (default: ['wasm'])
executionProviders: ['wasm'], // or ['webgl'], ['webgpu']
};
const booknlp = new BookNLP();
await booknlp.initialize(config);
// SpaCyContext must be provided (from spaCy preprocessing)
const spaCyContext: SpaCyContext = {
tokens: [
{
text: 'Harry',
startByte: 0,
endByte: 5,
pos: 'PROPN',
finePos: 'NNP',
lemma: 'Harry',
deprel: 'nsubj',
dephead: 1,
morph: {},
likeNum: false,
isStop: false,
sentenceId: 0,
withinSentenceId: 0,
},
// ... more tokens
],
sentences: [{ start: 0, end: 10 }],
};
const result = await booknlp.process(spaCyContext);
console.log('Entities:', result.entities);
console.log('Supersense:', result.supersense);
console.log('Events:', result.tokens.filter(t => t.event));Browser Deployment
Bundled Resources
The library automatically bundles resource files (entity tagset, supersense tagset, WordNet) from the source repository. No external network requests needed for resources.
WASM Configuration
For custom WASM paths (advanced usage):
const config = {
pipeline: ['entity'],
wasmPaths: {
'ort-wasm.wasm': '/custom/path/to/ort-wasm.wasm',
'ort-wasm-simd.wasm': '/custom/path/to/ort-wasm-simd.wasm',
},
};External Resource URLs
If you prefer to host resources externally:
const config = {
pipeline: ['entity', 'supersense'],
};Required Input: SpaCyContext
The TypeScript implementation requires pre-processed input from spaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Your text here")
spacy_context = {
"tokens": [
{
"text": token.text,
"startByte": token.idx,
"endByte": token.idx + len(token.text),
"pos": token.pos_,
"finePos": token.tag_,
"lemma": token.lemma_,
"deprel": token.dep_,
"dephead": token.head.i,
"morph": {k: v for k, v in token.morph.to_dict().items()},
"likeNum": token.like_num,
"isStop": token.is_stop,
"sentenceId": token.sent.start,
"withinSentenceId": token.i - token.sent.start
}
for token in doc
],
"sentences": [
{"start": sent.start, "end": sent.end}
for sent in doc.sents
]
}Validation Results
The validation suite compares Python and TypeScript outputs on identical input. Expected results:
| Metric | Status | Notes | | ----------------- | ---------------- | --------------------------- | | Token count | ✅ Exact match | All tokens preserved | | Token text | ✅ Exact match | Character-level accuracy | | Entity spans | ✅ Match expected | Boundaries align | | Entity categories | ✅ Match expected | PER, LOC, FAC, etc. | | Supersense spans | ✅ Match expected | Annotation boundaries align | | Event markers | ✅ Match expected | Token-level event flags |
Running Validation
cd validation
./run_validation.shThis automatically:
- Checks dependencies
- Builds TypeScript
- Runs Python BookNLP (generates baseline)
- Runs TypeScript BookNLP (with same input)
- Compares outputs (reports mismatches)
Known Limitations
- No automatic model downloading: ONNX model must be provided
- Requires SpaCy preprocessing: Cannot process raw text directly
- CPU-focused: CUDA support exists but not extensively tested
- No quote/coreference handling: Future enhancement
Future Enhancements
- [ ] Direct text input (integrate spaCy in TypeScript)
- [ ] WebAssembly optimization for browser use
- [ ] Quote and coreference chain extraction
- [ ] Big model variant support
- [ ] Streaming/incremental processing for very long documents
- [ ] GPU optimization and multi-GPU support
Comparison: Python vs TypeScript
| Feature | Python | TypeScript (Browser) | Notes | | ------------------- | ------------ | -------------------- | --------------------------- | | Entity recognition | ✅ | ✅ | Equivalent | | Supersense tagging | ✅ | ✅ | Equivalent | | Event detection | ✅ | ✅ | Equivalent | | ONNX inference | ✅ | ✅ | Equivalent | | SpaCy preprocessing | ✅ Integrated | ⚠️ External | Requires external spaCy | | Raw text input | ✅ | ❌ | Requires SpaCy context | | Model conversion | ✅ Native | ⚠️ Via Python | ONNX export | | Deployment | Server-side | Browser/Client-side | No server required | | GPU acceleration | CUDA | WebGL/WebGPU | Different acceleration tech |
Key Features
Type-Safe Interfaces
All data structures have complete TypeScript type definitions with validation:
interface SpaCyToken {
text: string;
startByte: number;
endByte: number;
pos: string; // Coarse POS (NOUN, VERB, etc.)
finePos: string; // Fine POS (NN, VBD, etc.)
lemma: string;
deprel: string; // Dependency relation
dephead: number; // Head token index
morph: Record<string, string>;
likeNum: boolean;
isStop: boolean;
sentenceId: number;
withinSentenceId: number;
}
interface EntityAnnotation {
startToken: number;
endToken: number;
cat: string; // PER, LOC, FAC, GPE, ORG, VEH
text: string;
prop: string; // PROP, NOM, PRON
}
type SupersenseAnnotation = [number, number, string, string];
// [startToken, endToken, category, text]ONNX Inference
Complete integration with ONNX Runtime:
// Automatic tensor creation with correct shapes and types
const predictions = await controller.predict(
inputIds, // int64[batch, seq_len]
attentionMask, // int64[batch, seq_len]
transforms, // float32[batch, seq_len, seq_len]
matrix1, // float32[batch, seq_len, seq_len]
matrix2, // float32[batch, seq_len, seq_len]
wn, // int64[batch, seq_len]
seqLengths, // int64[batch]
doEntity,
doSupersense,
doEvent
);The ONNX model outputs final predictions (already CRF-decoded), not logits or emissions. No Viterbi decoding needed in TypeScript.
Entity Recognition
Hierarchical 3-layer entity detection with proper BIO tag fixing:
// Automatically handles:
// - Invalid BIO sequences (I-PER without B-PER)
// - Entity type classification (PROP_PER → PER)
// - Hierarchical merging (3 LSTM layers)
// - Overlapping entity resolution
const entities = await tagger.tag(tokens, spaCyTokens, true, false, false);
// Returns: EntityAnnotation[] with startToken, endToken, cat, text, propSupersense Tagging
WordNet-based semantic annotation:
// Uses WordNet first sense mappings
// Categories: noun.person, noun.location, verb.communication, etc.
const supersense = await tagger.tag(tokens, spaCyTokens, false, true, false);
// Returns: [startToken, endToken, category, text][]Event Detection
Token-level event markers:
const events = await tagger.tag(tokens, spaCyTokens, false, false, true);
// Returns: Set<tokenId> of tokens that are eventsArchitecture
ONNX Model Architecture
Data Flow
SpaCy Context (input)
↓
Token Conversion
↓
Entity Tagger
├─→ Tokenization (with [CAP] tokens)
├─→ Transform Matrix Creation
├─→ WordNet Sense Lookup
├─→ ONNX Inference (predictions)
├─→ Postprocessing (BIO fixing)
└─→ Entity Extraction
↓
BookNLP Result (output)Configuration
BookNLPConfig
interface BookNLPConfig {
modelPath?: string; // Optional: Hugging Face repo ID or URL
// Default: 'Terraa/entities_google_bert_uncased_L-4_H-256_A-4-v1.0-ONNX'
pipeline: string[]; // ['entity', 'supersense', 'event']
verbose?: boolean; // Logging verbosity
executionProviders?: ExecutionProvider[]; // ['wasm', 'webgl', 'webgpu']
wasmPaths?: string | Record<string, string>; // Custom WASM paths
}Execution Providers
Choose the best backend for your deployment:
wasm(default): Universal compatibility, works in all browserswebgl: GPU acceleration via WebGL (faster inference)webgpu: Next-gen GPU acceleration (Chrome/Edge 113+, best performance)
const config = {
pipeline: ['entity'],
executionProviders: ['webgpu', 'wasm'], // Try WebGPU, fallback to WASM
};Model Loading Options
Option 1: Use Hugging Face (Automatic Download, Default)
const config = {
pipeline: ['entity', 'supersense'],
};
// Automatically downloads from Hugging Face and caches in browserOption 2: Specify Hugging Face Repository
const config = {
modelPath: 'Terraa/entities_google_bert_uncased_L-4_H-256_A-4-v1.0-ONNX',
pipeline: ['entity', 'supersense'],
};Option 3: Use Custom URL
const config = {
modelPath: 'https://your-cdn.com/model.onnx',
pipeline: ['entity', 'supersense'],
};Input Requirements
SpaCy Preprocessing
The TypeScript implementation requires complete linguistic annotations from spaCy. Here's how to generate the required input in Python:
import spacy
import json
nlp = spacy.load("en_core_web_sm")
text = "Harry Potter walked through the castle."
doc = nlp(text)
spacy_context = {
"tokens": [
{
"text": token.text,
"startByte": token.idx,
"endByte": token.idx + len(token.text),
"pos": token.pos_,
"finePos": token.tag_,
"lemma": token.lemma_,
"deprel": token.dep_,
"dephead": token.head.i,
"morph": {str(k): str(v) for k, v in token.morph.to_dict().items()},
"likeNum": token.like_num,
"isStop": token.is_stop,
"sentenceId": token.sent.start,
"withinSentenceId": token.i - token.sent.start
}
for token in doc
],
"sentences": [
{"start": sent.start, "end": sent.end}
for sent in doc.sents
]
}
# Save to JSON for TypeScript
with open('spacy_context.json', 'w') as f:
json.dump(spacy_context, f)Then in TypeScript:
import * as fs from 'fs';
const spaCyContext = JSON.parse(
fs.readFileSync('spacy_context.json', 'utf-8')
);
const result = await booknlp.process(spaCyContext);Required Fields
All SpaCyToken fields are required:
text: Token textstartByte,endByte: Character offsetspos,finePos: POS tagslemma: Lemmatized formdeprel,dephead: Dependency parsemorph: Morphological features (can be empty object)likeNum,isStop: Token propertiessentenceId,withinSentenceId: Sentence information
Output Format
BookNLPResult
interface BookNLPResult {
tokens: Token[]; // Annotated tokens
sents: any[]; // Sentence info (future)
nounChunks: any[]; // Noun chunks (future)
entities: EntityAnnotation[]; // Detected entities
supersense: SupersenseAnnotation[]; // Supersense annotations
timing: Record<string, number>; // Performance metrics
}Entity Format
{
startToken: 0,
endToken: 2,
cat: "PER", // Entity category
text: "Harry Potter",
prop: "PROP" // PROP, NOM, or PRON
}Categories: PER, LOC, FAC, GPE, ORG, VEH
Supersense Format
[0, 2, "noun.person", "Harry Potter"]
// [startToken, endToken, category, text]Categories include:
- Nouns:
noun.person,noun.location,noun.artifact,noun.cognition, etc. - Verbs:
verb.communication,verb.motion,verb.cognition, etc.
Batch Processing
The TypeScript implementation uses sentence-level batch processing (matching Python behavior):
// Automatically batches sentences for efficient processing
// - Groups tokens into sentence batches (max 500 tokens per batch)
// - Processes up to 32 batches in parallel
// - Reconstructs results with correct token offsets
const result = await booknlp.process(spaCyContext);
// All batching is handled internallyWhy Batch Processing?
- Memory efficiency: Processes long documents in manageable chunks
- Performance: Parallel processing of multiple sentences
- ONNX compatibility: Matches Python implementation's batching strategy
Dependencies
Runtime Dependencies
{
"@huggingface/transformers": "^2.6.0", // BERT tokenization
"onnxruntime-web": "^1.16.0" // ONNX inference (browser)
}Development Dependencies
{
"@types/node": "^20.0.0",
"typescript": "^5.0.0",
"vite": "^5.0.0", // Build tool
"vite-plugin-dts": "^3.0.0", // TypeScript declarations
"eslint": "^8.0.0"
}Build and Development
Development Mode
npm run dev
# Watches for changes and rebuilds automaticallyProduction Build
npm run build
# Compiles TypeScript and bundles with Vite
# Output: dist/booknlp.js (ES module), dist/booknlp.umd.cjs (UMD)Using in Your Project
ES Modules (Recommended)
import { BookNLP } from 'booknlp-ts';UMD (Script tag)
<script src="node_modules/booknlp-ts/dist/booknlp.umd.cjs"></script>
<script>
const booknlp = new window.BookNLP.BookNLP();
</script>License
Same as BookNLP (Python version).
References
- BookNLP Python: https://github.com/dbamman/book-nlp
- ONNX Runtime: https://onnxruntime.ai/
- Transformers.js: https://xenova.github.io/transformers.js/
- BERT: https://arxiv.org/abs/1810.04805
- CRF: https://en.wikipedia.org/wiki/Conditional_random_field
