papagan
v0.1.7
Published
Fast language detection for Node.js powered by Rust
Maintainers
Readme
papagan
Fast language detection for Node.js, powered by Rust (via napi-rs).
10 languages bundled, weighted per-word output, TypeScript types included.
Install
bun add papagan
# or
pnpm add papagan
# or
yarn add papagan
# or
npm install papaganPrebuilt binaries ship for Linux (x64, arm64 — glibc & musl), macOS (x64, arm64), and Windows (x64). Node.js 18+.
Quick start
JavaScript
const { Detector } = require('papagan')
const detector = new Detector()
// Document-level detection
const output = detector.detect('Die Katze sitzt auf der Matte')
const [lang, confidence] = output.top()
console.log(`${lang}: ${confidence.toFixed(3)}`)
// de: 0.996
// Full distribution
for (const [lang, score] of output.distribution()) {
console.log(` ${lang}: ${score.toFixed(3)}`)
}TypeScript
import { Detector, Lang, type LangCode } from 'papagan'
const detector = new Detector()
const [lang, score]: [LangCode, number] = detector.detect('Hello world').top()
if (lang === Lang.En) {
console.log(`English with ${score.toFixed(2)} confidence`)
}Per-word detail
const detailed = detector.detectDetailed('The cat is black. Die Katze ist schwarz.')
for (const word of detailed.words) {
const [topLang, topScore] = word.scores.reduce((a, b) => (a[1] > b[1] ? a : b))
console.log(` ${word.token.padEnd(10)} [${word.source}] ${topLang} (${topScore.toFixed(2)})`)
}
const [topLang, confidence] = detailed.aggregate.top()Batch detection
For multi-document workloads, detectBatch fans out across cores via rayon — ~3.5× faster than a for loop over detect() on 1000 Leipzig paragraphs (130 ms → 36 ms) on an 8-core M-series. Short titles see ~4–5× because dict-hit scoring amortizes rayon setup better.
const docs = ['The cat sat', 'Die Katze sitzt', 'Le chat est assis', 'El gato está sentado']
const results = detector.detectBatch(docs) // Output[]
const detailed = detector.detectDetailedBatch(docs) // Detailed[]
for (const o of results) console.log(o.top())Blocks the V8 thread for the duration. For request handlers or anywhere tail-latency on other work matters, prefer the async variant — it runs on libuv's thread pool so the event loop stays free:
const results = await detector.detectBatchAsync(docs) // Promise<Output[]>
const detailed = await detector.detectDetailedBatchAsync(docs) // Promise<Detailed[]>Throughput is essentially identical to sync; the async path pays a small per-call fixed cost but keeps the event loop responsive. Measured on a 1000-paragraph batch:
| | Wall time | Max event-loop stall |
|---|---:|---:|
| detectBatch (sync) | 34 ms | 35 ms (fully blocks event loop) |
| detectBatchAsync | 36 ms | 13 ms |
Run node examples/event-loop-latency.js in this repo to reproduce.
Batches of fewer than 4 inputs fall back through the per-call path, so there's no small-batch regression.
Restrict to specific languages
const detector = new Detector({ only: ['en', 'de'] })
// or via builder:
const detector = Detector.builder().only(['en', 'de']).build()Configuration
const detector = new Detector({
only: ['en', 'de', 'fr'], // restrict to a subset
unknownThreshold: 0.25, // below this => Lang.Unknown
parallelThreshold: 32, // parallelize per-word work at 32+ tokens (default)
// set parallelThreshold to a very large number to opt out of rayon entirely
})Both camelCase and snake_case are supported on builders and options (unknownThreshold or unknown_threshold, detectDetailed or detect_detailed, etc.) for ergonomic match to your codebase style.
Supported languages
| Code | Language | Code | Language |
|---|---|---|---|
| de | German | it | Italian |
| en | English | nl | Dutch |
| es | Spanish | pl | Polish |
| fr | French | pt | Portuguese |
| ru | Russian | tr | Turkish |
All 10 languages bundled — no build-time configuration.
Benchmarks
Measured on Darwin arm64, 2026-04-22. Open fixtures: Tatoeba sentences (CC-BY 2.0 FR) and Leipzig news paragraphs (CC-BY 4.0). ns/tok is the per-token rate. Full cross-binding matrix (including Python competitor comparison) in the repository README.
| Tokens | Bytes | Loop (ms) | Loop (ns/tok) | Batch sync (ms) | Batch (ns/tok) | Batch async (ms) | |---:|---:|---:|---:|---:|---:|---:| | 35k | 222 KB | 49.90 | 1 438 | 30.06 | 866 | 27.17 | | 87k | 620 KB | 91.58 | 1 057 | 30.65 | 354 | 28.21 |
detectBatch fans out across cores via rayon (2–4× over a for detect() loop). detectBatchAsync runs on libuv's thread pool — same wall time as sync but the V8 event loop stays responsive (max stall drops from ~35 ms → ~11 ms on a 1000-paragraph batch; see examples/event-loop-latency.js).
Accuracy
99.42 % on Tatoeba (5,000 sentences) and 99.86 % on FLORES-200 devtest (10,120 sentences) across the 10 supported languages. Full per-language precision/recall table in the repository README.
License
Dual-licensed under MIT or Apache-2.0, at your option.
Related
- Rust crate — the core library
- Python package — Python bindings
- GitHub — source, issues, development
