garu-ko
v0.9.3
Published
Ultra-lightweight Korean morphological analyzer for the web (1MB model, WASM, F1 93.7% NIKL MP)
Maintainers
Readme
garu-ko
Browser-native Korean morphological analyzer. No server required.
- 1MB model bundled in npm package (no CDN needed)
- 332KB WASM engine (152KB gzipped) -- runs in any modern browser
- F1 93.29% on 9k human-verified gold testset (ep_norm), F1 93.7% on NIKL MP
- ~1ms inference per sentence
- Offline-ready -- works without network
- Live Demo -- try it in your browser
Live demos
Try it in the browser — every page below runs the analyzer 100% client-side via WASM.
- Morphological analyzer — interactive POS tagging with the full Sejong tagset
- Use cases — copy-paste recipes for search indexing, keyword extraction, chatbot preprocessing, and offline PWAs
- Orama integration — side-by-side comparison of Orama's default tokenizer vs.
garu-orama-tokenizer - MiniSearch integration — same comparison for MiniSearch
Quick Start
npm install garu-koimport { Garu } from 'garu-ko';
const garu = await Garu.load();
// Morphological analysis
const result = garu.analyze('배가 아파서 약을 먹었다');
console.log(result.tokens);
// [
// { text: '배', pos: 'NNG', start: 0, end: 2 },
// { text: '가', pos: 'JKS', start: 0, end: 2 },
// { text: '아프', pos: 'VA', start: 3, end: 6 },
// { text: '어서', pos: 'EC', start: 3, end: 6 },
// { text: '약', pos: 'NNG', start: 7, end: 9 },
// { text: '을', pos: 'JKO', start: 7, end: 9 },
// { text: '먹', pos: 'VV', start: 10, end: 13 },
// { text: '었', pos: 'EP', start: 10, end: 13 },
// { text: '다', pos: 'EF', start: 10, end: 13 },
// ]
// Simple tokenization
const tokens = garu.tokenize('나는 학교에 간다');
// ['나', '는', '학교', '에', '간다']
garu.destroy(); // free WASM memoryCustom Model
// Load from custom URL
const garu = await Garu.load({ modelUrl: '/models/custom.gmdl' });
// Load from ArrayBuffer
const res = await fetch('/models/custom.gmdl');
const garu = await Garu.load({ modelData: await res.arrayBuffer() });API
Garu.load(options?): Promise<Garu>
Initialize WASM and load model. Uses bundled model by default.
| Option | Type | Description |
|---|---|---|
| modelData | ArrayBuffer | Provide model bytes directly |
| modelUrl | string | Fetch model from URL |
garu.analyze(text, options?): AnalyzeResult
Returns morphological tokens with POS tags (Sejong tagset).
interface Token {
text: string; // surface form
pos: POS; // POS tag
start: number; // eojeol start offset
end: number; // eojeol end offset
}Set options.topN > 1 to get N-best results as an array. Note: topN > 1 is not yet fully supported and may return fewer results.
garu.nouns(text, options?): string[]
Extract nouns (NNG, NNP) from text. Set options.includeSL to also include foreign tokens (SL) like "AI", "BM25".
garu.nouns('인공지능 기술이 발전했다');
// ["인공", "지능", "기술", "발전"]
garu.nouns('AI 기술이 발전했다', { includeSL: true });
// ["AI", "기술", "발전"]garu.tokenize(text): string[]
Returns surface-form strings only. Lightweight alternative to analyze().
garu.destroy(): void
Free WASM memory. Instance is unusable after this call.
Integrations
Drop-in tokenizers for popular JS search libraries:
garu-orama-tokenizer— Korean tokenization for Oramagaru-minisearch-tokenizer— Korean tokenization for MiniSearch
Both solve the same problem: default tokenizers don't handle Korean particles or verb inflections, so "먹다" never matches "먹었다" and "학교" misses "학교에". These adapters run morphological analysis so the inflections fall off before indexing.
import { create, insert, search } from '@orama/orama'
import { createTokenizer } from 'garu-orama-tokenizer'
const db = await create({
schema: { title: 'string' },
components: { tokenizer: await createTokenizer() }
})
await insert(db, { title: '학교에서 점심을 먹었다' })
await search(db, { term: '먹다' }) // ← matchesAcknowledgments
The morphological analysis model is trained on the NIKL Morpheme-Tagged Corpus (v1.1) provided by the National Institute of Korean Language (국립국어원). The model contains only derived frequency statistics, not original text.
License
MIT
