tokenize-is
v0.2.0
Published
TypeScript tokenizer for Icelandic text
Maintainers
Readme
tokenize-is
TypeScript tokenizer for Icelandic text. A port of Miðeind's Tokenizer.
Installation
npm install tokenize-is
# or
pnpm add tokenize-isUsage
import { tokenize, splitIntoSentences } from "tokenize-is";
// Tokenize text
const tokens = tokenize("Kl. 14:30 komu 100 gestir.");
for (const token of tokens) {
if (token.kind === "word") {
console.log(token.text);
} else if (token.kind === "number") {
console.log(token.value); // parsed number
}
}
// Split into sentences
const sentences = splitIntoSentences("Fyrst. Síðan.");
// → ["Fyrst.", "Síðan."]Token Types
All tokens have a kind discriminator for TypeScript narrowing:
| Kind | Description | Parsed Fields |
| -------------- | ----------------------------------- | -------------------------- |
| word | Words | text |
| number | Numbers (Icelandic/English formats) | value |
| ordinal | Ordinal numbers (1., XVII.) | value |
| time | Time (14:30, kl. tvö) | hour, minute, second |
| date | ISO dates | year, month, day |
| dateabs | Absolute dates (17. júní 1944) | year, month, day |
| daterel | Relative dates (3. janúar) | month, day |
| year | Four-digit years | value |
| amount | Currency amounts (100 kr.) | value, currency |
| currency | Currency codes/symbols | iso |
| measurement | Values with units (5km, 220V) | value, unit |
| percent | Percentages | value |
| url | URLs | text |
| domain | Domain names | text |
| email | Email addresses | text |
| hashtag | Hashtags (#iceland) | text |
| username | @mentions | username |
| numwletter | Number+letter (14b, 33C) | value, letter |
| telno | Phone numbers | cc, number |
| molecule | Chemical formulas (H2O) | text |
| ssn | Icelandic kennitala | value |
| serialnumber | Serial numbers | text |
| timestamp | Date+time combined | year..second |
| punctuation | Punctuation | normalized, position |
Options
tokenize(text, {
replaceCompositeGlyphs: true, // Normalize Unicode (a + ́ → á)
includeSentenceMarkers: false, // Add s_begin/s_end tokens
includeOffsets: false, // Add span.start/end character offsets
});Token Offsets
When includeOffsets: true, each token includes a span with character positions:
const tokens = tokenize("Halló heimur", { includeOffsets: true });
// tokens[0] = { kind: "word", text: "Halló", span: { start: 0, end: 5 } }
// tokens[1] = { kind: "word", text: "heimur", span: { start: 6, end: 12 } }
// Extract original text from spans
const text = "Halló heimur";
text.slice(tokens[0].span.start, tokens[0].span.end); // "Halló"Port Fidelity
This is a TypeScript port of Miðeind's Tokenizer (MIT licensed).
Supported
- All 30 token types from the original
- Sentence boundary detection with abbreviation awareness
- Unicode normalization (composite glyphs)
- Icelandic number formats (1.234,56)
- Spelled-out time expressions (hálftvö → 1:30)
- ~100 Icelandic abbreviations
- 70+ SI units, 18+ currencies
- Kennitala (SSN) validation with checksum
Not Yet Implemented
detokenize()- reconstruct text from tokenscorrect_spaces()- fix spacing between tokensparagraphs()/mark_paragraphs()- paragraph handling- HTML entity unescaping (á → á)
- Full abbreviation list (300+ in original vs ~100 here)
Design Differences
- ESM-only (no CommonJS)
- Returns arrays instead of generators
- Discriminated unions instead of numeric token codes
- Zero runtime dependencies
Development
pnpm install
pnpm test # Run tests
pnpm build # Build with tsdown
pnpm check # Lint + format + typecheckLicense
MIT - same as the original Tokenizer.
