tokenize-is

v0.2.0

Published

10 days ago

TypeScript tokenizer for Icelandic text

0High
0Medium
0Low

jokullsolberg

icelandic nlp sentence-splitting text-processing tokenizer

tokenize-is

TypeScript tokenizer for Icelandic text. A port of Miðeind's Tokenizer.

Installation

npm install tokenize-is
# or
pnpm add tokenize-is

Usage

import { tokenize, splitIntoSentences } from "tokenize-is";

// Tokenize text
const tokens = tokenize("Kl. 14:30 komu 100 gestir.");
for (const token of tokens) {
  if (token.kind === "word") {
    console.log(token.text);
  } else if (token.kind === "number") {
    console.log(token.value); // parsed number
  }
}

// Split into sentences
const sentences = splitIntoSentences("Fyrst. Síðan.");
// → ["Fyrst.", "Síðan."]

Token Types

All tokens have a kind discriminator for TypeScript narrowing:

| Kind | Description | -------------- | word | number | ordinal | time | date | dateabs | daterel | year | amount | currency | measurement | percent | url | domain | email | hashtag | username | numwletter | telno | molecule | ssn | serialnumber | timestamp | punctuation | Parsed Fields | | ----------------------------------- | -------------------------- | | Words | text | | Numbers (Icelandic/English formats) | value | | Ordinal numbers (1., XVII.) | value | | Time (14:30, kl. tvö) | hour, minute, second | | ISO dates | year, month, day | | Absolute dates (17. júní 1944) | year, month, day | | Relative dates (3. janúar) | month, day | | Four-digit years | value | | Currency amounts (100 kr.) | value, currency | | Currency codes/symbols | iso | | Values with units (5km, 220V) | value, unit | | Percentages | value | | URLs | text | | Domain names | text | | Email addresses | text | | Hashtags (#iceland) | text | | @mentions | username | | Number+letter (14b, 33C) | value, letter | | Phone numbers | cc, number | | Chemical formulas (H2O) | text | | Icelandic kennitala | value | | Serial numbers | text | | Date+time combined | year..second | | Punctuation | normalized, position |

Options

tokenize(text, {
  replaceCompositeGlyphs: true, // Normalize Unicode (a + ́ → á)
  includeSentenceMarkers: false, // Add s_begin/s_end tokens
  includeOffsets: false, // Add span.start/end character offsets
});

Token Offsets

When includeOffsets: true, each token includes a span with character positions:

const tokens = tokenize("Halló heimur", { includeOffsets: true });
// tokens[0] = { kind: "word", text: "Halló", span: { start: 0, end: 5 } }
// tokens[1] = { kind: "word", text: "heimur", span: { start: 6, end: 12 } }

// Extract original text from spans
const text = "Halló heimur";
text.slice(tokens[0].span.start, tokens[0].span.end); // "Halló"

Port Fidelity

This is a TypeScript port of Miðeind's Tokenizer (MIT licensed).

Supported

All 30 token types from the original
Sentence boundary detection with abbreviation awareness
Unicode normalization (composite glyphs)
Icelandic number formats (1.234,56)
Spelled-out time expressions (hálftvö → 1:30)
~100 Icelandic abbreviations
70+ SI units, 18+ currencies
Kennitala (SSN) validation with checksum

Not Yet Implemented

detokenize() - reconstruct text from tokens
correct_spaces() - fix spacing between tokens
paragraphs() / mark_paragraphs() - paragraph handling
HTML entity unescaping (á → á)
Full abbreviation list (300+ in original vs ~100 here)

Design Differences

ESM-only (no CommonJS)
Returns arrays instead of generators
Discriminated unions instead of numeric token codes
Zero runtime dependencies

Development

pnpm install
pnpm test        # Run tests
pnpm build       # Build with tsdown
pnpm check       # Lint + format + typecheck

License

MIT - same as the original Tokenizer.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

tokenize-is

Installation

Usage

Token Types

Options

Token Offsets

Port Fidelity

Supported

Not Yet Implemented

Design Differences

Development

License