npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

tokenize-is

v0.2.0

Published

TypeScript tokenizer for Icelandic text

Readme

tokenize-is

TypeScript tokenizer for Icelandic text. A port of Miðeind's Tokenizer.

Installation

npm install tokenize-is
# or
pnpm add tokenize-is

Usage

import { tokenize, splitIntoSentences } from "tokenize-is";

// Tokenize text
const tokens = tokenize("Kl. 14:30 komu 100 gestir.");
for (const token of tokens) {
  if (token.kind === "word") {
    console.log(token.text);
  } else if (token.kind === "number") {
    console.log(token.value); // parsed number
  }
}

// Split into sentences
const sentences = splitIntoSentences("Fyrst. Síðan.");
// → ["Fyrst.", "Síðan."]

Token Types

All tokens have a kind discriminator for TypeScript narrowing:

| Kind | Description | Parsed Fields | | -------------- | ----------------------------------- | -------------------------- | | word | Words | text | | number | Numbers (Icelandic/English formats) | value | | ordinal | Ordinal numbers (1., XVII.) | value | | time | Time (14:30, kl. tvö) | hour, minute, second | | date | ISO dates | year, month, day | | dateabs | Absolute dates (17. júní 1944) | year, month, day | | daterel | Relative dates (3. janúar) | month, day | | year | Four-digit years | value | | amount | Currency amounts (100 kr.) | value, currency | | currency | Currency codes/symbols | iso | | measurement | Values with units (5km, 220V) | value, unit | | percent | Percentages | value | | url | URLs | text | | domain | Domain names | text | | email | Email addresses | text | | hashtag | Hashtags (#iceland) | text | | username | @mentions | username | | numwletter | Number+letter (14b, 33C) | value, letter | | telno | Phone numbers | cc, number | | molecule | Chemical formulas (H2O) | text | | ssn | Icelandic kennitala | value | | serialnumber | Serial numbers | text | | timestamp | Date+time combined | year..second | | punctuation | Punctuation | normalized, position |

Options

tokenize(text, {
  replaceCompositeGlyphs: true, // Normalize Unicode (a + ́ → á)
  includeSentenceMarkers: false, // Add s_begin/s_end tokens
  includeOffsets: false, // Add span.start/end character offsets
});

Token Offsets

When includeOffsets: true, each token includes a span with character positions:

const tokens = tokenize("Halló heimur", { includeOffsets: true });
// tokens[0] = { kind: "word", text: "Halló", span: { start: 0, end: 5 } }
// tokens[1] = { kind: "word", text: "heimur", span: { start: 6, end: 12 } }

// Extract original text from spans
const text = "Halló heimur";
text.slice(tokens[0].span.start, tokens[0].span.end); // "Halló"

Port Fidelity

This is a TypeScript port of Miðeind's Tokenizer (MIT licensed).

Supported

  • All 30 token types from the original
  • Sentence boundary detection with abbreviation awareness
  • Unicode normalization (composite glyphs)
  • Icelandic number formats (1.234,56)
  • Spelled-out time expressions (hálftvö → 1:30)
  • ~100 Icelandic abbreviations
  • 70+ SI units, 18+ currencies
  • Kennitala (SSN) validation with checksum

Not Yet Implemented

  • detokenize() - reconstruct text from tokens
  • correct_spaces() - fix spacing between tokens
  • paragraphs() / mark_paragraphs() - paragraph handling
  • HTML entity unescaping (á → á)
  • Full abbreviation list (300+ in original vs ~100 here)

Design Differences

  • ESM-only (no CommonJS)
  • Returns arrays instead of generators
  • Discriminated unions instead of numeric token codes
  • Zero runtime dependencies

Development

pnpm install
pnpm test        # Run tests
pnpm build       # Build with tsdown
pnpm check       # Lint + format + typecheck

License

MIT - same as the original Tokenizer.