npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

codpoint

v1.0.1

Published

utf8, utf16, wtf8, and wtf16 decoding without strings

Downloads

6

Readme

Build Status Coverage Status npm version

codpoint

This lib exposes a set of transform streams that consume raw buffer chunks and decode them as UTF8, UTF16, WTF8, or WTF16. However they decode them to buffers of codepoints (in other words, UTF32, or I suppose WTF32), not to strings.

I found myself needing to do this repeatedly and realized it was worth spinning off into a standalone lib.

why

Naturally one can do some of this pretty easily with built-in decoding:

fs.createReadStream(filename, 'utf8').pipe(new Writable({
  write: (chunk, enc, done) => {
    const cps = Uint32Array.from(
      String(chunk),
      char => char.codePointAt(0)
    );

    /*... congrats u got em ...*/
  }
}));

Though nice and simple, this isn’t a particularly efficient way to get at the codepoints, and at least in my experience the reason I’ve usually needed to get at codepoints in the first place is because something is performance-sensitive. The native decoder is decoding utf8 to codepoints, but it then converts that into a string, and then you need to convert it back from a string to codepoints. So the main purpose of this lib is to eliminate the pointless steps there.

There are a few other distinctions:

  • the native decoder will output the \uFFFD replacement character in place of ill-formed encoding sequences, but because this lib is meant for internal processing rather than user-facing text handling, these streams instead will throw errors for ill-formed sequences (with certain exceptions permitted when using the WTF* encodings)
  • the WTF8 decoder permits sequences that would decode to UTF16 surrogate code units and passes these along as if they were valid unicode scalar values
  • the WTF16 decoder permits unpaired surrogate code units to pass through as if they were valid unicode scalar values
  • handling of BOM is configurable for UTF8 and detecting endianness of utf16 from the BOM is supported

There is a little naive benchmark in the test dir; res looked like this for me on node 8:

native decode utf8 to CPs: 100 iterations over all unicode scalars averaged 99.78330918ms
codpoint decode utf8 to CPs: 100 iterations over all unicode scalars averaged 43.749709050000014ms

usage

import { UTF16ToCPs, UTF8ToCPs, WTF16ToCPs, WTF8ToCPs } from 'codpoint';

fs.createReadStream(fileName).pipe(new UTF8ToCPs()).pipe(/* my consumer */);

The consumer will receive buffers of codepoints (effectively, this is UTF32le, unless using a WTF* decoder). You could read them from the node buffer interface:

for (let i = 0; i < buf.length; i += 4) {
  const cp = buf.readUInt32LE(i);
  /* do stuf */
}

Or you could read them from a regular typed array view:

for (const cp of new Uint32Array(buf.buffer, buf.offset, buf.length)) {
  /* do stuf */
}

You could also use DataView, etc. THE POSSIBILITIES R ENDLESS

options

The constructors each accept an options object.

UTF8ToCPs and WTF8ToCPs

  • options.discardBOM: default true. when true, an initial BOM is not piped through as a codepoint

UTF16ToCPs and WTF16ToCPs

  • options.endianness: default 'bom'. the possible values are 'bom', 'le' and 'be', which are effectively saying to decode *TF16, *TF16LE and *TF16BE respectively.

Note that 'discardBOM' is not an option here since the semantics vary from UTF8, where the BOM isn’t really a BOM so much as a sentinel value. In UTF16, the BOM is not optional and is not part of the text. UTF16LE and UTF16BE are defined as having no BOM; it’d unambiguously be a ZWNBSP.

errors

Various rather specific error constructors like InvalidUTF8ContinuationError are also exported. They’ll tell you what went wrong, but line/column is not tracked.