sudachi-wasm333
v1.0.3
Published
Modern WebAssembly distribution of sudachi.rs
Readme
sudachi-wasm333
Updated WebAssembly distribution of sudachi.rs.
This distribution supports both of browser and Node.js.
Why?
Because a lot of Japanese tokenizing projects use kuromoji.js or Kuroshiro
+ kuroshiro-analyzer-kuromoji
that internally use the kuromoji dictionary
. And although that is not a bad thing, as you may noticed, all of them are considerably outdated.
Fortunately SudachiDict is a modern Japanese morphological analyzer that is often updated.
So we can use sudachi-wasm and forget about outdated dicts. Right?
Well... not exactly. The original sudachi-wasm embedded the whole sudachi dictionary in its package code. That implies:
- Slower performance.
- Heavier file size.
- Unable to use another dict files besides the one the package was compiled with.
This library fixes all of that by using dynamic dictionary loading, allowing you to use the latest Sudachi Dictionary even if for some reason I forget to update this package.
Right now you have to manually download the Sudachi Dictionary you want to use, but I plan to add dynamic downloading too, so the package automatically will download the latest dictionary available.
Features
- Updated structure of the original sudachi-wasm to reassemble the actual structure of sudachi.rs.
- SudachiStateless and SudachiStateful classes implementation.
- Slightly improved library docstrings and types.
- Added dynamic dict loading, so a custom dict path/url can be provided.
- Improved file size because of dynamic dict loading.
- Structure kinda inspired in Kuroshiro initialization.
- Improved performance.
Usage
Custom Sudachi Dictionary
Sudachi-wasm333 includes a dictionary packaged by default (the small one). But if you want to use a specific version, you can download it from here and provide the path/url through the class initializer.
Browser
<script type="module">
if ("serviceWorker" in navigator) {
await navigator.serviceWorker.register("serviceWorker.js");
}
const console = document.querySelector("#console");
// Please replace to self-hosted script path.
import { SudachiStateless, TokenizeMode } from "/v1.0.3.js";
const sudachi = new SudachiStateless();
await sudachi.initialize_browser();
console.innerText = JSON.stringify(
JSON.parse(sudachi.tokenize_stringified("今日は良い天気なり。", TokenizeMode.C)),
null,
2
);
</script>⚠ Script is too large
Gzipped script file is also larger than 50 MB 🐘. Please use the following mechanisms to delivery it.
- gzip encoding for compressing
- Service Worker for caching
Node.js
npm i sudachi-wasm333Then,
import { promises } from 'fs';
import { SudachiStateless, TokenizeMode } from "sudachi";
const sudachi = new SudachiStateless();
await sudachi.initialize_node(promises.readFile);
console.log(sudachi.tokenize_raw("今日は良い天気なり。", TokenizeMode.C));Development requirements
Just (Optional)
All build and test related commands can be found in the justfile.
just helpjust devjust build test-allBuild
cd sudachi
wasm-pack build --dev --target web && zx ./wasm-pack-inline.mjsTest
Browser
npx http-serverThen, access to the local server.
I actually prefer Live Server. - Benjas333
Node.js
cd sudachi
node test/node.mjscd sudachi
node test/node_stateful.mjscd sudachi
node test/special_chars.mjsTODO
Minor
- Add public link (like the original sudachi-wasm: https://sudachi-wasm.s3.amazonaws.com/v0.1.4.js).
- Add SudachiStateful examples.
- Improve documentation.
- Edit README.ja.md.
- Add demo (like the original sudachi-wasm: https://sudachi-wasm.s3.amazonaws.com/index.html).
Major
- Add dict loading from the .zip to reduce library size.
- Add default dict being dynamically downloaded from SudachiDict.
- Add dynamic dict type downloading: "small", "core", "full".
