@leonsilicon/cc-cedict

v0.0.0

Published

24 days ago

CC-CEDICT (Chinese-English dictionary) parsed as JSON, with the Ohm grammar used to produce it.

0High
0Medium
0Low

leonsilicon

cc-cedict cedict chinese chinese-english dictionary mdbg ohm pinyin

@leonsilicon/cc-cedict

CC-CEDICT — the community-maintained Chinese ↔ English dictionary — parsed into JSON, plus the Ohm grammar that produced it.

~125,000 entries with Traditional / Simplified / pinyin / definitions.
Ships as a single JSON file you can import directly via JSON modules (with { type: "json" }).
The grammar is exported separately so you can re-parse the source file yourself, against a newer dump or a fork.

Installation

npm install @leonsilicon/cc-cedict
# or: pnpm add @leonsilicon/cc-cedict / yarn add @leonsilicon/cc-cedict / bun add @leonsilicon/cc-cedict

Requires Node.js 22.12+ (or any runtime that supports the with { type: "json" } import attribute).

Usage

Read the dictionary

import cedict from "@leonsilicon/cc-cedict";

console.log(cedict.entries.length); // 124_933
console.log(cedict.metadata.version); // "1"
console.log(cedict.entries[0]);
// {
//   traditional: "110",
//   simplified:  "110",
//   pinyin:      "yao1 yao1 ling2",
//   definitions: ["the emergency number for law enforcement ..."],
// }

The default export has this shape:

interface CedictDocument {
  metadata: Record<string, string>; // header `#! key=value` lines
  comments: string[]; // header `# ...` lines (license, version notes, etc.)
  entries: CedictEntry[];
}

interface CedictEntry {
  traditional: string;
  simplified: string;
  pinyin: string; // space-separated syllables; `u:` denotes ü
  definitions: string[]; // one per `/.../` segment in the source
}

Use the Ohm grammar

import grammar from "@leonsilicon/cc-cedict/grammar";
import * as ohm from "ohm-js";

const g = ohm.grammar(grammar);
const match = g.match(`你好 你好 [ni3 hao3] /hello/hi/\n`);
if (match.succeeded()) {
  // ... walk the parse tree with your own semantics
}

@leonsilicon/cc-cedict/grammar is a default-exported string containing the raw Ohm source. The grammar defines these rules (see grammar.ohm for the full source):

| Rule | Matches | | -------------- | ---------------------------------------- | | file | the whole file (a sequence of lines) | | metadataLine | #! key=value header lines | | commentLine | # anything header lines | | entryLine | <trad> <simp> [pinyin] /def1/def2/.../ | | blankLine | a bare newline |

Source format

Each entry in CC-CEDICT looks like:

亞當·斯密 亚当·斯密 [Ya4 dang1 · Si1 mi4] /Adam Smith (1723-1790), Scottish philosopher .../

The headword section is space-delimited (Traditional, then Simplified).
Pinyin appears inside [...]; u: represents ü.
Definitions follow inside /.../, slash-separated.

Regenerating the JSON

The published JSON is built from data/cedict_1_0_ts_utf-8_mdbg.txt (downloaded from MDBG). To rebuild against the latest dump:

bun run download   # fetch the latest cedict_1_0_ts_utf-8_mdbg.txt.gz
bun run parse      # parse it with the Ohm grammar → cedict_1_0_ts_utf-8_mdbg.json

The same two steps run automatically before npm publish via the prepublishOnly hook.

License & attribution

The parsing code in this package is MIT-licensed. The dictionary data itself is licensed by MDBG under Creative Commons Attribution-ShareAlike 4.0 International. If you redistribute the JSON or any derivative, you must keep the same license and credit CC-CEDICT.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@leonsilicon/cc-cedict

Installation

Usage

Read the dictionary

Use the Ohm grammar

Source format

Regenerating the JSON

License & attribution