@edwinho/kotoba-core
v0.2.5
Published
Framework-neutral language-learning data models, language profiles, and translation draft utilities.
Downloads
1,126
Maintainers
Readme
@edwinho/kotoba-core
Framework-neutral data contracts and utilities for Kotoba learning entries, translation drafts, language profiles, cache keys, and sanitizers.
This package is intentionally pure. It does not import Expo, React Native, Gemini provider code, CLI code, app storage, service policy, or runtime transport helpers.
@edwinho/kotoba-core is the shared data-contract package used by Kotoba
providers, CLI tools, and app integrations. It does not translate text by itself
and it does not call Gemini or hosted services.
Install
Inside this monorepo, depend on the package by version. Bun workspaces link the local package when the workspace version satisfies the range:
{
"dependencies": {
"@edwinho/kotoba-core": "^0.2.0"
}
}For external consumers:
bun add @edwinho/kotoba-coreConcepts
Version 0.2.0 includes optional Japanese study-token metadata, the
StudyTokenMetadata contract, and the generateJapaneseFormTable() helper for
rendering deterministic verb and adjective form tables from trusted metadata.
TranslationDraft is the normalized shape returned by translation providers
before a phrase is saved. It carries source and target text, optional reading
support, enrichment data, study tokens, freshness metadata, and capability
flags that describe which enrichments are present. It keeps the same field
shape while supporting future-language drafts through a generic language
parameter. App consumers can keep using the default Japanese/Chinese draft
type, while core/provider tests can infer TranslationDraft<"ko"> from
normalizeTranslationDraft({ targetLanguage: "ko", ... }).
LearningEntry and LearningEntryDraft are the saved-entry contracts used by
library and note surfaces.
Language helpers such as detectDirection, resolveLanguageProfile,
resolveActiveLanguageContext, and Chinese variant helpers keep language,
script, reading-system, and cache-scope decisions centralized.
Language support is represented through capability profiles. LearningLanguage
tracks product surfaces that only expose Japanese and Chinese today, while
SupportedLearningLanguage also includes package-level languages such as
Korean. Each LanguageCapabilityProfile declares language, script,
reading-system, locale, register, romanization, and variant support.
Japanese and Chinese profiles preserve the existing locale, reading,
romanization, and variant behavior. Korean is included with Hangul/Revised
Romanization, ko-KR speech locale metadata, and register support.
Validators and normalizers accept provider-like or imported data defensively:
malformed sections are dropped and reported through droppedSections; valid
sections are normalized into the public contracts. For example,
sanitizeStudyTokens rejects out-of-range tokens and tokens whose surface text
does not match the target text. sanitizeEnrichmentData drops malformed nested
sections such as invalid grammar breakdowns, examples, contrasts, and cached
note details. normalizeTranslationDraft derives completeness and capability
metadata after reading support, Chinese metadata, and study tokens are
normalized.
Japanese study tokens can optionally carry trusted morphology metadata under
StudyToken.metadata. Phase 1 supports Japanese verb and adjective metadata
only:
import type { StudyTokenMetadata } from "@edwinho/kotoba-core";
const metadata: StudyTokenMetadata = {
language: "ja",
category: "morphology",
kind: "verb",
surface: "飲んだ",
lemma: "飲む",
verbClass: "godan-mu",
observedForm: "past",
confidence: "high",
};surface is the study token that owns the metadata. Core may add
observedSurface when deterministic repair can prove that a split sequence is
one observed morphology phrase, for example a token 高く with
observedSurface: "高くないです".
sanitizeStudyTokens preserves valid metadata and drops only malformed
metadata while keeping the token. Dropped metadata is reported through a
metadata-specific path such as studyTokens[0].metadata.
The sanitizer also performs bounded Japanese morphology repair for common
provider tokenization failures. For example, split polite verb sequences such as
食べ + ました, and split adjective sequences such as 静か + でした,
高く + ない + です, and 静か + では + なかった, can be normalized
into observed surfaces like 食べました, 静かでした, 高くないです, and
静かではなかった so form tables can mark the observed cell while preserving the
original tappable token surface. These repairs require contiguous target-text
spans and trusted verb/adjective metadata or clear adjective part-of-speech
notes; core does not guess arbitrary morphology from unrelated tokens.
Package Boundary
The public package set is intentionally split:
@edwinho/kotoba-coreowns framework-neutral language profiles, draft contracts, validators, normalizers, cache/version helpers, and learning-entry utilities.@edwinho/kotoba-geminiowns Gemini prompt/schema/provider logic and requires a caller-provided Gemini API key.@edwinho/kotoba-cliis a terminal consumer of the public packages. It sends the user's input text to Gemini through@edwinho/kotoba-geminiusing the user's Gemini API key.- App integrations own runtime policy, persistence, and product-specific behavior around these contracts.
Examples
Normalize a cloud translation result:
import { normalizeTranslationDraft, sanitizeEnrichmentData } from "@edwinho/kotoba-core";
const { enrichment, droppedSections } = sanitizeEnrichmentData(providerPayload.enrichment);
const draft = normalizeTranslationDraft(
{
targetLanguage: "ja",
sourceLanguage: "en",
sourceText: "I'm hungry",
targetText: "お腹が空きました。",
readingSegments: [{ text: "お腹", reading: "おなか" }],
romanization: "onaka ga sukimashita",
translationText: "I'm hungry.",
register: "polite",
enrichment,
studyTokens: providerPayload.studyTokens,
},
{
source: "cloud",
canRegenerateWithCloud: true,
}
);
console.log(draft.completeness, draft.capabilities, droppedSections);Resolve language behavior:
import {
detectDirection,
resolveActiveLanguageContext,
resolveLanguageProfile,
resolveTTSLocale,
} from "@edwinho/kotoba-core";
const context = resolveActiveLanguageContext({
learningLanguage: "zh",
chineseVariant: "cantonese-traditional",
});
const profile = resolveLanguageProfile(context.learningLanguage);
const inputMode = detectDirection("我肚餓", context.learningLanguage);
const ttsLocale = resolveTTSLocale(
context.learningLanguage,
context.chineseDisplayScript ?? undefined,
context.chineseVariant ?? undefined
);
console.log(profile.defaultScript, inputMode, ttsLocale);Generate Japanese form tables from trusted metadata:
import { generateJapaneseFormTable } from "@edwinho/kotoba-core";
const table = generateJapaneseFormTable({
language: "ja",
category: "morphology",
kind: "verb",
surface: "飲んだ",
lemma: "飲む",
verbClass: "godan-mu",
observedForm: "past",
confidence: "high",
});
console.log(table?.coreRows);
// [
// {
// key: "non-past",
// label: "Non-past",
// plain: { value: "飲む" },
// polite: { value: "飲みます" },
// },
// {
// key: "past",
// label: "Past",
// plain: { value: "飲んだ", observed: true, note: "Seen here" },
// polite: { value: "飲みました" },
// },
// ...
// ]
console.log(table?.otherRows);
// [
// { label: "Te-form", value: "飲んで" },
// { label: "Potential", value: "飲める" },
// ]By default, only high-confidence metadata generates a table. Medium-confidence metadata can be enabled explicitly:
generateJapaneseFormTable(metadata, { minConfidence: "medium" });Form tables are deterministic once metadata is accepted. They should be treated as grammar support generated from validated metadata, not as a standalone Japanese dictionary. Consumers should prefer high-confidence metadata for learner-facing surfaces and should hide or soften tables when metadata is missing, low-confidence, or unsupported.
Resolve Korean as a future-language fixture:
import { normalizeTranslationDraft, resolveLanguageProfile } from "@edwinho/kotoba-core";
const profile = resolveLanguageProfile("ko");
const draft = normalizeTranslationDraft(
{
targetLanguage: "ko",
sourceLanguage: "en",
sourceText: "I'm going now",
targetText: "저 지금 가요.",
readingSystem: "revised_romanization",
readingSegments: [
{ text: "저", reading: "jeo" },
{ text: "지금", reading: "jigeum" },
{ text: "가요", reading: "gayo" },
],
romanization: "jeo jigeum gayo",
translationText: "I'm going now.",
register: "polite",
},
{
source: "cloud",
canRegenerateWithCloud: false,
}
);
console.log(profile.defaultTTSLocale, draft.readingSystem, draft.register);New-Language Checklist
Add future languages through bounded metadata and tests rather than changing the
TranslationDraft field shape:
- Add profile metadata in
src/languages/languageProfiles.ts, including scripts, reading systems, locale metadata, register support, and variant support. - Add script metadata to
ScriptCodeand reading metadata toReadingSystemor a language-specific reading-system alias. - Add detection fixtures for
detectDirectionwhen the target script can be detected locally. - Add provider prompt guidance and response-schema additions in the provider package for language-specific reading, romanization, register, and enrichment expectations.
- Add normalization and sanitizer fixtures that prove language-specific metadata survives and unsupported fields are dropped.
- Add public API tests for profile resolution and any new runtime exports.
- Add product work separately if the language should appear in settings, navigation, add phrase, library, TTS/STT, or persistence flows.
Verification
Run package-level checks from the repository root:
bun run --cwd packages/core typecheck
bun run --cwd packages/core test
bun run --cwd packages/core buildFrom the repository root, bun run build, bun run typecheck, and
bun run test run all package checks.
