text-kind
v1.0.0
Published
Detect and classify text kinds: western, CJK, data/code, mixed, or unknown
Maintainers
Readme
text-kind
A TypeScript/JavaScript library to detect and classify text kinds: western, CJK, data/code, mixed, or unknown.
Installation
npm install text-kindUsage
import { detectTextKind } from 'text-kind';
// Detect English text
const result1 = detectTextKind('Hello world, this is some English text.');
console.log(result1.kind); // 'western'
console.log(result1.confidence); // 0.834
// Detect CJK text
const result2 = detectTextKind('これは日本語のテキストです。');
console.log(result2.kind); // 'cjk'
// Detect JSON
const result3 = detectTextKind('{"name": "John", "age": 30}');
console.log(result3.kind); // 'data_code'
console.log(result3.details.jsonLikely); // true
// Detect CSV
const result4 = detectTextKind(`name,age,city
John,25,NYC
Jane,30,LA`);
console.log(result4.kind); // 'data_code'
console.log(result4.details.csvLikely); // trueAPI
detectTextKind(text: string, sampleSize?: number): DetectionResult
Analyzes the provided text and returns a classification result.
Parameters:
text- The text to analyzesampleSize- Optional. Maximum number of characters to analyze (default: 20000)
Returns: A DetectionResult object containing:
interface DetectionResult {
kind: 'western' | 'cjk' | 'data_code' | 'mixed' | 'unknown';
confidence: number; // 0 to 1
scores: {
western: number;
cjk: number;
data_code: number;
};
reasons: string[]; // Human-readable detection signals
details: DetectionDetails; // Detailed character counts and flags
}Text Kinds
- western: ASCII/Latin-based languages (English, Spanish, French, etc.)
- cjk: Chinese, Japanese, Korean text
- data_code: JSON, CSV/TSV, SQL, or source code
- mixed: Text that has significant presence of multiple kinds
- unknown: Text that doesn't clearly fit into other categories
Features
- Unicode-aware: Uses Unicode property escapes when available, falls back to character ranges
- Multiple format detection: JSON, CSV/TSV, SQL, and general source code patterns
- Confidence scoring: Provides confidence levels and detailed reasoning
- Efficient: Processes large texts by sampling (configurable sample size)
- Detailed analysis: Returns character counts, script distributions, and detection flags
Examples
Mixed Content
const mixed = detectTextKind('Hello 世界! {"status": "ok"}');
console.log(mixed.kind); // 'mixed'
console.log(mixed.reasons); // ['CJK characters: 2', 'ASCII letters: 5', ...]Code Detection
const code = detectTextKind(`
function hello() {
console.log("Hello world");
return true;
}
`);
console.log(code.kind); // 'data_code'
console.log(code.details.codeTokenHits); // 4License
MIT
