@flexorch/audit
v0.6.0
Published
Zero-dependency PII + quality + noise audit for LLM datasets (TR/EU/US)
Maintainers
Readme
@flexorch/audit
Zero-dependency PII detection, quality grading, and noise audit for LLM datasets — in a single function call.
Why
Before feeding documents into an LLM pipeline you need to answer three questions:
- Does this text contain personal data? Sending PII to a language model is a compliance risk.
- Is the text quality high enough? Short, noisy, or duplicate records hurt fine-tuning and RAG retrieval.
- How bad is the noise? Garbled encodings and symbol clutter degrade model output silently.
Most tools that answer these questions require heavy NLP frameworks, model weights, or cloud APIs. @flexorch/audit answers all three with one call — using only regex and Node.js built-ins. No model weights, no network calls, no external packages.
Features
- Quality grade — A/B/C/D composite score: is this text LLM-ready at a glance?
- Noise ratio — line-level symbol clutter detection (
noise_ratio); values above 0.20 indicate likely extraction artifacts - PII detection — 30+ types across 8 countries (TR/DE/FR/IT/NL/ES/UK/US) + universal types; all regex-based with checksum validation
- Batch audit —
auditBatch()aggregates duplicate ratio and PII counts across an entire dataset in one call - Masking — four strategies: redact, replace (synthetic), token, hash
- Zero runtime dependencies — pure Node.js built-ins, Node 18+
- TypeScript-first — full type definitions, no
@types/package needed
Install
npm install @flexorch/auditQuick start
import { audit, mask } from "@flexorch/audit"
import { readFileSync } from "fs"
const text = readFileSync("contract.txt", "utf8") // extract from PDF/DOCX first
const result = audit(text) // "und" by default — all detectors active
// const result = audit(text, { locale: "tr" }) // restrict to TR-only detectors
result.quality_grade // "B"
result.quality_score // 0.73 (0.0–1.0 composite)
result.noise_ratio // 0.04 (fraction of blank/garbage lines; >0.20 = low quality)
result.detected_language // "und" (locale you passed in; caller controls language)
result.pii_summary // [{ type: "email", count: 2 }, { type: "national_id_tr", count: 1 }]
result.pii // [{ type: "email", value: "[email protected]", start: 8, end: 23 }]
result.quality // { completeness: 1.0, avg_length: 342, duplicate_ratio: null }
result.noise // { garbage_ratio: 0.0, encoding_ok: true }
const clean = mask(text, result.pii, { strategy: "redact" })
// "Contact: [REDACTED_EMAIL]"Batch audit
import { auditBatch } from "@flexorch/audit"
const texts = dataset.map((r) => r.text)
const batch = auditBatch(texts) // locale: "und" by default
batch.duplicate_ratio // 0.12 — fraction of exact-duplicate records
batch.avg_quality_score // 0.78
batch.pii_summary // [{ type: "email", count: 47 }, ...]
batch.results // AuditResult[], one per textCountry coverage
| locale | Detectors activated |
|----------|---------------------|
| "und" (default) | All locales combined — use when document language is unknown |
| "all" | Alias for "und" |
| "tr" | TCKN · VKN · phone_tr · name · IBAN_TR · company_name_tr · MERSIS · postal_code_tr · province_tr |
| "de" | Steueridentifikationsnummer · Sozialversicherungsnummer |
| "fr" | SIREN · SIRET · INSEE/NIR |
| "it" | Codice Fiscale · Partita IVA |
| "nl" | BSN · KvK |
| "es" | DNI/NIE · CIF |
| "uk" | NI number · UTR |
| "us" | SSN · EIN · ITIN |
| "eu" | E.164 phone · IBAN (EU+GB+CH+NO) · company name |
Universal detectors (always active regardless of locale): email · iban · credit_card · ip · ip_v6
Language detection:
@flexorch/auditis zero-dependency — no language detection library is included. Pass the correctlocaleyourself, or use"und"(default) to activate all detectors.
PII types
Universal
| Type | Description |
|------|-------------|
| email | RFC-5321 email address |
| iban | ISO 13616 IBAN — mod-97 validated; suppressed when iban_tr or iban_intl fires on same span |
| credit_card | 16-digit groups, Luhn-validated |
| ip | IPv4 address |
| ip_v6 | IPv6 — full, compressed ::, loopback forms |
Turkey (locale="tr")
| Type | Description |
|------|-------------|
| national_id_tr | TCKN — 11-digit, modular arithmetic checksum |
| tax_id_tr | VKN — 10-digit, Luhn-variant checksum |
| phone_tr | Turkish mobile: +90/0 prefix + 10 digits |
| name | Label-prefixed name: Adı:, Full Name:, Customer Name:, etc. |
| iban_tr | Turkish IBAN (TR + 24 chars), mod-97 validated |
| company_name_tr | Company with TR legal suffix: A.Ş. · Ltd.Şti. · Koll.Şti. · Koop. · T.A.Ş. |
| mersis_no | MERSIS — 16-digit company registry number |
| postal_code_tr | Turkish postal code (province plate 01–81) |
| province_tr | All 81 Turkish provinces |
Germany (locale="de")
| Type | Description |
|------|-------------|
| tax_id_de | Steueridentifikationsnummer — 11 digits, ISO 7064 MOD 11,2 checksum |
| social_id_de | Sozialversicherungsnummer — area + DOB + letter + serial |
France (locale="fr")
| Type | Description |
|------|-------------|
| siret_fr | SIRET — 14 digits, label-prefix gated |
| company_id_fr | SIREN — 9 digits, label-prefix gated |
| social_id_fr | INSEE/NIR — 15 digits, starts with 1 or 2 |
Italy (locale="it")
| Type | Description |
|------|-------------|
| national_id_it | Codice Fiscale — 16 chars alphanumeric, uppercase normalized |
| tax_id_it | Partita IVA — 11 digits, Agenzia delle Entrate checksum |
Netherlands (locale="nl")
| Type | Description |
|------|-------------|
| national_id_nl | BSN — 9 digits, 11-check (weighted sum mod 11) |
| company_id_nl | KvK — 8 digits, label-prefix gated |
Spain (locale="es")
| Type | Description |
|------|-------------|
| national_id_es | DNI (8 digits + letter, mod-23) and NIE (X/Y/Z prefix, same check) |
| tax_id_es | CIF — letter prefix + 7 digits + control character |
United Kingdom (locale="uk")
| Type | Description |
|------|-------------|
| social_id_uk | NI number — 2 letters + 6 digits + A/B/C/D; HMRC forbidden prefixes excluded |
| tax_id_uk | UTR — 10 digits, label-prefix gated |
United States (locale="us")
| Type | Description |
|------|-------------|
| ssn | SSN — ###-##-####, invalid prefixes (000/666/9xx) excluded |
| tax_id_us | EIN — XX-XXXXXXX, IRS invalid area prefixes excluded |
| national_id_us | ITIN — 9XX-7X/8X/9X-XXXX middle group validated |
EU / International (locale="eu")
| Type | Description |
|------|-------------|
| phone_intl | E.164 international phone — 7–15 digits, TR (+90) excluded |
| iban_intl | IBAN for EU+GB+CH+NO — ISO 13616 country+length table + mod-97 |
| company_name_intl | Company with international suffix: GmbH · LLC · S.r.l. · B.V. · SAS · Inc. · Ltd. etc. |
Noise detection
noise_ratio measures the fraction of lines that are blank or contain symbol clutter:
const result = audit("clean line\n@@@garbage\n\nclean")
result.noise_ratio // 0.5 (2 noisy lines out of 4)A line is "noisy" when it is blank (after trim) or contains 3+ consecutive characters from @ # ! ~ * =.
| noise_ratio | Signal |
|---------------|--------|
| < 0.05 | Clean — likely well-extracted text |
| 0.05–0.20 | Acceptable — minor formatting artifacts |
| > 0.20 | Low quality — likely OCR noise or extraction failure |
Masking strategies
const clean = mask(text, result.pii) // redact (default)
const clean = mask(text, result.pii, { strategy: "token" })
const clean = mask(text, result.pii, { strategy: "hash" })
const clean = mask(text, result.pii, { strategy: "replace" })| Strategy | Example output |
|----------|----------------|
| redact (default) | [REDACTED_EMAIL] |
| replace | [email protected] (static synthetic) |
| token | <PII_EMAIL_1> (unique per type per call) |
| hash | [3d4f9a1b2c8e7f0a] (SHA-256 first 16 hex chars) |
TypeScript
Full type definitions — no @types/ package needed:
import {
audit, auditBatch, mask,
type AuditResult, type BatchAuditResult,
type PiiFinding, type AuditOptions,
} from "@flexorch/audit"AuditResult includes:
interface AuditResult {
quality_grade: "A" | "B" | "C" | "D"
quality_score: number
noise_ratio: number
detected_language: string
pii_summary: { type: string; count: number }[]
pii: { type: string; value: string; start: number; end: number }[]
quality: { completeness: number; avg_length: number; duplicate_ratio: number | null }
noise: { garbage_ratio: number; encoding_ok: boolean }
}Quality grade
quality_grade (A–D) and quality_score (0.0–1.0) are composite signals:
| Grade | Score | Signal | |-------|-------|--------| | A | ≥ 0.85 | Ready for LLM training or RAG | | B | ≥ 0.65 | Usable with minor cleanup | | C | ≥ 0.40 | Review before use | | D | < 0.40 | Not suitable — empty, too short, or high noise |
Score formula: completeness × (0.4 × noiseScore + 0.4 × lengthScore + 0.2)lengthScore = Math.min(charCount / 500, 1.0) · noiseScore = Math.max(0, 1 − garbageRatio × 10)
Limitations
- No automatic language detection —
@flexorch/audithas zero dependencies. Passlocaleexplicitly, or use the default"und"to activate all detectors. - Free-standing name detection (without a label prefix) requires NLP/NER — not included.
replacemasking uses static synthetic values; locale-aware realistic synthesis is not implemented.- The library audits plain text. PDF/DOCX parsing, e-invoice extraction, and pipeline orchestration are out of scope.
Also available for Python
pip install flexorch-auditContributing
See CONTRIBUTING.md.
License
MIT
