@didrod2539/datalint

v0.1.0

Published

25 days ago

Lint CSV/TSV data files locally for quality issues: ragged rows, type drift, missing values, duplicates, mixed date formats, numeric outliers, and optional schema violations. Column profiling, JSON/Markdown reports, no dependencies on a data lib, no API k

0High
0Medium
0Low

didrod2539

csv tsv data-quality data-validation data-linter csv-validator data-profiling data-cleaning etl csv-lint tabular-data data-audit schema-validation cli

📊 datalint

Lint your CSVs before they break your pipeline — locally, no Python, no API key.

A deterministic CLI that profiles every column of a CSV/TSV file and lints it for data-quality problems — ragged rows, type drift, missing values, duplicates, mixed date formats, numeric outliers, and optional schema violations — with a quality score, A–F grade and JSON/Markdown reports.

One-line summary

datalint reads your CSV/TSV files, infers each column's type, profiles the data, and reports every quality issue that would trip up an import or analysis — 100% locally, no API key, no server, and no dependency on a data library (the CSV parser is hand-rolled).

Why this project exists

CSV is the universal data format, and it's almost always messy. A file that "looks fine" in a spreadsheet hides:

Ragged rows — an unescaped comma silently shifts every column after it.
Type drift — a number column with a stray N/A, —, or 1.2.3.
Mixed date formats — 2024-01-05 next to 01/06/2024 (which is which?).
Missing values, duplicates, stray whitespace, inconsistent casing (US vs us), and outliers that are really data-entry errors.

Eyeballing this doesn't scale, and feeding a 50k-row file to an LLM gets you a confident-but-wrong summary. You want a deterministic, repeatable audit you can run on every export and gate in CI. That's datalint.

Key features

🧱 Dependency-free CSV/TSV parser — RFC 4180 quotes, embedded newlines, escaped quotes, CRLF/LF, plus automatic delimiter detection.
🔎 Column profiling — inferred type, empty rate, distinct count, min/max/mean, and top values for every column.
🚦 12 built-in checks — ragged rows, duplicate/empty headers, empty columns/rows, missing values, type drift, whitespace, mixed date formats, inconsistent casing, duplicate rows, and numeric outliers (Tukey/IQR).
📐 Optional schema — required, type, enum, min/max, regex pattern, unique, not-null constraints per column.
📊 Quality score + A–F grade, per file and overall.
📄 JSON & Markdown export, colored console output, CI gate exit codes.
⚙️ Config file, custom delimiter, headerless mode, per-rule severities.
🔒 Runs entirely offline. Nothing is uploaded.

Install

# run without installing
npx @didrod2539/datalint scan data.csv

# or install
npm install -g @didrod2539/datalint    # global CLI (provides `datalint`)
npm install -D @didrod2539/datalint    # project dev-dependency (for CI)

Node ≥ 18. ESM + CJS + TypeScript types.

Quick start

datalint scan data.csv

data.csv  42/100 (F)  12 rows × 8 cols · comma
  • id integer · 11 distinct
  • email email · 12 distinct
  • country string · 5 distinct
  • signup_date date · 11 distinct
  • amount decimal · 10 distinct
  • note string · 3 distinct 75% empty
  ✗ 1 row(s) have a different column count than the header (8)
  ✗ Duplicate header "email" (columns 3 and 4)
  ⚠ Column "note" is 75.0% empty (9/12)
  ⚠ Column "amount" looks decimal but 1 value(s) don't match
  ⚠ Column "signup_date" mixes 2 date formats
  ⚠ 1 duplicate row(s)
  ℹ Column "country" has 1 value(s) that differ only by case

Overall  42/100 (F)  1 file(s), 12 row(s), 2 error(s), 4 warning(s), 1 info

CLI usage

datalint scan [...targets]    # analyze CSV/TSV files or directories
datalint report <input.json>  # re-render a saved JSON report as Markdown
datalint init                 # scaffold datalint.config.json (with a schema)
datalint --help
datalint --version

scan options:

| Option | Description | | --- | --- | | --config <file> | Path to a config file (otherwise auto-detected) | | --delimiter <char> | , \t ; \| or auto (default) | | --no-header | Treat the first row as data (synthesize column names) | | --json <file> | Write a JSON report | | --md <file> | Write a Markdown report | | --min-score <n> | Exit non-zero if the overall score < n (CI gate) | | --quiet | Hide info-level issues in the console |

Point scan at a directory and it finds every *.csv, *.tsv, *.txt recursively.

Example result

Full reports for the bundled sample files are in examples/sample-report.md and examples/sample-report.json.

📸 Screenshot / demo GIF placeholder: ./docs/screenshot.png — record the terminal running npx @didrod2539/datalint scan examples/messy.csv.

Configuration

Create datalint.config.json (or run datalint init):

{
  "delimiter": "auto",
  "hasHeader": true,
  "maxEmptyRate": 0.1,
  "enumThreshold": 20,
  "outlierIqrFactor": 1.5,
  "minScore": 80,
  "disableRules": [],
  "ruleSeverity": { "inconsistent-case": "warning" },
  "schema": [
    { "name": "id", "type": "integer", "required": true, "unique": true },
    { "name": "email", "type": "email", "notNull": true },
    { "name": "amount", "type": "decimal", "min": 0, "max": 100000 },
    { "name": "country", "enum": ["US", "CA", "UK"] }
  ]
}

| Field | Meaning | | --- | --- | | delimiter | "auto" or a literal delimiter | | hasHeader | Whether row 1 is a header | | maxEmptyRate | Warn columns above this empty rate (0–1) | | enumThreshold | Max distinct values for casing checks to apply | | outlierIqrFactor | Tukey IQR multiplier (1.5 default; 0 disables outliers) | | minScore | CI gate threshold (overridable with --min-score) | | disableRules | Rule ids to turn off | | ruleSeverity | Override severity per rule id | | schema | Optional per-column constraints |

Rule ids: ragged-rows, duplicate-headers, empty-column, empty-row, missing-values, type-drift, whitespace, mixed-date-formats, inconsistent-case, duplicate-rows, outliers, and schema-*.

Real-world use cases

Gate a data pipeline in CI. Add datalint scan ./exports --min-score 85 to your workflow. A nightly export that arrives with shifted columns or a broken date format fails the build instead of corrupting downstream tables.
Vet a file before import. Before loading a vendor/marketing CSV into your warehouse, run datalint scan leads.csv --md audit.md and fix what it finds.
Profile an unfamiliar dataset. Run datalint scan dataset.csv to instantly see each column's type, null rate, distinct count and ranges — a fast EDA pass without spinning up a notebook.

Programmatic API

import { analyze, buildReport, toMarkdown } from "@didrod2539/datalint";

const ds = analyze({ source: "data.csv", content });
console.log(ds.score, ds.grade, ds.profiles, ds.issues);

const report = buildReport([ds], { version: "0.1.0" });
await fs.writeFile("report.md", toMarkdown(report));

Roadmap

Excel (.xlsx) and Parquet input.
Cross-file referential checks (foreign keys across CSVs).
A --fix mode to auto-trim whitespace and normalize obvious issues.
An HTML report with charts.
A GitHub Action that comments data-quality on PRs.
Streaming mode for very large files.

FAQ

Does it send my data anywhere? No. datalint runs entirely on your machine — no API key, no telemetry, no uploads, no network calls.

Do I need to define a schema? No. datalint is useful with zero config — it infers column types and catches drift, duplicates, missing values, etc. A schema is optional for stricter checks.

How does it parse CSV? With a small, hand-rolled RFC 4180 parser (no external CSV library) that handles quoted fields, embedded delimiters/newlines, escaped quotes and CRLF/LF — so behavior is fully predictable. Delimiter is auto-detected or set via config.

How are dates / types detected? By deterministic pattern matching (src/infer.ts). Type inference is conservative; ambiguous cells fall back to string. The date check recognizes common ISO and slash/dot formats and flags a column that mixes more than one.

Is the quality score official? No — it's a transparent metric: each issue costs a base penalty plus an amount scaled by how much of the data it affects, weighted by severity (src/score.ts). Use it to track and gate quality.

My valid data is being flagged — how do I silence it? Use disableRules, ruleSeverity, maxEmptyRate, or outlierIqrFactor in the config. Every heuristic is tunable.

Contributing

Contributions welcome! Each check is a small, self-contained rule in src/rules/. See CONTRIBUTING.md and the Code of Conduct.

git clone https://github.com/didrod205/datalint.git
cd datalint
npm install
npm test
npm run build
node dist/cli.js scan examples/messy.csv

License

💖 Sponsor

datalint is free, MIT-licensed, and built in spare time. If it caught a bad export before it hit production, please consider supporting it:

⭐ Star this repo — free, and it helps others find it.
🍋 Sponsor via Lemon Squeezy — one-time or recurring.

Where your support goes: Excel/Parquet input, cross-file referential checks, a --fix autoclean mode, an HTML report, a PR-commenting GitHub Action, and fast issue responses.