npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@certifieddata/pii-scan

v0.1.1

Published

Local PII risk scanner for datasets — identifies columns likely containing personal data before synthetic generation

Downloads

175

Readme

@certifieddata/pii-scan

Local PII risk scanner for datasets. Scans CSV and JSON files for likely Personally Identifiable Information patterns using regex heuristics.

Runs entirely locally. No data leaves your machine. No network calls. No telemetry.


DISCLAIMER: This tool is a diagnostic aid, not a compliance control. It does NOT guarantee detection of all PII types. False positives and negatives are possible. Do not rely on this tool as a substitute for proper data governance or legal review.


Quick Start

npx @certifieddata/pii-scan ./customers.csv
npx @certifieddata/pii-scan ./users.json

No installation required. Works with Node.js 18+.


Install

npm install @certifieddata/pii-scan
# or
pnpm add @certifieddata/pii-scan

CLI Usage

pii-scan <file> [options]

Arguments:
  <file>        CSV or JSON file to scan

Options:
  --json        Output results as JSON (machine-readable)
  --no-color    Disable color output (auto-detected in CI)
  -h, --help    Show help

Exit codes:
  0   No PII patterns detected
  1   PII patterns found (LOW or MEDIUM risk)
  2   HIGH risk PII found

Examples

# Scan a CSV
npx @certifieddata/pii-scan ./customers.csv

# Scan JSON, get JSON output
npx @certifieddata/pii-scan ./records.json --json

# Use in CI (exits non-zero if PII found)
npx @certifieddata/pii-scan ./test-data.csv && echo "Clean"

Example Output

pii-scan — local PII risk scanner
────────────────────────────────────────────────────────────
  File   : /path/to/customers.csv
  Rows   : 1,000
  Columns: 12
────────────────────────────────────────────────────────────

  Findings

  [HIGH] email
         Email Address (column name)
         Email Address (43 matches in content)  e.g. jo**@ex*****.com, ma**@gm***.com

  [HIGH] phone
         Phone column (column name)
         US Phone Number (38 matches in content)  e.g. 55*********00, 61*********87

  [MEDIUM] address
         Address column (column name)

────────────────────────────────────────────────────────────
  Overall risk : [HIGH]
  Findings     : 3 (2 HIGH  1 MEDIUM)

  3 potential PII finding(s) across 2 column(s). 2 HIGH risk.
  Do not use this dataset in lower environments without synthetic replacement.

  Next step: Generate a certified synthetic replacement at
  https://certifieddata.io

Library API

import { scanContent, scanColumns } from "@certifieddata/pii-scan";

// Scan file content
const result = scanContent(fileContents, "customers.csv");
console.log(result.overallRisk);   // "HIGH" | "MEDIUM" | "LOW"
console.log(result.findings);      // ColumnFinding[]
console.log(result.summary);       // Human-readable summary

// Scan pre-parsed columns
const columns = {
  email: ["[email protected]", "[email protected]"],
  age:   ["28", "34"],
};
const result2 = scanColumns(columns, "dataset.json");

Types

type RiskLevel = "HIGH" | "MEDIUM" | "LOW";

interface ColumnFinding {
  column: string;
  patternName: string;
  risk: RiskLevel;
  matchCount: number;
  sampleValues: string[];  // redacted (e.g. "al**@ex*****.com")
  source: "content" | "column_name";
}

interface ScanResult {
  file: string;
  rowsScanned: number;
  columnsScanned: number;
  findings: ColumnFinding[];
  overallRisk: RiskLevel;
  summary: string;
}

What It Detects

Content Patterns (HIGH risk)

  • Email addresses
  • US Social Security Numbers (SSN)
  • Credit / debit card numbers (Visa, MC, Amex, Discover)
  • US phone numbers
  • Passport / government ID numbers (letter + digits)
  • US bank routing numbers (ABA 9-digit)

Content Patterns (MEDIUM risk)

  • IPv4 addresses
  • Date of birth formats
  • US street addresses

Content Patterns (LOW risk)

  • US ZIP codes

Column Name Heuristics

Flags columns whose names suggest PII (e.g. email, ssn, dob, phone, first_name, patient_id, etc.) regardless of content — useful when values are already masked.


What It Does NOT Detect

  • Names embedded in free text (no NLP)
  • Non-US national ID formats
  • Device fingerprints or behavioral identifiers
  • PII hidden in binary formats (images, PDFs, audio)
  • Encoded or encrypted PII
  • De-anonymization risk from quasi-identifiers

This tool catches obvious, structured PII patterns. It is not a substitute for a full data classification system or legal review.


Supported Formats

| Format | Notes | |--------|-------| | CSV | Header row required; handles basic quoting | | JSON | Array of objects; also handles { data: [...] } and { rows: [...] } wrappers |


Use in CI

Add to GitHub Actions to block PRs that add PII to test fixtures:

- name: Scan test data for PII
  run: npx @certifieddata/pii-scan ./tests/fixtures/customers.csv
  # Exits 2 on HIGH risk, 1 on any finding, 0 if clean

How detection works

scanContent() reads the file, auto-detects format (CSV or JSON), and passes parsed columns to scanColumns(). For each column:

  1. Column name check — the column name is matched against a list of PII-suggestive names (e.g. email, ssn, dob, patient_id). This fires even when values are already masked.
  2. Content scan — up to 200 rows of content are regex-tested against each pattern.
  3. Redaction — matching values in output show only the first 2 and last 2 characters; the middle is masked.
  4. Risk aggregation — overall risk is the maximum risk level of any single finding.

Limitations and false positives

  • US-centric: phone, SSN, ZIP, routing number, and address patterns target US formats only.
  • Regex-based: no NLP or ML. Patterns like addresses and ZIPs have high false-positive rates in non-address columns.
  • Detection only: this tool does not modify, redact, or rewrite your data.
  • Sampling: only the first 200 rows per column are scanned for performance.
  • 9-digit sequences: the routing number pattern (\b\d{9}\b) will match any 9-digit number — treat those findings as advisory in non-financial datasets.
  • Not a compliance tool: findings are advisory only. They do not constitute a legal assessment under GDPR, HIPAA, CCPA, or any other regulation.

Privacy and security model

@certifieddata/pii-scan has zero runtime dependencies and makes no network calls.

  • Does not import http, https, fetch, or any telemetry SDK
  • Does not phone home, log usage, or report findings anywhere
  • Reads only the file you explicitly pass as an argument
  • Produces output only to stdout/stderr

You can verify this by reading the source directly: src/patterns.ts, src/scanner.ts, src/cli.ts


Development

git clone https://github.com/certifieddata/certifieddata-public
cd packages/pii-scan
pnpm install
pnpm build
pnpm test
pnpm lint
pnpm typecheck

Node.js 18+ required.


Replacing PII with Certified Synthetic Data

When this tool flags real PII in your dataset, the next step is to replace it with a certified synthetic equivalent — structurally identical, statistically representative, and cryptographically attested to contain no real personal data.

CertifiedData.io — Generate certified synthetic datasets with Ed25519-signed certificates, independently verifiable by any auditor.


License

MIT — see LICENSE

Part of the certifieddata-public open-source toolkit.