piivacy
v0.1.0
Published
PIIvacy — PII scrubber for LLM calls. Three substitution modes (token, realistic, pass-through), pluggable patterns, BYO-LLM helpers, optional in-browser/local-model name redaction. Two functions, no proxy, no server.
Maintainers
Readme
PIIvacy
A serverless PII scrubber for LLM calls. Two functions. Three substitution modes. No proxy. No server. Zero runtime dependencies.
The npm package is
piivacy. The brand is PIIvacy.
npm install piivacyimport { scrub, restore, createSession } from 'piivacy';
const session = createSession();
const { text } = await scrub('Email me at [email protected]', session);
// "Email me at [[EMAIL_1]]"
// ...send `text` to your LLM, get a response that uses the tokens...
const restored = restore(llmResponse, session);
// "Sure, I will email [email protected] today"That's the whole loop. The package never makes an HTTP call. Everything happens inside your process.
What it actually catches (and what it doesn't)
Regex catches most common PII — emails, phones, SSNs, credit cards, API keys, addresses, etc. — across 35+ patterns in 6 categories. It will miss things a human or LLM would catch (names, codenames, oblique references, free-form sensitive context). For those, the package ships an opt-in second-pass LLM check that you run with any chat model you already have. Combined coverage gets you very close to "no PII"; the goal is an asymptote, not a guarantee.
If you need cryptographic guarantees, hire a security team. If you need "stop accidentally pasting the customer's SSN into ChatGPT", this is for you.
Quick decision tree
| You want… | Use mode |
|---|---|
| Maximum privacy, restore-bulletproof | token (default) |
| Natural-sounding text for the LLM, restore via fakes | realistic |
| The LLM legitimately needs the value (e.g. an address for local search) | pass-through |
You can mix and match per category, per label, or via built-in presets.
The full API
Core
const { text, session } = await scrub(input, existingSession?, opts?)Async. Walks default + custom patterns in priority order; replaces matches according to the mode resolver. Returns the scrubbed text and the session (so you can keep using it across turns).
const restored = restore(llmResponse, session)Synchronous. Three-pass longest-match-first restore: tokens → fakes → reference forms ("Marcus", "Marcus's", "Mr. Chen"). LLM-invented or truncated tokens pass through unchanged.
Sessions
const session = createSession({ ttlMs?: 1_800_000, nameAdapter?: NameSubstitutionAdapter })
isExpired(session) // boolean
registerSecret(session, value, label = 'CUSTOM') // pre-flag a literal value
listRedactions(session) // [{ kind, identifier, label, value, count, firstSeenAt, lastSeenAt }]Sessions are JSON-serializable plain objects. Sliding TTL — every operation refreshes expiresAt. Persistence (Redis, file, etc.) is the caller's responsibility.
Pluggable patterns
registerPattern({ label, regex, category, priority?, validate?, fake?, referenceForms?, description? })
unregisterPattern(label)
listPatterns() // [{ label, category, priority, description, hasRealistic }]Modes & presets
import { presets } from 'piivacy';
await scrub(text, session, presets.maximumRedaction); // token everything
await scrub(text, session, presets.naturalConversation); // contact + location realistic
await scrub(text, session, presets.localSearch); // location pass-through
await scrub(text, session, presets.testFriendly); // realistic where possible, token for dangerBYO-LLM helpers
Second-pass detection (catch what regex missed):
buildPiiCheckPrompt(scrubbedText, opts?) // → { system, user }
parsePiiCheckResponse(rawText) // → { issues, parseError? }
applyPiiCheckIssues(session, issues, { minConfidence }) // → count appliedDynamic mode picking (let an LLM decide what the assistant needs preserved):
buildScrubIntentPrompt(text, opts?) // → { system, user }
parseScrubIntentResponse(rawText) // → { decisions, reason }
applyScrubIntent(decisions, baseOpts?) // → scrub optsMode comparison matrix
| Aspect | token | realistic | pass-through |
|---|---|---|---|
| Restore reliability | bulletproof | longest-match (good) | n/a (value never changed) |
| LLM fluency | poor ([[EMAIL_1]]) | excellent ([email protected]) | excellent |
| Privacy | maximum | high | none for that label |
| Default? | yes | opt-in | opt-in |
Safety override: secrets, financial, and identifiers categories can NEVER be pass-through. Even if you (or an LLM intent classifier) explicitly request it, the package downgrades to token. This is hard-coded.
Configuration recipes
Per-category modes
await scrub(text, session, {
defaultMode: 'token',
modes: {
contact: 'realistic',
location: 'pass-through'
}
});Per-label overrides (highest specificity)
await scrub(text, session, {
defaultMode: 'token',
modes: { contact: 'realistic' },
labels: {
EMAIL: 'realistic', // explicit wins
DOB: 'pass-through', // we want to discuss age
ZIP_US: 'token' // override the location-category default
}
});passThrough shorthand
await scrub(text, session, {
passThrough: ['CITY', 'STATE', 'ZIP_US'] // sugar for labels: { CITY: 'pass-through', ... }
});exclude — skip a pattern entirely
await scrub(text, session, { exclude: ['IPV6'] }); // never even check for IPv6Resolution order (most-specific first)
exclude→ pattern doesn't runpassThrough→pass-throughlabels[label]→ that modemodes[category]→ that modedefaultMode→ that mode (default'token')- Safety override: secrets/financial/identifiers →
tokenif any of the above saidpass-through
Multi-turn example
import { scrub, restore, createSession } from 'piivacy';
const session = createSession();
// Turn 1
let { text } = await scrub('Email [email protected] please', session);
const reply1 = await yourLLM(text);
console.log(restore(reply1, session));
// Turn 2 — same session, same tokens for the same values
({ text } = await scrub('Actually email [email protected] and [email protected]', session));
const reply2 = await yourLLM(text);
console.log(restore(reply2, session));
console.log(listRedactions(session));
// [{ kind: 'token', identifier: '[[EMAIL_1]]', value: '[email protected]', count: 2, ... },
// { kind: 'token', identifier: '[[EMAIL_2]]', value: '[email protected]', count: 1, ... }]OpenAI / Anthropic / fetch examples
Token mode (the safe default)
import OpenAI from 'openai';
import { scrub, restore, createSession } from 'piivacy';
const openai = new OpenAI();
const session = createSession();
const { text } = await scrub(userInput, session);
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: text }]
});
const restored = restore(completion.choices[0].message.content, session);Realistic mode (better LLM fluency)
import { scrub, restore, createSession, presets } from 'piivacy';
const session = createSession();
const { text } = await scrub(userInput, session, presets.naturalConversation);
// `text` reads naturally — "[email protected]" instead of "[[EMAIL_1]]"Anthropic (any LLM works)
import Anthropic from '@anthropic-ai/sdk';
const claude = new Anthropic();
const { text } = await scrub(userInput, session);
const response = await claude.messages.create({
model: 'claude-3-5-sonnet-latest',
max_tokens: 1024,
messages: [{ role: 'user', content: text }]
});
const restored = restore(response.content[0].text, session);Pattern catalog
secrets (token-only — never realistic)
| Label | Catches |
|---|---|
| OPENAI_KEY | sk-... and sk-proj-... |
| ANTHROPIC_KEY | sk-ant-... |
| GITHUB_TOKEN | ghp_, gho_, ghu_, ghs_, ghr_, github_pat_... |
| AWS_ACCESS_KEY | AKIA..., ASIA... |
| AWS_SECRET_KEY | 40-char value preceded by aws_secret_access_key= etc. |
| GOOGLE_API_KEY | AIza... |
| STRIPE_KEY | sk_live_, pk_live_, sk_test_, pk_test_ |
| SLACK_TOKEN | xoxa-, xoxb-, xoxp-, xoxr-, xoxs- |
| JWT | three base64url segments |
| URL_CREDENTIALS | https://user:pass@host/... |
contact (realistic-friendly)
| Label | Realistic fake |
|---|---|
| EMAIL | redacted{N}@example.com (RFC 6761 reserved) |
| PHONE_US | (555) 010-{NNNN} (FCC fictional range) |
| PHONE_E164 | +44 555 010 {NNNN} |
financial
| Label | Notes |
|---|---|
| IBAN | mod-97 validated |
| CC | Luhn validated; realistic = Visa test card 4111-1111-1111-XXXX |
| BTC | legacy, P2SH, bech32 |
| ETH | 0x + 40 hex |
identifiers (token-only)
SSN, SSN_BARE, CA_SIN (Luhn), PASSPORT_US, VIN (check-digit validated), MAC.
location (realistic OR pass-through friendly)
ADDRESS_US, ZIP_US, POSTCODE_UK, POSTCODE_CA, LATLONG.
network
IPV4 (octet validated), IPV6, DOB (month/day validated, realistic = 01/01/1970).
Names — special case
Names are not in the regex catalog (too noisy). They enter the system three ways:
- The opt-in LLM second-pass check (recommended)
- Manual
registerSecret(session, 'Jane Doe', 'NAME') - A custom regex pattern you register yourself
In realistic mode, NAME redaction prefers (in order): your session's name adapter → the static data/names.json table (267k names from US SSA + US Census 2010, bucketed by gender/era/ethnicity) → a phonetic-shift fallback.
Adding custom patterns
import { registerPattern } from 'piivacy';
registerPattern({
label: 'INTERNAL_TICKET',
regex: /\bTICKET-\d{6}\b/g, // /g flag REQUIRED
category: 'custom',
priority: 25, // slots between defaults
validate: (v) => Number(v.slice(7)) > 100000,
fake: (_value, { counter }) => `TICKET-${(counter + 1000).toString().padStart(6, '0')}`,
description: 'Internal Jira-style ticket'
});Or scope a pattern to a single scrub call without registering globally:
await scrub(text, session, {
extraPatterns: [{ label: 'EPHEMERAL', regex: /\bEPH-\d+\b/g, category: 'custom' }]
});LLM second-pass loop (catch what regex missed)
The package never calls an LLM itself. You wire it up using whatever chat client you already have:
import OpenAI from 'openai';
import {
scrub,
buildPiiCheckPrompt,
parsePiiCheckResponse,
applyPiiCheckIssues
} from 'piivacy';
const openai = new OpenAI();
const session = createSession();
// Pass 1
let { text } = await scrub(userInput, session);
// Pass 2: ask a cheap model "did regex miss anything?"
const { system, user } = buildPiiCheckPrompt(text);
const checkRes = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: system },
{ role: 'user', content: user }
]
});
const { issues } = parsePiiCheckResponse(checkRes.choices[0].message.content);
applyPiiCheckIssues(session, issues, { minConfidence: 0.7 });
// Pass 3: re-scrub with the new literals registered
({ text } = await scrub(userInput, session));
// Now `text` is much cleaner. Send it to your real LLM.The second-pass model can be cheap (gpt-4o-mini, claude-haiku, an Ollama 7B) — it just needs to follow JSON output instructions.
Dynamic mode picking with llm-intent
Sometimes the LLM legitimately needs certain PII to answer well — addresses for local search, dates for scheduling, etc. The intent helpers ask any LLM to make that decision per category:
import {
buildScrubIntentPrompt,
parseScrubIntentResponse,
applyScrubIntent
} from 'piivacy';
const userInput = 'Find Indian restaurants near 234 Main St, Brooklyn, NY 11211';
const intent = buildScrubIntentPrompt(userInput);
const intentRes = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: intent.system },
{ role: 'user', content: intent.user }
]
});
const { decisions } = parseScrubIntentResponse(intentRes.choices[0].message.content);
// decisions = { location: 'preserve', contact: 'redact', secrets: 'redact', ... }
const opts = applyScrubIntent(decisions, { defaultMode: 'token' });
const { text } = await scrub(userInput, session, opts);
// → "Find Indian restaurants near 234 Main St, Brooklyn, NY 11211" (location preserved)
// But if user added "my SSN is 123-45-6789", that would still get tokenized.The package hard-codes a safety override: even if the LLM hallucinates secrets: 'preserve', we downgrade it to redact. Same for financial and identifiers.
Name redaction with adapters (browser / local / cloud)
realistic mode for NAMES uses (in order):
- The session's configured adapter, if any
- The static
data/names.jsontable (267k names, ships with the package, lazy-loaded) - Phonetic shift fallback for unknown names
Adapters live behind subpath imports — they're not loaded unless you import them.
WebLLM (in-browser, zero network egress)
npm install @mlc-ai/web-llmimport { CreateMLCEngine } from '@mlc-ai/web-llm';
import { WebLLMAdapter } from 'piivacy/adapters/webllm';
import { createSession, scrub } from 'piivacy';
const engine = await CreateMLCEngine('Phi-3.5-mini-instruct-q4f16_1-MLC');
const session = createSession({ nameAdapter: new WebLLMAdapter({ engine }) });Ollama (local model)
import { OllamaAdapter } from 'piivacy/adapters/ollama';
const session = createSession({
nameAdapter: new OllamaAdapter({ model: 'phi3:mini' }) // assumes ollama is running locally
});OpenRouter (cloud, any small model)
import { OpenRouterAdapter } from 'piivacy/adapters/openrouter';
const session = createSession({
nameAdapter: new OpenRouterAdapter({
apiKey: process.env.OPENROUTER_API_KEY,
model: 'meta-llama/llama-3.1-8b-instruct'
})
});All three adapters mix the real names with decoy names before sending the prompt, so even a logging provider can't easily tell which input was the redaction target.
Inspecting redactions
import { listRedactions } from 'piivacy';
await scrub('Email [email protected] phone (415) 555-0142 SSN 123-45-6789', session);
console.log(listRedactions(session));
// [
// { kind: 'token', identifier: '[[EMAIL_1]]', label: 'EMAIL', value: '[email protected]', count: 1, ... },
// { kind: 'token', identifier: '[[PHONE_US_1]]', label: 'PHONE_US', value: '(415) 555-0142', count: 1, ... },
// { kind: 'token', identifier: '[[SSN_1]]', label: 'SSN', value: '123-45-6789', count: 1, ... }
// ]Useful for: audit trails, downstream feature calls (e.g. enriching the LLM call with structured PII metadata after scrubbing the raw text), debugging which patterns triggered.
What it catches / What it misses
Catches reliably: common emails, US/E.164 phones, US/UK/Canadian addresses & postcodes, SSN, credit cards (Luhn), IBAN (mod-97), VIN (check-digit), JWTs, API keys for 8 major providers, IP addresses, dates of birth, MAC addresses, BTC/ETH wallets, lat/long pairs.
Misses (regex can't catch these reliably; use the LLM second-pass):
- Names ("Jane Doe", "Dr. Smith")
- Companies, projects, codenames
- Free-form sensitive context ("my divorce", "the layoff")
- Internationalized email (unicode local-parts)
- Non-Latin-script identifiers
- PII embedded in unstructured prose with no formatting cues
The package is defense in depth, not a magic shield. Think of regex as the cheap fast filter, the LLM check as the expensive accurate filter, and your security review as the human in the loop.
Sessions, sliding TTL
Sessions store the bidirectional maps between real values and tokens/fakes. Default TTL is 30 minutes; every scrub or getOrAssignToken call refreshes the expiry. After expiry, calls throw unless you pass { allowExpired: true }.
Sessions are plain JSON objects. To persist across processes, just JSON.stringify(session) and rehydrate. The package provides no built-in persistence layer — that's deliberately the caller's call.
What's in the box vs. opt-in
| Always loaded | Lazy-loaded | Sub-import |
|---|---|---|
| Core (scrub, restore, registry, modes, sessions, BYO-LLM helpers) | data/names.json (~23 MB, only on first realistic NAME use) | piivacy/adapters/openrouter |
| 30+ default patterns | | piivacy/adapters/ollama |
| | | piivacy/adapters/webllm |
Token-mode-only callers pay zero overhead beyond core. Realistic-mode callers without NAMEs pay zero extra. Adapter users pay only their adapter's tiny wrapper.
Disclaimer
This package reduces the surface area for accidentally pasting PII into an LLM prompt. It does not:
- Guarantee zero leakage (regex misses things; LLMs may infer)
- Replace your security review process
- Replace data-handling agreements with your LLM provider
- Provide cryptographic guarantees of any kind
- Constitute legal compliance with GDPR / CCPA / HIPAA / etc.
If you're handling regulated data, consult a security professional. If you're trying to stop yourself from accidentally pasting your customer's SSN into Claude, this is the right tool.
Honest limits
There is no perfect PII detector. Names are ambiguous. "John" is a name, a noun ("john" as toilet), a verb ("to john"). Companies are names. Aliases exist. Free text contains thousands of leakage modes regex can't cover. We get most of them. We don't claim to get all of them.
PIIvacy is defense in depth, not a magic shield.
License
MIT.
Contributing
PRs welcome. Particularly valuable:
- Patterns for non-US/UK locales (Indian Aadhaar, EU VAT, Australian TFN, Brazilian CPF, etc.)
- Better validators
- Languages other than English for the LLM-check prompts
- A name table for non-Latin scripts (Hangul, CJK, Arabic, Devanagari)
Run npm test before submitting.
