datasink
v0.2.0
Published
Data hygiene for music PR — scrub, rinse, soak, steep your contact lists
Maintainers
Readme
___ (_)__ / /__
(_-</ / _ \/ '_/
/___/_/_//_/_/\_\Data hygiene for music PR. Scrub, rinse, and soak your contact lists.
Demo uses fictional contacts for illustration.
Quick Start
npx datasink scrub contacts.csv # validate emails
npx datasink rinse contacts.csv # deduplicate
npx datasink wash contacts.csv # full pipelineOr install globally:
npm install -g datasink
sink scrub contacts.csvCommands
| Command | Description |
| --------------------- | --------------------------------------------- |
| sink | Interactive menu (no args) |
| sink wash <file> | Full pipeline: scrub + rinse + soak + steep |
| sink scrub <file> | Validate & clean emails |
| sink rinse <file> | Deduplicate contacts |
| sink soak <file> | Enrich contacts with AI |
| sink steep <file> | Discover channels via outlet site scraping |
| sink spot <email> | Spot-check a single email (with SMTP) |
| sink inspect <file> | Data quality score |
| sink drain <file> | Convert between formats |
| sink tui <file> | Full TUI dashboard |
Why sink?
- Built for music PR. Knows BBC Radio 1 from Radio X, catches
bbc.com→bbc.co.uktypos, flags role-based emails likepress@. Not a generic email validator -- it understands your industry. - Zero config. Point it at a CSV and go. Flexible header matching means it works with whatever your spreadsheet exports. No mapping files, no setup wizard.
- Four phases, one metaphor. Scrub cleans. Rinse deduplicates. Soak enriches. Steep extracts channels from outlet sites. Run them individually or all at once with
wash. Like doing the washing up, but for data.
Phases
Scrub
Validates and cleans email addresses:
- RFC 5322 format validation
- UK domain typo correction (
bbc.com→bbc.co.uk,gmial.com→gmail.com) - Disposable domain detection
- MX record verification
- Role-based email flagging (
press@,info@) - Catch-all domain detection
- Optional SMTP verification (
--smtp)
Rinse
Deduplicates and resolves identities:
- Exact email -- case-insensitive dedup, keeps the richer record
- Fuzzy name -- Jaro-Winkler similarity within same domain (threshold: 0.92)
- Cross-field -- matches by phone or website across different emails
Soak
Enriches contacts with AI:
- Platform type detection (radio, press, playlist, blog, podcast)
- Genre identification
- Geographic scope
- Submission guidelines
- Pitch tips
Supports Anthropic (Claude Haiku) and OpenAI (GPT-4o-mini).
Steep
Discovers contact channels by scraping the outlet's public website:
- Outlet socials (Instagram, Twitter, LinkedIn, Facebook)
- Submission portal URL + submission email
- Submission format (mp3, link, form, mixed)
- Recent presenters / journalists named on the site
- Per-contact attribution: when a contact's name appears on the team / presenter page, their personal handles are extracted
- Pitch hooks: specific, observable hooks pulled from the scraped text ("submissions form requires ISRC", "no submissions email -- portal only")
Outlet scrapes are cached for 30 days. One scrape powers every contact at that outlet.
Requires FIRECRAWL_API_KEY and an LLM provider key. Phase is silently skipped if creds are missing.
Global Flags
-o, --output <path> Output file path
--format <csv|json|jsonl> Output format (default: csv)
--config <path> Config file path
--dry-run Preview without writing files
--verbose Detailed output
-q, --quiet Suppress all output except errors
--json JSON stdout (for piping)
--no-colour Disable colours
--smtp Enable SMTP verification (scrub phase)
--provider <name> Enrichment provider (anthropic|openai)Exit Codes
| Code | Meaning |
| ---- | --------------------------------------------------------- |
| 0 | Success |
| 1 | File error (not found, permission denied, is a directory) |
| 2 | Parse error (invalid CSV, no usable data) |
| 3 | Config error (invalid config file) |
| 4 | Pipeline error (enrichment failure, unexpected crash) |
Provider Setup
Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
sink soak contacts.csv --provider anthropicOpenAI
export OPENAI_API_KEY=sk-...
sink soak contacts.csv --provider openaiInput Format
Accepts CSV files with flexible column names:
| Field | Accepted Headers | | ------- | -------------------------------------------- | | Name | name, contact, full name, person | | Email | email, e mail, email address | | Outlet | outlet, publication, media, company, station | | Role | role, title, position, job title | | Phone | phone, telephone, mobile | | Website | website, url, web | | Notes | notes, comments, description | | Tags | tags, categories, labels |
First/last name columns are automatically joined. Unmapped columns are preserved in extras.
Configuration
Create a sink.config.ts in your project root:
export default {
scrub: {
smtp: false,
mxCacheTTL: 1800,
smtpTimeout: 10,
typoMap: './data/custom-typos.json',
},
rinse: {
fuzzyThreshold: 0.92,
strategies: ['exact-email', 'fuzzy-name', 'cross-field'],
},
soak: {
provider: 'anthropic',
anthropic: {
model: 'claude-haiku-4-5-20251001',
apiKey: process.env.ANTHROPIC_API_KEY,
},
},
output: {
format: 'csv',
locale: 'en-GB',
},
}Programmatic API
import { runPipeline, loadConfig } from 'datasink'
const config = await loadConfig()
const records = [
{
id: '1',
raw: { name: 'Sarah Jones', email: '[email protected]', outlet: 'BBC Radio 1' },
phases: [],
timestamp: new Date().toISOString(),
},
]
const { records: processed, stats } = await runPipeline(records, {
phases: ['scrub', 'rinse'],
config,
})
console.log(stats)Contributing
See CONTRIBUTING.md for dev setup, code style, and PR guidelines.
Changelog
See CHANGELOG.md for release history.
Part of Total Audio
Tools I build for music PR, by Chris Schofield. Part of Total Audio Promo.
| Project | Description | |---------|-------------| | TAP | Campaign management for music PR agencies | | totalaud.io | Release planning for emerging artists | | SpotCheck | Spotify playlist validation | | Newsjack | Music industry newsjacking | | Podflow | Podcast intelligence for music PR | | Sink | Contact data hygiene CLI |
Questions? Reach me on X/@chrisschouk or [email protected].
Licence
MIT
