resume-extract
v0.3.0
Published
Extract structured data from resume text, PDF, and DOCX using ONNX NER model
Readme
resume-extract
Fast, local resume extraction using a fine-tuned DistilBERT NER model. Extracts structured data from resume text, PDF, or DOCX via local document parsing + ONNX inference.
Installation
Binary (recommended):
curl -fsSL https://raw.githubusercontent.com/somus/resume-extract/main/scripts/install-release.sh | bash
resume-extract --helpThe installer downloads the latest GitHub Release asset into ~/.local/bin. Override INSTALL_DIR, REPO, or VERSION if needed:
INSTALL_DIR=/usr/local/bin VERSION=v0.1.0 curl -fsSL https://raw.githubusercontent.com/somus/resume-extract/main/scripts/install-release.sh | bashAs library:
bun installBuild from source:
bun run build:bin
./dist/resume-extract --input ./resume.pdf --atsNotes:
parseResume()is text-only fast path.parseResumePdf()andparseResumeDocx()use@kreuzberg/nodefor local document text extraction.parseResumePdf(..., { ocr: true })enables OCR for scanned PDFs (defaults to Tesseract). Supportstesseract,easyocr, andpaddleocrbackends via{ ocr: { backend: "easyocr" } }. OCR is much slower than text parsing.- On first run, the CLI automatically downloads the required
oksomu/resume-nermodel files into a local cache if they are missing and shows download progress. Pass--modelto use a custom directory or--no-downloadto require a pre-populated model directory. - Library consumers should manage model directories explicitly.
Features
- Structured extraction: name, email, phone, location, companies, titles, education, skills
- Document input support: parse raw text, PDF, or DOCX
- ATS scoring: completeness score with actionable issues list
- Seniority inference: from job titles + years of experience
- Country detection: from location + phone prefix
- Experience years: computed from employment dates
- Section-aware chunking: splits long resumes at paragraph boundaries for >512 token texts
- Section detection: rule-based gap-filling for skills, certifications, and languages the model misses
- 100% local: runs offline via ONNX, no API calls
- Fast text parsing: ~15ms per resume after model load
- Optional document parsing: PDF via Kreuzberg, including OCR when enabled; DOCX via Kreuzberg
Model
Uses oksomu/resume-ner — a DistilBERT model fine-tuned for resume NER and exported to ONNX for local structured extraction.
Latest model metrics (from model card, noise-augmented, 25 epochs, entity-level exact-match via seqeval):
- entity F1: 97.77%
- structured micro F1: 97.88%
- clean resume F1: 99.18%
- noisy resume F1: 69.24% (OCR/scraped text)
- quantized ONNX size: 63MB
Entity types:
- NAME, EMAIL, PHONE, LOCATION, COMPANY, TITLE, DATE, DEGREE, INSTITUTION, FIELD, SKILL, CERT, LANGUAGE
Model directory should include:
resume_config.json— pre-processing, post-processing, and inference rulescompanies.json— company gazetteer for post-processingcity_country_map.json— 317 cities for country inference- tokenizer/config files
onnx/model_quantized.onnxoronnx/model.onnx
Usage
import {
computeATSScore,
parseResume,
parseResumeDocx,
parseResumePdf,
} from "resume-extract";
const result = await parseResume(resumeText, "/path/to/model");
const fromPdf = await parseResumePdf("/path/to/resume.pdf", "/path/to/model");
const fromScannedPdf = await parseResumePdf(pdfBytes, "/path/to/model", { ocr: true });
const fromDocx = await parseResumeDocx("/path/to/resume.docx", "/path/to/model");
// result.personal: { name, email, phone, location }
// result.experience: [{ title, company, start_date, end_date }]
// result.education: [{ degree, field, institution }]
// result.skills: ["Python", "AWS", ...]
// result.seniority: "Senior"
// result.country: "India"
// result.experience_years: 10
const ats = computeATSScore(result);
// ats.score: 87
// ats.issues: [{ severity: "medium", message: "..." }]CLI
Run directly with Bun:
bun run cli ./resume.pdf --ats
bun run cli --text "Jane Doe..."
bun run cli ./resume.pdf --view json --output result.json
cat ./resume.txt | bun run cli
# Batch mode
bun run cli batch ./resumes/*.pdf --ats
bun run cli batch --input-dir ./resumes --glob '**/*' --output batch.jsonl
bun run cli batch --input-dir ./resumes --output batch.csv --output-format csv
bun run cli batch --input-dir ./resumes --fail-fast
# Explicit model setup and diagnostics
bun run cli setup-model
bun run cli doctor --ocr
bun run cli doctor --fix
bun run cli doctor --jsonCommon flags:
--model <path>: model directory--model-repo <repo>: alternate Hugging Face repo for first-run download--model-revision <rev>: alternate model revision for first-run download--no-download: disable automatic model download--input <path>: input file path--text <text>: inline text input--format <auto|text|pdf|docx>: override format detection--ocr: enable PDF OCR (defaults to Tesseract)--ocr-backend <backend>: OCR backend:tesseract,easyocr, orpaddleocr--ats: include ATS scoring in output--view <json|pretty>: render machine JSON or human-friendly terminal output--output <path>: write structured output to a file--compact: emit minified JSON
Batch-only flags:
batch [inputs...]: process many resumes at once--input-dir <path>: scan a directory for resumes--glob <pattern>: file selection pattern for directory scanning--concurrency <n>: parallel batch workers, defaults to4--fail-fast: stop batch processing on the first extraction error--output-format <json|jsonl|csv>: structured batch output format
Extra commands:
setup-model: download the configured model into the local cache or custom--modelpathupdate-model: pull the latest model from Hugging Face, re-downloading all filesdoctor: inspect model readiness, file integrity, writable cache paths, runtime platform, and optional OCR availabilitydoctor --fix: download/repair the configured model, then report statusdoctor --json: emit machine-readable diagnostics
The CLI checks for model updates once per day. If a newer model is available on Hugging Face, a warning is shown on stderr. Run update-model to pull the latest.
Output behavior:
- Single resume commands default to
prettyview on a TTY andjsonotherwise. - Batch commands default to
prettysummaries on a TTY and structured JSON otherwise. - Use
--view jsonwhen piping to other tools. - Use
--outputwithbatchplus--output-format jsonlfor machine-friendly bulk processing. - Use
--output-format csvwhen you want spreadsheet-friendly flat output with summary fields plus numbered experience and education columns.
Limitations
- English resumes only
- Max 512 tokens per chunk (section-aware chunking splits at paragraph boundaries for longer resumes)
- Image-based/scanned PDFs require OCR before text extraction
- Two-column PDF layouts may flatten during text extraction
Development
bun run test # Run tests
bun run check # Biome lint + format check
bun run typecheck # TypeScript type check
bun run format # Auto-formatLicense
MIT
