boa-statement-parser
v1.1.1
Published
Production-ready Bank of America statement PDF parser with transaction categorization
Maintainers
Readme
boa-statement-parser
A production-ready Node.js library and CLI for parsing Bank of America bank statement PDFs into clean, normalized, categorized JSON with full JSON Schema validation.
Features
- Multi-format support: Parses both checking and credit card statements
- Multi-statement PDFs: Extracts multiple statements from combined PDF files
- Batch directory processing: Process entire directories of PDFs with
--inputDir - Smart deduplication: Statement-level and transaction-level dedup with completeness scoring
- Automatic detection: Identifies account type from statement content
- Transaction categorization: 70+ priority-ordered rules with confidence tiers
- ML-based categorization: TensorFlow.js with Universal Sentence Encoder for intelligent categorization
- Channel detection: Identifies CHECKCARD, ATM, Zelle, Online Banking transfers, etc.
- Bank reference extraction: Captures confirmation numbers, trace numbers, ATM IDs
- Merchant extraction: Extracts merchant name, city, state, and online flag
- Recurring transaction detection: Identifies subscriptions, regular bills, and recurring payments
- JSON Schema validation: AJV-based validation against Draft 2020-12 schema
- Wrapped line handling: Merges multi-line descriptions from PDF extraction
- Strict validation: Optional schema validation with Zod
- Idempotent output: Same input always produces same output
- TypeScript-first: Full type safety with strict mode
Installation
# Install globally from npm
npm install -g boa-statement-parser
# Or install locally in your project
npm install boa-statement-parserQuick Start
Initialize Your Project
After installing, run the init command to set up required files:
# Initialize with .env file and pre-trained ML model
parse-boa init
# Skip ML model (if you don't need ML categorization)
parse-boa init --no-model
# Overwrite existing files
parse-boa init --forceThis creates:
.env- Configuration file with sensible defaultsmodels/categorizer/- Pre-trained ML model for transaction categorizationstatements/- Directory to place your PDF files
CLI Usage
# Parse a single statement PDF
pnpm parse-boa ./statement.pdf
# Save output to a file
pnpm parse-boa ./statement.pdf --out result.json
# Enable verbose mode for debugging
pnpm parse-boa ./statement.pdf --verbose
# Enable strict validation
pnpm parse-boa ./statement.pdf --strict
# Compact JSON output (no pretty-printing)
pnpm parse-boa ./statement.pdf --no-prettyBatch Directory Processing
Process multiple PDF files from a directory:
# Process all PDFs in a directory
pnpm parse-boa --inputDir "C:\Users\...\Statements" --out result.json
# With verbose output showing progress
pnpm parse-boa --inputDir ./statements --out result.json --verbose
# With strict validation
pnpm parse-boa --inputDir ./statements --out result.json --strict --verboseBatch processing features:
- Scans directory for
*.pdffiles (case-insensitive) - Skips temporary files (
~$...) and zero-byte files - Processes files in deterministic order (sorted by filename)
- Deduplicates statements across all PDFs
- Produces single consolidated output matching single-PDF schema
Programmatic Usage
import { parseStatementFile } from 'boa-statement-parser';
const result = await parseStatementFile('./statement.pdf', {
strict: true,
verbose: false,
});
console.log(result.statement.transactions);Advanced Usage
import { extractPDF, parseBoaStatement, detectAccountType } from 'boa-statement-parser';
// Extract PDF content
const pdf = await extractPDF('./statement.pdf');
// Detect account type
const accountType = detectAccountType(pdf);
console.log(`Detected: ${accountType}`); // 'checking' | 'credit' | 'unknown'
// Parse with options
const result = parseBoaStatement(pdf, {
strict: true,
verbose: true,
});
// Access parsed data
console.log(result.statement.account);
console.log(result.statement.summary);
console.log(result.statement.transactions);
console.log(result.statement.metadata.warnings);Output JSON Schema
The output conforms to JSON Schema Draft 2020-12. See JSON_SCHEMA.json for the full schema.
{
"schemaVersion": "1.0.0",
"source": {
"fileName": "statement.pdf",
"fileType": "pdf",
"pageCount": 4
},
"statements": [{
"statementId": "sha256-hash-32-chars",
"account": {
"institution": "Bank of America",
"accountType": "checking" | "credit",
"accountNumberMasked": "****1234",
"currency": "USD",
"statementPeriod": { "start": "YYYY-MM-DD", "end": "YYYY-MM-DD" }
},
"summary": {
"beginningBalance": number,
"endingBalance": number,
"totalCredits": number,
"totalDebits": number,
"transactionCount": number
},
"transactions": [{
"transactionId": "sha256-hash-32-chars",
"date": "YYYY-MM-DD",
"postedDate": "YYYY-MM-DD" | null,
"amount": number,
"direction": "debit" | "credit",
"description": "cleaned description",
"descriptionRaw": "original line from PDF",
"merchant": {
"name": "STARBUCKS",
"normalizedName": "Starbucks",
"city": "SEATTLE",
"state": "WA",
"online": false,
"network": "VISA" | null
},
"bankReference": {
"cardTransactionTraceNumber": "24801975260482319110911",
"confirmationNumber": "1234567890",
"zelleConfirmation": "T0ZDL3WND",
"atmId": "000009733",
"checkNumber": "1234"
},
"channel": {
"type": "CHECKCARD" | "ATM_DEPOSIT" | "ATM_WITHDRAWAL" | "ZELLE" |
"ONLINE_BANKING_TRANSFER" | "CHECK" | "FEE" | "OTHER",
"subtype": "transfer_from_sav" | null
},
"categorization": {
"category": "Food & Dining",
"subcategory": "Restaurants",
"confidence": 0.85,
"ruleId": "food-restaurant",
"rationale": "Matched rule: food-restaurant"
},
"raw": {
"page": 1,
"section": "deposits" | "atm_debit" | "checks" | "service_fees" | null,
"originalText": "01/05 STARBUCKS STORE 12345 5.75"
},
"flags": {
"isTransfer": true,
"isSubscription": false
}
}],
"provenance": {
"extractedFromText": true,
"pageStart": 1,
"pageEnd": 4
}
}],
"metadata": {
"parser": { "name": "boa-statement-parser", "version": "1.0.0" },
"parsedAt": "2024-01-15T10:30:00.000Z",
"warnings": []
}
}Channel Types
The parser detects and classifies transaction channels:
| Channel Type | Description | Example |
|--------------|-------------|---------|
| CHECKCARD | Debit card purchases | CHECKCARD 0105 STARBUCKS... |
| ATM_DEPOSIT | ATM cash/check deposits | BKOFAMERICA ATM #000009733 DEPOSIT |
| ATM_WITHDRAWAL | ATM cash withdrawals | BKOFAMERICA ATM WITHDRWL |
| ONLINE_BANKING_TRANSFER | Online transfers | Online Banking transfer from SAV Confirmation#... |
| ZELLE | Zelle payments | Zelle payment from JOHN DOE Conf#... |
| CHECK | Check payments | Check number extracted |
| FEE | Bank fees | Monthly Maintenance Fee |
| FINANCIAL_CENTER_DEPOSIT | Branch deposits | FINANCIAL CENTER DEPOSIT |
| OTHER | Unclassified | Fallback |
Confidence Tiers
Categorization uses three confidence levels:
| Tier | Confidence | Description | |------|------------|-------------| | HIGH | 0.95 | Exact merchant match (Netflix, Uber, etc.) | | MEDIUM | 0.75-0.85 | Keyword match with context | | LOW | 0.50 | Uncategorized (no rule matched) |
Transaction Categories
The parser includes 70+ priority-ordered categorization rules covering:
| Category | Subcategories | |----------|---------------| | Income | Salary, Interest, Dividends, Refund | | Housing | Rent, Mortgage, HOA, Property Tax | | Utilities | Electric, Gas, Water, Internet, Phone | | Transportation | Rideshare, Gas, Parking, Tolls, Insurance | | Food & Dining | Groceries, Restaurants, Food Delivery, Alcohol | | Shopping | Online, General Merchandise, Electronics, Clothing | | Entertainment | Streaming, Movies, Events, Fitness, Gaming | | Health | Pharmacy, Medical, Dental, Vision, Insurance | | Financial | ATM, Deposit, Check, Credit Card Payment, Investment, Loan Payment | | Transfer | Zelle, Venmo, Internal, Wire, ACH | | Fees | Bank | | Travel | Flights, Lodging, Car Rental | | Education | Tuition, Learning | | Personal Care | Grooming, Beauty | | Insurance | Life, Renters | | Taxes | Tax Payment, Tax Preparation | | Charity | Donation | | Pets | Pet Care | | Childcare | Daycare |
Uncategorized transactions receive a confidence score of 0.5.
ML-Based Categorization
The parser includes an optional machine learning-based categorizer using TensorFlow.js and Universal Sentence Encoder for intelligent transaction categorization.
Architecture
- Text Embeddings: Universal Sentence Encoder generates 512-dimensional embeddings from transaction descriptions
- Neural Network: Multi-output classifier predicts both category and subcategory
- Hybrid Approach: Combines rule-based and ML categorization for best results
Usage
import { HybridCategorizer, generateTrainingData } from 'boa-statement-parser';
// Initialize hybrid categorizer
const categorizer = new HybridCategorizer();
await categorizer.initialize();
// Train with synthetic data (or your own labeled transactions)
const trainingData = generateTrainingData(5000);
await categorizer.trainML(trainingData, { epochs: 50 });
// Categorize with hybrid approach
const result = await categorizer.categorizeAsync('STARBUCKS COFFEE SEATTLE WA', 'CHECKCARD');
console.log(result.category); // 'Food & Dining'
console.log(result.subcategory); // 'Restaurants'
console.log(result.source); // 'rule' | 'ml' | 'hybrid'
// Clean up
categorizer.dispose();Hybrid Strategy
- Rule-first: Fast, deterministic rule-based categorization runs first
- High confidence bypass: If rule confidence ≥ 0.9, use rule result directly
- ML validation: For medium confidence (0.75-0.9), ML validates/overrides
- ML fallback: For uncategorized transactions, ML provides predictions
- Confidence combination: When rule and ML agree, confidences are combined
Training Data Generation
The generateTrainingData() function creates synthetic training examples from:
- 100+ merchant templates across all categories
- Data augmentation (prefixes, cities, store numbers)
- Existing rule-based patterns
import { generateTrainingData, generateFromParsedTransactions } from 'boa-statement-parser';
// Generate synthetic training data
const syntheticData = generateTrainingData(5000);
// Or use your own labeled transactions
const customData = generateFromParsedTransactions([
{ description: 'MY LOCAL COFFEE SHOP', category: 'Food & Dining', subcategory: 'Restaurants' },
// ... more examples
]);Model Persistence
// Save trained model
await categorizer.saveMLModel('./models/categorizer');
// Load pre-trained model
const newCategorizer = new HybridCategorizer();
await newCategorizer.loadMLModel('./models/categorizer');Performance Notes
- First prediction is slower due to model warm-up
- Batch predictions (
predictBatch) are more efficient for multiple transactions - Consider installing
@tensorflow/tfjs-nodefor faster CPU inference
CLI Usage for ML Training
# Train ML model using synthetic data only
pnpm parse-boa --train-ml --model-out ./models/categorizer
# Train ML model from your parsed statements (recommended)
pnpm parse-boa --train-ml --inputDir ./statements --model-out ./models/categorizer
# Train with more epochs for better accuracy
pnpm parse-boa --train-ml --inputDir ./statements --model-out ./models/categorizer --epochs 100 --verboseThe training process:
- Parses all PDFs in the input directory
- Extracts categorized transactions as training examples
- Augments with synthetic data for better coverage
- Trains the neural network
- Saves the model to the specified path
Recurring Transaction Detection
The parser can detect recurring transactions such as subscriptions, regular bills, and periodic payments.
CLI Usage
# Detect recurring transactions
pnpm parse-boa --inputDir ./statements --detect-recurring --out result.json
# With verbose output showing detection stats
pnpm parse-boa --inputDir ./statements --detect-recurring --verbose --out result.jsonEnvironment Variable
# Enable via .env file
BOA_DETECT_RECURRING=trueDetected Frequencies
| Frequency | Interval Range |
|-----------|----------------|
| weekly | 5-9 days |
| bi-weekly | 12-16 days |
| monthly | 26-35 days |
| quarterly | 85-100 days |
| semi-annual | 170-200 days |
| annual | 350-380 days |
| irregular | High variance or outside ranges |
Output Structure
When --detect-recurring is enabled, a recurring object is added to the JSON output:
{
"recurring": {
"patterns": [
{
"patternId": "rec_debit_abc123",
"merchantKey": "netflix",
"merchantName": "NETFLIX.COM",
"frequency": "monthly",
"averageIntervalDays": 30.5,
"intervalStdDev": 1.2,
"averageAmount": 15.99,
"amountVariance": 0,
"isFixedAmount": true,
"category": "Entertainment",
"subcategory": "Streaming",
"direction": "debit",
"occurrenceCount": 6,
"firstSeen": "2025-01-15",
"lastSeen": "2025-06-15",
"expectedNext": "2025-07-15",
"confidence": 0.95,
"isSubscription": true,
"transactionIds": ["tx_abc123...", "tx_def456..."]
}
],
"summary": {
"totalPatterns": 5,
"totalRecurringTransactions": 30,
"recurringPercentage": 15.5,
"estimatedMonthlyRecurring": 125.50,
"estimatedAnnualRecurring": 1506.00,
"byFrequency": {
"weekly": 0,
"bi-weekly": 1,
"monthly": 3,
"quarterly": 1,
"semi-annual": 0,
"annual": 0,
"irregular": 0
},
"subscriptionCount": 3
}
}
}Subscription Detection
The detector identifies likely subscriptions based on:
- Known services: Netflix, Spotify, Amazon Prime, Disney+, HBO, Hulu, Apple, Google, Microsoft, Adobe, etc.
- Keywords: subscription, membership, premium, monthly, annual
- Amount stability: Fixed or near-fixed amounts with low variance
Confidence Scoring
Pattern confidence is calculated from:
- Occurrence count: More occurrences = higher confidence
- Interval regularity: Lower standard deviation = higher confidence
- Amount consistency: Fixed amounts boost confidence
- Frequency type: Regular frequencies (weekly, monthly) score higher than irregular
Programmatic Usage
import { detectRecurring, detectRecurringFromStatements } from 'boa-statement-parser';
// From raw transactions
const result = detectRecurring(transactions, {
minOccurrences: 3, // Minimum occurrences to detect pattern (default: 2)
maxIntervalVariance: 0.4, // Maximum coefficient of variation (default: 0.4)
});
// From parsed statements
const result = detectRecurringFromStatements(statements);
console.log(result.summary.estimatedMonthlyRecurring);
console.log(result.patterns.filter(p => p.isSubscription));CLI Options
| Option | Description |
|--------|-------------|
| -d, --inputDir <dir> | Directory containing PDF files to batch process |
| -o, --out <file> | Output file path (default: stdout) |
| -f, --format <format> | Output format: json or ofx (default: json) |
| --split-accounts | Split OFX into separate files per account |
| -v, --verbose | Enable verbose output with debug info |
| -s, --strict | Enable strict validation mode |
| --pretty | Pretty-print JSON output (default: true) |
| --no-pretty | Disable pretty-printing |
| --single | Parse as single statement (legacy mode) |
| --schema-version <v1\|v2> | Output schema version (default: v1) |
| --train-ml | Train ML categorizer from parsed transactions |
| --ml | Use ML-based categorization (hybrid mode) |
| --model <path> | Path to ML model directory (for loading) |
| --model-out <path> | Output path for trained ML model |
| --epochs <number> | Number of training epochs (default: 50) |
| --detect-recurring | Detect recurring transactions and include in output |
| --version | Show version number |
| --help | Show help |
Init Command
| Option | Description |
|--------|-------------|
| init | Initialize project with .env file and ML model |
| init --force | Overwrite existing files |
| init --no-model | Skip copying ML model files |
Output Schema Versioning
The parser supports multiple output schema versions for backwards compatibility and new features.
Selecting Schema Version
Schema version is resolved with the following precedence (highest to lowest):
- CLI flag:
--schema-version v1or--schema-version v2 - Environment variable:
FINAL_RESULT_SCHEMA_VERSION=v2 - Default:
v1
Schema v1 (Default)
The original output format with a flat array of statements:
{
"schemaVersion": "v1",
"statements": [
{
"account": { "institution": "Bank of America", "accountType": "checking", ... },
"summary": { "startingBalance": 100, "endingBalance": 200, ... },
"transactions": [...],
"metadata": { "parserVersion": "1.0.0", "parsedAt": "...", "warnings": [] }
}
],
"totalStatements": 1,
"totalTransactions": 5
}Schema v2 (BOFA Rollup)
A new rollup format that groups transactions by account with aggregated totals, analytics, and integrity checks:
{
"schemaVersion": "v2",
"startingBalance": 100,
"endingBalance": 500,
"totalStatements": 3,
"totalTransactions": 50,
"analytics": {
"quarterlyCashFlow": [...],
"incomeVsExpenses": { "totalIncome": 5000, "totalExpenses": 4500, "netIncome": 500, ... },
"lenderSummary": { "averageMonthlyIncome": 2500, "incomeStabilityScore": 85, ... },
"taxPreparation": { "taxYear": 2025, "totalTaxableIncome": 30000, ... }
},
"integrity": {
"overallValid": true,
"statementsChecked": 3,
"statementsWithIssues": 0,
"statementResults": [...],
"summary": { "totalDiscrepancies": 0, "totalDelta": 0, "warnings": [] }
},
"accounts": [
{
"account": { "institution": "Bank of America", "accountType": "checking", ... },
"summary": { "startingBalance": 100, "endingBalance": 300, ... },
"transactions": [
{
"date": "2025-01-15",
"description": "PAYROLL DIRECT DEP",
"amount": 2500,
"category": "Income",
"subcategory": "Salary",
"confidence": 0.95,
"statementId": "CHECKING-1234-20250101-20250131",
"periodLabel": "2025-01 BOA Checking",
...
}
],
"totalStatements": 2,
"totalTransactions": 30
}
]
}Key Differences: v1 vs v2
| Feature | v1 | v2 |
|---------|----|----|
| Structure | Flat statements array | Grouped by account |
| Root balances | Not included | Rolled up across accounts |
| Per-account totals | Not included | totalStatements, totalTransactions |
| Analytics | Not included | Quarterly cash flow, income vs expenses, lender summary, tax prep |
| Integrity checks | Not included | Per-statement balance validation with discrepancy reporting |
| Transaction traceability | Not included | statementId, periodLabel on each transaction |
| Metadata | Per-statement | Not included (simpler) |
| Use case | Individual statement processing | Multi-account rollup/reporting/analytics |
Migration from v1 to v2
If you're consuming v1 output and want to migrate to v2:
- Update your code to handle the
accountsarray instead ofstatements - Access transactions via
accounts[n].transactionsinstead ofstatements[n].transactions - Use root-level
startingBalance/endingBalancefor overall totals - Note that v2 groups multiple statements for the same account into one account block
Version Lifecycle
Future schema versions will follow this pattern:
- New versions are added as
v3,v4, etc. - Existing versions remain available indefinitely
- Default version changes only in major releases
- Deprecation warnings will be added before removal
Confidence Semantics
The confidence field in transactions represents parsing/OCR confidence, not financial correctness:
| Value | Meaning | |-------|---------| | 0.95 (HIGH) | Exact pattern match with high certainty | | 0.85 (MEDIUM_HIGH) | Strong keyword match | | 0.75 (MEDIUM) | Partial or weaker pattern match | | 0.50 (LOW) | Uncategorized or uncertain extraction |
Important distinctions:
- Confidence ≠ Financial Accuracy: A transaction with 0.95 confidence means the parser is confident about the extraction, not that the amount is financially verified
- Low confidence flags: Values < 0.75 indicate the parser had difficulty extracting or categorizing the transaction
- OCR artifacts: Low confidence may indicate OCR issues in the source PDF
- Review recommendation: Transactions with confidence < 0.75 should be manually reviewed
Example Commands
# Use v1 (default)
pnpm parse-boa ./statement.pdf
# Explicitly use v1
pnpm parse-boa ./statement.pdf --schema-version v1
# Use v2 rollup format
pnpm parse-boa ./statement.pdf --schema-version v2
# Use environment variable
FINAL_RESULT_SCHEMA_VERSION=v2 pnpm parse-boa ./statement.pdfEnvironment Variables
The parser automatically loads environment variables from a .env file using dotenv. This eliminates the need to set environment variables manually in your shell.
Quick Setup
# Copy the example file
cp .env.example .env
# Edit as needed
nano .env # or use your preferred editorThe .env file is automatically loaded when running the CLI. No additional configuration required.
Application Settings
| Variable | Values | Default | Description |
|----------|--------|---------|-------------|
| FINAL_RESULT_SCHEMA_VERSION | v1, v2 | v2 | Controls the output JSON schema format |
FINAL_RESULT_SCHEMA_VERSION
Determines which output schema version to use when generating JSON output. This is the primary application-specific environment variable.
v1: Original flat format with astatementsarray. Best for individual statement processing.v2: Rollup format grouped by account with analytics, integrity checks, and aggregated totals. Best for multi-account reporting and financial analysis.
Resolution precedence (highest to lowest):
- CLI flag:
--schema-version v2 .envfile:FINAL_RESULT_SCHEMA_VERSION=v2- Default:
v2
# Simply run the CLI - .env is loaded automatically
pnpm parse-boa ./statement.pdf
# CLI flags still override .env values
pnpm parse-boa ./statement.pdf --schema-version v1Input/Output Settings
| Variable | CLI Equivalent | Default | Description |
|----------|----------------|---------|-------------|
| BOA_INPUT_DIR | --inputDir | (none) | Directory containing PDF files to process |
| BOA_OUTPUT_FILE | --out | stdout | Output file path |
| BOA_FORMAT | --format | json | Output format: json, ofx, csv |
| BOA_SPLIT_ACCOUNTS | --split-accounts | false | Split output into separate files per account |
Example:
# .env
BOA_INPUT_DIR=C:\Users\YourName\Documents\Statements
BOA_OUTPUT_FILE=result.json
BOA_FORMAT=jsonNow you can simply run:
pnpm parse-boaParsing Options
| Variable | CLI Equivalent | Default | Description |
|----------|----------------|---------|-------------|
| BOA_VERBOSE | --verbose | false | Enable verbose output with debug info |
| BOA_STRICT | --strict | false | Enable strict validation mode |
| BOA_PRETTY | --pretty | true | Pretty-print JSON output |
| BOA_SINGLE | --single | false | Parse as single statement (legacy mode) |
| BOA_DETECT_RECURRING | --detect-recurring | false | Detect recurring transactions |
ML Categorization Settings
| Variable | CLI Equivalent | Default | Description |
|----------|----------------|---------|-------------|
| BOA_ML | --ml | false | Use ML-based categorization (hybrid mode) |
| BOA_MODEL_PATH | --model | (none) | Path to ML model directory for loading |
| BOA_MODEL_OUT | --model-out | (none) | Output path for trained ML model |
| BOA_TRAIN_ML | --train-ml | false | Train ML categorizer from parsed transactions |
| BOA_EPOCHS | --epochs | 50 | Number of training epochs |
Example ML configuration:
# .env
BOA_ML=true
BOA_MODEL_PATH=./models/categorizerNode.js Runtime
| Variable | Values | Default | Description |
|----------|--------|---------|-------------|
| NODE_ENV | development, production, test | development | Node.js environment mode |
NODE_ENV
Standard Node.js environment variable that affects runtime behavior:
development: Enables verbose error messages, development-only featuresproduction: Optimizes for performance, minimizes loggingtest: Used during test execution (set automatically by Vitest)
# Production mode
NODE_ENV=production pnpm parse-boa ./statement.pdfTensorFlow.js Settings
These variables control the ML categorizer's TensorFlow.js backend behavior.
| Variable | Values | Default | Description |
|----------|--------|---------|-------------|
| TF_FORCE_BACKEND | cpu, webgl, wasm | cpu | Force a specific TensorFlow.js backend |
| TF_CPP_MIN_LOG_LEVEL | 0, 1, 2, 3 | 0 | TensorFlow C++ logging level |
| TF_ENABLE_ONEDNN_OPTS | 0, 1 | 1 | Enable/disable oneDNN optimizations |
TF_FORCE_BACKEND
Forces TensorFlow.js to use a specific computation backend:
cpu: Pure JavaScript CPU backend. Most compatible, works everywhere.webgl: GPU-accelerated via WebGL. Faster for large models but requires GPU.wasm: WebAssembly backend. Good balance of speed and compatibility.
# Force CPU backend (recommended for Node.js)
TF_FORCE_BACKEND=cpu pnpm parse-boa --ml ./statement.pdfTF_CPP_MIN_LOG_LEVEL
Controls TensorFlow's C++ logging verbosity (when using native bindings):
0: All logs (DEBUG, INFO, WARNING, ERROR)1: INFO and above2: WARNING and above (suppresses most logs)3: ERROR only
# Suppress TensorFlow info/warning logs
TF_CPP_MIN_LOG_LEVEL=2 pnpm parse-boa --train-ml --inputDir ./statementsTF_ENABLE_ONEDNN_OPTS
Controls Intel oneDNN (formerly MKL-DNN) optimizations:
1: Enable oneDNN optimizations (faster on Intel CPUs)0: Disable oneDNN (useful if experiencing compatibility issues)
# Disable oneDNN if seeing warnings
TF_ENABLE_ONEDNN_OPTS=0 pnpm parse-boa --ml ./statement.pdfDebugging
| Variable | Values | Default | Description |
|----------|--------|---------|-------------|
| DEBUG | Pattern string | (none) | Enable debug output for specific modules |
| NO_COLOR | 1 | (none) | Disable colors in console output |
| FORCE_COLOR | 1 | (none) | Force colors in console output |
DEBUG
Enables debug logging for specific modules using the debug package pattern:
# Enable all boa-parser debug logs
DEBUG=boa-parser:* pnpm parse-boa ./statement.pdf
# Enable specific module debugging
DEBUG=boa-parser:extractor pnpm parse-boa ./statement.pdf
# Multiple patterns
DEBUG=boa-parser:parser,boa-parser:categorizer pnpm parse-boa ./statement.pdfNO_COLOR
Disables ANSI color codes in console output. Useful for logging to files or CI environments:
# Disable colors
NO_COLOR=1 pnpm parse-boa ./statement.pdf > output.logFORCE_COLOR
Forces ANSI color codes even when output is not a TTY. Overrides NO_COLOR:
# Force colors in piped output
FORCE_COLOR=1 pnpm parse-boa ./statement.pdf | tee output.logExample .env File
# .env - Complete example with all CLI options configured
# Input/Output
BOA_INPUT_DIR=C:\Users\YourName\Documents\Statements
BOA_OUTPUT_FILE=result.json
BOA_FORMAT=json
# Parsing
FINAL_RESULT_SCHEMA_VERSION=v2
BOA_VERBOSE=false
BOA_STRICT=false
# ML Categorization
BOA_ML=true
BOA_MODEL_PATH=./models/categorizer
# Runtime
NODE_ENV=production
TF_CPP_MIN_LOG_LEVEL=2
TF_ENABLE_ONEDNN_OPTS=0With this configuration, you can run:
pnpm parse-boaInstead of:
pnpm parse-boa --inputDir "C:\Users\YourName\Documents\Statements" --ml --model ./models/categorizer --schema-version v2 --out result.jsonCI/CD Environment
In CI/CD pipelines (GitHub Actions, etc.), you can set environment variables in your workflow:
# .github/workflows/ci.yml
jobs:
test:
runs-on: ubuntu-latest
env:
NODE_ENV: test
FINAL_RESULT_SCHEMA_VERSION: v2
TF_CPP_MIN_LOG_LEVEL: 2
steps:
- uses: actions/checkout@v4
- run: pnpm testCSV Export
The parser supports exporting to CSV format for spreadsheet import (Excel, Google Sheets, etc.).
CLI Usage
# Export single PDF to CSV
pnpm parse-boa ./statement.pdf --format csv --out statement.csv
# Export directory of PDFs to CSV
pnpm parse-boa --inputDir ./statements --format csv --out combined.csv
# Split into separate files per account (boa_checking_3529.csv, boa_savings_4971.csv)
pnpm parse-boa --inputDir ./statements --format csv --split-accounts --out ./output/CSV Columns
| Column | Description |
|--------|-------------|
| Date | Transaction date (ISO format) |
| Posted Date | Posted date if available |
| Description | Cleaned transaction description |
| Merchant | Extracted merchant name |
| Amount | Signed amount (negative for debits) |
| Direction | credit or debit |
| Type | Transaction type (Purchase, Deposit, Transfer, etc.) |
| Account Type | checking, credit, savings |
| Account Number | Masked account number |
| Category | Assigned category |
| Subcategory | Assigned subcategory |
| Confidence | Categorization confidence score |
Programmatic Usage
import { toFinalResultV2, exportCsv, exportCsvByAccount } from 'boa-statement-parser';
// Convert to v2 format
const v2Result = toFinalResultV2(canonicalOutput);
// Export to CSV
const csvText = exportCsv(v2Result);
// Or export per account
const splitResults = exportCsvByAccount(v2Result);
for (const { filename, content } of splitResults) {
fs.writeFileSync(filename, content);
}OFX Export
The parser supports exporting to OFX (Open Financial Exchange) format for import into accounting software like Quicken, GnuCash, or Dolibarr.
CLI Usage
# Export single PDF to OFX
pnpm parse-boa ./statement.pdf --format ofx --out statement.ofx
# Export directory of PDFs to OFX
pnpm parse-boa --inputDir ./statements --format ofx --out combined.ofx
# Split into separate files per account (boa_checking_3529.ofx, boa_savings_4971.ofx)
pnpm parse-boa --inputDir ./statements --format ofx --split-accounts --out ./output/
# With verbose output
pnpm parse-boa ./statement.pdf --format ofx --out statement.ofx --verboseOFX Transaction Types
The exporter automatically detects specific OFX transaction types from descriptions:
| Type | Detected From |
|------|---------------|
| DEP | Deposit, Direct Dep, Payroll |
| POS | Checkcard, Purchase, Debit Card |
| ATM | ATM, Cash Withdrawal |
| XFER | Zelle, Transfer, Wire |
| CHECK | Check #1234 (also extracts CHECKNUM) |
| FEE | Fee, Service Charge, Overdraft |
| PAYMENT | Payment, Bill Pay, ACH |
| INT | Interest |
Programmatic Usage
import { toFinalResultV2, exportOfx } from 'boa-statement-parser';
// Convert to v2 format
const v2Result = toFinalResultV2(canonicalOutput);
// Export to OFX
const ofxText = exportOfx(v2Result);
// Or export a single account
import { exportAccountOfx } from 'boa-statement-parser';
const singleAccountOfx = exportAccountOfx(v2Result.accounts[0]);OFX Features
- FITID: Uses deterministic
transactionIdas OFX FITID for reliable duplicate detection - Signed amounts: Credits are positive, debits are negative
- Date formatting: Converts ISO dates to OFX YYYYMMDD format
- Multiple accounts: Generates one
<STMTTRNRS>block per account - Ledger balance: Includes ending balance with date
Schema Versioning Notes
- v1 stays canonical raw output: The v1 schema represents the raw parsed output from individual statements
- v2 is rollup + integrity: The v2 schema groups by account and adds analytics/integrity checks
- schemaVersion remains v1/v2: The
schemaVersionfield is a const and should not be changed - schemaRevision is optional: Use the new optional
schemaRevisionfield for minor version tracking without breaking validation
Deduplication
When processing multiple PDFs (especially when combined PDFs overlap with individual statement PDFs), the parser performs intelligent deduplication:
Statement-Level Deduplication
Identity Key: {accountNumberMasked}|{periodStart}|{periodEnd}
When duplicate statements are found, the parser keeps the "best" one using these tie-break rules:
- Completeness score - Higher wins (based on transaction count, valid totals, fewer warnings)
- Standalone over combined - Prefers individual PDFs over combined PDFs (less merge artifacts)
- Lexicographic filename - Deterministic final tie-breaker
Transaction-Level Deduplication
Identity Key: {date}|{amount}|{direction}|{normalized_description}
Transactions are deduped within each statement, keeping the one with higher confidence score.
Example Output Summary
=== Batch Processing Summary ===
Total PDFs found: 11
PDFs succeeded: 11
PDFs failed: 0
Statements before dedup: 11
Statements kept: 11
Statements deduped: 0
Transactions merged: 424
Transactions deduped: 1
================================Development
Build
pnpm buildTest
pnpm test # Run all tests
pnpm test:watch # Watch mode
pnpm test:coverage # With coverage reportLint
pnpm lint # Check for issues
pnpm lint:fix # Auto-fix issues
pnpm format # Format with PrettierProject Structure
/src
/cli # Command-line interface
/batch # Batch processing orchestration
batch-processor.ts # Multi-PDF processing with dedup
/parsers # Bank-specific parsers
/boa # Bank of America parsers
checking-parser.ts # Checking account parsing
credit-parser.ts # Credit card parsing
channel-extractor.ts # Channel type & bank reference extraction
merchant-extractor.ts # Merchant info extraction
line-merger.ts # Wrapped line handling
transaction-normalizer.ts # Full transaction normalization
/extractors # PDF extraction utilities
/normalizers # Data transformation utilities
/categorization # Transaction categorization
categories.ts # Legacy category rules
categorizer.ts # Legacy categorizer
categorizer-v2.ts # Priority-based categorizer with confidence tiers
ml-categorizer.ts # TensorFlow.js ML-based categorizer
hybrid-categorizer.ts # Combined rule + ML categorization
training-data-generator.ts # Synthetic training data generation
/schemas # Zod schemas and types
/types # TypeScript output types (aligned with JSON Schema)
/validation # AJV JSON Schema validation
/utils # Shared utilities
directory-scanner.ts # PDF file discovery and filtering
statement-merger.ts # Statement/transaction deduplication
/tests # Test files (367+ tests)
/.windsurf # Agent documentationExtending to Other Banks
The architecture supports adding parsers for other banks:
- Create a new directory:
src/parsers/<bank>/ - Implement detection patterns for the bank's format
- Create account-type-specific parsers
- Add categorization rules for bank-specific descriptions
- Register the parser in the main index
Example structure for a new bank:
/src/parsers/chase/
index.ts # Main parser and detection
checking-parser.ts # Checking account logic
credit-parser.ts # Credit card logic
types.ts # Internal typesLibrary Choices
| Library | Purpose | Rationale |
|---------|---------|-----------|
| pdfjs-dist | Layout-aware PDF extraction | Positional text extraction for reliable table parsing |
| pdf-parse | Fallback PDF extraction | Lightweight, no native deps, good for simple text |
| zod | Schema validation | Runtime validation, TypeScript inference, composable |
| ajv | JSON Schema validation | Draft 2020-12 support, fast, comprehensive |
| commander | CLI parsing | Industry standard, auto-help, type-safe |
| vitest | Testing | Fast, ESM-native, Jest-compatible API |
Parsing Engine Upgrade
Layout-Aware Extraction (pdfjs-dist)
The parser now includes a layout-aware extraction engine using pdfjs-dist that extracts text with positional coordinates (x, y, width, height). This enables reliable row/column reconstruction for table parsing.
Key advantages:
- Positional data: Each text item includes x/y coordinates for accurate row grouping
- Column detection: Infers column boundaries from header rows or X-coordinate clustering
- Wrapped line handling: Merges multi-line descriptions that span multiple PDF text items
- Resilience: More robust to minor formatting changes across statement versions
When to use each extractor:
- pdfjs-dist (default): Best for structured table data, transaction parsing
- pdf-parse (fallback): Simpler extraction when layout isn't critical
Row/Column Reconstruction
The layout engine provides utilities for:
import { extractTextItems } from 'boa-statement-parser';
import { groupByRows, mergeWrappedDescriptions } from 'boa-statement-parser/layout';
// Extract with positions
const { items } = await extractTextItems('./statement.pdf');
// Group into rows (yTolerance default: 3.0)
const rows = groupByRows(items, 3.0);
// Merge wrapped descriptions
const merged = mergeWrappedDescriptions(rows);Balance Reconciliation
Quality checks validate that parsed data is consistent:
import { validateReconciliation } from 'boa-statement-parser/validation';
const result = validateReconciliation(
startingBalance,
endingBalance,
totalCredits,
totalDebits,
{ tolerance: 0.01 }
);
if (!result.passed) {
console.warn(`Balance mismatch: ${result.difference}`);
}Future Extensions
- MuPDF adapter: Optional high-fidelity extraction (not yet implemented)
- OCR pathway: For scanned/image-based PDFs (extension point documented)
Known Limitations
- Scanned/image-based PDFs are not supported (OCR required)
- Some complex table layouts may not parse correctly
- Statement formats may vary; parser tuned for recent formats
- International/non-USD statements not supported
- Password-protected PDFs will fail with an error (captured in
parseErrors)
Troubleshooting
"Could not determine account type"
The parser couldn't identify whether the statement is checking or credit. This usually means:
- The PDF text extraction failed
- The statement format is significantly different from expected
Try running with --verbose to see extracted text.
"No transactions found"
The transaction section patterns didn't match. Possible causes:
- Statement has no transactions
- Format differs from expected patterns
- PDF text extraction issues
Balance Mismatch Warning
In strict mode, the parser verifies that:
starting_balance + credits - debits = ending_balanceA mismatch may indicate:
- Missing transactions
- Parsing errors
- Fees/interest not captured
License
MIT
Contributing
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Run
pnpm lint && pnpm test - Submit a pull request
