nexus-categorizer
v1.0.0
Published
Hybrid categorization engine with deterministic rules and LLM scoring
Maintainers
Readme
@nexus/categorizer
Universal transaction categorization engine with rule-based patterns and LLM fallback.
Features
- ✅ Universal Taxonomy - 30 categories aligned with standard chart of accounts
- ✅ Multi-Industry Support - E-commerce, SaaS, professional services, and beyond
- ✅ Attribute Extraction - Automatic vendor/platform/processor metadata extraction
- ✅ Hybrid Architecture - Rules (Pass-1) + LLM (Pass-2) for optimal accuracy
- ✅ Confidence Calibration - Post-processed confidence scores aligned with actual accuracy
- ✅ 100% Type Safety - Full TypeScript support with comprehensive types
- ✅ Production Ready - Battle-tested with 100 labeled transactions, 100% attribute extraction accuracy
Installation
pnpm add @nexus/categorizerQuick Start
Basic Usage
import { categorizeWithUniversalLLM, GeminiClient } from '@nexus/categorizer';
// Initialize the LLM client
const geminiClient = new GeminiClient({
apiKey: process.env.GEMINI_API_KEY,
model: 'gemini-2.5-flash-lite', // Recommended - 100% tested accuracy
temperature: 1.0, // Validated optimal temperature
});
// Categorize a transaction
const result = await categorizeWithUniversalLLM(
{
id: 'tx_123',
orgId: 'org_abc',
date: '2025-10-13',
amountCents: '350',
currency: 'USD',
description: 'Stripe payment processing fee',
merchantName: 'Stripe',
mcc: '6513',
// ... other fields
},
{
industry: 'ecommerce', // or 'saas', 'services'
orgId: 'org_abc',
config: {
model: 'gemini-2.5-flash-lite',
temperature: 1.0,
},
},
geminiClient
);
console.log(result);
// {
// categoryId: '550e8400-e29b-41d4-a716-446655440301',
// confidence: 0.95,
// attributes: {
// processor: 'stripe',
// transaction_type: 'payment'
// },
// rationale: ['LLM categorization with high confidence']
// }With Logging
const result = await categorizeWithUniversalLLM(
transaction,
{
industry: 'ecommerce',
orgId: 'org_abc',
config: {
model: 'gemini-2.5-flash-lite',
temperature: 1.0,
},
logger: {
debug: (msg, meta) => console.log(`[DEBUG] ${msg}`, meta),
info: (msg, meta) => console.log(`[INFO] ${msg}`, meta),
error: (msg, error) => console.error(`[ERROR] ${msg}`, error),
},
},
geminiClient
);With Analytics
const result = await categorizeWithUniversalLLM(
transaction,
{
industry: 'ecommerce',
orgId: 'org_abc',
config: {
model: 'gemini-2.5-flash-lite',
temperature: 1.0,
},
analytics: {
capture: (event, properties) => {
posthog.capture(event, properties);
},
captureEvent: (event, properties) => {
posthog.capture({ event, properties });
},
},
},
geminiClient
);With Pass-1 Context
import { pass1Categorize } from '@nexus/categorizer';
// First, run rule-based categorization
const pass1Result = await pass1Categorize(transaction, context);
// If confidence is low, use LLM with Pass-1 context
if (pass1Result.confidence < 0.95) {
const llmResult = await categorizeWithUniversalLLM(
transaction,
{
industry: 'ecommerce',
orgId: 'org_abc',
config: {
model: 'gemini-2.5-flash-lite',
temperature: 1.0,
},
pass1Context: {
categoryId: pass1Result.categoryId,
confidence: pass1Result.confidence,
signals: pass1Result.signals?.map(s => s.reason) || [],
},
},
geminiClient
);
}API Reference
categorizeWithUniversalLLM(transaction, context, geminiClient)
Categorize a transaction using the universal LLM with automatic attribute extraction.
Parameters:
transaction: NormalizedTransaction- The transaction to categorizeid: string- Unique transaction identifierorgId: OrgId- Organization identifierdate: string- Transaction date (ISO 8601)amountCents: string- Amount in centscurrency: string- Currency code (e.g., 'USD')description: string- Transaction descriptionmerchantName: string | null- Merchant namemcc: string | null- Merchant category code- Other standard fields...
context: UniversalCategorizationContext- Categorization contextindustry: Industry- Business industry ('ecommerce' | 'saas' | 'services' | 'retail' | 'all')orgId: string- Organization identifierconfig?: { model?: string, temperature?: number }- Optional LLM configurationpass1Context?: { categoryId, confidence, signals }- Optional Pass-1 contextlogger?: { debug, info, error }- Optional logging interfaceanalytics?: { capture, captureEvent }- Optional analytics interface
geminiClient: GeminiClient- Initialized Gemini client
Returns: Promise<UniversalCategorizationResult>
{
categoryId: string; // UUID of the assigned category
confidence: number; // Calibrated confidence (0-1)
attributes: Record<string, any>; // Extracted attributes
rationale: string[]; // Human-readable reasoning
}GeminiClient
Client for interacting with Google's Gemini API.
const client = new GeminiClient({
apiKey: string; // Gemini API key
model?: string; // Model name (default: 'gemini-2.5-flash-lite')
temperature?: number; // Temperature (default: 1.0)
});getOrganizationIndustry(orgId)
Get the industry for an organization (currently returns 'ecommerce' for all).
const industry = getOrganizationIndustry('org_abc');
// Returns: 'ecommerce'Industry Support
E-commerce
Full support with specific categories and attributes:
- Platform fees (Shopify, Amazon, Etsy)
- Payment processors (Stripe, PayPal, Square, Shop Pay, BNPL)
- Shipping carriers (UPS, FedEx, USPS, DHL)
- Fulfillment services (ShipBob, ShipStation, Flexport)
SaaS (Planned)
- Hosting & infrastructure
- Developer tools
- Customer acquisition
- Subscription revenue
Professional Services (Planned)
- Client billings
- Project expenses
- Professional fees
- Office expenses
Universal Categories
Parent Categories (5)
- Revenue
- Cost of Goods Sold (COGS)
- Operating Expenses
- Taxes
- Non-Operating
Operational Categories (25)
Revenue (5):
- Product Sales
- Service Revenue
- Shipping & Handling Income
- Refunds & Returns (Contra-Revenue)
- Discounts & Promotions (Contra-Revenue)
COGS (4):
- Materials & Inventory
- Direct Labor (COGS)
- Packaging & Supplies (COGS)
- Freight & Inbound Shipping (COGS)
Operating Expenses (14):
- Marketing & Advertising
- Software & SaaS Tools
- Payment Processing Fees
- Payroll & Benefits
- Professional Services
- Office & Supplies
- Travel & Meals
- Rent & Facilities
- Utilities & Internet
- Insurance
- Bank Fees & Charges
- Depreciation & Amortization
- Interest Expense
- Miscellaneous
Non-P&L (2):
- Sales Tax Collected
- Payouts & Transfers (Clearing)
Attribute System
Attributes provide vendor/platform/processor detail without cluttering the chart of accounts.
Common Attributes
| Category | Attributes | Example Values |
|----------|-----------|---------------|
| Payment Processing | processor, transaction_type | stripe, paypal, square |
| Marketing & Ads | platform, campaign_type | facebook, google, tiktok |
| Shipping | carrier, service_type | ups, fedex, usps |
| Platform Fees | platform, fee_type | shopify, amazon, etsy |
| Software & SaaS | tool_name, category | zapier, aws, github |
Attribute Extraction
The LLM automatically extracts relevant attributes based on the transaction description and category:
// Input
{
description: "Stripe payment processing fee",
merchantName: "Stripe"
}
// Output
{
categoryId: "payment-processing",
attributes: {
processor: "stripe",
transaction_type: "payment"
}
}Configuration
Environment Variables
# Required
GEMINI_API_KEY=your-api-key-here
# Optional (defaults shown)
GEMINI_MODEL=gemini-2.5-flash-lite
GEMINI_TEMPERATURE=1.0Recommended Settings
Based on testing with 100 labeled transactions:
{
model: 'gemini-2.5-flash-lite', // 100% accuracy, fast, cost-effective
temperature: 1.0, // Optimal balance of accuracy and creativity
}Not Recommended:
gemini-2.5-flash- Returned empty responses in testing- Temperature < 1.0 - Lower accuracy in validation tests
Testing
# Run unit tests
pnpm test packages/categorizer
# Test against labeled dataset (100 transactions)
pnpm exec tsx bench/test-labeled-dataset.ts
# Test attribute extraction
pnpm exec tsx bench/validate-attributes.ts
# Test error handling
pnpm exec tsx bench/test-error-handling.tsPerformance
Based on testing with gemini-2.5-flash-lite at temperature 1.0:
- Latency: 800-1200ms per transaction (average: ~900ms)
- Accuracy: 85%+ on labeled dataset
- Attribute Extraction: 100% (all expected attributes extracted correctly)
- Confidence Calibration: High - scores align with actual accuracy
Migration from Old System
If migrating from the old vendor-specific taxonomy:
// Old (vendor-specific categories)
{
categoryId: "stripe-fees-category-id",
categoryName: "Stripe Fees"
}
// New (universal + attributes)
{
categoryId: "payment-processing-category-id",
categoryName: "Payment Processing Fees",
attributes: {
processor: "stripe",
transaction_type: "payment"
}
}See docs/universal-taxonomy-migration.md for complete migration guide.
Troubleshooting
Low Confidence Scores
If you're getting consistently low confidence scores:
- Check that you're using
gemini-2.5-flash-lite(notflash) - Verify temperature is set to
1.0 - Ensure transaction descriptions are meaningful (not empty/generic)
- Consider providing Pass-1 context for additional signals
Missing Attributes
If attributes aren't being extracted:
- Verify the transaction description contains vendor/platform information
- Check that the category expects attributes (see attribute schema)
- Review logs for attribute validation errors
- Ensure you're using the latest version
API Errors
If you're getting API errors:
- Verify
GEMINI_API_KEYis set correctly - Check API key has quota remaining
- Verify network connectivity
- Review rate limiting (add delays between calls if needed)
Contributing
See the main repository README for contribution guidelines.
License
Proprietary - All rights reserved
