ocsv
v1.3.1
Published
High-performance RFC 4180 compliant CSV parser with lazy mode for Bun. Parse 10M rows in 5s with minimal memory.
Downloads
42
Maintainers
Readme
OCSV - Odin CSV Parser
A high-performance, RFC 4180 compliant CSV parser written in Odin with Bun FFI support.
Features
- ⚡ High Performance - Fast CSV parsing with SIMD optimizations
- 🦺 Memory Safe - Zero memory leaks, comprehensive testing
- ✅ RFC 4180 Compliant - Full CSV specification support
- 🌍 UTF-8 Support - Correct handling of international characters
- 🔧 Flexible Configuration - Custom delimiters, quotes, comments
- 📦 Bun Native - Direct FFI integration with Bun runtime
- 🛡️ Error Handling - Detailed error messages with line/column info
- 🎯 Schema Validation - Type checking, constraints, type conversion
- 🌊 Streaming API - Memory-efficient chunk-based processing
- 🔄 Transform System - Built-in transforms and pipelines
- 🔌 Plugin System - Extensible architecture for custom functionality
Why Odin + Bun?
Key Advantages:
- ✅ Simple build system (no node-gyp, no Python)
- ✅ Better memory safety (explicit memory management + defer)
- ✅ Better error handling (enums + multiple returns)
- ✅ No C++ wrapper needed (Bun FFI is direct)
Quick Start
npm Installation (Recommended)
Install OCSV as an npm package for easy integration with your Bun projects:
# Using Bun
bun add ocsv
# Using npm
npm install ocsvThen use it in your project:
import { parseCSV } from 'ocsv';
// Parse CSV string
const result = parseCSV('name,age\nJohn,30\nJane,25', { hasHeader: true });
console.log(result.headers); // ['name', 'age']
console.log(result.rows); // [['John', '30'], ['Jane', '25']]
// Parse CSV file
import { parseCSVFile } from 'ocsv';
const data = await parseCSVFile('./data.csv', { hasHeader: true });
console.log(`Parsed ${data.rowCount} rows`);Manual Installation (Development)
For building from source or contributing:
git clone https://github.com/dvrd/ocsv.git
cd ocsvBuild
Current Support: macOS ARM64 (cross-platform support in progress)
# Using Task (recommended)
task build # Build release library
task build-dev # Build debug library
task test # Run all tests
task info # Show platform info
# Manual build
odin build src -build-mode:shared -out:libocsv.dylib -o:speedBasic Usage (Odin)
package main
import "core:fmt"
import ocsv "src"
main :: proc() {
// Create parser
parser := ocsv.parser_create()
defer ocsv.parser_destroy(parser)
// Parse CSV data
csv_data := "name,age,city\nAlice,30,NYC\nBob,25,SF\n"
ok := ocsv.parse_csv(parser, csv_data)
if ok {
// Access parsed data
fmt.printfln("Parsed %d rows", len(parser.all_rows))
for row in parser.all_rows {
for field in row {
fmt.printf("%s ", field)
}
fmt.printf("\n")
}
}
}Bun API Examples
Basic Parsing
import { parseCSV } from 'ocsv';
// Parse CSV with headers
const result = parseCSV('name,age,city\nAlice,30,NYC\nBob,25,SF', {
hasHeader: true
});
console.log(result.headers); // ['name', 'age', 'city']
console.log(result.rows); // [['Alice', '30', 'NYC'], ['Bob', '25', 'SF']]
console.log(result.rowCount); // 2Parse from File
import { parseCSVFile } from 'ocsv';
// Parse CSV file with headers
const data = await parseCSVFile('./sales.csv', {
hasHeader: true,
delimiter: ',',
});
console.log(`Parsed ${data.rowCount} rows`);
console.log(`Columns: ${data.headers.join(', ')}`);
// Process rows
for (const row of data.rows) {
console.log(row);
}Custom Configuration
import { parseCSV } from 'ocsv';
// Parse TSV (tab-separated)
const tsvData = parseCSV('col1\tcol2\trow1\tdata', {
delimiter: '\t',
hasHeader: true,
});
// Parse with semicolon delimiter (European CSV)
const europeanData = parseCSV('name;age;city\nJohn;30;Paris', {
delimiter: ';',
hasHeader: true,
});
// Relaxed mode (allows some RFC violations)
const relaxedData = parseCSV('messy,csv,"data', {
relaxed: true,
});Manual Parser Management
For more control, use the Parser class directly:
import { Parser } from 'ocsv';
const parser = new Parser();
try {
const result = parser.parse('a,b,c\n1,2,3');
console.log(result.rows);
} finally {
parser.destroy(); // Important: free memory
}Performance Modes
OCSV offers two access modes to optimize for different use cases:
Mode Comparison
| Feature | Eager Mode (default) | Lazy Mode |
|---------|---------------------|-----------|
| Performance | ~8 MB/s throughput | ≥180 MB/s (22x faster) |
| Memory Usage | High (all data in JS) | Low (<200 MB for 10M rows) |
| Parse Time (10M rows) | ~150s | <7s (21x faster) |
| Access Pattern | Random access, arrays | Random access, on-demand |
| Memory Management | Automatic (GC) | Manual (destroy() required) |
| Best For | Small files, full iteration | Large files, selective access |
| TypeScript Support | Full | Full (discriminated unions) |
Eager Mode (Default)
Best for: Small to medium files (<100k rows), full dataset iteration, simple workflows
All rows are materialized into JavaScript arrays immediately. Easy to use, no cleanup required.
import { parseCSV } from 'ocsv';
// Default: eager mode
const result = parseCSV(data, { hasHeader: true });
console.log(result.headers); // ['name', 'age', 'city']
console.log(result.rows); // [['Alice', '30', 'NYC'], ...]
console.log(result.rowCount); // 2
// Arrays: standard JavaScript operations
result.rows.forEach(row => console.log(row));
result.rows.map(row => row[0]);
result.rows.filter(row => row[1] > '25');Pros:
- ✅ Simple API - standard JavaScript arrays
- ✅ No manual cleanup required
- ✅ Familiar array methods (map, filter, slice)
- ✅ Safe for GC-managed memory
Cons:
- ❌ Slower for large files (7.5x overhead)
- ❌ High memory usage (all rows in JS heap)
- ❌ Parse time proportional to data crossing FFI boundary
Lazy Mode (High Performance)
Best for: Large files (>1M rows), selective access, memory-constrained environments
Rows stay in native Odin memory and are accessed on-demand. Achieves near-FFI performance with minimal memory footprint.
import { parseCSV } from 'ocsv';
// Lazy mode: high performance
const result = parseCSV(data, {
mode: 'lazy',
hasHeader: true
});
try {
console.log(result.headers); // ['name', 'age', 'city']
console.log(result.rowCount); // 10000000
// On-demand row access
const row = result.getRow(5000000);
console.log(row.get(0)); // 'Alice'
console.log(row.get(1)); // '30'
// Iterate fields
for (const field of row) {
console.log(field);
}
// Materialize row to array (when needed)
const arr = row.toArray(); // ['Alice', '30', 'NYC']
// Efficient slicing (generator)
for (const row of result.slice(1000, 2000)) {
console.log(row.get(0));
}
// Full iteration (if needed)
for (const row of result) {
console.log(row.get(0));
}
} finally {
// CRITICAL: Must cleanup native memory
result.destroy();
}Pros:
- ✅ 22x faster parse time than eager mode
- ✅ Low memory footprint (<200 MB for 10M rows)
- ✅ LRU cache (1000 hot rows) for repeated access
- ✅ Generator-based slicing (memory efficient)
- ✅ Random access to any row (O(1) after cache)
Cons:
- ❌ Manual cleanup required (
destroy()must be called) - ❌ Not standard arrays (use
.get(i)or.toArray()) - ❌ Use-after-destroy throws errors
When to Use Each Mode
Start
|
Is file size > 100MB or > 1M rows?
/ \
Yes No
| |
Do you need to Use Eager Mode
access all rows? (simple, safe)
/ \
No Yes
| |
Lazy Mode Memory constrained?
(fast, low / \
memory) Yes No
| |
Lazy Mode Try Eager first
(streaming) (measure, switch if slow)Use Lazy Mode when:
- File size > 100 MB or > 1M rows
- You need selective row access (not full iteration)
- Memory is constrained (< 1 GB available)
- You're building streaming/ETL pipelines
- You need maximum parsing performance
Use Eager Mode when:
- File size < 100 MB or < 1M rows
- You need full dataset iteration
- You prefer simpler API (standard arrays)
- Memory cleanup must be automatic (GC)
- You're prototyping or writing quick scripts
Performance Benchmarks
Test Setup: 10M rows, 4 columns, 1.2 GB CSV file
Mode Parse Time Throughput Memory Usage
────────────────────────────────────────────────────────
FFI Direct 6.2s 193 MB/s 50 MB (baseline)
Lazy Mode 6.8s 176 MB/s <200 MB
Eager Mode 151.7s 7.9 MB/s ~8 GBKey Metrics:
- Lazy mode is 22x faster than eager mode
- Lazy mode uses 40x less memory than eager mode
- Lazy mode is only 9% slower than raw FFI (acceptable overhead)
Advanced: High-Performance FFI Mode
For advanced users who need maximum FFI throughput, OCSV offers an optimized packed buffer mode that achieves 61.25 MB/s (56% of native Odin performance).
Performance Comparison (100K rows, 13.80 MB file):
Mode Throughput ns/row vs Native
──────────────────────────────────────────────────────
Native Odin 109.28 MB/s 915 100%
Packed Buffer 61.25 MB/s 2,253 56%
Bulk JSON 40.68 MB/s 2,878 37%
Field-by-Field 29.58 MB/s 3,957 27%Optimizations:
- ⚡ 61.25 MB/s average throughput
- 🚀 Batched TextDecoder with reduced decoder overhead
- 💾 Pre-allocated arrays to reduce GC pressure
- 📊 SIMD-friendly memory access patterns
- 🔄 Adaptive processing for different row sizes
- 📦 Binary packed format with length-prefixed strings
- ✨ Single FFI call instead of multiple round-trips
Usage:
import { parseCSVPacked } from 'ocsv/bindings/simple';
// Optimized packed buffer mode (highest FFI performance)
const rows = parseCSVPacked(csvData);
// Returns string[][] with minimal FFI overheadWhen to use Packed Buffer:
- Need maximum FFI throughput (>40 MB/s)
- Willing to trade API simplicity for performance
- Working with medium-large files through Bun FFI
- Want to minimize cross-language boundary overhead
Note: The 44% overhead compared to native Odin is inherent to the FFI serialization boundary. This is the practical limit for JavaScript-based FFI approaches.
Memory Management
Eager Mode
// Automatic cleanup via garbage collector
const result = parseCSV(data);
// ... use result.rows ...
// Memory freed automatically when result goes out of scopeLazy Mode
// Manual cleanup required
const result = parseCSV(data, { mode: 'lazy' });
try {
// ... use result ...
} finally {
// CRITICAL: Always call destroy()
result.destroy();
}Common Pitfalls:
❌ Forgetting to destroy:
const result = parseCSV(data, { mode: 'lazy' });
console.log(result.getRow(0));
// Memory leak! Parser not cleaned up❌ Use after destroy:
const result = parseCSV(data, { mode: 'lazy' });
result.destroy();
result.getRow(0); // Error: LazyResult has been destroyed✅ Correct pattern:
const result = parseCSV(data, { mode: 'lazy' });
try {
const row = result.getRow(0);
console.log(row.toArray());
} finally {
result.destroy();
}TypeScript Support
OCSV provides discriminated union types for type-safe mode selection:
import { parseCSV } from 'ocsv';
// Type: ParseResult (array-based)
const eager = parseCSV(data);
console.log(eager.rows[0]); // Type: string[]
// Type: LazyResult (on-demand)
const lazy = parseCSV(data, { mode: 'lazy' });
console.log(lazy.getRow(0)); // Type: LazyRow
// Compiler error: mode mismatch
const wrong = parseCSV(data, { mode: 'lazy' });
console.log(wrong.rows); // Error: Property 'rows' does not existConfiguration
// Create parser with custom configuration
parser := ocsv.parser_create()
defer ocsv.parser_destroy(parser)
// TSV (Tab-Separated Values)
parser.config.delimiter = '\t'
// European CSV (semicolon)
parser.config.delimiter = ';'
// Comments (skip lines starting with #)
parser.config.comment = '#'
// Relaxed mode (handle malformed CSV)
parser.config.relaxed = true
// Custom quote character
parser.config.quote = '\''RFC 4180 Compliance
OCSV fully implements RFC 4180 with support for:
- ✅ Quoted fields with embedded delimiters (
"field, with, commas") - ✅ Nested quotes (
"field with ""quotes"""→field with "quotes") - ✅ Multiline fields (newlines inside quotes)
- ✅ CRLF and LF line endings (Windows/Unix)
- ✅ Empty fields (consecutive delimiters:
a,,c) - ✅ Trailing delimiters (
a,b,→ 3 fields, last is empty) - ✅ Leading delimiters (
,a,b→ 3 fields, first is empty) - ✅ Comments (extension: lines starting with
#) - ✅ Unicode/UTF-8 (CJK characters, emojis, etc.)
Example:
# Sales data for Q1 2024
product,price,description,quantity
"Widget A",19.99,"A great widget, now with more features!",100
"Gadget B",29.99,"Essential gadget
Multi-line description",50Testing
~201 tests, 100% pass rate, 0 memory leaks
# Run all tests (standard)
odin test tests
# Run with memory tracking
odin test tests -debugTest Suites
The project includes comprehensive test coverage across multiple suites:
- Basic functionality and core parsing operations
- RFC 4180 edge cases and compliance
- Integration tests for end-to-end workflows
- Schema validation and type checking
- Transform system and pipelines
- Plugin system functionality
- Streaming API with chunk boundaries
- Large file handling
- Performance regression monitoring
- Error handling and recovery strategies
- Property-based fuzzing tests
- Parallel processing capabilities
- SIMD optimization verification
Project Structure
ocsv/
├── src/
│ ├── ocsv.odin # Main module
│ ├── parser.odin # RFC 4180 state machine parser
│ ├── parser_simd.odin # SIMD-optimized parser
│ ├── parser_error.odin # Error-aware parser
│ ├── streaming.odin # Streaming API
│ ├── parallel.odin # Parallel processing
│ ├── transform.odin # Transform system
│ ├── plugin.odin # Plugin architecture
│ ├── simd.odin # SIMD search functions
│ ├── error.odin # Error handling system
│ ├── schema.odin # Schema validation & type system
│ ├── config.odin # Configuration types
│ └── ffi_bindings.odin # Bun FFI exports
├── tests/ # Comprehensive test suite
├── plugins/ # Example plugins
├── bindings/ # Bun/TypeScript bindings
├── benchmarks/ # Performance benchmarks
├── examples/ # Usage examples
└── README.md # This fileRequirements
- Odin: Latest version (tested with Odin dev-2025-01)
- Bun: v1.0+ (for FFI integration, optional)
- Platform: macOS ARM64 (cross-platform support in development)
- Task: v3+ (optional, for automated builds)
Release Process
This project uses automated releases via semantic-release. Releases are triggered automatically when changes are pushed to the main branch.
Commit Message Format
All commits must follow Conventional Commits:
<type>(<scope>): <subject>
<body>
<footer>Examples:
git commit -m "feat: add streaming parser API"
git commit -m "fix: handle empty fields correctly"
git commit -m "docs: update installation instructions"
git commit -m "feat!: remove deprecated parseFile method
BREAKING CHANGE: parseFile has been removed, use parseCSVFile instead"Commit Types:
feat:New feature (triggers minor version bump)fix:Bug fix (triggers patch version bump)perf:Performance improvement (triggers patch version bump)docs:Documentation changes (no release)chore:Maintenance tasks (no release)refactor:Code refactoring (no release)test:Test changes (no release)ci:CI/CD changes (no release)
Version Bumps
- Patch (1.1.0 → 1.1.1):
fix:,perf: - Minor (1.1.0 → 1.2.0):
feat: - Major (1.1.0 → 2.0.0): Any commit with
BREAKING CHANGE:in footer or!after type
Release Workflow
- Developer pushes commits to
mainbranch - CI runs tests and builds
- semantic-release analyzes commits
- If releasable changes found:
- Determines new version number
- Updates CHANGELOG.md
- Updates package.json
- Creates git tag
- Publishes to npm with provenance
- Creates GitHub release with prebuilt binaries
Manual Release (Emergency Only):
npm run release:dry # Test what would be released
git push origin main # Trigger automated releaseContributing
Contributions are welcome! Please read CONTRIBUTING.md for detailed guidelines on commit messages and pull request process.
Development Workflow:
- Fork the repository
- Create a feature branch
- Make changes with tests (
odin test tests) - Ensure zero memory leaks
- Submit a pull request
License
MIT License - see LICENSE for details.
Acknowledgments
- Odin Language: https://odin-lang.org/
- Bun Runtime: https://bun.sh/
- RFC 4180: https://www.rfc-editor.org/rfc/rfc4180
Related Projects
- d3-dsv - Pure JavaScript CSV/DSV parser
- papaparse - Popular JavaScript CSV parser
- xsv - Rust CLI tool for CSV processing
- csv-parser - Node.js streaming CSV parser
Contact
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Built with ❤️ using Odin + Bun
Version: 1.3.0
Last Updated: 2025-11-09
