ocsv

v1.3.1

Published

2 months ago

High-performance RFC 4180 compliant CSV parser with lazy mode for Bun. Parse 10M rows in 5s with minimal memory.

Downloads

0High
0Medium
0Low

littlecastrum

csv parser rfc4180 fast odin ffi bun streaming high-performance lazy-evaluation zero-copy low-memory

OCSV - Odin CSV Parser

A high-performance, RFC 4180 compliant CSV parser written in Odin with Bun FFI support.

Platform Support:

Features

⚡ High Performance - Fast CSV parsing with SIMD optimizations
🦺 Memory Safe - Zero memory leaks, comprehensive testing
✅ RFC 4180 Compliant - Full CSV specification support
🌍 UTF-8 Support - Correct handling of international characters
🔧 Flexible Configuration - Custom delimiters, quotes, comments
📦 Bun Native - Direct FFI integration with Bun runtime
🛡️ Error Handling - Detailed error messages with line/column info
🎯 Schema Validation - Type checking, constraints, type conversion
🌊 Streaming API - Memory-efficient chunk-based processing
🔄 Transform System - Built-in transforms and pipelines
🔌 Plugin System - Extensible architecture for custom functionality

Why Odin + Bun?

Key Advantages:

✅ Simple build system (no node-gyp, no Python)
✅ Better memory safety (explicit memory management + defer)
✅ Better error handling (enums + multiple returns)
✅ No C++ wrapper needed (Bun FFI is direct)

Quick Start

npm Installation (Recommended)

Install OCSV as an npm package for easy integration with your Bun projects:

# Using Bun
bun add ocsv

# Using npm
npm install ocsv

Then use it in your project:

import { parseCSV } from 'ocsv';

// Parse CSV string
const result = parseCSV('name,age\nJohn,30\nJane,25', { hasHeader: true });
console.log(result.headers); // ['name', 'age']
console.log(result.rows);    // [['John', '30'], ['Jane', '25']]

// Parse CSV file
import { parseCSVFile } from 'ocsv';
const data = await parseCSVFile('./data.csv', { hasHeader: true });
console.log(`Parsed ${data.rowCount} rows`);

Manual Installation (Development)

For building from source or contributing:

git clone https://github.com/dvrd/ocsv.git
cd ocsv

Build

Current Support: macOS ARM64 (cross-platform support in progress)

# Using Task (recommended)
task build          # Build release library
task build-dev      # Build debug library
task test           # Run all tests
task info           # Show platform info

# Manual build
odin build src -build-mode:shared -out:libocsv.dylib -o:speed

Basic Usage (Odin)

package main

import "core:fmt"
import ocsv "src"

main :: proc() {
    // Create parser
    parser := ocsv.parser_create()
    defer ocsv.parser_destroy(parser)

    // Parse CSV data
    csv_data := "name,age,city\nAlice,30,NYC\nBob,25,SF\n"
    ok := ocsv.parse_csv(parser, csv_data)

    if ok {
        // Access parsed data
        fmt.printfln("Parsed %d rows", len(parser.all_rows))
        for row in parser.all_rows {
            for field in row {
                fmt.printf("%s ", field)
            }
            fmt.printf("\n")
        }
    }
}

Bun API Examples

Basic Parsing

import { parseCSV } from 'ocsv';

// Parse CSV with headers
const result = parseCSV('name,age,city\nAlice,30,NYC\nBob,25,SF', {
  hasHeader: true
});

console.log(result.headers); // ['name', 'age', 'city']
console.log(result.rows);    // [['Alice', '30', 'NYC'], ['Bob', '25', 'SF']]
console.log(result.rowCount); // 2

Parse from File

import { parseCSVFile } from 'ocsv';

// Parse CSV file with headers
const data = await parseCSVFile('./sales.csv', {
  hasHeader: true,
  delimiter: ',',
});

console.log(`Parsed ${data.rowCount} rows`);
console.log(`Columns: ${data.headers.join(', ')}`);

// Process rows
for (const row of data.rows) {
  console.log(row);
}

Custom Configuration

import { parseCSV } from 'ocsv';

// Parse TSV (tab-separated)
const tsvData = parseCSV('col1\tcol2\trow1\tdata', {
  delimiter: '\t',
  hasHeader: true,
});

// Parse with semicolon delimiter (European CSV)
const europeanData = parseCSV('name;age;city\nJohn;30;Paris', {
  delimiter: ';',
  hasHeader: true,
});

// Relaxed mode (allows some RFC violations)
const relaxedData = parseCSV('messy,csv,"data', {
  relaxed: true,
});

Manual Parser Management

For more control, use the Parser class directly:

import { Parser } from 'ocsv';

const parser = new Parser();
try {
  const result = parser.parse('a,b,c\n1,2,3');
  console.log(result.rows);
} finally {
  parser.destroy(); // Important: free memory
}

Performance Modes

OCSV offers two access modes to optimize for different use cases:

Mode Comparison

| Feature | Eager Mode (default) | Lazy Mode | |---------|---------------------|-----------| | Performance | ~8 MB/s throughput | ≥180 MB/s (22x faster) | | Memory Usage | High (all data in JS) | Low (<200 MB for 10M rows) | | Parse Time (10M rows) | ~150s | <7s (21x faster) | | Access Pattern | Random access, arrays | Random access, on-demand | | Memory Management | Automatic (GC) | Manual (destroy() required) | | Best For | Small files, full iteration | Large files, selective access | | TypeScript Support | Full | Full (discriminated unions) |

Eager Mode (Default)

Best for: Small to medium files (<100k rows), full dataset iteration, simple workflows

All rows are materialized into JavaScript arrays immediately. Easy to use, no cleanup required.

import { parseCSV } from 'ocsv';

// Default: eager mode
const result = parseCSV(data, { hasHeader: true });

console.log(result.headers);   // ['name', 'age', 'city']
console.log(result.rows);      // [['Alice', '30', 'NYC'], ...]
console.log(result.rowCount);  // 2

// Arrays: standard JavaScript operations
result.rows.forEach(row => console.log(row));
result.rows.map(row => row[0]);
result.rows.filter(row => row[1] > '25');

Pros:

✅ Simple API - standard JavaScript arrays
✅ No manual cleanup required
✅ Familiar array methods (map, filter, slice)
✅ Safe for GC-managed memory

Cons:

❌ Slower for large files (7.5x overhead)
❌ High memory usage (all rows in JS heap)
❌ Parse time proportional to data crossing FFI boundary

Lazy Mode (High Performance)

Best for: Large files (>1M rows), selective access, memory-constrained environments

Rows stay in native Odin memory and are accessed on-demand. Achieves near-FFI performance with minimal memory footprint.

import { parseCSV } from 'ocsv';

// Lazy mode: high performance
const result = parseCSV(data, {
  mode: 'lazy',
  hasHeader: true
});

try {
  console.log(result.headers);   // ['name', 'age', 'city']
  console.log(result.rowCount);  // 10000000

  // On-demand row access
  const row = result.getRow(5000000);
  console.log(row.get(0));       // 'Alice'
  console.log(row.get(1));       // '30'

  // Iterate fields
  for (const field of row) {
    console.log(field);
  }

  // Materialize row to array (when needed)
  const arr = row.toArray();     // ['Alice', '30', 'NYC']

  // Efficient slicing (generator)
  for (const row of result.slice(1000, 2000)) {
    console.log(row.get(0));
  }

  // Full iteration (if needed)
  for (const row of result) {
    console.log(row.get(0));
  }

} finally {
  // CRITICAL: Must cleanup native memory
  result.destroy();
}

Pros:

✅ 22x faster parse time than eager mode
✅ Low memory footprint (<200 MB for 10M rows)
✅ LRU cache (1000 hot rows) for repeated access
✅ Generator-based slicing (memory efficient)
✅ Random access to any row (O(1) after cache)

Cons:

❌ Manual cleanup required (destroy() must be called)
❌ Not standard arrays (use .get(i) or .toArray())
❌ Use-after-destroy throws errors

When to Use Each Mode

                    Start
                      |
           Is file size > 100MB or > 1M rows?
                 /         \
               Yes          No
                |            |
         Do you need to    Use Eager Mode
         access all rows?   (simple, safe)
              /    \
            No     Yes
             |      |
        Lazy Mode  Memory constrained?
     (fast, low     /              \
      memory)     Yes               No
                   |                 |
              Lazy Mode         Try Eager first
           (streaming)        (measure, switch if slow)

Use Lazy Mode when:

File size > 100 MB or > 1M rows
You need selective row access (not full iteration)
Memory is constrained (< 1 GB available)
You're building streaming/ETL pipelines
You need maximum parsing performance

Use Eager Mode when:

File size < 100 MB or < 1M rows
You need full dataset iteration
You prefer simpler API (standard arrays)
Memory cleanup must be automatic (GC)
You're prototyping or writing quick scripts

Performance Benchmarks

Test Setup: 10M rows, 4 columns, 1.2 GB CSV file

Mode          Parse Time    Throughput    Memory Usage
────────────────────────────────────────────────────────
FFI Direct    6.2s          193 MB/s      50 MB (baseline)
Lazy Mode     6.8s          176 MB/s      <200 MB
Eager Mode    151.7s        7.9 MB/s      ~8 GB

Key Metrics:

Lazy mode is 22x faster than eager mode
Lazy mode uses 40x less memory than eager mode
Lazy mode is only 9% slower than raw FFI (acceptable overhead)

Advanced: High-Performance FFI Mode

For advanced users who need maximum FFI throughput, OCSV offers an optimized packed buffer mode that achieves 61.25 MB/s (56% of native Odin performance).

Performance Comparison (100K rows, 13.80 MB file):

Mode              Throughput    ns/row    vs Native
──────────────────────────────────────────────────────
Native Odin       109.28 MB/s   915       100%
Packed Buffer     61.25 MB/s    2,253     56%
Bulk JSON         40.68 MB/s    2,878     37%
Field-by-Field    29.58 MB/s    3,957     27%

Optimizations:

⚡ 61.25 MB/s average throughput
🚀 Batched TextDecoder with reduced decoder overhead
💾 Pre-allocated arrays to reduce GC pressure
📊 SIMD-friendly memory access patterns
🔄 Adaptive processing for different row sizes
📦 Binary packed format with length-prefixed strings
✨ Single FFI call instead of multiple round-trips

Usage:

import { parseCSVPacked } from 'ocsv/bindings/simple';

// Optimized packed buffer mode (highest FFI performance)
const rows = parseCSVPacked(csvData);
// Returns string[][] with minimal FFI overhead

When to use Packed Buffer:

Need maximum FFI throughput (>40 MB/s)
Willing to trade API simplicity for performance
Working with medium-large files through Bun FFI
Want to minimize cross-language boundary overhead

Note: The 44% overhead compared to native Odin is inherent to the FFI serialization boundary. This is the practical limit for JavaScript-based FFI approaches.

Memory Management

Eager Mode

// Automatic cleanup via garbage collector
const result = parseCSV(data);
// ... use result.rows ...
// Memory freed automatically when result goes out of scope

Lazy Mode

// Manual cleanup required
const result = parseCSV(data, { mode: 'lazy' });
try {
  // ... use result ...
} finally {
  // CRITICAL: Always call destroy()
  result.destroy();
}

Common Pitfalls:

❌ Forgetting to destroy:

const result = parseCSV(data, { mode: 'lazy' });
console.log(result.getRow(0));
// Memory leak! Parser not cleaned up

❌ Use after destroy:

const result = parseCSV(data, { mode: 'lazy' });
result.destroy();
result.getRow(0);  // Error: LazyResult has been destroyed

✅ Correct pattern:

const result = parseCSV(data, { mode: 'lazy' });
try {
  const row = result.getRow(0);
  console.log(row.toArray());
} finally {
  result.destroy();
}

TypeScript Support

OCSV provides discriminated union types for type-safe mode selection:

import { parseCSV } from 'ocsv';

// Type: ParseResult (array-based)
const eager = parseCSV(data);
console.log(eager.rows[0]);  // Type: string[]

// Type: LazyResult (on-demand)
const lazy = parseCSV(data, { mode: 'lazy' });
console.log(lazy.getRow(0)); // Type: LazyRow

// Compiler error: mode mismatch
const wrong = parseCSV(data, { mode: 'lazy' });
console.log(wrong.rows);  // Error: Property 'rows' does not exist

Configuration

// Create parser with custom configuration
parser := ocsv.parser_create()
defer ocsv.parser_destroy(parser)

// TSV (Tab-Separated Values)
parser.config.delimiter = '\t'

// European CSV (semicolon)
parser.config.delimiter = ';'

// Comments (skip lines starting with #)
parser.config.comment = '#'

// Relaxed mode (handle malformed CSV)
parser.config.relaxed = true

// Custom quote character
parser.config.quote = '\''

RFC 4180 Compliance

OCSV fully implements RFC 4180 with support for:

✅ Quoted fields with embedded delimiters ("field, with, commas")
✅ Nested quotes ("field with ""quotes""" → field with "quotes")
✅ Multiline fields (newlines inside quotes)
✅ CRLF and LF line endings (Windows/Unix)
✅ Empty fields (consecutive delimiters: a,,c)
✅ Trailing delimiters (a,b, → 3 fields, last is empty)
✅ Leading delimiters (,a,b → 3 fields, first is empty)
✅ Comments (extension: lines starting with #)
✅ Unicode/UTF-8 (CJK characters, emojis, etc.)

Example:

# Sales data for Q1 2024
product,price,description,quantity
"Widget A",19.99,"A great widget, now with more features!",100
"Gadget B",29.99,"Essential gadget
Multi-line description",50

Testing

~201 tests, 100% pass rate, 0 memory leaks

# Run all tests (standard)
odin test tests

# Run with memory tracking
odin test tests -debug

Test Suites

The project includes comprehensive test coverage across multiple suites:

Basic functionality and core parsing operations
RFC 4180 edge cases and compliance
Integration tests for end-to-end workflows
Schema validation and type checking
Transform system and pipelines
Plugin system functionality
Streaming API with chunk boundaries
Large file handling
Performance regression monitoring
Error handling and recovery strategies
Property-based fuzzing tests
Parallel processing capabilities
SIMD optimization verification

Project Structure

ocsv/
├── src/
│   ├── ocsv.odin         # Main module
│   ├── parser.odin       # RFC 4180 state machine parser
│   ├── parser_simd.odin  # SIMD-optimized parser
│   ├── parser_error.odin # Error-aware parser
│   ├── streaming.odin    # Streaming API
│   ├── parallel.odin     # Parallel processing
│   ├── transform.odin    # Transform system
│   ├── plugin.odin       # Plugin architecture
│   ├── simd.odin         # SIMD search functions
│   ├── error.odin        # Error handling system
│   ├── schema.odin       # Schema validation & type system
│   ├── config.odin       # Configuration types
│   └── ffi_bindings.odin # Bun FFI exports
├── tests/               # Comprehensive test suite
├── plugins/             # Example plugins
├── bindings/            # Bun/TypeScript bindings
├── benchmarks/          # Performance benchmarks
├── examples/            # Usage examples
└── README.md           # This file

Requirements

Odin: Latest version (tested with Odin dev-2025-01)
Bun: v1.0+ (for FFI integration, optional)
Platform: macOS ARM64 (cross-platform support in development)
Task: v3+ (optional, for automated builds)

Release Process

This project uses automated releases via semantic-release. Releases are triggered automatically when changes are pushed to the main branch.

Commit Message Format

All commits must follow Conventional Commits:

<type>(<scope>): <subject>

<body>

<footer>

Examples:

git commit -m "feat: add streaming parser API"
git commit -m "fix: handle empty fields correctly"
git commit -m "docs: update installation instructions"
git commit -m "feat!: remove deprecated parseFile method

BREAKING CHANGE: parseFile has been removed, use parseCSVFile instead"

Commit Types:

feat: New feature (triggers minor version bump)
fix: Bug fix (triggers patch version bump)
perf: Performance improvement (triggers patch version bump)
docs: Documentation changes (no release)
chore: Maintenance tasks (no release)
refactor: Code refactoring (no release)
test: Test changes (no release)
ci: CI/CD changes (no release)

Version Bumps

Patch (1.1.0 → 1.1.1): fix:, perf:
Minor (1.1.0 → 1.2.0): feat:
Major (1.1.0 → 2.0.0): Any commit with BREAKING CHANGE: in footer or ! after type

Release Workflow

Developer pushes commits to main branch
CI runs tests and builds
semantic-release analyzes commits
If releasable changes found:
- Determines new version number
- Updates CHANGELOG.md
- Updates package.json
- Creates git tag
- Publishes to npm with provenance
- Creates GitHub release with prebuilt binaries

Manual Release (Emergency Only):

npm run release:dry  # Test what would be released
git push origin main  # Trigger automated release

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for detailed guidelines on commit messages and pull request process.

Development Workflow:

Fork the repository
Create a feature branch
Make changes with tests (odin test tests)
Ensure zero memory leaks
Submit a pull request

License

MIT License - see LICENSE for details.

Acknowledgments

Odin Language: https://odin-lang.org/
Bun Runtime: https://bun.sh/
RFC 4180: https://www.rfc-editor.org/rfc/rfc4180

Related Projects

d3-dsv - Pure JavaScript CSV/DSV parser
papaparse - Popular JavaScript CSV parser
xsv - Rust CLI tool for CSV processing
csv-parser - Node.js streaming CSV parser

Contact

Issues: GitHub Issues
Discussions: GitHub Discussions

Built with ❤️ using Odin + Bun

Version: 1.3.0

Last Updated: 2025-11-09