bigjsondb

v1.0.0

Published

3 months ago

Efficient library for querying massive compressed JSONL files with streaming and indexing support

0High
0Medium
0Low

manalejandro

json jsonl jsonlines gzip streaming big-data database query

BigJsonDB

Efficiently query massive compressed JSONL files (hundreds or thousands of gigabytes) with a MongoDB-like API.

BigJsonDB uses streaming decompression and optional indexing to handle enormous .jsonl.gz files without loading them entirely into memory. Perfect for big data processing, log analysis, and large dataset queries.

Features

🚀 Streaming Architecture - Process files of any size without memory constraints
📊 Index Support - Create indexes on fields for lightning-fast lookups
🔍 Rich Query API - MongoDB-like queries with operators ($eq, $gt, $in, $regex, etc.)
📄 Pagination & Sorting - Skip, limit, and sort results efficiently
💾 Memory Efficient - Handles terabyte-scale files on commodity hardware
📦 Zero Dependencies - Uses only Node.js built-in modules
🎯 TypeScript Support - Full type definitions included

Installation

npm install bigjsondb

Quick Start

import { BigJsonDB } from 'bigjsondb';

// Open a compressed JSONL file
const db = new BigJsonDB('data.jsonl.gz');

// Simple query
const users = await db.find({ age: { $gte: 18 } });

// Query with pagination
const results = await db.find(
  { status: 'active', country: 'US' },
  { skip: 100, limit: 20 }
);

// Count matching documents
const count = await db.count({ role: 'admin' });

API Reference

Constructor

new BigJsonDB(filePath: string, config?: BigJsonDBConfig)

Parameters:

filePath - Path to the .jsonl.gz file
config (optional):
- autoIndex?: boolean - Auto-index on first access (default: false)
- maxCacheSize?: number - Max memory for caching in bytes (default: 100MB)
- chunkSize?: number - Chunk size for streaming in bytes (default: 64KB)

Example:

const db = new BigJsonDB('logs.jsonl.gz', {
  maxCacheSize: 500 * 1024 * 1024, // 500MB
  chunkSize: 128 * 1024 // 128KB chunks
});

Query Methods

`find(query, options)`

Find documents matching a query.

async find(query?: Query, options?: QueryOptions): Promise<any[]>

Parameters:

query - Query conditions (MongoDB-like syntax)
options:
- skip?: number - Number of documents to skip
- limit?: number - Maximum number of documents to return
- sort?: { [field: string]: 'asc' | 'desc' } - Sort specification
- projection?: { [field: string]: 0 | 1 } - Fields to include/exclude

Examples:

// Simple equality
await db.find({ name: 'John' });

// Comparison operators
await db.find({ 
  age: { $gte: 21, $lt: 65 },
  status: { $ne: 'banned' }
});

// Array operators
await db.find({ 
  role: { $in: ['admin', 'moderator'] },
  tags: { $nin: ['spam', 'deleted'] }
});

// Regular expressions
await db.find({ 
  email: { $regex: '@gmail\\.com$' }
});

// Nested fields
await db.find({ 
  'address.city': 'New York',
  'profile.verified': true
});

// With options
await db.find(
  { category: 'electronics' },
  { 
    skip: 20, 
    limit: 10,
    sort: { price: 'desc' },
    projection: { name: 1, price: 1 }
  }
);

`findOne(query, options)`

Find a single document.

async findOne(query?: Query, options?: QueryOptions): Promise<any | null>

Example:

const user = await db.findOne({ email: '[email protected]' });

`count(query)`

Count documents matching a query.

async count(query?: Query): Promise<number>

Example:

const activeUsers = await db.count({ status: 'active' });

`distinct(field, query)`

Get distinct values for a field.

async distinct(field: string, query?: Query): Promise<any[]>

Example:

const countries = await db.distinct('country');
const activeDepartments = await db.distinct('department', { status: 'active' });

`stream(query, callback)`

Stream documents for custom processing (memory-efficient for large result sets).

async stream(query: Query, callback: (doc: any) => void | Promise<void>): Promise<void>

Example:

let sum = 0;
await db.stream(
  { category: 'sales' },
  (doc) => { sum += doc.amount; }
);
console.log('Total sales:', sum);

Index Methods

`createIndex(field)`

Create an index on a field for faster lookups. This scans the entire file once to build the index.

async createIndex(field: string): Promise<void>

Example:

// Create indexes on frequently queried fields
await db.createIndex('userId');
await db.createIndex('timestamp');
await db.createIndex('status');

// Now queries on these fields will be much faster
const user = await db.findOne({ userId: '12345' }); // Uses index!

`listIndexes()`

List all indexed fields.

listIndexes(): string[]

`dropIndex(field)`

Remove an index.

dropIndex(field: string): void

`getStats()`

Get database statistics.

getStats(): DbStats

Example:

const stats = db.getStats();
console.log('Total records:', stats.totalRecords);
console.log('Indexes:', stats.indexes);

Query Operators

BigJsonDB supports the following MongoDB-like operators:

| Operator | Description | Example | |----------|-------------|---------| | $eq | Equal to | { age: { $eq: 25 } } | | $ne | Not equal to | { status: { $ne: 'deleted' } } | | $gt | Greater than | { price: { $gt: 100 } } | | $gte | Greater than or equal | { age: { $gte: 18 } } | | $lt | Less than | { score: { $lt: 50 } } | | $lte | Less than or equal | { count: { $lte: 10 } } | | $in | Value in array | { role: { $in: ['admin', 'user'] } } | | $nin | Value not in array | { status: { $nin: ['banned'] } } | | $regex | Regular expression | { name: { $regex: '^John' } } | | $exists | Field exists | { phone: { $exists: true } } |

Performance Tips

1. Use Indexes for Frequent Queries

If you frequently query by specific fields, create indexes:

// One-time index creation
await db.createIndex('userId');
await db.createIndex('timestamp');

// Subsequent queries will be much faster
await db.find({ userId: '12345' }); // Lightning fast!

2. Use Projections to Reduce Data Transfer

Only retrieve the fields you need:

// Instead of this:
const users = await db.find({ role: 'admin' });

// Do this:
const users = await db.find(
  { role: 'admin' },
  { projection: { name: 1, email: 1 } }
);

3. Use Streaming for Large Result Sets

When processing many results, use streaming to avoid memory issues:

// Instead of loading everything into memory:
const allLogs = await db.find({ level: 'error' }); // Could be millions!

// Stream and process incrementally:
await db.stream({ level: 'error' }, (log) => {
  processLog(log);
});

4. Limit Results When Possible

Always use limit if you don't need all results:

// Get just the top 10
const topProducts = await db.find(
  { category: 'electronics' },
  { limit: 10, sort: { sales: 'desc' } }
);

5. Combine Multiple Operators

Make queries more specific to reduce scanning:

// More specific = faster
await db.find({ 
  status: 'active',
  created: { $gte: '2024-01-01' },
  country: { $in: ['US', 'CA', 'GB'] }
});

Use Cases

Log Analysis

const db = new BigJsonDB('server-logs.jsonl.gz');

// Find all 500 errors in the last hour
const errors = await db.find({
  status: 500,
  timestamp: { $gte: Date.now() - 3600000 }
});

// Count requests by endpoint
const endpoints = await db.distinct('endpoint');

Data Analytics

const db = new BigJsonDB('transactions.jsonl.gz');

// Calculate total revenue by category
const categories = await db.distinct('category');
for (const category of categories) {
  let total = 0;
  await db.stream(
    { category },
    (tx) => { total += tx.amount; }
  );
  console.log(`${category}: $${total}`);
}

User Data Processing

const db = new BigJsonDB('users.jsonl.gz');

// Index for fast lookups
await db.createIndex('email');
await db.createIndex('userId');

// Find user by email
const user = await db.findOne({ email: '[email protected]' });

// Get active users in a region
const activeUsers = await db.find({
  status: 'active',
  'location.country': 'US',
  lastLogin: { $gte: '2024-01-01' }
});

File Format

BigJsonDB works with gzip-compressed JSONL (JSON Lines) files. Each line should be a valid JSON object:

{"id": 1, "name": "Alice", "age": 30, "city": "NYC"}
{"id": 2, "name": "Bob", "age": 25, "city": "LA"}
{"id": 3, "name": "Charlie", "age": 35, "city": "Chicago"}

Compress with gzip:

gzip data.jsonl

This creates data.jsonl.gz ready for BigJsonDB.

Limitations

Write Operations - Currently read-only. For writes, decompress, modify, and re-compress.
In-Memory Sorting - Sorting loads results into memory. Use indexes and limits for large sorts.
Index Storage - Indexes are stored in memory. For very large datasets, index memory usage should be monitored.
Compressed Seeking - Random access in compressed files requires full decompression up to the target point.

TypeScript Types

Full TypeScript definitions are included:

interface Query {
  [field: string]: any | {
    $eq?: any;
    $ne?: any;
    $gt?: any;
    $gte?: any;
    $lt?: any;
    $lte?: any;
    $in?: any[];
    $nin?: any[];
    $regex?: string | RegExp;
    $exists?: boolean;
  };
}

interface QueryOptions {
  skip?: number;
  limit?: number;
  sort?: { [field: string]: 'asc' | 'desc' };
  projection?: { [field: string]: 0 | 1 };
}

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Author

ale

Made with ❤️ for big data processing

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

BigJsonDB

Features

Installation

Quick Start

API Reference

Constructor

Query Methods

find(query, options)

findOne(query, options)

count(query)

distinct(field, query)

stream(query, callback)

Index Methods

createIndex(field)

listIndexes()

dropIndex(field)

getStats()

Query Operators

Performance Tips

1. Use Indexes for Frequent Queries

2. Use Projections to Reduce Data Transfer

3. Use Streaming for Large Result Sets

4. Limit Results When Possible

5. Combine Multiple Operators

Use Cases

Log Analysis

Data Analytics

User Data Processing

File Format

Limitations

TypeScript Types

Contributing

License

Author

`find(query, options)`

`findOne(query, options)`

`count(query)`

`distinct(field, query)`

`stream(query, callback)`

`createIndex(field)`

`listIndexes()`

`dropIndex(field)`

`getStats()`