bigjsondb
v1.0.0
Published
Efficient library for querying massive compressed JSONL files with streaming and indexing support
Maintainers
Readme
BigJsonDB
Efficiently query massive compressed JSONL files (hundreds or thousands of gigabytes) with a MongoDB-like API.
BigJsonDB uses streaming decompression and optional indexing to handle enormous .jsonl.gz files without loading them entirely into memory. Perfect for big data processing, log analysis, and large dataset queries.
Features
- 🚀 Streaming Architecture - Process files of any size without memory constraints
- 📊 Index Support - Create indexes on fields for lightning-fast lookups
- 🔍 Rich Query API - MongoDB-like queries with operators (
$eq,$gt,$in,$regex, etc.) - 📄 Pagination & Sorting - Skip, limit, and sort results efficiently
- 💾 Memory Efficient - Handles terabyte-scale files on commodity hardware
- 📦 Zero Dependencies - Uses only Node.js built-in modules
- 🎯 TypeScript Support - Full type definitions included
Installation
npm install bigjsondbQuick Start
import { BigJsonDB } from 'bigjsondb';
// Open a compressed JSONL file
const db = new BigJsonDB('data.jsonl.gz');
// Simple query
const users = await db.find({ age: { $gte: 18 } });
// Query with pagination
const results = await db.find(
{ status: 'active', country: 'US' },
{ skip: 100, limit: 20 }
);
// Count matching documents
const count = await db.count({ role: 'admin' });API Reference
Constructor
new BigJsonDB(filePath: string, config?: BigJsonDBConfig)Parameters:
filePath- Path to the.jsonl.gzfileconfig(optional):autoIndex?: boolean- Auto-index on first access (default:false)maxCacheSize?: number- Max memory for caching in bytes (default: 100MB)chunkSize?: number- Chunk size for streaming in bytes (default: 64KB)
Example:
const db = new BigJsonDB('logs.jsonl.gz', {
maxCacheSize: 500 * 1024 * 1024, // 500MB
chunkSize: 128 * 1024 // 128KB chunks
});Query Methods
find(query, options)
Find documents matching a query.
async find(query?: Query, options?: QueryOptions): Promise<any[]>Parameters:
query- Query conditions (MongoDB-like syntax)options:skip?: number- Number of documents to skiplimit?: number- Maximum number of documents to returnsort?: { [field: string]: 'asc' | 'desc' }- Sort specificationprojection?: { [field: string]: 0 | 1 }- Fields to include/exclude
Examples:
// Simple equality
await db.find({ name: 'John' });
// Comparison operators
await db.find({
age: { $gte: 21, $lt: 65 },
status: { $ne: 'banned' }
});
// Array operators
await db.find({
role: { $in: ['admin', 'moderator'] },
tags: { $nin: ['spam', 'deleted'] }
});
// Regular expressions
await db.find({
email: { $regex: '@gmail\\.com$' }
});
// Nested fields
await db.find({
'address.city': 'New York',
'profile.verified': true
});
// With options
await db.find(
{ category: 'electronics' },
{
skip: 20,
limit: 10,
sort: { price: 'desc' },
projection: { name: 1, price: 1 }
}
);findOne(query, options)
Find a single document.
async findOne(query?: Query, options?: QueryOptions): Promise<any | null>Example:
const user = await db.findOne({ email: '[email protected]' });count(query)
Count documents matching a query.
async count(query?: Query): Promise<number>Example:
const activeUsers = await db.count({ status: 'active' });distinct(field, query)
Get distinct values for a field.
async distinct(field: string, query?: Query): Promise<any[]>Example:
const countries = await db.distinct('country');
const activeDepartments = await db.distinct('department', { status: 'active' });stream(query, callback)
Stream documents for custom processing (memory-efficient for large result sets).
async stream(query: Query, callback: (doc: any) => void | Promise<void>): Promise<void>Example:
let sum = 0;
await db.stream(
{ category: 'sales' },
(doc) => { sum += doc.amount; }
);
console.log('Total sales:', sum);Index Methods
createIndex(field)
Create an index on a field for faster lookups. This scans the entire file once to build the index.
async createIndex(field: string): Promise<void>Example:
// Create indexes on frequently queried fields
await db.createIndex('userId');
await db.createIndex('timestamp');
await db.createIndex('status');
// Now queries on these fields will be much faster
const user = await db.findOne({ userId: '12345' }); // Uses index!listIndexes()
List all indexed fields.
listIndexes(): string[]dropIndex(field)
Remove an index.
dropIndex(field: string): voidgetStats()
Get database statistics.
getStats(): DbStatsExample:
const stats = db.getStats();
console.log('Total records:', stats.totalRecords);
console.log('Indexes:', stats.indexes);Query Operators
BigJsonDB supports the following MongoDB-like operators:
| Operator | Description | Example |
|----------|-------------|---------|
| $eq | Equal to | { age: { $eq: 25 } } |
| $ne | Not equal to | { status: { $ne: 'deleted' } } |
| $gt | Greater than | { price: { $gt: 100 } } |
| $gte | Greater than or equal | { age: { $gte: 18 } } |
| $lt | Less than | { score: { $lt: 50 } } |
| $lte | Less than or equal | { count: { $lte: 10 } } |
| $in | Value in array | { role: { $in: ['admin', 'user'] } } |
| $nin | Value not in array | { status: { $nin: ['banned'] } } |
| $regex | Regular expression | { name: { $regex: '^John' } } |
| $exists | Field exists | { phone: { $exists: true } } |
Performance Tips
1. Use Indexes for Frequent Queries
If you frequently query by specific fields, create indexes:
// One-time index creation
await db.createIndex('userId');
await db.createIndex('timestamp');
// Subsequent queries will be much faster
await db.find({ userId: '12345' }); // Lightning fast!2. Use Projections to Reduce Data Transfer
Only retrieve the fields you need:
// Instead of this:
const users = await db.find({ role: 'admin' });
// Do this:
const users = await db.find(
{ role: 'admin' },
{ projection: { name: 1, email: 1 } }
);3. Use Streaming for Large Result Sets
When processing many results, use streaming to avoid memory issues:
// Instead of loading everything into memory:
const allLogs = await db.find({ level: 'error' }); // Could be millions!
// Stream and process incrementally:
await db.stream({ level: 'error' }, (log) => {
processLog(log);
});4. Limit Results When Possible
Always use limit if you don't need all results:
// Get just the top 10
const topProducts = await db.find(
{ category: 'electronics' },
{ limit: 10, sort: { sales: 'desc' } }
);5. Combine Multiple Operators
Make queries more specific to reduce scanning:
// More specific = faster
await db.find({
status: 'active',
created: { $gte: '2024-01-01' },
country: { $in: ['US', 'CA', 'GB'] }
});Use Cases
Log Analysis
const db = new BigJsonDB('server-logs.jsonl.gz');
// Find all 500 errors in the last hour
const errors = await db.find({
status: 500,
timestamp: { $gte: Date.now() - 3600000 }
});
// Count requests by endpoint
const endpoints = await db.distinct('endpoint');Data Analytics
const db = new BigJsonDB('transactions.jsonl.gz');
// Calculate total revenue by category
const categories = await db.distinct('category');
for (const category of categories) {
let total = 0;
await db.stream(
{ category },
(tx) => { total += tx.amount; }
);
console.log(`${category}: $${total}`);
}User Data Processing
const db = new BigJsonDB('users.jsonl.gz');
// Index for fast lookups
await db.createIndex('email');
await db.createIndex('userId');
// Find user by email
const user = await db.findOne({ email: '[email protected]' });
// Get active users in a region
const activeUsers = await db.find({
status: 'active',
'location.country': 'US',
lastLogin: { $gte: '2024-01-01' }
});File Format
BigJsonDB works with gzip-compressed JSONL (JSON Lines) files. Each line should be a valid JSON object:
{"id": 1, "name": "Alice", "age": 30, "city": "NYC"}
{"id": 2, "name": "Bob", "age": 25, "city": "LA"}
{"id": 3, "name": "Charlie", "age": 35, "city": "Chicago"}Compress with gzip:
gzip data.jsonlThis creates data.jsonl.gz ready for BigJsonDB.
Limitations
- Write Operations - Currently read-only. For writes, decompress, modify, and re-compress.
- In-Memory Sorting - Sorting loads results into memory. Use indexes and limits for large sorts.
- Index Storage - Indexes are stored in memory. For very large datasets, index memory usage should be monitored.
- Compressed Seeking - Random access in compressed files requires full decompression up to the target point.
TypeScript Types
Full TypeScript definitions are included:
interface Query {
[field: string]: any | {
$eq?: any;
$ne?: any;
$gt?: any;
$gte?: any;
$lt?: any;
$lte?: any;
$in?: any[];
$nin?: any[];
$regex?: string | RegExp;
$exists?: boolean;
};
}
interface QueryOptions {
skip?: number;
limit?: number;
sort?: { [field: string]: 'asc' | 'desc' };
projection?: { [field: string]: 0 | 1 };
}Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Author
ale
Made with ❤️ for big data processing
