npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

huge-csv-sorter

v1.0.2

Published

sorts huge CSV files efficiently

Downloads

298

Readme

Summary

This library can sort huge CSV files efficiently.

Once your CSV files are properly sorted on a primary key, they can also be efficiently compared to produce a diff file, using my other lib https://github.com/livetocode/tabular-data-differ

Keywords

  • csv
  • huge
  • large
  • big
  • sort
  • order
  • fast
  • sqlite

Table of content

Why another lib?

Most CSV sorting libraries would read the file in memory for sorting and filtering it, which is not possible when the files are huge!

This library acts as a thin wrapper around the SQLite library and delegates all the work to the DB which is made for this exact scenario.

Features

  • consumes very few memory
  • can sort huge files that wouldn't fit in memory
  • very fast since it relies on SQLite which is a highly optimized C library

Prerequisites

The "sqlite3" command must be installed on your system.

For a Mac: brew install sqlite

Don't forget to install the proper package if you're running your app in a container. For example, using the Node Alpine distro: RUN apk add sqlite

Note that we couldn't use the sqlite npm package since it wouldn't let us execute meta commands such as ".import" which we rely on for importing the CSV. (see https://sqlite.org/cli.html#csv_import)

Usage

Install

npm i huge-csv-sorter

Examples

Sort a file with one primary column

import { sort } from 'huge-csv-sorter';

sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    orderBy: ['id'],
});

Sort a file with two primary columns

import { sort } from 'huge-csv-sorter';

await sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    orderBy: ['code', 'version'],
});

Sort a file with two primary columns, with the one pk in descending order

import { sort } from 'huge-csv-sorter';

await sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    orderBy: [
        'code', 
        {
            name: 'version',
            sortDirection: 'DESC',
        }
    ],
});

Sort a file with a subset of the original columns

import { sort } from 'huge-csv-sorter';

await sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    select: [
        'id',
        'name',
        'price'
    ],
    orderBy: ['id'],
});

Sort a file with typed columns and order by a number column

import { sort } from 'huge-csv-sorter';

await sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    schema: [
        { 
            name: 'id',
            type: 'number',            
        },
        'name',
        {
            name: 'price',
            type: 'number',
        }
    ],
    select: ['id', 'name', 'price'],
    orderBy: ['id'],
});

Sort a file with a custom delimiter such as tab for TSV files

import { sort } from 'huge-csv-sorter';

await sort({
    source: {
        filename: 'huge.tsv',
        delimiter: '\t',
    },
    destination: {
        filename: 'huge.sorted.tsv',
        delimiter: '\t',
    },
    orderBy: ['id'],
});

Sort a file and filter the output rows on a text column

import { sort } from 'huge-csv-sorter';

await sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    orderBy: ['id'],
    where: `CATEGORY in ('Cat1', 'Cat2', 'Cat3')`,
});

Sort a file and filter the output rows on a number column

import { sort } from 'huge-csv-sorter';

await sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    schema: [
        { 
            name: 'id',
            type: 'number',
        },
        'name',
    ],
    orderBy: ['id'],
    where: `id < 1000`,
});

Sort a file and filter the output rows on a column that must be quoted

Be careful if the name of the columns you're filtering on contain special chars: in this case, you must double-quote them or SQLite will fail to identify the columns.

Note that the where clause should be pure valid SQL and no validation/conversion is done by this library.

import { sort } from 'huge-csv-sorter';

await sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    orderBy: ['The ID'],
    where: `"The ID" < 1000`,
});

Sort a file and paginate

import { sort } from 'huge-csv-sorter';

await sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    orderBy: ['id'],
    offset: 1000,
    limit: 100,
});

Sort a file with custom sqlite settings

If you want to keep the SQLite database for further inspection after the import, you override the sqlite options. You can also change the filename of the SQLite database which will use the destination filename and replace the csv extension with sqlite.

import { sort } from 'huge-csv-sorter';

await sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    sqlite: {
        filename: '/tmp/huge.sqlite',
        keepDB: true, // do not delete db after sort
    },
    orderBy: ['id'],
});

Log all commands

If you want to understand how the schema, the import and the query are implemented in SQLite, you can provide your logger function:

import { sort } from 'huge-csv-sorter';

await sort({
    source: 'huge.csv',
    destination: 'huge.sorted.csv',
    orderBy: ['id'],
    logger: console.log,
});

Order 2 CSV files and diff them on the console

Note that you must also install the diff lib with npm i tabular-data-differ.

import { diff } from 'tabular-data-differ';
import { sort } from 'huge-csv-sorter';

await sort({
    source: './tests/a.csv',
    destination: './tests/a.sorted.csv',
    orderBy: ['id'],
});

await sort({
    source: './tests/b.csv',
    destination: './tests/b.sorted.csv',
    orderBy: ['id'],
});

const stats = await diff({
    oldSource: './tests/a.sorted.csv',
    newSource: './tests/b.sorted.csv',
    keys: ['id'],
}).to('console');
console.log(stats);

Documentation

FileOptions

Name |Required|Default value|Description ---------|--------|-------------|----------- filename | yes | | a filename delimiter| no | , | the optional delimiter of the columns

SchemaColumn

Name |Required|Default value|Description ---------|--------|-------------|----------- name | yes | | the name of the column. type | no | string | the type of the column: either a string or a number.

SortedColumn

Name |Required|Default value|Description -------------|--------|-------------|----------- name | yes | | the name of the column. sortDirection| no | ASC | the sort direction of the data.

SQLiteOptions

Name |Required|Default value|Description ---------|--------|-------------|----------- filename | yes | | the filename of the SQLite temporary database. keepDB | no | false | specifies whether to keep the database after the operation or if it should be deleted. cli | no | sqlite3 | the SQLite command line tool.

SortOptions

Name |Required|Default value|Description ------------|--------|-------------|----------- source | yes | | either a filename or a FileOptions object destination | yes | | either a filename or a FileOptions object schema | no | | an optional list of columns annotated with their type (string or number). Note that if is specified, it must match all columns of the source file, in the same order of appearance, otherwise the SQLite import will be aborted. select | no | | a selection of columns to keep from the source CSV. It will keep all columns when not specified. orderBy | yes | | a list of columns for ordering the records. where | no | | the conditions for filtering the records. offset | no | 0 | the offset from which to start selecting the records limit | no | | the maximum number of records to select. It will keep all records when not specified. sqlite | no | | options for customizing SQLite. logger | no | | a function for logging commands sent to SQLite

sort

The sort function will require a single parameter of type {SortOptions}.

There are only 3 required options:

  • source
  • destination
  • orderBy

Development

Install

git clone [email protected]:livetocode/huge-csv-sorter.git
cd huge-csv-sorter
npm i

Tests

Tests are implemented with Jest and can be run with: npm t

You can also look at the coverage with: npm run show-coverage