hypgrep

v0.4.0

Published

16 days ago

Compact full-text grep search index for Parquet files

0High
0Medium
0Low

platypii

parquet index search full-text-search hyparquet serverless

HypGrep

hypgrep

Build a compact n-gram search index for a Parquet file using hyparquet and hyparquet-writer. Queries are case-insensitive substring matches — grep semantics over a precomputed index.

Part of HypStack, an open-source stack for AI observability.

Why?

Enable efficient grep-style search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.

Perfect for serverless architectures where you want to offer search capabilities without managing infrastructure.

Benchmarks

Full-text search over 3,199,860 real LLM conversations (WildChat-4.8M), the JSON conversation stored verbatim (14.7 GB of Parquet), searched with 109 queries across 15 shapes (tokens, phrases, JSON structure, code, Unicode, and regex) run against the same data on every engine and measured as time-to-first-100 rows from one client. Two things separate the engines: whether they can answer a query at all, and what they cost.

Can it answer the query? Only hypgrep and Athena handle all 109; Quickwit, a tokenized engine, cannot do literal-punctuation substrings or regex and fails nearly half.

| Engine | Substring | Regex | Backref | Answered | |---|:-:|:-:|:-:|---:| | hypgrep | ✓ | ✓ | ✓ | 109/109 | | Athena | ✓ | ✓ | ✓ | 109/109 | | DuckDB | ✓ | ✓ | ✗ | 108/109 | | pg_trgm | ✓ | ✓ | ⏱ | 106/109 | | Elasticsearch | ✓ | ✗ | ✗ | 97/109 | | Quickwit | ✗ | ✗ | ✗ | 59/109 |

Speed, footprint, and cost. The always-on engines answer fast but bill a box 24/7; the client and serverless engines cost per query, where an index is the difference between pennies and dollars.

| Engine | Warm latency | Index | Fixed / mo | Per query | Server | |---|---:|---:|---:|---:|---| | hypgrep | 596 ms | 1.7 GB | $0.38 | $0.003 | none | | Elasticsearch | 91 ms | 63 GB | $374 | ~$0 | r5.2xlarge 24/7 | | Quickwit | 381 ms | 43 GB | $63 | ~$0 | t3.large 24/7 | | pg_trgm | 449 ms | 28 GB | $187 | ~$0 | r5.xlarge 24/7 | | Athena | 5.1 s | none | $0.34 | $0.02 | serverless | | DuckDB | 30 s | none | $0.34 | $0.40 | none |

Warm latency is median time-to-first-100 rows over the queries every engine can answer. The always-on engines win raw latency by keeping a hot index in RAM, which is what the monthly bill pays for. hypgrep trades that for zero idle cost, a smaller footprint, and no infrastructure.

CLI usage

Build an index:

npx hypgrep dataset.parquet [dataset.index.parquet]

Grep against the indexed file:

npx hypgrep search dataset.parquet 'serverless'          # literal substring
npx hypgrep search dataset.parquet '/eigen.+value/i'      # regex
npx hypgrep search dataset.parquet 'rhythm' --limit 5     # first N matches
npx hypgrep search dataset.parquet 'rhythm' -c            # count only
npx hypgrep search dataset.parquet 'rhythm' -i            # case-insensitive literal

To install as a system-wide CLI tool:

npm install -g hypgrep
hypgrep search dataset.parquet 'pattern'

Find rows in a parquet file in JavaScript

Use parquetFind to find rows containing the query as a substring while preserving natural row order (like Ctrl+F):

import { parquetFind } from 'hypgrep'

for await (const row of parquetFind({
  query: 'serverless',
  url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
  console.log(row) // { title: '...', text: '...' }
}

The query matches as a contiguous substring (grep semantics): 'speed of light' matches rows containing that exact phrase, not rows where the words merely co-occur. Queries shorter than the indexed n-gram length (default 5) fall back to a full scan but still return correct results.

Regex queries

Pass a RegExp directly — mandatory literals are extracted from the pattern for index pruning, and regex.test runs against each row:

for await (const row of parquetFind({
  query: /eigen\w*value/i,
  url: '...',
})) ...

If the regex has no extractable literal (e.g. /./, /foo|bar/), the index can't prune and HypGrep does a full scan. The substring/regex filter still applies — results are correct, just unaccelerated.

If you want full control over the row predicate (e.g. a custom JS function), pass rowFilter. The string query is still used for index pruning while the callback decides which rows to keep:

for await (const row of parquetFind({
  query: 'eigen',
  rowFilter: row => myCustomCheck(row),
  url: '...',
})) ...

Ranked search

Use parquetSearch for Google-style ranked search: whitespace-separated words are ANDed (every word must appear), and results are ranked by total occurrence count:

import { parquetSearch } from 'hypgrep'

for await (const row of parquetSearch({
  query: 'serverless',
  url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
  console.log(row) // most matches first
}

Create an index in JavaScript

import { asyncBufferFromFile } from 'hyparquet'
import { fileWriter } from 'hyparquet-writer'
import { createIndex } from 'hypgrep'

// Generate dataset.index.parquet from dataset.parquet
const sourceFile = await asyncBufferFromFile('dataset.parquet')
const indexFile = fileWriter('dataset.index.parquet')
await createIndex({ sourceFile, indexFile })

Local parquet files

To search against local parquet files, provide an asyncBufferFactory that loads the file from the local filesystem:

import { asyncBufferFromFile } from 'hyparquet'
import { parquetFind } from 'hypgrep'

// Loads parquet file from local filesystem
function asyncBufferFactory({ url }) {
  return asyncBufferFromFile(url)
}

for await (const row of parquetFind({
  query: 'serverless',
  url: 'dataset.parquet',
  asyncBufferFactory,
})) {
  console.log(row)
}

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme