stratum-duckdb

v0.1.6

Published

2 months ago

Run analytical SQL queries against Parquet, CSV and JSON files on any static website. Part of the Stratum toolkit.

Downloads

0High
0Medium
0Low

takayasaito

duckdb parquet wasm static-site quarto analytical-sql stratum

stratum-duckdb

Run full analytical SQL against Parquet, CSV, and JSON files on any static website — no server required. Powered by DuckDB WASM.

Part of the Stratum toolkit for publishing research data on static websites. See STRATUM.md for a comparison of stratum-duckdb and stratum-sqlite and guidance on choosing the right library for your dataset.

How it works

DuckDB WASM runs the full DuckDB analytical engine in the browser using WebAssembly. stratum-duckdb wraps the DuckDB WASM API with a simple open() / query() interface and adds file registration helpers for both local and remote files.

For Parquet files on remote servers, DuckDB uses HTTP range requests to fetch only the row groups and columns each query needs — even for files hundreds of megabytes in size.

Browser                                     Remote server (GitHub / Zenodo / S3)
  │                                               │
  │  SELECT country, AVG(life_exp) …              │
  │  FROM 'https://…/indicators.parquet'          │
  │  WHERE year = 2022                            │
  │                                               │
  ├── GET …indicators.parquet  Range: 0-7         │  ← Parquet footer (metadata)
  │   ← 404 bytes              ────────────────►  │
  │                                               │
  ├── GET …indicators.parquet  Range: 512-8192    │  ← Only year+country columns
  │   ← 7680 bytes             ────────────────►  │
  │                                               │
  │  Result: [{country:"NOR", avg:83.2}, …]       │

Quick start

For the demo site (local preview)

git clone https://github.com/stratum-toolkit/stratum-duckdb
cd stratum-duckdb
node build.mjs   # build the library
bash setup.sh    # download DuckDB binary files into docs/libs/duckdb/
python3 -m http.server 8000 --directory docs
# open http://localhost:8000

setup.sh downloads both the mvp and eh DuckDB WASM bundles from jsDelivr and copies the built library into docs/libs/duckdb/. These binary files are in .gitignore — the CI pipeline downloads them fresh on every deploy.

Option A — Self-hosted (for your own project)

Step 1: Download the DuckDB binary files into your project. Both mvp and eh bundles are needed — selectBundle() picks the right one at runtime based on browser capabilities:

mkdir -p libs/duckdb
BASE="https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/dist"
curl -sSfL "$BASE/duckdb-browser-mvp.worker.js" -o libs/duckdb/duckdb-browser-mvp.worker.js
curl -sSfL "$BASE/duckdb-mvp.wasm"              -o libs/duckdb/duckdb-mvp.wasm
curl -sSfL "$BASE/duckdb-browser-eh.worker.js"  -o libs/duckdb/duckdb-browser-eh.worker.js
curl -sSfL "$BASE/duckdb-eh.wasm"               -o libs/duckdb/duckdb-eh.wasm

Why only binary files? The DuckDB main JS module is an ES module that cannot be served as a plain <script> tag. It is loaded automatically from jsDelivr CDN via dynamic import() at runtime (~100 KB compressed, CDN-cached). Only the binary companion files need to be self-hosted.

Step 2: Download stratum-duckdb.esm.js from the Releases page and place it alongside the DuckDB files:

libs/
└── duckdb/
    ├── duckdb-browser-mvp.worker.js
    ├── duckdb-mvp.wasm
    ├── duckdb-browser-eh.worker.js
    ├── duckdb-eh.wasm
    └── stratum-duckdb.esm.js

Step 3: Use in your HTML:

<!-- All pages need type="module" for dynamic import() to work -->
<script type="module">
  const StratumDuckDB =
    (await import('libs/duckdb/stratum-duckdb.esm.js')).default;

  const db = await StratumDuckDB.open({
    duckdbPath: 'libs/duckdb/',   // folder with worker.js and .wasm files
  });

  // Register local files (served from the same site)
  await db.registerFile('countries',  'data/countries.parquet');
  await db.registerFile('indicators', 'data/indicators.parquet');

  // Full analytical SQL
  const rows = await db.query(`
    SELECT
      c.name,
      i.life_expectancy,
      RANK() OVER (ORDER BY i.life_expectancy DESC) AS rank
    FROM countries c
    JOIN indicators i ON c.iso3 = i.country_iso
    WHERE i.year = 2022
    ORDER BY rank
  `);
  console.log(rows);
</script>

Option B — npm + bundler

npm install stratum-duckdb

import StratumDuckDB from 'stratum-duckdb';

const db = await StratumDuckDB.open({ duckdbPath: '/libs/duckdb/' });
await db.registerFile('countries', 'data/countries.parquet');
const rows = await db.query('SELECT * FROM countries');

API

`StratumDuckDB.open(options)` → `Promise<Database>`

Initialises DuckDB WASM and returns a Database.

| Option | Type | Default | Description | |--------|------|---------|-------------| | duckdbPath | string | jsDelivr CDN | Path to folder containing the DuckDB .worker.js and .wasm binary files. | | siteRoot | string | location.href directory | Base URL for resolving relative paths in registerFile(). |

`db.query(sql)` → `Promise<Array<Object>>`

Run any SQL and return rows as plain objects.

await db.query(`
  SELECT subregion, ROUND(AVG(gdp_usd_bn),1) AS avg_gdp
  FROM countries
  GROUP BY subregion
  ORDER BY avg_gdp DESC
`)
// → [{ subregion: "Northern America", avg_gdp: 9200.3 }, …]

`db.registerFile(alias, filePath)` → `Promise<void>`

Fetch a local file (served from the same site) and register it as a named view. Supported formats: .parquet, .csv, .json.

await db.registerFile('my_data', 'data/my_data.parquet')
await db.query('SELECT * FROM my_data LIMIT 10')

`db.registerRemote(alias, url)` → `Promise<void>`

Register a remote HTTPS file as a named view. DuckDB fetches only what each query needs via HTTP range requests — the file is never fully downloaded.

await db.registerRemote('indicators',
  'https://github.com/org/repo/releases/download/v1/indicators.parquet')
await db.query('SELECT * FROM indicators WHERE year = 2022')

`db.views()` → `Promise<string[]>`

List all registered view names.

`db.describe(name)` → `Promise<Array<Object>>`

Return column types for a view or table (equivalent to DESCRIBE).

`db.count(name)` → `Promise<number>`

Return total row count.

`db.close()` → `Promise<void>`

Close the connection and free resources.

Quarto / ObservableJS integration

`_duckdb-init.html`

Add this file to your Quarto project root and include it via _quarto.yml. It loads stratum-duckdb via import() and exposes a ready Database as a Promise that OJS cells can safely await.

<script type="module">
(async function () {
  // Detect page depth from Quarto's injected site_libs/ script path.
  // Works for pages at any subdirectory depth without hardcoding.
  const siteLibScript = document.querySelector('script[src*="site_libs/"]');
  const root = siteLibScript
    ? siteLibScript.getAttribute('src').replace(/site_libs\/.*$/, '')
    : '';

  window._duckdbBase = root + 'libs/duckdb/';
  window._dataRoot   = root + 'data/';

  // Expose the initialised Database as a Promise. OJS cells start evaluating
  // before scripts finish loading, so a plain global is unreliable.
  window._stratumDuckDB = (async () => {
    const StratumDuckDB =
      (await import(window._duckdbBase + 'stratum-duckdb.esm.js')).default;
    return StratumDuckDB.open({ duckdbPath: window._duckdbBase });
  })();
}());
</script>

`_quarto.yml`

format:
  html:
    include-in-header: _duckdb-init.html
project:
  resources:
    - libs/duckdb/   # must contain worker.js, .wasm, and stratum-duckdb.esm.js
    - data/

Any `.qmd` page

// Await the Database Promise before registering files and querying.
// Other OJS cells that reference db receive the resolved value automatically.
db = {
  const database = await window._stratumDuckDB;
  await database.registerFile('indicators', 'indicators.csv');
  return database;
}

rows = db.query("SELECT * FROM indicators WHERE year = 2022 ORDER BY life_expectancy DESC")
Inputs.table(await rows)

Creating Parquet files for your research data

Python (pyarrow)

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_csv("my_data.csv")
pq.write_table(pa.Table.from_pandas(df), "docs/data/my_data.parquet",
               compression="snappy")

R (arrow)

library(arrow)
df <- read.csv("my_data.csv")
write_parquet(df, "docs/data/my_data.parquet")

From an existing SQLite database

import sqlite3, pandas as pd, pyarrow.parquet as pq, pyarrow as pa

con = sqlite3.connect("my.sqlite")
for table in ["countries", "indicators"]:
    df = pd.read_sql(f"SELECT * FROM {table}", con)
    pq.write_table(pa.Table.from_pandas(df),
                   f"docs/data/{table}.parquet", compression="snappy")

Hosting your data

| Host | Free | Parquet range requests | |------|------|----------------------| | GitHub Pages (same origin) | ✓ | ✓ | | GitHub Releases | ✓ | ✓ | | Zenodo | ✓ | ✓ | | Hugging Face Datasets | ✓ | ✓ | | AWS Open Data / S3 | varies | ✓ (configure CORS) | | Cloudflare R2 | free tier | ✓ (configure CORS) |

Developing this library

git clone https://github.com/stratum-toolkit/stratum-duckdb
cd stratum-duckdb

# Build dist bundles
node build.mjs

# Create demo data (CSV + Parquet)
python3 scripts/create_demo_data.py
# or
Rscript scripts/create_demo_data.R

# Download DuckDB binary files and copy the built library (one command)
bash setup.sh

# Serve demo site
python3 -m http.server 8000 --directory docs
# open http://localhost:8000

setup.sh downloads both mvp and eh DuckDB WASM bundles from jsDelivr into docs/libs/duckdb/ and copies the built library there. These binary files are in .gitignore — the CI pipeline downloads them fresh on every deploy.

To force a re-download of all binary files: bash setup.sh --force

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

stratum-duckdb

How it works

Quick start

For the demo site (local preview)

Option A — Self-hosted (for your own project)

Option B — npm + bundler

API

StratumDuckDB.open(options) → Promise<Database>

db.query(sql) → Promise<Array<Object>>

db.registerFile(alias, filePath) → Promise<void>

db.registerRemote(alias, url) → Promise<void>

db.views() → Promise<string[]>

db.describe(name) → Promise<Array<Object>>

db.count(name) → Promise<number>

db.close() → Promise<void>

Quarto / ObservableJS integration

_duckdb-init.html

_quarto.yml

Any .qmd page

Creating Parquet files for your research data

Python (pyarrow)

R (arrow)

From an existing SQLite database

Hosting your data

Developing this library

License

`StratumDuckDB.open(options)` → `Promise<Database>`

`db.query(sql)` → `Promise<Array<Object>>`

`db.registerFile(alias, filePath)` → `Promise<void>`

`db.registerRemote(alias, url)` → `Promise<void>`

`db.views()` → `Promise<string[]>`

`db.describe(name)` → `Promise<Array<Object>>`

`db.count(name)` → `Promise<number>`

`db.close()` → `Promise<void>`

`_duckdb-init.html`

`_quarto.yml`

Any `.qmd` page