delta-unity-duckdb
v1.0.7
Published
Read Databricks Delta tables from Azure storage through Unity Catalog using DuckDB
Maintainers
Readme
Delta Scanner
A standalone Delta Table Scanner using DuckDB for querying Delta tables through Unity Catalog from Node.js (designed with Azure Databricks in mind).
Features
- Query Delta tables using DuckDB through Unity Catalog without a Databricks cluster
- Support for Unity Catalog tables with automatic credential handling
- ABFSS path support with SAS token authentication
- Use as a CLI tool or import as a module in your Node.js applications
- Custom SQL queries with table placeholder support
Requirements
- Node.js 14 or higher
- Internet access to Databricks Unity Catalog API (for UC tables)
- No Azure credentials needed for UC tables (temporary SAS is requested automatically)
- For direct ABFSS access only: an Azure Storage account name and a valid SAS token
Installation
Global Installation
npm install -g delta-unity-duckdbLocal Installation
npm install delta-unity-duckdbConfiguration
Delta Scanner can be configured using environment variables or CLI arguments:
Environment Variables
Create a .env file in your project root with the minimal variables for Unity Catalog tables (recommended):
DATABRICKS_WORKSPACE_URL=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=your-token-hereDirect ABFSS access (no Unity Catalog):
AZURE_STORAGE_ACCOUNT_NAME=your-storage-account-name
AZURE_STORAGE_SAS_TOKEN=your-sas-tokenNotes for ABFSS:
- These are only required if you pass an
abfss://table path directly and are NOT using Unity Catalog. - Provide the SAS token string itself (usually starts with
sv=...), without the leading?.
CLI Arguments
CLI arguments will override environment variables:
delta-unity-duckdb --table="catalog.schema.table_name" --databricks_host="https://your-workspace.cloud.databricks.com" --databricks_token="your-token"ABFSS example (requires AZURE_STORAGE_* envs):
export AZURE_STORAGE_ACCOUNT_NAME=your-account
export AZURE_STORAGE_SAS_TOKEN='sv=...&ss=...&srt=...&sp=rl&se=...&st=...&spr=https&sig=...'
delta-unity-duckdb --table="abfss://[email protected]/path/to/delta-table" --limit=5 --format=tableQuickstart
- Create a
.envfile with the minimal required variables for Unity Catalog tables:
DATABRICKS_WORKSPACE_URL=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=your-databricks-personal-access-tokenTip: export your .env into the shell (mac/Linux):
set -a; source .env; set +a- Run a query against a Unity Catalog table:
delta-unity-duckdb --table="$TABLE" --limit=5 --format=tableExample output (table format):
Table row count: 12345
Query Results:
Columns: [ 'id', 'name', 'value' ]
┌─────────┬─────┬───────────┬───────┐
│ (index) │ id │ name │ value │
├─────────┼─────┼───────────┼───────┤
│ 0 │ 1 │ 'alpha' │ 3.14 │
│ 1 │ 2 │ 'beta' │ 2.71 │
│ 2 │ 3 │ 'gamma' │ 1.62 │
│ 3 │ 4 │ 'delta' │ 0.58 │
│ 4 │ 5 │ 'epsilon' │ 4.20 │
└─────────┴─────┴───────────┴───────┘- Inspect schema (optional):
delta-unity-duckdb --table="catalog.schema.table_name" --schemaUsage
As a CLI Tool
After installing globally, you can use the delta-unity-duckdb command directly:
# Basic usage with .env file configuration (uses TABLE variable from .env)
delta-unity-duckdb --table="$TABLE" --limit=10
# Using explicit table name with CLI arguments for Databricks credentials
delta-unity-duckdb --table="catalog.schema.table_name" --databricks_host="https://your-workspace.cloud.databricks.com" --databricks_token="your-token" --limit=5
# Custom SQL query with explicit table name (IMPORTANT: use actual table name in --table parameter)
delta-unity-duckdb --table="catalog.schema.table_name" --query="SELECT column1, COUNT(*) as count FROM \$TABLE GROUP BY column1 ORDER BY count DESC LIMIT 5"
# Display results as a table instead of JSON
delta-unity-duckdb --table="catalog.schema.table_name" --format=table
# IMPORTANT: Table Parameter vs $TABLE Placeholder
# --table="actual.table.name" <- Specifies which table to query
# \$TABLE in --query <- Placeholder that gets replaced with delta_scan() functionHow it works
Delta Scanner:
- Loads DuckDB with the
httpfs,delta, andazureextensions. - Looks up your Unity Catalog table and requests temporary credentials from the Databricks Unity Catalog API.
- Creates a temporary DuckDB secret with the retrieved SAS token to securely access Azure Storage.
- Resolves the Delta table location and runs your query via DuckDB (e.g.,
delta_scan(...)). - Closes connections when finished.
As a Module
const { DeltaScanner } = require('delta-unity-duckdb');
async function queryDeltaTable() {
const scanner = new DeltaScanner();
try {
// Get table statistics
const stats = await scanner.getTableStats('catalog.schema.table_name');
console.log(`Table row count: ${stats.count}`);
// Get schema information
const schemaInfo = await scanner.getTableSchema('catalog.schema.table_name');
console.log('Table Schema:', schemaInfo.schema);
// Execute a simple query
const rows = await scanner.query('catalog.schema.table_name', null, { limit: 5 });
console.log('Query Results:', rows);
// Execute a custom query
const customResults = await scanner.query(
'catalog.schema.table_name',
'SELECT column1, COUNT(*) as count FROM $TABLE GROUP BY column1 ORDER BY count DESC LIMIT 5'
);
console.log('Custom Query Results:', customResults);
} catch (error) {
console.error('Error:', error.message);
} finally {
await scanner.close();
}
}
queryDeltaTable();Next.js / Webpack Integration
Webpack Bundling Issues
When using delta-unity-duckdb in Next.js applications, you must exclude it from client-side bundling since it contains Node.js-specific dependencies that cannot run in browsers.
Add to your next.config.js:
/** @type {import('next').NextConfig} */
const nextConfig = {
// Exclude server-only packages from client bundles
serverExternalPackages: [
'duckdb',
'@duckdb/node-api',
'@mapbox/node-pre-gyp',
'node-gyp',
'delta-unity-duckdb'
],
// Mark API routes as server-only
experimental: {
serverComponentsExternalPackages: ['delta-unity-duckdb']
}
}
module.exports = nextConfigIn your API routes, add the server runtime directive:
// Mark this as server-only to prevent client bundling
export const runtime = 'nodejs';
import { DeltaScanner } from 'delta-unity-duckdb'; // This will work in API routesSingleton Pattern for Concurrency
When using multiple DeltaScanner instances simultaneously (e.g., in a web server with concurrent requests), you may encounter DuckDB extension initialization conflicts and query deadlocks. Use a singleton pattern with query queuing to prevent these issues:
// lib/shared-scanner.js
const { DeltaScanner } = require('delta-unity-duckdb');
class DeltaScannerSingleton {
constructor() {
if (DeltaScannerSingleton.instance) {
return DeltaScannerSingleton.instance;
}
this.scanner = null;
this.isInitialized = false;
this.initPromise = null;
this.queryQueue = [];
this.isProcessingQueue = false;
DeltaScannerSingleton.instance = this;
}
static getInstance() {
if (!DeltaScannerSingleton.instance) {
DeltaScannerSingleton.instance = new DeltaScannerSingleton();
}
return DeltaScannerSingleton.instance;
}
async initialize() {
if (this.isInitialized) return;
if (this.initPromise) {
return this.initPromise;
}
this.initPromise = this._initialize();
return this.initPromise;
}
async _initialize() {
try {
this.scanner = new DeltaScanner();
await this.scanner.initialize();
this.isInitialized = true;
console.log('[DELTA SINGLETON] Initialized successfully');
} catch (error) {
console.error('[DELTA SINGLETON] Initialization failed:', error);
this.initPromise = null;
throw error;
}
}
async query(tableName, query) {
await this.initialize();
if (!this.scanner) {
throw new Error('DeltaScanner not initialized');
}
// Queue the query to prevent concurrent access issues
return new Promise((resolve, reject) => {
this.queryQueue.push(async () => {
try {
const result = await this.scanner.query(tableName, query);
resolve(result);
} catch (error) {
reject(error);
}
});
// Use setImmediate to avoid blocking the event loop
setImmediate(() => this.processQueue());
});
}
async processQueue() {
if (this.isProcessingQueue || this.queryQueue.length === 0) {
return;
}
this.isProcessingQueue = true;
try {
while (this.queryQueue.length > 0) {
const queryFn = this.queryQueue.shift();
if (queryFn) {
await queryFn();
}
}
} finally {
this.isProcessingQueue = false;
}
}
async close() {
if (this.scanner) {
await this.scanner.close();
this.scanner = null;
this.isInitialized = false;
this.initPromise = null;
}
}
}
// Export singleton instance
const sharedScanner = DeltaScannerSingleton.getInstance();
// Cleanup on process exit
process.on('exit', async () => {
await sharedScanner.close();
});
process.on('SIGINT', async () => {
await sharedScanner.close();
process.exit(0);
});
module.exports = { sharedScanner };Usage in API routes:
// app/api/your-endpoint/route.js
export const runtime = 'nodejs';
import { sharedScanner } from '../../../lib/shared-scanner';
export async function GET() {
try {
const data = await sharedScanner.query(
'catalog.schema.table_name',
'SELECT * FROM $TABLE LIMIT 10'
);
return Response.json({ success: true, data });
} catch (error) {
return Response.json({ error: error.message }, { status: 500 });
}
// Note: Don't call scanner.close() - let singleton manage lifecycle
}Key Benefits of Singleton Pattern with Query Queue
- Prevents DuckDB Extension Conflicts: Only one scanner initializes extensions
- Improves Performance: Reuses connection across requests
- Handles Concurrency: Query queue prevents DuckDB connection deadlocks
- Sequential Processing: Queries execute one at a time to avoid conflicts
- Automatic Cleanup: Properly closes connections on process exit
Important Notes
- Query Queuing: The singleton processes queries sequentially to prevent DuckDB concurrency issues
- Multiple Queries: Combine multiple COUNT/aggregate queries into single SQL statements using CASE expressions for better performance
- Event Loop: Uses
setImmediate()to avoid blocking the Node.js event loop during queue processing
API Reference
DeltaScanner Class
Constructor
const scanner = new DeltaScanner(options);Options:
memory: Boolean, whether to use in-memory database (default: true)extensions: Array of extensions to load (default: ['httpfs', 'delta', 'azure'])
Methods
async initialize(): Initialize DuckDB connection and load required extensionsasync query(tablePath, sql = null, options = {}): Execute a SQL query on a Delta tableasync getTableStats(tablePath): Get table statistics (row count)async getTableSchema(tablePath): Get table schema informationasync close(): Close the database connection
License
MIT
