miniml

v1.0.16

Published

9 months ago

A minimal, embeddable semantic data modeling language for generating SQL queries from YAML model definitions. Inspired by LookML.

0High
0Medium
0Low

dtempx

modeling-language data-modeling-language data-modeling data-model data semantic-model semantic-layer sql metrics dimensions measures lookml mcp model context protocol

MiniML

A minimal, embeddable semantic data modeling language for generating SQL queries from YAML model definitions, inspired by LookML.

Generates SQL queries driven by a semantic model and selected dimensions and measures. Capable of generating optimal SQL from minimal queries to very complex queries with multiple joins based on the specified configuration.

Also generates metadata for each model suitable for use by AI to execute queries.

Designed to be a stand-alone and light-weight library, embeddable in any server or app without requiring additional data platforming. Supports BigQuery and Snowflake dialects.

Quick Start

1. Define a Model

Create a YAML model file (sales.yaml) that describes your data structure:

description: Sales fact table capturing transactional-level purchase data across products, stores, and customers.

from: acme.sales

join:
  customer_join: JOIN acme.customers USING (customer_id)
  product_join: JOIN acme.products USING (product_id)
  store_join: JOIN acme.stores USING (store_id)

always_join:
  - product_join  # Always include product data for consistency

date_field: date                    # Primary date field for filtering
default_date_range: last 7 days     # Default time window when no dates specified
include_today: false                # Exclude today's partial data

dimensions:
  date:
    - The calendar date on which the sale occurred.
    - DATE(sale_date)
  sale_id: Unique identifier for each sale transaction.
  customer_id: Unique identifier for the customer making the purchase.
  customer_name:
    - Display name or label associated with the customer.
    - customer_name
    - customer_join
  product_id: Unique identifier for the product sold.
  product_name:
    - Display name or label associated with the product.
    - product_name
    - product_join
  store_id: Unique identifier for the store where the sale took place.
  store_name:
    - Display name or label of the store location.
    - store_name
    - store_join

measures:
  total_amount:
    - Total sale amount generated by the transaction.
    - SUM(total_amount)
  price_avg:
    - Average unit price across all sales.
    - AVG(unit_price)
  count:
    - The count of records for the row.
    - COUNT(*)

2. Produce a description of the model

npx miniml sales.yaml

OUTPUT

## DIMENSIONS
- `date` The calendar date on which the sale occurred.
- `sale_id` Unique identifier for each sale transaction.
- `customer_id` Unique identifier for the customer making the purchase.
- `customer_name` Display name or label associated with the customer.
- `product_id` Unique identifier for the product sold.
- `product_name` Display name or label associated with the product.
- `store_id` Unique identifier for the store where the sale took place.
- `store_name` Display name or label of the store location.

## MEASURES
- `total_amount` Total sale amount generated by the transaction.
- `price_avg` Average unit price across all sales.
- `count` The count of records for the row.

3. Generate a SQL query from the model

npx miniml sales.yaml --dimensions=date --measures=total_amount

OUTPUT

SELECT
  DATE(sale_date) AS date,
  SUM(total_amount) AS total_amount
FROM acme.sales
JOIN acme.products USING (product_id)
WHERE DATE(sale_date) >= CURRENT_TIMESTAMP - INTERVAL 7 DAY
  AND DATE(sale_date) < DATE_TRUNC(CURRENT_TIMESTAMP, DAY)
GROUP BY ALL

Add more options:

npx miniml sales.yaml \
  --dimensions=date,customer_name,product_name \
  --measures=total_amount,price_avg,count \
  --date-from=2025-01-01 \
  --date-to=2025-01-31 \
  --order-by=-date \
  --limit=10

OUTPUT

SELECT
  DATE(sale_date) AS date,
  customer_name AS customer_name,
  product_name AS product_name,
  SUM(total_amount) AS total_amount,
  AVG(unit_price) AS price_avg,
  COUNT(*) AS count
FROM acme.sales
JOIN acme.customers USING (customer_id)
JOIN acme.products USING (product_id)
WHERE DATE(sale_date) BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY ALL
ORDER BY date DESC
LIMIT 10

Installation

To use MiniML in your project:

npm install miniml

Programmatic Usage

import { loadModel, renderQuery } from 'miniml';

// Load the model
const model = await loadModel('./sales.yaml');

// Generate a SQL query
const sql = renderQuery(model, {
    dimensions: ['date', 'customer_name'],
    measures: ['total_amount', 'count'],
    date_from: '2025-01-01',
    date_to: '2025-01-31',
    order_by: ['-date'],
    limit: 10
});

console.log(sql);

OUTPUT

SELECT
  DATE(sale_date) AS date,
  customer_name AS customer_name,
  SUM(total_amount) AS total_amount,
  COUNT(*) AS count
FROM acme.sales
JOIN acme.customers USING (customer_id)
WHERE DATE(sale_date) BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY ALL
ORDER BY date DESC
LIMIT 10

Model Definition

Basic Structure

A MiniML model is a YAML file with the following structure:

description: Human-readable description of the model
dialect: bigquery | snowflake
from: schema.table_name

join:
  join_name: JOIN clause definition

always_join:
  - join_name  # Joins to always include

where: base_filter_conditions

date_field: primary_date_column        # Auto-detected if not specified
default_date_range: last 30 days       # Default time window
include_today: false                   # Include/exclude current day

dimensions:
  dimension_name: definition

measures:
  measure_name: definition

tags:
  - public  # Tags that can be used for purposes such as access control

info: Detailed usage instructions for the model

Dimensions

Dimensions are fields used for grouping data. They can be defined in several formats:

Simple string (field name only):

dimensions:
  customer_id: Unique customer identifier

Array format (description, SQL, join):

dimensions:
  customer_name:
    - Display name of the customer
    - customer_name
    - customer_join

Object format:

dimensions:
  date:
    description: The transaction date
    sql: DATE(created_at)
    join: date_dimension_join

Measures

Measures are aggregated fields. They default to SUM() if no SQL is provided:

Simple string (defaults to SUM):

measures:
  revenue: Total revenue amount  # becomes SUM(revenue) AS revenue

Custom aggregation:

measures:
  avg_price:
    - Average unit price
    - AVG(unit_price)

Joins

Define reusable join clauses. Joins are rendered in the order they are defined under the join section, regardless of the order they are referenced by dimensions or measures.

join:
  customer_join: LEFT JOIN customers USING (customer_id)
  product_join: LEFT JOIN products USING (product_id)
  store_join: LEFT JOIN stores USING (store_id)

always_join:
  - product_join  # Always include this join, regardless of field usage

dimensions:
  customer_name:
    - Customer display name
    - customer_name
    - customer_join  # References the join above
  store_name:
    - Store location name  
    - store_name
    - store_join

Generated SQL join order:

-- Joins appear in YAML definition order: customer_join, product_join, store_join
-- Even if referenced in reverse order by dimensions/measures
FROM base_table
JOIN customers USING (customer_id)      -- First in YAML
JOIN products USING (product_id)        -- Second in YAML  
JOIN stores USING (store_id)            -- Third in YAML

This predictable ordering ensures consistent SQL generation and helps with query optimization and dependency management between joins.

Always Join

The always_join property specifies joins that should always be included in generated SQL queries, regardless of whether any dimensions or measures explicitly reference them. This is useful for joins that provide essential context, filtering, or data integrity constraints.

join:
  customer_join: LEFT JOIN customers USING (customer_id)
  product_join: LEFT JOIN products USING (product_id)
  store_join: LEFT JOIN stores USING (store_id)

always_join:
  - product_join  # Always included for data integrity
  - store_join    # Always included for consistent filtering

dimensions:
  date: Transaction date
  customer_name:
    - Customer name from customer table
    - customer_name
    - customer_join  # Only included when customer_name is used

Alternatively, specify all to always include all joins.

always_join: all

Example Query:

// This query only uses dimensions that don't require joins
const sql = renderQuery(model, {
    dimensions: ['date'],
    measures: ['count']
});

Generated SQL:

SELECT
  DATE(sale_date) AS date,
  COUNT(*) AS count
FROM acme.sales
JOIN products USING (product_id)  -- Always included
JOIN stores USING (store_id)      -- Always included
GROUP BY ALL

Use Cases:

Data Quality: Ensure referential integrity by always joining lookup tables
Security: Always include joins for row-level security filters
Business Logic: Maintain consistent business rules across all queries
Performance: Pre-join commonly used dimensions to optimize query plans

Best Practices:

Use sparingly - only for joins that are truly always needed
Consider the performance impact of additional joins on large datasets
Document why specific joins are marked as always_join in your model

Date Field Configuration

MiniML provides sophisticated date handling capabilities through three key properties: date_field, default_date_range, and include_today.

Date Field (`date_field`)

The date_field property specifies the primary date dimension used for automatic date filtering. If not explicitly defined, MiniML automatically detects it using these rules (in order):

Exact matches: "date", "timestamp"
Pattern matches: Fields ending with "date", "time", "_at", "_on", "_until", starting with "date", or containing "datetime"

# Explicit date field
date_field: order_date

dimensions:
  order_date: The date when the order was placed
  created_at: Record creation timestamp
  
# Auto-detected: MiniML will choose "order_date" (exact match)

# Auto-detection example
dimensions:
  transaction_date: Purchase date      # Will be auto-detected
  customer_id: Customer identifier
  
# No explicit date_field needed - "transaction_date" will be chosen

Default Date Range (`default_date_range`)

Automatically applies date filtering when no explicit date_from/date_to parameters are provided. Supports human-readable relative date expressions:

default_date_range: last 30 days
# Other examples:
# default_date_range: last 7 days
# default_date_range: last 12 months  
# default_date_range: last 1 year
# default_date_range: last 24 hours

Supported time units: hour/hours, day/days, week/weeks, month/months, year/years

Generated SQL Examples:

BigQuery:

WHERE order_date >= CURRENT_TIMESTAMP - INTERVAL 30 DAY

Snowflake:

WHERE order_date >= CURRENT_TIMESTAMP - INTERVAL '30 DAY'

Include Today (`include_today`)

Controls whether the current day is included in default date range filtering. Defaults to true if not specified.

default_date_range: last 7 days
include_today: false  # Exclude today from the range (default is true)

With include_today: true (default):

-- BigQuery
WHERE order_date >= CURRENT_TIMESTAMP - INTERVAL 7 DAY

-- Includes all of today's data

With include_today: false:

-- BigQuery  
WHERE order_date >= CURRENT_TIMESTAMP - INTERVAL 7 DAY 
  AND order_date < DATE_TRUNC(CURRENT_TIMESTAMP, DAY)
  
-- Excludes today's data (up to midnight)

Complete Date Configuration Example

description: E-commerce orders with 30-day default window
dialect: bigquery
from: ecommerce.orders
date_field: order_date           # Primary date field for filtering
default_date_range: last 30 days # Auto-apply 30-day lookback  
include_today: false             # Exclude today's partial data

dimensions:
  order_date:
    - Order placement date
    - DATE(order_date)
  customer_id: Customer identifier

measures:
  revenue: 
    - Total revenue
    - SUM(amount)

Query without explicit dates:

const sql = renderQuery(model, {
    dimensions: ['order_date'], 
    measures: ['revenue']
});

Generated SQL:

SELECT
  DATE(order_date) AS order_date,
  SUM(amount) AS revenue
FROM ecommerce.orders  
WHERE order_date >= CURRENT_TIMESTAMP - INTERVAL 30 DAY
  AND order_date < DATE_TRUNC(CURRENT_TIMESTAMP, DAY)
GROUP BY ALL

Query with explicit dates (overrides defaults):

const sql = renderQuery(model, {
    dimensions: ['order_date'],
    measures: ['revenue'], 
    date_from: '2024-01-01',
    date_to: '2024-01-31'
});
// Uses explicit dates instead of default_date_range

Date Granularity

When using the primary date_field as a dimension, you can apply date truncation:

const sql = renderQuery(model, {
    dimensions: ['order_date'],
    measures: ['revenue'],
    date_granularity: 'week'  // Group by week
});

Generated SQL:

-- BigQuery
SELECT
  DATE_TRUNC(DATE(order_date), WEEK) AS order_date,
  SUM(amount) AS revenue
FROM ecommerce.orders
WHERE order_date >= CURRENT_TIMESTAMP - INTERVAL 30 DAY
GROUP BY ALL

Supported granularities: MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, YEAR

SQL Dialect Detection

MiniML can automatically detect the SQL dialect from the file path:

Files containing "bigquery" use BigQuery syntax
Files containing "snowflake" use Snowflake syntax
Otherwise, specify explicitly with the dialect field

Examples:

models/bigquery/sales.yaml     # Automatically uses BigQuery dialect
models/snowflake/users.yaml    # Automatically uses Snowflake dialect  
models/sales.yaml              # Requires explicit dialect: bigquery

This means you can omit the dialect field when your file paths include the database type.

Query Generation

Query Options

The renderQuery function accepts these options:

interface MinimlQueryOptions {
    dimensions?: string[];       // Fields to group by
    measures?: string[];         // Aggregated fields to calculate
    date_from?: string | null;   // Start date filter (YYYY-MM-DD)
    date_to?: string | null;     // End date filter (YYYY-MM-DD)
    where?: string;              // Additional WHERE conditions
    having?: string;             // HAVING clause for aggregate filters
    order_by?: string[];         // Sort fields (prefix with - for DESC)
    limit?: number;              // Limit result count
    distinct?: boolean;          // Add DISTINCT to SELECT
    date_granularity?: string;   // Date truncation (day, week, month, year)
}

Date Handling

MiniML provides powerful date filtering and processing capabilities. For comprehensive documentation on date configuration, see the Date Field Configuration section.

Basic date filtering:

// Uses model's default_date_range if defined
renderQuery(model, { measures: ['count'] });


// Override with specific date range
renderQuery(model, {
    measures: ['count'],
    date_from: '2024-01-01',
    date_to: '2024-01-31'
});

// Specify a null for date_from to bypass the default_date_range
renderQuery(model, { measures: ['count'], date_from: null });

Date granularity:

// Group by week using the primary date_field
renderQuery(model, {
    dimensions: ['date'],
    measures: ['count'],
    date_granularity: 'week'
});
// Generates: DATE_TRUNC('WEEK', sale_date) AS date

Filtering

WHERE clause (for dimensions):

renderQuery(model, {
    dimensions: ['customer_id'],
    measures: ['total_amount'],
    where: "customer_type = 'premium'"
});

HAVING clause (for measures):

renderQuery(model, {
    dimensions: ['customer_id'],
    measures: ['total_amount'],
    having: "total_amount > 1000"
});

Ordering

renderQuery(model, {
    dimensions: ['customer_id'],
    measures: ['total_amount'],
    order_by: ['customer_id', '-total_amount']  // ASC, then DESC
});

API Reference

Core Functions

`loadModel(file: string): Promise<MinimlModel>`

Asynchronously loads and processes a YAML model file.

`loadModelSync(file: string): MinimlModel`

Synchronously loads and processes a YAML model file.

`renderQuery(model: MinimlModel, options: MinimlQueryOptions): string`

Generates a SQL query from a model and query options.

Interfaces

`MinimlModel`

interface MinimlModel {
    description: string;
    dialect: string;
    from: string;
    join: Record<string, string>;
    always_join?: string[];           // Joins to always include
    where: string;
    date_field?: string;              // Primary date field (auto-detected)
    default_date_range?: string;      // Default time window (e.g., "last 30 days")
    include_today?: boolean;          // Include current day in default range
    dimensions: Record<string, MinimlDef>;
    measures: Record<string, MinimlDef>;
    info: string;  // Auto-generated documentation
}

`MinimlDef`

interface MinimlDef {
    key: string;
    description: string;
    sql?: string;
    join?: string;
}

Examples

E-commerce Analytics

# ecommerce.yaml
description: E-commerce sales and customer analytics

dialect: bigquery
from: ecommerce.orders
date_field: order_date

join:
  customer_join: LEFT JOIN ecommerce.customers USING (customer_id)
  product_join: LEFT JOIN ecommerce.products USING (product_id)

where: status = 'completed'

dimensions:
  date: Order date
  customer_id: Customer identifier
  customer_email:
    - Customer email address
    - email
    - customer_join
  product_category:
    - Product category
    - category
    - product_join

measures:
  revenue: Total revenue
  order_count: 
    - Number of orders
    - COUNT(DISTINCT order_id)
  avg_order_value:
    - Average order value
    - AVG(total_amount)

Usage:

const model = await loadModel('./ecommerce.yaml');

// Revenue by category, last 7 days
const sql = renderQuery(model, {
    dimensions: ['product_category'],
    measures: ['revenue', 'order_count'],
    date_from: '2024-01-20',
    date_to: '2024-01-27',
    order_by: ['-revenue'],
    limit: 10
});

Complex Joins and Calculations

# advanced.yaml
description: Advanced analytics with multiple joins

dialect: snowflake
from: fact.sales
date_field: sale_date

join:
  customer_join: LEFT JOIN dim.customers c ON c.id = customer_id
  product_join: LEFT JOIN dim.products p ON p.id = product_id
  region_join: LEFT JOIN dim.regions r ON r.code = c.region_code

dimensions:
  customer_segment:
    - Customer segment classification
    - c.segment
    - customer_join
  product_brand:
    - Product brand name
    - p.brand
    - product_join
  region_name:
    - Geographic region
    - r.name
    - region_join

measures:
  revenue_per_customer:
    - Average revenue per customer
    - SUM(amount) / COUNT(DISTINCT customer_id)
  profit_margin:
    - Profit margin percentage
    - (SUM(amount) - SUM(cost)) / SUM(amount) * 100

Supported SQL Dialects

BigQuery

Supports GROUP BY ALL syntax
Uses DATE_TRUNC for date granularity
Compatible with BigQuery standard SQL functions

Snowflake

Supports Snowflake-specific functions
Uses DATE_TRUNC for date granularity
Compatible with Snowflake SQL syntax

More dialects coming soon!

Advanced Features

Template Variables in Info

Models support Jinja-style templating in the info section:

info: |
  ## Available Dimensions
  {%- for dimension in dimensions %}
  - `{{ dimension.key }}`: {{ dimension.description }}
  {%- endfor %}

  ## Available Measures  
  {%- for measure in measures %}
  - `{{ measure.key }}`: {{ measure.description }}
  {%- endfor %}

Filter Reference Expansion

MiniML automatically expands dimension and measure references in filter clauses:

// If you have a measure: conversion_rate: "clicks / impressions"
// And use it in a HAVING clause:
renderQuery(model, {
    having: "conversion_rate > 0.05"
});
// MiniML expands this to: HAVING (clicks / impressions) > 0.05

Validation

MiniML validates query parameters against model definitions and provides helpful error messages:

// This will throw an error with specific invalid keys
renderQuery(model, {
    dimensions: ['invalid_dimension'],
    measures: ['invalid_measure']
});
// Error: The following keys are invalid:
// - dimensions: invalid_dimension  
// - measures: invalid_measure

Security

Because MiniML allows flexible where and having clauses to be specified in native SQL format, comprehensive SQL validation is implemented to guard against SQL injection attacks for user-provided query parameters:

AST-based validation: All user-provided SQL expressions are parsed and validated using node-sql-parser
Allowlist approach: Only safe SQL constructs are permitted (comparisons, logical operators, basic functions)
Model-aware validation: Column references are validated against your model schema
Dialect-specific rules: Validation adapts to BigQuery/Snowflake syntax differences
Date input validation: Date parameters (date_from, date_to) are strictly validated using regex patterns

Security Model

Trusted Inputs: YAML model files are considered trusted input and are not validated for SQL injection. Model files should be:

Created and maintained by trusted developers
Stored securely with appropriate access controls
Reviewed before deployment to production environments
Treated as code artifacts subject to version control and security policies

Untrusted Inputs: All query parameters provided at runtime (where, having, date_from, date_to, etc.) are considered untrusted and undergo strict validation.

Safe Expression Examples

WHERE clauses:

renderQuery(model, {
    where: "account_name = 'Acme Corp'"
});

renderQuery(model, {
    where: "date >= '2024-01-01' AND category_name LIKE 'Electronics%'"
});

renderQuery(model, {
    where: "revenue > 1000 OR quantity IS NOT NULL"
});

HAVING clauses:

renderQuery(model, {
    having: "total_amount > 1000 AND order_count >= 5"
});

Blocked Constructs

For security, the following are automatically blocked:

Subqueries: SELECT, EXISTS, ANY, ALL
DDL/DML statements: DROP, ALTER, CREATE, INSERT, UPDATE, DELETE
System functions: Potentially dangerous database functions
Comments: SQL comments that could hide malicious code
Unauthorized columns: References to columns not defined in your model

Validation Errors

When unsafe expressions are detected, MiniML provides helpful error messages:

// Error: "Subqueries are not allowed in WHERE clauses. Use simple comparisons instead."
renderQuery(model, {
    where: "customer_id IN (SELECT id FROM users WHERE active = 1)"
});

// Error: "Column 'user_password' not found. Available dimensions: account_name, date, category_name"
renderQuery(model, {
    where: "user_password = 'secret'"
});

Building and Development

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run tests (requires Mocha setup)
npx mocha test/**/*.test.js

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

MIT License - see LICENSE file for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

MiniML

Table of Contents

Quick Start

1. Define a Model

2. Produce a description of the model

3. Generate a SQL query from the model

Installation

Programmatic Usage

Model Definition

Basic Structure

Dimensions

Measures

Joins

Always Join

Date Field Configuration

Date Field (date_field)

Default Date Range (default_date_range)

Include Today (include_today)

Complete Date Configuration Example

Date Granularity

SQL Dialect Detection

Query Generation

Query Options

Date Handling

Filtering

Ordering

API Reference

Core Functions

loadModel(file: string): Promise<MinimlModel>

loadModelSync(file: string): MinimlModel

renderQuery(model: MinimlModel, options: MinimlQueryOptions): string

Interfaces

MinimlModel

MinimlDef

Examples

E-commerce Analytics

Complex Joins and Calculations

Supported SQL Dialects

BigQuery

Snowflake

Advanced Features

Template Variables in Info

Filter Reference Expansion

Validation

Security

Security Model

Safe Expression Examples

Blocked Constructs

Validation Errors

Building and Development

Contributing

License

Date Field (`date_field`)

Default Date Range (`default_date_range`)

Include Today (`include_today`)

`loadModel(file: string): Promise<MinimlModel>`

`loadModelSync(file: string): MinimlModel`

`renderQuery(model: MinimlModel, options: MinimlQueryOptions): string`

`MinimlModel`

`MinimlDef`