lakecode

v0.1.15

Published

3 months ago

Databricks-native AI CLI agent

0High
0Medium
0Low

sachds

databricks cli ai agent lakehouse lakecode data-engineering unity-catalog

lakecode

Databricks-native AI CLI agent. Talk to your lakehouse — query data, debug jobs, manage permissions, deploy assets — all from your terminal.

Built on the Claude Agent SDK, lakecode wraps every Databricks operation in a safety-classified MCP tool layer with confirmation gates, policy enforcement, and full audit trails.

Quick Start

# Install
npm install -g lakecode

# Authenticate
lakecode auth login --host https://your-workspace.cloud.databricks.com

# Start chatting
lakecode chat

❯ show me the top 10 tables by size in the analytics schema
❯ why did job 12345 fail last night?
❯ /prove main.analytics.daily_revenue
❯ /cost top --days 7

Features

Natural language SQL — ask questions, get results with fully-qualified Unity Catalog names
18 deterministic workflows — /debug, /prove, /audit, /cost, /uc and more
Safety-first — every tool call classified as READ_ONLY / WRITE_REMOTE / DESTRUCTIVE with confirmation gates
Context-aware — workspace profiler injects catalog metadata, warehouse info, and function signatures into every turn
Knowledge injection — 25 Databricks skills loaded on-demand via keyword matching from the AI Dev Kit
Session management — --continue / --resume <id> to pick up where you left off
Mission Control — full-screen TUI dashboard for ops monitoring
Policy engine — YAML-based rules for compliance enforcement
Evidence packs — every workflow run produces a timestamped audit trail in ~/.lakecode/runs/

Installation

npm install -g lakecode

Requirements: Node.js >= 18, Databricks CLI installed and on PATH.

Authentication

lakecode uses the Databricks CLI's authentication under the hood. Set up once:

# OAuth browser flow (recommended)
lakecode auth login --host https://your-workspace.cloud.databricks.com

# Or use an existing profile from ~/.databrickscfg
lakecode chat --profile STAGING

# Check auth status
lakecode auth status

Supports OAuth (U2M), PAT tokens, and Azure/GCP service principal auth — anything the Databricks CLI supports.

CLI Commands

`lakecode chat` (default)

Interactive REPL session with the AI agent.

lakecode chat [options]

| Flag | Description | |------|-------------| | --profile <name> | Databricks config profile | | --target <name> | Bundle target (dev/staging/prod) | | --continue | Resume most recent session | | --resume <id> | Resume a specific session by ID | | -p, --prompt <text> | Send initial prompt on startup | | --approve <level> | Auto-approve: read | write | destructive | | --compliance | Enable compliance mode (policy deny overrides --approve) | | --dry-run | Show commands without executing | | --verbose | Show raw CLI commands and LLM traffic |

`lakecode run <prompt>`

Single-shot execution — run a prompt non-interactively and exit.

lakecode run "list all tables in main.analytics" --output json
lakecode run workflow debug_job --input params.json --output md

| Flag | Description | |------|-------------| | --output <format> | text | json | stream-json | md (default: text) | | --approve <level> | Auto-approve level (default: read) | | --session-id <id> | Session ID for multi-turn continuity | | --profile, --verbose | Same as chat |

`lakecode mc`

Mission Control — full-screen ops dashboard with job monitoring, alerts, and watch subscriptions.

lakecode mc --profile PROD

`lakecode auth`

Manage Databricks authentication.

lakecode auth status          # Show current auth
lakecode auth profiles        # List all profiles
lakecode auth login           # OAuth browser flow
lakecode auth logout          # Clear cached tokens

`lakecode config`

Manage lakecode configuration.

lakecode config init          # Interactive setup wizard
lakecode config show          # Print resolved config

Slash Commands

In chat mode, type / to see autocomplete suggestions.

Workspace & Navigation

| Command | Description | |---------|-------------| | /help | Show available commands | | /clear | Clear conversation history | | /context | Show current context window usage | | /exit | Exit the session |

Databricks Operations

| Command | Description | |---------|-------------| | /debug job <id> | Multi-step job failure diagnosis with root cause analysis | | /prove <table> | Data quality analysis — row counts, nulls, distributions, anomalies | | /audit jobs | Comprehensive job audit with risk assessment across all jobs | | /cost top | Top spend analysis by SKU, identity, and job | | /cost spike | Cost anomaly detection — find unexpected spending |

Unity Catalog Governance

| Command | Description | |---------|-------------| | /uc explain-access <principal> <object> | Privilege graph + effective access explanation | | /uc diff-grants <object> --to <desired.yml> | Diff current vs desired grants | | /uc apply-grants --plan <planId> | Execute a reviewed grant plan | | /uc export-grants <object> | Export current grants as canonical YAML |

Asset Bundle Management

| Command | Description | |---------|-------------| | /capture job <id> | Extract job config into a Databricks Asset Bundle | | /capture pipeline <id> | Extract pipeline config into a bundle | | /drift detect | Compare bundle definition vs live workspace state | | /deploy | Deploy bundle with preflight checks and verification |

Monitoring

| Command | Description | |---------|-------------| | /watch | Start watching a job/query/table on an interval | | /watch list | List active watch subscriptions | | /watch stop | Stop a watch subscription | | /runs prune | Clean up old evidence pack runs |

MCP Tools

lakecode exposes 7 built-in tools via the Model Context Protocol:

| Tool | Description | Safety | |------|-------------|--------| | list_catalogs | List Unity Catalog catalogs | READ_ONLY | | list_schemas | List schemas in a catalog | READ_ONLY | | list_tables | List tables in a schema | READ_ONLY | | describe_table | Column types, properties, storage info | READ_ONLY | | sql_execute | Run any SQL statement | Dynamic | | batch_sql | Execute multiple SQL statements or files | Dynamic | | databricks_cli | Run any Databricks CLI command or REST API call | Dynamic |

Dynamic tools are classified per-invocation based on the SQL statement or CLI command being run.

External MCP Server

lakecode can optionally connect to the official databricks-mcp-server for additional tools (dashboards, pipelines, clusters, etc.):

# ~/.lakecode/config.yml
external_mcp:
  enabled: true
  command: uvx
  args: ["databricks-mcp-server@latest"]

Safety Model

Every tool invocation is classified into one of three levels:

| Level | Examples | Behavior | |-------|----------|----------| | READ_ONLY | SELECT, SHOW, DESCRIBE, list, get | Auto-approved by default | | WRITE_REMOTE | CREATE TABLE, INSERT, GRANT, jobs create | Requires confirmation | | DESTRUCTIVE | DROP TABLE, DELETE, jobs delete, clusters delete | Requires explicit confirmation |

Confirmation Gates

Databricks  ⚠ WRITE_REMOTE — CREATE TABLE main.staging.dim_users ...
            Allow? (y/n/always)

Override with --approve:

--approve read — auto-approve READ_ONLY (default)
--approve write — auto-approve READ_ONLY + WRITE_REMOTE
--approve destructive — auto-approve everything (use with caution)

Policy Engine

Define organization-wide rules in ~/.lakecode/policy.yml:

rules:
  - name: no-production-drops
    match:
      tool: sql_execute
      statement: "DROP.*production\\."
    action: deny
    message: "Dropping production tables is not allowed"

  - name: require-where-on-delete
    match:
      tool: sql_execute
      statement: "^DELETE FROM(?!.*WHERE)"
    action: deny
    message: "DELETE without WHERE clause is not allowed"

Enable compliance mode to make policy denials non-overridable:

lakecode chat --compliance

Workflows

Workflows are deterministic multi-step pipelines that combine API calls, SQL queries, and LLM analysis. Each produces a timestamped evidence pack in ~/.lakecode/runs/.

| ID | Name | Steps | |----|------|-------| | debug_job | Debug Job | Fetch config → get runs → get output → get logs → LLM diagnosis | | prove_table | Prove Table | Describe → sample → stats → nulls → duplicates → LLM assessment | | audit_jobs | Audit Jobs | List jobs → get runs → check schedules → LLM risk analysis | | cost_top | Cost Top | Query billing → group by SKU/identity/job → LLM insights | | cost_spike | Cost Spike | Query billing history → detect anomalies → LLM explanation | | genai_cost_agent | GenAI Cost Agent | Multi-turn tool-use loop over 10 GenAI cost functions | | uc_explain_access | UC Explain Access | Build privilege graph → compute effective access → LLM summary | | uc_diff_grants | UC Diff Grants | Fetch current → load desired → compute diff → generate plan | | uc_apply_grants | UC Apply Grants | Load plan → snapshot before → execute → snapshot after | | capture_job | Capture Job | Fetch config → generate bundle YAML → write files | | capture_pipeline | Capture Pipeline | Fetch pipeline → generate bundle YAML → write files | | drift_detect | Drift Detect | Read bundle → fetch live → diff → report | | bundle_deploy | Bundle Deploy | Preflight checks → deploy → verify | | job_status | Job Status | Fetch job config → get recent runs | | job_logs | Run Logs | Fetch run output → LLM diagnosis | | run_job | Run Job | Trigger run → poll for completion | | deploy_file | Deploy File | Validate → import to workspace | | query_history | Query History | Resolve run → fetch SQL history |

Running workflows programmatically

# Via slash command
/debug job 12345

# Via CLI
lakecode run workflow debug_job --input '{"job_id": "12345"}' --output json

Configuration

lakecode reads config from (in order of precedence):

CLI flags
.lakecode/config.yml (project-level)
~/.lakecode/config.yml (global)

Full config reference

# ~/.lakecode/config.yml

databricks:
  profile: DEFAULT                    # Databricks CLI profile
  target: dev                         # Bundle target
  warehouse_id: abc123               # Default SQL warehouse
  default_catalog: main              # Default catalog
  default_schema: default            # Default schema

agent:
  max_turns: 50                      # Max agent loop iterations
  max_tokens_per_response: 32768     # Max tokens per LLM response
  temperature: 0                     # LLM temperature
  context_window_budget: 100000      # Budget for auto-compaction
  system_prompt_extra: ""            # Additional system prompt text

safety:
  auto_approve: []                   # Tool patterns to auto-approve
  require_confirm: []                # Tool patterns requiring confirmation
  blocked: []                        # Tool patterns blocked from execution

external_mcp:
  enabled: true                      # Enable databricks-mcp-server
  command: uvx
  args: ["databricks-mcp-server@latest"]

sessions:
  dir: ~/.lakecode/sessions          # Session storage
  retention_days: 30

runs:
  dir: ~/.lakecode/runs              # Evidence pack storage
  retention_days: 30

policy:
  global_path: ~/.lakecode/policy.yml
  compliance: false                  # Enable compliance mode

watch:
  default_interval_sec: 60
  max_subscriptions: 20
  subscriptions_path: ~/.lakecode/watch.yml

User Conventions

Add SQL and coding conventions that the agent will follow:

# Global conventions
echo "Always use UPPERCASE SQL keywords" > ~/.lakecode/conventions.md

# Project-level conventions
echo "Prefer CTEs over subqueries" > .lakecode/conventions.md

Architecture

src/
├── bin/             # CLI entry point
├── cli/
│   ├── commands/    # chat, run, mc, auth, config
│   └── ui/          # Ink (React) terminal components
├── config/          # Zod schema, config loader
├── context/         # Workspace profiler, skill router, conventions
├── mcp/             # MCP server, tool safety classification
├── prompts/         # System prompt construction
├── tools/           # Tool definitions and registry
├── uc/              # Unity Catalog governance (grants, diff, plans)
└── workflows/       # Workflow engine, 18 registered workflows

Knowledge Injection Pipeline

On each conversation turn, lakecode builds context:

Workspace profiler — cached metadata (~200 tokens): catalogs, schemas, table counts, warehouses, functions
Skill router — keyword-matches the user's message against 25 skills, loads top 1-2 per turn
Skills library — 25 skills from the official Databricks AI Dev Kit
User conventions — merged from global + project conventions files

Development

# Clone
git clone https://github.com/lakeside-analytics/lakecode.git
cd lakecode

# Install dependencies
npm install

# Run in development mode
npm run dev

# Run tests (674 tests across 41 files)
npm test

# Build
npm run build

# Type check
npx tsc --noEmit

Test Suite

41 test files, 674 tests

Key test areas:
- System prompt regression (48 tests) — behavioral contracts + snapshot
- MCP tool safety classification (110 tests) — SQL, CLI, REST patterns
- Workflow step execution (29 tests) — mocked CLI, real execute() logic
- Config schema validation, policy engine, watch system, UC governance
- Markdown rendering edge cases, session management, input sanitization

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

lakecode

Quick Start

Features

Installation

Authentication

CLI Commands

lakecode chat (default)

lakecode run <prompt>

lakecode mc

lakecode auth

lakecode config

Slash Commands

Workspace & Navigation

Databricks Operations

Unity Catalog Governance

Asset Bundle Management

Monitoring

MCP Tools

External MCP Server

Safety Model

Confirmation Gates

Policy Engine

Workflows

Running workflows programmatically

Configuration

Full config reference

User Conventions

Architecture

Knowledge Injection Pipeline

Development

Test Suite

License

`lakecode chat` (default)

`lakecode run <prompt>`

`lakecode mc`

`lakecode auth`

`lakecode config`