@tanstack/ai-code-mode
v0.1.8
Published
Code Mode for TanStack AI - LLM-driven code execution in secure sandboxes
Readme
@tanstack/ai-code-mode
Code Mode for TanStack AI — let LLMs write and execute TypeScript in secure sandboxes with typed tool access.
Overview
Code Mode gives your AI agent an execute_typescript tool. Instead of one tool call per action, the LLM writes a small TypeScript program that orchestrates multiple tool calls with loops, conditionals, Promise.all, and data transformations — all running in an isolated sandbox.
Installation
pnpm add @tanstack/ai-code-modeYou also need an isolate driver:
# Node.js (fastest, uses V8 isolates via isolated-vm)
pnpm add @tanstack/ai-isolate-node
# QuickJS WASM (browser-compatible, no native deps)
pnpm add @tanstack/ai-isolate-quickjs
# Cloudflare Workers (edge execution)
pnpm add @tanstack/ai-isolate-cloudflareQuick Start
import { chat, toolDefinition } from '@tanstack/ai'
import { createCodeMode } from '@tanstack/ai-code-mode'
import { createNodeIsolateDriver } from '@tanstack/ai-isolate-node'
import { z } from 'zod'
// Define tools that the LLM can call from inside the sandbox
const weatherTool = toolDefinition({
name: 'fetchWeather',
description: 'Get weather for a city',
inputSchema: z.object({ location: z.string() }),
outputSchema: z.object({ temperature: z.number(), condition: z.string() }),
}).server(async ({ location }) => {
// Your implementation
return { temperature: 72, condition: 'sunny' }
})
// Create the execute_typescript tool and system prompt
const { tool, systemPrompt } = createCodeMode({
driver: createNodeIsolateDriver(),
tools: [weatherTool],
})
const result = await chat({
adapter: yourAdapter,
model: 'gpt-4o',
systemPrompts: ['You are a helpful assistant.', systemPrompt],
tools: [tool],
messages: [
{ role: 'user', content: 'Compare weather in Tokyo, Paris, and NYC' },
],
})The LLM will generate code like:
const cities = ['Tokyo', 'Paris', 'NYC']
const results = await Promise.all(
cities.map((city) => external_fetchWeather({ location: city })),
)
const warmest = results.reduce((prev, curr) =>
curr.temperature > prev.temperature ? curr : prev,
)
return { warmestCity: warmest.location, temperature: warmest.temperature }API Reference
createCodeMode(config)
Creates both the execute_typescript tool and its matching system prompt. This is the recommended entry point.
Config:
driver— AnIsolateDriver(Node, QuickJS, or Cloudflare)tools— Array ofServerToolorToolDefinitioninstances. Exposed asexternal_*functions in the sandboxtimeout— Execution timeout in ms (default: 30000)memoryLimit— Memory limit in MB (default: 128, supported by Node and QuickJS drivers)getSkillBindings— Optional async function returning dynamic bindings
createCodeModeTool(config) / createCodeModeSystemPrompt(config)
Lower-level functions if you need only the tool or only the prompt. createCodeMode calls both internally.
Advanced
These utilities are used internally and exported for custom pipelines:
stripTypeScript(code)— Strips TypeScript syntax using esbuild.toolsToBindings(tools, prefix?)— Converts tools toToolBindingrecords for sandbox injection.generateTypeStubs(bindings, options?)— Generates TypeScript type declarations from tool bindings.
Driver Selection Guide
| Driver | Best For | Native Deps | Browser | Memory Limit |
| --------------------------------- | -------------------------------------------- | ------------------- | ------- | ------------ |
| @tanstack/ai-isolate-node | Server-side Node.js apps | Yes (isolated-vm) | No | Yes |
| @tanstack/ai-isolate-quickjs | Browser, edge, or no-native-dep environments | No (WASM) | Yes | Yes |
| @tanstack/ai-isolate-cloudflare | Cloudflare Workers deployments | No | N/A | N/A |
Custom Events
Code Mode emits custom events during execution that you can observe via the TanStack AI event system:
| Event | Description |
| ----------------------------- | --------------------------------------------------- |
| code_mode:execution_started | Emitted when code execution begins |
| code_mode:console | Emitted for each console.log/error/warn/info call |
| code_mode:external_call | Emitted before each external_* function call |
| code_mode:external_result | Emitted after a successful external_* call |
| code_mode:external_error | Emitted when an external_* call fails |
Models eval (development)
The benchmark lives in a separate workspace package so @tanstack/ai-code-mode does not depend on @tanstack/ai-isolate-node (avoids an Nx build cycle). See models-eval/package.json (@tanstack/ai-code-mode-models-eval).
packages/typescript/ai-code-mode/models-eval/pull-models.sh— pull recommended Ollama modelspnpm --filter @tanstack/ai-code-mode-models-eval eval:capture— run models and capture raw outputs/telemetry only (no judge LLM call)pnpm --filter @tanstack/ai-code-mode-models-eval eval:judge— judge latest captured session from logs (no model rerun)pnpm --filter @tanstack/ai-code-mode-models-eval eval— single-pass run+judge (legacy convenience mode)pnpm --filter @tanstack/ai-code-mode-models-eval eval -- --ollama-only— only Ollama models fromeval-config.tspnpm --filter @tanstack/ai-code-mode-models-eval eval -- --ollama-only --models qwen3-coder— one or more model ids (comma-separated)
Judge-phase flags:
--judge-latestjudge latest captured session--rejudgere-run judging even if logs already contain judge fields
The default list omits some small Ollama models that rarely complete code-mode successfully (see comments in eval-config.ts). You can still benchmark them with --models granite4:3b etc. if pulled locally.
Model comparison metrics
The models eval now tracks seven decision-oriented metrics plus an overall rating:
accuracy(1-10): numerical/factual correctness vs gold reportcomprehensiveness(1-10): whether the response covers everything requested by the user querytypescriptQuality(1-10): quality/readability/type-safety of generated TypeScriptcodeModeEfficiency(1-10): how efficiently the model uses code-mode/tooling to reach the answerspeedTier(1-5): relative wall-clock speed against peers in the same category (localorcloud)tokenEfficiencyTier(1-5): relative token efficiency against peers in the same categorystabilityTier(1-5): success consistency over the latest 5 logged runs for that modelstars(1-3): weighted rollup score across all metrics
Raw run telemetry also includes compile/runtime failures, redundant schema checks, total tool calls, TTFT, token totals, stability sample size/rate, and per-model logs.
Methodology
Canonical output is written to packages/typescript/ai-code-mode/models-eval/results.json after each capture or judge run.
- Benchmark: single code-mode benchmark prompt over the in-memory
customers/products/purchasesdataset - Primary quality scores (judge):
accuracy,comprehensiveness,typescriptQuality,codeModeEfficiency - Computed comparative scores:
speedTier,tokenEfficiencyTier,stabilityTier - Stability definition: a run is "stable" if it has no top-level run error, produces a non-empty candidate report, and has at least one successful
execute_typescriptcall - Star rollup weights:
- accuracy: 25%
- comprehensiveness: 15%
- typescriptQuality: 15%
- codeModeEfficiency (with compile/runtime failure penalty): 10%
- speedTier: 10%
- tokenEfficiencyTier: 10%
- stabilityTier: 15%
Model comparison table
The table below is transcribed from canonical models-eval/results.json (session 2026-03-26T15:38:44.006Z).
| Provider | Model | Category | Stars | Accuracy | Comprehensiveness | TypeScript | Code-Mode | Speed Tier | Token Tier | Stability Tier |
| --------- | ----------------------------- | -------- | ----- | -------- | ----------------- | ---------- | --------- | ---------- | ---------- | -------------- |
| Ollama | gpt-oss:20b | local | ★★★ | 10 | 8 | 5 | 5 | 5 | 5 | 5 |
| Ollama | nemotron-cascade-2 | local | ★★☆ | 3 | 5 | 6 | 5 | 1 | 5 | 5 |
| Anthropic | claude-haiku-4-5 | cloud | ★★★ | 10 | 10 | 6 | 7 | 3 | 2 | 5 |
| OpenAI | gpt-4o-mini | cloud | ★★★ | 10 | 8 | 7 | 9 | 3 | 1 | 5 |
| Gemini | gemini-2.5-flash | cloud | ★★★ | 10 | 8 | 7 | 10 | 4 | 2 | 5 |
| xAI | grok-4-1-fast-non-reasoning | cloud | ★★★ | 10 | 8 | 6 | 10 | 4 | 5 | 5 |
| Groq | llama-3.3-70b-versatile | cloud | ★★★ | 10 | 7 | 6 | 9 | 5 | 3 | 4 |
| Groq | qwen/qwen3-32b | cloud | ★★☆ | 10 | 8 | 5 | 4 | 1 | 2 | 5 |
Suggested interpretation:
- Local-first: favor
stars >= 2with high speed tier. - Cloud-first quality: favor high
accuracy+typescriptQuality, then compare stars. - Cost-sensitive: prioritize
tokenEfficiencyTierandspeedTiertogether.
License
MIT
