@sudhanshu1402/keel
v1.0.0
Published
Durable execution for TypeScript AI agents. Local-first, crash-proof, zero-cloud.
Maintainers
Readme
keel
Durable execution for TypeScript. Zero build step, zero lock-in, and a dashboard that runs on your laptop. Scales from a single JSON file to a SQLite-backed multi-worker pool. No cloud, no account, no cost.

What is this?
keel keeps a program running correctly even when it crashes partway through.
Say your code runs three things in order: charge a card, reserve stock, ship the order. If the process dies after the charge but before the ship, restarting it the naive way charges the card a second time. The same trap hits AI agents: crash mid-run and on restart you re-call the model and pay for the tokens all over again.
keel removes that risk. You write the work as named steps. When a step completes, keel saves its result; on restart it hands back the saved result instead of running the step again, so the run continues from where it failed. A run that died at step 7 picks up at step 7, with steps 1 through 6 served from their saved results rather than re-run.
keel is at-least-once, not exactly-once: if a crash lands in the narrow window after a step's side effect ran but before its result was saved, that one step re-runs on restart. That is true of every durable engine. keel gives each step an idempotency key (and you can supply your own) so you can make a repeat safe; see Delivery guarantee. The point it buys you is that the steps that already finished are never redone.
That idea is called durable execution. The demo above shows it: run 1 charges the card and crashes, run 2 resumes and ships without re-charging.
Why keel
Temporal, Inngest, DBOS, and Vercel's Workflow SDK all do durable execution. They also pull in a server, a hosted control plane, a database, or a build-time compiler transform. keel is the same core idea in a few hundred lines of dependency-free TypeScript, with the trade made in the opposite direction:
- No build step. Vercel's Workflow SDK is built on
"use workflow"/"use step"directives that only work inside their bundler; you cannot run those files in plain Node. keel is ordinaryctx.step()method calls. It runs anywhere Node runs, with no transform in the path. - Local observability.
npx keel dashboardserves a zero-config run dashboard from your own store. Every run, every step, token counts, errors, Resume and Send-signal buttons, onhttp://127.0.0.1. The hosted engines put this in the cloud. - Zero lock-in. One tier, MIT, no account, no usage pricing. Your run history is a JSON file or a SQLite database you own.
- Trivial testing. In-memory store plus a mock provider plus an injectable clock means a durable workflow test needs zero infrastructure and runs deterministically in milliseconds. Crash-and-resume is a unit test.
- Zero runtime dependencies in the core.
package.jsonhas nodependenciesfield at all.
A full, honest side-by-side is in docs/COMPARISON.md, and what keel deliberately does not do is in docs/LIMITATIONS.md.
| | keel | Vercel Workflow | Temporal | |---|---|---|---| | Build step required | no | yes | no | | Runs in plain Node | yes | no | yes | | Local dashboard, zero config | yes | cloud only | run a server | | Account to start | no | yes | no | | Core runtime deps | zero | bundler + runtime | server + client | | Store | memory / JSON / SQLite | managed | DB cluster |
Install
npm install @sudhanshu1402/keelQuick start
import { Keel, defineWorkflow } from '@sudhanshu1402/keel';
const order = defineWorkflow<{ id: string }, { shipped: boolean }>(
'order',
async (ctx, input) => {
const charge = await ctx.step('charge', () => chargeCard(input.id));
await ctx.step('reserve', () => reserveInventory(input.id));
return ctx.step('ship', () => ship(input.id, charge));
},
);
const keel = new Keel(); // in-memory store by default
const result = await keel.run(order, { id: 'A-1001' });
console.log(result.status); // "completed"Each ctx.step(name, fn) runs fn once, persists the result under name, and on any later execution of the same run returns the stored result without calling fn again. That is the entire durability contract.
Crash recovery
Use the file store to survive a process restart, then resume by run id:
import { Keel, FileStore, defineWorkflow } from '@sudhanshu1402/keel';
const keel = new Keel({ store: new FileStore('keel-data/orders.json') });
const result = await keel.run(order, { id: 'A-1001' });
// ...process crashes after the charge step...
// In a new process, with the workflow registered:
keel.register(order);
const resumed = await keel.resume(result.runId);
// charge step is replayed from disk; the card is not charged twiceA runnable version is in examples/order-workflow.ts.
Durable sleep and human-in-the-loop signals
ctx.waitForSignal(name) pauses a run until an outside event arrives. The run is
suspended in the store the whole time it waits, so the process can exit and come
back, then resume when the signal is delivered.
import { Keel, FileStore, Supervisor, defineWorkflow } from '@sudhanshu1402/keel';
const approval = defineWorkflow<{ docId: string }, string>(
'approval',
async (ctx, input) => {
await ctx.step('submit', () => submitForReview(input.docId));
const decision = await ctx.waitForSignal<string>('decision'); // pauses here
return ctx.step('finalize', () => finalize(input.docId, decision));
},
);
const store = new FileStore('keel-data/approvals.json');
const keel = new Keel({ store });
const supervisor = new Supervisor(keel, store); // wakes signal-ready runs
supervisor.start();
const { runId } = await keel.run(approval, { docId: 'D-7' });
// ...later, when a human clicks Approve...
await keel.sendSignal(runId, 'decision', 'approved'); // resumes the runctx.sleep(name, ms) is a delay. By default (durableTimers off) it simply
blocks the current process, which is right for single-shot scripts and tests. To
make a sleep survive a restart, construct the engine with
new Keel({ store, durableTimers: true }): the run then suspends durably and a
running Supervisor wakes it once the timer is due, even across a process exit.
Reordering or renaming steps between runs is caught by a divergence guard that
throws DivergenceError.
The dashboard
Point the CLI at your store and open the dashboard. No config, no account.
npx keel dashboard --store keel-data/orders.json
# keel dashboard on http://127.0.0.1:4500It is a zero-dependency node:http server that reads the same store your app
writes to: a run list, per-run step timelines with status, token counts and
errors, plus Resume and Send-signal actions. You can also embed it:
import { startDashboard } from '@sudhanshu1402/keel';
const { port } = await startDashboard({ store, port: 4500 });CLI
keel runs # list runs in the store
keel inspect <runId> # show a run and its steps
keel resume <runId> # requeue a failed or paused run for a Worker
keel cancel <runId> # cancel a run
keel signal <runId> <name> [json] # deliver a signal
keel dashboard [--port <n>] # serve the dashboard
# any command takes --store <file.json> or --db <file.sqlite>LLM steps
ctx.llm is a step backed by a provider, with token capture. The call is memoized like any step, so a resume never repays for a completion already received.
import { Keel, OllamaProvider, defineWorkflow } from '@sudhanshu1402/keel';
const research = defineWorkflow<{ topic: string }, string>(
'research',
async (ctx, input) => {
const plan = await ctx.llm('plan', {
prompt: `Three research questions about ${input.topic}`,
});
const summary = await ctx.llm('summary', {
prompt: `Write a briefing on ${input.topic}:\n${plan.text}`,
});
return summary.text;
},
);
const keel = new Keel({ provider: new OllamaProvider() });
const result = await keel.run(research, { topic: 'durable execution' });Ollama runs locally and free (ollama serve && ollama pull llama3.2). No Ollama?
The example falls back to a mock provider: npx tsx examples/agent-research.ts "durable execution" --mock.
Example OpenAI and Anthropic providers live in examples/providers.
Durable agents
defineAgent wraps a multi-turn tool-calling loop where every model turn is a
ctx.llm and every tool call is a ctx.step. The whole loop is memoized, so an
agent that crashes mid-tool-call resumes without re-calling the model for turns
it already took.
import { defineAgent } from '@sudhanshu1402/keel';
const agent = defineAgent('researcher', {
tools: [
{ name: 'search', description: 'web search', run: async ({ q }) => search(q) },
{ name: 'fetch', description: 'read a url', run: async ({ url }) => fetch(url) },
],
maxTurns: 8,
});
const result = await keel.run(agent, { prompt: 'Summarize durable execution' });
if (result.status === 'completed' && result.output?.stopReason === 'final') {
console.log(result.output.answer);
}The model replies with one JSON object per turn, either {"tool": "...", "args":
{...}} or {"final": "..."}. The loop always terminates, and the result carries
a stopReason of 'final', 'max_turns', or 'budget' (set maxTurns,
maxTokens, or maxCostUsd). Only 'final' means the model actually produced
an answer; a malformed reply is fed back as an error observation, never silently
treated as the final answer. A crash-and-resume demo is in
examples/durable-agent.ts.
Scaling: SQLite store and multi-worker
The default FileStore rewrites its JSON file on every commit, which is simple
and diffable but does not scale to long runs or high volume (see
docs/BENCHMARKS.md). For volume, switch to the SQLite store
(built on Node's built-in node:sqlite, still zero-dependency; needs Node 22.5+
run with --experimental-sqlite, or Node 24+ where it is on by default):
import { SqliteStore } from '@sudhanshu1402/keel/sqlite';
const store = new SqliteStore('keel-data/keel.sqlite');
const keel = new Keel({ store });SqliteStore is a ConcurrentStore: it supports lease-based claiming and
compare-and-swap updates, so multiple Workers can drain one store without ever
double-executing a step.
import { Worker } from '@sudhanshu1402/keel';
const w1 = new Worker(keel, store, { concurrency: 4 });
const w2 = new Worker(keel, store, { concurrency: 4 });
w1.start();
w2.start();Testing with zero infrastructure
createTestKeel builds a fully in-memory, deterministic engine: a MemoryStore,
a MockProvider with a counted call log, instant sleeps, stable run ids, and a
controllable clock. No database, no broker, no real time, no API keys.
import { createTestKeel } from '@sudhanshu1402/keel';
const t = createTestKeel({ respond: () => 'ok' });
const r = await t.keel.run(myWorkflow, input);
expect(t.provider.calls).toBe(1); // model called exactly once
t.advance(60_000); // drive durable timers without waitingDelivery guarantee
keel is at-least-once, like every durable engine. The contract:
- A step that has reached
completedis never re-run; its saved result replays. A run that has reachedcompletedorcancelledis terminal and a strayresumeis a no-op. - A step whose side effect ran but whose process died before the result was persisted re-runs on resume. A failed step also re-runs when you resume the run.
- So: make side effects idempotent. Every step gets a stable
idempotencyKey(default"<runId>:<stepName>", or pass your own viactx.step(name, fn, { idempotencyKey })); the step function receives it ashelpers.idempotencyKeyalongsidehelpers.attemptand anAbortSignal. Forward it to APIs that support idempotency keys (Stripe, etc.) and a retried charge collapses to one. - For values that must be stable across replays, use
ctx.now(),ctx.random(), andctx.uuid()instead ofDate.now()/Math.random()/randomUUID()in workflow code: each is recorded as a step and replays the same value.
What keel does not promise is exactly-once side effects with no work on your part. No durable engine can; the honest version is at-least-once plus idempotency keys.
Hardening
ctx.step(name, fn, { timeoutMs })fails a step that hangs past the deadline. The attempt'sAbortSignalis aborted so cooperative work can stop. Timeouts are not retried by default (a timed-out side effect may still be running).keel.cancel(runId)cancels a run cooperatively at the next durable boundary, throwingCancelledError. A cancel that lands during the final step wins over completion.ctx.all(name, [fn, ...])runs sibling steps concurrently with deterministic replay (each child reserves its call position before any runs).new Keel({ onEvent })emits astep:start/step:complete/step:failevent per step so you can wire OpenTelemetry spans or metrics; thrown callback errors are swallowed so observability never breaks a run.defineWorkflow(name, handler, { version })pegs a run to the code version that started it; resuming under a changed version throwsWorkflowVersionError.- Steps retry with exponential backoff; override per step with
{ retry: { maxAttempts, baseMs, factor, jitter } }.
Custom providers and stores
Implement one method to use any LLM:
import type { Provider } from '@sudhanshu1402/keel';
const openai: Provider = {
async complete({ prompt, model }) {
const res = await callOpenAI(prompt, model);
return { text: res.text, tokensIn: res.usage.prompt, tokensOut: res.usage.completion };
},
};Implement the Store interface (or ConcurrentStore for multi-worker) to back runs with Redis, Postgres, or anything else. The engine only depends on the interface.
API
new Keel({ store?, provider?, sleepFn?, now?, idFactory?, durableTimers?, onEvent? })- engine. Defaults toMemoryStore, no provider,durableTimers: false.keel.run(def, input)- start a workflow, returns{ runId, status, output? }(outputis set on a completed run).keel.resume(runId)- resume a run after a crash. Register the workflow first if it is a fresh process. A completed or cancelled run resumes to a no-op.keel.sendSignal(runId, name, value)- deliver a signal; resumes the run if it was paused on it.keel.cancel(runId)- cancel a run cooperatively.defineWorkflow(name, handler, opts?)- declare a workflow.opts.versionpegs the run version.ctx.step(name, fn, opts?)- durable step.opts.retry,opts.timeoutMs,opts.idempotencyKey.fnreceives{ attempt, signal, idempotencyKey }.ctx.all(name, [fn, ...], opts?)- run sibling steps concurrently, deterministic on replay.ctx.now()/ctx.random()/ctx.uuid()- replay-stable clock, RNG, and ids (each recorded as a step).ctx.sleep(name, ms)- delay; durable suspend whendurableTimersis on, in-process block otherwise.ctx.waitForSignal(name)- pause until a signal arrives.ctx.llm(name, { prompt, model? })- durable LLM step, captures token counts.defineAgent(name, opts)- durable multi-turn tool-calling agent workflow.opts.maxTurns,opts.maxTokens,opts.maxCostUsd,opts.historyWindow; per-toolvalidateArgs.new Supervisor(keel, store, { pollMs? })- wakes sleeping and signal-ready runs.new Worker(keel, store, { concurrency?, leaseMs?, pollMs? })- multi-worker executor over aConcurrentStore.createTestKeel(opts?)- in-memory deterministic engine for tests.startDashboard({ store, port? })/runCli(argv)- dashboard and CLI.- Stores:
MemoryStore,FileStore,SqliteStore(via@sudhanshu1402/keel/sqlite). - Providers:
OllamaProvider,MockProvider.
Docs
- docs/COMPARISON.md - keel vs Vercel Workflow, Temporal, Inngest.
- docs/LIMITATIONS.md - what keel deliberately does not do.
- docs/BENCHMARKS.md - throughput and recovery numbers, with
npm run bench.
Status
v1.0. The durable core, durable timers and signals, replay-divergence and version guards, a SQLite store with multi-worker claiming, a local dashboard and CLI, durable agents, and zero-infra testing helpers are all built and tested (Node 20 and 22). keel targets a long-lived Node process; edge and serverless runtimes are out of scope by design (see docs/LIMITATIONS.md).
License
MIT
