agent.libx.js

v0.94.37

Published

2 days ago

Edge-native AI agent runtime — drives a virtual filesystem via any LLM (ai.libx.js). Same bytes run in node, browser, or edge.

0High
0Medium
0Low

elya_livshitz

ai agent llm virtual-filesystem edge runtime tools mcp claude sandbox

agent.libx.js

A coding agent that matches Claude Code on correctness — then beats it on cost, tokens, and tool-efficiency, and runs where Claude Code can't (sandbox, browser/edge, database).

By default it's a full-strength terminal coding agent: real disk, real shell, and the same Read/Edit/Grep/permissions/streaming-DX surface you'd expect from Claude Code. The difference is its two host couplings are swappable seams:

LLM → any model via ai.libx.js (AIClient.chat, OpenAI-style tools/streaming).
Filesystem → a pluggable IFilesystem (real disk, in-memory, IndexedDB, a database, hybrid mounts) from wcli's headless core.

So the same agent loop also runs sandboxed (in-memory VFS, real disk untouched), on the edge / browser (no Node, no /bin/sh), or hybrid (mount real dirs + a database + remote storage side by side, with transactional overlays for checkpoint/rollback).

Claude Code is the floor; running isolated, on the edge, or hybrid is the ceiling.

How it stacks up vs Claude Code

Correctness parity — efficiency, cost, and reach are the lead. Hard 7-task coding suite, Sonnet, denoised (each task ×3, no lucky run promotes; SUITE=hard bun compare/run.ts):

| | agent.libx.js | Claude Code | |---|---|---| | Correctness | 7/7 | 7/7 — parity | | Tool-calls | 16 | 28 — −43% | | Tokens | 69k | 171k — 2.5× fewer | | Wall-time | ~100s | 133s — ~25% faster |

Cost (9-task hard suite, USD-metered, vs CC-on-Opus): $0.49 single-tier Sonnet (5.4× cheaper) · $0.82 three-tier voice/duplex (3.3× cheaper) vs CC-Opus $2.67 — at quality parity (16/18 vs 17/18 passes).

Plus things Claude Code simply doesn't do:

Runs where CC can't — the same agent loop runs on real disk, an in-memory sandbox, the browser/edge (no Node, no /bin/sh), or a database-backed workspace. Swap the filesystem, not the agent.
Keyless web search, built in — WebSearch works in any deployment with no API key (DuckDuckGo; auto-upgrades to Tavily if you set one). CC's search is Anthropic-server-bound.
Context-safe by default — a 1 MB Grep/Read/MCP result is auto-paginated and can't blow the window; buried detail is recovered via a cheap context-isolated Ask peek — ~5.3× cheaper and more accurate than re-fetching, in a head-to-head.
It improves its own efficiency — an autonomous evolution loop cut its own tool-use ~50% (32 → 15 on the core suite, denoised), self-discovered, not hand-tuned — the same lever behind the efficiency lead above.

Honest scope: the win is efficiency / cost / reach, not a claim of smarter reasoning — correctness is parity. All figures are denoised and reproducible (see Eval & compare); full boards in mind/09-outperform.md.

Quickstart

Point it at your project — no clone needed (requires Bun):

export ANTHROPIC_API_KEY=…                              # or OPENAI_API_KEY / GOOGLE_API_KEY / GROQ_API_KEY
bunx agent.libx.js "find and fix the failing test"      # run once in the current directory
bunx agent.libx.js                                      # …or open the interactive REPL

Want a permanent command? bun add -g agent.libx.js, then just agentx (and agentx --duplex for voice). The agent has full real-disk + shell access by default (like Claude Code); add --sandbox to work on an in-memory copy instead. See The agentx CLI for flags, sessions, and slash commands.

Use it as a library

import { AIClient } from 'ai.libx.js';
import { Agent, MemFilesystem } from 'agent.libx.js';

const fs = new MemFilesystem();              // or NodeDiskFilesystem(dir) — interchangeable
await fs.createDir('/src');
await fs.writeFile('/src/x.ts', 'export const add = (a,b) => a - b;\n');

const ai = new AIClient({ apiKeys: { anthropic: process.env.ANTHROPIC_API_KEY } });
const res = await new Agent({ ai, fs, model: 'anthropic/claude-sonnet-4-6' })
  .run('Fix the add bug in /src/x.ts');

console.log(res.finishReason, await fs.readFile('/src/x.ts'));

Tools the agent gets

Shell (CLI disk mode) — a real /bin/sh: run any installed binary (git, bun, node, curl, scripts, …). bash (library / sandbox mode) — ls/cat/grep/find/head/tail/echo/mkdir/rm/mv/wc, pipes, redirects, chaining — over the VFS (wcli's sandboxed JS interpreter).
Read — 1-indexed numbered lines, offset/limit.
Edit — exact unique-substring replace, with a read-before-edit staleness guard.
Grep/Glob/Write/MultiEdit — structured, typed results straight from the VFS (no bash parsing). The selectable tool set the self-evolution loop mutates over.
TodoWrite — a planning scratchpad; Task — spawn a depth-limited child agent over the VFS (subagents: true); SlashCommand — reusable prompt templates from <dir>/*.md (commandsDir); plus a real MCP client (src/mcp.client.ts, node-only — stdio/HTTP JSON-RPC handshake + discovery) that feeds the edge-safe MCP adapter (mcpToolsToAgentTools), so any MCP server's tools become agent tools.
WebFetch/WebSearch — fetch a URL as readable text, or search the web. Keyless by default (WebSearch uses DuckDuckGo; auto-upgrades to Tavily when TAVILY_API_KEY is set) and auto-enabled in the CLI. Factory-built with an injectable fetch, so they stay edge-portable and testable. (In the library they're opt-in by name: tools: [...,'WebSearch'].)
Oversized-output pagination — any tool result over a byte ceiling (maxToolResultBytes, default 60k) is cropped to page 1 with a marker (refine the query / read further), so one big Grep/Read/MCP/web result can't blow the context window. In the CLI (on by default; --no-scratch to disable) the full output instead spills losslessly to a scratch file and the model recovers specifics via Grep/Read or Ask — a cheap, context-isolated peek that returns just the answer (the raw blob never re-enters context).

Agentic subsystems

Beyond file tools, the runtime ships the higher-altitude pieces too — each an AgentOptions/loop extension over the two seams (see mind/06):

Skills + memory — VFS-backed (skillsDir/memoryDir); persistence is just the backend choice.
Subagents (subagents; typed agents via agentsDir — <dir>/<name>.md defines a persona + model + scoped tools, selected with the Task agentType), hooks (hooks: preToolUse/postToolUse/onStop — block or audit any tool call), slash-commands (commandsDir), TodoWrite, MCP (mcpToolsToAgentTools).
Streaming (stream: true → text_delta via HostBridge.notify) and context compaction (compaction: { maxMessages } → edge-safe summarize-and-boundary). Defaults preserve the original non-stream, drop-oldest behavior.
Multi-turn + project context — Agent.send() continues a conversation across turns (vs run(), which starts fresh); project instructions (instructionFiles: AGENTS.md/CLAUDE.md at the FS root) inject into the system prompt.
DuplexAgent (src/duplex.ts) — voice-optimized three-tier engine (reflex/act/think): a fast reflex agent streams instant replies and self-selects escalation — Act for standard tool work (Sonnet-class), Think for deep reasoning (Opus-class, configurable, default on). Results are pushed back and re-voiced by the reflex (turn mutex, coalesced completions, TaskStatus/CancelTask). See mind/10.
Scheduler (src/scheduler.ts + cli/osScheduler.ts) — one-off ({at}), interval ({everyMs}), cron ({cron}) via ScheduleTask/ScheduleList/ScheduleCancel/Wakeup. In-session jobs fire while the session is alive (persisted, re-armed on --resume); far one-offs (or backend:'os') register with the OS scheduler (launchd / crontab / at) and survive quitting — the fired job headless-resumes the session (agentx -p … --resume <id> --yes). The PushNotification tool (osascript / notify-send) alerts the user out-of-band; Read on a .pdf returns extracted text (poppler's pdftotext, disk mode). RemoteTrigger invokes another agentx session on this machine: a session open in a live terminal receives the prompt as an injected turn (per-session unix socket, same-user only); otherwise it's resumed headless and the final answer comes back. See mind/12.
Budget kill-switches — always-on per-run guards (maxTokens/timeoutMs/maxRepeats/maxToolCalls/signal → finishReason budget/timeout/loop/max_tool_calls/aborted) protect the API spend against runaway loops. The enforceable billing cap is server-side in the web key-proxy: a VFS-backed budget config (/.agent/budget.json, USD-metered, hot-reloaded, $100/wk default) a browser client can't bypass. See web/ and mind/06.

The `agentx` CLI

A dependency-light readline REPL (plus headless -p mode) over the runtime:

agentx                      # interactive REPL in the current dir
agentx "fix the bug in x"   # run once and exit
agentx -c "keep going"      # continue the most recent session
agentx --resume <id> "…"    # resume a specific session

Filesystem + Shell — by default the CLI has full real-filesystem access like Claude Code (root / is the machine root, the launch dir is the working dir, absolute host paths and above-cwd reach both work) with a real /bin/sh (Shell tool) so the agent can run git, bun, node, curl, and any installed binary. Secrets (.env, .ssh, keys, .git) stay hidden by the jail; env secrets are scrubbed from the child shell. --sandbox instead operates over an in-memory copy of the working dir with a VFS-only bash — the real disk is never touched. --boddb <dir> runs over a persistent database workspace (a bod-db store at <dir> — meta.db tree + files/ bytes) that survives across runs while the real disk stays untouched; DB-native by default, or add --seed to hydrate it from cwd on the first run. --no-shell forces the VFS bash in disk mode. --harden OS-sandboxes the real shell (macOS sandbox-exec / Linux bwrap): writes confined to cwd+tmp, outbound network blocked (--harden-net keeps network); commands fail closed when no wrapper exists. (/sandbox shows the active mode.)
Sessions — every conversation persists to ./.agent/sessions/<id>.json, flushed at every tool step (a crash, hang, or Ctrl-C mid-turn loses at most the in-flight step, never the transcript); --continue/--resume (and /sessions, /resume) pick it back up, with memory across turns — a REPL turn sees the previous one. A global symlink index at ~/.agent/sessions/ enables cross-project lookup: --resume 090715-myproject resolves from any directory, and /sessions all lists every project's sessions in one picker.
Diffs — every Edit/Write/MultiEdit renders a colorized +/- diff (TTY-gated; plain when piped).
Slash commands — /help /tools /model /compact /copy /diff /memory /clear /sessions /resume /commands /init; /compact <focus> preserves matching lines from the folded span; /copy [code] puts the last reply (or its last code block) on the OS clipboard; /diff shows everything the session changed (oldest checkpoint → now); /memory opens the memory index in $EDITOR; user-defined ./.agent/commands/<name>.md are invokable directly as /<name> (the same registry the model's SlashCommand tool uses). Skills/commands created mid-session are picked up automatically each turn (delivered as a cache-friendly <system-reminder> delta, like Claude Code) and the Skill/SlashCommand tools rescan on a name miss; /reload forces a full catalog + system-prompt rebuild.
Live chrome — the thinking spinner shows elapsed seconds + esc to interrupt; the terminal tab title tracks the session topic; a bell rings when a long (>10s) turn finishes in a backgrounded tab; the footer warns at 80%/90% context pressure and auto-trims announce themselves.
/transcript [n] — the full session transcript including complete tool-result bodies (the past-turn equivalent of Ctrl-O live verbose), paged through less; /doctor — one-shot environment sanity check (keys, model pricing, config, session-store writability, memory, MCP mounts).
Syntax-highlighted code fences — ```ts (and js/py/sh/go/rust/…) blocks render with keywords bold, strings green, numbers cyan, comments dim; unknown languages keep the plain cyan body. TodoWrite plans pin a compact ☑ 2/5 · current step line into the idle footer.
/agents — list subagent types from ./.agent/agents (description, model, tool scope); /agents new <name> scaffolds a frontmatter'd definition for the Task tool's agentType. !<partial> + menu completes from past ! shell commands. @server:uri mentions inline an MCP resource body into the prompt. Transient network drops mid-step retry automatically (2 attempts, backoff) instead of failing the turn.
Project instructions — ./AGENTS.md (or CLAUDE.md) auto-loads into every run; /init scaffolds one.
Any provider — set ANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY / GROQ_API_KEY; choose with -m provider/model.
@-file mentions & headless JSON — reference files inline in a prompt with @path (e.g. explain @src/Agent.ts; ~/ expands to the home directory; quote paths with spaces as @"…" — drag-dropped files, e.g. macOS screenshots, quote themselves automatically); script with -p --output-format json to get one machine-readable result object on stdout (activity stays on stderr).
Tab-completion — Tab completes /<command> names and @<path> file/dir references (descends subdirs, dotfiles hidden unless typed) straight from the working tree.
Duplex mode — agentx --duplex runs the full standard REPL (slash commands, sessions, postures, rewind, MCP) with the three-tier engine driving turns: a fast voice model (--voice-model, default groq/openai/gpt-oss-120b) answers every line instantly and delegates real work to background workers built with the same wiring as a normal run (fs mode, permissions, MCP); worker activity shows as dim chrome and results are re-voiced when ready. Switch any tier live with /model (opens a reflex/act/think picker), or the /voice-model · /think-model shortcuts. /tasks lists background tasks, inspects a task's live output tail, and cancels a running one from a picker (Esc mid-turn cancels the foreground turn; Esc again at the idle prompt cancels running workers).
MCP servers — declare mcpServers: { name: { command, args } | { url } } in config and they're auto-mounted at startup (in parallel, with an optional mountTimeoutMs deadline so one slow/dead server never blocks the rest): the client does the JSON-RPC handshake (stdio or HTTP) + tools/list, and the discovered tools appear as mcp__<name>__<tool> in /tools (inspect with /mcp). A bad server is logged and skipped, never blocking the agent. For large tool sets, deferred mode (makeMcpToolSearch / mountMcpDeferred) exposes just two bounded tools (ToolSearch + McpCall) instead of N defs — dodging the provider tool-cap and improving selection accuracy; the CLI applies this automatically past 12 mounted tools (a 42-tool server was costing ~80k tok/turn in schema alone), and permission rules written against the real mcp__<name>__<tool> names still match through McpCall. mountMcpCatalog goes further: a cached, hash-keyed catalog + lazy connect means a turn that uses no MCP tool opens zero connections, and one that uses a tool connects exactly that server — latency scales with tools-used, not servers-configured. A down server is negative-cached (failureCooldownMs) so it never re-floors a later turn at the deadline. For zero turn-path latency even on a cold process, call warmMcpCatalog at boot + on a timer (off-turn discovery) and mount with { discover: 'cache-only' } — the turn then never synchronously connects: it serves the warmed catalog and discovers any miss in the background.

🧬 It improves itself

The agent is a coding agent that operates over a swappable filesystem — so it can be pointed at its own repo and evolve its own configuration. evolve/ is an autonomous loop:

champion → propose patch → jailed + sandboxed eval → per-task no-regression gate → ledger → repeat

An LLM is the mutation operator; a behavioral fitness function (run the produced code) is natural selection. Correctness is a hard gate, the rule files are hash-pinned (the agent can't edit what judges it), and every candidate runs under two containment boundaries — a JailedFilesystem (secret denylist, symlink-escape defense) and a sandboxed grader (scrubbed env, nonce-authenticated result, default-on sandbox-exec). Those guardrails were hardened against a 22-agent adversarial red-team (14 findings fixed) before the loop was allowed to run.

Result (Sonnet 4.6): the loop autonomously drove baseline 32 → 15 tool-calls (53% fewer), 5/5 pass held — parity with Claude Code (head-to-head 15 vs 15 tools, 1.8× faster, 2.8× fewer tokens), the efficiency gap we'd only described before. This is the denoised figure (each candidate averaged over 3 runs so no lucky run promotes); a single un-averaged run reached 14. It generalizes to held-out tasks (24 → 12, no overfit) and discovered the human-authored parity plan on its own: use structured Grep/MultiEdit, stop over-exploring.

GENERATIONS=8 bun evolve/loop.ts     # evolve → evolve/champion.json + ledger.jsonl
bun evolve/report.ts                 # instant replay of the arc (no tokens)
EVOLVED=1 bun compare/run.ts         # evolved champion vs Claude Code
bun evolve/generalize.ts             # baseline vs champion on UNSEEN tasks

Full design + threat model + results: mind/08-self-evolve.md.

Status

v1 (done): loop + hybrid tools + Mem/Disk backends + deterministic FakeAIClient tests + real-model run. 5/5 pass@1 on the behavioral eval (Sonnet 4.6); the head-to-head started at correctness parity with Claude Code but ~2× the tool calls (≈28 vs 15) — a gap the self-evolution loop has now closed autonomously: it drove its own baseline from 32 → 15 tool-calls (denoised over 3 runs) and ties Claude Code in a fresh head-to-head (15 vs 15). 820+ tests green.

See mind/ for the full vision, architecture, decision journal, roadmap, eval + head-to-head results, the parity plan, and the self-evolution design.

Develop & evaluate

Hacking on the runtime itself (from a clone):

bun install                # links wcli (file:), ai.libx.js + libx.js (bun link)
bun test                   # 820+ unit/integration tests (offline via FakeAIClient, no key)
ANTHROPIC_API_KEY=… bun examples/run-sonnet.ts   # drive a real model end-to-end

Eval & head-to-head (real model):

bun eval/run.ts            # behavioral scorecard (our agent over MemFilesystem)
bun compare/seed-tasks.ts  # materialize task specs into .tmp/tasks/
bun compare/run.ts         # head-to-head vs Claude Code (needs the `claude` CLI)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

agent.libx.js

How it stacks up vs Claude Code

Quickstart

Use it as a library

Tools the agent gets

Agentic subsystems

The agentx CLI

🧬 It improves itself

Status

Develop & evaluate

The `agentx` CLI