@dutchmanlabs/evalstudio-cli

v0.3.0

Published

2 months ago

Local-first CLI for Dutchman Labs Eval Studio

Downloads

0High
0Medium
0Low

rtsarsour

riyadsarsour

evals agents cli openai testing

Eval Studio CLI

Local-first CLI for detecting AI agents in a codebase, generating eval suites, running them locally, and syncing results back to Dutchman Labs when the hosted backend is available.

Install and Run

Zero-install:

npx evalstudio-cli login
npx evalstudio-cli init
npx evalstudio-cli detect
npx evalstudio-cli generate
npx evalstudio-cli run

Create an API key at https://dutchmanlabs.com/dashboard/settings. If you are not signed in, Dutchman Labs routes you through signup and returns you to the API key page.

Global install:

npm install -g evalstudio-cli
evalstudio-cli login
evalstudio-cli init
evalstudio-cli detect
evalstudio-cli generate
evalstudio-cli run

From the monorepo during development:

npm run build:cli
node packages/cli/dist/index.js --help
node packages/cli/dist/index.js login

login is still the best first step for hosted generation and dashboard sync, but the CLI now stays useful without it:

init can create a local project config and sync it later
detect always runs locally and only uploads when credentials are available
generate creates up to 3 local sample evals when no API key is saved yet
generate falls back to a full local synthetic suite when the backend is unavailable but you do have a key
run always executes locally and only uploads when a hosted suite and valid credentials are available

Commands

evalstudio-cli login
evalstudio-cli init
evalstudio-cli detect
evalstudio-cli scan (alias)
evalstudio-cli generate
evalstudio-cli run
evalstudio-cli sandbox run
evalstudio-cli sandbox doctor
evalstudio-cli sandbox latest
evalstudio-cli status
evalstudio-cli export

Detection

detect scans the local repo and recognizes patterns such as:

OpenAI
Anthropic / Claude
Vertex AI / Gemini
Azure AI
LangChain
LangGraph
LlamaIndex
Next.js, FastAPI, and Express handlers
Plain JavaScript, TypeScript, or Python agent files with callable entrypoints, tool usage, messages arrays, or system prompts

Bias detection manually when you know the framework:

evalstudio-cli detect --framework langchain

If detection finds more than one candidate, Eval Studio prints a ranked list and lets you choose one. If your local .evalstudio/scan-results.json file is malformed, the CLI warns and falls back to automatic detection instead of crashing.

Generate

generate prefers the hosted backend.

If you are logged in and the backend is reachable, Eval Studio generates the full hosted suite.
If you are not logged in yet, Eval Studio creates up to 3 local sample evals and points you to sign up for a free account.
If you are logged in but the backend is temporarily unavailable, Eval Studio falls back to a full local synthetic suite and still writes .evalstudio/latest-suite.json.

evalstudio-cli generate
evalstudio-cli generate --count 12

When hosted generation succeeds, the CLI prints your remaining daily generation quota. When generation falls back locally, the CLI tells you whether you are seeing a 3-eval sample because you are not logged in yet, or a full local fallback because the backend is temporarily unavailable.

Run

run has a single default path now: call the detected local function entrypoint directly.

Python candidates default to module:function entrypoints such as agent:run
JavaScript and TypeScript candidates default to path#exportName entrypoints such as src/agent.ts#run
HTTP is only used when you explicitly pass --url

Examples:

evalstudio-cli run
evalstudio-cli run --entrypoint src/agent.ts#run
evalstudio-cli run --entrypoint app.agents.refund_agent:run_agent
evalstudio-cli run --url http://127.0.0.1:3000/api/chat
evalstudio-cli run --payload '{"input":"{{prompt}}"}' --url http://127.0.0.1:3000/api/chat

If a hosted run cannot be created or synced, or the CLI is operating without an API key, Eval Studio still saves .evalstudio/latest-run.json locally so you can inspect or export the results.

Browser Sandbox Runs

Use sandbox run for browser-executing agents. It loads trajectory JSON, creates an isolated browser context per trajectory, replays the steps, scores expected URL/text/selectors/tool calls, and writes trace/replay artifacts.

evalstudio-cli sandbox run \
  --eval-set ./evals/browser-trajectories.json \
  --backend local \
  --url http://127.0.0.1:3000 \
  --parallel 2 \
  --timeout 300 \
  --export json

Check local setup before a run:

evalstudio-cli sandbox doctor --eval-set ./evals/browser-trajectories.json

Print the latest sandbox summary and artifact paths:

evalstudio-cli sandbox latest

Trajectory files can be a top-level array, { "trajectories": [...] }, or an Eval Studio { "evals": [...] } suite.

{
  "trajectories": [
    {
      "id": "checkout-flow",
      "name": "Checkout under $50",
      "start_url": "http://127.0.0.1:3000",
      "steps": [
        {
          "step": 1,
          "input": { "user_message": "Buy the blue widget under $50" },
          "expected_tool_calls": ["search_products", "add_to_cart"],
          "expected_dom_state": {
            "url_pattern": "/cart",
            "element_text": "Proceed to checkout"
          }
        }
      ],
      "metadata": { "domain": "ecommerce", "risk_level": "high" }
    }
  ]
}

Artifacts are written under .evalstudio/sandbox-runs/<run-id>/:

summary.json
trace.ndjson
replay.html
screenshots/

If you are logged in, initialized, and have selected a hosted candidate with detect, the sandbox summary, trace, and replay HTML also sync to the dashboard as browser sandbox artifacts. Screenshot files stay local in the current MVP.

Local mode auto-detects common Chrome, Chromium, Edge, and Brave installs. If your browser is in a custom location, set PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH=/path/to/chrome.

Local Files

Per-project state lives under .evalstudio/:

.evalstudio/config.json
.evalstudio/scan-results.json
.evalstudio/latest-suite.json
.evalstudio/latest-run.json
.evalstudio/exports/
.evalstudio/sandbox-runs/

Global auth lives in ~/.evalstudio/config.json.

Anonymous CLI telemetry is enabled by default to help us understand command usage and funnel dropoff. It does not block CLI execution, and you can opt out with:

evalstudio-cli --no-telemetry

or:

EVALSTUDIO_NO_TELEMETRY=1 evalstudio-cli detect

Status

status is the quickest way to see what Eval Studio knows about the current repo.

evalstudio-cli status

It shows:

current project ID
selected candidate
latest suite ID and run ID when cached locally
hosted usage and reset time when you are logged in
local-only state when you are not logged in yet

Manual Scan Cache Schema

Power users can pre-populate .evalstudio/scan-results.json. The minimum supported shape is:

{
  "projectId": "proj_123",
  "scannedAt": "2026-04-04T00:00:00.000Z",
  "candidates": [
    {
      "path": "src/agent.ts",
      "exportName": "run",
      "language": "typescript",
      "framework_guess": "openai",
      "tool_names": ["lookup_order"],
      "prompt_snippets": ["You are a support assistant."],
      "confidence": 0.7
    }
  ]
}

Unknown fields are ignored. Invalid candidates are skipped with a warning. If the whole file cannot be used, Eval Studio falls back to automatic detection.

Help

evalstudio-cli --help
evalstudio-cli help
npx evalstudio-cli --help
evalstudio-cli run --help

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme