@root-signals/scorable-cli
v0.14.0
Published
CLI for Scorable
Readme
The scorable CLI is a command-line tool for interacting with the Scorable API. It lets you manage and execute Judges and Evaluators, view execution logs, and run prompt testing experiments directly from the terminal.
Requires Node.js 20 or higher.
Installation
curl -sSL https://scorable.ai/cli/install.sh | shOr install directly with npm:
npm install -g @root-signals/scorable-cliOr run without installing via npx:
npx @root-signals/scorable-cli judge listAuthentication
Option 1 — Free demo key (no registration required):
scorable auth demo-keyCreates a temporary key and saves it to ~/.scorable/settings.json.
Option 2 — Permanent key from scorable.ai/register:
# Interactively
scorable auth set-key
# From argument
scorable auth set-key sk-your-api-keyOption 3 — Environment variable (takes precedence over saved key):
export SCORABLE_API_KEY="sk-your-api-key"The key lookup order is: SCORABLE_API_KEY env var → api_key in ~/.scorable/settings.json → temporary_api_key in ~/.scorable/settings.json.
Projects
A project is a workspace inside your organization that groups related judges, evaluators, datasets, and execution logs. Every resource belongs to a project; omitting the project at create/execute time files things under the org's default project.
Manage projects
scorable project list # list all projects (default marked)
scorable project get <project_id> # show a single project
scorable project create --name "Production" # create a project
scorable project create --name "Production" --is-default
scorable project update <project_id> --name "Renamed"
scorable project update <project_id> --is-default # promote to default
scorable project set-default <project_id> # convenience for `update --is-default`
scorable project delete <project_id>--project-id on every project-aware command
Every command that creates, executes, lists, or filters a project-scoped resource accepts --project-id <uuid>:
# Filter list endpoints
scorable judge list --project-id <project_id>
scorable evaluator list --project-id <project_id>
scorable execution-log list --project-id <project_id>
# Route an execution log to a project
scorable judge execute <judge_id> --project-id <project_id>
scorable evaluator execute <evaluator_id> --project-id <project_id>
# Pin a resource to a project at creation
scorable judge create --name X --intent Y --project-id <project_id>
scorable evaluator import-yaml --file evaluator.yaml --project-id <project_id>
# Move a resource between projects
scorable judge update <judge_id> --project-id <other_project_id>
# OpenAI-compat (translated to X-Project-Id header)
scorable judge exec-openai <judge_id> --project-id <project_id>Setting a default project for your shell
To avoid passing --project-id on every command, set an env var or persist a per-machine default:
# Environment variable (great for CI)
export SCORABLE_PROJECT_ID=<project_id>
# Persistent default written to ~/.scorable/settings.json
scorable auth set-project <project_id>
scorable auth show-project # see what's resolved and from where
scorable auth unset-project # remove the saved defaultResolution order: --project-id flag → SCORABLE_PROJECT_ID env var → project_id in ~/.scorable/settings.json → omitted (backend resolves to org default). Pass --project-id "" to explicitly opt out of an inherited default for a single invocation.
scorable auth logout clears the entire auth section of ~/.scorable/settings.json, including the saved project_id.
Scorable Skills for AI Coding Agents
Install Scorable skills into your project so your AI coding agent (Claude Code, Cursor, etc.) can integrate evaluators automatically:
scorable skills-addOnce installed, open your coding agent in your AI powered project and use the prompt:
"Integrate scorable evaluators"
Judge Management
List judges
scorable judge listOptions: --page-size, --cursor, --search, --name, --ordering
Get a judge
scorable judge get <judge_id>Create a judge
scorable judge create --name "My Judge" --intent "Evaluate response quality."Options: --name (required), --intent (required), --stage, --evaluator-references (JSON string, e.g. '[{"id": "eval-id"}]')
Update a judge
scorable judge update <judge_id> --name "Updated Name"Options: --name, --stage, --evaluator-references (use "[]" to clear)
Delete a judge
scorable judge delete <judge_id>Prompts for confirmation. Use --yes to skip.
Duplicate a judge
scorable judge duplicate <judge_id>Generate a judge
AI-powered judge generation from a plain-language description of what you want to evaluate.
scorable judge generate --intent "I am building a customer support chatbot. Evaluate that responses are helpful and follow our refund policy."Attach a policy document so the generated evaluators can check compliance against it:
# Upload and generate in one step
scorable judge generate --intent "Evaluate responses against the attached policy." --file ./policy.pdf
# Or reuse an already-uploaded file
scorable judge generate --intent "Evaluate responses against the attached policy." --file-id <file_uuid>Options: --intent (required), --file (path to PDF/PNG/JPG — uploads and attaches), --file-id (UUID of an already-uploaded file), --visibility (private/public, default private), --name, --stage, --extra-contexts (JSON object, e.g. '{"Domain":"hotel","Tone":"formal"}'), --reasoning-effort (off/low/medium/high), --judge-id (regenerate an existing judge), --overwrite, --context-aware
File Management
Upload a file
Upload a PDF or image for use as context in judge generation or evaluator execution.
scorable file upload ./policy.pdfReturns a file UUID that can be passed to judge generate --file-id or evaluator execute --file-ids.
Supported formats: PDF, PNG, JPG, JPEG, WEBP, SVG.
Judge Execution
Execute by ID
scorable judge execute <judge_id> --request "What is the capital of France?" --response "Paris"Options: --request, --response, --turns (JSON array of conversation turns), --contexts (JSON list), --expected-output, --tag (repeatable), --user-id, --session-id, --system-prompt
Pipe a response via stdin:
echo "Paris" | scorable judge execute <judge_id> --request "What is the capital of France?"
cat response.txt | scorable judge execute <judge_id>For multi-turn conversations, pass the full history as a JSON array:
scorable judge execute <judge_id> --turns '[{"role":"user","content":"Hi"},{"role":"assistant","content":"Hello!"}]'Execute by name
scorable judge execute-by-name "My Judge" --request "What is the capital of France?" --response "Paris"Accepts the same options as execute. Stdin piping and --turns work the same way.
Evaluator Management
List evaluators
scorable evaluator listOptions: --page-size, --cursor, --search, --name, --ordering
Get an evaluator
scorable evaluator get <evaluator_id>Create an evaluator
scorable evaluator create \
--name "My Evaluator" \
--scoring-criteria "Does the {{ response }} directly answer the user's question?" \
--intent "Evaluate response relevance"Options: --name (required), --scoring-criteria (required — must contain {{ request }} and/or {{ response }}), --intent or --objective-id (one required), --system-message, --models (JSON array), --overwrite, --objective-version-id
Update an evaluator
scorable evaluator update <evaluator_id> --name "Updated Name"Options: --name, --scoring-criteria, --system-message, --models (JSON array), --objective-id, --objective-version-id
Delete an evaluator
scorable evaluator delete <evaluator_id>Prompts for confirmation. Use --yes to skip.
Duplicate an evaluator
scorable evaluator duplicate <evaluator_id>Evaluator Execution
Execute by ID
scorable evaluator execute <evaluator_id> --request "What is 2+2?" --response "4"Options: --request, --response, --turns (JSON array of conversation turns), --contexts (JSON list), --expected-output, --tag (repeatable), --user-id, --session-id, --system-prompt, --variables (JSON object of extra template variables)
Stdin piping and --turns work the same way as for judge execution.
For evaluators with custom template placeholders beyond {{request}}/{{response}}:
scorable evaluator execute <evaluator_id> --request "Hello" --variables '{"lang":"EN","topic":"science"}'Execute by name
scorable evaluator execute-by-name "My Evaluator" --request "What is 2+2?" --response "4"Accepts the same options as execute, including --variables.
Custom Model Management
Bring your own LLM (BYO-LLM) — register a custom or self-hosted model, then reference it from evaluators and judges.
List models
scorable model listShows ID, name, provider, and visibility. Options: --page-size, --cursor, --ordering.
Get a model
scorable model get <model_id>Create a model
# SaaS provider (key inline)
scorable model create --name my-gpt --model gpt-5.5 --key sk-...
# Self-hosted / custom endpoint
scorable model create \
--name azure/gpt-5.5 \
--model azure/gpt-5.5 \
--url https://my-azure-openai.openai.azure.com \
--key sk-...
# Read the key from stdin (keeps it out of shell history)
echo "$MY_PROVIDER_KEY" | scorable model create --name my-gpt --model gpt-5.5 --key -Options: --name (required), --model, --url (for self-hosted endpoints), --key (provider API key; - reads from stdin), --max-token-count, --max-output-token-count.
Update a model
scorable model update <model_id> --max-output-token-count 4096update is a PATCH — only fields you pass are sent. All create flags are accepted as optional updates, including --key - for stdin.
Delete a model
scorable model delete <model_id>Prompts for confirmation. Use --yes to skip.
Execution Logs
List execution logs
scorable execution-log listOptions: --page-size, --cursor, --search, --evaluator-id, --judge-id, --model, --tags, --score-min, --score-max, --created-at-after, --created-at-before, --owner-email
Get an execution log
scorable execution-log get <log_id>OTEL Trace Evaluation Filters
When traces arrive at Scorable's OTLP endpoint, evaluation filters automatically run an evaluator or judge against each matching trace. Results land back on the same trace as a child span carrying the OpenTelemetry GenAI evaluation attributes (gen_ai.evaluation.name, gen_ai.evaluation.score.value, gen_ai.evaluation.explanation).
Create a filter
scorable otel-filter create \
--name "default-truthfulness" \
--evaluator-id <evaluator-uuid>Required: --name and exactly one of --evaluator-id or --judge-id. A judge target emits one eval span per inner evaluator.
Options: --filter-criteria (JSON of conditions), --sampling-rate (0.0–1.0, default 1.0), --delay-seconds (default 10, allows late spans to land before evaluation), --inactive
Match traces from a specific service, run a 5-second-delayed evaluation:
scorable otel-filter create \
--name "agent-truthfulness" \
--evaluator-id <evaluator-uuid> \
--filter-criteria '{"conditions":[{"column":"resource","type":"string","key":"service.name","operator":"=","value":"my_agent"}]}' \
--delay-seconds 5Multi-evaluator judge target:
scorable otel-filter create --name "quality-judge" --judge-id <judge-uuid>List filters
scorable otel-filter listDelete a filter
scorable otel-filter delete <filter_id>Custom extractor rules (when input/output isn't in gen_ai.*)
For the common case the flags above are everything you need: matching traces are evaluated and their gen_ai.input.messages / gen_ai.output.messages are fed to the evaluator. If your traces don't follow that shape — Claude Code, OpenInference, custom instrumentations — you tell the evaluator where input/output live by attaching extractor_rules to the filter. Filters without extractor_rules keep the default behavior.
Rules are carried in a YAML manifest and applied with -f:
scorable otel-filter create -f filter.yaml
scorable otel-filter update <id> -f filter.yaml
scorable otel-filter validate -f filter.yaml # dry-run; exit 2 on schema errorCLI flags override values from the file when both are provided.
Manifest shape
name: <string>
evaluator_id: <uuid> # exactly one of evaluator_id / judge_id
judge_id: <uuid>
sampling_rate: <0.0-1.0> # optional, default 1.0
delay_seconds: <int> # optional, default 10
is_active: <bool> # optional, default true
filter_criteria: # which traces to evaluate (same shape as --filter-criteria)
conditions:
- column: <string> # span_name, has_error, kind, status, attribute, resource, …
operator: <string> # =, !=, contains, starts with, any of, none of, …
value: <string|number|bool>
key: <string> # required for "attribute" / "resource" sentinel columns
extractor_rules: # optional; how to extract input/output from matching spans
- emit: <text|request_response|tool_pair|genai_messages>
match: # optional; same shape as filter_criteria — empty matches every span
conditions: [...]
# … emit-specific fields, see below …extractor_rules reference
Each rule has an emit kind, an optional match filter (same shape as filter_criteria.conditions), and emit-specific fields. Spans are walked in timestamp order; per span, the first rule whose match passes wins.
A rule set must be able to produce both user-side and agent-side content (a request_response or genai_messages rule alone qualifies; a single text role: user rule does not). Validation rejects sets that can't.
text — emit one MessageTurn per matching span.
- emit: text
match: # optional
conditions:
- { column: span_name, operator: "=", value: claude_code.interaction }
role: user # user | assistant
locator: # where the value lives
kind: span_attr # span_attr | event_attr | resource_attr
key: user_prompt # attribute name
event_name: <string> # required when kind: event_attr
value_path: $.foo # optional JSONPath into the located value
tool_name: <string> # only valid when role: assistantrequest_response — emit a user turn + an assistant turn from one span. Common when an agent stores prompt and response as flat attributes (OpenInference's input.value / output.value).
- emit: request_response
match: { ... } # optional
input_locator: { kind: span_attr, key: input.value } # → user turn
output_locator: { kind: span_attr, key: output.value } # → assistant turntool_pair — emit one assistant turn per matching span carrying a tool call (input + output combined into the turn content). Built for tool-execution spans like Claude Code's claude_code.tool event.
- emit: tool_pair
match: { ... } # optional
input_locator: { kind: event_attr, event_name: tool.output, key: bash_command }
output_locator: { kind: event_attr, event_name: tool.output, key: output }
tool_name: Bash # static; OR
tool_name_locator: { kind: span_attr, key: tool_name } # dynamicgenai_messages — same parsing as the default extractor, but on attribute keys you choose. Use this if your service emits the standard gen_ai JSON-array shape under non-default attribute names.
- emit: genai_messages
match: { ... } # optional
input_locator: { kind: span_attr, key: my_framework.input.json }
output_locator: { kind: span_attr, key: my_framework.output.json }Locator details
| Field | Meaning |
| ------------ | ------------------------------------------------------------------------------------------------- |
| kind | span_attr (top-level), event_attr (inside a named event), or resource_attr (resource-level) |
| key | Attribute name to read |
| event_name | Only for event_attr — the OTel event whose attributes to read (e.g. tool.output) |
| value_path | Optional JSONPath into the located value; used when the attribute is itself JSON |
Reference manifests
examples/otel-filters/:
openinference-agent.yaml— single-span agent withinput.value/output.value(request_response).claude-code.yaml— Claude Code's interaction span + tool spans (text+tool_pair).genai-explicit.yaml— gen_ai messages under custom attribute keys.
OTEL Trace Querying
List traces
scorable otel-trace listOptions: --since / --start-time / --end-time (time window), --page-size, --cursor, --output table|json|csv (default table), --filter (repeatable raw expression), plus convenience shortcuts below.
Convenience flags — cover the common case, AND-combined with each other and with --filter:
| Flag | Effect |
| ----------------------- | -------------------------------------- |
| --service-name <name> | match resource.service.name = <name> |
| --has-error | only traces where some span errored |
| --root-name <substr> | substring match on the root span name |
| --span-name <substr> | substring match on any span's name |
| --agent-name <name> | match gen_ai.agent.name |
| --model <name> | match gen_ai.request.model |
| --tool <name> | match gen_ai.tool.name |
Time-window flags (mutually exclusive group):
scorable otel-trace list --since 1h # last hour
scorable otel-trace list --since 7d
scorable otel-trace list --start-time 2026-04-30T00:00:00Z --end-time 2026-05-01T00:00:00ZCommon one-liners:
# All traces from a specific agent in the last 24h, exported as CSV
scorable otel-trace list --since 24h --service-name my_agent --output csv > traces.csv
# Errored traces this week
scorable otel-trace list --since 7d --has-error
# Drill into traces that hit a specific tool
scorable otel-trace list --tool fetch_customer_dataFor anything the shortcuts don't cover, use --filter directly. Format: column;type;key;operator;value, repeatable, AND-combined. If your instrumentation follows the OpenTelemetry GenAI semantic conventions — pydantic-ai, OpenLLMetry, Logfire, OpenAI/Anthropic SDKs with otel all do — every documented attribute is filterable without extra setup.
# Expensive runs — over 5k input tokens
scorable otel-trace list --since 24h \
--filter 'gen_ai.usage.input_tokens;number;gen_ai.usage.input_tokens;>;5000'
# Multi-turn conversation drill-down
scorable otel-trace list \
--filter 'gen_ai.conversation.id;string;gen_ai.conversation.id;=;conv_5j66'
# Filter on Scorable's own evaluation result spans
scorable otel-trace list \
--filter 'gen_ai.evaluation.name;string;gen_ai.evaluation.name;=;Truthfulness'See scorable otel-trace list --help for the full column / operator / type reference.
Inspect spans for a trace
scorable otel-trace spans <trace_id>Options: --output table|json|csv. The JSON form returns the full span payload — attributes, events, status, kind, resource_attributes — which is what you typically want for debugging or piping to jq.
scorable otel-trace spans <trace_id> --output json | jq '.[0].span.attributes'
scorable otel-trace spans <trace_id> --output csv > spans.csvPrompt Testing
Initialize a config file and run experiments:
scorable pt init
scorable pt runUse a custom config path:
scorable pt run --config path/to/prompt-tests.yamlThe prompt-test command is an alias for pt.
Config file format
prompts:
- "Extract info from: {{text}}"
inputs:
- vars:
text: "John Doe, [email protected]"
# Or use a dataset instead of inline inputs:
# dataset_id: "<uuid>"
models:
- gpt-5.4
- gemini-3-flash
evaluators:
- name: Precision
- name: Confidentiality
# Optional: enforce structured output
# response_schema:
# type: object
# properties:
# name: { type: string }Results are displayed in a table and a browser link is printed for the full comparison view.
Development
npm install
npm run build # compile TypeScript
npm test # run tests
npm run typecheck # type-check without emitting
npm run lint # lint with oxlint
npm run fmt # format with oxfmt