@ykstormsorg/quickdraw

v1.0.1

Published

12 days ago

Benchmark LLM streaming. TTFT, TPS, cost ceiling, nightly dashboard.

0High
0Medium
0Low

ykstormsorg

llm benchmark streaming ttft tps openai anthropic ai performance

Quickdraw

Benchmark LLM streaming. TTFT, TPS, $/1K tokens. Across providers, on your prompts, with a hard cost ceiling.

Live dashboard with fresh numbers every morning: quickdraw.lakshyaraj.dev

How this started

LLM provider marketing benchmarks don't match your traffic. Quickdraw measures TTFT, TPS, guardrail overhead, and dollar cost across providers, on your prompts, with a hard cost ceiling so you never accidentally burn $42 on a stuck loop.

What it measures

| Metric | What it means | Why care | |---|---|---| | TTFT p50/p95 | Time to first streamed token | UX feel — perceived latency | | TPS | Tokens per second once streaming starts | Throughput for long answers | | Chunk shape | Mean tokens per SSE event | Smoothness on mobile vs desktop | | Cold latency | First call after 60s idle | Cold-start tax for serverless | | Guardrail overhead | Extra latency with a guard in the loop | Cost of safety | | $/1K out tokens | Real dollar cost, ledgered per call | Your actual spend, not marketing rate |

When to use Quickdraw and when not to

| You want this | Use | |---|---| | TTFT/TPS measured on your prompts, from your region, on any provider | Quickdraw | | Eval correctness of LLM outputs (golden dataset, judge) | Goldset | | Production observability of every LLM call | Helicone, LangFuse | | Model quality benchmarks (MMLU, HELM, MT-Bench) | Standard harnesses | | LLM router / gateway with auto-failover | LiteLLM, Portkey |

Quickdraw is for "should I switch providers?" decisions. It's not a router or a load balancer or an output validator. Pair it with Goldset when you want both speed AND correctness in one PR check.

60-second quickstart

npm install -g @ykstorm/quickdraw
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

quickdraw bench --providers openai,anthropic --runs 5 --prompt-file ./prompts/your-prompt.md

Output:

Provider     Model              TTFT p50   TTFT p95   TPS    Chunk size   Cost/1K out
openai       gpt-4o-mini        289 ms     412 ms     78.2   8 tok        $0.0006
anthropic    claude-haiku-4-5   541 ms     720 ms     91.4   16 tok       $0.0010
openai       gpt-4o             520 ms     714 ms     54.7   12 tok       $0.0150

Results land in ./results/run-<timestamp>.json. Cost ledger appends to ./api_calls.jsonl. Run halts before exceeding --cost-cap (default $2).

The cost ceiling — why this matters

I once accidentally looped a benchmark in a Bash for-loop and woke up to a $42 OpenAI charge. Eight hours of gpt-4o calls. My fault for not thinking through the loop, but easy fault to make.

Quickdraw enforces a hard cap, before the next call would breach. Set --cost-cap=2.00 and the runner halts at $1.97 with partial results written. The kind of feature you wish existed the first time you needed it.

Pair with a provider-side cap (OpenAI Usage limits / Anthropic spend limits per key) as belt-and-suspenders.

Run nightly, publish a public dashboard

# .github/workflows/nightly-bench.yml
- uses: ykstorm/quickdraw-bench-action@v1
  with:
    providers: openai,anthropic,bedrock
    runs: 10
    prompt-file: ./bench/standard-prompt.md
    cost-cap: 5.00
    publish-to: gh-pages

GitHub Pages serves the static dashboard at your-site.github.io/quickdraw-results/. That's how quickdraw.lakshyaraj.dev works — a single Action that runs at 01:00 UTC, writes results to gh-pages, no backend, no SaaS.

What I'd build differently next time

Sample variance is bigger than I thought. Run count should default to 20, not 5. The TTFT distribution is fat-tailed — 5 runs can lie. v1.1 will bump the default.
Don't measure from one region. Multi-region runs (US-East vs Mumbai vs Singapore) tell you what your actual users see. v1.1 adds region rotation.
The cost estimator is wrong on tool-use responses. Function-calling counts arg tokens too, which the current estimator misses. v1.0.1 will fix.

If you're using Quickdraw to make a switch-provider decision today, run 20 cycles per provider and check the p95, not the p50.

How it compares

| | Quickdraw | OpenAI Eval | LangSmith | LiteLLM | rolling-your-own | |---|---|---|---|---|---| | TTFT measured separately from total latency | ✅ | ❌ | ✅ | partial | depends | | Provider-agnostic | ✅ | ❌ | partial | ✅ | n/a | | Hard cost ceiling | ✅ | ❌ | ❌ | ❌ | up to you | | Nightly public dashboard from GH Action | ✅ | ❌ | paid tier | ❌ | up to you | | Single-binary CLI | ✅ | ❌ | ❌ | ✅ | n/a | | License | Apache 2.0 | proprietary | proprietary | MIT | n/a |

If you only need cost tracking, LangSmith does it (paid). If you need TTFT specifically and a CLI you can ship to CI, Quickdraw is what's open and free.

Architecture

graph LR
    CLI[quickdraw bench] --> Cost[CostTracker<br/>budget check]
    Cost --> Provider[Provider adapter]
    Provider --> Stream[Open stream]
    Stream --> First[Mark TTFT on first token]
    First --> Count[Count tokens until done]
    Count --> Log[Append JSONL row]
    Log --> Cost
    Cost --> Stop{over cap?}
    Stop -->|no| Cost
    Stop -->|yes| Halt[Halt + partial results]

Components map: docs/architecture.md.

Roadmap

[x] v1.0 — OpenAI + Anthropic providers, CLI + library, cost ceiling, JSON Lines ledger, GH Pages dashboard
[ ] v1.0.1 — fix tool-use cost estimator
[ ] v1.1 — default runs 20 instead of 5, multi-region rotation
[ ] v1.2 — Bedrock + Vertex + Mistral providers, Ollama local adapter
[ ] v1.3 — continuous mode (long-running daemon, time-series chart)

Tests + CI

npm test
DRY_RUN=true npm run bench:dry

CI runs lint → typecheck → tests → docker build → dry-run bench. Nightly Action runs a real bench against allowlisted providers and publishes to gh-pages.

Limits

Not a model quality benchmark. HELM/MMLU/MT-Bench for that.
Not a regression-blocker. Pair with Goldset for that.
Streaming-format adapters only for OpenAI SSE and Anthropic at v1.0. Bedrock and Vertex land in v1.2.

License

Apache License 2.0 — see LICENSE.

Provenance

Built after a real provider-switch where marketing latency claims didn't match traffic measurements. Quickdraw answers the question in 4 minutes so you don't have to trust marketing copy.

Author

Lakshyaraj Singh Rao — Full-Stack Engineer · AI Systems · Backend · DevOps Mumbai, India

lakshyaraj.dev · @ykstorm · LinkedIn · [email protected]