@botbotgo/better-call
v0.1.14
Published
LLM tool-call reliability layer.
Downloads
1,467
Maintainers
Readme
BetterCall
One-line wrapper. Eight full BFCL remote runs completed. Best: 73.4% → 83.8%.
const tools = betterTools([searchTool, calculatorTool]);No model means validate + block only. Add repairModel when you want automatic repair.
Install
npm install @botbotgo/better-callLangGraph Quick Start
import { betterTools } from "@botbotgo/better-call";
// Validate + block: stop bad calls before execution.
const tools = betterTools([searchTool, calculatorTool]);Pass a repair model to let BetterCall fix rejected calls automatically. This can be the same chat model your agent uses, or a separate cheaper/stronger model dedicated to repair:
// Validate + repair: validate, repair with a model, validate again, then execute.
const tools = betterTools([searchTool, calculatorTool], {
repairModel: model,
});Run The BFCL Benchmark
npm run bench:bfclThis prints the BFCL v4 targeted weak-category table used above. It is a BetterCall wrapper benchmark, not an official leaderboard submission.
For the full official BFCL v4 reproduction, use Berkeley's published checkpoint/package: commit f7cf735 or pip install bfcl-eval==2025.12.17.
To run the real-model benchmark against your own Ollama endpoint, install the BFCL package and point BetterCall at the package data:
python3 -m venv /tmp/better-call-bfcl-venv
/tmp/better-call-bfcl-venv/bin/pip install "bfcl-eval==2025.12.17" soundfile
OLLAMA_BASE_URL=http://127.0.0.1:11434 \
BENCH_MODELS=qwen3.5:0.8b \
BENCH_CATEGORIES=all \
BENCH_CASES_PER_CATEGORY=0 \
npm run bench:bfcl:realFor long categories, resume or shard a category with BENCH_CASE_OFFSET and BENCH_CASES_PER_CATEGORY; the benchmark still uses real model calls and counts request errors/timeouts as incorrect.
For a model/category matrix run, set OLLAMA_BASE_URL and run:
OLLAMA_BASE_URL=http://127.0.0.1:11434 npm run bench:bfcl:remote-small-allThe runner does real /api/chat tool-call requests. Benchmark JSON redacts the endpoint by default; set BENCH_SHOW_ENDPOINT=1 only for private debugging.
What It Catches
| Category | Failures | Example |
| --- | --- | --- |
| Tool selection | Unknown tool | stock_price instead of stock_quote |
| Tool selection | Irrelevant call | Model calls a tool when no tool should be used |
| Arguments | Missing required arg | Missing required ticker |
| Arguments | Wrong arg name | symbol instead of ticker |
| Arguments | Wrong type | "3" where an integer is required |
| Schema | Invalid enum | NASDAQ where only US, HK, CN are allowed |
| Schema | Extra arg | currency when additionalProperties: false |
| Policy | Semantic validator rejection | Domain-specific validator rejects unsafe args |
In validate + block mode, BetterCall rejects these calls before execution. With repairModel, it asks the model to fix rejected calls, validates again, and only then executes.
API
betterTools
Wrap a LangGraph-style tools array.
// Validate + block.
const tools = betterTools([searchTool, calculatorTool]);
// Validate + repair.
const toolsWithRepair = betterTools([searchTool, calculatorTool], { repairModel: model });options is optional. Each tool must expose name and invoke(input). BetterCall preserves each tool's shape and wraps invoke.
Without repairModel or repair, BetterCall validates and blocks unsafe calls instead of fixing them. repairModel only needs an invoke(input) method, such as a LangChain chat model. If it is provided, BetterCall supplies the repair prompt and JSON parser.
Modes:
| Mode | Behavior |
| --- | --- |
| no model | Validate and block only |
| repairModel | Validate, repair rejected calls, then validate again |
| custom repair | Use your own repair function |
| review | Ask for a full self-check even when calls pass schema |
Default recommendation: start with validate + block, then add repairModel for small models or unreliable tool callers. review is more expensive and model-dependent.
irrelevance is one validation failure type: the model called a tool when no tool should be called. BetterCall also validates tool names, argument names, JSON schema, types, enums, and semantic validators.
Benchmark
Measured with real Ollama /api/chat calls over all supported BFCL v4 single-turn tool-call categories. Request errors and timeouts count as incorrect. This is not an official BFCL leaderboard score; it measures BetterCall as a runtime reliability layer.
Latest completed remote run artifact: benchmarks/bfcl-real-remote-completed-summary.json.
Performance after wrapping the same model outputs with BetterCall:
granite4.1:3b
Raw 73.4% | #############################...........
BetterCall 83.8% | ##################################......
qwen2.5:7b-instruct
Raw 72.2% | #############################...........
BetterCall 78.2% | ###############################.........
qwen3:0.6b
Raw 55.5% | ######################..................
BetterCall 63.6% | #########################...............
qwen3.5:0.8b
Raw 54.6% | ######################..................
BetterCall 56.9% | #######################.................
qwen3.5:2b
Raw 53.9% | ######################..................
BetterCall 54.9% | ######################..................
lfm2.5-thinking:latest
Raw 50.8% | ####################....................
BetterCall 54.8% | ######################..................
qwen3.5:4b
Raw 43.6% | #################.......................
BetterCall 43.4% | #################.......................
gemma4:e2b
Raw 24.3% | ##########..............................
BetterCall 24.7% | ##########..............................| Rank | Model | Completed cases | Raw model | BetterCall | Lift | Request errors |
| ---: | --- | ---: | ---: | ---: | ---: | ---: |
| 1 | granite4.1:3b | 3,625 | 73.4% | 83.8% | +10.4pp | 25 |
| 2 | qwen2.5:7b-instruct | 3,625 | 72.2% | 78.2% | +5.9pp | 80 |
| 3 | qwen3:0.6b | 3,625 | 55.5% | 63.6% | +8.2pp | 217 |
| 4 | qwen3.5:0.8b | 3,625 | 54.6% | 56.9% | +2.3pp | 901 |
| 5 | qwen3.5:2b | 3,625 | 53.9% | 54.9% | +1.0pp | 1,308 |
| 6 | lfm2.5-thinking:latest | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 |
| 7 | qwen3.5:4b | 3,625 | 43.6% | 43.4% | -0.2pp | 1,847 |
| 8 | gemma4:e2b | 3,625 | 24.3% | 24.7% | +0.4pp | 2,641 |
Latest completed model category detail: qwen3.5:4b.
| Category | Cases | Raw | BetterCall repair | Lift | Request errors |
| --- | ---: | ---: | ---: | ---: | ---: |
| simple_python | 400 | 81.3% | 81.3% | +0.0pp | 54 |
| simple_java | 100 | 56.0% | 56.0% | +0.0pp | 32 |
| simple_javascript | 50 | 48.0% | 48.0% | +0.0pp | 18 |
| multiple | 200 | 83.5% | 83.5% | +0.0pp | 20 |
| parallel | 200 | 70.0% | 70.0% | +0.0pp | 45 |
| parallel_multiple | 200 | 47.0% | 47.0% | +0.0pp | 96 |
| irrelevance | 240 | 68.8% | 68.8% | +0.0pp | 75 |
| live_simple | 258 | 66.7% | 66.3% | -0.4pp | 45 |
| live_multiple | 1,053 | 41.6% | 41.0% | -0.6pp | 538 |
| live_parallel | 16 | 0.0% | 0.0% | +0.0pp | 16 |
| live_parallel_multiple | 24 | 0.0% | 0.0% | +0.0pp | 24 |
| live_irrelevance | 884 | 0.0% | 0.0% | +0.0pp | 884 |
This qwen3.5:4b run hit sustained remote request failures in the live categories; those failures are counted as incorrect by the benchmark.
Historical targeted wrapper benchmark:
| Model | Raw | BetterCall repair | Accuracy lift |
| --- | ---: | ---: | ---: |
| gemma4:e2b | 81.3% | 91.3% | +10.0pp |
| qwen3.5:2b | 75.3% | 84.0% | +8.7pp |
| qwen3.5:9b | 84.0% | 90.0% | +6.0pp |
| qwen3.5:4b | 82.0% | 87.3% | +5.3pp |
| granite4.1:3b | 66.0% | 69.3% | +3.3pp |
Strongest result: BFCL irrelevance, where the model should not call any tool.
| Model | Raw irrelevance | BetterCall repair |
| --- | ---: | ---: |
| qwen3.5:2b | 74% | 100% |
| qwen3.5:4b | 84% | 100% |
| qwen3.5:9b | 84% | 100% |
| granite4.1:3b | 92% | 100% |
| gemma4:e2b | 70% | 100% |
Why It Exists
Small models are useful because they are cheap and fast. They also make tool mistakes:
- call a tool that does not exist
- fill the wrong parameter names
- pass the wrong types
- call tools when no tool is relevant
- produce a call that looks valid but is unsafe to execute
BetterCall reduces those failures before they reach production tools.
