@botbotgo/better-call

v0.1.14

Published

11 hours ago

LLM tool-call reliability layer.

Downloads

1,467

0High
0Medium
0Low

boqiang.liang

llm agents tool-calling function-calling guardrails reliability

BetterCall

One-line wrapper. Eight full BFCL remote runs completed. Best: 73.4% → 83.8%.

const tools = betterTools([searchTool, calculatorTool]);

No model means validate + block only. Add repairModel when you want automatic repair.

Install

npm install @botbotgo/better-call

LangGraph Quick Start

import { betterTools } from "@botbotgo/better-call";

// Validate + block: stop bad calls before execution.
const tools = betterTools([searchTool, calculatorTool]);

Pass a repair model to let BetterCall fix rejected calls automatically. This can be the same chat model your agent uses, or a separate cheaper/stronger model dedicated to repair:

// Validate + repair: validate, repair with a model, validate again, then execute.
const tools = betterTools([searchTool, calculatorTool], {
  repairModel: model,
});

Run The BFCL Benchmark

npm run bench:bfcl

This prints the BFCL v4 targeted weak-category table used above. It is a BetterCall wrapper benchmark, not an official leaderboard submission.

For the full official BFCL v4 reproduction, use Berkeley's published checkpoint/package: commit f7cf735 or pip install bfcl-eval==2025.12.17.

To run the real-model benchmark against your own Ollama endpoint, install the BFCL package and point BetterCall at the package data:

python3 -m venv /tmp/better-call-bfcl-venv
/tmp/better-call-bfcl-venv/bin/pip install "bfcl-eval==2025.12.17" soundfile

OLLAMA_BASE_URL=http://127.0.0.1:11434 \
BENCH_MODELS=qwen3.5:0.8b \
BENCH_CATEGORIES=all \
BENCH_CASES_PER_CATEGORY=0 \
npm run bench:bfcl:real

For long categories, resume or shard a category with BENCH_CASE_OFFSET and BENCH_CASES_PER_CATEGORY; the benchmark still uses real model calls and counts request errors/timeouts as incorrect.

For a model/category matrix run, set OLLAMA_BASE_URL and run:

OLLAMA_BASE_URL=http://127.0.0.1:11434 npm run bench:bfcl:remote-small-all

The runner does real /api/chat tool-call requests. Benchmark JSON redacts the endpoint by default; set BENCH_SHOW_ENDPOINT=1 only for private debugging.

What It Catches

| Category | Failures | Example | | --- | --- | --- | | Tool selection | Unknown tool | stock_price instead of stock_quote | | Tool selection | Irrelevant call | Model calls a tool when no tool should be used | | Arguments | Missing required arg | Missing required ticker | | Arguments | Wrong arg name | symbol instead of ticker | | Arguments | Wrong type | "3" where an integer is required | | Schema | Invalid enum | NASDAQ where only US, HK, CN are allowed | | Schema | Extra arg | currency when additionalProperties: false | | Policy | Semantic validator rejection | Domain-specific validator rejects unsafe args |

In validate + block mode, BetterCall rejects these calls before execution. With repairModel, it asks the model to fix rejected calls, validates again, and only then executes.

API

`betterTools`

Wrap a LangGraph-style tools array.

// Validate + block.
const tools = betterTools([searchTool, calculatorTool]);

// Validate + repair.
const toolsWithRepair = betterTools([searchTool, calculatorTool], { repairModel: model });

options is optional. Each tool must expose name and invoke(input). BetterCall preserves each tool's shape and wraps invoke.

Without repairModel or repair, BetterCall validates and blocks unsafe calls instead of fixing them. repairModel only needs an invoke(input) method, such as a LangChain chat model. If it is provided, BetterCall supplies the repair prompt and JSON parser.

Modes:

| Mode | Behavior | | --- | --- | | no model | Validate and block only | | repairModel | Validate, repair rejected calls, then validate again | | custom repair | Use your own repair function | | review | Ask for a full self-check even when calls pass schema |

Default recommendation: start with validate + block, then add repairModel for small models or unreliable tool callers. review is more expensive and model-dependent.

irrelevance is one validation failure type: the model called a tool when no tool should be called. BetterCall also validates tool names, argument names, JSON schema, types, enums, and semantic validators.

Benchmark

Measured with real Ollama /api/chat calls over all supported BFCL v4 single-turn tool-call categories. Request errors and timeouts count as incorrect. This is not an official BFCL leaderboard score; it measures BetterCall as a runtime reliability layer.

Latest completed remote run artifact: benchmarks/bfcl-real-remote-completed-summary.json.

Performance after wrapping the same model outputs with BetterCall:

granite4.1:3b
  Raw         73.4% | #############################...........
  BetterCall  83.8% | ##################################......
qwen2.5:7b-instruct
  Raw         72.2% | #############################...........
  BetterCall  78.2% | ###############################.........
qwen3:0.6b
  Raw         55.5% | ######################..................
  BetterCall  63.6% | #########################...............
qwen3.5:0.8b
  Raw         54.6% | ######################..................
  BetterCall  56.9% | #######################.................
qwen3.5:2b
  Raw         53.9% | ######################..................
  BetterCall  54.9% | ######################..................
lfm2.5-thinking:latest
  Raw         50.8% | ####################....................
  BetterCall  54.8% | ######################..................
qwen3.5:4b
  Raw         43.6% | #################.......................
  BetterCall  43.4% | #################.......................
gemma4:e2b
  Raw         24.3% | ##########..............................
  BetterCall  24.7% | ##########..............................

| Rank | Model | Completed cases | Raw model | BetterCall | Lift | Request errors | | ---: | --- | ---: | ---: | ---: | ---: | ---: | | 1 | granite4.1:3b | 3,625 | 73.4% | 83.8% | +10.4pp | 25 | | 2 | qwen2.5:7b-instruct | 3,625 | 72.2% | 78.2% | +5.9pp | 80 | | 3 | qwen3:0.6b | 3,625 | 55.5% | 63.6% | +8.2pp | 217 | | 4 | qwen3.5:0.8b | 3,625 | 54.6% | 56.9% | +2.3pp | 901 | | 5 | qwen3.5:2b | 3,625 | 53.9% | 54.9% | +1.0pp | 1,308 | | 6 | lfm2.5-thinking:latest | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 | | 7 | qwen3.5:4b | 3,625 | 43.6% | 43.4% | -0.2pp | 1,847 | | 8 | gemma4:e2b | 3,625 | 24.3% | 24.7% | +0.4pp | 2,641 |

Latest completed model category detail: qwen3.5:4b.

| Category | Cases | Raw | BetterCall repair | Lift | Request errors | | --- | ---: | ---: | ---: | ---: | ---: | | simple_python | 400 | 81.3% | 81.3% | +0.0pp | 54 | | simple_java | 100 | 56.0% | 56.0% | +0.0pp | 32 | | simple_javascript | 50 | 48.0% | 48.0% | +0.0pp | 18 | | multiple | 200 | 83.5% | 83.5% | +0.0pp | 20 | | parallel | 200 | 70.0% | 70.0% | +0.0pp | 45 | | parallel_multiple | 200 | 47.0% | 47.0% | +0.0pp | 96 | | irrelevance | 240 | 68.8% | 68.8% | +0.0pp | 75 | | live_simple | 258 | 66.7% | 66.3% | -0.4pp | 45 | | live_multiple | 1,053 | 41.6% | 41.0% | -0.6pp | 538 | | live_parallel | 16 | 0.0% | 0.0% | +0.0pp | 16 | | live_parallel_multiple | 24 | 0.0% | 0.0% | +0.0pp | 24 | | live_irrelevance | 884 | 0.0% | 0.0% | +0.0pp | 884 |

This qwen3.5:4b run hit sustained remote request failures in the live categories; those failures are counted as incorrect by the benchmark.

Historical targeted wrapper benchmark:

| Model | Raw | BetterCall repair | Accuracy lift | | --- | ---: | ---: | ---: | | gemma4:e2b | 81.3% | 91.3% | +10.0pp | | qwen3.5:2b | 75.3% | 84.0% | +8.7pp | | qwen3.5:9b | 84.0% | 90.0% | +6.0pp | | qwen3.5:4b | 82.0% | 87.3% | +5.3pp | | granite4.1:3b | 66.0% | 69.3% | +3.3pp |

Strongest result: BFCL irrelevance, where the model should not call any tool.

| Model | Raw irrelevance | BetterCall repair | | --- | ---: | ---: | | qwen3.5:2b | 74% | 100% | | qwen3.5:4b | 84% | 100% | | qwen3.5:9b | 84% | 100% | | granite4.1:3b | 92% | 100% | | gemma4:e2b | 70% | 100% |

Why It Exists

Small models are useful because they are cheap and fast. They also make tool mistakes:

call a tool that does not exist
fill the wrong parameter names
pass the wrong types
call tools when no tool is relevant
produce a call that looks valid but is unsafe to execute

BetterCall reduces those failures before they reach production tools.

License

Apache-2.0. See LICENSE and NOTICE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme