npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@botbotgo/better-call

v0.1.14

Published

LLM tool-call reliability layer.

Downloads

1,467

Readme

BetterCall

One-line wrapper. Eight full BFCL remote runs completed. Best: 73.4% → 83.8%.

const tools = betterTools([searchTool, calculatorTool]);

No model means validate + block only. Add repairModel when you want automatic repair.

Install

npm install @botbotgo/better-call

LangGraph Quick Start

import { betterTools } from "@botbotgo/better-call";

// Validate + block: stop bad calls before execution.
const tools = betterTools([searchTool, calculatorTool]);

Pass a repair model to let BetterCall fix rejected calls automatically. This can be the same chat model your agent uses, or a separate cheaper/stronger model dedicated to repair:

// Validate + repair: validate, repair with a model, validate again, then execute.
const tools = betterTools([searchTool, calculatorTool], {
  repairModel: model,
});

Run The BFCL Benchmark

npm run bench:bfcl

This prints the BFCL v4 targeted weak-category table used above. It is a BetterCall wrapper benchmark, not an official leaderboard submission.

For the full official BFCL v4 reproduction, use Berkeley's published checkpoint/package: commit f7cf735 or pip install bfcl-eval==2025.12.17.

To run the real-model benchmark against your own Ollama endpoint, install the BFCL package and point BetterCall at the package data:

python3 -m venv /tmp/better-call-bfcl-venv
/tmp/better-call-bfcl-venv/bin/pip install "bfcl-eval==2025.12.17" soundfile

OLLAMA_BASE_URL=http://127.0.0.1:11434 \
BENCH_MODELS=qwen3.5:0.8b \
BENCH_CATEGORIES=all \
BENCH_CASES_PER_CATEGORY=0 \
npm run bench:bfcl:real

For long categories, resume or shard a category with BENCH_CASE_OFFSET and BENCH_CASES_PER_CATEGORY; the benchmark still uses real model calls and counts request errors/timeouts as incorrect.

For a model/category matrix run, set OLLAMA_BASE_URL and run:

OLLAMA_BASE_URL=http://127.0.0.1:11434 npm run bench:bfcl:remote-small-all

The runner does real /api/chat tool-call requests. Benchmark JSON redacts the endpoint by default; set BENCH_SHOW_ENDPOINT=1 only for private debugging.

What It Catches

| Category | Failures | Example | | --- | --- | --- | | Tool selection | Unknown tool | stock_price instead of stock_quote | | Tool selection | Irrelevant call | Model calls a tool when no tool should be used | | Arguments | Missing required arg | Missing required ticker | | Arguments | Wrong arg name | symbol instead of ticker | | Arguments | Wrong type | "3" where an integer is required | | Schema | Invalid enum | NASDAQ where only US, HK, CN are allowed | | Schema | Extra arg | currency when additionalProperties: false | | Policy | Semantic validator rejection | Domain-specific validator rejects unsafe args |

In validate + block mode, BetterCall rejects these calls before execution. With repairModel, it asks the model to fix rejected calls, validates again, and only then executes.

API

betterTools

Wrap a LangGraph-style tools array.

// Validate + block.
const tools = betterTools([searchTool, calculatorTool]);

// Validate + repair.
const toolsWithRepair = betterTools([searchTool, calculatorTool], { repairModel: model });

options is optional. Each tool must expose name and invoke(input). BetterCall preserves each tool's shape and wraps invoke.

Without repairModel or repair, BetterCall validates and blocks unsafe calls instead of fixing them. repairModel only needs an invoke(input) method, such as a LangChain chat model. If it is provided, BetterCall supplies the repair prompt and JSON parser.

Modes:

| Mode | Behavior | | --- | --- | | no model | Validate and block only | | repairModel | Validate, repair rejected calls, then validate again | | custom repair | Use your own repair function | | review | Ask for a full self-check even when calls pass schema |

Default recommendation: start with validate + block, then add repairModel for small models or unreliable tool callers. review is more expensive and model-dependent.

irrelevance is one validation failure type: the model called a tool when no tool should be called. BetterCall also validates tool names, argument names, JSON schema, types, enums, and semantic validators.

Benchmark

Measured with real Ollama /api/chat calls over all supported BFCL v4 single-turn tool-call categories. Request errors and timeouts count as incorrect. This is not an official BFCL leaderboard score; it measures BetterCall as a runtime reliability layer.

Latest completed remote run artifact: benchmarks/bfcl-real-remote-completed-summary.json.

Performance after wrapping the same model outputs with BetterCall:

granite4.1:3b
  Raw         73.4% | #############################...........
  BetterCall  83.8% | ##################################......
qwen2.5:7b-instruct
  Raw         72.2% | #############################...........
  BetterCall  78.2% | ###############################.........
qwen3:0.6b
  Raw         55.5% | ######################..................
  BetterCall  63.6% | #########################...............
qwen3.5:0.8b
  Raw         54.6% | ######################..................
  BetterCall  56.9% | #######################.................
qwen3.5:2b
  Raw         53.9% | ######################..................
  BetterCall  54.9% | ######################..................
lfm2.5-thinking:latest
  Raw         50.8% | ####################....................
  BetterCall  54.8% | ######################..................
qwen3.5:4b
  Raw         43.6% | #################.......................
  BetterCall  43.4% | #################.......................
gemma4:e2b
  Raw         24.3% | ##########..............................
  BetterCall  24.7% | ##########..............................

| Rank | Model | Completed cases | Raw model | BetterCall | Lift | Request errors | | ---: | --- | ---: | ---: | ---: | ---: | ---: | | 1 | granite4.1:3b | 3,625 | 73.4% | 83.8% | +10.4pp | 25 | | 2 | qwen2.5:7b-instruct | 3,625 | 72.2% | 78.2% | +5.9pp | 80 | | 3 | qwen3:0.6b | 3,625 | 55.5% | 63.6% | +8.2pp | 217 | | 4 | qwen3.5:0.8b | 3,625 | 54.6% | 56.9% | +2.3pp | 901 | | 5 | qwen3.5:2b | 3,625 | 53.9% | 54.9% | +1.0pp | 1,308 | | 6 | lfm2.5-thinking:latest | 3,625 | 50.8% | 54.8% | +4.0pp | 1,142 | | 7 | qwen3.5:4b | 3,625 | 43.6% | 43.4% | -0.2pp | 1,847 | | 8 | gemma4:e2b | 3,625 | 24.3% | 24.7% | +0.4pp | 2,641 |

Latest completed model category detail: qwen3.5:4b.

| Category | Cases | Raw | BetterCall repair | Lift | Request errors | | --- | ---: | ---: | ---: | ---: | ---: | | simple_python | 400 | 81.3% | 81.3% | +0.0pp | 54 | | simple_java | 100 | 56.0% | 56.0% | +0.0pp | 32 | | simple_javascript | 50 | 48.0% | 48.0% | +0.0pp | 18 | | multiple | 200 | 83.5% | 83.5% | +0.0pp | 20 | | parallel | 200 | 70.0% | 70.0% | +0.0pp | 45 | | parallel_multiple | 200 | 47.0% | 47.0% | +0.0pp | 96 | | irrelevance | 240 | 68.8% | 68.8% | +0.0pp | 75 | | live_simple | 258 | 66.7% | 66.3% | -0.4pp | 45 | | live_multiple | 1,053 | 41.6% | 41.0% | -0.6pp | 538 | | live_parallel | 16 | 0.0% | 0.0% | +0.0pp | 16 | | live_parallel_multiple | 24 | 0.0% | 0.0% | +0.0pp | 24 | | live_irrelevance | 884 | 0.0% | 0.0% | +0.0pp | 884 |

This qwen3.5:4b run hit sustained remote request failures in the live categories; those failures are counted as incorrect by the benchmark.

Historical targeted wrapper benchmark:

| Model | Raw | BetterCall repair | Accuracy lift | | --- | ---: | ---: | ---: | | gemma4:e2b | 81.3% | 91.3% | +10.0pp | | qwen3.5:2b | 75.3% | 84.0% | +8.7pp | | qwen3.5:9b | 84.0% | 90.0% | +6.0pp | | qwen3.5:4b | 82.0% | 87.3% | +5.3pp | | granite4.1:3b | 66.0% | 69.3% | +3.3pp |

Strongest result: BFCL irrelevance, where the model should not call any tool.

| Model | Raw irrelevance | BetterCall repair | | --- | ---: | ---: | | qwen3.5:2b | 74% | 100% | | qwen3.5:4b | 84% | 100% | | qwen3.5:9b | 84% | 100% | | granite4.1:3b | 92% | 100% | | gemma4:e2b | 70% | 100% |

Why It Exists

Small models are useful because they are cheap and fast. They also make tool mistakes:

  • call a tool that does not exist
  • fill the wrong parameter names
  • pass the wrong types
  • call tools when no tool is relevant
  • produce a call that looks valid but is unsafe to execute

BetterCall reduces those failures before they reach production tools.

License

Apache-2.0. See LICENSE and NOTICE.