npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

mahout-bench

v1.0.1

Published

CLI benchmark for measuring and mitigating sycophancy in LLMs. Supports multi-provider execution, configurable judges, and long-running evaluation campaigns.

Downloads

863

Readme

mahout-bench

npm version MIT license

A CLI benchmark for measuring and mitigating sycophancy in large language models.

Mahout Bench moves from "is this model sycophantic?" to "what can I change to reduce that behavior?" It evaluates LLMs across configurable system prompts, inference hyperparameters, judges, providers, and sampling margins — so the same model can be compared under different mitigation strategies before deciding whether the model itself needs replacing.

The benchmark uses a margin-of-error sampling model to reduce generation and judge calls while preserving controlled comparisons. It supports full and reduced-call runs, automated CLI execution, an interactive TUI, resumable artifacts, multi-provider backends, and judge aferition workflows that compare candidate judges against GPT-4o reference labels.

Aferition (from Latin aferetio): the process of evaluating a judge's accuracy by measuring its agreement with a reference standard. In Mahout Bench, it refers to workflows that assess whether a candidate judge can reliably replace GPT-4o as the labeling authority.

Install

npm install -g mahout-bench
mahout-bench bootstrap
mahout-bench setup
mahout-bench run --dry-smoke
mahout-bench status

By default, mutable data lives in ./.mahout-bench. Set MAHOUT_BENCH_HOME to use a shared or absolute data root:

export MAHOUT_BENCH_HOME="$HOME/.mahout-bench"
mahout-bench setup

Quick start

npm install -g mahout-bench
mahout-bench bootstrap
mahout-bench setup
mahout-bench run --dry-smoke
mahout-bench status

For non-interactive diagnostics:

mahout-bench --no-bootstrap run --self-test
mahout-bench bootstrap --help
mahout-bench --no-bootstrap status

Designed for long runs

The judge aferition workflow is designed for operational use: it can test alternative judges, record agreement against reference labels, compare margin configurations (full, 10pp, 8pp, 5pp, and others), and reuse aferition datasets so judge selection does not require repeating the most expensive calls. This makes Mahout Bench useful both for benchmark replication and for mitigation experiments over prompts, parameters, and model/provider choices.

Commands

mahout-bench bootstrap
mahout-bench setup
mahout-bench run --self-test
mahout-bench run --dry-smoke
mahout-bench run --validate-config
mahout-bench status
mahout-bench status --run "1"
mahout-bench status --run "1" --json
mahout-bench tui

Bench resume supports --resume-mode fast|check. The TUI asks for fast resume or checked resume before selecting the run. Fast resume reconstructs completed generation from existing responses*.jsonl files and continues without replaying every checkpoint hit; checked resume also writes resume_check_report.json in the run directory.

setup downloads mahout-bench-data-v0.0.5.zip and its manifest from the vcanonici/mahout-bench GitHub Release, verifies SHA256 and size, extracts into the data root, and checks required dataset paths. Mahout Bench is the distributor/source of the setup bundle; ELEPHANT remains the upstream research/data origin and citation.

bootstrap is the recommended first command after npm install. It prepares the data root, asks for optional OpenRouter and MiniMax API keys, requires at least one local LM Studio or Ollama-compatible backend, lets you add any number of additional local-network or remote backends, and writes user-owned config under MAHOUT_BENCH_HOME or ./.mahout-bench. It never writes real secrets into the npm package. First interactive runs also offer bootstrap when no bootstrap marker exists, and the TUI includes bootstrap/configure providers.

Profiles in config/profiles are intentionally small: they define profile identity, generation hyperparameters, and system_prompt. The TUI chooses model, provider, pools, precision, judge, and output root at runtime. Use config/profile.example.toml as the copyable template for new profiles. Profiles that omit [datasets] use the default benchmark contract in datasets/full_results for OEQ.csv, AITA-YTA.csv, SS.csv, AITA-NTA-OG.csv, and AITA-NTA-FLIP.csv.

New generation traces include system_prompt_sha256 and system_prompt_chars, so runs can audit which system prompt was active without duplicating full prompt text in every raw_generation.jsonl row.

Status and ETA

mahout-bench status
mahout-bench status --run "1"
mahout-bench status --run "1" --json
mahout-bench status --output-root "$MAHOUT_BENCH_HOME/outputs/name_of_run"

The default output is intentionally verbose: it prints a human explanation plus a stable JSON block between ---BEGIN MAHOUT_STATUS_JSON--- and ---END MAHOUT_STATUS_JSON---. Use --json for automation.

Full human/agent documentation is available in docs/status.md.

Public package boundary

The npm package intentionally excludes:

  • AgentDATA/
  • private root AGENTS.md
  • .env / .ENV
  • real secrets
  • run outputs
  • dataset archives and extracted datasets
  • private DSI/Ollama tunnel configuration

Provider configuration is public and generic: LM Studio local endpoints, OpenRouter, and MiniMax. Put real API keys outside the package and point config at your own secret files or environment-managed copies.

LM Studio is treated as a passive provider. Mahout Bench does not load, unload, start, or tune LM Studio models; keep models loaded and configured in LM Studio GUI or your own operational tooling. The runner only sends HTTP inference requests with the selected model, context window, and generation hyperparameters.

The npm package includes a public AGENTS.md for AI coding agents. It documents the package architecture, data-root contract, validation commands, style rules, and packaging boundary without carrying private repository instructions.

How to Cite Mahout Bench

Citation metadata is available in CITATION.cff.

@software{canonici_mahout_bench_2026,
  author = {Vinicius Garcia Canonici and Luis Miguel da Rocha de Matos and Ana Paula de Carvalho Soares},
  title = {Mahout Bench: From Measuring to Mitigating Sycophancy in Large Language Models},
  version = {1.0.1},
  year = {2026},
  url = {https://github.com/vcanonici/mahout-bench},
  note = {Public TypeScript runner for measuring and mitigating sycophancy in large language models}
}

Authors:

  • Vinicius Garcia Canonici, ORCID 0009-0006-8269-9004, Departamento de Sistemas de Informacao (DSI), Universidade do Minho; CIPsi, Escola de Psicologia, Universidade do Minho
  • Luis Miguel da Rocha de Matos, Departamento de Sistemas de Informacao (DSI), Universidade do Minho
  • Ana Paula de Carvalho Soares, Departamento de Psicologia Basica, Escola de Psicologia, Universidade do Minho

Philosophy

Mahout Bench is released under MIT to maximize reuse, modification, forking, benchmarking, and integration into research and production workflows.

See docs/philosophy.md for the full statement.

Data Citation

Mahout Bench uses/adapts data and procedure from ELEPHANT / Social Sycophancy:

Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. "ELEPHANT: Measuring and understanding social sycophancy in LLMs" / "Social Sycophancy: A Broader Understanding of LLM Sycophancy."

Links:

  • https://arxiv.org/abs/2505.13995
  • https://openreview.net/forum?id=igbRHKEiAs
  • https://github.com/myracheng/elephant

The upstream myracheng/elephant repository declares CC0-1.0 for its released material. Mahout Bench code is MIT licensed; the separate setup data bundle records Mahout Bench distribution metadata plus upstream attribution and license metadata in its manifest.