npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

node-llm-test

v0.18.5

Published

Generate tests to evaluate the intelligence of large language models.

Readme

node-llm-test

Main release Build
Status Code Scanning Coverage Status Npm version Download NPM

  • See the web app at: https://marketeer.snowdon.dev/tools/llmtest-online/.
  • Don't want to keep running tests. Sign up to the periodic newsletter containing results from the leading agents. And stay informed without spinning up infra. Just email: [email protected].
  • The latest information is available on the web app.
  • Issue reports welcome. Just let me know.

| Make | Model | Level | Score | | ------- | ----- | ------ | ----------- | | ChatGPT | 5.1 | Warden | 3 | | Gemini | 3 | Expert | 5 | | Kimi | k2 | Warden | 3 (warning) | | Grok | 4 | Warden | 3 (warning) | | Sonnet | 4 | Unable | -1 |

A test to evaluate the various intelligence modes of large language models. A linear-time algorithm for generating infinitely many test instances.

Traditional evaluation methods often rely on assessing a model’s ability to answer previously known questions—tasks that can frequently be addressed using older information-retrieval approaches. Contemporary benchmarks therefore emphasize agentic workflows, interactive environments, and long-horizon tasks intended to measure an AI system’s capacity for planning, adaptation, and multi-step reasoning.

However, current agentic evaluations exhibit several limitations. Test instructions are typically conveyed through inefficient or verbose formats, resulting in unnecessarily high token usage. In addition, such benchmarks have been shown to permit unintended shortcuts: models may consult strategy guides, analyze repository histories, or otherwise circumvent the intended reasoning process.

The test proposed here offers a more efficient and controlled alternative.

It provides an agent-like experience while using a minimal and token-efficient prompt format. The evaluation consists of a concise logical puzzle defined within a domain that is natural for computational agents. The task may involve multiple steps and may allow limited tool usage, but its structure is intentionally designed so that success requires authentic reasoning rather than memorization or external lookup.

This approach yields a benchmark that can generate unlimited test instances and is lightweight, cost-effective, resistant to exploitation, and better aligned with the objective of evaluating true reasoning capabilities.

Installation

Install using any package manager with the NPM registry.

npm install node-llm-test

Install globally as a command.

npm install --global node-llm-test

Commercial use of this code package requires permission—please contact me at [email protected] if you intend to use it for such purposes. The web app, however, is freely available at your convenience. To learn more from the Oxford AI Chair (not me) https://www.youtube.com/watch?v=7-UzV9AZKeU.

Quick Usage

  • Create a Codespace on this repository.
  • Then simply, run the command llmtest in the terminal.

Or:

import { Puzzle, Feature } from "node-llm-test";
const puzzle = Puzzle.New();
const result = puzzle.run(console.log);
const answer = puzzle.answer(result, "someanswer");
npx llmtest --interactive

The puzzle

Quick fox lookup test. A test that is solved by the simple act of looking things up.

LLMs cannot solve this simple puzzle effectively because they rely on statistical patterns, which this test disrupts by introducing a constructed language with minimal statistical grounding. If an LLM could truly reason, the task would be trivial. While identifying the missing word is difficult for a human, especially someone like me who doesn’t typically engage with grammatical puzzles, it should, in theory, be simple for an LLM. For it does know what a pangram is and all the variations or pangrams, and there is many resources that can be looked up.

For example, given the sequence “the quick brown [..]”, an LLM will almost always guess the correct next word based purely on probability. However, to pass this test, the LLM must go beyond prediction. It must first reason through the encoded sentence, translate it into natural English, complete the sentence, and then convert it back to the encoded form to select the correct missing word.

The test also reduces to a simple deterministic answer that is very easy to compute and check. Unlike other puzzles that rely on probabilistic outcomes, external datasets, or auxiliary GPT/LLM models, results here are unambiguous: answers are either correct or wrong. And answers may never have been correct before. However, there exists a possibility that the model can provide an answer that was unexpected but completes a pangram, it may find a novel solution. To ensure that novel solutions exist, if you want them, you must design the input to the puzzle carefully. Testing the result of a novel solution with a generalized deterministic oracle is inherently difficult—it is magic, or it may require an LLM prone to false positives and negatives. To maintain linear-time deterministic test assertions for novel solutions, pre-compute all possible outcomes based on the input words. Alternatively, possible novel solution can be manually verified by humans, or simply considered incorrect. Importantly, providing a novel solution does not necessarily mean providing the correct solution—it may not correspond to the most probable variation of the chosen pangram.

Simple puzzles (with few features/levels) may measure the number of tokens output, or the time taken to complete a correct test. While complicated puzzles may test the total reasoning capability. This test forces the model to think, which seems to be achievable to some extent by producing output tokens that move the task forward towards some end result. However, this is a double edged sword as the cost per answer is high if the reasoning is not concise. See a YouTube video by @t3dotgg for more information: I was wrong about GPT-5. With reasoning and tool calls enabled, a hard test could be --level 1183. To see custom levels configurations, see the web app.

Why select this puzzle instead of a simple, well-worn exercise such as: Let 𝑘=4, (𝑛^2+3𝑛+5) mod r, r∈𝑍, 𝑛=𝑘 and 𝑛∈𝑍? Although such tasks are quick generate, they are also very familiar. And it reduces to a fixed sequence of steps, that can be reused on variants of the problem. Or entirely skipped by handing the expression to a calculator. In the end, it’s just concise Haskell. An LLM can classify the input as a Mathematica-like computable expression, directly yielding the result. In practice, parsers can construct an abstract syntax tree (AST) for the expression, and most of the semantic annotations required for execution are either recoverable from the structure or explicit in the grammar. The AST implicitly encodes type information: for example, no mathematical parser interprets the + operator as string concatenation, so the digits on either side of the infix operator are necessarily of type number. As a result, this kind of test can be trivially solved by a non-reasoning system, or reduced by a reasoning model to just three steps: identify the parameter, retrieve the appropriate script, and invoke it with the parameter to obtain the result. For small values of 𝑛 or r, the computation could even be precomputed and stored in a lookup table. Alternatively, the script itself could be cached at the token level, by respecting known parameters. In practice, a finite-state automata combined with a calculator (ACU) is sufficient to solve this test.

There are numerous situations in which a puzzle may reasonably be considered correct, depending on the evaluation framework. For instance, consider a scenario where a structured output schema is not specified. If a language model produces 'every' but the correct answer is every, should this be judged incorrect? Strictly speaking, it does not match exactly and the test has been designed to be interpreted as being not correct; however, within the constraints of a schema, the response would be functionally equivalent and thus acceptable. However, in the context of code generation and precise logical tasks, this constitutes a failure to adhere to the given instructions. This issue also illustrates a broader issue related to negative feedback attention. Providing feedback such as “no” or “that answer is incorrect” can destabilize the model’s reasoning process. In such cases, the system may be led to reinterpret a correct response as incorrect, a phenomenon sometimes described as gaslighting the model. Public demonstrations of this effect, often titled along the lines of “Gaslighting GPT-5 into believing 2+2=5”, illustrate how easily a model can be misled by poor feedback. The implication is significant: in many instances, evaluators must supply the correct answer explicitly in order to reliably guide the model toward producing the correct response in future interactions.

Implementation notes

While the software is in development (v0.*), when incrementing a minor version I may intentionally break the interface to aid the development. If you would like an LTS or stable release, please get in contact.

It’s worth noting that many of the mechanics operate on a 50% chance of activation. This means that even at the highest level, there is a real possibility that the test output will be simple—some tests remain trivial to solve regardless of difficulty. To obtain meaningful results, a large number of tests must be run, as the majority are expected to fail. For example, when the mapping order is left-to-right rather than right-to-left, the LLM tends to produce more correct answers.

I’ve also introduced randomness into the puzzle generation process. This ensures that even if an LLM has access to solved examples, any newly generated test will differ significantly in its wording and structure, making memorization ineffective.

Importantly, while each test is deterministic given a fixed wordlist, seed, level, and code version—meaning the same inputs will always produce the same encoded and decoded sentences—this determinism can break if the code version changes or the underlying wordlist is updated. To preserve reproducibility, I can provide options for using a static wordlist and locking the process to a specific code tag. Let me know if you'd like support for that.

If the test code and the pangram list is known, then based on the position of the missing word, the chance of guessing the correct input is 1/pangrams.length. A testee should therefore achieve at least this level of accuracy; otherwise, its reasoning is performing worse than random guessing. The default pangrams list length is 9.

In addition, tests are often simple once excluding native tool calls, requiring the steps: two lookups, complete the word, find the token, pick the side to choose the token, and that's your answer. Additional features, may require running a function (tool call) over the input, or will only be activated with a 50% chance.

If you need to read all the words of the active pangram you may make ~60 lookup actions. This number does not scale with the number of input words, but with the size of the chosen pangram.

Why was this created

I noticed that all AIs seem to fail using the tools that I use. I wondered if it was because of the lack of public information to train them on. This test proves it. This was prior to Chat GPT 5.

Usage

Prerequisites

  • Install node and NPM (tested on >22)
  • You will also need a system wordlist file if not using the --wordlist-file option. For example, first install: (on Debian sudo apt install wbritish).

CLI Reference

To run the CLI:

npx llmtest

For example:

npx llmtest --seed 12345
npx llmtest --number 10 --write
npx llmtest --number 10 --write ~/Documents/test1
npx llmtest --excludeinfo --no-answer

Or run in interactive mode:

npx llmtest --interactive

Generate sequence of results in bash:

seq -f "%.0f" 1000000 1000010 | \
  xargs \
    -n1 \
    -P2 \
    -I{} \
    llmtest \
      --write "test-{}.txt" \
      --seed {} \
      --no-answer \
    > /dev/null 2>&1

| Argument | Description | | ---------------------------- | ------------------------------------------------------------------------------------ | | --number <number> | The number of words in the wordlist (default: 0) | | --write [filepath] | Write to a temporary file or the target path | | --level <integer> | Features enabled (0=none, 16777215=all, default: 0) | | --seed <integer> | A seed to preserve reproducibility | | --no-print | Do not print the output for the LLM | | -i, --interactive | Run in interactive mode | | --wordlist-file <filepath> | Load wordlist from a file | | --answer <string> | Provide answer via arguments, implies --no-answer | | --no-answer | Do not wait for an answer on stdin | | --verbose | Print more debug information | | --excludeinfo | Exclude any extra information that should not be shown to the LLM from being printed |

Configuration

Persistent defaults can be stored in a config file at ~/.config/node-llm-test/config.json.

# Set a default wordlist file
npx llmtest config set wordlist-file ~/my-words.json

# Set a default word count
npx llmtest config set number 50

# Set a default level or output path
npx llmtest config set level 14
npx llmtest config set write ~/Documents/tests

# Set boolean defaults
npx llmtest config set no-print true
npx llmtest config set verbose true
npx llmtest config set excludeinfo true

# List all config values
npx llmtest config list

# Get a specific value
npx llmtest config get wordlist-file

# Remove a config value
npx llmtest config unset number

# Show config file path
npx llmtest config path

Config values serve as fallback defaults. CLI flags take precedence when provided.

Generating a wordlist

Use the wordsgen command to generate a wordlist file for reproducible tests:

# Generate 300 random words (default)
npx llmtest wordsgen > my-words.json

# Generate a specific number of words
npx llmtest wordsgen --number 1000 > my-words.json

Workflow: wordsgen + config

Create a wordlist once, then configure it as the default:

# 1. Generate a wordlist file
npx llmtest wordsgen --number 500 > ~/llm-words.json

# 2. Set it as the default wordlist
npx llmtest config set wordlist-file ~/llm-words.json

# 3. Run tests -- the wordlist is loaded automatically
npx llmtest --seed 42 --no-answer

This ensures all your tests use the same wordlist without passing --wordlist-file every time.

Programmatic

import { Puzzle, Feature, getRandomWords } from "node-llm-test";

const level = Feature.CHAOS_WORDS | Feature.EXTRA_WORDS;
async function run() {
  const seed = Math.floor(Math.random() * (2 ** 31 - 1));
  const wordList = await getRandomWords(600, seed);
  const puzzle = Puzzle.New(seed, level, [
    /*someWordList*/
  ]);

  const result = puzzle.result();
  puzzle.print(result, console.log);

  const llmsAnswer = await (async function getLlmAnswer() {
    return "somewronganswer";
  })();

  const answer = puzzle.answer(result, llmsAnswer);

  if (answer.dontKnow) {
    // this will never be true if EXCLUDE_DONT_KNOW level is enabled
    // it signifies that the model has responded with an unsure answer.
    // the score, weight or usage is up to the implementer.
    return;
  } else if (answer.possible) {
    // requires human review or extra checks
    return;
  } else if (answer.exact) {
    // the correct answer was provided
    return;
  } else {
    // the answer was not correct
    return;
  }
}

async function runInCycle() {
  const puzzle2 = Puzzle.New();
  puzzle2.getSeed();
  puzzle2.result();

  const secondseed = puzzle2.getSeed();
  const result3 = puzzle2.result();
  puzzle2.print(result3, console.log);

  // later on..
  // we can then re create result3 from the seed
  const result3Puzzle = Puzzle.New(secondseed);
  const result3Close = result3Puzzle.result();
  puzzle3Puzzle.print(result3Close, console.log);
}

Builder Pattern

Use the builder pattern to configure puzzles with a custom random source:

import { Puzzle, PuzzleBuilder, RandomSource, Feature } from "node-llm-test";

// using Puzzle.Builder (static property - same class as PuzzleBuilder)
const puzzle1 = new Puzzle.Builder().withSeed(12345).withLevel(2).build();

// using PuzzleBuilder class directly (identical)
const puzzle2 = new PuzzleBuilder().withSeed(12345).withLevel(2).build();

// create puzzle with custom IRandom implementation
const customRandom = RandomSource.New(RandomSource.TYPES.small, 12345);
const puzzle3 = new PuzzleBuilder()
  .withRandom(customRandom)
  .withLevel(4)
  .build();

// chain builder methods
const puzzle4 = new PuzzleBuilder()
  .withSeed(12345)
  .withLevel(Feature.CHAOS_WORDS | Feature.EXTRA_WORDS)
  .withInputWords(["customword"])
  .withPangrams(["complete pangram here"])
  .withMaxCycleDepth(3)
  .build();

Available builder methods:

  • .withSeed(seed: number | null) - Seed for reproducibility
  • .withLevel(level: number) - Difficulty level (bitflags)
  • .withInputWords(words: string[]) - Custom word list
  • .withPangrams(pangrams: string[]) - Custom pangrams
  • .withMaxCycleDepth(depth: number) - Max recursion depth (1-10, default: 2)
  • .withRandom(random: IRandom) - Custom random source
  • .withAssociativityLevelFactory(factory: IAssociativityLevelFactory) - Custom character mutation factory

Using Custom Associativity Level Factory:

You can supply a custom IAssociativityLevelFactory to control how character mutations are applied to the puzzle output. The factory must implement create(), getRotNShift(), getActiveTypes():

import { PuzzleBuilder } from "node-llm-test";

const customFactory = {
  getActiveTypes() {
    // Return the non-"NONE" mutation types this factory supports
    return ["MILD", "MODERATE", "SEVERE"];
  },
  getRotNShift() {
    return this.rotNShift;
    // return null
  },
  create(type, random) {
    // Return an IAssociativityLevel implementation
    return {
      id: type,
      mutate(input) {
        // Your custom mutation logic based on type
        return input;
      },
    };
  },
};

const puzzle = new PuzzleBuilder()
  .withSeed(12345)
  .withLevel(34) // will need a level that uses this feature
  .withAssociativityLevelFactory(customFactory)
  .build();

Test Levels

I recommend trying the game out at a low level and word count, at least once. Some information is omitted here. See web app link at the top of file.

Given commands may not be reproducible unless, you happen to be one the same version.

Reference

| Flag name | Value (decimal) | What it does | Example behavior | Difficulty | | :----------------------------- | --------------: | :-------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------- | ---------- | | CHAOS_WORDS | 1 | Increases output word count (roughly words_in_domain * 2) to make text longer/more chaotic. | If domain has 50 words, output may include ~50 extra words scattered through the text. | Easy | | MULTIZE_TOKENS | 2 | Duplicates tokens used to increase token density/length. | A token like cat might become cat cat. | Medium | | EXCLUDE_MAPPING_INFO | 4 | Omits mapping/metadata information (e.g., token→map direction) from output. | Instead of returning the details in text, the response omits the info. | Easy | | MULTIZE_I_TOKENS | 8 | Similar to MULTIZE_TOKENS but targets input words. | quick becomes quick brown | Medium | | PARTIAL_REASINING | 16 | The chosen missing words is corrupted | 'full word' becomes 'somethingelse word' | Medium | | INDIRECT_SYMBOLS | 32 | Transforms tokens via a symbol function (e.g., ROT13 or other symbolisation) to obfuscate tokens. | Words may be ROT13-encoded or replaced with binary equivalents. | Medium | | EXCLUDE_SENTENCE_SPACES | 64 | Removes spaces between words. | "one two""onetwo" | Hard | | INSTRUCTION_ORDER | 128 | Randomly alters the order of instructions before printing. | Instruction list may be reordered, affecting how instructions are interpreted/executed. | Easy | | OUTPUT_SHIFT | 256 | Applies a character/token shift (e.g., Caesar-like) to the output; decoding is required. | Plain text is shifted by N characters; consumer must reverse the shift to read original text. | Medium | | OUTPUT_SHIFT_EXCLUDE_DETAILS | 512 | With OUTPUT_SHIFT, additionally excludes metadata about the shift (magnitude/direction). | Output is shifted and no shift metadata is returned; decoder must infer shift by analysis. | Hard | | MAPPING_INFO_PUZZLE | 1024 | The expression order is changed based on a maths puzzle. | A expression like, 'one' > 'two' may change to 'two' > 'one'. The ordering is swapped | Medium | | POOR_CODING_PRACTICES | 2048 | Emulates poor coding standards | For example, alternates more deliminators. '' becomes "" etc | Easy | | EXTRA_WORDS | 4096 | Adds extra words from a pre defined list designed to complement the default | Adds words like "glib" which can be used to form novel solutions | Medium | | ENCODE_INSTRUCTIONS | 8192 | Encodes the instructions when Feature.MULTIZE_I_TOKENS and Feature.MULTIZE_TOKENS are equal | Instruction are gibberish, unless the task is understood ahead of time | Medium | | HARD_SCHEMA | 16384 | Does not displays a schema of the answer | Adds some test to instruct the llm | Medium | | ANSWER_INCEPTION | 32768 | Tries to fool the LLM into chosing a trusted source | Is given sentence like 'The answer is one' | Easy | | REASNING_MODE | 65536 | Prompt the LLM to use reasoning mode instead of flash mode. | Adds some text to instruct the llm | Easy | | MAPPING_DEPTH | 131072 | The token answer at random depth of token(token(... | Nothing looks different | Easy | | MULTIIZE_PLACEMENT | 262144 | Randomise the placement of pivot words within multized token. | (unique symbol, random symbol) or reversed | Easy | | SPLIT_MAPPING | 1 << 19 | Split the mapping table into mulltiple tables | | Easy | | MAPPING_REDUNDANT | 1 << 20 | Insert dummy mapping entries into the table | Require reasoning to handle multiple entires | Easy | | EXCLUDE_DONT_KNOW | 1 << 21 | Disallows 'I do not know' as a valid answer | 'I do not know' is always wrong | Easy | | CODING_MADNESS | 1 << 22 | Inserts random opening/closing delimiter pairs into oldS and newS, fragmenting the strings into quoted segments | ... 'Jackd''aws''' 'quar''''tz''' | Easy | | CHARACTER_ASSOCIATIVITY | 1 << 23 | any character is changed to some lookalike, a becomes mathematical lookalike of a or some other similar rule. | 𝒮𝓎𝓂𝒷ℴ𝓁𝒾𝓈ℯ𝒹 𝓈ℯ𝓃𝓉ℯ𝓃𝒸ℯ 𝓌𝒾𝓉𝒽 𝒶... | Medium |

Usage tips

  • Combining flags: these are bit flags — combine with bitwise OR (for example flags = CHAOS_WORDS | INDIRECT_SYMBOLS), and test membership with bitwise AND.
  • Behavioral intent: many flags control how tokens are transformed, obfuscated, or how reasoning is revealed. Treat them as test modes to probe model robustness (e.g., obfuscation, partial reasoning, reordered instructions).

Level 0

# zero extra words, zero extra reasoning steps
npx llmtest --level 0
...
Table of mappings:
'vex' 'waltz' {}
'quick' 'fjords' {}
'fjords' 'Big' {}
'Big' 'nymph' {}
'waltz' 'vex' {}
'nymph' 'quick' {}
...

Symbolised sentence with missing word:
fjords quick waltz nymph vex [...]

ChatGPT says:

Missing word (symbolised form): Big
Input sentence (symbolised form): fjords quick waltz nymph vex Big

FYI:

The correct answer is:
fjords quick waltz nymph vex Big
The real sentence is:
Big fjords vex quick waltz nymph

This test has a straightforward probabilistic solution, and the LLM successfully arrives at the correct result without chain of thought. Given the sentence, the missing word is the one not present in the sequence. From the lookup table, we observe that the word “Big” has two mappings: 'Big' → 'nymph' and 'fjords' → 'Big'. Which would make either a likely candidate for the missing word, if additional words from the domain were present.

At this level, the order of the expression is explicitly provided in the instructions. This makes it a very easy Level 0 test. The task can become identifying the one word from the domain that is not present in the sentence. This structure allows for a simple elimination approach to deduce the missing word.

It’s important to note that, despite the simplicity of the structure, some reasoning is still required to arrive at the correct answer due to the way the test is presented. So the solver must infer the ordering and relationships from the text. This subtle requirement distinguishes it from a purely mechanical task and introduces a minimal layer of logical deduction.

Level 14

npx llmtest -- --level 14 --seed 12345

You have been given a sequence of encoded symbols that contains a missing part.
The '~' operator defines a mapping between two character sequences enclosed in quotes.
Each mapping entry in the table is separated by a newline character.
The marketeer dot snowdon dot dev llmtest online.

~ 'Big vex' 'quick'
~ 'waltz' 'quick Big'
~ 'quick' 'Big'
~ 'waltz nymph' 'fjords vex'
~ 'quick Big' 'nymph'
~ 'fjords vex' 'waltz nymph'
~ 'nymph' 'waltz'
~ 'Big' 'vex'
~ 'vex' 'Big vex'
~ 'fjords' 'fjords'

Take into account the given symbolised sequence of words and other contextual information.
Complete the following tasks:
- Determine the absent word
- Present only the symbol or symbols that map to find the real word or words
- Show the answer as concisely as possible
- Do not ask any questions
- The answer must be a strict match with case and character sensitively
- Think carefully and respond only when confident

Symbolised sentence with a missing part or parts:
[...] waltz nymph Big fjords vex

FYI:

+-------+-------------+-------+-------------+
| Big   | fjords vex  | quick | waltz nymph |
+-------+-------------+-------+-------------+
| [...] | waltz nymph | Big   | fjords vex  |
| vex   | waltz nymph | Big   | fjords vex  |
+-------+-------------+-------+-------------+