npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

llama-cpp-client

v1.0.1

Published

TypeScript client for llama.cpp OpenAI-compatible API

Readme

llama-cpp-client

Typed Node.js client for llama.cpp's OpenAI-compatible HTTP API.

Features:

  • Automatic retries with exponential backoff
  • reasoning_content recovery (when the model reasons but forgets to respond)
  • Real token counting via /v1/messages/count_tokens
  • Context window management with two-phase compression (estimation + real count)
  • AbortSignal support throughout

Installation

npm install llama-cpp-client

Or as a local path dependency:

"llama-cpp-client": "file:../LlamaCppClient"

LlamaCppClient

The low-level client. Handles HTTP, retries, and reasoning recovery.

import { LlamaCppClient } from 'llama-cpp-client';

const client = new LlamaCppClient({
  baseUrl: 'http://localhost:8080',
  model: '',           // optional — leave empty for the loaded model
  maxRetries: 8,       // optional, default 8
  timeoutMs: 600000,   // optional, default 10 minutes
});

callLlm

Calls /v1/chat/completions. Retries on failure with exponential backoff (2s base, 30s max). If the model returns reasoning_content but no content or tool calls, automatically pushes a recovery user message and retries.

import type { Message } from 'llama-cpp-client';
import type { ChatCompletionTool } from 'openai/resources/chat/completions';

const history: Message[] = [
  { role: 'user', content: 'What is 2 + 2?' },
];

const tools: ChatCompletionTool[] = []; // pass [] for text-only calls

const result = await client.callLlm('You are a helpful assistant.', history, tools);
console.log(result.content);          // "4"
console.log(result.usage);            // { prompt_tokens, completion_tokens, total_tokens }
console.log(result.reasoning_content); // reasoning trace if the model produced one

With tool calls:

const tools: ChatCompletionTool[] = [
  {
    type: 'function',
    function: {
      name: 'get_page_state',
      description: 'Returns the current page HTML',
      parameters: { type: 'object', properties: {}, required: [] },
    },
  },
];

const result = await client.callLlm('You are a browser agent.', history, tools);
if (result.tool_calls?.length) {
  for (const tc of result.tool_calls) {
    console.log(tc.function.name, tc.function.arguments);
  }
}

With an AbortSignal:

const controller = new AbortController();
setTimeout(() => controller.abort(), 5000);

const result = await client.callLlm('You are a helper.', history, [], controller.signal);

countTokens

Calls /v1/messages/count_tokens to get the real token count for a context. Useful for pre-flight checks before sending a large context.

const tokens = await client.countTokens('You are a helper.', history, tools);
console.log(`Context is ${tokens} tokens`);

isContextOverflow

Returns true if an error thrown by callLlm indicates the context window was exceeded.

try {
  await client.callLlm(systemPrompt, history, tools);
} catch (e) {
  if (client.isContextOverflow(e)) {
    // trim history and retry
  }
}

LLMContextManager

Manages a message history array with context compression built in. Inject a LlamaCppClient instance so it can count tokens for trimming decisions.

import { LLMContextManager, LlamaCppClient } from 'llama-cpp-client';

const client = new LlamaCppClient({ baseUrl: 'http://localhost:8080' });
const ctx = new LLMContextManager(applicationId, client, (phase, msg) => {
  console.log(`[${phase}] ${msg}`);
});

ctx.setSystemPrompt('You are a browser agent.');
ctx.setTools(tools);

Building up history

ctx.addMessage('user', 'Fill out the form on the page.');

const result = await client.callLlm(ctx.getSystemPrompt(), ctx.getMessages(), ctx.getTools());
ctx.markAllSent(); // marks all current messages as sent — required for safe trimming

if (result.tool_calls?.length) {
  ctx.addMessage('assistant', result.content ?? '', { tool_calls: result.tool_calls });
  ctx.addMessage('tool', 'Tool result: {"html":"..."}', { tool_call_id: result.tool_calls[0].id });
}

Context compression

Remove oldest messages until under a token budget:

// Trims from the front of history — only removes messages already sent to the LLM.
// Uses real token counts via countTokens (two-phase: estimation bulk-removes, real count verifies).
await ctx.removeOlderMessagesFromHistoryUntilContextIsLessThanNTokens(150000);

Deduplicate page state HTML (keep only the latest):

// Stubs HTML on all but the most recent get_page_state tool result.
ctx.ensureOnlyOnePageStateToolCallResultHasHtmlContent();

Deduplicate screenshots (keep only the latest):

// Clears image data from all but the most recent screenshot tool result.
ctx.ensureOnlyOneScreenshotToolCallHasContent();

Trim HTML on the latest page state result:

// Shrinks the HTML field on the latest get_page_state result by ~8000 tokens.
// Used for context overflow recovery without dropping entire messages.
ctx.trimLatestGetPageStateHtmlInHistory();

Token estimation (sync, no HTTP)

const estimated = ctx.estimateTokens(); // rough estimate, no network call
const breakdown = ctx.getTokenBreakdown(); // per-message breakdown for debugging

Types

type Message = {
  role: string;
  content: string;
  tool_calls?: ChatCompletionMessageToolCall[];
  tool_call_id?: string;
  llmToolContent?: string | ChatCompletionContentPart[]; // overrides content for tool messages sent to the LLM
};

type LlmCallResult = {
  content: string | null;
  reasoning_content?: string | null;
  tool_calls?: ChatCompletionMessageToolCall[];
  usage?: { prompt_tokens: number; completion_tokens: number; total_tokens: number };
};

type LlamaCppClientConfig = {
  baseUrl: string;    // e.g. "http://localhost:8080"
  model?: string;
  maxRetries?: number;
  apiKey?: string;
  timeoutMs?: number;
};