llama-cpp-client

v1.0.1

Published

24 days ago

TypeScript client for llama.cpp OpenAI-compatible API

0High
0Medium
0Low

jasonmcaffee

llama-cpp-client

Typed Node.js client for llama.cpp's OpenAI-compatible HTTP API.

Features:

Automatic retries with exponential backoff
reasoning_content recovery (when the model reasons but forgets to respond)
Real token counting via /v1/messages/count_tokens
Context window management with two-phase compression (estimation + real count)
AbortSignal support throughout

Installation

npm install llama-cpp-client

Or as a local path dependency:

"llama-cpp-client": "file:../LlamaCppClient"

LlamaCppClient

The low-level client. Handles HTTP, retries, and reasoning recovery.

import { LlamaCppClient } from 'llama-cpp-client';

const client = new LlamaCppClient({
  baseUrl: 'http://localhost:8080',
  model: '',           // optional — leave empty for the loaded model
  maxRetries: 8,       // optional, default 8
  timeoutMs: 600000,   // optional, default 10 minutes
});

`callLlm`

Calls /v1/chat/completions. Retries on failure with exponential backoff (2s base, 30s max). If the model returns reasoning_content but no content or tool calls, automatically pushes a recovery user message and retries.

import type { Message } from 'llama-cpp-client';
import type { ChatCompletionTool } from 'openai/resources/chat/completions';

const history: Message[] = [
  { role: 'user', content: 'What is 2 + 2?' },
];

const tools: ChatCompletionTool[] = []; // pass [] for text-only calls

const result = await client.callLlm('You are a helpful assistant.', history, tools);
console.log(result.content);          // "4"
console.log(result.usage);            // { prompt_tokens, completion_tokens, total_tokens }
console.log(result.reasoning_content); // reasoning trace if the model produced one

With tool calls:

const tools: ChatCompletionTool[] = [
  {
    type: 'function',
    function: {
      name: 'get_page_state',
      description: 'Returns the current page HTML',
      parameters: { type: 'object', properties: {}, required: [] },
    },
  },
];

const result = await client.callLlm('You are a browser agent.', history, tools);
if (result.tool_calls?.length) {
  for (const tc of result.tool_calls) {
    console.log(tc.function.name, tc.function.arguments);
  }
}

With an AbortSignal:

const controller = new AbortController();
setTimeout(() => controller.abort(), 5000);

const result = await client.callLlm('You are a helper.', history, [], controller.signal);

`countTokens`

Calls /v1/messages/count_tokens to get the real token count for a context. Useful for pre-flight checks before sending a large context.

const tokens = await client.countTokens('You are a helper.', history, tools);
console.log(`Context is ${tokens} tokens`);

`isContextOverflow`

Returns true if an error thrown by callLlm indicates the context window was exceeded.

try {
  await client.callLlm(systemPrompt, history, tools);
} catch (e) {
  if (client.isContextOverflow(e)) {
    // trim history and retry
  }
}

LLMContextManager

Manages a message history array with context compression built in. Inject a LlamaCppClient instance so it can count tokens for trimming decisions.

import { LLMContextManager, LlamaCppClient } from 'llama-cpp-client';

const client = new LlamaCppClient({ baseUrl: 'http://localhost:8080' });
const ctx = new LLMContextManager(applicationId, client, (phase, msg) => {
  console.log(`[${phase}] ${msg}`);
});

ctx.setSystemPrompt('You are a browser agent.');
ctx.setTools(tools);

Building up history

ctx.addMessage('user', 'Fill out the form on the page.');

const result = await client.callLlm(ctx.getSystemPrompt(), ctx.getMessages(), ctx.getTools());
ctx.markAllSent(); // marks all current messages as sent — required for safe trimming

if (result.tool_calls?.length) {
  ctx.addMessage('assistant', result.content ?? '', { tool_calls: result.tool_calls });
  ctx.addMessage('tool', 'Tool result: {"html":"..."}', { tool_call_id: result.tool_calls[0].id });
}

Context compression

Remove oldest messages until under a token budget:

// Trims from the front of history — only removes messages already sent to the LLM.
// Uses real token counts via countTokens (two-phase: estimation bulk-removes, real count verifies).
await ctx.removeOlderMessagesFromHistoryUntilContextIsLessThanNTokens(150000);

Deduplicate page state HTML (keep only the latest):

// Stubs HTML on all but the most recent get_page_state tool result.
ctx.ensureOnlyOnePageStateToolCallResultHasHtmlContent();

Deduplicate screenshots (keep only the latest):

// Clears image data from all but the most recent screenshot tool result.
ctx.ensureOnlyOneScreenshotToolCallHasContent();

Trim HTML on the latest page state result:

// Shrinks the HTML field on the latest get_page_state result by ~8000 tokens.
// Used for context overflow recovery without dropping entire messages.
ctx.trimLatestGetPageStateHtmlInHistory();

Token estimation (sync, no HTTP)

const estimated = ctx.estimateTokens(); // rough estimate, no network call
const breakdown = ctx.getTokenBreakdown(); // per-message breakdown for debugging

Types

type Message = {
  role: string;
  content: string;
  tool_calls?: ChatCompletionMessageToolCall[];
  tool_call_id?: string;
  llmToolContent?: string | ChatCompletionContentPart[]; // overrides content for tool messages sent to the LLM
};

type LlmCallResult = {
  content: string | null;
  reasoning_content?: string | null;
  tool_calls?: ChatCompletionMessageToolCall[];
  usage?: { prompt_tokens: number; completion_tokens: number; total_tokens: number };
};

type LlamaCppClientConfig = {
  baseUrl: string;    // e.g. "http://localhost:8080"
  model?: string;
  maxRetries?: number;
  apiKey?: string;
  timeoutMs?: number;
};

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

llama-cpp-client

Installation

LlamaCppClient

callLlm

countTokens

isContextOverflow

LLMContextManager

Building up history

Context compression

Token estimation (sync, no HTTP)

Types

`callLlm`

`countTokens`

`isContextOverflow`