llama-cpp-client
v1.0.1
Published
TypeScript client for llama.cpp OpenAI-compatible API
Readme
llama-cpp-client
Typed Node.js client for llama.cpp's OpenAI-compatible HTTP API.
Features:
- Automatic retries with exponential backoff
reasoning_contentrecovery (when the model reasons but forgets to respond)- Real token counting via
/v1/messages/count_tokens - Context window management with two-phase compression (estimation + real count)
- AbortSignal support throughout
Installation
npm install llama-cpp-clientOr as a local path dependency:
"llama-cpp-client": "file:../LlamaCppClient"LlamaCppClient
The low-level client. Handles HTTP, retries, and reasoning recovery.
import { LlamaCppClient } from 'llama-cpp-client';
const client = new LlamaCppClient({
baseUrl: 'http://localhost:8080',
model: '', // optional — leave empty for the loaded model
maxRetries: 8, // optional, default 8
timeoutMs: 600000, // optional, default 10 minutes
});callLlm
Calls /v1/chat/completions. Retries on failure with exponential backoff (2s base, 30s max).
If the model returns reasoning_content but no content or tool calls, automatically pushes a recovery
user message and retries.
import type { Message } from 'llama-cpp-client';
import type { ChatCompletionTool } from 'openai/resources/chat/completions';
const history: Message[] = [
{ role: 'user', content: 'What is 2 + 2?' },
];
const tools: ChatCompletionTool[] = []; // pass [] for text-only calls
const result = await client.callLlm('You are a helpful assistant.', history, tools);
console.log(result.content); // "4"
console.log(result.usage); // { prompt_tokens, completion_tokens, total_tokens }
console.log(result.reasoning_content); // reasoning trace if the model produced oneWith tool calls:
const tools: ChatCompletionTool[] = [
{
type: 'function',
function: {
name: 'get_page_state',
description: 'Returns the current page HTML',
parameters: { type: 'object', properties: {}, required: [] },
},
},
];
const result = await client.callLlm('You are a browser agent.', history, tools);
if (result.tool_calls?.length) {
for (const tc of result.tool_calls) {
console.log(tc.function.name, tc.function.arguments);
}
}With an AbortSignal:
const controller = new AbortController();
setTimeout(() => controller.abort(), 5000);
const result = await client.callLlm('You are a helper.', history, [], controller.signal);countTokens
Calls /v1/messages/count_tokens to get the real token count for a context.
Useful for pre-flight checks before sending a large context.
const tokens = await client.countTokens('You are a helper.', history, tools);
console.log(`Context is ${tokens} tokens`);isContextOverflow
Returns true if an error thrown by callLlm indicates the context window was exceeded.
try {
await client.callLlm(systemPrompt, history, tools);
} catch (e) {
if (client.isContextOverflow(e)) {
// trim history and retry
}
}LLMContextManager
Manages a message history array with context compression built in.
Inject a LlamaCppClient instance so it can count tokens for trimming decisions.
import { LLMContextManager, LlamaCppClient } from 'llama-cpp-client';
const client = new LlamaCppClient({ baseUrl: 'http://localhost:8080' });
const ctx = new LLMContextManager(applicationId, client, (phase, msg) => {
console.log(`[${phase}] ${msg}`);
});
ctx.setSystemPrompt('You are a browser agent.');
ctx.setTools(tools);Building up history
ctx.addMessage('user', 'Fill out the form on the page.');
const result = await client.callLlm(ctx.getSystemPrompt(), ctx.getMessages(), ctx.getTools());
ctx.markAllSent(); // marks all current messages as sent — required for safe trimming
if (result.tool_calls?.length) {
ctx.addMessage('assistant', result.content ?? '', { tool_calls: result.tool_calls });
ctx.addMessage('tool', 'Tool result: {"html":"..."}', { tool_call_id: result.tool_calls[0].id });
}Context compression
Remove oldest messages until under a token budget:
// Trims from the front of history — only removes messages already sent to the LLM.
// Uses real token counts via countTokens (two-phase: estimation bulk-removes, real count verifies).
await ctx.removeOlderMessagesFromHistoryUntilContextIsLessThanNTokens(150000);Deduplicate page state HTML (keep only the latest):
// Stubs HTML on all but the most recent get_page_state tool result.
ctx.ensureOnlyOnePageStateToolCallResultHasHtmlContent();Deduplicate screenshots (keep only the latest):
// Clears image data from all but the most recent screenshot tool result.
ctx.ensureOnlyOneScreenshotToolCallHasContent();Trim HTML on the latest page state result:
// Shrinks the HTML field on the latest get_page_state result by ~8000 tokens.
// Used for context overflow recovery without dropping entire messages.
ctx.trimLatestGetPageStateHtmlInHistory();Token estimation (sync, no HTTP)
const estimated = ctx.estimateTokens(); // rough estimate, no network call
const breakdown = ctx.getTokenBreakdown(); // per-message breakdown for debuggingTypes
type Message = {
role: string;
content: string;
tool_calls?: ChatCompletionMessageToolCall[];
tool_call_id?: string;
llmToolContent?: string | ChatCompletionContentPart[]; // overrides content for tool messages sent to the LLM
};
type LlmCallResult = {
content: string | null;
reasoning_content?: string | null;
tool_calls?: ChatCompletionMessageToolCall[];
usage?: { prompt_tokens: number; completion_tokens: number; total_tokens: number };
};
type LlamaCppClientConfig = {
baseUrl: string; // e.g. "http://localhost:8080"
model?: string;
maxRetries?: number;
apiKey?: string;
timeoutMs?: number;
};