npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@lucas-bortoli/fluent-llama

v0.2.13

Published

Client library for interacting with llama.cpp servers

Readme

fluent-llama

npm License TypeScript

This package is currently in Alpha status. It is not yet suitable for production use. Breaking changes may occur without notice.

fluent-llama is a type-safe, fluent API client for interacting with llama-server (llama.cpp inference server). It provides a modern, expressive interface for chat completions, tool calling, vision tasks, and agent loops.

Features

  • Fluent Configuration 🧠: Builder pattern for Sampling and Toolset configurations.
  • Agent Loops 🤖: The act() method handles the multi-turn reasoning and tool execution cycle automatically.
  • Error Handling 🛡️: All operations throw native JavaScript Error instances. Handle errors explicitly with standard try/catch patterns.
  • Vision Support 📷: Native handling of image attachments via Base64.
  • Reasoning 🔍: Supports reasoningContent (Chain of Thought) streams.
  • Text Infilling 📃: Native support for fill-in-the-middle text completion tasks.
  • Advanced Sampling ⚙️: Fine-grained control over temperature, top-k, top-p, mirostat, DRY, XTC, and more.
  • Streaming 🔄: Full SSE (Server-Sent Events) support for real-time token streaming.
  • Router Mode 🛰️: Dynamic model loading/unloading with automatic model discovery.
  • Embeddings 📊: Generate text embeddings for semantic search, clustering, and similarity tasks.
  • Tokenization 🔤: Convert text to token IDs and back, with optional piece-level metadata.

Prerequisites

  • Node.js (v20+ recommended)
  • llama-server: This client is designed to connect to the OpenAI-compatible API exposed by llama-server. Ensure your server is running at a compatible version.

Installation

npm install @lucas-bortoli/fluent-llama

Quick Start

1. Basic Chat Completion

import { Client } from "@lucas-bortoli/fluent-llama";
import { RandomSeed, Sampling } from "@lucas-bortoli/fluent-llama";

async function main() {
  try {
    // Connect to your local llama-server
    const client = await Client.from("http://localhost:8080");
    console.log("Client initialized with models:", [...client.modelStatuses.keys()]);

    const llm = await client.createTextModel("Qwen3.6-35B-A3B");

    const result = await llm.respond({
      instructions: "You are a helpful assistant.",
      history: [{ role: "user", content: "Hello, who are you?", attachments: [] }],
      sampling: new Sampling().setSeed(RandomSeed).build(),
    });

    console.log(result.response.content);
  } catch (error) {
    console.error("Error:", error);
  }
}

main();

2. Tool Calling (Agent Mode)

Use the act() method to run autonomous agent loops where the model decides when to use tools.

import { Client, tool, Toolset, Sampling, RandomSeed } from "@lucas-bortoli/fluent-llama";
import * as v from "valibot";

// Define a tool using Valibot for schema validation
const weatherTool = tool({
  name: "get_weather",
  description: "Gets weather data for a location.",
  parameters: { location: v.string() },
  exec: async ({ location }) => {
    return { temp: 20, condition: "Sunny" };
  },
});

try {
  const client = await Client.from("http://localhost:8080");
  const llm = await client.createTextModel("Qwen3.6-35B-A3B");

  // Run the agent loop
  const history = await llm.act({
    instructions: "You are a helpful assistant. Use tools to answer.",
    history: [{ role: "user", content: "What's the weather in Tokyo?", attachments: [] }],
    sampling: new Sampling()
      .setSamplerTemperature(0.7)
      .setSamplerTopK(80)
      .setSamplerMinP(0.02)
      .setSeed(RandomSeed)
      .build(),
    toolset: new Toolset([weatherTool]).build(),
  });

  // The history contains the new generated messages with tool results
  console.log(history);
} catch (error) {
  console.error("Agent error:", error);
}

3. Vision Support

You can send images by attaching binary content to user messages.

import fs from "node:fs/promises";
import path from "node:path";
import { Client, Sampling } from "@lucas-bortoli/fluent-llama";

async function main() {
  try {
    const client = await Client.from("http://localhost:8080");
    const llm = await client.createTextModel("Qwen3.6-35B-A3B");

    const imageData = await fs.readFile(path.join(__dirname, "image.jpg"));
    const response = await llm.respond({
      instructions: "Describe this image.",
      history: [
        {
          role: "user",
          content: "What is in this picture?",
          attachments: [{ mimeType: "image/jpeg", content: imageData.buffer }],
        },
      ],
      sampling: new Sampling().build(),
    });

    console.log(response.response.content);
  } catch (error) {
    console.error("Vision error:", error);
  }
}

main();

4. Model Loading and Unloading (Router Mode)

With llama-server's router mode, you can dynamically load and unload models without restarting the server.

import { Client } from "@lucas-bortoli/fluent-llama";

async function main() {
  try {
    const client = await Client.from("http://localhost:8080");

    // Check available models
    console.log("Available models:", [...client.modelStatuses.keys()]);

    // Load a model
    await client.load("Qwen3.6-35B-A3B");
    console.log("Model loaded successfully");

    // Use the model
    const llm = await client.createTextModel("Qwen3.6-35B-A3B");
    const isLoaded = await client.isModelLoaded("Qwen3.6-35B-A3B");
    console.log("Model loaded status:", isLoaded);

    // Unload the model when done
    await client.unload("Qwen3.6-35B-A3B");
    console.log("Model unloaded successfully");
  } catch (error) {
    console.error("Error:", error);
  }
}

main();

5. Text Infilling

The predict() method supports text infilling by using the native /infill endpoint. Provide a prefix, suffix, and the main prompt to generate completions for partial text blocks.

import { Client, Sampling, RandomSeed } from "@lucas-bortoli/fluent-llama";

async function main() {
  try {
    const client = await Client.from("http://localhost:8080");
    const llm = await client.createTextModel("Qwen3.6-35B-A3B");

    const result = await llm.predict({
      input: {
        prefix: "def sum(a, b):\n",
        suffix: "\n\nprint(sum(5, 8))",
        prompt: "Write this function.",
      },
      sampling: new Sampling().setSamplerTemperature(0.6).setSeed(RandomSeed).build(),
    });

    console.log("Infilling completion:", result.content);
  } catch (error) {
    console.error("Infilling error:", error);
  }
}

main();

6. Embeddings

Generate text embeddings for semantic search, clustering, and similarity tasks.

import { Client, Sampling } from "@lucas-bortoli/fluent-llama";

async function main() {
  try {
    const client = await Client.from("http://localhost:8080");

    // Load an embedding model
    await client.load("all-MiniLM-L6-v2");

    // Create embedding model instance
    const embeddingModel = await client.createEmbeddingModel("all-MiniLM-L6-v2");

    // Generate embedding for single text (returns number[])
    const singleEmbedding = await embeddingModel.embed("Hello, world!");
    console.log("Single embedding dimension:", singleEmbedding.length);
    console.log("First 5 values:", singleEmbedding.slice(0, 5));

    // Generate embeddings for multiple texts (returns number[][])
    const multipleEmbeddings = await embeddingModel.embed([
      "Hello, world!",
      "How are you?",
      "Good morning!",
    ]);

    console.log("Generated", multipleEmbeddings.length, "embeddings");
    console.log("Each embedding has", multipleEmbeddings[0].length, "dimensions");
  } catch (error) {
    console.error("Embedding error:", error);
  }
}

main();

7. Tokenization

Convert text to token IDs and back using the tokenize() and detokenize() methods on TextModel.

import { Client } from "@lucas-bortoli/fluent-llama";

async function main() {
  try {
    const client = await Client.from("http://localhost:8080");
    await client.load("Qwen3.6-35B-A3B");

    const llm = await client.createTextModel("Qwen3.6-35B-A3B");

    // Tokenize text into token IDs (returns number[])
    const tokens = await llm.tokenize({ text: "Hello, world!" });
    console.log("Token IDs:", tokens);

    // Detokenize token IDs back into text (returns string)
    const text = await llm.detokenize(tokens);
    console.log("Detokenized:", text);

    // Tokenize with piece metadata (returns ApiTokenizePiece[])
    const pieces = await llm.tokenize({ text: "Hello, world!", withPieces: true });
    for (const piece of pieces) {
      console.log(`Token ${piece.id}: "${String(piece.piece)}"`);
    }

    // Optional: include special tokens (BOS, EOS)
    const withSpecial = await llm.tokenize({
      text: "Hello, world!",
      addSpecial: true,
    });
  } catch (error) {
    console.error("Tokenization error:", error);
  }
}

main();

Error Handling

This library uses standard JavaScript Error classes for error handling. Every fallible operation can throw errors. Handle them explicitly with try/catch blocks.

Available Error Classes

  • ApiRequestError - API request failures (includes httpStatusCode and responseBody)
  • InvalidParameterError - Invalid parameters provided
  • EmptyMessageArrayError - Empty message history
  • AbortedRequestError - Request was cancelled
  • UnexpectedServerBehaviorError - Server returned unexpected response
  • ModelLoadError - Model load failures
  • ModelUnloadError - Model unload failures
  • InvalidModelError - Invalid model ID

Example: Handling Different Error Types

try {
  const response = await llm.respond({
    instructions: "You are a helpful assistant.",
    history: [
      /* ... */
    ],
    sampling: new Sampling().build(),
  });
  console.log(response.response.content);
} catch (error) {
  if (error instanceof InvalidParameterError) {
    console.error("Invalid parameters:", error.message);
  } else if (error instanceof ApiRequestError) {
    console.error("API request failed:", {
      status: error.httpStatusCode,
      responseBody: error.responseBody,
    });
  } else if (error instanceof ModelLoadError) {
    console.error("Model load failed:", error.message, error.inner);
  } else {
    console.error("Unexpected error:", error);
  }
}

Configuration Reference

Sampling

The Sampling class allows you to configure generation parameters fluently.

const config = new Sampling()
  .setSamplerTemperature(0.7)
  .setSamplerTopP(0.95)
  .setSamplerTopK(40)
  .setSeed(RandomSeed) // or setSeed(42) for deterministic results
  .setSamplerPresencePenalty(1.0)
  .setGrammar({ type: "Json", schema: { ... } }) // For structured outputs
  .build();

Toolset

The Toolset class manages available functions for the LLM.

const tools = new Toolset([weatherTool, webSearchTool])
  .setWhitelist(["weather-tool"]) // Only allow these tools
  .setBatchMode("Parallel") // Run tools concurrently
  .setInvocationRequirement("AsNeeded") // Or "RequireOne"
  .build();

Compatibility

This package is built specifically for the API interface exposed by llama-server (llama.cpp). While some endpoints use the OpenAI compat layer, this package specifically leverages llama-server's native endpoints for optimal performance and feature support. Do not use this package with other LLM servers.

Disclaimer

This software is in Alpha.

  • Stability is not guaranteed.
  • API endpoints or types may change.
  • Do not use in production environments until a stable version is released.

License

MIT License. See LICENSE for details.