npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

bare-llama-cpp

v1.0.0

Published

llama.cpp bindings for Bare

Downloads

218

Readme

bare-llama.cpp

Native llama.cpp bindings for Bare.

Run LLM inference directly in your Bare JavaScript applications with GPU acceleration support.

Requirements

  • CMake 3.25+
  • C/C++ compiler (clang, gcc, or MSVC)
  • Node.js (for npm/cmake-bare)
  • Bare runtime

Building

Clone with submodules:

git clone --recursive https://github.com/CameronTofer/bare-llama.cpp
cd bare-llama.cpp

Or if already cloned:

git submodule update --init --recursive

Install dependencies and build:

npm install

Or manually:

bare-make generate
bare-make build
bare-make install

This creates prebuilds/<platform>-<arch>/bare-llama.bare.

Build Options

For a debug build:

bare-make generate -- -D CMAKE_BUILD_TYPE=Debug
bare-make build

To disable GPU acceleration:

bare-make generate -- -D GGML_METAL=OFF -D GGML_CUDA=OFF
bare-make build

Usage

const { LlamaModel, LlamaContext, LlamaSampler, generate } = require('bare-llama')

// Load model (GGUF format)
const model = new LlamaModel('./model.gguf', {
  nGpuLayers: 99  // Offload layers to GPU (0 = CPU only)
})

// Create context
const ctx = new LlamaContext(model, {
  contextSize: 2048,  // Max context length
  batchSize: 512      // Batch size for prompt processing
})

// Create sampler
const sampler = new LlamaSampler(model, {
  temp: 0.7,    // Temperature (0 = greedy)
  topK: 40,     // Top-K sampling
  topP: 0.95    // Top-P (nucleus) sampling
})

// Generate text
const output = generate(model, ctx, sampler, 'The meaning of life is', 128)
console.log(output)

// Cleanup
sampler.free()
ctx.free()
model.free()

Embeddings

const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./embedding-model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
  contextSize: 512,
  embeddings: true,
  poolingType: 1  // -1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank
})

const tokens = model.tokenize('Hello world', true)
ctx.decode(tokens)
const embedding = ctx.getEmbeddings(-1)  // Float32Array

// Reuse context for multiple embeddings
ctx.clearMemory()
const tokens2 = model.tokenize('Another text', true)
ctx.decode(tokens2)
const embedding2 = ctx.getEmbeddings(-1)

ctx.free()
model.free()

Reranking

Cross-encoder reranking scores how relevant a document is to a query. Use a reranker model (e.g. BGE reranker) with poolingType: 4 (rank).

Important: You must call ctx.clearMemory() before each scoring to clear the KV cache. Without this, stale context from previous pairs corrupts the scores.

const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./bge-reranker-v2-m3.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
  contextSize: 512,
  embeddings: true,
  poolingType: 4  // rank pooling (required for rerankers)
})

function rerank (query, document) {
  ctx.clearMemory()  // critical: clear KV cache before each pair
  const tokens = model.tokenize(query + '\n' + document, true)
  ctx.decode(tokens)
  return ctx.getEmbeddings(0)[0]  // single float score
}

const query = 'What is machine learning?'
const docs = [
  'Machine learning is a branch of AI that learns from data.',
  'The recipe calls for two cups of flour and one egg.'
]

const scored = docs
  .map((doc, i) => ({ i, score: rerank(query, doc) }))
  .sort((a, b) => b.score - a.score)

for (const { i, score } of scored) {
  console.log(`[${score.toFixed(4)}] ${docs[i]}`)
}

ctx.free()
model.free()

Constrained Generation

const { LlamaModel, LlamaContext, LlamaSampler, generate, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, { contextSize: 2048 })

// JSON schema constraint (requires llguidance)
const schema = JSON.stringify({
  type: 'object',
  properties: { name: { type: 'string' }, age: { type: 'integer' } },
  required: ['name', 'age']
})
const sampler = new LlamaSampler(model, { temp: 0, json: schema })

// Lark grammar constraint
const sampler2 = new LlamaSampler(model, { temp: 0, lark: 'start: "yes" | "no"' })

Examples

| Example | Description | |---------|-------------| | examples/text-generation.js | High-level generate() API | | examples/token-by-token.js | Manual tokenize/sample/decode loop | | examples/cosine-similarity.js | Embeddings + semantic similarity | | examples/json-constrained-output.js | JSON schema constrained generation | | examples/lark-constrained-output.js | Lark grammar constrained generation | | examples/tool-use-agent.js | Agentic tool calling with multi-turn |

Run examples with:

bare examples/text-generation.js -- /path/to/model.gguf

Testing

Tests use brittle and skip gracefully when models aren't available.

npm test

Model-dependent tests require Ollama models installed locally:

ollama pull llama3.2:1b        # generation tests
ollama pull nomic-embed-text   # embedding tests
ollama pull qllama/bge-reranker-v2-m3  # reranking tests

Benchmarks

npm run bench

Results are saved to bench/results/ as JSON with full metadata (llama.cpp version, system info, platform). History is tracked in JSONL files for comparison across runs.

API Reference

LlamaModel

new LlamaModel(path, options?)

| Option | Type | Default | Description | |--------|------|---------|-------------| | nGpuLayers | number | 0 | Number of layers to offload to GPU |

Properties:

  • name - Model name from metadata
  • embeddingDimension - Embedding vector size
  • trainingContextSize - Training context length

Methods:

  • tokenize(text, addBos?) - Convert text to tokens (Int32Array)
  • detokenize(tokens) - Convert tokens back to text
  • isEogToken(token) - Check if token is end-of-generation
  • getMeta(key) - Get model metadata by key
  • free() - Release model resources

LlamaContext

new LlamaContext(model, options?)

| Option | Type | Default | Description | |--------|------|---------|-------------| | contextSize | number | 512 | Maximum context length | | batchSize | number | 512 | Batch size for processing | | embeddings | boolean | false | Enable embedding mode | | poolingType | number | -1 | Pooling strategy (-1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank) |

Properties:

  • contextSize - Actual context size

Methods:

  • decode(tokens) - Process tokens through the model
  • getEmbeddings(idx) - Get embedding vector (Float32Array)
  • clearMemory() - Clear context for reuse (faster than creating new context)
  • free() - Release context resources

LlamaSampler

new LlamaSampler(model, options?)

| Option | Type | Default | Description | |--------|------|---------|-------------| | temp | number | 0 | Temperature (0 = greedy sampling) | | topK | number | 40 | Top-K sampling parameter | | topP | number | 0.95 | Top-P (nucleus) sampling parameter | | json | string | - | JSON schema constraint (requires llguidance) | | lark | string | - | Lark grammar constraint (requires llguidance) |

Methods:

  • sample(ctx, idx) - Sample next token (-1 for last position)
  • accept(token) - Accept token into sampler state
  • free() - Release sampler resources

generate()

generate(model, ctx, sampler, prompt, maxTokens?)

Convenience function for simple text generation. Returns the generated text (not including the prompt).

Utility Functions

  • setQuiet(quiet?) - Suppress llama.cpp output
  • setLogLevel(level) - Set log level (0=off, 1=errors, 2=all)
  • readGgufMeta(path, key) - Read GGUF metadata without loading the model
  • getModelName(path) - Get model name from GGUF file
  • systemInfo() - Get hardware/instruction set info (AVX, NEON, Metal, CUDA)

Project Structure

index.js              Main module
binding.cpp           C++ native bindings
lib/
  ollama-models.js    Ollama model discovery
  ollama.js           GGUF metadata + Jinja chat templates
test/                 Brittle test suite
bench/                Benchmark system
examples/             Usage examples
tools/
  ollama-hyperdrive.js  P2P model distribution (standalone CLI)

Models

This addon works with GGUF format models. You can use models from Ollama (auto-detected from ~/.ollama/models) or download GGUF files directly from Hugging Face.

Platform Support

| Platform | Architecture | GPU Support | |----------|--------------|-------------| | macOS | arm64, x64 | Metal | | Linux | x64, arm64 | CUDA (if available) | | Windows | x64, arm64 | CUDA (if available) | | iOS | arm64 | Metal | | Android | arm64, arm, x64, ia32 | - |

Constrained generation (llguidance)

JSON schema and Lark grammar constraints require llguidance, which is built from Rust source. This is enabled automatically on native (non-cross-compiled) builds. Cross-compiled targets (iOS, Android, Windows arm64) do not include llguidance — constrained generation is unavailable on those platforms.

Note: Lark grammar constraints are currently not working correctly — llguidance does not appear to enforce token constraints as expected (e.g. allowing "Yes" when the grammar only permits "yes"). JSON schema constraints work fine.

License

MIT