bare-llama-cpp

v1.0.0

Published

6 days ago

llama.cpp bindings for Bare

Downloads

218

0High
0Medium
0Low

m11ps

llama llama.cpp llm ai inference bare native gpu metal cuda

bare-llama.cpp

Native llama.cpp bindings for Bare.

Run LLM inference directly in your Bare JavaScript applications with GPU acceleration support.

Requirements

CMake 3.25+
C/C++ compiler (clang, gcc, or MSVC)
Node.js (for npm/cmake-bare)
Bare runtime

Building

Clone with submodules:

git clone --recursive https://github.com/CameronTofer/bare-llama.cpp
cd bare-llama.cpp

Or if already cloned:

git submodule update --init --recursive

Install dependencies and build:

npm install

Or manually:

bare-make generate
bare-make build
bare-make install

This creates prebuilds/<platform>-<arch>/bare-llama.bare.

Build Options

For a debug build:

bare-make generate -- -D CMAKE_BUILD_TYPE=Debug
bare-make build

To disable GPU acceleration:

bare-make generate -- -D GGML_METAL=OFF -D GGML_CUDA=OFF
bare-make build

Usage

const { LlamaModel, LlamaContext, LlamaSampler, generate } = require('bare-llama')

// Load model (GGUF format)
const model = new LlamaModel('./model.gguf', {
  nGpuLayers: 99  // Offload layers to GPU (0 = CPU only)
})

// Create context
const ctx = new LlamaContext(model, {
  contextSize: 2048,  // Max context length
  batchSize: 512      // Batch size for prompt processing
})

// Create sampler
const sampler = new LlamaSampler(model, {
  temp: 0.7,    // Temperature (0 = greedy)
  topK: 40,     // Top-K sampling
  topP: 0.95    // Top-P (nucleus) sampling
})

// Generate text
const output = generate(model, ctx, sampler, 'The meaning of life is', 128)
console.log(output)

// Cleanup
sampler.free()
ctx.free()
model.free()

Embeddings

const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./embedding-model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
  contextSize: 512,
  embeddings: true,
  poolingType: 1  // -1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank
})

const tokens = model.tokenize('Hello world', true)
ctx.decode(tokens)
const embedding = ctx.getEmbeddings(-1)  // Float32Array

// Reuse context for multiple embeddings
ctx.clearMemory()
const tokens2 = model.tokenize('Another text', true)
ctx.decode(tokens2)
const embedding2 = ctx.getEmbeddings(-1)

ctx.free()
model.free()

Reranking

Cross-encoder reranking scores how relevant a document is to a query. Use a reranker model (e.g. BGE reranker) with poolingType: 4 (rank).

Important: You must call ctx.clearMemory() before each scoring to clear the KV cache. Without this, stale context from previous pairs corrupts the scores.

const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./bge-reranker-v2-m3.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
  contextSize: 512,
  embeddings: true,
  poolingType: 4  // rank pooling (required for rerankers)
})

function rerank (query, document) {
  ctx.clearMemory()  // critical: clear KV cache before each pair
  const tokens = model.tokenize(query + '\n' + document, true)
  ctx.decode(tokens)
  return ctx.getEmbeddings(0)[0]  // single float score
}

const query = 'What is machine learning?'
const docs = [
  'Machine learning is a branch of AI that learns from data.',
  'The recipe calls for two cups of flour and one egg.'
]

const scored = docs
  .map((doc, i) => ({ i, score: rerank(query, doc) }))
  .sort((a, b) => b.score - a.score)

for (const { i, score } of scored) {
  console.log(`[${score.toFixed(4)}] ${docs[i]}`)
}

ctx.free()
model.free()

Constrained Generation

const { LlamaModel, LlamaContext, LlamaSampler, generate, setQuiet } = require('bare-llama')

setQuiet(true)

const model = new LlamaModel('./model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, { contextSize: 2048 })

// JSON schema constraint (requires llguidance)
const schema = JSON.stringify({
  type: 'object',
  properties: { name: { type: 'string' }, age: { type: 'integer' } },
  required: ['name', 'age']
})
const sampler = new LlamaSampler(model, { temp: 0, json: schema })

// Lark grammar constraint
const sampler2 = new LlamaSampler(model, { temp: 0, lark: 'start: "yes" | "no"' })

Examples

| Example | Description | |---------|-------------| | examples/text-generation.js | High-level generate() API | | examples/token-by-token.js | Manual tokenize/sample/decode loop | | examples/cosine-similarity.js | Embeddings + semantic similarity | | examples/json-constrained-output.js | JSON schema constrained generation | | examples/lark-constrained-output.js | Lark grammar constrained generation | | examples/tool-use-agent.js | Agentic tool calling with multi-turn |

Run examples with:

bare examples/text-generation.js -- /path/to/model.gguf

Testing

Tests use brittle and skip gracefully when models aren't available.

npm test

Model-dependent tests require Ollama models installed locally:

ollama pull llama3.2:1b        # generation tests
ollama pull nomic-embed-text   # embedding tests
ollama pull qllama/bge-reranker-v2-m3  # reranking tests

Benchmarks

npm run bench

Results are saved to bench/results/ as JSON with full metadata (llama.cpp version, system info, platform). History is tracked in JSONL files for comparison across runs.

API Reference

LlamaModel

new LlamaModel(path, options?)

| Option | Type | Default | Description | |--------|------|---------|-------------| | nGpuLayers | number | 0 | Number of layers to offload to GPU |

Properties:

name - Model name from metadata
embeddingDimension - Embedding vector size
trainingContextSize - Training context length

Methods:

tokenize(text, addBos?) - Convert text to tokens (Int32Array)
detokenize(tokens) - Convert tokens back to text
isEogToken(token) - Check if token is end-of-generation
getMeta(key) - Get model metadata by key
free() - Release model resources

LlamaContext

new LlamaContext(model, options?)

| Option | Type | Default | Description | |--------|------|---------|-------------| | contextSize | number | 512 | Maximum context length | | batchSize | number | 512 | Batch size for processing | | embeddings | boolean | false | Enable embedding mode | | poolingType | number | -1 | Pooling strategy (-1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank) |

Properties:

contextSize - Actual context size

Methods:

decode(tokens) - Process tokens through the model
getEmbeddings(idx) - Get embedding vector (Float32Array)
clearMemory() - Clear context for reuse (faster than creating new context)
free() - Release context resources

LlamaSampler

new LlamaSampler(model, options?)

| Option | Type | Default | Description | |--------|------|---------|-------------| | temp | number | 0 | Temperature (0 = greedy sampling) | | topK | number | 40 | Top-K sampling parameter | | topP | number | 0.95 | Top-P (nucleus) sampling parameter | | json | string | - | JSON schema constraint (requires llguidance) | | lark | string | - | Lark grammar constraint (requires llguidance) |

Methods:

sample(ctx, idx) - Sample next token (-1 for last position)
accept(token) - Accept token into sampler state
free() - Release sampler resources

generate()

generate(model, ctx, sampler, prompt, maxTokens?)

Convenience function for simple text generation. Returns the generated text (not including the prompt).

Utility Functions

setQuiet(quiet?) - Suppress llama.cpp output
setLogLevel(level) - Set log level (0=off, 1=errors, 2=all)
readGgufMeta(path, key) - Read GGUF metadata without loading the model
getModelName(path) - Get model name from GGUF file
systemInfo() - Get hardware/instruction set info (AVX, NEON, Metal, CUDA)

Project Structure

index.js              Main module
binding.cpp           C++ native bindings
lib/
  ollama-models.js    Ollama model discovery
  ollama.js           GGUF metadata + Jinja chat templates
test/                 Brittle test suite
bench/                Benchmark system
examples/             Usage examples
tools/
  ollama-hyperdrive.js  P2P model distribution (standalone CLI)

Models

This addon works with GGUF format models. You can use models from Ollama (auto-detected from ~/.ollama/models) or download GGUF files directly from Hugging Face.

Platform Support

| Platform | Architecture | GPU Support | |----------|--------------|-------------| | macOS | arm64, x64 | Metal | | Linux | x64, arm64 | CUDA (if available) | | Windows | x64, arm64 | CUDA (if available) | | iOS | arm64 | Metal | | Android | arm64, arm, x64, ia32 | - |

Constrained generation (llguidance)

JSON schema and Lark grammar constraints require llguidance, which is built from Rust source. This is enabled automatically on native (non-cross-compiled) builds. Cross-compiled targets (iOS, Android, Windows arm64) do not include llguidance — constrained generation is unavailable on those platforms.

Note: Lark grammar constraints are currently not working correctly — llguidance does not appear to enforce token constraints as expected (e.g. allowing "Yes" when the grammar only permits "yes"). JSON schema constraints work fine.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

bare-llama.cpp

Requirements

Building

Build Options

Usage

Embeddings

Reranking

Constrained Generation

Examples

Testing

Benchmarks

API Reference

LlamaModel

LlamaContext

LlamaSampler

generate()

Utility Functions

Project Structure

Models

Platform Support

Constrained generation (llguidance)

License