bare-llama-cpp
v1.0.0
Published
llama.cpp bindings for Bare
Downloads
218
Maintainers
Readme
bare-llama.cpp
Native llama.cpp bindings for Bare.
Run LLM inference directly in your Bare JavaScript applications with GPU acceleration support.
Requirements
- CMake 3.25+
- C/C++ compiler (clang, gcc, or MSVC)
- Node.js (for npm/cmake-bare)
- Bare runtime
Building
Clone with submodules:
git clone --recursive https://github.com/CameronTofer/bare-llama.cpp
cd bare-llama.cppOr if already cloned:
git submodule update --init --recursiveInstall dependencies and build:
npm installOr manually:
bare-make generate
bare-make build
bare-make installThis creates prebuilds/<platform>-<arch>/bare-llama.bare.
Build Options
For a debug build:
bare-make generate -- -D CMAKE_BUILD_TYPE=Debug
bare-make buildTo disable GPU acceleration:
bare-make generate -- -D GGML_METAL=OFF -D GGML_CUDA=OFF
bare-make buildUsage
const { LlamaModel, LlamaContext, LlamaSampler, generate } = require('bare-llama')
// Load model (GGUF format)
const model = new LlamaModel('./model.gguf', {
nGpuLayers: 99 // Offload layers to GPU (0 = CPU only)
})
// Create context
const ctx = new LlamaContext(model, {
contextSize: 2048, // Max context length
batchSize: 512 // Batch size for prompt processing
})
// Create sampler
const sampler = new LlamaSampler(model, {
temp: 0.7, // Temperature (0 = greedy)
topK: 40, // Top-K sampling
topP: 0.95 // Top-P (nucleus) sampling
})
// Generate text
const output = generate(model, ctx, sampler, 'The meaning of life is', 128)
console.log(output)
// Cleanup
sampler.free()
ctx.free()
model.free()Embeddings
const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')
setQuiet(true)
const model = new LlamaModel('./embedding-model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
contextSize: 512,
embeddings: true,
poolingType: 1 // -1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank
})
const tokens = model.tokenize('Hello world', true)
ctx.decode(tokens)
const embedding = ctx.getEmbeddings(-1) // Float32Array
// Reuse context for multiple embeddings
ctx.clearMemory()
const tokens2 = model.tokenize('Another text', true)
ctx.decode(tokens2)
const embedding2 = ctx.getEmbeddings(-1)
ctx.free()
model.free()Reranking
Cross-encoder reranking scores how relevant a document is to a query. Use a reranker model (e.g. BGE reranker) with poolingType: 4 (rank).
Important: You must call ctx.clearMemory() before each scoring to clear the KV cache. Without this, stale context from previous pairs corrupts the scores.
const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')
setQuiet(true)
const model = new LlamaModel('./bge-reranker-v2-m3.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
contextSize: 512,
embeddings: true,
poolingType: 4 // rank pooling (required for rerankers)
})
function rerank (query, document) {
ctx.clearMemory() // critical: clear KV cache before each pair
const tokens = model.tokenize(query + '\n' + document, true)
ctx.decode(tokens)
return ctx.getEmbeddings(0)[0] // single float score
}
const query = 'What is machine learning?'
const docs = [
'Machine learning is a branch of AI that learns from data.',
'The recipe calls for two cups of flour and one egg.'
]
const scored = docs
.map((doc, i) => ({ i, score: rerank(query, doc) }))
.sort((a, b) => b.score - a.score)
for (const { i, score } of scored) {
console.log(`[${score.toFixed(4)}] ${docs[i]}`)
}
ctx.free()
model.free()Constrained Generation
const { LlamaModel, LlamaContext, LlamaSampler, generate, setQuiet } = require('bare-llama')
setQuiet(true)
const model = new LlamaModel('./model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, { contextSize: 2048 })
// JSON schema constraint (requires llguidance)
const schema = JSON.stringify({
type: 'object',
properties: { name: { type: 'string' }, age: { type: 'integer' } },
required: ['name', 'age']
})
const sampler = new LlamaSampler(model, { temp: 0, json: schema })
// Lark grammar constraint
const sampler2 = new LlamaSampler(model, { temp: 0, lark: 'start: "yes" | "no"' })Examples
| Example | Description |
|---------|-------------|
| examples/text-generation.js | High-level generate() API |
| examples/token-by-token.js | Manual tokenize/sample/decode loop |
| examples/cosine-similarity.js | Embeddings + semantic similarity |
| examples/json-constrained-output.js | JSON schema constrained generation |
| examples/lark-constrained-output.js | Lark grammar constrained generation |
| examples/tool-use-agent.js | Agentic tool calling with multi-turn |
Run examples with:
bare examples/text-generation.js -- /path/to/model.ggufTesting
Tests use brittle and skip gracefully when models aren't available.
npm testModel-dependent tests require Ollama models installed locally:
ollama pull llama3.2:1b # generation tests
ollama pull nomic-embed-text # embedding tests
ollama pull qllama/bge-reranker-v2-m3 # reranking testsBenchmarks
npm run benchResults are saved to bench/results/ as JSON with full metadata (llama.cpp version, system info, platform). History is tracked in JSONL files for comparison across runs.
API Reference
LlamaModel
new LlamaModel(path, options?)| Option | Type | Default | Description |
|--------|------|---------|-------------|
| nGpuLayers | number | 0 | Number of layers to offload to GPU |
Properties:
name- Model name from metadataembeddingDimension- Embedding vector sizetrainingContextSize- Training context length
Methods:
tokenize(text, addBos?)- Convert text to tokens (Int32Array)detokenize(tokens)- Convert tokens back to textisEogToken(token)- Check if token is end-of-generationgetMeta(key)- Get model metadata by keyfree()- Release model resources
LlamaContext
new LlamaContext(model, options?)| Option | Type | Default | Description |
|--------|------|---------|-------------|
| contextSize | number | 512 | Maximum context length |
| batchSize | number | 512 | Batch size for processing |
| embeddings | boolean | false | Enable embedding mode |
| poolingType | number | -1 | Pooling strategy (-1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank) |
Properties:
contextSize- Actual context size
Methods:
decode(tokens)- Process tokens through the modelgetEmbeddings(idx)- Get embedding vector (Float32Array)clearMemory()- Clear context for reuse (faster than creating new context)free()- Release context resources
LlamaSampler
new LlamaSampler(model, options?)| Option | Type | Default | Description |
|--------|------|---------|-------------|
| temp | number | 0 | Temperature (0 = greedy sampling) |
| topK | number | 40 | Top-K sampling parameter |
| topP | number | 0.95 | Top-P (nucleus) sampling parameter |
| json | string | - | JSON schema constraint (requires llguidance) |
| lark | string | - | Lark grammar constraint (requires llguidance) |
Methods:
sample(ctx, idx)- Sample next token (-1 for last position)accept(token)- Accept token into sampler statefree()- Release sampler resources
generate()
generate(model, ctx, sampler, prompt, maxTokens?)Convenience function for simple text generation. Returns the generated text (not including the prompt).
Utility Functions
setQuiet(quiet?)- Suppress llama.cpp outputsetLogLevel(level)- Set log level (0=off, 1=errors, 2=all)readGgufMeta(path, key)- Read GGUF metadata without loading the modelgetModelName(path)- Get model name from GGUF filesystemInfo()- Get hardware/instruction set info (AVX, NEON, Metal, CUDA)
Project Structure
index.js Main module
binding.cpp C++ native bindings
lib/
ollama-models.js Ollama model discovery
ollama.js GGUF metadata + Jinja chat templates
test/ Brittle test suite
bench/ Benchmark system
examples/ Usage examples
tools/
ollama-hyperdrive.js P2P model distribution (standalone CLI)Models
This addon works with GGUF format models. You can use models from Ollama (auto-detected from ~/.ollama/models) or download GGUF files directly from Hugging Face.
Platform Support
| Platform | Architecture | GPU Support | |----------|--------------|-------------| | macOS | arm64, x64 | Metal | | Linux | x64, arm64 | CUDA (if available) | | Windows | x64, arm64 | CUDA (if available) | | iOS | arm64 | Metal | | Android | arm64, arm, x64, ia32 | - |
Constrained generation (llguidance)
JSON schema and Lark grammar constraints require llguidance, which is built from Rust source. This is enabled automatically on native (non-cross-compiled) builds. Cross-compiled targets (iOS, Android, Windows arm64) do not include llguidance — constrained generation is unavailable on those platforms.
Note: Lark grammar constraints are currently not working correctly — llguidance does not appear to enforce token constraints as expected (e.g. allowing "Yes" when the grammar only permits "yes"). JSON schema constraints work fine.
License
MIT
