@runanywhere/llamacpp

v0.18.1

Published

a month ago

LlamaCpp backend for RunAnywhere React Native SDK - GGUF model support for on-device LLM

Downloads

834

0High
0Medium
0Low

run_anywhere

react-native runanywhere llm llamacpp llama gguf on-device nitro expo

@runanywhere/llamacpp

LlamaCPP backend for the RunAnywhere React Native SDK. Provides on-device LLM text generation with GGUF models powered by llama.cpp.

Overview

@runanywhere/llamacpp provides the LlamaCPP backend for on-device Large Language Model (LLM) inference. It enables:

Text Generation — Generate text responses from prompts
Streaming — Real-time token-by-token output
Tool Calling — Let models invoke registered tools during generation
Structured Output — Generate type-safe JSON responses
GGUF Support — Run any GGUF-format model (Llama, Mistral, Qwen, SmolLM, etc.)
Metal GPU Acceleration — 3-5x faster inference on Apple Silicon (iOS)
CPU Inference — Works on all devices without GPU requirements
Memory Efficient — Quantized models (Q4, Q6, Q8) for reduced memory usage

Requirements

@runanywhere/core (peer dependency)
React Native 0.74+
iOS 15.1+ / Android API 24+

Installation

npm install @runanywhere/core @runanywhere/llamacpp
# or
yarn add @runanywhere/core @runanywhere/llamacpp

iOS Setup

cd ios && pod install && cd ..

Android Setup

No additional setup required. Native libraries are downloaded automatically.

Quick Start

import { RunAnywhere, SDKEnvironment, ModelCategory } from '@runanywhere/core';
import { LlamaCPP } from '@runanywhere/llamacpp';

// 1. Initialize SDK
await RunAnywhere.initialize({
  environment: SDKEnvironment.Development,
});

// 2. Register LlamaCPP backend
LlamaCPP.register();

// 3. Add a model
await LlamaCPP.addModel({
  id: 'smollm2-360m-q8_0',
  name: 'SmolLM2 360M Q8_0',
  url: 'https://huggingface.co/prithivMLmods/SmolLM2-360M-GGUF/resolve/main/SmolLM2-360M.Q8_0.gguf',
  memoryRequirement: 500_000_000,
});

// 4. Download model
await RunAnywhere.downloadModel('smollm2-360m-q8_0', (progress) => {
  console.log(`Downloading: ${(progress.progress * 100).toFixed(1)}%`);
});

// 5. Load model
const modelInfo = await RunAnywhere.getModelInfo('smollm2-360m-q8_0');
await RunAnywhere.loadModel(modelInfo.localPath);

// 6. Generate text
const response = await RunAnywhere.chat('What is the capital of France?');
console.log(response);

API Reference

LlamaCPP Module

import { LlamaCPP } from '@runanywhere/llamacpp';

`LlamaCPP.register()`

LlamaCPP.register(): void

Example:

await RunAnywhere.initialize({ ... });
LlamaCPP.register();  // Now LLM features are available

`LlamaCPP.addModel(options)`

Add a GGUF model to the model registry.

await LlamaCPP.addModel(options: LlamaCPPModelOptions): Promise<ModelInfo>

Parameters:

interface LlamaCPPModelOptions {
  /**
   * Unique model ID.
   * If not provided, generated from the URL filename.
   */
  id?: string;

  /** Display name for the model */
  name: string;

  /** Download URL for the model (GGUF format) */
  url: string;

  /**
   * Model category.
   * Default: ModelCategory.Language
   */
  modality?: ModelCategory;

  /**
   * Memory requirement in bytes.
   * Used for device capability checks.
   */
  memoryRequirement?: number;

  /**
   * Whether model supports reasoning/thinking tokens.
   * If true, thinking content is extracted from responses.
   */
  supportsThinking?: boolean;
}

Returns: Promise<ModelInfo> — The registered model info

Example:

// Basic model
await LlamaCPP.addModel({
  id: 'smollm2-360m-q8_0',
  name: 'SmolLM2 360M Q8_0',
  url: 'https://huggingface.co/prithivMLmods/SmolLM2-360M-GGUF/resolve/main/SmolLM2-360M.Q8_0.gguf',
  memoryRequirement: 500_000_000,
});

// Larger model
await LlamaCPP.addModel({
  id: 'llama-2-7b-chat-q4_k_m',
  name: 'Llama 2 7B Chat Q4_K_M',
  url: 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf',
  memoryRequirement: 4_000_000_000,
});

// Model with thinking support (e.g., DeepSeek-R1)
await LlamaCPP.addModel({
  id: 'deepseek-r1-distill-qwen-1.5b',
  name: 'DeepSeek R1 Distill Qwen 1.5B',
  url: 'https://huggingface.co/.../deepseek-r1-distill-qwen-1.5b-q8_0.gguf',
  memoryRequirement: 2_000_000_000,
  supportsThinking: true,
});

Module Properties

LlamaCPP.moduleId        // 'llamacpp'
LlamaCPP.moduleName      // 'LlamaCPP'
LlamaCPP.inferenceFramework  // LLMFramework.LlamaCpp
LlamaCPP.capabilities    // ['llm']
LlamaCPP.defaultPriority // 100

Text Generation

Once a model is registered and loaded, use the RunAnywhere API for generation:

Simple Chat

const response = await RunAnywhere.chat('Hello!');
console.log(response);

Generation with Options

const result = await RunAnywhere.generate(
  'Explain machine learning in simple terms',
  {
    maxTokens: 256,
    temperature: 0.7,
    topP: 0.95,
    systemPrompt: 'You are a helpful teacher.',
    stopSequences: ['\n\n'],
  }
);

console.log('Response:', result.text);
console.log('Tokens:', result.tokensUsed);
console.log('Speed:', result.performanceMetrics.tokensPerSecond, 'tok/s');
console.log('TTFT:', result.performanceMetrics.timeToFirstTokenMs, 'ms');

Streaming Generation

const streamResult = await RunAnywhere.generateStream(
  'Write a story about a robot',
  { maxTokens: 500 }
);

// Display tokens as they're generated
for await (const token of streamResult.stream) {
  process.stdout.write(token);
}

// Get final metrics
const result = await streamResult.result;
console.log('\nSpeed:', result.performanceMetrics.tokensPerSecond, 'tok/s');

Tool Calling

Register tools and let the LLM call them during generation. Tool calling parsing and prompt formatting is handled entirely in C++ for consistency across platforms.

import { RunAnywhere } from '@runanywhere/core';
import { LlamaCPP } from '@runanywhere/llamacpp';

// Register a tool
RunAnywhere.registerTool(
  {
    name: 'calculate',
    description: 'Perform a math calculation',
    parameters: [
      { name: 'expression', type: 'string', description: 'Math expression', required: true },
    ],
  },
  async (args) => {
    const result = eval(args.expression as string); // simplified example
    return { result };
  }
);

// Generate with tools
const result = await RunAnywhere.generateWithTools(
  'What is 42 * 17?',
  {
    autoExecute: true,
    maxToolCalls: 3,
    temperature: 0.7,
    format: 'default', // 'default' for most models, 'lfm2' for Liquid AI models
  }
);
console.log(result.text); // "42 * 17 = 714"

Supported tool calling formats:

| Format | Tag Pattern | Models | |--------|-------------|--------| | default | <tool_call>{"tool":"name","arguments":{}}</tool_call> | Llama, Qwen, Mistral, SmolLM, most GGUF models | | lfm2 | <\|tool_call_start\|>[func(arg="val")]<\|tool_call_end\|> | Liquid AI LFM2-Tool models |

Model Management

// Load model
await RunAnywhere.loadModel('/path/to/model.gguf');

// Check if loaded
const isLoaded = await RunAnywhere.isModelLoaded();

// Unload to free memory
await RunAnywhere.unloadModel();

// Cancel ongoing generation
await RunAnywhere.cancelGeneration();

Supported Models

Any GGUF-format model works with this backend. Recommended models:

Small Models (< 1GB RAM)

| Model | Size | Memory | Tool Calling | Description | |-------|------|--------|:------------:|-------------| | SmolLM2 360M Q8_0 | ~400MB | 500MB | - | Fast, lightweight | | Qwen 2.5 0.5B Q6_K | ~500MB | 600MB | Yes | Multilingual | | LFM2 350M Q4_K_M | ~200MB | 250MB | Yes (lfm2) | Ultra-compact, Liquid AI |

Medium Models (1-3GB RAM)

| Model | Size | Memory | Tool Calling | Description | |-------|------|--------|:------------:|-------------| | Phi-3 Mini Q4_K_M | ~2GB | 2.5GB | - | Microsoft | | Gemma 2B Q4_K_M | ~1.5GB | 2GB | - | Google | | LFM2 1.2B Q4_K_M | ~800MB | 1GB | Yes (lfm2) | Liquid AI tool-calling | | Qwen 2.5 1.5B Instruct Q4_K_M | ~1GB | 1.5GB | Yes | Alibaba, multilingual | | TinyLlama 1.1B Q4_K_M | ~700MB | 1GB | - | Fast chat |

Large Models (4GB+ RAM)

| Model | Size | Memory | Tool Calling | Description | |-------|------|--------|:------------:|-------------| | Llama 3.2 3B Instruct Q4_K_M | ~2GB | 3GB | Yes | Meta latest | | Mistral 7B Instruct Q4_K_M | ~4GB | 5GB | Yes | Mistral AI | | Qwen 2.5 7B Instruct Q4_K_M | ~4GB | 5GB | Yes | Alibaba | | Llama 2 7B Chat Q4_K_M | ~4GB | 5GB | - | Meta |

Tool Calling Model Selection Guide

Best for tool calling (small): LFM2-350M-Tool (use format: 'lfm2') or Qwen 2.5 0.5B
Best for tool calling (medium): LFM2-1.2B-Tool or Qwen 2.5 1.5B Instruct
Best for tool calling (large): Mistral 7B Instruct or Qwen 2.5 7B Instruct
Instruct-tuned models generally perform better at following tool calling instructions
Use format: 'lfm2' only with Liquid AI LFM2-Tool models; all others use format: 'default'

Performance Tips

Device Recommendations

Apple Silicon (M1/M2/M3, A14+): Metal GPU acceleration provides 3-5x speedup
Modern Android: 6GB+ RAM recommended for 7B models
Older devices: Use smaller models (360M-1B)

Optimization Strategies

Use quantized models — Q4_K_M offers best quality/size ratio
Limit maxTokens — Shorter responses = faster generation
Unload when idle — Free memory for other apps
Pre-download models — Better UX during onboarding

Expected Performance

| Device | Model | Speed | |--------|-------|-------| | iPhone 15 Pro | SmolLM2 360M Q8 | 50-80 tok/s | | iPhone 15 Pro | Llama 3.2 3B Q4 | 15-25 tok/s | | MacBook M2 | Llama 2 7B Q4 | 20-40 tok/s | | Pixel 8 | SmolLM2 360M Q8 | 30-50 tok/s |

Native Integration

iOS

This package uses RABackendLLAMACPP.xcframework which includes:

llama.cpp compiled for iOS (arm64)
Metal GPU acceleration
Optimized NEON SIMD

The framework is automatically downloaded during pod install.

Android

Native library librunanywhere_llamacpp.so includes:

llama.cpp compiled for Android (arm64-v8a, armeabi-v7a)
OpenMP threading support
Optimized for ARM NEON

Libraries are automatically downloaded during Gradle build.

Package Structure

packages/llamacpp/
├── src/
│   ├── index.ts                # Package exports
│   ├── LlamaCPP.ts             # Module API (register, addModel)
│   ├── LlamaCppProvider.ts     # Service provider
│   ├── native/
│   │   └── NativeRunAnywhereLlama.ts
│   └── specs/
│       └── RunAnywhereLlama.nitro.ts
├── cpp/
│   ├── HybridRunAnywhereLlama.cpp
│   ├── HybridRunAnywhereLlama.hpp
│   └── bridges/
├── ios/
│   ├── RunAnywhereLlama.podspec
│   └── Frameworks/
│       └── RABackendLLAMACPP.xcframework
├── android/
│   ├── build.gradle
│   └── src/main/jniLibs/
│       └── arm64-v8a/
│           └── librunanywhere_llamacpp.so
└── nitrogen/
    └── generated/

Troubleshooting

Model fails to load

Symptoms: modelLoadFailed error

Solutions:

Check file exists at the path
Verify GGUF format (not GGML, SafeTensors, etc.)
Ensure sufficient memory (check memoryRequirement)
Try a smaller model

Slow generation

Symptoms: < 5 tokens/second

Solutions:

Use a smaller model (360M instead of 7B)
Check device isn't thermal throttling
Close other apps to free memory
On iOS, ensure Metal is enabled

Out of memory

Symptoms: App crash during inference

Solutions:

Unload model before loading a new one
Use a smaller/more quantized model
Reduce context length (fewer tokens)

License

MIT License

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@runanywhere/llamacpp

Overview

Requirements

Installation

iOS Setup

Android Setup

Quick Start

API Reference

LlamaCPP Module

LlamaCPP.register()

LlamaCPP.addModel(options)

Module Properties

Text Generation

Simple Chat

Generation with Options

Streaming Generation

Tool Calling

Model Management

Supported Models

Small Models (< 1GB RAM)

Medium Models (1-3GB RAM)

Large Models (4GB+ RAM)

Tool Calling Model Selection Guide

Performance Tips

Device Recommendations

Optimization Strategies

Expected Performance

Native Integration

iOS

Android

Package Structure

Troubleshooting

Model fails to load

Slow generation

Out of memory

See Also

License

`LlamaCPP.register()`

`LlamaCPP.addModel(options)`