expo-fast-llm

v0.1.0

Published

18 days ago

Run **Gemma** and other LLMs **fully on-device** in your Expo / React Native app — no API keys, no cloud, no per-token costs, complete user privacy.

0High
0Medium
0Low

resatiate

react-native expo expo-litert-lm gemma litert llm on-device ExpoLitertLM

expo-litert-lm

Run Gemma and other LLMs fully on-device in your Expo / React Native app — no API keys, no cloud, no per-token costs, complete user privacy.

Built on Google's LiteRT-LM runtime with automatic backend selection (CPU / GPU / NPU), streaming, multimodal support, cancellation, and a typed React hook.

Status: Android only. iOS support is on the roadmap.

✨ Highlights

Dead-simple chat API: one line to ask a question.
Streaming: AsyncIterable<string> — no manual DeviceEventEmitter wiring.
React hook: useLiteRT() gives you reactive engine state.
Smart backend selection: automatic fallback from GPU → NPU → CPU, with device-aware preferences for known-buggy hardware (Pixel Tensor).
Multimodal: pass an image alongside your prompt for vision models.
Persistence: save the user's model choice so the app reloads instantly on next launch.
Cancellable: stop a long generation at any time.
Fully typed: TypeScript everywhere.

🚀 Installation

npx expo install expo-litert-lm

Add the plugin to your app.json / app.config.js:

{
  "expo": {
    "plugins": ["expo-litert-lm"]
  }
}

Then prebuild and run a dev client (this module ships native Android code):

npx expo prebuild --clean
npx expo run:android

📦 Getting a model

LiteRT-LM expects a .task or .tflite file. The most common choice is a Gemma model:

Download from LiteRT Community on Hugging Face (e.g. gemma-3n-E2B-it-int4.task).
Place it on the device — typically via expo-file-system by downloading to FileSystem.documentDirectory.
Pass the resulting absolute path to LiteRTLM.initialize({ modelPath }).

The plugin handles file:// URI prefixes automatically.

💻 Quick start — the simplest possible example

import { useEffect, useState } from 'react';
import { Button, Text, View } from 'react-native';
import { useLiteRT } from 'expo-litert-lm';

export default function App() {
  const { isReady, isInitializing, initialize, generate } = useLiteRT();
  const [reply, setReply] = useState('');

  useEffect(() => {
    initialize({ modelPath: '/sdcard/Download/gemma.task' });
  }, []);

  return (
    <View style={{ padding: 24, marginTop: 60 }}>
      <Text>Status: {isReady ? '✅ ready' : isInitializing ? '⏳ loading' : '⌛'}</Text>
      <Button
        title="Ask"
        disabled={!isReady}
        onPress={async () => setReply(await generate('Why is the sky blue?'))}
      />
      <Text>{reply}</Text>
    </View>
  );
}

🌊 Streaming

generateStream and chatStream return an AsyncIterable<string>. Just use for await:

const { generateStream } = useLiteRT();

const askStreaming = async () => {
  setReply('');
  for await (const chunk of generateStream('Write a haiku about the moon')) {
    setReply((r) => r + chunk);
  }
};

Need to stop mid-generation? Either break from the loop or call .cancel() on the returned iterable:

const stream = generateStream('Long answer please');
setTimeout(() => stream.cancel(), 1000);
for await (const chunk of stream) { /* ... */ }

💬 Multi-turn chat

const { chat } = useLiteRT();

const reply = await chat(
  [
    { role: 'user', content: 'My name is Alice.' },
    { role: 'model', content: 'Nice to meet you, Alice!' },
    { role: 'user', content: 'What did I just tell you my name was?' },
  ],
  { systemPrompt: 'You are concise.' },
);

The last message must have role user. Prior turns are folded into the system prompt as plain text, so you get history-aware answers without burning tokens regenerating the assistant's prior responses.

🧵 Persistent conversations (recommended for chat UIs)

chat() is stateless: every call spins up a fresh native conversation and folds the prior turns into the system prompt. That works for short interactions, but for a real chat UI you want the engine's native chat template tracking state across turns. Use createConversation to get a long-lived handle:

const conversation = await LiteRTLM.createConversation({
  systemPrompt: 'You are a helpful, concise assistant.',
});

const a = await conversation.sendMessage('My name is Alice.');
const b = await conversation.sendMessage('What did I just tell you?');

for await (const chunk of conversation.sendMessageStream('Tell me a joke')) {
  process.stdout.write(chunk);
}

await conversation.close(); // or just let GC release it

The native Conversation is wired through an Expo SharedObject: it stays in memory across calls and is released when JS drops the reference (or when you explicitly close()). Multimodal works the same way:

await conversation.sendMessage('What is in this image?', {
  imageFilePath: '/sdcard/photo.jpg',
});

🖼 Vision (multimodal)

const reply = await generate('What is in this image?', {
  imageFilePath: '/sdcard/Download/photo.jpg',
});

Make sure the model you loaded is multimodal (e.g. Gemma 3n). Configure the image pipeline at init time:

await LiteRTLM.initialize({
  modelPath: '/sdcard/gemma-3n.task',
  visionBackend: 'GPU',
  maxNumImages: 1,
});

🧠 Reactive engine status

useLiteRT() exposes a live status object. Native events drive re-renders; no polling.

const { status } = useLiteRT();
// status.state: 'NotConfigured' | 'Initializing' | 'Ready' | 'Closing' | 'Error'
// status.actualBackend: which backend actually loaded (CPU/GPU/NPU)
// status.usedFallback: true if we fell back from the requested backend
// status.errorMessage: populated when state === 'Error'

Or outside React:

const sub = LiteRTLM.onStatusChange((s) => console.log(s.state));
// later:
sub.remove();

💾 Persisting the user's model choice

import { LiteRTLM } from 'expo-litert-lm';

// Save once after the user picks a model
await LiteRTLM.saveConfig({
  modelPath: '/sdcard/gemma.task',
  displayName: 'Gemma 2B int4',
  backend: 'GPU',
  temperature: 0.7,
});

// On next launch
const status = await LiteRTLM.initializeFromStoredConfig().catch(() => null);
if (!status) {
  // Prompt the user to choose a model
}

📚 Full API reference

Lifecycle

| Method | Description | |---|---| | initialize(config: ModelConfig): Promise<EngineStatus> | Loads a model. Auto-falls-back across backends. | | initializeFromStoredConfig(): Promise<EngineStatus> | Loads from previously saved config. | | getStatus(): Promise<EngineStatus> | One-shot status read. | | getAvailableBackends(): Promise<Backend[]> | Returns the device-appropriate probe order. | | onStatusChange(fn): StatusSubscription | Subscribe to status changes. | | close(): Promise<boolean> | Release native resources. Idempotent. |

Inference

| Method | Description | |---|---| | generate(prompt, opts?) | Single-turn. Returns full response. | | generateStream(prompt, opts?) | Single-turn. Returns AsyncIterable<string> with .cancel(). | | chat(messages, opts?) | Multi-turn. Returns full response. | | chatStream(messages, opts?) | Multi-turn streaming. | | createConversation(opts?) | Persistent multi-turn conversation backed by a native SharedObject. Returns a Conversation with sendMessage / sendMessageStream / close. | | cancel(): Promise<boolean> | Cancels the current in-flight inference. | | benchmark(prompt?): Promise<BenchmarkResult> | Latency / approx-tps measurement. |

Persistence

| Method | Description | |---|---| | saveConfig(config) | Persist model config to SharedPreferences. | | loadConfig() | Returns the persisted config, or null. | | clearConfig() | Erase persisted config. |

Types

See src/types.ts:

type Backend = 'CPU' | 'GPU' | 'NPU';

interface ModelConfig {
  modelPath: string;        // required
  displayName?: string;
  backend?: Backend;        // default 'GPU'
  visionBackend?: Backend;
  audioBackend?: Backend;
  maxNumImages?: number;    // default 1
  maxTokens?: number;       // default 1024
  topK?: number;            // default 40
  topP?: number;            // default 0.95
  temperature?: number;     // default 0.7
  strictBackend?: boolean;  // default false; if true, no fallback
}

interface EngineStatus {
  state: 'NotConfigured' | 'Initializing' | 'Ready' | 'Closing' | 'Error';
  modelConfig?: ModelConfig;
  errorMessage?: string;
  initDurationMs?: number;
  actualBackend?: Backend;
  usedFallback: boolean;
}

🛠 Troubleshooting

| Symptom | Cause | Fix | |---|---|---| | MODEL_NOT_FOUND | Wrong path / missing file | Verify with FileSystem.getInfoAsync(path) | | "All backend candidates failed" | Model file corrupted or incompatible | Re-download the model | | OOM / crash on first inference | Model larger than device RAM permits | Use a smaller int4-quantized model | | GPU crashes on Pixel 8 / 9 | Known OpenCL / Tensor SoC issues | Plugin auto-falls back to CPU; no action needed | | App freezes during streaming | Long generation; user wants to stop | Call stream.cancel() or LiteRTLM.cancel() | | Engine is not ready on chat() | Called before initialize() resolved | Await initialization or check status.state === 'Ready' | | APK too large | Model bundled in assets | Download at runtime instead of bundling |

Building

npm install
npm run build      # compile TS
npm test           # run jest
npm run lint

🏗 Architecture

The Android side is layered for testability and clarity:

┌──────────────────────────────────────────────────────────────┐
│  LiteRtLMModule  (React Native bridge / thin shim)           │
├──────────────────────────────────────────────────────────────┤
│  LiteRtLlmProvider           LiteRtConversationContext       │
│  (stateless chat / stream)   (SharedObject, multi-turn)      │
├──────────────────────────────────────────────────────────────┤
│  LiteRtEngineManager  (lifecycle, fallback, mutex, init cache)│
├──────────────────────────────────────────────────────────────┤
│  LiteRtConfigStore   (SharedPreferences)                     │
├──────────────────────────────────────────────────────────────┤
│  Google LiteRT-LM SDK                                        │
└──────────────────────────────────────────────────────────────┘

All public Kotlin classes carry KDoc — open them in Android Studio for inline reference.

📝 License

MIT