expo-fast-llm
v0.1.0
Published
Run **Gemma** and other LLMs **fully on-device** in your Expo / React Native app — no API keys, no cloud, no per-token costs, complete user privacy.
Maintainers
Readme
expo-litert-lm
Run Gemma and other LLMs fully on-device in your Expo / React Native app — no API keys, no cloud, no per-token costs, complete user privacy.
Built on Google's LiteRT-LM runtime with automatic backend selection (CPU / GPU / NPU), streaming, multimodal support, cancellation, and a typed React hook.
Status: Android only. iOS support is on the roadmap.
✨ Highlights
- Dead-simple chat API: one line to ask a question.
- Streaming:
AsyncIterable<string>— no manualDeviceEventEmitterwiring. - React hook:
useLiteRT()gives you reactive engine state. - Smart backend selection: automatic fallback from GPU → NPU → CPU, with device-aware preferences for known-buggy hardware (Pixel Tensor).
- Multimodal: pass an image alongside your prompt for vision models.
- Persistence: save the user's model choice so the app reloads instantly on next launch.
- Cancellable: stop a long generation at any time.
- Fully typed: TypeScript everywhere.
🚀 Installation
npx expo install expo-litert-lmAdd the plugin to your app.json / app.config.js:
{
"expo": {
"plugins": ["expo-litert-lm"]
}
}Then prebuild and run a dev client (this module ships native Android code):
npx expo prebuild --clean
npx expo run:android📦 Getting a model
LiteRT-LM expects a .task or .tflite file. The most common choice is a Gemma model:
- Download from LiteRT Community on Hugging Face (e.g.
gemma-3n-E2B-it-int4.task). - Place it on the device — typically via
expo-file-systemby downloading toFileSystem.documentDirectory. - Pass the resulting absolute path to
LiteRTLM.initialize({ modelPath }).
The plugin handles file:// URI prefixes automatically.
💻 Quick start — the simplest possible example
import { useEffect, useState } from 'react';
import { Button, Text, View } from 'react-native';
import { useLiteRT } from 'expo-litert-lm';
export default function App() {
const { isReady, isInitializing, initialize, generate } = useLiteRT();
const [reply, setReply] = useState('');
useEffect(() => {
initialize({ modelPath: '/sdcard/Download/gemma.task' });
}, []);
return (
<View style={{ padding: 24, marginTop: 60 }}>
<Text>Status: {isReady ? '✅ ready' : isInitializing ? '⏳ loading' : '⌛'}</Text>
<Button
title="Ask"
disabled={!isReady}
onPress={async () => setReply(await generate('Why is the sky blue?'))}
/>
<Text>{reply}</Text>
</View>
);
}🌊 Streaming
generateStream and chatStream return an AsyncIterable<string>. Just use for await:
const { generateStream } = useLiteRT();
const askStreaming = async () => {
setReply('');
for await (const chunk of generateStream('Write a haiku about the moon')) {
setReply((r) => r + chunk);
}
};Need to stop mid-generation? Either break from the loop or call .cancel() on the returned iterable:
const stream = generateStream('Long answer please');
setTimeout(() => stream.cancel(), 1000);
for await (const chunk of stream) { /* ... */ }💬 Multi-turn chat
const { chat } = useLiteRT();
const reply = await chat(
[
{ role: 'user', content: 'My name is Alice.' },
{ role: 'model', content: 'Nice to meet you, Alice!' },
{ role: 'user', content: 'What did I just tell you my name was?' },
],
{ systemPrompt: 'You are concise.' },
);The last message must have role user. Prior turns are folded into the system prompt as plain text, so you get history-aware answers without burning tokens regenerating the assistant's prior responses.
🧵 Persistent conversations (recommended for chat UIs)
chat() is stateless: every call spins up a fresh native conversation and folds the prior turns into the system prompt. That works for short interactions, but for a real chat UI you want the engine's native chat template tracking state across turns. Use createConversation to get a long-lived handle:
const conversation = await LiteRTLM.createConversation({
systemPrompt: 'You are a helpful, concise assistant.',
});
const a = await conversation.sendMessage('My name is Alice.');
const b = await conversation.sendMessage('What did I just tell you?');
for await (const chunk of conversation.sendMessageStream('Tell me a joke')) {
process.stdout.write(chunk);
}
await conversation.close(); // or just let GC release itThe native Conversation is wired through an Expo SharedObject: it stays in memory across calls and is released when JS drops the reference (or when you explicitly close()). Multimodal works the same way:
await conversation.sendMessage('What is in this image?', {
imageFilePath: '/sdcard/photo.jpg',
});🖼 Vision (multimodal)
const reply = await generate('What is in this image?', {
imageFilePath: '/sdcard/Download/photo.jpg',
});Make sure the model you loaded is multimodal (e.g. Gemma 3n). Configure the image pipeline at init time:
await LiteRTLM.initialize({
modelPath: '/sdcard/gemma-3n.task',
visionBackend: 'GPU',
maxNumImages: 1,
});🧠 Reactive engine status
useLiteRT() exposes a live status object. Native events drive re-renders; no polling.
const { status } = useLiteRT();
// status.state: 'NotConfigured' | 'Initializing' | 'Ready' | 'Closing' | 'Error'
// status.actualBackend: which backend actually loaded (CPU/GPU/NPU)
// status.usedFallback: true if we fell back from the requested backend
// status.errorMessage: populated when state === 'Error'Or outside React:
const sub = LiteRTLM.onStatusChange((s) => console.log(s.state));
// later:
sub.remove();💾 Persisting the user's model choice
import { LiteRTLM } from 'expo-litert-lm';
// Save once after the user picks a model
await LiteRTLM.saveConfig({
modelPath: '/sdcard/gemma.task',
displayName: 'Gemma 2B int4',
backend: 'GPU',
temperature: 0.7,
});
// On next launch
const status = await LiteRTLM.initializeFromStoredConfig().catch(() => null);
if (!status) {
// Prompt the user to choose a model
}📚 Full API reference
Lifecycle
| Method | Description |
|---|---|
| initialize(config: ModelConfig): Promise<EngineStatus> | Loads a model. Auto-falls-back across backends. |
| initializeFromStoredConfig(): Promise<EngineStatus> | Loads from previously saved config. |
| getStatus(): Promise<EngineStatus> | One-shot status read. |
| getAvailableBackends(): Promise<Backend[]> | Returns the device-appropriate probe order. |
| onStatusChange(fn): StatusSubscription | Subscribe to status changes. |
| close(): Promise<boolean> | Release native resources. Idempotent. |
Inference
| Method | Description |
|---|---|
| generate(prompt, opts?) | Single-turn. Returns full response. |
| generateStream(prompt, opts?) | Single-turn. Returns AsyncIterable<string> with .cancel(). |
| chat(messages, opts?) | Multi-turn. Returns full response. |
| chatStream(messages, opts?) | Multi-turn streaming. |
| createConversation(opts?) | Persistent multi-turn conversation backed by a native SharedObject. Returns a Conversation with sendMessage / sendMessageStream / close. |
| cancel(): Promise<boolean> | Cancels the current in-flight inference. |
| benchmark(prompt?): Promise<BenchmarkResult> | Latency / approx-tps measurement. |
Persistence
| Method | Description |
|---|---|
| saveConfig(config) | Persist model config to SharedPreferences. |
| loadConfig() | Returns the persisted config, or null. |
| clearConfig() | Erase persisted config. |
Types
See src/types.ts:
type Backend = 'CPU' | 'GPU' | 'NPU';
interface ModelConfig {
modelPath: string; // required
displayName?: string;
backend?: Backend; // default 'GPU'
visionBackend?: Backend;
audioBackend?: Backend;
maxNumImages?: number; // default 1
maxTokens?: number; // default 1024
topK?: number; // default 40
topP?: number; // default 0.95
temperature?: number; // default 0.7
strictBackend?: boolean; // default false; if true, no fallback
}
interface EngineStatus {
state: 'NotConfigured' | 'Initializing' | 'Ready' | 'Closing' | 'Error';
modelConfig?: ModelConfig;
errorMessage?: string;
initDurationMs?: number;
actualBackend?: Backend;
usedFallback: boolean;
}🛠 Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| MODEL_NOT_FOUND | Wrong path / missing file | Verify with FileSystem.getInfoAsync(path) |
| "All backend candidates failed" | Model file corrupted or incompatible | Re-download the model |
| OOM / crash on first inference | Model larger than device RAM permits | Use a smaller int4-quantized model |
| GPU crashes on Pixel 8 / 9 | Known OpenCL / Tensor SoC issues | Plugin auto-falls back to CPU; no action needed |
| App freezes during streaming | Long generation; user wants to stop | Call stream.cancel() or LiteRTLM.cancel() |
| Engine is not ready on chat() | Called before initialize() resolved | Await initialization or check status.state === 'Ready' |
| APK too large | Model bundled in assets | Download at runtime instead of bundling |
Building
npm install
npm run build # compile TS
npm test # run jest
npm run lint🏗 Architecture
The Android side is layered for testability and clarity:
┌──────────────────────────────────────────────────────────────┐
│ LiteRtLMModule (React Native bridge / thin shim) │
├──────────────────────────────────────────────────────────────┤
│ LiteRtLlmProvider LiteRtConversationContext │
│ (stateless chat / stream) (SharedObject, multi-turn) │
├──────────────────────────────────────────────────────────────┤
│ LiteRtEngineManager (lifecycle, fallback, mutex, init cache)│
├──────────────────────────────────────────────────────────────┤
│ LiteRtConfigStore (SharedPreferences) │
├──────────────────────────────────────────────────────────────┤
│ Google LiteRT-LM SDK │
└──────────────────────────────────────────────────────────────┘All public Kotlin classes carry KDoc — open them in Android Studio for inline reference.
📝 License
MIT
