@mikesaintsg/inference
v0.0.8
Published
Type-safe inference with full TypeScript support. Zero runtime dependencies.
Maintainers
Readme
@mikesaintsg/inference
Zero-dependency, adapter-based LLM inference library for browser and Node.js applications.
Features
- ✅ Session-Based Conversations — Maintain conversation history with automatic context management
- ✅ Ephemeral Generation — Stateless single-shot completions for one-off tasks
- ✅ Unified Streaming — Async iteration over tokens with abort control and events
- ✅ Token Batching — Coalesce tokens for smoother UI updates with boundary detection
- ✅ Abort Coordination — Coordinated cancellation across multiple operations
- ✅ Timeout Monitoring — Detect token stalls and total timeout exceeded
- ✅ Token Counting — Estimate tokens for context window management
- ✅ Context Integration — Generate from
BuiltContext(contextbuilder) - ✅ Model Orchestrator — Progressive model loading with tier-based generation
- ✅ Intent Detection — Classify user input into search, question, action, or navigation
- ✅ Circuit Breaker — Prevent cascading failures with circuit breaker pattern
- ✅ Retry Logic — Configurable retry with exponential backoff for transient failures
- ✅ Rate Limiting — Concurrency control with acquire/release semantics
- ✅ Telemetry Support — Observability with spans, metrics, and logging
- ✅ Zero dependencies — Built on native fetch, EventSource, and browser/Node APIs
- ✅ TypeScript first — Full type safety with generics and strict types
Installation
npm install @mikesaintsg/inferenceQuick Start
import { createEngine } from '@mikesaintsg/inference'
import { createOpenAIProviderAdapter } from '@mikesaintsg/adapters'
// Create engine with provider adapter (required first parameter)
const engine = createEngine(
createOpenAIProviderAdapter({
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o',
}),
)
// Create a conversation session
const session = engine.createSession({
system: 'You are a helpful assistant.',
})
// Add a message and generate response
session.addMessage('user', 'Hello!')
const result = await session.generate()
console.log(result.text)
// Stream responses
const handle = session.stream()
for await (const token of handle) {
process.stdout.write(token)
}Documentation
📚 Full API Guide — Comprehensive documentation with examples
Key Sections
- Package Purpose & Philosophy — Design principles and boundaries
- Core Concepts — Engine, Session, Message, StreamHandle hierarchy
- Architecture Walkthrough — Request lifecycle and layers
- Provider Adapters — Custom adapter implementation
- Streaming & Events — Token queue and UX patterns
- Error Model — Error taxonomy and recovery
- API Reference — Complete API documentation
API Overview
Factory Functions
| Function | Description |
|---------------------------------------|------------------------------------|
| createEngine(provider, options?) | Create an inference engine |
| createTokenBatcher(options?) | Create a token batcher for UI |
| createTokenCounter(options?) | Create a token counter |
| createAbortScope() | Create an abort scope |
| createTimeoutMonitor(options?) | Create a timeout monitor |
| createModelOrchestrator(options) | Create a model orchestrator |
| createIntentDetector(options) | Create an intent detector |
EngineInterface
| Method | Description |
|------------------------------------------|----------------------------------|
| createSession(options?) | Create a conversation session |
| generate(messages, options?) | Ephemeral generation (stateless) |
| stream(messages, options?) | Ephemeral streaming (stateless) |
| generateFromContext(context, options?) | Generate from BuiltContext |
| streamFromContext(context, options?) | Stream from BuiltContext |
| countTokens(text, model) | Count tokens in text |
| countMessages(messages, model) | Count tokens in messages |
| fitsInContext(content, model, max?) | Check if content fits |
| getContextWindowSize(model) | Get model context size |
| abort(requestId) | Abort a request by ID |
| getDeduplicationStats() | Get request deduplication stats |
SessionInterface
| Method | Description |
|---------------------------------------|---------------------------------|
| getId() | Get session ID |
| getSystem() | Get system prompt |
| getHistory() | Get message history |
| addMessage(role, content) | Add message to history |
| addToolResult(callId, name, result) | Add tool result message |
| removeMessage(id) | Remove message from history |
| clear() | Clear all messages |
| truncateHistory(count) | Keep last N messages |
| generate(options?) | Generate response with context |
| stream(options?) | Stream response with context |
| getTokenBudgetState() | Get current token budget state |
| fitsInBudget(content) | Check if content fits in budget |
| onMessageAdded(callback) | Subscribe to message additions |
| onMessageRemoved(callback) | Subscribe to message removals |
| onTokenBudgetChange(callback) | Subscribe to budget changes |
Examples
Session-Based Conversation
import { createEngine } from '@mikesaintsg/inference'
import { createOpenAIProviderAdapter } from '@mikesaintsg/adapters'
const engine = createEngine(
createOpenAIProviderAdapter({ apiKey }),
)
const session = engine.createSession({
system: 'You are a helpful coding assistant.',
tokenBudget: {
model: 'gpt-4o',
warningThreshold: 0.7,
criticalThreshold: 0.9,
},
})
// Conversation history is maintained automatically
session.addMessage('user', 'What is TypeScript?')
const result1 = await session.generate()
session.addMessage('user', 'Show me an example')
const result2 = await session.generate() // Has full contextStreaming with Abort
const handle = session.stream()
// Async iteration
for await (const token of handle) {
process.stdout.write(token)
}
// Or use event subscriptions
handle.onToken((token) => updateUI(token))
handle.onComplete((result) => finalizeUI(result))
handle.onError((error) => showError(error))
// Abort at any time
handle.abort()Error Handling
import { createEngine, InferenceError, isRateLimitError } from '@mikesaintsg/inference'
try {
const result = await session.generate()
} catch (error) {
if (isRateLimitError(error)) {
console.error('Rate limit exceeded, retrying...')
await sleep(60_000)
} else if (error instanceof InferenceError) {
console.error(`[${error.code}]: ${error.message}`)
}
}Circuit Breaker Integration
import { createEngine } from '@mikesaintsg/inference'
import { createCircuitBreaker } from '@mikesaintsg/adapters'
const circuitBreaker = createCircuitBreaker({
failureThreshold: 5,
resetTimeoutMs: 30_000,
})
const engine = createEngine(provider, {
circuitBreaker,
})
// Requests are blocked when circuit is open
circuitBreaker.onStateChange((state) => {
console.log('Circuit state:', state)
})Retry Logic
import { createEngine } from '@mikesaintsg/inference'
import { createRetryAdapter } from '@mikesaintsg/adapters'
const retry = createRetryAdapter({
maxAttempts: 3,
initialDelayMs: 1000,
maxDelayMs: 30_000,
backoffMultiplier: 2,
})
const engine = createEngine(provider, {
retry,
})
// Failed requests are automatically retried with exponential backoff
const result = await engine.generate([
{ id: '1', role: 'user', content: 'Hello!', createdAt: Date.now() },
])Rate Limiting
import { createEngine } from '@mikesaintsg/inference'
import { createRateLimitAdapter } from '@mikesaintsg/adapters'
const rateLimit = createRateLimitAdapter({
requestsPerMinute: 60,
maxConcurrent: 10,
})
const engine = createEngine(provider, {
rateLimit,
})
// Requests automatically wait for a slot before starting
// Slots are released after request completes (success or error)
const handle = engine.stream([
{ id: '1', role: 'user', content: 'Hello!', createdAt: Date.now() },
])
await handle.result()Telemetry Integration
import { createEngine } from '@mikesaintsg/inference'
import { createTelemetryAdapter } from '@mikesaintsg/adapters'
const telemetry = createTelemetryAdapter({
serviceName: 'my-app',
})
const engine = createEngine(provider, {
telemetry,
})
// Spans are created for each generation request
// Metrics are recorded for latencyToken Batching for UI
import { createTokenBatcher } from '@mikesaintsg/inference'
const batcher = createTokenBatcher({
batchSize: 5,
flushIntervalMs: 50,
flushOnBoundary: 'sentence',
})
batcher.onBatch((batch) => {
appendToUI(batch.text)
})
for await (const token of handle) {
batcher.push(token)
}
batcher.end()Generate from BuiltContext
import { createEngine } from '@mikesaintsg/inference'
import { createContextBuilder } from '@mikesaintsg/contextbuilder'
// Build context using contextbuilder
const builder = createContextBuilder(tokenAdapter, { budget })
builder.addSection('system', 'You are helpful.')
builder.addRetrieval(searchResults)
const context = builder.build()
// Generate from built context
const result = await engine.generateFromContext(context)Ecosystem Integration
| Package | Integration |
|--------------------------------|--------------------------------------------------------|
| @mikesaintsg/core | Shared types (Message, GenerationResult, BuiltContext) |
| @mikesaintsg/adapters | Provider adapters (OpenAI, Anthropic, etc.) |
| @mikesaintsg/vectorstore | Vector storage with embedding adapters |
| @mikesaintsg/contextbuilder | Advanced context assembly (BuiltContext) |
| @mikesaintsg/contextprotocol | Tool registry and routing |
See Integration with Ecosystem for details.
Browser Support
| Browser | Minimum Version | |---------|-----------------| | Chrome | 90+ | | Firefox | 90+ | | Safari | 15+ | | Edge | 90+ |
License
MIT © mikesaintsg
