@pico-brief/speech-services
v1.1.6
Published
Unified speech-to-text and text-to-speech library wrapping AssemblyAI, Azure, Cartesia, Deepgram, ElevenLabs, Google, OpenAI, PlayHT, Rev.ai, and Speechmatics
Maintainers
Readme
@pico-brief/speech-services
Unified speech-to-text and text-to-speech library wrapping 11 provider APIs behind consistent interfaces. Zero external dependencies — Node.js 18+ built-ins only.
Providers
| Provider | STT | TTS | Language Detection | |----------|-----|-----|--------------------| | AssemblyAI | Yes | - | Yes | | Azure | Yes (fast + batch) | Yes | Yes | | Cartesia | - | Yes | - | | Deepgram | Yes | Yes | Yes | | ElevenLabs | Yes | Yes | Yes | | Google | Yes | Yes | - | | Inworld | Yes | Yes | - | | OpenAI | Yes | Yes | Yes | | PlayHT | - | Yes | - | | Rev.ai | Yes | - | - | | Speechmatics | Yes | - | Yes |
Installation
npm install @pico-brief/speech-servicesQuick Start
Client API (recommended)
import { createSpeechClient } from "@pico-brief/speech-services";
const client = createSpeechClient({
openai: { apiKey: "sk-..." },
azure: { subscriptionKey: "...", region: "eastus" },
});
// Transcribe
const transcript = await client.transcribe({
provider: "openai",
audio: audioBuffer,
languages: ["en"],
});
console.log(transcript.text);
console.log(transcript.words); // word-level timestamps
// Synthesize (voice auto-selected by language + gender)
const speech = await client.synthesize({
provider: "azure",
text: "Hello, world!",
languages: ["en-US"],
gender: "female",
});
// speech.audio is a Buffer, speech.voice is the resolved voice ID
// Detect languages
const languages = await client.detectLocales({
provider: "openai",
audio: audioBuffer,
});
// Map { "en" => 1 }Standalone Functions (tree-shakeable)
import { transcribe, synthesize } from "@pico-brief/speech-services";
const result = await transcribe(
{ openai: { apiKey: "sk-..." } },
{ provider: "openai", audio: audioBuffer },
);Direct Provider Import
import { transcribe } from "@pico-brief/speech-services/providers/deepgram";
const result = await transcribe(
{ apiKey: "your-deepgram-key" },
audioBuffer,
["en"],
{ model: "nova-2", smartFormat: true },
);Audio Input
All transcription functions accept Buffer | string for audio:
- Buffer: Raw audio bytes (MP3, WAV, etc.)
- String (URL):
https://...orgs://...— each provider handles URLs natively where possible
Voice Resolution
When synthesizing, the voice parameter is optional. If omitted, a voice is auto-selected based on languages and gender:
// Explicit voice
await client.synthesize({ provider: "azure", text: "Hi", voice: "en-US-JennyNeural" });
// Auto-select: female English voice
await client.synthesize({ provider: "azure", text: "Hi", languages: ["en-US"], gender: "female" });
// Voice by name (fuzzy matched)
await client.synthesize({ provider: "azure", text: "Hi", voice: "Jenny" });Resolution tiers: exact ID → exact name → locale extraction → base language fallback. Gender is a preference, not a hard filter.
Provider Options
Each provider has specific options accessible via providerOptions:
await client.transcribe({
provider: "assemblyai",
audio: audioBuffer,
providerOptions: {
speechModel: "universal",
pollInterval: 3000,
timeout: 300000,
},
});
await client.synthesize({
provider: "elevenlabs",
text: "Hello",
voice: "Rachel",
providerOptions: {
modelId: "eleven_multilingual_v2",
stability: 0.5,
similarityBoost: 0.75,
},
});Language Detection
// With ffmpeg (recommended for long audio — samples clips from different positions)
const languages = await client.detectLocales({
provider: "azure",
audio: audioBuffer,
ffmpegPath: "/usr/local/bin/ffmpeg",
});
// Without ffmpeg (truncates to first ~30s)
const languages = await client.detectLocales({
provider: "azure",
audio: audioBuffer,
maxBytes: 500_000,
});Transcript Snippets
groupWordsToSnippets groups word-level timestamps from a transcription result into readable, time-bounded snippets — useful for subtitles, chunked display, or downstream processing.
import { groupWordsToSnippets } from "@pico-brief/speech-services";
const result = await client.transcribe({
provider: "openai",
audio: audioBuffer,
});
const snippets = groupWordsToSnippets(result.words);
// [
// { text: "Hello how are you", time: 0.0, duration: 1.2 },
// { text: "I'm doing well thanks", time: 2.1, duration: 0.9 },
// ...
// ]A new snippet boundary is created when:
- The gap between consecutive words exceeds
gapseconds (default: 0.4s — natural pause boundary) - The current snippet already spans more than
existingDurationseconds (default: 10s — prevents excessively long snippets)
Both thresholds are configurable:
const snippets = groupWordsToSnippets(result.words, {
gap: 0.6, // more tolerant of pauses
existingDuration: 5, // shorter snippets
});Error Handling
All errors are thrown as SpeechServiceError with structured fields:
import { SpeechServiceError } from "@pico-brief/speech-services";
try {
await client.transcribe({ provider: "openai", audio: buffer });
} catch (err) {
if (err instanceof SpeechServiceError) {
console.log(err.code); // "API_ERROR", "TIMEOUT", "NOT_CONFIGURED", etc.
console.log(err.provider); // "openai"
console.log(err.statusCode); // 401
}
}License
MIT
