@pico-brief/speech-services

v1.1.6

Published

a month ago

Unified speech-to-text and text-to-speech library wrapping AssemblyAI, Azure, Cartesia, Deepgram, ElevenLabs, Google, OpenAI, PlayHT, Rev.ai, and Speechmatics

@pico-brief/speech-services

Unified speech-to-text and text-to-speech library wrapping 11 provider APIs behind consistent interfaces. Zero external dependencies — Node.js 18+ built-ins only.

Providers

| Provider | STT | TTS | Language Detection | |----------|-----|-----|--------------------| | AssemblyAI | Yes | - | Yes | | Azure | Yes (fast + batch) | Yes | Yes | | Cartesia | - | Yes | - | | Deepgram | Yes | Yes | Yes | | ElevenLabs | Yes | Yes | Yes | | Google | Yes | Yes | - | | Inworld | Yes | Yes | - | | OpenAI | Yes | Yes | Yes | | PlayHT | - | Yes | - | | Rev.ai | Yes | - | - | | Speechmatics | Yes | - | Yes |

Installation

npm install @pico-brief/speech-services

Quick Start

Client API (recommended)

import { createSpeechClient } from "@pico-brief/speech-services";

const client = createSpeechClient({
    openai: { apiKey: "sk-..." },
    azure: { subscriptionKey: "...", region: "eastus" },
});

// Transcribe
const transcript = await client.transcribe({
    provider: "openai",
    audio: audioBuffer,
    languages: ["en"],
});
console.log(transcript.text);
console.log(transcript.words); // word-level timestamps

// Synthesize (voice auto-selected by language + gender)
const speech = await client.synthesize({
    provider: "azure",
    text: "Hello, world!",
    languages: ["en-US"],
    gender: "female",
});
// speech.audio is a Buffer, speech.voice is the resolved voice ID

// Detect languages
const languages = await client.detectLocales({
    provider: "openai",
    audio: audioBuffer,
});
// Map { "en" => 1 }

Standalone Functions (tree-shakeable)

import { transcribe, synthesize } from "@pico-brief/speech-services";

const result = await transcribe(
    { openai: { apiKey: "sk-..." } },
    { provider: "openai", audio: audioBuffer },
);

Direct Provider Import

import { transcribe } from "@pico-brief/speech-services/providers/deepgram";

const result = await transcribe(
    { apiKey: "your-deepgram-key" },
    audioBuffer,
    ["en"],
    { model: "nova-2", smartFormat: true },
);

Audio Input

All transcription functions accept Buffer | string for audio:

Buffer: Raw audio bytes (MP3, WAV, etc.)
String (URL): https://... or gs://... — each provider handles URLs natively where possible

Voice Resolution

When synthesizing, the voice parameter is optional. If omitted, a voice is auto-selected based on languages and gender:

// Explicit voice
await client.synthesize({ provider: "azure", text: "Hi", voice: "en-US-JennyNeural" });

// Auto-select: female English voice
await client.synthesize({ provider: "azure", text: "Hi", languages: ["en-US"], gender: "female" });

// Voice by name (fuzzy matched)
await client.synthesize({ provider: "azure", text: "Hi", voice: "Jenny" });

Resolution tiers: exact ID → exact name → locale extraction → base language fallback. Gender is a preference, not a hard filter.

Provider Options

Each provider has specific options accessible via providerOptions:

await client.transcribe({
    provider: "assemblyai",
    audio: audioBuffer,
    providerOptions: {
        speechModel: "universal",
        pollInterval: 3000,
        timeout: 300000,
    },
});

await client.synthesize({
    provider: "elevenlabs",
    text: "Hello",
    voice: "Rachel",
    providerOptions: {
        modelId: "eleven_multilingual_v2",
        stability: 0.5,
        similarityBoost: 0.75,
    },
});

Language Detection

// With ffmpeg (recommended for long audio — samples clips from different positions)
const languages = await client.detectLocales({
    provider: "azure",
    audio: audioBuffer,
    ffmpegPath: "/usr/local/bin/ffmpeg",
});

// Without ffmpeg (truncates to first ~30s)
const languages = await client.detectLocales({
    provider: "azure",
    audio: audioBuffer,
    maxBytes: 500_000,
});

Transcript Snippets

groupWordsToSnippets groups word-level timestamps from a transcription result into readable, time-bounded snippets — useful for subtitles, chunked display, or downstream processing.

import { groupWordsToSnippets } from "@pico-brief/speech-services";

const result = await client.transcribe({
    provider: "openai",
    audio: audioBuffer,
});

const snippets = groupWordsToSnippets(result.words);
// [
//   { text: "Hello how are you", time: 0.0, duration: 1.2 },
//   { text: "I'm doing well thanks", time: 2.1, duration: 0.9 },
//   ...
// ]

A new snippet boundary is created when:

The gap between consecutive words exceeds gap seconds (default: 0.4s — natural pause boundary)
The current snippet already spans more than existingDuration seconds (default: 10s — prevents excessively long snippets)

Both thresholds are configurable:

const snippets = groupWordsToSnippets(result.words, {
    gap: 0.6,              // more tolerant of pauses
    existingDuration: 5,   // shorter snippets
});

Error Handling

All errors are thrown as SpeechServiceError with structured fields:

import { SpeechServiceError } from "@pico-brief/speech-services";

try {
    await client.transcribe({ provider: "openai", audio: buffer });
} catch (err) {
    if (err instanceof SpeechServiceError) {
        console.log(err.code);       // "API_ERROR", "TIMEOUT", "NOT_CONFIGURED", etc.
        console.log(err.provider);   // "openai"
        console.log(err.statusCode); // 401
    }
}

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@pico-brief/speech-services

Providers

Installation

Quick Start

Client API (recommended)

Standalone Functions (tree-shakeable)

Direct Provider Import

Audio Input

Voice Resolution

Provider Options

Language Detection

Transcript Snippets

Error Handling

License