@pyai/sdk
v0.2.1
Published
Official TypeScript/JavaScript SDK for PyAI — speech-to-text (Hear), text-to-speech (Speak), realtime voice agents (Omni), and call compliance (Trace).
Maintainers
Readme
@pyai/sdk
Official TypeScript/JavaScript SDK for PyAI — the all-in-one voice AI platform: lightning-fast speech-to-text, ultra-realistic text-to-speech, end-to-end realtime voice agents, and automatic call compliance. Zero dependencies; runs in the browser and Node 18+.
PyAI products
- Hear — Lightning-fast, telephony-native speech-to-text. Whisper-compatible transcription tuned for real phone-call audio, with live streaming partials so your app reacts mid-sentence, plus async batch transcription for big archives.
POST /v1/audio/transcriptions - Speak — Ultra-realistic text-to-speech that starts speaking in tens of milliseconds. Stream lifelike, expressive voices, choose from 36 studio-quality presets, or clone any voice instantly — for free.
POST /v1/audio/speech - Omni (flagship) — One API for a complete, end-to-end voice AI agent. A single WebSocket where your agent listens, thinks, and speaks — grounded in your knowledge bases and tools, with human-like turn-taking and instant barge-in — no STT, LLM, or TTS to stitch together yourself.
wss://api.pyai.com/v1/omni - Trace (flagship) — The compliance API that keeps your AI agents safe. Trace automatically checks every call for HIPAA, TCPA, and PII risks (plus your own brand-voice rules), flags the exact rule broken, redacts sensitive data, and seals each call with a tamper-evident audit trail — so a risky conversation never slips through.
GET /v1/trace/interactions - Cue — Realtime turn detection + knowledge-grounded context for your own stack. Bring your own LLM and voice; Cue nails the hard part — knowing the instant a speaker finishes and surfacing the right context.
wss://api.pyai.com/v1/audio/transcriptions/stream - Telephony — Instant managed phone numbers for your voice agents. Provision a US number and route live calls straight into an Omni agent — no carrier contracts, no telephony glue.
POST /v1/telephony/numbers
The contract is https://api.pyai.com/openapi.json. This SDK wraps it
ergonomically with typed errors, automatic retries, and a realtime helper.
Install
npm install @pyai/sdkQuickstart
import PyAI from "@pyai/sdk";
const pyai = new PyAI({ apiKey: process.env.PYAI_API_KEY! });
// Text-to-speech
const audio = await pyai.audio.speech({ input: "Hello from PyAI.", voice: "stock_emma_en_gb" });
await Bun.write?.("hello.wav", audio); // or fs.writeFile in Node
// Text-to-speech, streamed — start playing/forwarding at the first chunk
// (tens of ms) instead of waiting for the whole clip. Use mp3 for smooth
// progressive playback.
const stream = await pyai.audio.speechStream({ input: "Hello from PyAI.", voice: "stock_emma_en_gb", response_format: "mp3" });
for await (const chunk of stream) writeToSpeakerOrResponse(chunk);
// Voices
const { data: voices } = await pyai.voices.list({ gender: "female" });
// Async transcription (safe retry with an idempotency key)
const job = await pyai.transcriptionJobs.create(
{ audio_url: "https://example.com/call.wav", diarize: true },
{ idempotencyKey: crypto.randomUUID() },
);
const done = await pyai.transcriptionJobs.get(job.job_id);Realtime (Omni)
omni.connect() opens an agentic-voice session and hides the wire protocol —
including its frame-key asymmetry (your control frames are keyed on type,
the server's frames are keyed on event). It sends a type-keyed configure
the instant the socket opens and routes server frames to typed callbacks, so you
can't trip the #1 Omni integration bug (a hand-rolled {"event":"configure"}
is acked but silently dropped, giving you a connected session with zero turns):
// Omni is zero-state: the key's org authorizes the session — nothing to create.
const omni = pyai.omni.connect({
rate: 16000, // 24000 browser · 16000 wideband telephony · 8000 G.711/Twilio
configure: { voice_id: "stock_emma_en_gb", persona: "You are a receptionist." },
onAudio: (chunk) => speaker.write(chunk), // binary agent audio — play it out
onTranscript: (f) => console.log(f.text),
onError: (e) => console.error(e),
});
omni.sendAudio(pcm16Chunk); // stream caller audio continuously (server-side VAD)
omni.sendDtmf("5");
omni.close();From the browser, mint an ephemeral token server-side with
pyai.omni.createSession({ allowedOrigins }) and pass it as token so the page
never holds a secret key:
const omni = pyai.omni.connect({ token: session.token, configure: { voice_id, persona } });Omni uses the native
wss://api.pyai.com/v1/omnisurface and is zero-state — no agent to create,sessionLabelis an optional opaque tag (never required). Need the raw socket?pyai.realtimeURL({ product: "omni" })+pyai.realtimeSubprotocol()(orpyai.connectRealtime()) still work;product: "flow"uses/v1/realtime. The older/v2/omni/chatURL and theagentIdoption are deprecated but still work.
Streaming speech-to-text (Hear / Cue)
transcriptions.stream() hides the WebSocket frame protocol behind callbacks.
It opens wss://api.pyai.com/v1/audio/transcriptions/stream (key carried as the
WS subprotocol, so it works in the browser), routes the wire frames to
onPartial/onFinal/onError, and gives you sendAudio, commit(), and
close():
const hear = pyai.audio.transcriptions.stream({
sampleRate: 16000,
onPartial: (f) => console.log("…", f.text),
onFinal: (f) => console.log("✓", f.text, `(${f.audio_ms}ms)`),
onError: (e) => console.error(e),
});
micChunks.on("data", (pcm16) => hear.sendAudio(pcm16)); // binary frames
vad.on("end", () => hear.commit()); // force-finalize an utterance
// hear.close() also flushes a final for any buffered audioFrame types, WS close codes, and error codes are exported as named
constants so you never hardcode a magic string:
import { HearFrameType, WSCloseCode, ErrorCode } from "@pyai/sdk";
HearFrameType.SpeechFinal; // "speech_final"
WSCloseCode.OverCapacity; // 4429
ErrorCode.CreditExhausted; // "credit_exhausted"Set grounding: true to turn the stream into Cue (turn detection + KB
context): the SDK sends the grounding config on open and final/speech_final
frames then carry a grounding array of top KB passages.
In Node, pass a WebSocket implementation if there's no global one:
transcriptions.stream({ webSocket: (await import("ws")).WebSocket }).
Speak audio formats (incl. telephony G.711)
audio.speech encodes server-side into any of eight formats via response_format
— the audio comes back already in the shape you need, so telephony callers can
drop the hand-rolled resampler + μ-law encoder entirely:
// Twilio/SIP-ready in one param: raw 8 kHz mono μ-law, no client-side DSP.
const ulaw = await pyai.audio.speech({
input: "Your appointment is confirmed.",
voice: "stock_emma_en_gb",
response_format: "g711_ulaw", // -> audio/basic, forced 8 kHz
});
// base64-encode `ulaw` straight into a Twilio media frame.| response_format | sample rates (Hz) | Content-Type |
|---|---|---|
| mp3 (default) | 8000 / 16000 / 24000 / 48000 | audio/mpeg |
| wav | 8000 / 16000 / 24000 / 48000 | audio/wav |
| opus | 8000 / 16000 / 24000 / 48000 | audio/ogg |
| aac | 8000 / 16000 / 24000 / 48000 | audio/aac |
| flac | 8000 / 16000 / 24000 / 48000 | audio/flac |
| pcm (raw int16 LE, no header) | 8000 / 16000 / 24000 / 48000 | audio/pcm |
| g711_ulaw | 8000 (forced) | audio/basic |
| g711_alaw | 8000 (forced) | audio/basic |
sample_rate is optional — omit it for the engine's native 24 kHz (g711_* is
always 8 kHz). The set is typed (SpeechFormat) and exported as SPEECH_FORMATS
/ SPEECH_SAMPLE_RATES for dropdowns and validation. Any other value is a
400 unsupported_format; omit response_format for the default mp3.
See
examples/speak-telephony-formatsfor the full before/after: ~120 lines of resampler + μ-law replaced by one param, with Node (@pyai/twilio), Python, and raw-curl snippets.
More APIs: clones, telephony, trace
// Voice clones (Speak)
const { data: clones } = await pyai.clones.list();
const clone = await pyai.clones.create({ name: "Brand VO", file: refAudioBlob });
await pyai.clones.delete(clone.id);
// Managed phone numbers (Telephony)
const { data: avail } = await pyai.telephony.numbers.available({ areaCode: "415" });
const num = await pyai.telephony.numbers.buy({ phone_number: avail[0]!.phone_number, agent_id: "agent_123" });
await pyai.telephony.numbers.assign(num.id, "agent_123");
await pyai.telephony.numbers.release(num.id);
// Compliance (Trace)
const { data: calls } = await pyai.trace.interactions.list({ verdict: "FAIL" });
const detail = await pyai.trace.interactions.get(calls[0]!.id);
await pyai.trace.config.set({ agent_id: "agent_123", enabled: true });
const exposure = await pyai.trace.exposure(30);
// Per-call eval scorecard (timeline + quality metrics). These are additive and
// forward-compatible — present once the engine emits them, so reading them is
// always safe (the timeline reader returns [] until then).
const timeline = await pyai.trace.callTimeline(detail.id); // TraceTimelineTurn[]
const quality = detail.quality_metrics; // { wer?, ttfb_ms?, turn_p95_ms?, vaqi?, … }Reproducible runs (evals)
audio.speech and audio.transcriptions.create take optional seed and
temperature for deterministic eval runs. They're forward-compatible — honored
once the engine supports them and otherwise ignored — so it's always safe to send:
await pyai.audio.speech({ input: "Hello", voice: "stock_emma_en_gb", seed: 42, temperature: 0 });
await pyai.audio.transcriptions.create({ file: wavBlob, seed: 42 });Errors
Failures throw PyAIError with a stable code (branch on it, not the message):
import { PyAIError } from "@pyai/sdk";
try {
await pyai.audio.speech({ input: "hi" });
} catch (err) {
if (err instanceof PyAIError && err.code === "credit_exhausted") {
// out of prepaid credit — add credit or use a sandbox key
}
}Common codes: unauthorized, forbidden, credit_exhausted,
rate_limit_exceeded, concurrency_limit_exceeded, idempotency_conflict.
429/5xx are retried automatically (honoring Retry-After); tune with
new PyAI({ apiKey, maxRetries }).
CLI (pyai)
Installing the package also provides a pyai command. pyai doctor runs a
deeper diagnosis — it introspects your key/scopes via GET /v1/me (skipped
gracefully if the route isn't deployed yet), checks endpoint liveness, runs a
Speak→Hear round-trip, and prints remediation hints for any failure:
export PYAI_API_KEY=pyai_test_...
npx pyai doctor
# PASS key (/v1/me) — env=test; 3 scope(s): hear:transcribe, voice:synthesize, hear:stream
# PASS models.list — 12 models
# PASS voices.list — 38 voices
# PASS speak→hear round-trip — synth 45210 bytes → "the quick brown fox…"
# Diagnosis: healthy. Key, endpoint, and a Speak→Hear round-trip all work.
npx pyai smoke # lighter: models + voices + speakOther commands:
pyai models
pyai voices --gender female --region en_us
pyai speak --text "Hello" --voice stock_emma_en_gb --out hello.wav
pyai transcribe --url https://example.com/call.wav --diarize --pollAuth comes from PYAI_API_KEY / PYAI_BASE_URL (or --api-key / --base-url).
Develop
npm install
npm test # node --test, fetch injected (no network)
npm run build # emits dist/ (incl. the pyai CLI bin)