@pyai/sdk

v0.2.1

Published

6 days ago

Official TypeScript/JavaScript SDK for PyAI — speech-to-text (Hear), text-to-speech (Speak), realtime voice agents (Omni), and call compliance (Trace).

0High
0Medium
0Low

atomsai

pyai voice-ai tts stt speech voice-agents realtime transcription compliance openai-compatible

@pyai/sdk

Official TypeScript/JavaScript SDK for PyAI — the all-in-one voice AI platform: lightning-fast speech-to-text, ultra-realistic text-to-speech, end-to-end realtime voice agents, and automatic call compliance. Zero dependencies; runs in the browser and Node 18+.

PyAI products

Hear — Lightning-fast, telephony-native speech-to-text. Whisper-compatible transcription tuned for real phone-call audio, with live streaming partials so your app reacts mid-sentence, plus async batch transcription for big archives. POST /v1/audio/transcriptions
Speak — Ultra-realistic text-to-speech that starts speaking in tens of milliseconds. Stream lifelike, expressive voices, choose from 36 studio-quality presets, or clone any voice instantly — for free. POST /v1/audio/speech
Omni (flagship) — One API for a complete, end-to-end voice AI agent. A single WebSocket where your agent listens, thinks, and speaks — grounded in your knowledge bases and tools, with human-like turn-taking and instant barge-in — no STT, LLM, or TTS to stitch together yourself. wss://api.pyai.com/v1/omni
Trace (flagship) — The compliance API that keeps your AI agents safe. Trace automatically checks every call for HIPAA, TCPA, and PII risks (plus your own brand-voice rules), flags the exact rule broken, redacts sensitive data, and seals each call with a tamper-evident audit trail — so a risky conversation never slips through. GET /v1/trace/interactions
Cue — Realtime turn detection + knowledge-grounded context for your own stack. Bring your own LLM and voice; Cue nails the hard part — knowing the instant a speaker finishes and surfacing the right context. wss://api.pyai.com/v1/audio/transcriptions/stream
Telephony — Instant managed phone numbers for your voice agents. Provision a US number and route live calls straight into an Omni agent — no carrier contracts, no telephony glue. POST /v1/telephony/numbers

The contract is https://api.pyai.com/openapi.json. This SDK wraps it ergonomically with typed errors, automatic retries, and a realtime helper.

Install

npm install @pyai/sdk

Quickstart

import PyAI from "@pyai/sdk";

const pyai = new PyAI({ apiKey: process.env.PYAI_API_KEY! });

// Text-to-speech
const audio = await pyai.audio.speech({ input: "Hello from PyAI.", voice: "stock_emma_en_gb" });
await Bun.write?.("hello.wav", audio); // or fs.writeFile in Node

// Text-to-speech, streamed — start playing/forwarding at the first chunk
// (tens of ms) instead of waiting for the whole clip. Use mp3 for smooth
// progressive playback.
const stream = await pyai.audio.speechStream({ input: "Hello from PyAI.", voice: "stock_emma_en_gb", response_format: "mp3" });
for await (const chunk of stream) writeToSpeakerOrResponse(chunk);

// Voices
const { data: voices } = await pyai.voices.list({ gender: "female" });

// Async transcription (safe retry with an idempotency key)
const job = await pyai.transcriptionJobs.create(
  { audio_url: "https://example.com/call.wav", diarize: true },
  { idempotencyKey: crypto.randomUUID() },
);
const done = await pyai.transcriptionJobs.get(job.job_id);

Realtime (Omni)

omni.connect() opens an agentic-voice session and hides the wire protocol — including its frame-key asymmetry (your control frames are keyed on type, the server's frames are keyed on event). It sends a type-keyed configure the instant the socket opens and routes server frames to typed callbacks, so you can't trip the #1 Omni integration bug (a hand-rolled {"event":"configure"} is acked but silently dropped, giving you a connected session with zero turns):

// Omni is zero-state: the key's org authorizes the session — nothing to create.
const omni = pyai.omni.connect({
  rate: 16000, // 24000 browser · 16000 wideband telephony · 8000 G.711/Twilio
  configure: { voice_id: "stock_emma_en_gb", persona: "You are a receptionist." },
  onAudio: (chunk) => speaker.write(chunk), // binary agent audio — play it out
  onTranscript: (f) => console.log(f.text),
  onError: (e) => console.error(e),
});

omni.sendAudio(pcm16Chunk); // stream caller audio continuously (server-side VAD)
omni.sendDtmf("5");
omni.close();

From the browser, mint an ephemeral token server-side with pyai.omni.createSession({ allowedOrigins }) and pass it as token so the page never holds a secret key:

const omni = pyai.omni.connect({ token: session.token, configure: { voice_id, persona } });

Omni uses the native wss://api.pyai.com/v1/omni surface and is zero-state — no agent to create, sessionLabel is an optional opaque tag (never required). Need the raw socket? pyai.realtimeURL({ product: "omni" }) + pyai.realtimeSubprotocol() (or pyai.connectRealtime()) still work; product: "flow" uses /v1/realtime. The older /v2/omni/chat URL and the agentId option are deprecated but still work.

Streaming speech-to-text (Hear / Cue)

transcriptions.stream() hides the WebSocket frame protocol behind callbacks. It opens wss://api.pyai.com/v1/audio/transcriptions/stream (key carried as the WS subprotocol, so it works in the browser), routes the wire frames to onPartial/onFinal/onError, and gives you sendAudio, commit(), and close():

const hear = pyai.audio.transcriptions.stream({
  sampleRate: 16000,
  onPartial: (f) => console.log("…", f.text),
  onFinal: (f) => console.log("✓", f.text, `(${f.audio_ms}ms)`),
  onError: (e) => console.error(e),
});

micChunks.on("data", (pcm16) => hear.sendAudio(pcm16)); // binary frames
vad.on("end", () => hear.commit());                     // force-finalize an utterance
// hear.close() also flushes a final for any buffered audio

Frame types, WS close codes, and error codes are exported as named constants so you never hardcode a magic string:

import { HearFrameType, WSCloseCode, ErrorCode } from "@pyai/sdk";
HearFrameType.SpeechFinal; // "speech_final"
WSCloseCode.OverCapacity;  // 4429
ErrorCode.CreditExhausted; // "credit_exhausted"

Set grounding: true to turn the stream into Cue (turn detection + KB context): the SDK sends the grounding config on open and final/speech_final frames then carry a grounding array of top KB passages.

In Node, pass a WebSocket implementation if there's no global one: transcriptions.stream({ webSocket: (await import("ws")).WebSocket }).

Speak audio formats (incl. telephony G.711)

audio.speech encodes server-side into any of eight formats via response_format — the audio comes back already in the shape you need, so telephony callers can drop the hand-rolled resampler + μ-law encoder entirely:

// Twilio/SIP-ready in one param: raw 8 kHz mono μ-law, no client-side DSP.
const ulaw = await pyai.audio.speech({
  input: "Your appointment is confirmed.",
  voice: "stock_emma_en_gb",
  response_format: "g711_ulaw", // -> audio/basic, forced 8 kHz
});
// base64-encode `ulaw` straight into a Twilio media frame.

| response_format | sample rates (Hz) | Content-Type | |---|---|---| | mp3 (default) | 8000 / 16000 / 24000 / 48000 | audio/mpeg | | wav | 8000 / 16000 / 24000 / 48000 | audio/wav | | opus | 8000 / 16000 / 24000 / 48000 | audio/ogg | | aac | 8000 / 16000 / 24000 / 48000 | audio/aac | | flac | 8000 / 16000 / 24000 / 48000 | audio/flac | | pcm (raw int16 LE, no header) | 8000 / 16000 / 24000 / 48000 | audio/pcm | | g711_ulaw | 8000 (forced) | audio/basic | | g711_alaw | 8000 (forced) | audio/basic |

sample_rate is optional — omit it for the engine's native 24 kHz (g711_* is always 8 kHz). The set is typed (SpeechFormat) and exported as SPEECH_FORMATS / SPEECH_SAMPLE_RATES for dropdowns and validation. Any other value is a 400 unsupported_format; omit response_format for the default mp3.

See examples/speak-telephony-formats for the full before/after: ~120 lines of resampler + μ-law replaced by one param, with Node (@pyai/twilio), Python, and raw-curl snippets.

More APIs: clones, telephony, trace

// Voice clones (Speak)
const { data: clones } = await pyai.clones.list();
const clone = await pyai.clones.create({ name: "Brand VO", file: refAudioBlob });
await pyai.clones.delete(clone.id);

// Managed phone numbers (Telephony)
const { data: avail } = await pyai.telephony.numbers.available({ areaCode: "415" });
const num = await pyai.telephony.numbers.buy({ phone_number: avail[0]!.phone_number, agent_id: "agent_123" });
await pyai.telephony.numbers.assign(num.id, "agent_123");
await pyai.telephony.numbers.release(num.id);

// Compliance (Trace)
const { data: calls } = await pyai.trace.interactions.list({ verdict: "FAIL" });
const detail = await pyai.trace.interactions.get(calls[0]!.id);
await pyai.trace.config.set({ agent_id: "agent_123", enabled: true });
const exposure = await pyai.trace.exposure(30);

// Per-call eval scorecard (timeline + quality metrics). These are additive and
// forward-compatible — present once the engine emits them, so reading them is
// always safe (the timeline reader returns [] until then).
const timeline = await pyai.trace.callTimeline(detail.id); // TraceTimelineTurn[]
const quality = detail.quality_metrics;                    // { wer?, ttfb_ms?, turn_p95_ms?, vaqi?, … }

Reproducible runs (evals)

audio.speech and audio.transcriptions.create take optional seed and temperature for deterministic eval runs. They're forward-compatible — honored once the engine supports them and otherwise ignored — so it's always safe to send:

await pyai.audio.speech({ input: "Hello", voice: "stock_emma_en_gb", seed: 42, temperature: 0 });
await pyai.audio.transcriptions.create({ file: wavBlob, seed: 42 });

Errors

Failures throw PyAIError with a stable code (branch on it, not the message):

import { PyAIError } from "@pyai/sdk";

try {
  await pyai.audio.speech({ input: "hi" });
} catch (err) {
  if (err instanceof PyAIError && err.code === "credit_exhausted") {
    // out of prepaid credit — add credit or use a sandbox key
  }
}

Common codes: unauthorized, forbidden, credit_exhausted, rate_limit_exceeded, concurrency_limit_exceeded, idempotency_conflict. 429/5xx are retried automatically (honoring Retry-After); tune with new PyAI({ apiKey, maxRetries }).

CLI (`pyai`)

Installing the package also provides a pyai command. pyai doctor runs a deeper diagnosis — it introspects your key/scopes via GET /v1/me (skipped gracefully if the route isn't deployed yet), checks endpoint liveness, runs a Speak→Hear round-trip, and prints remediation hints for any failure:

export PYAI_API_KEY=pyai_test_...
npx pyai doctor
# PASS  key (/v1/me)  — env=test; 3 scope(s): hear:transcribe, voice:synthesize, hear:stream
# PASS  models.list  — 12 models
# PASS  voices.list  — 38 voices
# PASS  speak→hear round-trip  — synth 45210 bytes → "the quick brown fox…"
# Diagnosis: healthy. Key, endpoint, and a Speak→Hear round-trip all work.

npx pyai smoke   # lighter: models + voices + speak

Other commands:

pyai models
pyai voices --gender female --region en_us
pyai speak --text "Hello" --voice stock_emma_en_gb --out hello.wav
pyai transcribe --url https://example.com/call.wav --diarize --poll

Auth comes from PYAI_API_KEY / PYAI_BASE_URL (or --api-key / --base-url).

Develop

npm install
npm test         # node --test, fetch injected (no network)
npm run build    # emits dist/ (incl. the pyai CLI bin)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@pyai/sdk

PyAI products

Install

Quickstart

Realtime (Omni)

Streaming speech-to-text (Hear / Cue)

Speak audio formats (incl. telephony G.711)

More APIs: clones, telephony, trace

Reproducible runs (evals)

Errors

CLI (pyai)

Develop

CLI (`pyai`)