ordi-ai-agent-sdk

v0.1.10

Published

3 months ago

A SDK for building AI agents that can interact with users in real time, providing both text and audio responses. This SDK connects to the ORDI AI Gateway, which serves as a bridge to powerful language models and tools.

0High
0Medium
0Low

quanph

🤖 ORDI AI SDK

Dependencies

| Dependency | Version | |------------|---------| | ws | ^8.0.0 |

Install

npm

npm install ordi-ai-agent-sdk@latest

yarn

yarn add ordi-ai-agent-sdk@latest

Example

import { AgentClient } from "@ordi/ai-waiter";
import { GoogleGenAI } from "@google/genai";
import wav from "wav";
import fs from "fs";

// Initialize Gemini TTS
const ai = new GoogleGenAI({
  apiKey: process.env.GEMINI_API_KEY
});

async function saveWaveFile(
  filename,
  pcmData,
  channels = 1,
  rate = 24000,
  sampleWidth = 2
) {
  return new Promise((resolve, reject) => {
    const writer = new wav.FileWriter(filename, {
      channels,
      sampleRate: rate,
      bitDepth: sampleWidth * 8,
    });

    writer.on("finish", resolve);
    writer.on("error", reject);

    writer.write(pcmData);
    writer.end();
  });
}

// Convert text to speech using Gemini
async function generateSpeech(text) {
  const response = await ai.models.generateContent({
    model: "gemini-2.5-flash-preview-tts",
    contents: [{ parts: [{ text }] }],
    config: {
      responseModalities: ["AUDIO"],
      speechConfig: {
        voiceConfig: {
          prebuiltVoiceConfig: { voiceName: "Kore" },
        },
      },
    },
  });

  const data =
    response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data;

  const audioBuffer = Buffer.from(data, "base64");

  const filename = `tts-${Date.now()}.wav`;
  await saveWaveFile(filename, audioBuffer);

  console.log("🔊 Audio saved:", filename);
}

// Initialize ORDI Agent
const agent = new AgentClient(
  {
    gatewayUrl: "ws://localhost:8000/ws/chat",
    sessionId: "session_tts_demo"
  },
  {
    name: "Jack Daniels",
    storeName: "ORDI Restaurant",
    tableNumber: "12"
  }
);

// Listen for spoken output from the agent
agent.onAudio = async (text) => {
  console.log("🔊 Agent audio text:", text);

  // Convert to speech
  await generateSpeech(text);
};

async function main() {
  await agent.connect();

  const stream = agent.streamMessage("Hello! Can you recommend a dish?");

  for await (const chunk of stream) {
    if (chunk?.delta) {
      process.stdout.write(chunk.delta);
    }
  }

  console.log("\n");
}

main();

Configuration

| Field | Type | Description | | ------------ | ------ | ----------------------------------- | | gatewayUrl | string | WebSocket endpoint for the AI Agent | | sessionId | string | Optional session identifier |

{
  gatewayUrl: "ws://localhost:8000/ws/chat",
  sessionId: "session_123"
}

User configuration

User metadata helps the AI personalize responses.

| Field | Type | Description | | --------------- | ------ | ------------------------ | | name | string | User's name | | storeName | string | Restaurant or store name | | tableNumber | string | Current table | | email | string | User email | | contactNumber | string | Phone number |

{
  name: "Jack Daniels",
  storeName: "ORDI Restaurant",
  tableNumber: "12",
  email: "[email protected]",
  contactNumber: "0123456789"
}

Sending a message

Messages are streamed using an async iterator.

const stream = agent.streamMessage("Show me the menu");

for await (const chunk of stream) {
  if (chunk?.delta) {
    process.stdout.write(chunk.delta);
  }
}

Output (Markdown content is printed as it arrives, while audio responses are handled separately):

*I have received your message and am checking on that for you now.*


Hello, **Jack Daniels**! I'm sorry to hear you're hungry, but you've come to the right place.

Welcome to **ORDI Restaurant**. Since you are at **Table 12**, I can help you find something delicious right away. Would you like to see our full menu, or are you in the mood for something specific like Vietnamese food or perhaps a pizza?

Handling Audio Responses

Audio responses are delivered via the onAudio callback.

agent.onAudio = (audioText) => {
  console.log("Audio:", audioText);
};

You can forward this text to any Text-To-Speech (TTS) engine.

Implement Audio Model (Gemini)

Google API Docs

npm install @google/genai

yarn add  @google/genai

Sample:

async function generateSpeech(text) {
  const response = await ai.models.generateContent({
    model: "gemini-2.5-flash-preview-tts",
    contents: [{ parts: [{ text }] }],
    config: {
      responseModalities: ["AUDIO"],
      speechConfig: {
        voiceConfig: {
          prebuiltVoiceConfig: { voiceName: "Kore" },
        },
      },
    },
  });

  const data =
    response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data;

  const audioBuffer = Buffer.from(data, "base64");

  const filename = `tts-${Date.now()}.wav`;
  await saveWaveFile(filename, audioBuffer);

  console.log("🔊 Audio saved:", filename);
}

// Listen for spoken output from the agent
agent.onAudio = async (text) => {
  // Convert to speech
  await generateSpeech(text);
};

Microphone Speech to Text (Browser)

The SDK provides a built-in SpeechToTextHelper to capture voice input directly from the browser's microphone and transcribe it into text using the Gemini API. This transcript can then be sent to the AI Agent.

import { SpeechToTextHelper } from "ordi-ai-agent-sdk";

// 1. Initialize the helper with your Gemini API key
const helper = new SpeechToTextHelper({
  apiKey: "YOUR_GEMINI_API_KEY",
  model: "gemini-3-flash-preview", // Optional, defaults to gemini-3-flash-preview
  systemInstruction: "Keep the default input" // Optional instructions to clean up grammar
});

// 2. Start recording (will prompt the user for microphone permissions)
await helper.startRecording();
console.log("Recording started...");

// ... user speaks ...

// 3. Stop recording and get the transcribed text and audio URL
const { transcript, audioUrl } = await helper.stopRecording();

console.log("User said:", transcript);

// 4. Send the transcript to your agent
// agent.streamMessage(transcript);

Streaming Architecture

The ORDI AI SDK streams events from the server in real time.

Server events | Status | Description | | -------------- | ----------------------------- | | text_output | Visual UI text stream | | audio_output | Spoken response stream | | tool_ack | Tool execution acknowledgment | | done | End of response |

text_output → UI stream
audio_output → onAudio callback
tool_ack → status message (for quick response as acknownledgment of user input)
done → stream ends

SDK Architecture

Application
     │
     ▼
AgentClient (SDK)
     │
     ▼
WebSocketTransport
     │
     ▼
ORDI AI Gateway
     │
     ▼
LangChain AI Agent