@newgameplusinc/odyssey-official-audio-video-sdk

v1.0.24

Published

14 days ago

Odyssey Official Audio & Video SDK using MediaSoup for real-time communication

0High
0Medium
0Low

spatial-audio webrtc mediasoup conference real-time audio video odyssey

Odyssey Audio/Video SDK (MediaSoup + Web Audio)

This package exposes OdysseySpatialComms, a thin TypeScript client that glues together:

code structure:

src/ ├── index.ts # Main SDK entry (OdysseySpatialComms class) ├── types/ # All TypeScript interfaces │ ├── position.ts # Position, Direction, Rotation types │ ├── participant.ts # Participant, MediaState types │ ├── events.ts # All event types (OdysseyEvent, etc.) │ ├── room.ts # Room-related types │ └── index.ts # Re-exports all types ├── utils/ │ ├── spatial/ # Spatial audio calculations │ │ ├── distance-calc.ts # Distance calculations │ │ ├── gain-calc.ts # Logarithmic gain │ │ ├── pan-calc.ts # Stereo panning │ │ ├── head-position.ts # Head/body position │ │ ├── listener-calc.ts # Listener orientation │ │ └── angle-calc.ts # Azimuth/angle calculations │ ├── position/ # Position utilities │ │ ├── normalize.ts # Unit normalization (cm/m) │ │ ├── snap.ts # Position snapping cache │ │ └── coordinates.ts # Unreal↔Standard conversion │ ├── smoothing/ # Audio smoothing │ │ ├── pan-smoothing.ts # Pan interpolation │ │ └── gain-smoothing.ts # Gain interpolation │ └── audio/ # Audio quality │ ├── clarity-score.ts # Voice clarity │ └── voice-filter.ts # Voice filtering ├── channels/ │ ├── spatial/ # Spatial audio channel │ │ ├── SpatialAudioChannel.ts # Main spatial processor │ │ └── SpatialAudioTypes.ts # Channel-specific types │ └── huddle/ # Huddle/private channel │ ├── HuddleChannel.ts # Huddle channel manager │ └── HuddleTypes.ts # Huddle types ├── audio/ # Audio processing │ ├── AudioPipeline.ts # Master audio chain │ ├── AudioNodeFactory.ts # Web Audio node factory │ └── MLNoiseSuppressor.ts # TensorFlow noise suppression ├── core/ # Core managers │ ├── EventManager.ts # Event emitter base │ └── MediasoupManager.ts # WebRTC transport manager └── sdk/ └── index.ts # Public SDK exports

MediaSoup SFU for ultra-low-latency audio/video routing
Web Audio API for Apple-like spatial mixing via SpatialAudioManager
Socket telemetry (position + direction) so every browser hears/sees everyone exactly where they are in the 3D world

It mirrors the production SDK used by Odyssey V2 and ships ready-to-drop into any Web UI (Vue, React, plain JS).

complete flow from frotnend to server to sdk :

┌─────────────────────────────────────────────────────────────────────────────┐ │ UNREAL ENGINE │ │ Sends: pos=(4130, 220, 700) cm in Unreal coords (X=fwd, Y=right, Z=up) │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ VUE (MediaSoupHub.vue) │ │ Transform: Unreal → Standard coords + cm → meters │ │ │ │ position = { │ │ x: data.pos.y / 100, // UE Y (right) → Standard X (right) = 2.2m │ │ y: data.pos.z / 100, // UE Z (up) → Standard Y (up) = 7.0m │ │ z: data.pos.x / 100 // UE X (forward) → Standard Z (forward) = 41.3m │ │ } │ │ │ │ Output: pos=(2.2, 7.0, 41.3) meters in Standard coords │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ SDK (index.ts → updatePosition) │ │ Passes position directly to server via socket.emit("update-position") │ │ Also sets listener position via setListenerFromLSD() │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ SERVER (server.ts) - PASS-THROUGH MODE │ │ │ │ 1. Receive position from client │ │ 2. AUTO-DETECT UNITS: if maxAxis > 50 → divide by 100 (cm→m) │ │ 3. NO SMOOTHING: Pass through real-time position directly │ │ 4. BROADCAST: Send normalized position (meters) to all clients │ │ │ │ Result: SDK receives REAL positions for accurate distance-based gain │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ SDK (SpatialAudioManager.ts) - Receiving remote positions │ │ │ │ 1. normalizePositionUnits(): if maxAxis > 50 → divide by 100 (backup) │ │ 2. snapPosition(): ignore movements < 15cm (reduces jitter) │ │ 3. computeHeadPosition(): add +1.6m to Y for head height │ │ 4. calculateLogarithmicGain(): sqrt curve 100%→20% over 0.5m→15m │ │ 5. calculatePanning(): based on rot.y (listener yaw) │ └─────────────────────────────────────────────────────────────────────────────┘

What Happens on Sudden position change like 5m Jump

┌─────────────────────────────────────────────────────────────────┐ │ SUDDEN TELEPORT: Person jumps from 2m → 7m instantly │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. SERVER detects jump > 5m │ │ └── Lerps 30% toward new position each update │ │ Frame 1: 2.0m → 3.5m (30% of 5m jump) │ │ Frame 2: 3.5m → 4.55m │ │ Frame 3: 4.55m → 5.29m │ │ Frame 4: 5.29m → 5.80m │ │ ... eventually reaches 7m │ │ │ │ 2. SDK receives smoothed positions │ │ └── Calculates gain for each position │ │ │ │ 3. WEB AUDIO smooths gain changes │ │ └── setTargetAtTime(gain, time, 0.1) = ~300ms smooth │ │ │ └─────────────────────────────────────────────────────────────────┘

Coordinate System (World Space)

All spatial calculations are performed relative to a world origin (datum) at (0, 0, 0):

                    +Z (Forward/North)
                      ↑
                   10 |
                      |        B (15, 8) ← Speaker
                    8 |       /
                      |      / 5.83m distance
                    6 |     /
                      |    /
                    5 | A (10, 5) ← YOU (Listener, facing 0°)
                      | ↑
                    3 | | Your right ear →
                      | |
                    1 |
                      |
                    0 +--+--+--+--+--+--+--+--+--→ +X (Right/East)
                      0  2  4  6  8 10 12 14 16
                     
                    ↙ (into page)
                  +Y (Up/Height)

Key Points:

Datum (0,0,0): World origin - all positions measured from here
X-axis: Right/Left (positive = right, negative = left)
Y-axis: Up/Down (height above ground)
Z-axis: Forward/Back (positive = forward/north, negative = back/south)
Distance: 3D Euclidean distance = √(Δx² + Δy² + Δz²)
Panning: Calculated from X-Z plane position relative to listener rotation

Coordinate Transform (Unreal → Standard):

// Unreal: X=forward, Y=right, Z=up
// Standard SDK: X=right, Y=up, Z=forward
position = {
  x: unrealPos.y / 100,  // UE Y (right) → X (right)
  y: unrealPos.z / 100,  // UE Z (up) → Y (up)
  z: unrealPos.x / 100   // UE X (forward) → Z (forward)
}

Feature Highlights

🔌 One class to rule it all – OdysseySpatialComms wires transports, producers, consumers, and room state.
🧭 Accurate pose propagation – updatePosition() streams listener pose to the SFU while participant-position-updated keeps the local store in sync.
🎧 Studio-grade spatial audio – each remote participant gets a dedicated Web Audio graph: ML denoiser (ScriptProcessorNode) → limiter → high-pass → low-pass → stereo panner → adaptive gain → master compressor. ML denoiser is a trained 3-layer GRU model (872K params, val_loss=0.1636) running fully client-side via TensorFlow.js.
🎚️ Crystal-Clear Audio Processing – A finely-tuned audio pipeline featuring a gentle compressor, multi-stage filtering, and a smart denoiser prevents audio dropouts and echo. The result is a more natural, continuous voice without distracting artifacts.
🧭 Position-based spatial panning – updatePosition forwards positions to Web Audio which calculates panning based on WHERE the speaker is relative to the listener (not which way they face). Uses listener's right-vector projection with 5m pan radius for natural left/right placement.
🤖 ML Noise Suppression (Active) – TensorFlow.js GRU model (odyssey_adaptive_denoiser) runs as a ScriptProcessorNode wired as the first node in every participant's audio chain. Loads non-blocking in the background; operates in pass-through mode until the model is ready, then switches to ML denoising automatically. No fallback — if it fails, the error is logged to console.
🔄 ICE Connection Stability – Automatic ICE restart on transport disconnect for robust connections. SDK requests ICE restart from server when transport enters disconnected state, enabling faster recovery from network issues without full reconnection.

Quick Start

import {
  OdysseySpatialComms,
  Direction,
  Position,
} from "@newgameplusinc/odyssey-audio-video-sdk-dev";

const sdk = new OdysseySpatialComms("https://mediasoup-server.example.com");

// 1) Join a room
await sdk.joinRoom({
  roomId: "demo-room",
  userId: "user-123",
  deviceId: "device-123",
  position: { x: 0, y: 0, z: 0 },
  direction: { x: 0, y: 0, z: 1 }, // Forward vector (facing +Z)
});

// 2) Produce local media
const stream = await navigator.mediaDevices.getUserMedia({ audio: true, video: true });
for (const track of stream.getTracks()) {
  await sdk.produceTrack(track);
}

// 3) Handle remote tracks
sdk.on("consumer-created", async ({ participant, track }) => {
  if (track.kind === "video") {
    attachVideo(track, participant.participantId);
  }
});

// 4) Keep spatial audio updated with all 3 data types
const position = { x: 10, y: 0, z: 20 };           // World coordinates (meters)
const direction = { x: 0, y: 0, z: 1 };             // Forward vector (normalized)
const rot = { x: 0, y: 45, z: 0 };                  // Rotation angles: pitch, yaw, roll (degrees)

// Send position update with rotation to server
sdk.updatePosition(position, direction, { rot });

// Update local listener for spatial audio
sdk.setListenerFromLSD(position, cameraPos, lookAtPos, rot);

Audio Flow (Server ↔ Browser)

┌──────────────┐   update-position   ┌──────────────┐   pose + tracks   ┌──────────────────┐
│ Browser LSD  │ ──────────────────▶ │ MediaSoup SFU│ ────────────────▶ │ SDK Event Bus     │
│ (Unreal data)│                     │ + Socket.IO  │                   │ (EventManager)    │
└──────┬───────┘                     └──────┬───────┘                   └──────────┬────────┘
       │                                    │                                  track + pose
       │                                    │                                       ▼
       │                           ┌────────▼────────┐                      ┌──────────────────┐
       │ audio RTP                 │  consumer-created│                      │ SpatialAudioMgr   │
       └──────────────────────────▶│  setup per-user │◀──────────────────────│  (Web Audio API)  │
                                   └────────┬────────┘                      │  - Denoiser       │
                                            │                               │  - HP / LP        │
                                            │                               │  - StereoPanner   │
                                            ▼                               │  - Gain + Comp    │
                                       Web Audio Graph                      └──────────┬───────┘
                                            │                                          │
                                            ▼                                          ▼
                                    Listener ears (Left/Right)                  System Output

Video Flow (Capture ↔ Rendering)

┌──────────────┐   produceTrack   ┌──────────────┐   RTP   ┌──────────────┐
│ getUserMedia │ ───────────────▶ │ MediaSoup SDK│ ──────▶ │ MediaSoup SFU│
└──────┬───────┘                  │ (Odyssey)    │         └──────┬───────┘
       │                          └──────┬───────┘                │
       │                   consumer-created │ track                │
       ▼                                  ▼                       │
┌──────────────┐                   ┌──────────────┐                │
│ Vue/React UI │ ◀─────────────── │ SDK Event Bus │ ◀──────────────┘
│ (muted video │                   │ exposes media │
│  elements)   │                   │ tracks        │
└──────────────┘                   └──────────────┘

Video Track Flow:

Capture: getUserMedia() captures video from camera or screen
Produce: sdk.produceTrack(track, { isScreenshare: true }) sends to SFU
Route: MediaSoup SFU routes video RTP to other participants
Consume: SDK receives consumer-created event with video track
Render: UI attaches track to muted <video> element (audio handled separately)

Web Audio Algorithms

Coordinate normalization – Unreal sends centimeters; SpatialAudioManager auto-detects large values and converts to meters once.

360° angle-based stereo panning – setListenerFromLSD() calculates the listener's right-ear vector from their yaw (rot.y). When updateSpatialAudio() runs, it uses atan2 to calculate the angle from listener to speaker, then applies sin(angle) for natural panning. This gives full left/right separation at ±90° angles. Speaker's rotation is ignored – only their position relative to listener matters.

Dynamic distance gain – updateSpatialAudio() measures distance from listener → source and applies a CUBIC EXPONENTIAL falloff (0.5m-15m range). Voices gradually fade from 100% (0.5m) to complete silence at 15m+ (hard cutoff). The cubic (1-normalized)³ formula creates clearly noticeable volume changes as you move. Distance calculated from listener's HEAD position to participant's HEAD position (body + 1.6m height). Master compressor is DISABLED to ensure gain changes are audible.

Noise handling – a TensorFlow.js GRU model (odyssey_adaptive_denoiser, 872K params, val_loss=0.1636) runs in a ScriptProcessorNode as the FIRST node in every participant's chain, applying a learned spectral mask before the high/low-pass filters. Audio passes through unchanged until the model finishes loading, then ML denoising becomes active automatically with no user action required.

Spatial Audio System (CLOCKWISE Rotation)

Core Algorithm (Full 360° Support)

The panning calculation uses position-based projection onto the listener's right-ear axis:

// Step 1: Calculate listener's right vector from yaw (CLOCKWISE rotation)
const yawRad = (rot.y * Math.PI) / 180;
listenerRight = {
  x: Math.cos(yawRad),
  z: -Math.sin(yawRad)  // NEGATIVE sine for CLOCKWISE rotation
};

// Step 2: Vector from listener to speaker
vecToSource = {
  x: speakerPos.x - listenerPos.x,
  z: speakerPos.z - listenerPos.z
};

// Step 3: Calculate forward vector (90° CW from right)
listenerForward = { x: -listenerRight.z, z: listenerRight.x };

// Step 4: Project onto both axes
dxLocal = vecToSource.x * listenerRight.x + vecToSource.z * listenerRight.z;  // Right/Left
dzLocal = vecToSource.x * listenerForward.x + vecToSource.z * listenerForward.z;  // Front/Back

// Step 5: Calculate angle using atan2 (gives -π to +π radians)
angleToSource = Math.atan2(dxLocal, dzLocal);

// Step 6: Convert to pan value using sine (-1 to +1)
// 90° (right) = +1.0, 270° (left) = -1.0, 0°/180° (front/back) = 0.0
rawPan = Math.sin(angleToSource);

// Step 7: Apply smoothing to prevent jitter
smoothedPan = smoothPanValue(participantId, rawPan);

Key Principles

| Principle | Description | |----------------------------|-------------------------------------------------------------| | Position-based | Panning based on WHERE speaker is, NOT where they're looking | | Listener yaw matters | Your rot.y determines which direction is "right" | | Speaker rotation ignored | Their facing direction does NOT affect panning | | Full 360° support | cos/sin trigonometry handles any angle automatically |

Listener Right Vector by Yaw (CLOCKWISE Rotation)

| Yaw | Facing | listenerRight (x, z) | Right Ear Faces | Left Ear Faces | |-------|-----------|----------------------|-----------------|----------------| | 0° | +Z (fwd) | (1.0, 0.0) | +X | -X | | 90° | +X (right)| (0.0, -1.0) | -Z | +Z | | 180° | -Z (back) | (-1.0, 0.0) | -X | +X | | 270° | -X (left) | (0.0, 1.0) | +Z | -Z |

Pan Value to Left/Right Gain

| panValue | Left Ear | Right Ear | Angle | Description | |----------|----------|-----------|--------------|----------------| | -1.0 | 100% | 0% | 270° (left) | Full LEFT | | -0.71 | 85% | 15% | 315°/225° | Diagonal LEFT | | 0.0 | 50% | 50% | 0°/180° | CENTER | | +0.71 | 15% | 85% | 45°/135° | Diagonal RIGHT | | +1.0 | 0% | 100% | 90° (right) | Full RIGHT |

Anti-Jitter Smoothing (3 Layers)

Layer 1: Gain Change Threshold Filter (2.5%)

const GAIN_CHANGE_THRESHOLD = 0.025; // 2.5%
if (Math.abs(newGain - currentGain) / 100 < GAIN_CHANGE_THRESHOLD) {
  return currentGain; // Ignore micro-jitter (movements ≤40cm)
}

Layer 2: Adaptive EMA for Pan

// Normal: 70% smoothing for stability
smoothedPan = previousPan * 0.7 + newPan * 0.3;

// Near center: 50% smoothing for moderate response
if (bothNearCenter) {
  smoothedPan = previousPan * 0.5 + newPan * 0.5;
}

// Full flip (likely jitter): 85% HEAVY smoothing
if (signFlipped && panChange > 1.0) {
  smoothedPan = previousPan * 0.85 + newPan * 0.15;
}

Layer 3: Audio API Ramp Time

stereoPanner.pan.setTargetAtTime(panValue, currentTime, 0.08);  // 80ms pan
gainNode.gain.setTargetAtTime(gainValue, currentTime, 0.05);    // 50ms gain

Distance-Based Gain: CUBIC EXPONENTIAL Falloff (HARD CUTOFF at 15m)

| Distance | Gain | Description | |:---------------|:------:|:---------------------------------| | 0.0 - 0.5m | 100% | Full volume (intimate) | | 1.0m | ~90% | Very close - still loud | | 2.0m | ~72% | Normal talking - NOTICEABLE | | 3.0m | ~57% | Across table - CLEARLY QUIETER | | 5.0m | ~33% | Across room - MUCH QUIETER | | 7.0m | ~17% | Far end of room - very faint | | 10.0m | ~4% | Barely audible | | ≥15.0m | 0% | Silent (HARD CUTOFF) |

CUBIC EXPONENTIAL falloff formula:

private calculateLogarithmicGain(distance: number): number {
  const minDistance = 0.5;   // Full volume at 0.5m or closer
  const maxDistance = 15.0;  // Silent at 15m or farther - HARD CUTOFF

  if (distance <= minDistance) return 100;  // Full volume
  if (distance >= maxDistance) return 0;    // Silent - HARD CUTOFF

  // CUBIC: (1 - normalized)³ for NOTICEABLE volume changes
  const range = maxDistance - minDistance;  // 14.5m
  const normalizedDistance = (distance - minDistance) / range;
  const remainingRatio = 1 - normalizedDistance;
  
  return 100 * remainingRatio * remainingRatio * remainingRatio;
}

Why Cubic (not Linear or Quadratic)?

Linear: Too gradual - hard to notice volume changes
Quadratic: Not steep enough for 15m range
Cubic: Perfect balance - clearly noticeable with proper 15m silence

Smoothing: Web Audio's setTargetAtTime() handles all smoothing:

// Time constant 0.05 = ~150ms smooth transition (no clicks)
gainNode.gain.setTargetAtTime(gainValue, currentTime, 0.05);

Note: Master compressor is DISABLED to ensure gain changes are clearly audible.

🎯 Audio Stability System

Layer 1: Gain Change Threshold Filter (2.5%)

const GAIN_CHANGE_THRESHOLD = 0.025; // 2.5%
if (Math.abs(newGain - currentGain) / 100 < GAIN_CHANGE_THRESHOLD) {
  return currentGain; // Ignore micro-jitter (movements ≤40cm)
}

Layer 2: SDK Position Snapping

positionSnapThreshold = 40cm
If movement < 40cm → use cached position (ignores pixel streaming jitter)

Layer 3: Web Audio Smoothing

// Gain changes are smoothed by Web Audio API directly
// 50ms time constant for smooth transitions
gainNode.gain.setTargetAtTime(gainValue, currentTime, 0.05);  // 150ms smooth
stereoPanner.pan.setTargetAtTime(panValue, currentTime, 0.08);  // 240ms smooth

Why simplified? Previous rate-limiting was causing gain to get stuck at low values. Web Audio's built-in smoothing is sufficient and more reliable.

🎯 Enterprise-Grade Gain Smoothing (v1.0.202+)

Problem Solved: With 40+ people in a room, rapid position updates (60+ Hz) caused instantaneous gain changes that created audible clicks, pops, and "pit pit" crackling noise. This was caused by setValueAtTime() creating waveform discontinuities.

Throttling = "Wait a bit before doing the same thing again" to prevent overload! 🎯
Person walking 10 meters:

Without Throttling:
|||||||||||||||||||||||||||||||||||||  600 updates
↑ Every single tiny movement = update
Result: Clicks, pops, CPU overload 😖

With Throttling (16ms):
|  |  |  |  |  |  |  |  |  |  60 updates
↑  ↑  ↑  ↑  ↑  ↑  ↑  ↑  ↑  ↑
16ms gaps between updates
Result: Smooth, efficient, perfect 😎

The Solution: Intelligent Throttling + Adaptive Ramping

// OLD (Causes Clicks):
nodes.gain.gain.setValueAtTime(gainValue, currentTime);  // ❌ Instant jump

// NEW (Butter Smooth):
nodes.gain.gain.cancelScheduledValues(currentTime);
nodes.gain.gain.setValueAtTime(lastGain, currentTime);
nodes.gain.gain.linearRampToValueAtTime(gainValue, currentTime + rampTime);  // ✅ Smooth transition

Performance Characteristics

| Participant Count | Position Updates/sec | Throttled Updates/sec | CPU Impact | Audio Quality | |:-----------------:|:--------------------:|:---------------------:|:----------:|:-------------:| | 2-5 | ~300 | ~60 | Low | Perfect ✅ | | 10 | ~600 | ~120 | Low | Perfect ✅ | | 20 | ~1,200 | ~240 | Medium | Perfect ✅ | | 40 | ~2,400 | ~480 | Medium | Perfect ✅ | | 100 | ~6,000 | ~600 | Medium | Perfect ✅ |

Intelligent Throttling Logic

// Throttle: Skip update if too recent AND gain change is small
const isSignificantChange = gainDelta > 0.1;  // >10% change
if (timeSinceLastUpdate < 16ms && !isSignificantChange) {
  return;  // Skip this update, wait for next frame
}

Key Features:

✅ Time-based throttling: Maximum 60Hz per participant (16ms interval)
✅ Significance bypass: Large changes (>10%) bypass throttle immediately
✅ Per-participant tracking: Each person has independent throttle state
✅ Standing participants: Minimal updates when not moving (saves CPU)

Adaptive Ramp Time

The system automatically adjusts ramp time based on gain change magnitude:

| Gain Change | Ramp Time | User Experience | |:-----------:|:---------:|:----------------| | < 5% | 15ms | Instant feel, imperceptible smoothing | | 5-20% | 15-35ms | Smooth transition, natural | | 20-30% | 35-45ms | Very smooth, no artifacts | | > 30% | 50ms | Ultra smooth, prevents any clicking |

Formula:

rampTime = Math.min(
  0.015 + (gainDelta * 0.1),  // Base 15ms + scaled by change
  0.050                        // Max 50ms cap
);

Real-World Scenarios

| Scenario | Behavior | Result | |----------|----------|--------| | Person walking nearby | Small gain changes → 15-25ms ramps | Feels instant, zero clicks | | Person runs past you | Large gain changes → 40-50ms ramps | Smooth volume sweep | | 40 people, 20 moving | ~1200 updates → throttled to ~240 | Perfect audio, low CPU | | Person stands still | Updates skipped entirely | Zero CPU usage | | Person teleports close | >10% change bypasses throttle | Immediate volume update |

Error Handling & Fallback

try {
  // Smooth ramping
  nodes.gain.gain.linearRampToValueAtTime(gainValue, currentTime + rampTime);
} catch (err) {
  // Fallback: Direct value setting (rare edge case)
  console.warn(`Gain scheduling failed, using instant set:`, err);
  nodes.gain.gain.value = gainValue;
}

Why This Works

Root Cause: Instantaneous gain changes create waveform discontinuities:

Old Method:                    New Method:
Volume                         Volume
  ↑                              ↑
  │     ╱╲    ╱╲                │     ╱╲    ╱╲
  │    ╱  ╲  ╱  ╲               │    ╱  ╲  ╱  ╲
  │   ╱    ╲╱    ╲              │   ╱    ╲╱    ╲
  │  ╱            ╲             │  ╱            ╲
  │ ╱              ╲            │ ╱              ╲
──┼────────────────────→ Time  ──┼────────────────────→ Time
  0  ← JUMP! Click here!        0  ← Smooth ramp here!

Technical Details:

Rapid gain jumps = discontinuous waveform = audible click
With 60Hz position updates × 40 people = 2400 potential clicks/sec
Linear ramping = continuous waveform = zero artifacts
Throttling reduces update frequency by ~60% (saves CPU + audio thread)

Network Resilience

Server-Side:

// Opus codec with Forward Error Correction
useinbandfec: 1  // Automatically recovers lost packets
ptime: 20        // 20ms frames for low latency

Why Non-Spatial Audio Worked Fine:

Non-spatial audio: Single static gain value, rarely changes
Spatial audio: Per-frame position updates = rapid gain changes
The issue wasn't network - it was rapid gain value changes in Web Audio API

🎛️ Audio Processing Settings

Design Goal: Crystal clear voice with no echo, pumping, or bathroom effect.

🔊 Master Compressor

| Setting | Value | Purpose | |:--------------|:---------:|:-------------------------------------| | Threshold | -18 dB | Only compress loud peaks | | Knee | 40 dB | Soft knee for natural sound | | Ratio | 3:1 | Gentle compression, no pumping | | Attack | 10 ms | Fast enough to catch peaks | | Release | 150 ms | Fast release prevents echo tail | | Master Gain | 1.0 | Unity gain for clean signal |

🎚️ Filter Chain

| Filter | Frequency | Q Value | Purpose | |:----------------|:-----------:|:-------:|:--------------------------------| | Highpass | 100 Hz | 0.5 | Remove room boom/rumble | | Lowpass | 10 kHz | 0.5 | Open sound, no ringing | | Voice Boost | 180 Hz | 0.5 | ❌ Disabled (prevents echo) | | Dynamic Lowpass | 12 kHz | 0.5 | Natural treble preservation |

🛡️ Per-Participant Limiter

| Setting | Value | Purpose | |:-----------|:---------:|:--------------------------------------| | Threshold | -6 dB | Only activate near clipping | | Knee | 3 dB | Hard knee = true limiter | | Ratio | 20:1 | High ratio catches peaks cleanly | | Attack | 1 ms | Ultra-fast peak catching | | Release | 50 ms | Fast release = no pumping |

🎤 Denoiser (ML — GRU ScriptProcessorNode)

| Parameter | Value | Purpose | |:------------------|:-------------------------:|:--------------------------------------------------| | Model | odyssey_adaptive_denoiser | 3-layer GRU, UINT8 quantized TF.js | | Params | 872,448 | Trained 100 epochs, val_loss=0.1636 | | Buffer size | 4096 samples (~85ms) | ScriptProcessorNode synchronous processing | | Backend | WebGL (GPU) | Reported at load: [MLNoiseSuppressor] TF.js backend ready: webgl | | Pass-through | Yes (while loading) | Audio unaffected until model is ready | | Normalization | mean=0.3953, std=0.1442 | Stats loaded from normalization_stats.json |

🔗 Audio Chain

┌──────────────────────────────────────────────────────────────────────────────────────────┐
│                              AUDIO PROCESSING CHAIN                                      │
├──────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                          │
│  MediaStream  ML Denoiser    Per-Participant                        Spatial    Master    │
│    Source   → (GRU model) →    Limiter     → Filters →             Panner  → Compressor │
│      │            │               │             │                    │           │       │
│      ▼            ▼               ▼             ▼                    ▼           ▼       │
│  [WebRTC]  [ScriptProcessor  [Peak Catch]  [HP 100Hz]          [Stereo      [3:1 Ratio] │
│   Track     872K GRU model]   [-6dB]        LP 10kHz]           L/R Pan]     Output     │
│             (pass-through                                                                │
│              while loading)                                                              │
└──────────────────────────────────────────────────────────────────────────────────────────┘

Detailed Chain:

Source → MLScriptProcessor (GRU denoiser) → Limiter → HighPass(100Hz) → VoiceBand → LowPass(10kHz) → 
DynamicLP(12kHz) → MonoDownmix → StereoUpmix → StereoPanner → Gain → MasterCompressor → Output

Spatial Audio Flowchart

┌─────────────────────────────────────────────────────────────────────────────┐
│                           SPATIAL AUDIO PIPELINE                            │
└─────────────────────────────────────────────────────────────────────────────┘

    ┌──────────────────┐         ┌──────────────────┐
    │   LISTENER DATA  │         │   SPEAKER DATA   │
    │  pos, rot (yaw)  │         │   pos (x, y, z)  │
    └────────┬─────────┘         └────────┬─────────┘
             │                            │
             ▼                            │
    ┌──────────────────┐                  │
    │ Calculate Right  │                  │
    │     Vector       │                  │
    │ cos(yaw), -sin() │                  │
    └────────┬─────────┘                  │
             │                            │
             ▼                            ▼
    ┌─────────────────────────────────────────────┐
    │           VECTOR TO SPEAKER                 │
    │  vecToSource = speakerPos - listenerPos     │
    └──────────────────────┬──────────────────────┘
                           │
                           ▼
    ┌─────────────────────────────────────────────┐
    │            DOT PRODUCT (Projection)         │
    │  dxLocal = vecToSource · listenerRight      │
    │  (positive = RIGHT, negative = LEFT)        │
    └──────────────────────┬──────────────────────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
    ┌──────────────────┐      ┌──────────────────┐
    │   NORMALIZE PAN  │      │ CALCULATE DIST   │
    │ sin(atan2(dx,dz))│      │ dist = |vecTo|   │
    │  range: -1 to +1 │      │ 0.5m → 15m range │
    └────────┬─────────┘      └────────┬─────────┘
             │                         │
             ▼                         ▼
    ┌──────────────────┐      ┌──────────────────┐
    │ ANTI-JITTER      │      │ EXPONENTIAL GAIN │
    │ - Threshold 2.5% │      │ gain = (1-norm)² │
    │ - EMA 70%        │      │ × 100%           │
    │ - Ramp 80ms      │      │ 0% at 15m        │
    └────────┬─────────┘      └────────┬─────────┘
             │                         │
             ▼                         ▼
    ┌──────────────────┐      ┌──────────────────┐
    │   StereoPanner   │      │    GainNode      │
    │  L/R balance     │      │   volume         │
    └────────┬─────────┘      └────────┬─────────┘
             │                         │
             └──────────┬──────────────┘
                        ▼
              ┌──────────────────┐
              │  AUDIO OUTPUT    │
              │  (headphones)    │
              └──────────────────┘

360° Spatial Audio Diagram (Top View)

                        0° (Front)
                     L100%  R100%
                         ↑
                         │
          315°           │           45°
        L100% R58%       │       L58% R100%
              ↖         │         ↗
                ↖       │       ↗
                  ↖     │     ↗
                    ↖   │   ↗
  270° ←─────────────── 🎧 ───────────────→ 90°
 (Left)            Listener            (Right)
L100% R40%          yaw=0°           L40% R100%
                    ↙   │   ↘
                  ↙     │     ↘
                ↙       │       ↘
              ↙         │         ↘
        L100% R58%       │       L58% R100%
          225°           │           135°
                         │
                         ↓
                     L100%  R100%
                     180° (Behind)

    Legend:
    - 🎧 = Listener at origin, facing 0° (forward)
    - Angles = Speaker position around listener
    - L/R % = Left/Right ear volume for speaker at that position

Configuration

| Parameter | Value | Description | |------------------------|--------|-----------------------------| | positionPanRadius | 5.0m | Distance for full L/R pan | | nearDistance | 0.5m | Full gain threshold | | farDistance | 10.0m | Silence threshold | | panSmoothingFactor | 0.5 | Normal smoothing | | panChangeThreshold | 0.02 | Jitter ignore threshold | | panRampTime | 0.15s | Audio transition time | | headHeight | 1.6m | Added to body Y |

Console Logs Reference

// Mediasoup server URL being used
[Odyssey] Connecting to MediaSoup server: https://...

// ML model loading
[MLNoiseSuppressor] Initializing TF.js backend: webgl
[MLNoiseSuppressor] Model loaded — 872,448 params | backend: webgl
[Odyssey] ML Noise Suppression loaded and active

// ML model failure (audio still works — pass-through mode)
[Odyssey] ML Noise Suppression failed to load: <error>

// ML active per participant
[SpatialAudioChannel] ML noise suppression ACTIVE — model loaded from <url>

// Listener position update
📍 [SDK Listener] pos=(x, y, z) rot=(pitch, yaw, roll)

// Speaker position received
🎧 [SDK Rx] <id> bodyPos=(x, y, z) rot=(pitch, yaw, roll)

// Spatial audio calculation
🎧 SPATIAL AUDIO [<id>] dist=Xm dxLocal=Xm rawPan=X smoothPan=X pan(L=X%,R=X%) gain=X% listenerRight=(x,z) vecToSrc=(x,z)

Server Contract (Socket.IO Events)

| Event | Direction | Payload | |----------------------------------|------------------|-----------------------------------------------------------------------------| | join-room | client → server | {roomId, userId, deviceId, position, direction} | | room-joined | server → client | RoomJoinedData (router caps, participants snapshot) | | update-position | client → server | {participantId, conferenceId, position, direction, rot, cameraDistance} | | participant-position-updated | server → client | {participantId, position, direction, rot, mediaState, pan} | | consumer-created | server → client | {participantId, track(kind), position, direction, appData} | | participant-media-state-updated| server → client | {participantId, mediaState} | | all-participants-update | server → client | {roomId, participants[]} | | new-participant | server → client | {participantId, userId, position, direction} | | participant-left | server → client | {participantId} |

Position Data Types (Critical for Spatial Audio)

The SDK sends three separate data types to the server for accurate spatial audio:

| Data Type | Structure | Description | |--------------|----------------------------------|---------------------------------------------------------------| | position | {x, y, z} in meters | World coordinates - WHERE the player is located | | direction | {x, y, z} normalized vector | Forward direction - which way the player is LOOKING (unit vector) | | rot | {x, y, z} in degrees | Euler rotation angles - pitch(x), yaw(y), roll(z) |

IMPORTANT: rot.y (yaw) is critical for spatial audio left/right ear calculation:

The listener's yaw determines their ear orientation
listenerRight = { x: cos(yaw), z: -sin(yaw) }
Speakers are panned based on their position projected onto listener's right axis

// Frontend sends all 3 data types:
sdk.updatePosition(position, direction, {
  rot,           // Rotation angles (pitch, yaw, roll) in degrees
  cameraDistance,
  screenPos,
});

// Server broadcasts to other clients:
socket.emit("participant-position-updated", {
  position,   // World coordinates
  direction,  // Forward vector
  rot,        // Rotation angles - yaw used for L/R audio
  ...
});

Noise-Cancellation Stack (What's Included)

| Layer | Purpose | |-------------------------------|--------------------------------------------------------------------------------------------------------------------------------| | Adaptive denoiser worklet | Learns each participant's noise floor in real time, applies multi-band downward expander plus dynamic low/high-pass shaping | | speechBoost | Lifts the low/mid band only when speech confidence is high, keeping consonants bright without reintroducing floor noise | | highBandGate | Clamps constant fan hiss in the 4–12 kHz band whenever speechPresence is low | | Silence gate | If energy stays below silenceFloor for configurable hold window, track ramps to true silence, wakes instantly on voice return| | Classic filters | Fixed high-pass (80Hz) / low-pass (8kHz) shave off rumble and hiss before signals reach the panner |

Configuration example:

const sdk = new OdysseySpatialComms(serverUrl, {
  denoiser: {
    threshold: 0.008,
    maxReduction: 0.88,
    hissCut: 0.52,
    holdMs: 260,
    voiceBoost: 0.65,
    voiceSensitivity: 0.33,
    voiceEnhancement: true,
    silenceFloor: 0.00075,
    silenceHoldMs: 520,
    silenceReleaseMs: 160,
    speechBoost: 0.35,
    highBandGate: 0.7,
    highBandAttack: 0.25,
    highBandRelease: 0.12,
  },
});

How Spatial Audio Is Built

| Step | Description | |-------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1. Telemetry ingestion | Each LSD packet is passed through setListenerFromLSD(listenerPos, cameraPos, lookAtPos, rot) so the Web Audio listener matches the player's real head/camera pose | | 2. Per-participant graph | When consumer-created yields a remote audio track, setupSpatialAudioForParticipant() spins up: Source → Compressor → Denoiser → HP → LP → StereoPanner → Gain | | 3. Position updates | Every participant-position-updated event calls updateSpatialAudio(participantId, position, rot). Position feeds panning, rot provides listener's yaw | | 4. Distance-aware gain | The manager computes Euclidean distance to each remote participant and applies inverse distance law with exponential falloff (0.5m–15m range) for more perceptible volume changes | | 5. Anti-jitter smoothing | 3-layer system: threshold filter (0.02), EMA smoothing (0.5), SNAP behavior (0.2 for direction changes) | | 6. Left/right rendering | StereoPannerNode outputs processed signal with accurate L/R separation based on position projection |

Integration Checklist

[ ] Instantiate once per page/tab and keep it in a store (Vuex, Redux, Zustand, etc.)
[ ] Pipe LSD/Lap data from your rendering engine into updatePosition() + setListenerFromLSD() at ~10 Hz
[ ] Render videos muted – never attach remote audio tracks straight to DOM; let SpatialAudioManager own playback
[ ] Resume audio context – call sdk.resumeAudio() on first user interaction (required by browsers)
[ ] Handle consumer-created – attach video tracks to UI, audio is handled automatically by spatial audio
[ ] Monitor logs – browser console shows 🎧 SDK, 📍 SDK, and 🎚️ [Spatial Audio] statements for every critical hop
[ ] Push avatar telemetry back to Unreal so remoteSpatialData can render minimaps/circles

Core Classes

| File | Purpose | |-----------------------------|--------------------------------------------------------------------------------------| | src/index.ts | OdysseySpatialComms – socket lifecycle, producers/consumers, event surface | | src/core/MediasoupManager.ts | Transport helpers for produce/consume/resume | | src/channels/spatial/SpatialAudioChannel.ts | Web Audio orchestration (listener transforms, per-participant chains, ML denoiser node) | | src/audio/MLNoiseSuppressor.ts | TensorFlow.js GRU denoiser — odyssey_adaptive_denoiser model, 872K params, val_loss=0.1636 | | src/core/EventManager.ts | Lightweight EventEmitter used by the entire SDK | | src/types/index.ts | TypeScript interfaces for Position, Direction, Participant, MediaState, etc. |

Development Tips

Run pnpm install && pnpm build inside odyssey-mediasoup-sdk to publish a fresh build
Use pnpm watch while iterating so TypeScript outputs live under dist/
The SDK targets evergreen browsers; Safari <16.4 needs WebGL support for TF.js (all modern Safari versions have this)
Have questions or want to extend the SDK? Start with SpatialAudioManager – that's where most of the "real-world" behavior (distance feel, stereo cues, denoiser) lives
ML Noise Suppression: Initialized automatically at SDK startup using the odyssey_adaptive_denoiser model from public/odyssey_adaptive_denoiser/model.json. No manual call needed. Watch browser console for [Odyssey] ML Noise Suppression loaded and active

Development

# Install dependencies
pnpm install

# Build
pnpm build

# Watch mode
pnpm watch

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Odyssey Audio/Video SDK (MediaSoup + Web Audio)

code structure:

complete flow from frotnend to server to sdk :

What Happens on Sudden position change like 5m Jump

Coordinate System (World Space)

Feature Highlights

Quick Start

Audio Flow (Server ↔ Browser)

Video Flow (Capture ↔ Rendering)

Web Audio Algorithms

Spatial Audio System (CLOCKWISE Rotation)

Core Algorithm (Full 360° Support)

Key Principles

Listener Right Vector by Yaw (CLOCKWISE Rotation)

Pan Value to Left/Right Gain

Anti-Jitter Smoothing (3 Layers)

Layer 1: Gain Change Threshold Filter (2.5%)

Layer 2: Adaptive EMA for Pan

Layer 3: Audio API Ramp Time

Distance-Based Gain: CUBIC EXPONENTIAL Falloff (HARD CUTOFF at 15m)

🎯 Audio Stability System

Layer 1: Gain Change Threshold Filter (2.5%)

Layer 2: SDK Position Snapping

Layer 3: Web Audio Smoothing

🎯 Enterprise-Grade Gain Smoothing (v1.0.202+)

The Solution: Intelligent Throttling + Adaptive Ramping

Performance Characteristics

Intelligent Throttling Logic

Adaptive Ramp Time

Real-World Scenarios

Error Handling & Fallback

Why This Works

Network Resilience

🎛️ Audio Processing Settings

🔊 Master Compressor

🎚️ Filter Chain

🛡️ Per-Participant Limiter

🎤 Denoiser (ML — GRU ScriptProcessorNode)

🔗 Audio Chain

Spatial Audio Flowchart

360° Spatial Audio Diagram (Top View)

Configuration

Console Logs Reference

Server Contract (Socket.IO Events)

Position Data Types (Critical for Spatial Audio)

Noise-Cancellation Stack (What's Included)

How Spatial Audio Is Built

Integration Checklist

Core Classes

Development Tips

Development

Related Documentation