@newgameplusinc/odyssey-audio-video-sdk-dev
v1.0.353
Published
Odyssey Spatial Audio & Video SDK using MediaSoup for real-time communication
Readme
Odyssey Audio/Video SDK (MediaSoup + Web Audio)
This package exposes OdysseySpatialComms, a thin TypeScript client that glues together:
code structure:
src/ ├── index.ts # Main SDK entry (OdysseySpatialComms class) ├── types/ # All TypeScript interfaces │ ├── position.ts # Position, Direction, Rotation types │ ├── participant.ts # Participant, MediaState types │ ├── events.ts # All event types (OdysseyEvent, etc.) │ ├── room.ts # Room-related types │ └── index.ts # Re-exports all types ├── utils/ │ ├── spatial/ # Spatial audio calculations │ │ ├── distance-calc.ts # Distance calculations │ │ ├── gain-calc.ts # Logarithmic gain │ │ ├── pan-calc.ts # Stereo panning │ │ ├── head-position.ts # Head/body position │ │ ├── listener-calc.ts # Listener orientation │ │ └── angle-calc.ts # Azimuth/angle calculations │ ├── position/ # Position utilities │ │ ├── normalize.ts # Unit normalization (cm/m) │ │ ├── snap.ts # Position snapping cache │ │ └── coordinates.ts # Unreal↔Standard conversion │ ├── smoothing/ # Audio smoothing │ │ ├── pan-smoothing.ts # Pan interpolation │ │ └── gain-smoothing.ts # Gain interpolation │ └── audio/ # Audio quality │ ├── clarity-score.ts # Voice clarity │ └── voice-filter.ts # Voice filtering ├── channels/ │ ├── spatial/ # Spatial audio channel │ │ ├── SpatialAudioChannel.ts # Main spatial processor │ │ └── SpatialAudioTypes.ts # Channel-specific types │ └── huddle/ # Huddle/private channel │ ├── HuddleChannel.ts # Huddle channel manager │ └── HuddleTypes.ts # Huddle types ├── audio/ # Audio processing │ ├── AudioPipeline.ts # Master audio chain │ ├── AudioNodeFactory.ts # Web Audio node factory │ └── MLNoiseSuppressor.ts # TensorFlow noise suppression ├── core/ # Core managers │ ├── EventManager.ts # Event emitter base │ └── MediasoupManager.ts # WebRTC transport manager └── sdk/ └── index.ts # Public SDK exports
- MediaSoup SFU for ultra-low-latency audio/video routing
- Web Audio API for Apple-like spatial mixing via
SpatialAudioManager - Socket telemetry (position + direction) so every browser hears/sees everyone exactly where they are in the 3D world
It mirrors the production SDK used by Odyssey V2 and ships ready-to-drop into any Web UI (Vue, React, plain JS).
complete flow from frotnend to server to sdk :
┌─────────────────────────────────────────────────────────────────────────────┐ │ UNREAL ENGINE │ │ Sends: pos=(4130, 220, 700) cm in Unreal coords (X=fwd, Y=right, Z=up) │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ VUE (MediaSoupHub.vue) │ │ Transform: Unreal → Standard coords + cm → meters │ │ │ │ position = { │ │ x: data.pos.y / 100, // UE Y (right) → Standard X (right) = 2.2m │ │ y: data.pos.z / 100, // UE Z (up) → Standard Y (up) = 7.0m │ │ z: data.pos.x / 100 // UE X (forward) → Standard Z (forward) = 41.3m │ │ } │ │ │ │ Output: pos=(2.2, 7.0, 41.3) meters in Standard coords │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ SDK (index.ts → updatePosition) │ │ Passes position directly to server via socket.emit("update-position") │ │ Also sets listener position via setListenerFromLSD() │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ SERVER (server.ts) - PASS-THROUGH MODE │ │ │ │ 1. Receive position from client │ │ 2. AUTO-DETECT UNITS: if maxAxis > 50 → divide by 100 (cm→m) │ │ 3. NO SMOOTHING: Pass through real-time position directly │ │ 4. BROADCAST: Send normalized position (meters) to all clients │ │ │ │ Result: SDK receives REAL positions for accurate distance-based gain │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ SDK (SpatialAudioManager.ts) - Receiving remote positions │ │ │ │ 1. normalizePositionUnits(): if maxAxis > 50 → divide by 100 (backup) │ │ 2. snapPosition(): ignore movements < 15cm (reduces jitter) │ │ 3. computeHeadPosition(): add +1.6m to Y for head height │ │ 4. calculateLogarithmicGain(): sqrt curve 100%→20% over 0.5m→15m │ │ 5. calculatePanning(): based on rot.y (listener yaw) │ └─────────────────────────────────────────────────────────────────────────────┘
What Happens on Sudden position change like 5m Jump
┌─────────────────────────────────────────────────────────────────┐ │ SUDDEN TELEPORT: Person jumps from 2m → 7m instantly │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. SERVER detects jump > 5m │ │ └── Lerps 30% toward new position each update │ │ Frame 1: 2.0m → 3.5m (30% of 5m jump) │ │ Frame 2: 3.5m → 4.55m │ │ Frame 3: 4.55m → 5.29m │ │ Frame 4: 5.29m → 5.80m │ │ ... eventually reaches 7m │ │ │ │ 2. SDK receives smoothed positions │ │ └── Calculates gain for each position │ │ │ │ 3. WEB AUDIO smooths gain changes │ │ └── setTargetAtTime(gain, time, 0.1) = ~300ms smooth │ │ │ └─────────────────────────────────────────────────────────────────┘
Coordinate System (World Space)
All spatial calculations are performed relative to a world origin (datum) at (0, 0, 0):
+Z (Forward/North)
↑
10 |
| B (15, 8) ← Speaker
8 | /
| / 5.83m distance
6 | /
| /
5 | A (10, 5) ← YOU (Listener, facing 0°)
| ↑
3 | | Your right ear →
| |
1 |
|
0 +--+--+--+--+--+--+--+--+--→ +X (Right/East)
0 2 4 6 8 10 12 14 16
↙ (into page)
+Y (Up/Height)Key Points:
- Datum (0,0,0): World origin - all positions measured from here
- X-axis: Right/Left (positive = right, negative = left)
- Y-axis: Up/Down (height above ground)
- Z-axis: Forward/Back (positive = forward/north, negative = back/south)
- Distance: 3D Euclidean distance =
√(Δx² + Δy² + Δz²) - Panning: Calculated from X-Z plane position relative to listener rotation
Coordinate Transform (Unreal → Standard):
// Unreal: X=forward, Y=right, Z=up
// Standard SDK: X=right, Y=up, Z=forward
position = {
x: unrealPos.y / 100, // UE Y (right) → X (right)
y: unrealPos.z / 100, // UE Z (up) → Y (up)
z: unrealPos.x / 100 // UE X (forward) → Z (forward)
}Feature Highlights
- 🔌 One class to rule it all –
OdysseySpatialCommswires transports, producers, consumers, and room state. - 🧭 Accurate pose propagation –
updatePosition()streams listener pose to the SFU whileparticipant-position-updatedkeeps the local store in sync. - 🎧 Studio-grade spatial audio – each remote participant gets a dedicated Web Audio graph:
ML denoiser (ScriptProcessorNode) → limiter → high-pass → low-pass → stereo panner → adaptive gain → master compressor. ML denoiser is a trained 3-layer GRU model (872K params, val_loss=0.1636) running fully client-side via TensorFlow.js. - 🎚️ Crystal-Clear Audio Processing – A finely-tuned audio pipeline featuring a gentle compressor, multi-stage filtering, and a smart denoiser prevents audio dropouts and echo. The result is a more natural, continuous voice without distracting artifacts.
- 🧭 Position-based spatial panning –
updatePositionforwards positions to Web Audio which calculates panning based on WHERE the speaker is relative to the listener (not which way they face). Uses listener's right-vector projection with 5m pan radius for natural left/right placement. - 🤖 ML Noise Suppression (Active) – TensorFlow.js GRU model (
odyssey_adaptive_denoiser) runs as aScriptProcessorNodewired as the first node in every participant's audio chain. Loads non-blocking in the background; operates in pass-through mode until the model is ready, then switches to ML denoising automatically. No fallback — if it fails, the error is logged to console. - 🔄 ICE Connection Stability – Automatic ICE restart on transport disconnect for robust connections. SDK requests ICE restart from server when transport enters
disconnectedstate, enabling faster recovery from network issues without full reconnection.
Quick Start
import {
OdysseySpatialComms,
Direction,
Position,
} from "@newgameplusinc/odyssey-audio-video-sdk-dev";
const sdk = new OdysseySpatialComms("https://mediasoup-server.example.com");
// 1) Join a room
await sdk.joinRoom({
roomId: "demo-room",
userId: "user-123",
deviceId: "device-123",
position: { x: 0, y: 0, z: 0 },
direction: { x: 0, y: 0, z: 1 }, // Forward vector (facing +Z)
});
// 2) Produce local media
const stream = await navigator.mediaDevices.getUserMedia({ audio: true, video: true });
for (const track of stream.getTracks()) {
await sdk.produceTrack(track);
}
// 3) Handle remote tracks
sdk.on("consumer-created", async ({ participant, track }) => {
if (track.kind === "video") {
attachVideo(track, participant.participantId);
}
});
// 4) Keep spatial audio updated with all 3 data types
const position = { x: 10, y: 0, z: 20 }; // World coordinates (meters)
const direction = { x: 0, y: 0, z: 1 }; // Forward vector (normalized)
const rot = { x: 0, y: 45, z: 0 }; // Rotation angles: pitch, yaw, roll (degrees)
// Send position update with rotation to server
sdk.updatePosition(position, direction, { rot });
// Update local listener for spatial audio
sdk.setListenerFromLSD(position, cameraPos, lookAtPos, rot);Audio Flow (Server ↔ Browser)
┌──────────────┐ update-position ┌──────────────┐ pose + tracks ┌──────────────────┐
│ Browser LSD │ ──────────────────▶ │ MediaSoup SFU│ ────────────────▶ │ SDK Event Bus │
│ (Unreal data)│ │ + Socket.IO │ │ (EventManager) │
└──────┬───────┘ └──────┬───────┘ └──────────┬────────┘
│ │ track + pose
│ │ ▼
│ ┌────────▼────────┐ ┌──────────────────┐
│ audio RTP │ consumer-created│ │ SpatialAudioMgr │
└──────────────────────────▶│ setup per-user │◀──────────────────────│ (Web Audio API) │
└────────┬────────┘ │ - Denoiser │
│ │ - HP / LP │
│ │ - StereoPanner │
▼ │ - Gain + Comp │
Web Audio Graph └──────────┬───────┘
│ │
▼ ▼
Listener ears (Left/Right) System OutputVideo Flow (Capture ↔ Rendering)
┌──────────────┐ produceTrack ┌──────────────┐ RTP ┌──────────────┐
│ getUserMedia │ ───────────────▶ │ MediaSoup SDK│ ──────▶ │ MediaSoup SFU│
└──────┬───────┘ │ (Odyssey) │ └──────┬───────┘
│ └──────┬───────┘ │
│ consumer-created │ track │
▼ ▼ │
┌──────────────┐ ┌──────────────┐ │
│ Vue/React UI │ ◀─────────────── │ SDK Event Bus │ ◀──────────────┘
│ (muted video │ │ exposes media │
│ elements) │ │ tracks │
└──────────────┘ └──────────────┘Video Track Flow:
- Capture:
getUserMedia()captures video from camera or screen - Produce:
sdk.produceTrack(track, { isScreenshare: true })sends to SFU - Route: MediaSoup SFU routes video RTP to other participants
- Consume: SDK receives
consumer-createdevent with video track - Render: UI attaches track to muted
<video>element (audio handled separately)
Web Audio Algorithms
Coordinate normalization – Unreal sends centimeters; SpatialAudioManager auto-detects large values and converts to meters once.
360° angle-based stereo panning – setListenerFromLSD() calculates the listener's right-ear vector from their yaw (rot.y). When updateSpatialAudio() runs, it uses atan2 to calculate the angle from listener to speaker, then applies sin(angle) for natural panning. This gives full left/right separation at ±90° angles. Speaker's rotation is ignored – only their position relative to listener matters.
Dynamic distance gain – updateSpatialAudio() measures distance from listener → source and applies a CUBIC EXPONENTIAL falloff (0.5m-15m range). Voices gradually fade from 100% (0.5m) to complete silence at 15m+ (hard cutoff). The cubic (1-normalized)³ formula creates clearly noticeable volume changes as you move. Distance calculated from listener's HEAD position to participant's HEAD position (body + 1.6m height). Master compressor is DISABLED to ensure gain changes are audible.
Noise handling – a TensorFlow.js GRU model (odyssey_adaptive_denoiser, 872K params, val_loss=0.1636) runs in a ScriptProcessorNode as the FIRST node in every participant's chain, applying a learned spectral mask before the high/low-pass filters. Audio passes through unchanged until the model finishes loading, then ML denoising becomes active automatically with no user action required.
Spatial Audio System (CLOCKWISE Rotation)
Core Algorithm (Full 360° Support)
The panning calculation uses position-based projection onto the listener's right-ear axis:
// Step 1: Calculate listener's right vector from yaw (CLOCKWISE rotation)
const yawRad = (rot.y * Math.PI) / 180;
listenerRight = {
x: Math.cos(yawRad),
z: -Math.sin(yawRad) // NEGATIVE sine for CLOCKWISE rotation
};
// Step 2: Vector from listener to speaker
vecToSource = {
x: speakerPos.x - listenerPos.x,
z: speakerPos.z - listenerPos.z
};
// Step 3: Calculate forward vector (90° CW from right)
listenerForward = { x: -listenerRight.z, z: listenerRight.x };
// Step 4: Project onto both axes
dxLocal = vecToSource.x * listenerRight.x + vecToSource.z * listenerRight.z; // Right/Left
dzLocal = vecToSource.x * listenerForward.x + vecToSource.z * listenerForward.z; // Front/Back
// Step 5: Calculate angle using atan2 (gives -π to +π radians)
angleToSource = Math.atan2(dxLocal, dzLocal);
// Step 6: Convert to pan value using sine (-1 to +1)
// 90° (right) = +1.0, 270° (left) = -1.0, 0°/180° (front/back) = 0.0
rawPan = Math.sin(angleToSource);
// Step 7: Apply smoothing to prevent jitter
smoothedPan = smoothPanValue(participantId, rawPan);Key Principles
| Principle | Description |
|----------------------------|-------------------------------------------------------------|
| Position-based | Panning based on WHERE speaker is, NOT where they're looking |
| Listener yaw matters | Your rot.y determines which direction is "right" |
| Speaker rotation ignored | Their facing direction does NOT affect panning |
| Full 360° support | cos/sin trigonometry handles any angle automatically |
Listener Right Vector by Yaw (CLOCKWISE Rotation)
| Yaw | Facing | listenerRight (x, z) | Right Ear Faces | Left Ear Faces | |-------|-----------|----------------------|-----------------|----------------| | 0° | +Z (fwd) | (1.0, 0.0) | +X | -X | | 90° | +X (right)| (0.0, -1.0) | -Z | +Z | | 180° | -Z (back) | (-1.0, 0.0) | -X | +X | | 270° | -X (left) | (0.0, 1.0) | +Z | -Z |
Pan Value to Left/Right Gain
| panValue | Left Ear | Right Ear | Angle | Description | |----------|----------|-----------|--------------|----------------| | -1.0 | 100% | 0% | 270° (left) | Full LEFT | | -0.71 | 85% | 15% | 315°/225° | Diagonal LEFT | | 0.0 | 50% | 50% | 0°/180° | CENTER | | +0.71 | 15% | 85% | 45°/135° | Diagonal RIGHT | | +1.0 | 0% | 100% | 90° (right) | Full RIGHT |
Anti-Jitter Smoothing (3 Layers)
Layer 1: Gain Change Threshold Filter (2.5%)
const GAIN_CHANGE_THRESHOLD = 0.025; // 2.5%
if (Math.abs(newGain - currentGain) / 100 < GAIN_CHANGE_THRESHOLD) {
return currentGain; // Ignore micro-jitter (movements ≤40cm)
}Layer 2: Adaptive EMA for Pan
// Normal: 70% smoothing for stability
smoothedPan = previousPan * 0.7 + newPan * 0.3;
// Near center: 50% smoothing for moderate response
if (bothNearCenter) {
smoothedPan = previousPan * 0.5 + newPan * 0.5;
}
// Full flip (likely jitter): 85% HEAVY smoothing
if (signFlipped && panChange > 1.0) {
smoothedPan = previousPan * 0.85 + newPan * 0.15;
}Layer 3: Audio API Ramp Time
stereoPanner.pan.setTargetAtTime(panValue, currentTime, 0.08); // 80ms pan
gainNode.gain.setTargetAtTime(gainValue, currentTime, 0.05); // 50ms gainDistance-Based Gain: CUBIC EXPONENTIAL Falloff (HARD CUTOFF at 15m)
| Distance | Gain | Description |
|:---------------|:------:|:---------------------------------|
| 0.0 - 0.5m | 100% | Full volume (intimate) |
| 1.0m | ~90% | Very close - still loud |
| 2.0m | ~72% | Normal talking - NOTICEABLE |
| 3.0m | ~57% | Across table - CLEARLY QUIETER |
| 5.0m | ~33% | Across room - MUCH QUIETER |
| 7.0m | ~17% | Far end of room - very faint |
| 10.0m | ~4% | Barely audible |
| ≥15.0m | 0% | Silent (HARD CUTOFF) |
CUBIC EXPONENTIAL falloff formula:
private calculateLogarithmicGain(distance: number): number {
const minDistance = 0.5; // Full volume at 0.5m or closer
const maxDistance = 15.0; // Silent at 15m or farther - HARD CUTOFF
if (distance <= minDistance) return 100; // Full volume
if (distance >= maxDistance) return 0; // Silent - HARD CUTOFF
// CUBIC: (1 - normalized)³ for NOTICEABLE volume changes
const range = maxDistance - minDistance; // 14.5m
const normalizedDistance = (distance - minDistance) / range;
const remainingRatio = 1 - normalizedDistance;
return 100 * remainingRatio * remainingRatio * remainingRatio;
}Why Cubic (not Linear or Quadratic)?
- Linear: Too gradual - hard to notice volume changes
- Quadratic: Not steep enough for 15m range
- Cubic: Perfect balance - clearly noticeable with proper 15m silence
Smoothing: Web Audio's setTargetAtTime() handles all smoothing:
// Time constant 0.05 = ~150ms smooth transition (no clicks)
gainNode.gain.setTargetAtTime(gainValue, currentTime, 0.05);Note: Master compressor is DISABLED to ensure gain changes are clearly audible.
🎯 Audio Stability System
Layer 1: Gain Change Threshold Filter (2.5%)
const GAIN_CHANGE_THRESHOLD = 0.025; // 2.5%
if (Math.abs(newGain - currentGain) / 100 < GAIN_CHANGE_THRESHOLD) {
return currentGain; // Ignore micro-jitter (movements ≤40cm)
}Layer 2: SDK Position Snapping
positionSnapThreshold = 40cm
If movement < 40cm → use cached position (ignores pixel streaming jitter)Layer 3: Web Audio Smoothing
// Gain changes are smoothed by Web Audio API directly
// 50ms time constant for smooth transitions
gainNode.gain.setTargetAtTime(gainValue, currentTime, 0.05); // 150ms smooth
stereoPanner.pan.setTargetAtTime(panValue, currentTime, 0.08); // 240ms smoothWhy simplified? Previous rate-limiting was causing gain to get stuck at low values. Web Audio's built-in smoothing is sufficient and more reliable.
🎯 Enterprise-Grade Gain Smoothing (v1.0.202+)
Problem Solved: With 40+ people in a room, rapid position updates (60+ Hz) caused instantaneous gain changes that created audible clicks, pops, and "pit pit" crackling noise. This was caused by setValueAtTime() creating waveform discontinuities.
Throttling = "Wait a bit before doing the same thing again" to prevent overload! 🎯
Person walking 10 meters:
Without Throttling:
||||||||||||||||||||||||||||||||||||| 600 updates
↑ Every single tiny movement = update
Result: Clicks, pops, CPU overload 😖
With Throttling (16ms):
| | | | | | | | | | 60 updates
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
16ms gaps between updates
Result: Smooth, efficient, perfect 😎The Solution: Intelligent Throttling + Adaptive Ramping
// OLD (Causes Clicks):
nodes.gain.gain.setValueAtTime(gainValue, currentTime); // ❌ Instant jump
// NEW (Butter Smooth):
nodes.gain.gain.cancelScheduledValues(currentTime);
nodes.gain.gain.setValueAtTime(lastGain, currentTime);
nodes.gain.gain.linearRampToValueAtTime(gainValue, currentTime + rampTime); // ✅ Smooth transitionPerformance Characteristics
| Participant Count | Position Updates/sec | Throttled Updates/sec | CPU Impact | Audio Quality | |:-----------------:|:--------------------:|:---------------------:|:----------:|:-------------:| | 2-5 | ~300 | ~60 | Low | Perfect ✅ | | 10 | ~600 | ~120 | Low | Perfect ✅ | | 20 | ~1,200 | ~240 | Medium | Perfect ✅ | | 40 | ~2,400 | ~480 | Medium | Perfect ✅ | | 100 | ~6,000 | ~600 | Medium | Perfect ✅ |
Intelligent Throttling Logic
// Throttle: Skip update if too recent AND gain change is small
const isSignificantChange = gainDelta > 0.1; // >10% change
if (timeSinceLastUpdate < 16ms && !isSignificantChange) {
return; // Skip this update, wait for next frame
}Key Features:
- ✅ Time-based throttling: Maximum 60Hz per participant (16ms interval)
- ✅ Significance bypass: Large changes (>10%) bypass throttle immediately
- ✅ Per-participant tracking: Each person has independent throttle state
- ✅ Standing participants: Minimal updates when not moving (saves CPU)
Adaptive Ramp Time
The system automatically adjusts ramp time based on gain change magnitude:
| Gain Change | Ramp Time | User Experience | |:-----------:|:---------:|:----------------| | < 5% | 15ms | Instant feel, imperceptible smoothing | | 5-20% | 15-35ms | Smooth transition, natural | | 20-30% | 35-45ms | Very smooth, no artifacts | | > 30% | 50ms | Ultra smooth, prevents any clicking |
Formula:
rampTime = Math.min(
0.015 + (gainDelta * 0.1), // Base 15ms + scaled by change
0.050 // Max 50ms cap
);Real-World Scenarios
| Scenario | Behavior | Result | |----------|----------|--------| | Person walking nearby | Small gain changes → 15-25ms ramps | Feels instant, zero clicks | | Person runs past you | Large gain changes → 40-50ms ramps | Smooth volume sweep | | 40 people, 20 moving | ~1200 updates → throttled to ~240 | Perfect audio, low CPU | | Person stands still | Updates skipped entirely | Zero CPU usage | | Person teleports close | >10% change bypasses throttle | Immediate volume update |
Error Handling & Fallback
try {
// Smooth ramping
nodes.gain.gain.linearRampToValueAtTime(gainValue, currentTime + rampTime);
} catch (err) {
// Fallback: Direct value setting (rare edge case)
console.warn(`Gain scheduling failed, using instant set:`, err);
nodes.gain.gain.value = gainValue;
}Why This Works
Root Cause: Instantaneous gain changes create waveform discontinuities:
Old Method: New Method:
Volume Volume
↑ ↑
│ ╱╲ ╱╲ │ ╱╲ ╱╲
│ ╱ ╲ ╱ ╲ │ ╱ ╲ ╱ ╲
│ ╱ ╲╱ ╲ │ ╱ ╲╱ ╲
│ ╱ ╲ │ ╱ ╲
│ ╱ ╲ │ ╱ ╲
──┼────────────────────→ Time ──┼────────────────────→ Time
0 ← JUMP! Click here! 0 ← Smooth ramp here!Technical Details:
- Rapid gain jumps = discontinuous waveform = audible click
- With 60Hz position updates × 40 people = 2400 potential clicks/sec
- Linear ramping = continuous waveform = zero artifacts
- Throttling reduces update frequency by ~60% (saves CPU + audio thread)
Network Resilience
Server-Side:
// Opus codec with Forward Error Correction
useinbandfec: 1 // Automatically recovers lost packets
ptime: 20 // 20ms frames for low latencyWhy Non-Spatial Audio Worked Fine:
- Non-spatial audio: Single static gain value, rarely changes
- Spatial audio: Per-frame position updates = rapid gain changes
- The issue wasn't network - it was rapid gain value changes in Web Audio API
🎛️ Audio Processing Settings
Design Goal: Crystal clear voice with no echo, pumping, or bathroom effect.
🔊 Master Compressor
| Setting | Value | Purpose |
|:--------------|:---------:|:-------------------------------------|
| Threshold | -18 dB | Only compress loud peaks |
| Knee | 40 dB | Soft knee for natural sound |
| Ratio | 3:1 | Gentle compression, no pumping |
| Attack | 10 ms | Fast enough to catch peaks |
| Release | 150 ms | Fast release prevents echo tail |
| Master Gain | 1.0 | Unity gain for clean signal |
🎚️ Filter Chain
| Filter | Frequency | Q Value | Purpose |
|:----------------|:-----------:|:-------:|:--------------------------------|
| Highpass | 100 Hz | 0.5 | Remove room boom/rumble |
| Lowpass | 10 kHz | 0.5 | Open sound, no ringing |
| Voice Boost | 180 Hz | 0.5 | ❌ Disabled (prevents echo) |
| Dynamic Lowpass | 12 kHz | 0.5 | Natural treble preservation |
🛡️ Per-Participant Limiter
| Setting | Value | Purpose |
|:-----------|:---------:|:--------------------------------------|
| Threshold | -6 dB | Only activate near clipping |
| Knee | 3 dB | Hard knee = true limiter |
| Ratio | 20:1 | High ratio catches peaks cleanly |
| Attack | 1 ms | Ultra-fast peak catching |
| Release | 50 ms | Fast release = no pumping |
🎤 Denoiser (ML — GRU ScriptProcessorNode)
| Parameter | Value | Purpose |
|:------------------|:-------------------------:|:--------------------------------------------------|
| Model | odyssey_adaptive_denoiser | 3-layer GRU, UINT8 quantized TF.js |
| Params | 872,448 | Trained 100 epochs, val_loss=0.1636 |
| Buffer size | 4096 samples (~85ms) | ScriptProcessorNode synchronous processing |
| Backend | WebGL (GPU) | Reported at load: [MLNoiseSuppressor] TF.js backend ready: webgl |
| Pass-through | Yes (while loading) | Audio unaffected until model is ready |
| Normalization | mean=0.3953, std=0.1442 | Stats loaded from normalization_stats.json |
🔗 Audio Chain
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│ AUDIO PROCESSING CHAIN │
├──────────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ MediaStream ML Denoiser Per-Participant Spatial Master │
│ Source → (GRU model) → Limiter → Filters → Panner → Compressor │
│ │ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ ▼ │
│ [WebRTC] [ScriptProcessor [Peak Catch] [HP 100Hz] [Stereo [3:1 Ratio] │
│ Track 872K GRU model] [-6dB] LP 10kHz] L/R Pan] Output │
│ (pass-through │
│ while loading) │
└──────────────────────────────────────────────────────────────────────────────────────────┘Detailed Chain:
Source → MLScriptProcessor (GRU denoiser) → Limiter → HighPass(100Hz) → VoiceBand → LowPass(10kHz) →
DynamicLP(12kHz) → MonoDownmix → StereoUpmix → StereoPanner → Gain → MasterCompressor → OutputSpatial Audio Flowchart
┌─────────────────────────────────────────────────────────────────────────────┐
│ SPATIAL AUDIO PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
┌──────────────────┐ ┌──────────────────┐
│ LISTENER DATA │ │ SPEAKER DATA │
│ pos, rot (yaw) │ │ pos (x, y, z) │
└────────┬─────────┘ └────────┬─────────┘
│ │
▼ │
┌──────────────────┐ │
│ Calculate Right │ │
│ Vector │ │
│ cos(yaw), -sin() │ │
└────────┬─────────┘ │
│ │
▼ ▼
┌─────────────────────────────────────────────┐
│ VECTOR TO SPEAKER │
│ vecToSource = speakerPos - listenerPos │
└──────────────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ DOT PRODUCT (Projection) │
│ dxLocal = vecToSource · listenerRight │
│ (positive = RIGHT, negative = LEFT) │
└──────────────────────┬──────────────────────┘
│
┌────────────┴────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ NORMALIZE PAN │ │ CALCULATE DIST │
│ sin(atan2(dx,dz))│ │ dist = |vecTo| │
│ range: -1 to +1 │ │ 0.5m → 15m range │
└────────┬─────────┘ └────────┬─────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ ANTI-JITTER │ │ EXPONENTIAL GAIN │
│ - Threshold 2.5% │ │ gain = (1-norm)² │
│ - EMA 70% │ │ × 100% │
│ - Ramp 80ms │ │ 0% at 15m │
└────────┬─────────┘ └────────┬─────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ StereoPanner │ │ GainNode │
│ L/R balance │ │ volume │
└────────┬─────────┘ └────────┬─────────┘
│ │
└──────────┬──────────────┘
▼
┌──────────────────┐
│ AUDIO OUTPUT │
│ (headphones) │
└──────────────────┘360° Spatial Audio Diagram (Top View)
0° (Front)
L100% R100%
↑
│
315° │ 45°
L100% R58% │ L58% R100%
↖ │ ↗
↖ │ ↗
↖ │ ↗
↖ │ ↗
270° ←─────────────── 🎧 ───────────────→ 90°
(Left) Listener (Right)
L100% R40% yaw=0° L40% R100%
↙ │ ↘
↙ │ ↘
↙ │ ↘
↙ │ ↘
L100% R58% │ L58% R100%
225° │ 135°
│
↓
L100% R100%
180° (Behind)
Legend:
- 🎧 = Listener at origin, facing 0° (forward)
- Angles = Speaker position around listener
- L/R % = Left/Right ear volume for speaker at that positionConfiguration
| Parameter | Value | Description |
|------------------------|--------|-----------------------------|
| positionPanRadius | 5.0m | Distance for full L/R pan |
| nearDistance | 0.5m | Full gain threshold |
| farDistance | 10.0m | Silence threshold |
| panSmoothingFactor | 0.5 | Normal smoothing |
| panChangeThreshold | 0.02 | Jitter ignore threshold |
| panRampTime | 0.15s | Audio transition time |
| headHeight | 1.6m | Added to body Y |
Console Logs Reference
// Mediasoup server URL being used
[Odyssey] Connecting to MediaSoup server: https://...
// ML model loading
[MLNoiseSuppressor] Initializing TF.js backend: webgl
[MLNoiseSuppressor] Model loaded — 872,448 params | backend: webgl
[Odyssey] ML Noise Suppression loaded and active
// ML model failure (audio still works — pass-through mode)
[Odyssey] ML Noise Suppression failed to load: <error>
// ML active per participant
[SpatialAudioChannel] ML noise suppression ACTIVE — model loaded from <url>
// Listener position update
📍 [SDK Listener] pos=(x, y, z) rot=(pitch, yaw, roll)
// Speaker position received
🎧 [SDK Rx] <id> bodyPos=(x, y, z) rot=(pitch, yaw, roll)
// Spatial audio calculation
🎧 SPATIAL AUDIO [<id>] dist=Xm dxLocal=Xm rawPan=X smoothPan=X pan(L=X%,R=X%) gain=X% listenerRight=(x,z) vecToSrc=(x,z)Server Contract (Socket.IO Events)
| Event | Direction | Payload |
|----------------------------------|------------------|-----------------------------------------------------------------------------|
| join-room | client → server | {roomId, userId, deviceId, position, direction} |
| room-joined | server → client | RoomJoinedData (router caps, participants snapshot) |
| update-position | client → server | {participantId, conferenceId, position, direction, rot, cameraDistance} |
| participant-position-updated | server → client | {participantId, position, direction, rot, mediaState, pan} |
| consumer-created | server → client | {participantId, track(kind), position, direction, appData} |
| participant-media-state-updated| server → client | {participantId, mediaState} |
| all-participants-update | server → client | {roomId, participants[]} |
| new-participant | server → client | {participantId, userId, position, direction} |
| participant-left | server → client | {participantId} |
Position Data Types (Critical for Spatial Audio)
The SDK sends three separate data types to the server for accurate spatial audio:
| Data Type | Structure | Description |
|--------------|----------------------------------|---------------------------------------------------------------|
| position | {x, y, z} in meters | World coordinates - WHERE the player is located |
| direction | {x, y, z} normalized vector | Forward direction - which way the player is LOOKING (unit vector) |
| rot | {x, y, z} in degrees | Euler rotation angles - pitch(x), yaw(y), roll(z) |
IMPORTANT: rot.y (yaw) is critical for spatial audio left/right ear calculation:
- The listener's yaw determines their ear orientation
listenerRight = { x: cos(yaw), z: -sin(yaw) }- Speakers are panned based on their position projected onto listener's right axis
// Frontend sends all 3 data types:
sdk.updatePosition(position, direction, {
rot, // Rotation angles (pitch, yaw, roll) in degrees
cameraDistance,
screenPos,
});
// Server broadcasts to other clients:
socket.emit("participant-position-updated", {
position, // World coordinates
direction, // Forward vector
rot, // Rotation angles - yaw used for L/R audio
...
});Noise-Cancellation Stack (What's Included)
| Layer | Purpose |
|-------------------------------|--------------------------------------------------------------------------------------------------------------------------------|
| Adaptive denoiser worklet | Learns each participant's noise floor in real time, applies multi-band downward expander plus dynamic low/high-pass shaping |
| speechBoost | Lifts the low/mid band only when speech confidence is high, keeping consonants bright without reintroducing floor noise |
| highBandGate | Clamps constant fan hiss in the 4–12 kHz band whenever speechPresence is low |
| Silence gate | If energy stays below silenceFloor for configurable hold window, track ramps to true silence, wakes instantly on voice return|
| Classic filters | Fixed high-pass (80Hz) / low-pass (8kHz) shave off rumble and hiss before signals reach the panner |
Configuration example:
const sdk = new OdysseySpatialComms(serverUrl, {
denoiser: {
threshold: 0.008,
maxReduction: 0.88,
hissCut: 0.52,
holdMs: 260,
voiceBoost: 0.65,
voiceSensitivity: 0.33,
voiceEnhancement: true,
silenceFloor: 0.00075,
silenceHoldMs: 520,
silenceReleaseMs: 160,
speechBoost: 0.35,
highBandGate: 0.7,
highBandAttack: 0.25,
highBandRelease: 0.12,
},
});How Spatial Audio Is Built
| Step | Description |
|-------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1. Telemetry ingestion | Each LSD packet is passed through setListenerFromLSD(listenerPos, cameraPos, lookAtPos, rot) so the Web Audio listener matches the player's real head/camera pose |
| 2. Per-participant graph | When consumer-created yields a remote audio track, setupSpatialAudioForParticipant() spins up: Source → Compressor → Denoiser → HP → LP → StereoPanner → Gain |
| 3. Position updates | Every participant-position-updated event calls updateSpatialAudio(participantId, position, rot). Position feeds panning, rot provides listener's yaw |
| 4. Distance-aware gain | The manager computes Euclidean distance to each remote participant and applies inverse distance law with exponential falloff (0.5m–15m range) for more perceptible volume changes |
| 5. Anti-jitter smoothing | 3-layer system: threshold filter (0.02), EMA smoothing (0.5), SNAP behavior (0.2 for direction changes) |
| 6. Left/right rendering | StereoPannerNode outputs processed signal with accurate L/R separation based on position projection |
Integration Checklist
- [ ] Instantiate once per page/tab and keep it in a store (Vuex, Redux, Zustand, etc.)
- [ ] Pipe LSD/Lap data from your rendering engine into
updatePosition()+setListenerFromLSD()at ~10 Hz - [ ] Render videos muted – never attach remote audio tracks straight to DOM; let
SpatialAudioManagerown playback - [ ] Resume audio context – call
sdk.resumeAudio()on first user interaction (required by browsers) - [ ] Handle consumer-created – attach video tracks to UI, audio is handled automatically by spatial audio
- [ ] Monitor logs – browser console shows
🎧 SDK,📍 SDK, and🎚️ [Spatial Audio]statements for every critical hop - [ ] Push avatar telemetry back to Unreal so
remoteSpatialDatacan render minimaps/circles
Core Classes
| File | Purpose |
|-----------------------------|--------------------------------------------------------------------------------------|
| src/index.ts | OdysseySpatialComms – socket lifecycle, producers/consumers, event surface |
| src/core/MediasoupManager.ts | Transport helpers for produce/consume/resume |
| src/channels/spatial/SpatialAudioChannel.ts | Web Audio orchestration (listener transforms, per-participant chains, ML denoiser node) |
| src/audio/MLNoiseSuppressor.ts | TensorFlow.js GRU denoiser — odyssey_adaptive_denoiser model, 872K params, val_loss=0.1636 |
| src/core/EventManager.ts | Lightweight EventEmitter used by the entire SDK |
| src/types/index.ts | TypeScript interfaces for Position, Direction, Participant, MediaState, etc. |
Development Tips
- Run
pnpm install && pnpm buildinsideodyssey-mediasoup-sdkto publish a fresh build - Use
pnpm watchwhile iterating so TypeScript outputs live underdist/ - The SDK targets evergreen browsers; Safari <16.4 needs WebGL support for TF.js (all modern Safari versions have this)
- Have questions or want to extend the SDK? Start with
SpatialAudioManager– that's where most of the "real-world" behavior (distance feel, stereo cues, denoiser) lives - ML Noise Suppression: Initialized automatically at SDK startup using the
odyssey_adaptive_denoisermodel frompublic/odyssey_adaptive_denoiser/model.json. No manual call needed. Watch browser console for[Odyssey] ML Noise Suppression loaded and active
Development
# Install dependencies
pnpm install
# Build
pnpm build
# Watch mode
pnpm watchRelated Documentation
- HEAD_POSITION_DATA_FLOW.md – Detailed panning algorithm with 360° tables
- SPATIAL_AUDIO_IMPLEMENTATION.md – Implementation summary with examples
- audio-position-wisecalulation.md – Position-wise calculation reference
