auralwise_cli

v1.0.8

Published

2 days ago

CLI for AuralWise audio intelligence API - transcription, speaker diarization, audio event detection

0High
0Medium
0Low

andiezhu

asr speech-to-text transcription speaker-diarization audio-events auralwise cli audio voice

AuralWise CLI

中文文档

Command-line interface for AuralWise Speech Intelligence API.

One API call returns transcription, speaker diarization, speaker embeddings, word-level timestamps, and 521-class audio event detection — all at once.

Features

Speech Transcription — 99 languages, with a dedicated Chinese engine (optimize_zh) for faster speed and higher accuracy
Speaker Diarization — Automatic speaker count detection, per-segment speaker labels
Speaker Embeddings — 192-dim voice print vectors for cross-recording speaker matching
Timestamps — Word-level (~10ms) or segment-level (~100ms) precision
Audio Event Detection — 521 AudioSet sound event classes (applause, cough, music, keyboard, etc.)
VAD — Voice Activity Detection segments
Batch Mode — Half-price processing using off-peak GPU capacity, delivered within 24h
Local Transcoding + Vocal Enhancement — Automatically transcodes audio to mono 16kHz 32kbps MP3 and applies a vocal enhancement filter chain (denoise, highpass, EQ, loudness normalization) before upload, cutting upload size dramatically and improving ASR accuracy on noisy recordings

Installation

npm install -g auralwise_cli

Requires Node.js >= 18.

Quick Start

# Set your API key (get one at https://auralwise.cn)
export AURALWISE_API_KEY=asr_xxxxxxxxxxxxxxxxxxxx

# Transcribe from URL — waits for completion and prints results
auralwise transcribe https://example.com/meeting.mp3

# Transcribe a local file (auto base64 upload)
auralwise transcribe ./recording.wav

# Chinese optimization mode (faster, cheaper for Chinese audio)
auralwise transcribe ./meeting.mp3 --optimize-zh --language zh

# Submit without waiting
auralwise transcribe https://example.com/audio.mp3 --no-wait

# Get JSON output
auralwise transcribe ./audio.mp3 --json --output result.json

Commands

`auralwise transcribe <source>`

Submit an audio file for processing. <source> can be an HTTP(S) URL or a local file path.

Input modes:

URL mode — Pass an https://... URL. By default the CLI downloads and transcodes the audio locally before upload; if ffmpeg is missing or transcoding is disabled, the URL is submitted directly to the API.
File mode — Pass a local file path; the CLI reads, transcodes, and uploads as base64.

Local transcoding pipeline (default ON):

When ffmpeg is available on your PATH, the CLI first converts your audio to mono 16kHz 32kbps MP3 with a vocal enhancement filter chain (afftdn denoise → 80Hz highpass → two-band EQ → dynaudnorm loudness). This typically shrinks uploads by 10-20× and yields cleaner ASR on noisy recordings. Temp files are deleted immediately after the upload succeeds.

If ffmpeg isn't installed, the CLI prints a one-time warning and submits the original file/URL unchanged.
If the filter chain is incompatible with a particular source, it falls back to a plain transcode (same format, no filters).
Upload size is capped at 150 MB. A local file exceeding the limit aborts with an error; a URL source that exceeds the limit after transcoding falls back to submitting the URL directly.
Pass --no-transcode to skip transcoding entirely and upload the file as-is.

Upstream service limits (validated locally before upload):

| Limit | Value | Notes | |-------|-------|-------| | Minimum file size | 1 KB | Anything smaller is rejected as not a valid audio file | | Maximum file size | 2 GB | Upstream hard cap | | Maximum duration | 5 hours | Probed via ffprobe when available; otherwise enforced server-side |

Common options:

| Option | Description | |--------|-------------| | --language <lang> | ASR language code (zh, en, ja, ...) or auto-detect if omitted | | --optimize-zh | Use dedicated Chinese engine (faster, cheaper, segment-level timestamps) | | --no-asr | Disable transcription | | --no-diarize | Disable speaker diarization | | --no-events | Disable audio event detection | | --no-transcode | Skip local transcoding; upload the original file as-is | | --hotwords <words> | Boost recognition of specific words (comma-separated) | | --num-speakers <n> | Set fixed number of speakers | | --max-speakers <n> | Max speakers for auto-detection (default: 10) | | --batch | Use batch mode (half-price, 24h delivery) | | --no-wait | Return immediately after task creation | | --json | Output result as JSON | | --output <file> | Save result to file | | --callback-url <url> | Webhook URL for completion notification |

Advanced ASR options:

| Option | Description | |--------|-------------| | --beam-size <n> | Beam search width (default: 5) | | --temperature <n> | Decoding temperature (default: 0.0) | | --initial-prompt <text> | Guide transcription style | | --vad-threshold <n> | VAD sensitivity 0-1 (default: 0.35) | | --events-threshold <n> | Audio event confidence threshold (default: 0.3) | | --events-classes <list> | Only detect specific event classes |

`auralwise tasks`

List your tasks with optional filtering.

auralwise tasks                          # List all tasks
auralwise tasks --status done            # Only completed tasks
auralwise tasks --page 2 --page-size 50  # Pagination
auralwise tasks --json                   # JSON output

`auralwise task <id>`

Get details of a specific task.

auralwise task 550e8400-e29b-41d4-a716-446655440000
auralwise task 550e8400-e29b-41d4-a716-446655440000 --json

`auralwise result <id>`

Retrieve the full result of a completed task.

auralwise result <task-id>                     # Pretty-printed output
auralwise result <task-id> --json              # JSON output
auralwise result <task-id> --output result.json  # Save to file

`auralwise delete <id>`

Delete a task and its associated files.

auralwise delete <task-id>           # With confirmation prompt
auralwise delete <task-id> --force   # Skip confirmation

`auralwise events`

Browse the 521 AudioSet sound event classes.

auralwise events                       # List all 521 classes
auralwise events --search Cough        # Search by name
auralwise events --category Music      # Filter by category
auralwise events --json                # JSON output

Configuration

API Key

Set your API key via --api-key flag or environment variable:

# Environment variable (recommended)
export AURALWISE_API_KEY=asr_xxxxxxxxxxxxxxxxxxxx

# Or pass directly
auralwise --api-key asr_xxxx transcribe ./audio.mp3

Base URL

Override the API endpoint (default: https://api.auralwise.cn/v1):

auralwise --base-url https://your-private-instance.com/v1 transcribe ./audio.mp3

Language

The CLI supports English and Chinese interfaces:

auralwise --locale zh --help           # Chinese interface
auralwise --locale en transcribe --help  # English interface (default)

Examples

Meeting transcription with speaker diarization

auralwise transcribe ./meeting.mp3 \
  --optimize-zh \
  --language zh \
  --max-speakers 5 \
  --output meeting_result.json

Batch processing (half-price)

# Submit in batch mode — processed during off-peak hours, 50% discount
auralwise transcribe https://storage.example.com/archive.mp3 \
  --batch \
  --no-wait \
  --callback-url https://your-server.com/webhook

Audio event detection only

auralwise transcribe ./audio.mp3 \
  --no-asr \
  --no-diarize \
  --events-classes "Cough,Music,Applause" \
  --json

Transcription only (no diarization, no events)

auralwise transcribe ./podcast.mp3 \
  --no-diarize \
  --no-events \
  --hotwords "AuralWise,PGPU" \
  --output transcript.json

Output Format

Pretty-printed (default)

Audio Duration: 5.3min
Language: zh (99%)
Speakers: 2

Transcription

[0:00.5 - 0:02.3] SPEAKER_0: This is the first sentence
[0:02.5 - 0:04.1] SPEAKER_1: And this is the reply

Audio Events

[0:45.0 - 0:45.9] Cough (87%)
[1:20.0 - 1:25.0] Music (92%)

Speaker Embeddings

  SPEAKER_0: 25 segments, 192-dim vector
  SPEAKER_1: 18 segments, 192-dim vector

JSON (`--json`)

Returns the full API response. See API documentation for the complete schema.

Pricing

| Capability | Standard | Batch (50% off) | |-----------|----------|-----------------| | Chinese transcription | ¥0.27/hr | ¥0.14/hr | | General transcription (with word timestamps) | ¥1.20/hr | ¥0.60/hr | | Speaker diarization (labels + embeddings) | +¥0.40/hr | +¥0.20/hr | | Audio event detection (521 classes) | +¥0.10/hr | +¥0.05/hr |

Example: 100 hours of Chinese meetings (full features) = ¥39 in batch mode.

API Documentation

Full API reference: https://auralwise.cn/api-docs

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme