@tricoteuses/transcription-videos
v0.1.1
Published
Permet d'obtenir la transcription des vidéos de l'assemblée/sénat en fournissant un lien vidéo m3u8 en entrée
Readme
tricoteuses-transcription-videos
Node.js/TypeScript pipeline to transcribe French Parliament videos (with speaker diarization), from either a .m3u8 URL or a WAV file extracted via ffmpeg.
Output: a JSON array in the Compte-Rendu format of Assemblée's Data:
[ { "code_grammaire": "PAROLE_GENERIQUE", "ordre_absolu_seance": "4", "orateurs": { "orateur": { "nom": "speaker A", "id": "", "qualite": "" } }, "texte": { "_": "Merci monsieur le rapporteur général." } }, { "code_grammaire": "PAROLE_GENERIQUE", "ordre_absolu_seance": "5", "orateurs": { "orateur": { "nom": "speaker D", "id": "", "qualite": "" } }, "texte": { "_": "Merci monsieur le président, mesdames et messieurs, ..." } } ]Timestamps are milliseconds.
speakeris a letter (A, B, C…).Plug-and-play architecture via providers: currently
AssemblyAIandDeepgram. You can plug additional models later without changing application code.
Table of Contents
- Prerequisites
- Installation
- Usage (single video)
- Usage (batch by reunion UIDs)
- Code Architecture
- Swap providers later
- License
Prerequisites
- Node.js ≥ 20
- npm
- ffmpeg available in your
PATH(to extract audio from.m3u8):ffmpeg -version - Assemblée dataset prepared :
- Must contain: Agenda__nettoye/ for the target legislature.
Installation
npm install
cp .env.example .env # add your models keyUsage (single video, useful for model testing)
1) Select your provider in .env
Set TRANSCRIPTION_PROVIDER to one of: deepgram or assemblyai
2) From a .m3u8 URL (audio extraction + transcription)
# create the output folders if needed
mkdir -p ./out ./audios
npm run transcribe -- --m3u8 "https://videos-an.vodalys.com/.../master.m3u8" --out ./audios/reunion.wav --ss 0 --t 800 --save ./out/transcript-{model_name}.jsonCLI Options
--ss: start offset (seconds)--t: duration (seconds)--out: WAV path; if omitted, we default toos.tmpdir()--save: output JSON path (default:./transcript.json)--lang frto force language (otherwise uses.envdefault)--diarize falseto disable diarization (enabled by default)
3) From an existing audio file WAV
npm run transcribe -- --file C:/path/to/reunion.wav --save ./out/transcript-{model_name}.jsonUsage (batch by reunion UIDs, useful in prod)
Process only specific Assemblée reunion UIDs using the dataset loaders. For each UID:
- read
reunion.urlVideo, - extract audio to
./audios/<uid>.wav(skip ffmpeg if the WAV already exists), - transcribe + diarize the full video with the current provider,
- write segments to
$ASSEMBLEE_DATA_DIR/Videos_<ROMAN_LEGISLATURE>_nettoye/<uid>/transcript.json
(+info.jsonwith basic metadata).
CLI Options
--dataDir: Absolute path to Assemblée dataset (or as 1st positional). Required.-l, --legislature: Legislature number (e.g.,16or17).-s, --fromSession: Session number to start from (Senat only)--uids: Comma-separated UIDs (e.g.,uid1,uid2).--uid: Repeatable UID flag (can be used multiple times).--lang,--language: Language code (e.g.,fr).--diarize: Enable diarization (default:true).--no-diarize: Disable diarization.--keepWav: Keep extracted WAV files (default:true).--no-keepWav: Delete WAV after successful transcription.--audioDir: Directory for WAV files (default:./audios).--reextract: Force re-extraction even if WAV exists (default:false).--ss: Start offset (seconds).--t: Duration (seconds).-p, --provider: Transcription provider (assemblyai|deepgram).
Examples
Transcribe all Reunions from 17th legislature (max 50):
npm run transcribe:reunions ../assemblee-data -- --legislature 17 --provider assemblyai --max 50Force re-extraction + startTimecode to optimize wav:
npm run transcribe:reunions -- --dataDir /abs/path/assemblee-data -l 17 --uid RUANR5... --reextract true --ss 796
# Transcribe one AN
npm run transcribe:reunions ../assemblee-data -- --legislature 17 --transcriptsDir ../assemblee-data/transcripts --audioDir ../assemblee-data/audios --provider deepgram --chambre AN --uid RUANR5L17S2025IDC453375
# Transcribe one SN
npm run transcribe:reunions ../senat-data -- --fromSession 2025 --transcriptsDir ../senat-data/transcripts --audioDir ../senat-data/audios --provider deepgram --chambre SN --uid RUSN20251016IDODDF-900Usage - Transcription Live
Live transcription continuously transcribes an HLS .m3u8 stream, with automatic retries and a clean stop when the stream ends.
It is designed for one job per live (e.g. one Kubernetes pod per debate).
Basic CLI usage
npm run transcribe:live -- --url "https://videos-an.vodalys.com/live/.../index.m3u8" --out ./live-transcripts/live-$(date +%s).ndjsonLive CLI options
--url(required): HLS.m3u8live URL--out: NDJSON output file (default:./live-transcripts/live-<timestamp>.ndjson)--lang: language code (default from.env)--diarize/--no-diarize: enable/disable diarization (default: enabled)--provider: transcription provider (deepgram,assemblyai, …)--model: provider-specific model (optional)--punctuate/--no-punctuate: enable/disable punctuation--maxMinutes: stop automatically after N minutes (POC / safety)
Output format (NDJSON)
The output file is append-only, one JSON object per line:
{"type":"meta","msg":"live transcription start","url":"..."}
{"type":"segment","start_ms":123400,"end_ms":127800,"speaker":"Speaker A","text":"Hello everyone"}
{"type":"segment","start_ms":128000,"end_ms":132200,"speaker":"Speaker B","text":"Thank you"}
{"type":"meta","msg":"session ended","durSec":6400}This format allows streaming ingestion, retries without duplicates, and easy replay.
Error handling & retries
- On stream errors or disconnects, the script retries automatically with a short backoff.
- If a session ends too quickly, it is retried until the minimum valid duration is reached.
Production integration
Typical flow:
- API detects a new
DebatDirect - A job/pod is started for this live
transcribe:liveruns for this single stream- Segments are pushed incrementally to the API
- When the live ends, the job exits and the debate is marked
TERMINE
Rule: 1 live = 1 process.
Code Architecture
src/
├─ config/
│ └─ env.ts # .env loading & validation
├─ types/
│ └─ transcription.ts # common types (segments in ms, speakers, metadata)
├─ providers/
│ ├─ TranscriptionProvider.ts # generic interface
│ ├─ assemblyai.ts # AssemblyAI implementation
│ ├─ deepgram.ts # Deepgram implementation
│ ├─ mistral.ts # Mistral implementation
│ └─ index.ts # provider factory based on .env
├─ utils/
│ └─ ffmpeg.ts # .m3u8 → WAV mono 16k extraction
│ └─ transcribe.ts # single function used by scripts/services
├─ scripts/
│ └─ transcribe_reunions.ts
└─ └─ transcribe_live.tsSwap providers later
Application code always calls:
const result = await transcribeVideo({
filePath: '/tmp/reunion.wav',
language: 'fr',
diarize: true,
});To add another provider:
- Create
src/providers/myProvider.tsimplementingTranscriptionProvider. - Add a
caseinsrc/providers/index.tsand a.envvalue (TRANSCRIPTION_PROVIDER=myProvider). - Map the new API’s response to the same types (segments in ms, letter speakers).
Docker :
Build the image
docker build -t transcriber:dev .Run (change the ABSOLUTE_PATH_TO_ASSEMBLEE_DATA)
docker run \
--env-file .env \
-e LEGISLATURE=17 \
-v "/ABSOLUTE_PATH_TO_ASSEMBLEE_DATA:/app/assemblee-data" \
transcriber:devLicense
AGPL-3.0-or-later
