@tricoteuses/transcription-videos

v1.1.2

Published

2 days ago

Permet d'obtenir la transcription des vidéos de l'assemblée/sénat en fournissant un lien vidéo m3u8 en entrée

0High
0Medium
0Low

Assemblée nationale Sénat France m3u8 transcription vidéo modèle IA

tricoteuses-transcription-videos

Node.js/TypeScript pipeline to transcribe French Parliament videos (with speaker diarization), from either a .m3u8 URL or a WAV file extracted via ffmpeg.

Output: a JSON array in the Compte-Rendu format of Assemblée's Data:

[
  {
    "code_grammaire": "PAROLE_GENERIQUE",
    "ordre_absolu_seance": "4",
    "orateurs": {
      "orateur": {
        "nom": "speaker A",
        "id": "",
        "qualite": ""
      }
    },
    "texte": {
      "_": "Merci monsieur le rapporteur général."
    }
  },
  {
    "code_grammaire": "PAROLE_GENERIQUE",
    "ordre_absolu_seance": "5",
    "orateurs": {
      "orateur": {
        "nom": "speaker D",
        "id": "",
        "qualite": ""
      }
    },
    "texte": {
      "_": "Merci monsieur le président, mesdames et messieurs, ..."
    }
  }
]

Timestamps are milliseconds. speaker is a letter (A, B, C…).

Plug-and-play architecture via providers: currently AssemblyAI and Deepgram. You can plug additional models later without changing application code.

Prerequisites

Node.js ≥ 20
npm
ffmpeg available in your PATH (to extract audio from .m3u8):
```
ffmpeg -version
```
Assemblée dataset prepared :
- Must contain: Agenda__nettoye/ for the target legislature.

Installation

npm install
cp .env.example .env   # add your models key

Usage (single video, useful for model testing)

1) Select your provider in `.env`

Set TRANSCRIPTION_PROVIDER to one of: deepgram or assemblyai

2) From a `.m3u8` URL (audio extraction + transcription)

# create the output folders if needed
mkdir -p ./out ./audios

npm run transcribe --   --m3u8 "https://videos-an.vodalys.com/.../master.m3u8"   --out ./audios/reunion.wav   --ss 0   --t 800   --save ./out/transcript-{model_name}.json

CLI Options

--ss : start offset (seconds)
--t : duration (seconds)
--out: WAV path; if omitted, we default to os.tmpdir()
--save: output JSON path (default: ./transcript.json)
--lang fr to force language (otherwise uses .env default)
--diarize false to disable diarization (enabled by default)

3) From an existing audio file WAV

npm run transcribe --   --file C:/path/to/reunion.wav   --save ./out/transcript-{model_name}.json

Usage (batch by reunion UIDs, useful in prod)

Process only specific Assemblée reunion UIDs using the dataset loaders. For each UID:

read reunion.urlVideo,
extract audio to ./audios/<uid>.wav (skip ffmpeg if the WAV already exists),
transcribe + diarize the full video with the current provider,
write segments to $ASSEMBLEE_DATA_DIR/Videos_<ROMAN_LEGISLATURE>_nettoye/<uid>/transcript.json
(+ info.json with basic metadata).

CLI Options

--dataDir: Absolute path to Assemblée dataset (or as 1st positional). Required.
-l, --legislature: Legislature number (e.g., 16 or 17).
-s, --fromSession: Session number to start from (Senat only)
--uids: Comma-separated UIDs (e.g., uid1,uid2).
--uid: Repeatable UID flag (can be used multiple times).
--lang, --language: Language code (e.g., fr).
--diarize: Enable diarization (default: true).
--no-diarize: Disable diarization.
--keepWav: Keep extracted WAV files (default: true).
--no-keepWav: Delete WAV after successful transcription.
--audioDir: Directory for WAV files (default: ./audios).
--reextract: Force re-extraction even if WAV exists (default: false).
--ss: Start offset (seconds).
--t: Duration (seconds).
-p, --provider: Transcription provider (assemblyai | deepgram).

Examples

Transcribe all Reunions from 17th legislature (max 50):

npm run transcribe:reunions ../assemblee-data -- --legislature 17 --provider assemblyai --max 50

Force re-extraction + startTimecode to optimize wav:

npm run transcribe:reunions -- --dataDir /abs/path/assemblee-data -l 17 --uid RUANR5...   --reextract true --ss 796

# Transcribe one AN
npm run transcribe:reunions ../assemblee-data -- --legislature 17 --transcriptsDir ../assemblee-data/transcripts --audioDir ../assemblee-data/audios --provider deepgram --chambre AN --uid RUANR5L17S2025IDC453375
# Transcribe one SN
npm run transcribe:reunions ../senat-data -- --fromSession 2025 --transcriptsDir ../senat-data/transcripts --audioDir ../senat-data/audios --provider deepgram --chambre SN --uid RUSN20251016IDODDF-900

Usage - Transcription Live

Live transcription continuously transcribes an HLS .m3u8 stream, with automatic retries and a clean stop when the stream ends.
It is designed for one job per live (e.g. one Kubernetes pod per debate).

Basic CLI usage

npm run transcribe:live --   --url "https://videos-an.vodalys.com/live/.../index.m3u8"   --out ./live-transcripts/live-$(date +%s).ndjson

Live CLI options

--url (required): HLS .m3u8 live URL
--out: NDJSON output file (default: ./live-transcripts/live-<timestamp>.ndjson)
--lang: language code (default from .env)
--diarize / --no-diarize: enable/disable diarization (default: enabled)
--provider: transcription provider (deepgram, assemblyai, …)
--model: provider-specific model (optional)
--punctuate / --no-punctuate: enable/disable punctuation
--maxMinutes: stop automatically after N minutes (POC / safety)

Output format (NDJSON)

The output file is append-only, one JSON object per line:

{"type":"meta","msg":"live transcription start","url":"..."}
{"type":"segment","start_ms":123400,"end_ms":127800,"speaker":"Speaker A","text":"Hello everyone"}
{"type":"segment","start_ms":128000,"end_ms":132200,"speaker":"Speaker B","text":"Thank you"}
{"type":"meta","msg":"session ended","durSec":6400}

This format allows streaming ingestion, retries without duplicates, and easy replay.

Error handling & retries

On stream errors or disconnects, the script retries automatically with a short backoff.
If a session ends too quickly, it is retried until the minimum valid duration is reached.

Production integration

Typical flow:

API detects a new DebatDirect
A job/pod is started for this live
transcribe:live runs for this single stream
Segments are pushed incrementally to the API
When the live ends, the job exits and the debate is marked TERMINE

Rule: 1 live = 1 process.

Tests

Provider benchmark on fixtures

An opt-in integration test compares deepgram, assemblyai and mistral on fixture videos and prints a ranking based on:

average WER (word error rate, lower is better)
average transcription latency in milliseconds (tie-breaker)

Run it with:

RUN_PROVIDER_BENCHMARK_TEST=true \
PROVIDER_BENCHMARK_MAX_FIXTURES=2 \
npm run test -- tests/integration/providers.fixtures.benchmark.test.ts

Required API keys:

DEEPGRAM_API_KEY
ASSEMBLYAI_API_KEY
MISTRAL_API_KEY

The test is skipped by default unless RUN_PROVIDER_BENCHMARK_TEST=true.

Code Architecture

src/
├─ config/
│  └─ env.ts                   # .env loading & validation
├─ types/
│  └─ transcription.ts         # common types (segments in ms, speakers, metadata)
├─ providers/
│  ├─ TranscriptionProvider.ts # generic interface
│  ├─ assemblyai.ts            # AssemblyAI implementation
│  ├─ deepgram.ts              # Deepgram implementation
│  ├─ mistral.ts               # Mistral implementation
│  └─ index.ts                 # provider factory based on .env
├─ utils/
│  └─ ffmpeg.ts                # .m3u8 → WAV mono 16k extraction
│  └─ transcribe.ts            # single function used by scripts/services
├─ scripts/
│  └─ transcribe_reunions.ts
└─ └─ transcribe_live.ts

Swap providers later

Application code always calls:

const result = await transcribeVideo({
  filePath: '/tmp/reunion.wav',
  language: 'fr',
  diarize: true,
});

To add another provider:

Create src/providers/myProvider.ts implementing TranscriptionProvider.
Add a case in src/providers/index.ts and a .env value (TRANSCRIPTION_PROVIDER=myProvider).
Map the new API’s response to the same types (segments in ms, letter speakers).

Docker :

Build the image

docker build -t transcriber:dev .

Run (change the ABSOLUTE_PATH_TO_ASSEMBLEE_DATA)

docker run \
  --env-file .env \
  -e LEGISLATURE=17 \
  -v "/ABSOLUTE_PATH_TO_ASSEMBLEE_DATA:/app/assemblee-data" \
  transcriber:dev

License

AGPL-3.0-or-later

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

tricoteuses-transcription-videos

Table of Contents

Prerequisites

Installation

Usage (single video, useful for model testing)

1) Select your provider in .env

2) From a .m3u8 URL (audio extraction + transcription)

CLI Options

3) From an existing audio file WAV

Usage (batch by reunion UIDs, useful in prod)

CLI Options

Examples

Usage - Transcription Live

Basic CLI usage

Live CLI options

Output format (NDJSON)

Error handling & retries

Production integration

Tests

Provider benchmark on fixtures

Code Architecture

Swap providers later

Docker :

Build the image

Run (change the ABSOLUTE_PATH_TO_ASSEMBLEE_DATA)

License

1) Select your provider in `.env`

2) From a `.m3u8` URL (audio extraction + transcription)