npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@tricoteuses/transcription-videos

v0.1.1

Published

Permet d'obtenir la transcription des vidéos de l'assemblée/sénat en fournissant un lien vidéo m3u8 en entrée

Readme

tricoteuses-transcription-videos

Node.js/TypeScript pipeline to transcribe French Parliament videos (with speaker diarization), from either a .m3u8 URL or a WAV file extracted via ffmpeg.

  • Output: a JSON array in the Compte-Rendu format of Assemblée's Data:

    [
      {
        "code_grammaire": "PAROLE_GENERIQUE",
        "ordre_absolu_seance": "4",
        "orateurs": {
          "orateur": {
            "nom": "speaker A",
            "id": "",
            "qualite": ""
          }
        },
        "texte": {
          "_": "Merci monsieur le rapporteur général."
        }
      },
      {
        "code_grammaire": "PAROLE_GENERIQUE",
        "ordre_absolu_seance": "5",
        "orateurs": {
          "orateur": {
            "nom": "speaker D",
            "id": "",
            "qualite": ""
          }
        },
        "texte": {
          "_": "Merci monsieur le président, mesdames et messieurs, ..."
        }
      }
    ]

    Timestamps are milliseconds. speaker is a letter (A, B, C…).

  • Plug-and-play architecture via providers: currently AssemblyAI and Deepgram. You can plug additional models later without changing application code.


Table of Contents


Prerequisites

  • Node.js ≥ 20
  • npm
  • ffmpeg available in your PATH (to extract audio from .m3u8):
    ffmpeg -version
  • Assemblée dataset prepared :
    • Must contain: Agenda__nettoye/ for the target legislature.

Installation

npm install
cp .env.example .env   # add your models key

Usage (single video, useful for model testing)

1) Select your provider in .env

Set TRANSCRIPTION_PROVIDER to one of: deepgram or assemblyai

2) From a .m3u8 URL (audio extraction + transcription)

# create the output folders if needed
mkdir -p ./out ./audios

npm run transcribe --   --m3u8 "https://videos-an.vodalys.com/.../master.m3u8"   --out ./audios/reunion.wav   --ss 0   --t 800   --save ./out/transcript-{model_name}.json

CLI Options

  • --ss : start offset (seconds)
  • --t : duration (seconds)
  • --out: WAV path; if omitted, we default to os.tmpdir()
  • --save: output JSON path (default: ./transcript.json)
  • --lang fr to force language (otherwise uses .env default)
  • --diarize false to disable diarization (enabled by default)

3) From an existing audio file WAV

npm run transcribe --   --file C:/path/to/reunion.wav   --save ./out/transcript-{model_name}.json

Usage (batch by reunion UIDs, useful in prod)

Process only specific Assemblée reunion UIDs using the dataset loaders. For each UID:

  1. read reunion.urlVideo,
  2. extract audio to ./audios/<uid>.wav (skip ffmpeg if the WAV already exists),
  3. transcribe + diarize the full video with the current provider,
  4. write segments to $ASSEMBLEE_DATA_DIR/Videos_<ROMAN_LEGISLATURE>_nettoye/<uid>/transcript.json
    (+ info.json with basic metadata).

CLI Options

  • --dataDir: Absolute path to Assemblée dataset (or as 1st positional). Required.
  • -l, --legislature: Legislature number (e.g., 16 or 17).
  • -s, --fromSession: Session number to start from (Senat only)
  • --uids: Comma-separated UIDs (e.g., uid1,uid2).
  • --uid: Repeatable UID flag (can be used multiple times).
  • --lang, --language: Language code (e.g., fr).
  • --diarize: Enable diarization (default: true).
  • --no-diarize: Disable diarization.
  • --keepWav: Keep extracted WAV files (default: true).
  • --no-keepWav: Delete WAV after successful transcription.
  • --audioDir: Directory for WAV files (default: ./audios).
  • --reextract: Force re-extraction even if WAV exists (default: false).
  • --ss: Start offset (seconds).
  • --t: Duration (seconds).
  • -p, --provider: Transcription provider (assemblyai | deepgram).

Examples

Transcribe all Reunions from 17th legislature (max 50):

npm run transcribe:reunions ../assemblee-data -- --legislature 17 --provider assemblyai --max 50

Force re-extraction + startTimecode to optimize wav:

npm run transcribe:reunions -- --dataDir /abs/path/assemblee-data -l 17 --uid RUANR5...   --reextract true --ss 796

# Transcribe one AN
npm run transcribe:reunions ../assemblee-data -- --legislature 17 --transcriptsDir ../assemblee-data/transcripts --audioDir ../assemblee-data/audios --provider deepgram --chambre AN --uid RUANR5L17S2025IDC453375
# Transcribe one SN
npm run transcribe:reunions ../senat-data -- --fromSession 2025 --transcriptsDir ../senat-data/transcripts --audioDir ../senat-data/audios --provider deepgram --chambre SN --uid RUSN20251016IDODDF-900

Usage - Transcription Live

Live transcription continuously transcribes an HLS .m3u8 stream, with automatic retries and a clean stop when the stream ends.
It is designed for one job per live (e.g. one Kubernetes pod per debate).

Basic CLI usage

npm run transcribe:live --   --url "https://videos-an.vodalys.com/live/.../index.m3u8"   --out ./live-transcripts/live-$(date +%s).ndjson

Live CLI options

  • --url (required): HLS .m3u8 live URL
  • --out: NDJSON output file (default: ./live-transcripts/live-<timestamp>.ndjson)
  • --lang: language code (default from .env)
  • --diarize / --no-diarize: enable/disable diarization (default: enabled)
  • --provider: transcription provider (deepgram, assemblyai, …)
  • --model: provider-specific model (optional)
  • --punctuate / --no-punctuate: enable/disable punctuation
  • --maxMinutes: stop automatically after N minutes (POC / safety)

Output format (NDJSON)

The output file is append-only, one JSON object per line:

{"type":"meta","msg":"live transcription start","url":"..."}
{"type":"segment","start_ms":123400,"end_ms":127800,"speaker":"Speaker A","text":"Hello everyone"}
{"type":"segment","start_ms":128000,"end_ms":132200,"speaker":"Speaker B","text":"Thank you"}
{"type":"meta","msg":"session ended","durSec":6400}

This format allows streaming ingestion, retries without duplicates, and easy replay.

Error handling & retries

  • On stream errors or disconnects, the script retries automatically with a short backoff.
  • If a session ends too quickly, it is retried until the minimum valid duration is reached.

Production integration

Typical flow:

  1. API detects a new DebatDirect
  2. A job/pod is started for this live
  3. transcribe:live runs for this single stream
  4. Segments are pushed incrementally to the API
  5. When the live ends, the job exits and the debate is marked TERMINE

Rule: 1 live = 1 process.

Code Architecture

src/
├─ config/
│  └─ env.ts                   # .env loading & validation
├─ types/
│  └─ transcription.ts         # common types (segments in ms, speakers, metadata)
├─ providers/
│  ├─ TranscriptionProvider.ts # generic interface
│  ├─ assemblyai.ts            # AssemblyAI implementation
│  ├─ deepgram.ts              # Deepgram implementation
│  ├─ mistral.ts               # Mistral implementation
│  └─ index.ts                 # provider factory based on .env
├─ utils/
│  └─ ffmpeg.ts                # .m3u8 → WAV mono 16k extraction
│  └─ transcribe.ts            # single function used by scripts/services
├─ scripts/
│  └─ transcribe_reunions.ts
└─ └─ transcribe_live.ts

Swap providers later

Application code always calls:

const result = await transcribeVideo({
  filePath: '/tmp/reunion.wav',
  language: 'fr',
  diarize: true,
});

To add another provider:

  1. Create src/providers/myProvider.ts implementing TranscriptionProvider.
  2. Add a case in src/providers/index.ts and a .env value (TRANSCRIPTION_PROVIDER=myProvider).
  3. Map the new API’s response to the same types (segments in ms, letter speakers).

Docker :

Build the image

docker build -t transcriber:dev .

Run (change the ABSOLUTE_PATH_TO_ASSEMBLEE_DATA)

docker run \
  --env-file .env \
  -e LEGISLATURE=17 \
  -v "/ABSOLUTE_PATH_TO_ASSEMBLEE_DATA:/app/assemblee-data" \
  transcriber:dev

License

AGPL-3.0-or-later