@ohnrshyp/metadata

v1.0.2

Published

11 days ago

Unified AI-assisted descriptive audio metadata extraction pipeline (Mood, Genre, Instruments, Vocals, Embeddings)

Downloads

437

0High
0Medium
0Low

focusjordan

audio metadata clap panns wav2vec2 zero-shot-classification

@ohnrshyp/metadata

Unified hybrid neural AI audio metadata extraction pipeline.

This package integrates multiple deep learning and digital signal processing (DSP) models to automatically extract comprehensive structural and semantic metadata tags from raw audio files. It is the intelligence layer powering the auto-tagging and verification capabilities of the ORBIT protocol.

🚀 Key AI Pipelines

🧠 Zero-Shot Mood & Vocal Tagger (LAION-CLAP): Uses contrastive text-audio embeddings to detect acoustic mood, vocal presence, vocal gender, and singing style.
🎹 Primary Instrument Identification (PANNs): Leverages Pretrained Audio Neural Networks (CNN14 architecture) to analyze and score 50+ instruments.
🏷️ Genre Classifier (wav2vec2): Classifies the primary musical genre over 10 structural music classifications.
🧬 2048-dim Audio Embeddings (PANNs): Generates highly descriptive semantic vector representations for similarity indexing in vector databases.
⏱️ Classical DSP Features: Integrates tempo (BPM), key/scale, RMS energy, and integrated loudness (dB) directly.
🔗 Stem-Aware Analysis (Demucs): Optionally splits audio into vocal, bass, drum, and instrumental stems for isolated checking.

🔬 Model & Pipeline Architecture

The extraction orchestrator combines local JavaScript-based inferences and Python subprocess modules.

                    ┌─────────────────────────┐
                    │       Audio Input       │
                    └─────────────────────────┘
                                 │
         ┌───────────────────────┼───────────────────────┐
         ▼                       ▼                       ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   LAION-CLAP    │     │      PANNs      │     │    wav2vec2     │
│  (Transformers) │     │ (PyTorch Python)│     │  (Transformers) │
└─────────────────┘     └─────────────────┘     └─────────────────┘
     (Mood, Vocals)     (Instruments, Tag)          (10 Genres)
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 ▼
                    ┌─────────────────────────┐
                    │   DSP Signal Engine     │
                    │   (BPM, Key, Loudness)  │
                    └─────────────────────────┘
                                 │
                                 ▼
                    ┌─────────────────────────┐
                    │    Unified JSON/DB      │
                    │     Metadata Object     │
                    └─────────────────────────┘

1. LAION-CLAP Inferences

CLAP models contain joint audio and text encoders trained to project audio waveforms and textual descriptions into a shared latent space. Zero-shot tags are extracted by computing the cosine similarity between the audio embedding $E_a$ and candidate text prompt embeddings $E_t$: $$\text{Confidence}(i) = \frac{e^{S(E_a, E_{t, i}) / \tau}}{\sum_k e^{S(E_a, E_{t, k}) / \tau}}$$ This outputs mood keywords and vocal properties without retraining.

2. PANNs (Pretrained Audio Neural Networks)

PANNs maps the audio signal into a log-mel spectrogram and runs a CNN14 architecture trained on AudioSet (527 classes). By filtering AudioSet indices, we isolate specific instrument classes and extract the 2048-dimensional output from the global pooling layer to serve as our dense audio embedding.

📦 Installation

Node.js (NPM Package)

npm install @ohnrshyp/metadata

🛠️ API Reference

`extractMetadata(input, [options])`

Extracts comprehensive structural and semantic metadata from an audio source.

Parameters:
- input (Buffer | string): Raw binary buffer or absolute path to the target audio file.
- options (Object, optional):
  - includeEmbedding (boolean): Include the 2048-dim PANNs audio vector. Default is false.
  - verbose (boolean): Enable diagnostic logging. Default is false.
  - config (Object): Advanced configuration overrides (see below).

Returns: Promise<Object> containing the following schema:

{
  "genre": [
    { "label": "Electronic", "confidence": 0.8912 }
  ],
  "mood": [
    { "label": "Energetic", "confidence": 0.8412 },
    { "label": "Bright", "confidence": 0.7612 }
  ],
  "instruments": [
    { "label": "Synthesizer", "confidence": 0.912 },
    { "label": "Drum Kit", "confidence": 0.824 }
  ],
  "vocals": {
    "present": true,
    "confidence": 0.9412,
    "gender": "female"
  },
  "bpm": {
    "value": 128.0,
    "confidence": 0.8912
  },
  "key": {
    "value": "A minor",
    "confidence": 0.7412
  },
  "energy": 0.8124,
  "loudness_db": -12.42,
  "danceability": 0.8942,
  "duration": 210.42,
  "sample_rate": 22050,
  "extractionStatus": {
    "clap": "success",
    "panns": "success",
    "genreClassifier": "success",
    "audioAnalysis": "success",
    "embedding": "success"
  }
}

`extractClapOnly(input, [options])`

Performs only CLAP mood/vocals classification. Avoids PyTorch loading overhead, completing in $< 1$ second.

`extractAudioAnalysisOnly(input, [options])`

Performs only CPU-only classical DSP feature extraction (BPM, key, loudness). No machine learning model is loaded.

`checkEnvironment()`

Runs validation diagnostics across all sub-modules.

Returns: Promise<Object> detailing availability flags.

`formatForDatabase(result)`

Formats the extraction result into the JSON structure expected by the ai_metadata JSONB column in PostgreSQL.

`formatEmbeddingForDatabase(embedding)`

Converts the Float32Array embedding vector into the PostgreSQL pgvector string format "[v1,v2,...]" for query parameters.

⚙️ Configuration Flags

Override defaults in the options parameter:

const result = await metadata.extractMetadata(audioBuffer, {
  config: {
    enableClap: true,             // Zero-shot mood/vocals
    enablePanns: true,            // Instrument tagging
    enableGenreClassifier: true,  // wav2vec2 genre classification
    enableEmbedding: true,        // Generate 2048-dim vector
    enableAudioAnalysis: true,    // Run BPM/Key DSP
    enableDemucs: false,          // Run Demucs stem separation
    failOnError: false            // Bypass partial module failures
  }
});

💻 Code Examples

Full Metadata Extraction & Database Formatting

const metadata = require('@ohnrshyp/metadata');
const fs = require('fs');

async function run() {
  const audioBuffer = fs.readFileSync('dance-track.mp3');

  try {
    // Extract metadata including PANNs embedding
    const rawResult = await metadata.extractMetadata(audioBuffer, {
      includeEmbedding: true,
      verbose: true
    });

    console.log(`Primary Genre: ${rawResult.genre[0].label} (${(rawResult.genre[0].confidence * 100).toFixed(1)}%)`);
    console.log(`Mood Tags: ${rawResult.mood.map(m => m.label).join(', ')}`);
    console.log(`BPM: ${rawResult.bpm.value}`);

    // Format for pg/pgvector insertion
    const dbMetadataField = metadata.formatForDatabase(rawResult);
    const dbVectorParam = metadata.formatEmbeddingForDatabase(rawResult.embedding);

    console.log('Formatted pgvector String Prefix:', dbVectorParam.substring(0, 50));
    
  } catch (error) {
    console.error('Metadata extraction failed:', error.message);
  }
}

run();

📄 License

Licensed under the Apache License, Version 2.0 (the "License"). See LICENSE in the project root for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@ohnrshyp/metadata

🚀 Key AI Pipelines

🔬 Model & Pipeline Architecture

1. LAION-CLAP Inferences

2. PANNs (Pretrained Audio Neural Networks)

📦 Installation

Node.js (NPM Package)

🛠️ API Reference

extractMetadata(input, [options])

extractClapOnly(input, [options])

extractAudioAnalysisOnly(input, [options])

checkEnvironment()

formatForDatabase(result)

formatEmbeddingForDatabase(embedding)

⚙️ Configuration Flags

💻 Code Examples

Full Metadata Extraction & Database Formatting

📄 License

`extractMetadata(input, [options])`

`extractClapOnly(input, [options])`

`extractAudioAnalysisOnly(input, [options])`

`checkEnvironment()`

`formatForDatabase(result)`

`formatEmbeddingForDatabase(embedding)`