@memberjunction/content-autotagging

v5.28.0

Published

6 hours ago

MemberJunction Content Autotagging Application

0High
0Medium
0Low

@memberjunction/content-autotagging

AI-powered content ingestion, autotagging, and vectorization engine for MemberJunction. Scans content from multiple sources (local files, websites, RSS feeds, cloud storage), extracts text from documents, uses LLMs to generate weighted tags and metadata attributes, and vectorizes content for semantic search.

Overview

The @memberjunction/content-autotagging package provides an extensible framework for ingesting content from diverse sources and leveraging AI models to extract meaningful tags, summaries, and metadata. Built on the MemberJunction platform, it helps organizations automatically organize and categorize their content. The engine uses the managed "Content Autotagging" AI prompt via AIPromptRunner (rather than direct BaseLLM calls), enabling prompt versioning, model routing, and centralized prompt management.

graph TD
    A["AutotagBaseEngine<br/>(Orchestrator)"] --> B["Content Sources"]
    B --> C["Local File System"]
    B --> D["Websites"]
    B --> E["RSS Feeds"]
    B --> F["Cloud Storage<br/>(Azure Blob)"]

    A --> G["Text Extraction"]
    G --> H["PDF Parser"]
    G --> I["Office Parser"]
    G --> J["HTML Parser<br/>(Cheerio)"]

    A --> K["AIPromptRunner<br/>(Content Autotagging prompt)"]
    K --> L["Tag Generation<br/>(with weights)"]
    K --> M["Attribute Extraction"]

    A --> V["Vectorization"]
    V --> W["Embedding Model"]
    V --> X["Vector DB Upsert"]

    A --> N["Content Items<br/>(Database)"]
    A --> O["Content Item Attributes<br/>(Database)"]

    style A fill:#2d6a9f,stroke:#1a4971,color:#fff
    style B fill:#7c5295,stroke:#563a6b,color:#fff
    style C fill:#2d8659,stroke:#1a5c3a,color:#fff
    style D fill:#2d8659,stroke:#1a5c3a,color:#fff
    style E fill:#2d8659,stroke:#1a5c3a,color:#fff
    style F fill:#2d8659,stroke:#1a5c3a,color:#fff
    style G fill:#b8762f,stroke:#8a5722,color:#fff
    style K fill:#7c5295,stroke:#563a6b,color:#fff
    style V fill:#2d6a9f,stroke:#1a4971,color:#fff
    style N fill:#2d6a9f,stroke:#1a4971,color:#fff
    style O fill:#2d6a9f,stroke:#1a4971,color:#fff

Key Features

AIPromptRunner integration: Uses the managed "Content Autotagging" prompt, enabling prompt versioning and model routing through MJ's prompt management system (no direct BaseLLM calls)
Tag weights: Each generated tag includes a relevance weight (0.0--1.0) indicating how strongly the tag relates to the content
Batch processing: Configurable batch size (default: 20) with concurrent processing within each batch
Parallel tagging + vectorization: Tagging and vectorization run in parallel for maximum throughput
Per-source/type embedding model selection: Cascade resolution for embedding model and vector index -- source override, then content type default, then global fallback (first active vector index)
Real-time progress reporting: AutotagProgressCallback provides per-item progress updates during processing
Graceful provider skip: Providers skip gracefully when no content sources are configured for their type

Installation

npm install @memberjunction/content-autotagging

Content Processing Pipeline

sequenceDiagram
    participant Source as Content Source
    participant Engine as AutotagBaseEngine
    participant Extract as Text Extractor
    participant Prompt as AIPromptRunner
    participant Vec as Embedding + VectorDB
    participant DB as Database

    Source->>Engine: Provide content items
    Engine->>Engine: Change detection (checksum)
    Engine->>Extract: Extract text (PDF/Office/HTML)
    Extract-->>Engine: Raw text
    Engine->>Engine: Chunk text for token limits
    par Tagging
        Engine->>Prompt: Run "Content Autotagging" prompt
        Prompt-->>Engine: Tags (with weights) + Attributes
    and Vectorization
        Engine->>Vec: Embed text + upsert to vector DB
        Vec-->>Engine: Vectorization result
    end
    Engine->>DB: Save ContentItem + Tags + Attributes
    Engine->>DB: Create ProcessRun record

Content Sources

| Source | Class | Description | |--------|-------|-------------| | Local Files | AutotagLocalFileSystem | Scans local directories for documents | | Websites | AutotagWebsite | Crawls web pages and extracts content | | RSS Feeds | AutotagRSSFeed | Parses RSS/Atom feeds for articles | | Azure Blob | AutotagAzureBlob | Processes files from Azure Blob Storage |

All sources extend AutotagBase, which provides the common interface for content discovery and ingestion. Each source's Autotag() method accepts an optional AutotagProgressCallback for real-time progress reporting. Sources skip gracefully when no content sources of their type are configured in the database.

Supported File Formats

| Format | Library | Extensions | |--------|---------|------------| | PDF | pdf-parse | .pdf | | Office Documents | officeparser | .docx, .xlsx, .pptx | | HTML/Web Pages | cheerio | .html, .htm | | Plain Text | Native | .txt, .md, .csv |

Tag Weights

The LLM prompt returns tags with relevance weights between 0.0 and 1.0 indicating how strongly each tag relates to the content. Both old-style (plain string array) and new-style (object with tag + weight) responses are supported:

// New format (preferred) — returned by the "Content Autotagging" prompt
[
  { "tag": "machine learning", "weight": 0.95 },
  { "tag": "neural networks", "weight": 0.82 },
  { "tag": "data science", "weight": 0.70 }
]

// Legacy format — auto-normalized with weight 1.0
["machine learning", "neural networks", "data science"]

Embedding Model and Vector Index Resolution

The engine resolves the embedding model and vector index for each content item using a three-level cascade:

Content Source override: If the source has EmbeddingModelID and VectorIndexID set, those are used
Content Type default: If the source has no override, the content type's defaults are used
Global fallback: If neither source nor type specifies, the first active vector index in the system is used

Items sharing the same (embeddingModel, vectorIndex) pair are grouped and processed together for efficient batching.

Usage

RSS Feed Processing

import { AutotagRSSFeed } from '@memberjunction/content-autotagging';

const rssTagger = new AutotagRSSFeed();
await rssTagger.Autotag(contextUser, (processed, total, currentItem) => {
    console.log(`[${processed}/${total}] Processing: ${currentItem}`);
});

Website Content Processing

import { AutotagWebsite } from '@memberjunction/content-autotagging';

const websiteTagger = new AutotagWebsite();
await websiteTagger.Autotag(contextUser);

Local File System Processing

import { AutotagLocalFileSystem } from '@memberjunction/content-autotagging';

const fileTagger = new AutotagLocalFileSystem();
await fileTagger.Autotag(contextUser);

Azure Blob Storage Processing

import { AutotagAzureBlob } from '@memberjunction/content-autotagging';

const blobTagger = new AutotagAzureBlob(
  process.env.AZURE_STORAGE_CONNECTION_STRING,
  'your-container-name'
);
await blobTagger.Authenticate();
await blobTagger.Autotag(contextUser);

Direct Engine Usage

import { AutotagBaseEngine } from '@memberjunction/content-autotagging';

const engine = AutotagBaseEngine.Instance;

// Process content items with custom batch size
await engine.ExtractTextAndProcessWithLLM(contentItems, contextUser, batchSize);

// Vectorize content items (runs in parallel with tagging)
const result = await engine.VectorizeContentItems(contentItems, tagMap, contextUser, batchSize);
console.log(`Vectorized: ${result.vectorized}, Skipped: ${result.skipped}`);

Creating a Custom Content Source

import { AutotagBase, AutotagProgressCallback } from '@memberjunction/content-autotagging';
import { RegisterClass } from '@memberjunction/global';

@RegisterClass(AutotagBase, 'AutotagCustomSource')
export class AutotagCustomSource extends AutotagBase {
  public async SetContentItemsToProcess(contentSources) {
    // Fetch and create content items from your custom source
    return contentItems;
  }

  public async Autotag(contextUser, onProgress?: AutotagProgressCallback) {
    const contentSourceTypeID = await this.engine.setSubclassContentSourceType(
      'Custom Source', contextUser
    );
    const contentSources = await this.engine.getAllContentSources(
      contextUser, contentSourceTypeID
    );
    if (contentSources.length === 0) return; // Skip gracefully
    const contentItems = await this.SetContentItemsToProcess(contentSources);
    await this.engine.ExtractTextAndProcessWithLLM(contentItems, contextUser);
  }
}

Database Entities

| Entity | Purpose | |--------|---------| | Content Sources | Configuration for each content source (with optional EmbeddingModelID/VectorIndexID overrides) | | Content Items | Individual pieces of content with extracted text | | Content Item Tags | AI-generated tags with relevance weights (0.0--1.0) | | Content Item Attributes | Additional extracted metadata | | Content Process Runs | Processing history and audit trail | | Content Types | Content categorization definitions (with default EmbeddingModelID/VectorIndexID) | | Content Source Types | Source type definitions | | Content File Types | Supported file format definitions |

Dependencies

| Package | Purpose | |---------|---------| | @memberjunction/core | Entity system and metadata | | @memberjunction/global | Class registration | | @memberjunction/core-entities | Content entity types | | @memberjunction/ai | Embedding model integration | | @memberjunction/aiengine | AI Engine for prompt cache access | | @memberjunction/ai-prompts | AIPromptRunner for managed prompt execution | | @memberjunction/ai-core-plus | AIPromptParams types | | @memberjunction/ai-vectors | TextChunker for content chunking | | @memberjunction/ai-vectordb | VectorDBBase for vector storage | | pdf-parse | PDF text extraction | | officeparser | Office document parsing | | cheerio | HTML parsing | | axios | HTTP requests for web content | | rss-parser | RSS feed parsing |

License

ISC

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@memberjunction/content-autotagging

Overview

Key Features

Installation

Content Processing Pipeline

Content Sources

Supported File Formats

Tag Weights

Embedding Model and Vector Index Resolution

Usage

RSS Feed Processing

Website Content Processing

Local File System Processing

Azure Blob Storage Processing

Direct Engine Usage

Creating a Custom Content Source

Database Entities

Dependencies

License