npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@memberjunction/ai-vectors

v5.22.0

Published

MemberJunction: AI Vectors Module

Readme

@memberjunction/ai-vectors

Core foundation package for vector operations in MemberJunction. Provides text processing utilities (chunking, extraction), base classes for vectorization pipelines, and interfaces for embedding providers and vector databases.

Installation

npm install @memberjunction/ai-vectors

What's Included

| Export | Type | Purpose | |---|---|---| | TextChunker | Class | Token-aware text splitting with sentence, paragraph, and fixed strategies | | TextExtractor | Class | HTML stripping, entity decoding, MIME-type routing, token truncation | | VectorBase | Class | Base class providing RunView, Metadata, AIEngine integration for subclasses | | IEmbedding | Interface | Contract for single and batch text embedding generation | | IVectorDatabase | Interface | Contract for vector database management (create/delete/list indexes) | | IVectorIndex | Interface | Contract for CRUD operations on vector records within an index | | ChunkTextParams | Type | Configuration for TextChunker.ChunkText() | | TextChunk | Type | Output chunk with text, offsets, token count, and index | | PageRecordsParams | Type | Paginated entity record retrieval configuration |

Architecture

graph TD
    subgraph Core["@memberjunction/ai-vectors"]
        TC["TextChunker"]
        TE["TextExtractor"]
        VB["VectorBase"]
        IE["IEmbedding"]
        IVD["IVectorDatabase"]
        IVI["IVectorIndex"]
    end

    subgraph MJCore["MemberJunction Core"]
        MD["Metadata"]
        RV["RunView"]
        BE["BaseEntity"]
    end

    subgraph AIEngine["AI Engine"]
        AIM["AIEngine.Instance"]
        MOD["Embedding Models"]
        VDB["Vector Databases"]
    end

    subgraph Consumers["Consumer Packages"]
        SYNC["ai-vector-sync"]
        DUPE["ai-vector-dupe"]
    end

    VB --> MD
    VB --> RV
    VB --> BE
    VB --> AIM
    AIM --> MOD
    AIM --> VDB
    SYNC --> VB
    SYNC --> TC
    SYNC --> TE
    DUPE --> VB

    style Core fill:#2d6a9f,stroke:#1a4971,color:#fff
    style MJCore fill:#2d8659,stroke:#1a5c3a,color:#fff
    style AIEngine fill:#b8762f,stroke:#8a5722,color:#fff
    style Consumers fill:#7c5295,stroke:#563a6b,color:#fff

TextChunker

Token-aware text splitting that respects natural language boundaries. All methods are static.

Strategies

| Strategy | Splits On | Best For | |---|---|---| | sentence | Sentence-ending punctuation (. ! ?) | Prose, articles, descriptions | | paragraph | Double newlines (\n\n) | Structured documents, Markdown, reports | | fixed | Whitespace boundaries at the character limit | Logs, code, unstructured data |

Basic Usage

import { TextChunker, ChunkTextParams, TextChunk } from '@memberjunction/ai-vectors';

const article = `Machine learning models require training data.
The quality of training data directly impacts model performance.
Data preprocessing is a critical step in any ML pipeline.

Feature engineering transforms raw data into meaningful representations.
Good features can dramatically improve model accuracy.`;

// Sentence strategy (default)
const chunks: TextChunk[] = TextChunker.ChunkText({
    Text: article,
    MaxChunkTokens: 128,
    Strategy: 'sentence'
});

for (const chunk of chunks) {
    console.log(`Chunk ${chunk.Index}: ${chunk.TokenCount} tokens, offset ${chunk.StartOffset}-${chunk.EndOffset}`);
    console.log(chunk.Text);
}

Paragraph Strategy

const markdownDoc = `## Introduction

This document covers the architecture of our data pipeline.
It handles ingestion, transformation, and storage.

## Processing

Records are validated against schema constraints.
Invalid records are routed to a dead-letter queue.

## Storage

Processed data is stored in both relational and vector databases.
Vector embeddings enable semantic search across all records.`;

const chunks = TextChunker.ChunkText({
    Text: markdownDoc,
    MaxChunkTokens: 256,
    Strategy: 'paragraph'
});
// Each paragraph becomes a chunk (or paragraphs merge if they fit together)

Fixed Strategy

const logData = `2024-01-15T10:00:00Z INFO Server started on port 4000
2024-01-15T10:00:01Z INFO Connected to database
2024-01-15T10:00:02Z WARN High memory usage detected: 85%
2024-01-15T10:00:03Z ERROR Connection timeout after 30000ms`;

const chunks = TextChunker.ChunkText({
    Text: logData,
    MaxChunkTokens: 64,
    Strategy: 'fixed'
});

Configuring Overlap

Overlap repeats trailing content from the previous chunk at the start of the next chunk, preserving context across chunk boundaries. Defaults to 10% of MaxChunkTokens.

// Explicit overlap: 50 tokens of shared context between chunks
const chunks = TextChunker.ChunkText({
    Text: longDocument,
    MaxChunkTokens: 512,
    OverlapTokens: 50,
    Strategy: 'sentence'
});

// No overlap
const chunks = TextChunker.ChunkText({
    Text: longDocument,
    MaxChunkTokens: 512,
    OverlapTokens: 0,
    Strategy: 'sentence'
});

Token Estimation

EstimateTokenCount provides a fast approximation using the ~4 characters per token heuristic for English text. This is suitable for chunking where exact counts are not critical.

const tokens = TextChunker.EstimateTokenCount('This is a sample sentence.');
// Returns: 7 (26 characters / 4)

// For production accuracy with specific models, use tiktoken directly
// and pass the result to MaxChunkTokens for precise control

TextChunk Output Shape

Each chunk includes full position metadata for traceability back to the source:

interface TextChunk {
    Text: string;        // The chunk text content
    StartOffset: number; // Start character offset in original text
    EndOffset: number;   // End character offset (exclusive)
    TokenCount: number;  // Approximate token count
    Index: number;       // 0-based chunk index
}

TextExtractor

Static utilities for extracting clean plain text from various content formats. Dependency-light (regex-based, no DOM parser required).

HTML Extraction

import { TextExtractor } from '@memberjunction/ai-vectors';

const html = `
<html>
<head><style>body { color: red; }</style></head>
<body>
  <h1>Welcome</h1>
  <p>This is a <strong>formatted</strong> paragraph with &amp; entities.</p>
  <script>alert('removed');</script>
  <ul>
    <li>Item one</li>
    <li>Item two</li>
  </ul>
</body>
</html>`;

const text = TextExtractor.ExtractFromHTML(html);
// "Welcome\nThis is a formatted paragraph with & entities.\nItem one\nItem two"

What it does:

  • Removes <script> and <style> elements entirely
  • Converts block-level elements (<p>, <div>, <h1>-<h6>, <li>, <br>, etc.) to newlines
  • Strips all remaining HTML tags
  • Decodes named entities (&amp;, &lt;, &gt;, &quot;, &nbsp;, &mdash;, &hellip;, etc.)
  • Decodes numeric entities (decimal &#169; and hex &#xA9;)
  • Normalizes whitespace (collapses runs of spaces, limits consecutive newlines to 2)

Plain Text Normalization

const raw = "  Some text\x00with\x07control\x1Fcharacters\n\n\n\n\nand  extra   spaces  ";
const clean = TextExtractor.ExtractFromPlainText(raw);
// "Some textwithcontrolcharacters\n\nand extra spaces"

Removes control characters (\x00-\x1F except \n and \t), normalizes whitespace, trims.

MIME-Type Routing

// Automatically selects the right extraction method
const fromHTML = TextExtractor.ExtractByMimeType(htmlContent, 'text/html');
const fromPlain = TextExtractor.ExtractByMimeType(plainContent, 'text/plain');
const fromCSV = TextExtractor.ExtractByMimeType(csvContent, 'text/csv');  // Falls back to plain text

// For binary formats (PDF, DOCX), extract text with a dedicated library first,
// then pass through ExtractFromPlainText for normalization:
// const pdfText = await pdfParse(buffer);
// const clean = TextExtractor.ExtractFromPlainText(pdfText);

Token Truncation

// Truncate text to fit within a model's context window
const truncated = TextExtractor.TruncateToTokenLimit(veryLongText, 8192);
// Truncates at the last whitespace boundary before the estimated character limit

VectorBase

Abstract base class that downstream vector packages extend. Provides integrated access to MemberJunction's Metadata, RunView, and AIEngine systems.

Class Diagram

classDiagram
    class VectorBase {
        +Metadata : Metadata
        +RunView : RunView
        +CurrentUser : UserInfo
        #GetRecordsByEntityID(entityID, recordIDs?) BaseEntity[]
        #PageRecordsByEntityID~T~(params) T[]
        #GetAIModel(id?) MJAIModelEntityExtended
        #GetVectorDatabase(id?) MJVectorDatabaseEntity
        #RunViewForSingleValue~T~(entityName, filter) T | null
        #SaveEntity(entity) boolean
        #BuildExtraFilter(compositeKeys) string
    }

Extending VectorBase

import { VectorBase, PageRecordsParams } from '@memberjunction/ai-vectors';
import { BaseEntity } from '@memberjunction/core';

export class MyVectorProcessor extends VectorBase {
    async ProcessEntity(entityId: string): Promise<void> {
        // Load all records for an entity
        const records = await this.GetRecordsByEntityID(entityId);

        // Access configured AI models and vector databases
        const model = this.GetAIModel();       // First available embedding model
        const vectorDb = this.GetVectorDatabase(); // First available vector DB

        for (const record of records) {
            // Generate embeddings, upsert into vector DB
        }
    }

    async ProcessInPages(entityId: string): Promise<void> {
        let page = 1;
        let hasMore = true;

        while (hasMore) {
            const records = await this.PageRecordsByEntityID<Record<string, unknown>>({
                EntityID: entityId,
                PageNumber: page,
                PageSize: 100,
                ResultType: 'simple',
                Filter: "Status = 'Active'"
            });
            hasMore = records.length === 100;
            page++;
        }
    }
}

Filtering with Composite Keys

import { VectorBase } from '@memberjunction/ai-vectors';
import { CompositeKey } from '@memberjunction/core';

class FilteredProcessor extends VectorBase {
    async GetSpecificRecords(entityId: string): Promise<void> {
        const keys: CompositeKey[] = [
            { KeyValuePairs: [{ FieldName: 'ID', Value: 'abc-123' }] },
            { KeyValuePairs: [{ FieldName: 'ID', Value: 'def-456' }] }
        ];

        // Generates: (ID = 'abc-123') OR (ID = 'def-456')
        const records = await this.GetRecordsByEntityID(entityId, keys);
    }
}

API Reference

TextChunker (Static Methods)

| Method | Parameters | Returns | Description | |---|---|---|---| | ChunkText | params: ChunkTextParams | TextChunk[] | Split text into token-bounded chunks using the specified strategy | | EstimateTokenCount | text: string | number | Fast token count approximation (~4 chars/token) |

TextExtractor (Static Methods)

| Method | Parameters | Returns | Description | |---|---|---|---| | ExtractFromHTML | html: string | string | Strip tags, decode entities, normalize whitespace | | ExtractFromPlainText | text: string | string | Remove control characters, normalize whitespace | | ExtractByMimeType | content: string, mimeType: string | string | Route to the appropriate extraction method by MIME type | | TruncateToTokenLimit | text: string, maxTokens: number | string | Truncate at whitespace boundary within the token budget |

VectorBase (Protected Methods for Subclasses)

| Method | Returns | Description | |---|---|---| | GetRecordsByEntityID(entityID, recordIDs?) | Promise<BaseEntity[]> | Load entity records, optionally filtered by composite keys | | PageRecordsByEntityID<T>(params) | Promise<T[]> | Paginated retrieval with configurable page size and filter | | GetAIModel(id?) | MJAIModelEntityExtended | Locate an embedding model by ID or get the first available | | GetVectorDatabase(id?) | MJVectorDatabaseEntity | Locate a vector database by ID or get the first available | | RunViewForSingleValue<T>(entityName, filter) | Promise<T \| null> | Query for a single entity record matching a filter | | SaveEntity(entity) | Promise<boolean> | Save a BaseEntity with CurrentUser context applied | | BuildExtraFilter(compositeKeys) | string | Convert CompositeKey array to a SQL filter string |

Interfaces

| Interface | Methods | Purpose | |---|---|---| | IEmbedding | createEmbedding, createBatchEmbedding | Text embedding generation | | IVectorDatabase | listIndexes, createIndex, deleteIndex, editIndex | Vector database management | | IVectorIndex | createRecord(s), getRecord(s), updateRecord(s), deleteRecord(s) | Vector record CRUD |

Package Ecosystem

| Package | Depends On Core | Purpose | |---|---|---| | @memberjunction/ai-vectordb | No (peer) | Abstract vector database interface | | @memberjunction/ai-vector-sync | Yes | Entity-to-vector synchronization | | @memberjunction/ai-vector-dupe | Yes | Duplicate detection via vector similarity | | @memberjunction/ai-vectors-memory | No | In-memory vector search and clustering | | @memberjunction/ai-vectors-pinecone | No | Pinecone implementation of VectorDBBase |

Further Reading

  • Text Processing Guide -- in-depth guide on chunking strategies, overlap tuning, HTML edge cases, and integration with vectorization/autotagging pipelines

Development

# Build
npm run build

# Run tests
npm run test

# Watch mode
npm run test:watch

License

ISC