@memberjunction/ai-vectors
v5.22.0
Published
MemberJunction: AI Vectors Module
Keywords
Readme
@memberjunction/ai-vectors
Core foundation package for vector operations in MemberJunction. Provides text processing utilities (chunking, extraction), base classes for vectorization pipelines, and interfaces for embedding providers and vector databases.
Installation
npm install @memberjunction/ai-vectorsWhat's Included
| Export | Type | Purpose |
|---|---|---|
| TextChunker | Class | Token-aware text splitting with sentence, paragraph, and fixed strategies |
| TextExtractor | Class | HTML stripping, entity decoding, MIME-type routing, token truncation |
| VectorBase | Class | Base class providing RunView, Metadata, AIEngine integration for subclasses |
| IEmbedding | Interface | Contract for single and batch text embedding generation |
| IVectorDatabase | Interface | Contract for vector database management (create/delete/list indexes) |
| IVectorIndex | Interface | Contract for CRUD operations on vector records within an index |
| ChunkTextParams | Type | Configuration for TextChunker.ChunkText() |
| TextChunk | Type | Output chunk with text, offsets, token count, and index |
| PageRecordsParams | Type | Paginated entity record retrieval configuration |
Architecture
graph TD
subgraph Core["@memberjunction/ai-vectors"]
TC["TextChunker"]
TE["TextExtractor"]
VB["VectorBase"]
IE["IEmbedding"]
IVD["IVectorDatabase"]
IVI["IVectorIndex"]
end
subgraph MJCore["MemberJunction Core"]
MD["Metadata"]
RV["RunView"]
BE["BaseEntity"]
end
subgraph AIEngine["AI Engine"]
AIM["AIEngine.Instance"]
MOD["Embedding Models"]
VDB["Vector Databases"]
end
subgraph Consumers["Consumer Packages"]
SYNC["ai-vector-sync"]
DUPE["ai-vector-dupe"]
end
VB --> MD
VB --> RV
VB --> BE
VB --> AIM
AIM --> MOD
AIM --> VDB
SYNC --> VB
SYNC --> TC
SYNC --> TE
DUPE --> VB
style Core fill:#2d6a9f,stroke:#1a4971,color:#fff
style MJCore fill:#2d8659,stroke:#1a5c3a,color:#fff
style AIEngine fill:#b8762f,stroke:#8a5722,color:#fff
style Consumers fill:#7c5295,stroke:#563a6b,color:#fffTextChunker
Token-aware text splitting that respects natural language boundaries. All methods are static.
Strategies
| Strategy | Splits On | Best For |
|---|---|---|
| sentence | Sentence-ending punctuation (. ! ?) | Prose, articles, descriptions |
| paragraph | Double newlines (\n\n) | Structured documents, Markdown, reports |
| fixed | Whitespace boundaries at the character limit | Logs, code, unstructured data |
Basic Usage
import { TextChunker, ChunkTextParams, TextChunk } from '@memberjunction/ai-vectors';
const article = `Machine learning models require training data.
The quality of training data directly impacts model performance.
Data preprocessing is a critical step in any ML pipeline.
Feature engineering transforms raw data into meaningful representations.
Good features can dramatically improve model accuracy.`;
// Sentence strategy (default)
const chunks: TextChunk[] = TextChunker.ChunkText({
Text: article,
MaxChunkTokens: 128,
Strategy: 'sentence'
});
for (const chunk of chunks) {
console.log(`Chunk ${chunk.Index}: ${chunk.TokenCount} tokens, offset ${chunk.StartOffset}-${chunk.EndOffset}`);
console.log(chunk.Text);
}Paragraph Strategy
const markdownDoc = `## Introduction
This document covers the architecture of our data pipeline.
It handles ingestion, transformation, and storage.
## Processing
Records are validated against schema constraints.
Invalid records are routed to a dead-letter queue.
## Storage
Processed data is stored in both relational and vector databases.
Vector embeddings enable semantic search across all records.`;
const chunks = TextChunker.ChunkText({
Text: markdownDoc,
MaxChunkTokens: 256,
Strategy: 'paragraph'
});
// Each paragraph becomes a chunk (or paragraphs merge if they fit together)Fixed Strategy
const logData = `2024-01-15T10:00:00Z INFO Server started on port 4000
2024-01-15T10:00:01Z INFO Connected to database
2024-01-15T10:00:02Z WARN High memory usage detected: 85%
2024-01-15T10:00:03Z ERROR Connection timeout after 30000ms`;
const chunks = TextChunker.ChunkText({
Text: logData,
MaxChunkTokens: 64,
Strategy: 'fixed'
});Configuring Overlap
Overlap repeats trailing content from the previous chunk at the start of the next chunk, preserving context across chunk boundaries. Defaults to 10% of MaxChunkTokens.
// Explicit overlap: 50 tokens of shared context between chunks
const chunks = TextChunker.ChunkText({
Text: longDocument,
MaxChunkTokens: 512,
OverlapTokens: 50,
Strategy: 'sentence'
});
// No overlap
const chunks = TextChunker.ChunkText({
Text: longDocument,
MaxChunkTokens: 512,
OverlapTokens: 0,
Strategy: 'sentence'
});Token Estimation
EstimateTokenCount provides a fast approximation using the ~4 characters per token heuristic for English text. This is suitable for chunking where exact counts are not critical.
const tokens = TextChunker.EstimateTokenCount('This is a sample sentence.');
// Returns: 7 (26 characters / 4)
// For production accuracy with specific models, use tiktoken directly
// and pass the result to MaxChunkTokens for precise controlTextChunk Output Shape
Each chunk includes full position metadata for traceability back to the source:
interface TextChunk {
Text: string; // The chunk text content
StartOffset: number; // Start character offset in original text
EndOffset: number; // End character offset (exclusive)
TokenCount: number; // Approximate token count
Index: number; // 0-based chunk index
}TextExtractor
Static utilities for extracting clean plain text from various content formats. Dependency-light (regex-based, no DOM parser required).
HTML Extraction
import { TextExtractor } from '@memberjunction/ai-vectors';
const html = `
<html>
<head><style>body { color: red; }</style></head>
<body>
<h1>Welcome</h1>
<p>This is a <strong>formatted</strong> paragraph with & entities.</p>
<script>alert('removed');</script>
<ul>
<li>Item one</li>
<li>Item two</li>
</ul>
</body>
</html>`;
const text = TextExtractor.ExtractFromHTML(html);
// "Welcome\nThis is a formatted paragraph with & entities.\nItem one\nItem two"What it does:
- Removes
<script>and<style>elements entirely - Converts block-level elements (
<p>,<div>,<h1>-<h6>,<li>,<br>, etc.) to newlines - Strips all remaining HTML tags
- Decodes named entities (
&,<,>,", ,—,…, etc.) - Decodes numeric entities (decimal
©and hex©) - Normalizes whitespace (collapses runs of spaces, limits consecutive newlines to 2)
Plain Text Normalization
const raw = " Some text\x00with\x07control\x1Fcharacters\n\n\n\n\nand extra spaces ";
const clean = TextExtractor.ExtractFromPlainText(raw);
// "Some textwithcontrolcharacters\n\nand extra spaces"Removes control characters (\x00-\x1F except \n and \t), normalizes whitespace, trims.
MIME-Type Routing
// Automatically selects the right extraction method
const fromHTML = TextExtractor.ExtractByMimeType(htmlContent, 'text/html');
const fromPlain = TextExtractor.ExtractByMimeType(plainContent, 'text/plain');
const fromCSV = TextExtractor.ExtractByMimeType(csvContent, 'text/csv'); // Falls back to plain text
// For binary formats (PDF, DOCX), extract text with a dedicated library first,
// then pass through ExtractFromPlainText for normalization:
// const pdfText = await pdfParse(buffer);
// const clean = TextExtractor.ExtractFromPlainText(pdfText);Token Truncation
// Truncate text to fit within a model's context window
const truncated = TextExtractor.TruncateToTokenLimit(veryLongText, 8192);
// Truncates at the last whitespace boundary before the estimated character limitVectorBase
Abstract base class that downstream vector packages extend. Provides integrated access to MemberJunction's Metadata, RunView, and AIEngine systems.
Class Diagram
classDiagram
class VectorBase {
+Metadata : Metadata
+RunView : RunView
+CurrentUser : UserInfo
#GetRecordsByEntityID(entityID, recordIDs?) BaseEntity[]
#PageRecordsByEntityID~T~(params) T[]
#GetAIModel(id?) MJAIModelEntityExtended
#GetVectorDatabase(id?) MJVectorDatabaseEntity
#RunViewForSingleValue~T~(entityName, filter) T | null
#SaveEntity(entity) boolean
#BuildExtraFilter(compositeKeys) string
}Extending VectorBase
import { VectorBase, PageRecordsParams } from '@memberjunction/ai-vectors';
import { BaseEntity } from '@memberjunction/core';
export class MyVectorProcessor extends VectorBase {
async ProcessEntity(entityId: string): Promise<void> {
// Load all records for an entity
const records = await this.GetRecordsByEntityID(entityId);
// Access configured AI models and vector databases
const model = this.GetAIModel(); // First available embedding model
const vectorDb = this.GetVectorDatabase(); // First available vector DB
for (const record of records) {
// Generate embeddings, upsert into vector DB
}
}
async ProcessInPages(entityId: string): Promise<void> {
let page = 1;
let hasMore = true;
while (hasMore) {
const records = await this.PageRecordsByEntityID<Record<string, unknown>>({
EntityID: entityId,
PageNumber: page,
PageSize: 100,
ResultType: 'simple',
Filter: "Status = 'Active'"
});
hasMore = records.length === 100;
page++;
}
}
}Filtering with Composite Keys
import { VectorBase } from '@memberjunction/ai-vectors';
import { CompositeKey } from '@memberjunction/core';
class FilteredProcessor extends VectorBase {
async GetSpecificRecords(entityId: string): Promise<void> {
const keys: CompositeKey[] = [
{ KeyValuePairs: [{ FieldName: 'ID', Value: 'abc-123' }] },
{ KeyValuePairs: [{ FieldName: 'ID', Value: 'def-456' }] }
];
// Generates: (ID = 'abc-123') OR (ID = 'def-456')
const records = await this.GetRecordsByEntityID(entityId, keys);
}
}API Reference
TextChunker (Static Methods)
| Method | Parameters | Returns | Description |
|---|---|---|---|
| ChunkText | params: ChunkTextParams | TextChunk[] | Split text into token-bounded chunks using the specified strategy |
| EstimateTokenCount | text: string | number | Fast token count approximation (~4 chars/token) |
TextExtractor (Static Methods)
| Method | Parameters | Returns | Description |
|---|---|---|---|
| ExtractFromHTML | html: string | string | Strip tags, decode entities, normalize whitespace |
| ExtractFromPlainText | text: string | string | Remove control characters, normalize whitespace |
| ExtractByMimeType | content: string, mimeType: string | string | Route to the appropriate extraction method by MIME type |
| TruncateToTokenLimit | text: string, maxTokens: number | string | Truncate at whitespace boundary within the token budget |
VectorBase (Protected Methods for Subclasses)
| Method | Returns | Description |
|---|---|---|
| GetRecordsByEntityID(entityID, recordIDs?) | Promise<BaseEntity[]> | Load entity records, optionally filtered by composite keys |
| PageRecordsByEntityID<T>(params) | Promise<T[]> | Paginated retrieval with configurable page size and filter |
| GetAIModel(id?) | MJAIModelEntityExtended | Locate an embedding model by ID or get the first available |
| GetVectorDatabase(id?) | MJVectorDatabaseEntity | Locate a vector database by ID or get the first available |
| RunViewForSingleValue<T>(entityName, filter) | Promise<T \| null> | Query for a single entity record matching a filter |
| SaveEntity(entity) | Promise<boolean> | Save a BaseEntity with CurrentUser context applied |
| BuildExtraFilter(compositeKeys) | string | Convert CompositeKey array to a SQL filter string |
Interfaces
| Interface | Methods | Purpose |
|---|---|---|
| IEmbedding | createEmbedding, createBatchEmbedding | Text embedding generation |
| IVectorDatabase | listIndexes, createIndex, deleteIndex, editIndex | Vector database management |
| IVectorIndex | createRecord(s), getRecord(s), updateRecord(s), deleteRecord(s) | Vector record CRUD |
Package Ecosystem
| Package | Depends On Core | Purpose |
|---|---|---|
| @memberjunction/ai-vectordb | No (peer) | Abstract vector database interface |
| @memberjunction/ai-vector-sync | Yes | Entity-to-vector synchronization |
| @memberjunction/ai-vector-dupe | Yes | Duplicate detection via vector similarity |
| @memberjunction/ai-vectors-memory | No | In-memory vector search and clustering |
| @memberjunction/ai-vectors-pinecone | No | Pinecone implementation of VectorDBBase |
Further Reading
- Text Processing Guide -- in-depth guide on chunking strategies, overlap tuning, HTML edge cases, and integration with vectorization/autotagging pipelines
Development
# Build
npm run build
# Run tests
npm run test
# Watch mode
npm run test:watchLicense
ISC
