@toolkit-p2p/search

v0.1.0

Published

6 months ago

Full-text search with inverted index for Orbit runtime

0High
0Medium
0Low

roseyballs

p2p search full-text-search inverted-index orbit

@toolkit-p2p/search

Full-text search with inverted index and BM25 ranking for P2P applications.

Features

Inverted Index: Fast full-text search with term-based indexing
BM25 Ranking: Probabilistic ranking for relevance-based results
Porter Stemming: English word stemming for better matching
Stop Words: Common word filtering for improved relevance
TTL Management: Automatic document expiration (max 48 hours)
Metadata Filtering: Filter results by document metadata
Pagination: Limit and offset support for large result sets
Highlight Generation: Extract relevant snippets from matches
Configurable: Custom stop words, stemming options, term length limits

Installation

pnpm add @toolkit-p2p/search

Quick Start

import { SearchIndex } from '@toolkit-p2p/search';

// Create a search index
const index = new SearchIndex();

// Add documents
index.addDocument({
  id: 'doc1',
  content: 'TypeScript is a typed superset of JavaScript',
  metadata: { category: 'programming' },
  timestamp: Date.now(),
  ttl: 3600000, // 1 hour
});

// Search
const results = index.search({ query: 'typescript' });

results.forEach((result) => {
  console.log(`${result.id}: ${result.score.toFixed(3)}`);
  console.log(result.content);
});

API Reference

SearchIndex

Constructor

new SearchIndex(options?: IndexOptions)

Options:

enableStemming?: boolean - Enable Porter stemming (default: true)
enableStopWords?: boolean - Filter common words (default: true)
customStopWords?: string[] - Additional stop words to filter
minTermLength?: number - Minimum term length (default: 2)
maxTermLength?: number - Maximum term length (default: 50)

Methods

addDocument(doc: SearchDocument): void

Add or update a document in the index.

index.addDocument({
  id: 'unique-id',
  content: 'Document content to index',
  metadata: { category: 'example' },
  timestamp: Date.now(),
  ttl: 3600000, // milliseconds
});

Note: TTL cannot exceed 48 hours (172,800,000ms).

removeDocument(docId: string): boolean

Remove a document from the index. Returns true if document was found and removed.

search(query: SearchQuery): SearchResult[]

Search for documents matching the query.

const results = index.search({
  query: 'search terms',
  limit: 20, // default: 20
  offset: 0, // default: 0
  filters: { category: 'example' }, // optional metadata filters
});

Returns results sorted by relevance (BM25 score, 0-1 range).

getStats(): SearchStats

Get index statistics.

const stats = index.getStats();
console.log(`Documents: ${stats.documentCount}`);
console.log(`Terms: ${stats.termCount}`);
console.log(`Avg length: ${stats.averageDocumentLength}`);
console.log(`Memory: ${stats.memoryUsage} bytes`);

purgeExpired(now: number): number

Remove expired documents. Returns the number of documents purged.

const purged = index.purgeExpired(Date.now());
console.log(`Purged ${purged} expired documents`);

Types

SearchDocument

interface SearchDocument {
  id: string;
  content: string;
  metadata?: Record<string, unknown>;
  timestamp: number; // Unix timestamp in milliseconds
  ttl: number; // Time-to-live in milliseconds (max 48 hours)
}

SearchQuery

interface SearchQuery {
  query: string; // Search terms (AND logic)
  limit?: number; // Max results (default: 20)
  offset?: number; // Pagination offset (default: 0)
  filters?: Record<string, unknown>; // Metadata filters
}

SearchResult

interface SearchResult {
  id: string;
  content: string;
  metadata?: Record<string, unknown>;
  score: number; // BM25 score (0-1, higher is better)
  highlights?: string[]; // Matching snippets
}

Examples

Basic Search

See examples/basic-usage.ts for a complete example.

Advanced Features

See examples/advanced-usage.ts for examples of:

TTL management and document expiration
Metadata filtering
Pagination
Custom index options
Document updates
Highlight generation

TTL Management

// Add document with 1-hour TTL
index.addDocument({
  id: 'temp-doc',
  content: 'Temporary content',
  timestamp: Date.now(),
  ttl: 3600000, // 1 hour
});

// Later, purge expired documents
setInterval(() => {
  const purged = index.purgeExpired(Date.now());
  console.log(`Purged ${purged} expired documents`);
}, 60000); // Check every minute

Metadata Filtering

// Add documents with metadata
index.addDocument({
  id: 'doc1',
  content: 'Tutorial for beginners',
  metadata: { category: 'tutorial', difficulty: 'beginner' },
  timestamp: Date.now(),
  ttl: 86400000, // 24 hours
});

// Search with metadata filter
const tutorials = index.search({
  query: 'tutorial',
  filters: { category: 'tutorial' },
});

Pagination

// Get first page (results 0-9)
const page1 = index.search({
  query: 'search term',
  limit: 10,
  offset: 0,
});

// Get second page (results 10-19)
const page2 = index.search({
  query: 'search term',
  limit: 10,
  offset: 10,
});

How it Works

Inverted Index

The inverted index maps each term to a list of documents containing that term, along with frequency and position information.

Term         → Postings
"javascript" → [{docId: "1", frequency: 2, positions: [0, 5]}, ...]
"typescript" → [{docId: "1", frequency: 1, positions: [3]}, ...]

BM25 Ranking

Documents are ranked using the BM25 algorithm, which considers:

Term Frequency (TF): How often terms appear in the document
Inverse Document Frequency (IDF): How rare terms are across all documents
Document Length: Normalized by average document length

Parameters:

k1 = 1.2: Controls term frequency saturation
b = 0.75: Controls length normalization

Text Processing Pipeline

Lowercase: Convert to lowercase
Tokenize: Extract alphanumeric sequences
Filter: Remove short terms and stop words
Stem: Apply Porter stemming (if enabled)

Performance Considerations

Memory: Index stores all documents and term positions in memory
Search Speed: O(terms × docs_per_term) for search operations
Indexing Speed: O(terms) for adding documents
TTL Purge: O(total_docs) to purge expired documents

Best practices:

Set appropriate TTLs to limit memory usage
Purge expired documents periodically
Use metadata filters to reduce result set size
Consider pagination for large result sets

Testing

Run tests:

pnpm test

Test coverage:

28 test cases
Document management (add, remove, update)
Search functionality (single term, multiple terms, ranking)
Tokenization (stemming, stop words, case-insensitive)
TTL and expiration
Metadata filtering
Highlights
Statistics
BM25 scoring

License

MIT

Author

Aaron Rosenthal

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@toolkit-p2p/search

Features

Installation

Quick Start

API Reference

SearchIndex

Constructor

Methods

Types

SearchDocument

SearchQuery

SearchResult

Examples

Basic Search

Advanced Features

TTL Management

Metadata Filtering

Pagination

How it Works

Inverted Index

BM25 Ranking

Text Processing Pipeline

Performance Considerations

Testing

License

Author