@toolkit-p2p/search
v0.1.0
Published
Full-text search with inverted index for Orbit runtime
Maintainers
Readme
@toolkit-p2p/search
Full-text search with inverted index and BM25 ranking for P2P applications.
Features
- Inverted Index: Fast full-text search with term-based indexing
- BM25 Ranking: Probabilistic ranking for relevance-based results
- Porter Stemming: English word stemming for better matching
- Stop Words: Common word filtering for improved relevance
- TTL Management: Automatic document expiration (max 48 hours)
- Metadata Filtering: Filter results by document metadata
- Pagination: Limit and offset support for large result sets
- Highlight Generation: Extract relevant snippets from matches
- Configurable: Custom stop words, stemming options, term length limits
Installation
pnpm add @toolkit-p2p/searchQuick Start
import { SearchIndex } from '@toolkit-p2p/search';
// Create a search index
const index = new SearchIndex();
// Add documents
index.addDocument({
id: 'doc1',
content: 'TypeScript is a typed superset of JavaScript',
metadata: { category: 'programming' },
timestamp: Date.now(),
ttl: 3600000, // 1 hour
});
// Search
const results = index.search({ query: 'typescript' });
results.forEach((result) => {
console.log(`${result.id}: ${result.score.toFixed(3)}`);
console.log(result.content);
});API Reference
SearchIndex
Constructor
new SearchIndex(options?: IndexOptions)Options:
enableStemming?: boolean- Enable Porter stemming (default:true)enableStopWords?: boolean- Filter common words (default:true)customStopWords?: string[]- Additional stop words to filterminTermLength?: number- Minimum term length (default:2)maxTermLength?: number- Maximum term length (default:50)
Methods
addDocument(doc: SearchDocument): void
Add or update a document in the index.
index.addDocument({
id: 'unique-id',
content: 'Document content to index',
metadata: { category: 'example' },
timestamp: Date.now(),
ttl: 3600000, // milliseconds
});Note: TTL cannot exceed 48 hours (172,800,000ms).
removeDocument(docId: string): boolean
Remove a document from the index. Returns true if document was found and removed.
search(query: SearchQuery): SearchResult[]
Search for documents matching the query.
const results = index.search({
query: 'search terms',
limit: 20, // default: 20
offset: 0, // default: 0
filters: { category: 'example' }, // optional metadata filters
});Returns results sorted by relevance (BM25 score, 0-1 range).
getStats(): SearchStats
Get index statistics.
const stats = index.getStats();
console.log(`Documents: ${stats.documentCount}`);
console.log(`Terms: ${stats.termCount}`);
console.log(`Avg length: ${stats.averageDocumentLength}`);
console.log(`Memory: ${stats.memoryUsage} bytes`);purgeExpired(now: number): number
Remove expired documents. Returns the number of documents purged.
const purged = index.purgeExpired(Date.now());
console.log(`Purged ${purged} expired documents`);Types
SearchDocument
interface SearchDocument {
id: string;
content: string;
metadata?: Record<string, unknown>;
timestamp: number; // Unix timestamp in milliseconds
ttl: number; // Time-to-live in milliseconds (max 48 hours)
}SearchQuery
interface SearchQuery {
query: string; // Search terms (AND logic)
limit?: number; // Max results (default: 20)
offset?: number; // Pagination offset (default: 0)
filters?: Record<string, unknown>; // Metadata filters
}SearchResult
interface SearchResult {
id: string;
content: string;
metadata?: Record<string, unknown>;
score: number; // BM25 score (0-1, higher is better)
highlights?: string[]; // Matching snippets
}Examples
Basic Search
See examples/basic-usage.ts for a complete example.
Advanced Features
See examples/advanced-usage.ts for examples of:
- TTL management and document expiration
- Metadata filtering
- Pagination
- Custom index options
- Document updates
- Highlight generation
TTL Management
// Add document with 1-hour TTL
index.addDocument({
id: 'temp-doc',
content: 'Temporary content',
timestamp: Date.now(),
ttl: 3600000, // 1 hour
});
// Later, purge expired documents
setInterval(() => {
const purged = index.purgeExpired(Date.now());
console.log(`Purged ${purged} expired documents`);
}, 60000); // Check every minuteMetadata Filtering
// Add documents with metadata
index.addDocument({
id: 'doc1',
content: 'Tutorial for beginners',
metadata: { category: 'tutorial', difficulty: 'beginner' },
timestamp: Date.now(),
ttl: 86400000, // 24 hours
});
// Search with metadata filter
const tutorials = index.search({
query: 'tutorial',
filters: { category: 'tutorial' },
});Pagination
// Get first page (results 0-9)
const page1 = index.search({
query: 'search term',
limit: 10,
offset: 0,
});
// Get second page (results 10-19)
const page2 = index.search({
query: 'search term',
limit: 10,
offset: 10,
});How it Works
Inverted Index
The inverted index maps each term to a list of documents containing that term, along with frequency and position information.
Term → Postings
"javascript" → [{docId: "1", frequency: 2, positions: [0, 5]}, ...]
"typescript" → [{docId: "1", frequency: 1, positions: [3]}, ...]BM25 Ranking
Documents are ranked using the BM25 algorithm, which considers:
- Term Frequency (TF): How often terms appear in the document
- Inverse Document Frequency (IDF): How rare terms are across all documents
- Document Length: Normalized by average document length
Parameters:
k1 = 1.2: Controls term frequency saturationb = 0.75: Controls length normalization
Text Processing Pipeline
- Lowercase: Convert to lowercase
- Tokenize: Extract alphanumeric sequences
- Filter: Remove short terms and stop words
- Stem: Apply Porter stemming (if enabled)
Performance Considerations
- Memory: Index stores all documents and term positions in memory
- Search Speed: O(terms × docs_per_term) for search operations
- Indexing Speed: O(terms) for adding documents
- TTL Purge: O(total_docs) to purge expired documents
Best practices:
- Set appropriate TTLs to limit memory usage
- Purge expired documents periodically
- Use metadata filters to reduce result set size
- Consider pagination for large result sets
Testing
Run tests:
pnpm testTest coverage:
- 28 test cases
- Document management (add, remove, update)
- Search functionality (single term, multiple terms, ranking)
- Tokenization (stemming, stop words, case-insensitive)
- TTL and expiration
- Metadata filtering
- Highlights
- Statistics
- BM25 scoring
License
MIT
Author
Aaron Rosenthal
