searchmix
v1.0.2
Published
đ Lightning-fast full-text search engine for Markdown, EPUB, PDF, TXT, and SRT documents powered by SQLite FTS5
Maintainers
Readme
đ SearchMix
A powerful JavaScript library for indexing and searching Markdown, EPUB, PDF, TXT, and SRT documents using SQLite FTS5 (Full-Text Search).
Features
- Fast Full-Text Search - Powered by SQLite FTS5 with BM25 ranking
- Smart Indexing - Automatically detects new and modified files, only reindexes what changed
- Multiple Formats - Support for Markdown (.md), EPUB, PDF, TXT, and SRT files
- Tags - Organize documents with multiple tags per document
- Method Chaining - Fluent API for easy composition
- Buffer Support - Index content directly from memory
- Advanced Search - FTS5 syntax with column-specific and boolean queries
- Context Snippets - Shows where matches occur with surrounding text
- No Duplicates - Automatic duplicate detection and updating
- Accent & Case Insensitive - Search "mediterraneo" to find "MEDITERRĂNEO"
- Zero Configuration - Works out of the box with sensible defaults
Installation
npm install searchmixQuick Start
import SearchMix from "searchmix";
const searcher = new SearchMix();
// Index a folder â only new/modified files are reindexed on subsequent calls
await searcher.addDocument("./docs");
// Search returns a flat list of ranked snippets
const { results, totalCount, totalSnippets } = searcher.search("mediterraneo", {
limit: 1, // max documents to return
limitSnippets: 10 // max snippets per document
});
// totalCount â 3 (matching documents)
// totalSnippets â 8 (total snippets across all documents)
for (const [i, s] of results.entries()) {
s.documentTitle || s.documentPath // â "Don Quijote"
s.heading?.text ?? "(no heading)" // â "CapĂtulo XV"
s.getText({ length: 200, offset: -80 }) // â "...en las costas del mediterrĂĄneo..."
}
searcher.getStats();
// â { totalDocs: 39, tags: { spa: 39 } }
API Reference
Constructor
new SearchMix({
dbPath = "./db/searchmix.db",
includeCodeBlocks = false,
weights = { title: 10.0, h1: 5.0, body: 1.0 }
} = {})Options:
dbPath(string) - Path to SQLite database file. Default:"./db/searchmix.db"includeCodeBlocks(boolean) - Include code blocks in body text. Default:falseweights(object) - BM25 ranking weights for title, h1, and body. Default:{ title: 10.0, h1: 5.0, body: 1.0 }
Methods
addDocument(pathOrBuffer, options)
Add document(s) to the index. Returns this for chaining.
Parameters:
pathOrBuffer(string|Buffer) - Can be:- Path to a file (
.md,.markdown,.epub,.pdf,.txt,.srt) - Path to a directory (scans recursively)
- Buffer containing Markdown content
- Path to a file (
options(object)tags(string|string[]) - Tags for the document. Default:[]. Language is auto-detected and added automatically.exclude(array) - Patterns to exclude when scanning. Default:["node_modules", ".git"]recursive(boolean) - Scan directories recursively. Default:trueskipExisting(boolean) - Skip documents already indexed. Default:trueupdate(boolean) - Update existing documents instead of skipping. Default:falsecheckModified(boolean) - Check file modification time and reindex if changed. Default:true
Smart Indexing:
SearchMix automatically detects and handles changes:
- New files: Automatically added to the index
- Modified files: Detected by modification time and reindexed automatically
- Unchanged files: Skipped (fast - no reindexing needed)
This means you can safely call addDocument() repeatedly without worrying about duplicates or performance - it will only reindex files that have actually changed!
Note: PDF, EPUB, TXT, and SRT files are converted asynchronously. For immediate search results, use Markdown files or wait ~2 seconds after adding converted files.
Example:
// First call: indexes all documents
await searcher.addDocument("./docs");
// Second call: only indexes new or modified files (very fast!)
await searcher.addDocument("./docs");
// Organize by tags
searcher
.addDocument("./notes", { tags: "notes" })
.addDocument("./book.epub", { tags: "books" })
.addDocument(Buffer.from("# Note\nContent"), { tags: ["quick", "notes"] });
// Force update all existing documents
searcher.addDocument("./notes", { update: true });
// Disable automatic change detection
searcher.addDocument("./docs", { checkModified: false });
// Don't skip, always re-index (not recommended)
searcher.addDocument("./notes", { skipExisting: false });Smart Duplicate Handling:
- By default (
skipExisting: true), documents already in the index are automatically skipped - Set
update: trueto re-index existing documents with latest content - Set
skipExisting: falseto always re-index (creates duplicates if path exists)
search(query, options)
Search indexed documents. Returns a flat list of snippets where each snippet includes both match context and document metadata.
Parameters:
query(string) - Search query (supports FTS5 syntax)options(object)limit(number) - Maximum documents to search. Default:20minScore(number|null) - Minimum score threshold. Default:nulltags(string|string[]|null) - Filter by tag(s). Documents matching any tag + untagged docs are returned. Default:nullsnippets(boolean) - Include text snippets showing where matches occur. Default:truesnippetLength(number) - Characters of context around matches. Default:150limitSnippets(number) - Maximum snippets per document. Default:5count(boolean) - Execute COUNT query for totalCount. Default:true
Returns: { results: [Snippet, ...], totalCount: number, totalSnippets: number }
results- Array ofSnippetobjects (flat list)totalCount- Total number of matching documentstotalSnippets- Total number of snippets returned
Each Snippet includes:
Document metadata:
documentPath- Document pathdocumentTitle- Document titletags- Array of tags assigned to the documentrank- BM25 relevance score
Match context:
text- Text fragment showing the match with contextsection- Where found:'title','h1','h2','h3','h4','h5','h6', or'body'position- Character position in document
Navigation (optional):
heading- Heading details (id, type, text, depth)sectionId- Unique section IDparentId- Parent section ID referencechildrenIds- Array of child section IDscontentCount- Number of content blocks
Methods:
getText()- Get extended text around matchgetParent()- Navigate to parent sectiongetChildren()- Get child sectionsgetContent()- Get section content blocksgetBreadcrumbs()- Get full hierarchy path- And more... (see Navigable Snippets section)
FTS5 Query Syntax:
// Simple search - returns flat list of snippets
const results = searcher.search("postgres backup");
console.log(`Found ${results.totalSnippets} snippets in ${results.totalCount} documents`);
results.results.forEach(snippet => {
console.log(snippet.documentTitle);
console.log(`Found in ${snippet.section}: "${snippet.text}"`);
console.log(`Rank: ${snippet.rank}`);
});
// Get only one snippet per document
const single = searcher.search("postgres", { allOccurrences: false });
// Returns one snippet per matching document
// Column-specific search
searcher.search("title:searchmix");
// Boolean operators
searcher.search("markdown OR sqlite");
searcher.search("sqlite NOT backup");
// Phrase search
searcher.search('"full text search"');
// Search filtered by tag
searcher.search("api", { tags: "docs" });
// Filter by relevance
searcher.search("database", { minScore: 0.5 });
// Control snippet length
searcher.search("database", { snippetLength: 200 });
// Disable snippets for faster queries
searcher.search("database", { snippets: false });get(path)
Get a document by exact path.
Parameters:
path(string) - Document path
Returns: Document object or null if not found.
const doc = searcher.get("./docs/README.md");
// { path, title, h1, h2, h3, h4, h5, h6, body, tags, structure, sections_index }getMultiple(pattern)
Get multiple documents by glob pattern.
Parameters:
pattern(string) - Glob pattern (e.g.,"journals/2025-05*.md")
Returns: Array of document objects.
const docs = searcher.getMultiple("./docs/**/*.md");removeDocument(path)
Remove a document from the index. Returns this for chaining.
searcher.removeDocument("./old-note.md");removeByTag(tagName)
Remove all documents that have a specific tag. Returns this for chaining.
searcher.removeByTag("temp");hasDocument(path)
Check if a document exists in the index.
Parameters:
path(string) - Document path
Returns: boolean - True if document exists
if (!searcher.hasDocument("./README.md")) {
searcher.addDocument("./README.md");
}getStats(options)
Get statistics about indexed documents.
Parameters:
options(object)tag(string|null) - Get stats for a specific tag. Default:null
Returns: Statistics object.
// All tags
const stats = searcher.getStats();
// { totalDocs: 150, tags: { notes: 80, books: 50, spa: 20 } }
// Specific tag
const notesStats = searcher.getStats({ tag: "notes" });
// { totalDocs: 80, tag: "notes" }clear()
Clear all documents from the database.
searcher.clear();close()
Close the database connection.
searcher.close();Usage Examples
Basic Usage
import SearchMix from "searchmix";
const searcher = new SearchMix();
// Add documents - automatically skips if already indexed
searcher
.addDocument("./docs")
.addDocument("./README.md");
// Search - returns flat list of snippets
const results = searcher.search("api documentation");
results.results.forEach(snippet => {
console.log(`${snippet.documentTitle}`);
console.log(` Found in ${snippet.section}:`);
console.log(` "${snippet.text}"`);
});
searcher.close();Note: Running this multiple times will only index documents once. Already indexed documents are automatically skipped for better performance.
Using Tags
const searcher = new SearchMix({
dbPath: "./search.db",
weights: { title: 15.0, h1: 5.0, body: 1.0 }
});
// Organize documents with tags (supports multiple tags per document)
await searcher.addDocument("~/notes", { tags: "notes" });
await searcher.addDocument("~/library", { tags: "books" });
await searcher.addDocument("~/work/docs", { tags: ["docs", "work"] });
// Search filtered by tag
const bookResults = searcher.search("javascript", { tags: "books" });
// Search filtered by multiple tags (matches any)
const workResults = searcher.search("javascript", { tags: ["docs", "notes"] });
// Search across all documents
const allResults = searcher.search("javascript");
searcher.close();Working with EPUB Files
const searcher = new SearchMix();
// Index EPUB files (automatically converted to markdown)
await searcher.addDocument("~/library/book1.epub", { tags: "books" });
await searcher.addDocument("~/library", { tags: "books" }); // Scans for all .epub files
// Search within books
const results = searcher.search("chapter", { tags: "books" });
searcher.close();Buffer Content
const searcher = new SearchMix();
// Index markdown from memory
const content = Buffer.from(`
# My Note
Quick thoughts about the project.
## Ideas
- Implement feature X
- Optimize performance
`);
searcher.addDocument(content);
// Search the buffer content
const results = searcher.search("feature");
searcher.close();Advanced Search with Snippets
const searcher = new SearchMix();
searcher.addDocument("./docs");
// Search - returns flat list of snippets (default: all occurrences)
const results = searcher.search("database");
console.log(`Found ${results.totalSnippets} snippets in ${results.totalCount} documents`);
results.results.forEach(snippet => {
console.log(`\n${snippet.documentTitle}`);
console.log(`Found in ${snippet.section}:`);
console.log(`"${snippet.text}"`);
console.log(`Relevance: ${snippet.rank}`);
});
// Search with one snippet per document
const singleResults = searcher.search("database", {
allOccurrences: false
});
// Returns only the first/best match per document
// Column-specific search
const titleResults = searcher.search("title:searchmix");
// Boolean operators
const orResults = searcher.search("markdown OR epub");
const notResults = searcher.search("database NOT backup");
// Phrase search
const phraseResults = searcher.search('"full text search"');
// Control snippet length
const longSnippets = searcher.search("api documentation", {
snippetLength: 300 // More context
});
// Limit occurrences per document
const limitedResults = searcher.search("api", {
allOccurrences: true,
maxOccurrences: 3 // Max 3 snippets per document
});
// Disable snippets for faster queries
const fastResults = searcher.search("api", {
snippets: false // Only metadata
});
// Combine with options
const relevantResults = searcher.search("api documentation", {
minScore: 0.3,
limit: 10,
tags: "docs",
snippetLength: 200
});
searcher.close();Finding All Occurrences
const searcher = new SearchMix();
searcher.addDocument("./docs");
// Find all occurrences - returns flat list of snippets (default behavior)
const results = searcher.search("javascript", {
allOccurrences: true, // Default: true
maxOccurrences: 10 // Max per document
});
console.log(`Found ${results.totalSnippets} total snippets in ${results.totalCount} documents\n`);
// Group by document if needed
const byDocument = new Map();
results.results.forEach(snippet => {
if (!byDocument.has(snippet.documentPath)) {
byDocument.set(snippet.documentPath, []);
}
byDocument.get(snippet.documentPath).push(snippet);
});
// Display grouped results
byDocument.forEach((snippets, docPath) => {
console.log(`\n${snippets[0].documentTitle} (${snippets.length} occurrences)`);
snippets.forEach((snippet, index) => {
console.log(` ${index + 1}. Found in ${snippet.section} at position ${snippet.position}:`);
console.log(` "${snippet.text}"\n`);
});
});
searcher.close();Navigable Snippets (Lightweight with IDs)
Snippets include hierarchical navigation using lightweight ID references instead of full objects, reducing memory usage by up to 99.9%:
const searcher = new SearchMix();
searcher.addDocument("./docs");
const results = searcher.search("async", {
allOccurrences: true,
maxOccurrences: 5
});
results.results.forEach(snippet => {
{
console.log(`Text: "${snippet.text}"`);
console.log(`Section: ${snippet.section}`);
// Current heading information
if (snippet.heading) {
console.log(`Heading: ${snippet.heading.text} (${snippet.heading.type})`);
console.log(`ID: ${snippet.heading.id}`);
}
// Lightweight references (IDs only)
if (snippet.parentId) {
console.log(`Parent ID: ${snippet.parentId}`);
}
if (snippet.childrenIds) {
console.log(`Children IDs: ${snippet.childrenIds.join(', ')}`);
}
if (snippet.contentCount) {
console.log(`Content blocks: ${snippet.contentCount}`);
}
}
});
searcher.close();Navigate Using Snippet Methods:
Snippets are objects with navigation methods for easy traversal:
const snippet = results.results[0];
// Navigate to parent (auto-loads details)
if (snippet.hasParent()) {
const parent = snippet.getParent();
console.log(`Parent: ${parent.text}`);
}
// Navigate to children
if (snippet.hasChildren()) {
const children = snippet.getChildren();
children.forEach(child => {
console.log(`Child: ${child.text}`);
});
}
// Get content
if (snippet.hasContent()) {
const content = snippet.getContent();
content.forEach(block => {
console.log(`[${block.type}] ${block.text}`);
});
}
// Get breadcrumbs
console.log(snippet.getBreadcrumbsText());
// "Manual > Features > Async/Await"
// Get siblings
const siblings = snippet.getSiblings();
siblings.forEach(sibling => {
console.log(`Sibling: ${sibling.text}`);
});Snippet Methods:
hasParent()- Check if has parent sectionhasChildren()- Check if has child sectionshasContent()- Check if has content blocksgetParent()- Get parent section detailsgetChildren()- Get all child sectionsgetChild(index)- Get specific child by indexgetContent()- Get full content blocksgetDetails()- Get complete section detailsgetBreadcrumbs()- Get full hierarchy path as arraygetBreadcrumbsText(separator)- Get breadcrumbs as stringgetAncestorAtDepth(depth)- Find ancestor at specific levelgetSiblings()- Get sections at same leveltoString()- String representationtoJSON()- Plain object for serialization
Advanced: Direct Access (if needed):
You can also use getHeadingById() directly:
const details = searcher.getHeadingById(
snippet.documentPath,
snippet.heading.id
);Snippet Properties (Lightweight):
text- The snippet text with contextsection- Section type:'title','h1','h2','h3','h4','h5','h6', or'body'position- Character position in the original textdocumentPath- Document path (for use withgetHeadingById)sectionId- Unique section IDheading(optional) - Basic heading info:id- Heading IDtype- Heading type (e.g.,'h2')text- Heading textdepth- Heading level (1-6)
parentId(optional) - Parent section ID referencechildrenIds(optional) - Array of child section ID referencescontentCount(optional) - Number of content blocks
Memory Optimization:
For a 240-section book:
- Before: 1928 KB loaded per search
- Now: 1.74 KB loaded per search (99.9% reduction!)
- Details loaded on-demand: Only ~1-5 KB when calling
getHeadingById()
Use Cases:
- Memory Efficient: Handle large documents without memory issues
- Fast Searches: Don't load unnecessary data upfront
- On-Demand Navigation: Load parent/child/content details only when needed
- Breadcrumbs: Build navigation paths by traversing parent IDs
- Context Loading: Get full section content when user clicks on a result
See demo/lightweight-navigation.js for complete examples.
Smart Indexing (Automatic Duplicate Prevention)
const searcher = new SearchMix();
// By default, documents are automatically skipped if already indexed
searcher.addDocument("./docs"); // First time - indexes all files
searcher.addDocument("./docs"); // Second time - skips (already indexed)
// Force update existing documents
searcher.addDocument("./docs", { update: true }); // Re-indexes everything
// Check if specific document exists (optional, library handles this automatically)
if (!searcher.hasDocument("./new-doc.md")) {
console.log("New document will be indexed");
}
searcher.close();Benefits:
- No manual checking needed
- Fast re-runs (skips already indexed documents)
- Prevents accidental duplicates
- Use
update: truewhen documents have changed
Accent & Case Insensitive Search
SearchMix automatically normalizes text for searching, making searches insensitive to accents and case:
const searcher = new SearchMix();
// Index document with accented text
const doc = Buffer.from(`
# Viajes por el MediterrĂĄneo
## MEDITERRĂNEO I
El mar MediterrĂĄneo es importante.
## Visita a ParĂs
ParĂs es la capital de Francia.
`);
searcher.addDocument(doc);
// All these queries will find the same results:
searcher.search("mediterraneo"); // Finds "MEDITERRĂNEO", "MediterrĂĄneo"
searcher.search("MEDITERRĂNEO"); // Same results
searcher.search("MediterrĂĄneo"); // Same results
searcher.search("paris"); // Finds "ParĂs"
searcher.search("ParĂs"); // Same results
// Works with field-specific search too
searcher.search("headings:mediterraneo"); // Finds headings with "MEDITERRĂNEO"
searcher.close();Benefits:
- Search naturally without worrying about accents or case
- Especially useful for multilingual content (Spanish, French, Portuguese, etc.)
- Original formatting is preserved in results and snippets
- Works with all FTS5 query operators (AND, OR, NOT, phrases, etc.)
Note: If you have an existing database created before this feature, you'll need to re-index your documents to enable accent-insensitive search:
const searcher = new SearchMix();
searcher.addDocument("./docs", { update: true }); // Re-index with new schemaHow It Works
SearchMix uses SQLite's FTS5 (Full-Text Search 5) extension to provide fast, efficient full-text search capabilities:
Parsing - Documents are parsed to extract structured content:
title- First h1 headingheadings- All other headings (h2-h6)body- Paragraph text (and optionally code blocks)- Supported formats: Markdown, EPUB, PDF, TXT, and SRT are automatically converted to a searchable format
Indexing - Content is stored in an FTS5 virtual table with separate columns for title, headings, and body
Ranking - Search results are ranked using BM25 algorithm with configurable weights
Tags - Documents can be assigned multiple tags for organization and filtered searching
Supported File Types
- Markdown -
.md,.markdown - EPUB -
.epub(automatically converted to Markdown) - PDF -
.pdf(automatically converted to Markdown) - Plain Text -
.txt(automatically converted to Markdown) - Subtitles -
.srt(automatically converted to Markdown with timestamps)
Configuration
Database Path
By default, SearchMix uses ./db/searchmix.db. You can customize this:
const searcher = new SearchMix({ dbPath: "./db/path.db" });BM25 Weights
Adjust ranking weights to prioritize different parts of documents:
const searcher = new SearchMix({
weights: {
title: 15.0, // Matches in title are most important
h1: 5.0, // H1 are moderately important
body: 1.0 // Body text has normal weight
}
});Code Blocks
Include code blocks in the searchable body text:
const searcher = new SearchMix({ includeCodeBlocks: true });Performance Tips
- Use tags to organize documents and narrow search scope
- Set minScore to filter out irrelevant results
- Use limit to control the number of results returned
- For large directories, consider excluding irrelevant paths with the
excludeoption
Error Handling
try {
const searcher = new SearchMix();
searcher.addDocument("./nonexistent");
} catch (error) {
console.error("Error:", error.message);
// "Path does not exist: ./nonexistent"
}License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
