static-research-engine
v1.0.2
Published
Transform documents into structured, queryable span artifacts with intelligent search and ranking
Downloads
40
Maintainers
Readme
SRE (Static Research Engine)
Transform documents into structured, queryable span artifacts with intelligent search and ranking
SRE is a modular TypeScript pipeline that transforms text-based documents into structured, queryable data artifacts. It provides document segmentation, hierarchical structure tracking, lexical search, and TF-IDF relevance ranking—all with a clean, deterministic API.
Features
- 📝 Document Processing - Parse Markdown and plain text with auto-format detection
- ✂️ Span Segmentation - Split documents into paragraph spans with metadata
- 🏗️ Structural Hints - Track hierarchical structure (chapters, sections, headings)
- 📖 Runtime Reader API - Efficient read-only access to artifacts with O(1) lookups
- 🔍 Lexical Search - Fast, case-insensitive token matching with AND queries
- 🎯 TF-IDF Ranking - Relevance scoring with length normalization
- ⚡ Zero Runtime Dependencies - Lightweight reader with no external deps
- 🛠️ CLI Tools - Build pipeline and search utilities
- 🔬 Deterministic - Identical input produces identical output
- 📊 Quality Metrics - Build reports with span statistics and warnings
Why SRE?
The Problem: LLMs are great at reasoning but terrible at reading large documents efficiently. Traditional RAG (Retrieval-Augmented Generation) systems are dynamic, probabilistic, and transient—each query reinterprets embeddings without persistent, deterministic understanding of the source text.
The Solution: SRE compiles documents into static, structured knowledge artifacts — like a build system for language understanding. Build once, query forever.
How SRE Complements RAG
SRE does not replace RAG — it enhances it. Each serves a different role:
- RAG provides immediate, dynamic context using embeddings for fast recall
- SRE provides persistent, deterministic structure with full provenance
When combined:
- RAG finds relevant snippets (dynamic recall)
- SRE expands context by traversing structured corpus (deterministic discovery)
RAG tells the agent where to look. SRE gives it everything it needs once it's there.
Static vs Dynamic Retrieval
| Aspect | Traditional RAG | SRE | |--------|----------------|-----| | Data volatility | Reinterprets embeddings per query | Fixed, compiled spans and indexes | | Cost | Requires vector DB access | One-time compile, static files | | Determinism | May vary by model or threshold | Bitwise reproducible builds | | Hosting | Needs live vector DB | Works from static JSON on any filesystem | | Explainability | Depends on vector similarity | Full provenance with manifest + nodeMap |
Who It's For
✅ Engineers and researchers who need:
- Reproducible, explainable document retrieval for LLM pipelines
- Offline corpus preparation for LLM reasoning, QA, or summarization
- Static, local corpus foundation to complement RAG systems
- Provenance, structure, and deterministic builds
✅ Use cases:
- Knowledge bases and documentation compilers
- Offline research assistants and LLM tools
- Dataset preparation for fine-tuning or evaluation
- Analytical indexing (law, science, policy, technical docs)
📖 Read more: See ABOUT.md for the complete philosophy, including detailed comparison with RAG and how they work together.
Installation
From npm (when published)
# Global installation
npm install -g static-research-engine
# Project installation
npm install static-research-engineFrom source
# Clone the repository
git clone https://github.com/phillt/SRE.git
cd SRE
# Install dependencies
npm install
# Build TypeScript
npm run buildQuick Start
1. Build a corpus from a document
# Process a Markdown file
sre input.md -o output/
# Process a plain text file
sre input.txt -o output/ --format=txt
# With verbose output
sre input.md -o output/ -vThis creates:
manifest.json- Document metadataspans.json- Array of paragraph spansnodeMap.json- Hierarchical structure (for Markdown)buildReport.json- Quality metrics
2. Search the corpus
# Basic search
sre-search output/ "your query"
# With TF-IDF ranking
sre-search output/ "error handling" --rank=tfidf
# Limit results
sre-search output/ "section" --rank=tfidf --limit=53. Use the Reader API
import { createReader } from 'static-research-engine'
// Load artifacts
const reader = await createReader('output/')
// Get document info
const manifest = reader.getManifest()
console.log(`${manifest.title}: ${manifest.spanCount} spans`)
// Search
const results = reader.search('error handling')
// Search with ranking
const ranked = reader.search('error handling', { rank: 'tfidf' })
// Get span by ID
const span = reader.getSpan('span:000001')
// Get context around a span
const contextIds = reader.neighbors('span:000003', { before: 1, after: 1 })
// Navigate sections
const sections = reader.listSections()
const section = reader.getSection('sec:000001')CLI Tools
sre - Main build tool
Transform documents into span artifacts.
sre <input-file> [options]
Options:
-o, --output <dir> Output directory (default: dist/)
--format <fmt> Force format: md, txt (default: auto-detect)
-v, --verbose Verbose output
-h, --help Display helpExamples:
# Auto-detect format from extension
sre document.md -o dist/
# Force plain text parsing
sre notes.txt --format=txt -o output/
# Verbose mode
sre book.md -o book-output/ -vsre-search - Search with ranking
Query span artifacts with optional TF-IDF ranking.
sre-search <output-dir> <query> [options]
Options:
--limit=N Limit results to N spans
--rank=tfidf Enable TF-IDF relevance ranking
Examples:
sre-search dist/ "error handling"
sre-search dist/ "section" --rank=tfidf --limit=5API Documentation
Reader API
The Reader class provides read-only access to artifacts:
import { createReader } from 'static-research-engine'
const reader = await createReader('output-dir/')
// Document metadata
reader.getManifest(): Manifest
reader.getSpanCount(): number
reader.getBuildReport(): BuildReport | undefined
reader.getNodeMap(): NodeMap | undefined
// Span access
reader.getSpan(id: string): Span | undefined
reader.getByOrder(order: number): Span | undefined
reader.neighbors(id: string, opts?: NeighborsOptions): string[]
// Structure navigation
reader.listSections(): string[]
reader.getSection(id: string): { paragraphIds: string[] } | undefined
// Search
reader.search(query: string, opts?: SearchOptions): Span[]
reader.enableTfCache(size?: number): voidSearch Options
interface SearchOptions {
limit?: number // Maximum results
rank?: 'none' | 'tfidf' // Ranking method (default: 'none')
}Examples:
// Unranked search (document order)
const results = reader.search('error')
// Ranked by TF-IDF
const ranked = reader.search('error', { rank: 'tfidf' })
// Top 10 most relevant
const top10 = reader.search('query', { rank: 'tfidf', limit: 10 })
// Enable TF caching for better performance
reader.enableTfCache(100)
const cached = reader.search('query', { rank: 'tfidf' })See demo/reader/README.md and demo/search/README.md for detailed API documentation.
Demos & Examples
The demo/ directory contains interactive demonstrations and comprehensive tests:
# Run interactive demos
node demo/reader/demo.js # Reader API demo
node demo/search/demo.js # Search demo
node demo/ranking/demo.js # TF-IDF ranking demo
# Run verification tests
node demo/reader/verify.js # 26 tests
node demo/search/verify.js # 17 tests
node demo/ranking/verify.js # 12 tests
# Example CLI tool
node demo/reader/example-cli.js output/ infoSee demo/README.md for the complete demo guide.
Project Structure
SRE/
├── src/ # TypeScript source
│ ├── cli/ # Command-line interface
│ ├── pipeline/ # Build orchestration
│ ├── core/ # Pure logic and schemas
│ ├── adapters/ # I/O (readers, writers)
│ └── utils/ # Shared utilities
├── bin/ # Production CLI tools
├── demo/ # Interactive demos and tests
│ ├── reader/ # Reader API demos (26 tests)
│ ├── search/ # Search demos (17 tests)
│ ├── ranking/ # Ranking demos (12 tests)
│ └── format-tracking/ # Format detection tests
├── docs/ # Technical implementation docs
└── dist/ # Compiled JavaScript (after build)See CLAUDE.md for detailed architecture documentation.
Development
Setup
# Clone and install
git clone https://github.com/phillt/SRE.git
cd SRE
npm install
# Build
npm run build
# Development mode (auto-rebuild)
npm run dev
# Format code
npm run formatRunning Tests
# Build first
npm run build
# Generate test corpus
node dist/cli/index.js demo/test-input/sample.md -o dist/final-test
node dist/cli/index.js demo/test-input/sample.txt -o dist/test-txt
# Run all verification tests
node demo/reader/verify.js && \
node demo/search/verify.js && \
node demo/ranking/verify.js
# Run demos
node demo/reader/demo.js
node demo/search/demo.js
node demo/ranking/demo.jsCode Style
This project uses Prettier for code formatting:
# Format code
npm run format
# Check formatting
npm run format:checkArchitecture
SRE follows a layered architecture:
- CLI Layer - User interface and argument parsing
- Pipeline Layer - Orchestrates build process
- Core Layer - Pure logic, schemas, transformations
- Adapters Layer - I/O operations (filesystem, etc.)
- Utils Layer - Shared utilities
Design Principles:
- Pure core, mutable edges
- Schema-driven development with Zod
- Single responsibility per module
- Deterministic output
- Zero runtime dependencies for Reader
See CLAUDE.md for complete architecture details.
Contributing
We welcome contributions! Please see CONTRIBUTING.md for:
- Code of Conduct
- How to report bugs
- How to suggest features
- Development workflow
- Pull request process
- Testing requirements
Quick Start for Contributors:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests (
npm run build && node demo/*/verify.js) - Format code (
npm run format) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Documentation
Philosophy & Overview
- ABOUT.md - Why SRE? Philosophy, design rationale, and comparison with RAG
User Documentation
- Demo Guide - Interactive examples and verification tests
- Reader API - Runtime API documentation
- Lexical Search - Search functionality
- TF-IDF Ranking - Relevance ranking
Technical Documentation
- CLAUDE.md - Architecture and development guide
- Implementation Docs - Technical implementation details
- Feature docs in
demo/*/directories
Roadmap
Potential future enhancements:
- [ ] BM25 ranking algorithm
- [ ] Semantic search with embeddings
- [ ] PDF and EPUB support
- [ ] Boolean search operators (AND, OR, NOT)
- [ ] Phrase matching ("exact phrase" queries)
- [ ] Fuzzy matching for typos
- [ ] Incremental updates to artifacts
- [ ] HTTP API server
- [ ] Web UI for exploration
Performance
- Index Building: < 10ms for 1,000 spans
- Lexical Search: < 1ms for typical queries
- TF-IDF Ranking: < 3ms for ranked queries
- Memory: ~1KB per span in memory
License
MIT License - Copyright (c) 2024 phillt
Acknowledgments
Built with:
- TypeScript - Type-safe JavaScript
- Zod - Schema validation
- Commander - CLI framework
- Prettier - Code formatting
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Demo Guide
Made with ❤️ by phillt
