@voidwire/corpus
v0.4.0
Published
Personal knowledge pipeline — extracts signals from sources, deduplicates, and serves hybrid (FTS5 + vector) search over a local SQLite store
Downloads
280
Maintainers
Readme
@voidwire/corpus
Personal knowledge pipeline. Extracts signals from sources, deduplicates, and serves hybrid (FTS5 + vector) search over a local SQLite store.
What it is
Corpus collects signals from project artifacts (transcripts, operator logs, session state, READMEs, commits, notes) into a holding tank, runs LLM-based consolidation (dedup + synthesis), and writes the result to a local SQLite database with FTS5 full-text search and sqlite-vec vector embeddings.
The library exposes a single entry point: a CorpusApi factory that opens the corpus DB and provides hybridSearch (FTS5 BM25 + vector cosine fused), plus the universal ingestion gate insertCorpusEntry.
A corpus CLI ships alongside the library for batch indexing, search, health checks, and one-shot migrations.
Install
bun add @voidwire/corpusBun >= 1.0.0 required. Corpus uses bun:sqlite and is not Node-compatible without compilation changes.
Public API
The package exports exactly five symbols from @voidwire/corpus:
createCorpusDb(dbPath: string): CorpusApi
Factory that opens (or creates) the corpus SQLite database at dbPath, runs migrations, loads the sqlite-vec extension, and returns a CorpusApi instance. Lazy: the database file is not touched until the first method call on the returned api.
getCorpusPath(): string
Returns the configured corpus DB path from ~/.config/corpus/config.toml. Lazy — getConfig() fires only when this function is called, never at module evaluation. Use as the argument to createCorpusDb for consumers that want to honour the user's configured path.
interface CorpusApi
Methods (selected): getDb(), insertCorpusEntry(entry, embedding, opts?), hybridSearch(query, options?), plus query helpers (queryByProject, queryByAgentType, etc.) and a few internal helpers used by the consolidation engine.
interface HybridSearchOptions
{ limit?, source?, type?, project?, sourceFilter?, sourceExclude?, includeSuperseded? } — passed to hybridSearch. sourceFilter whitelists sources; sourceExclude blacklists them (exclusion wins on conflict).
interface HybridSearchResult
{ rowid, source, title, content, topic, type, timestamp, score, vectorScore, textScore } — one row per match, fused score in [0, 1].
Quick start
import { createCorpusDb, getCorpusPath } from "@voidwire/corpus";
const corpus = createCorpusDb(getCorpusPath());
const results = await corpus.hybridSearch("session state worker hook", {
limit: 10,
});
for (const r of results) {
console.log(`${r.score.toFixed(3)} ${r.source} ${r.title}`);
}Configuration
Corpus reads its config from ~/.config/corpus/config.toml. Required keys (validated at first getConfig() call — missing keys throw):
custom_sqlite— path to the SQLite library that supports loadable extensionsllm_service— service identifier consumed by@voidwire/llm-coretranscripts_dir— Claude Code transcript rootprojects_dir— root of project directories scraped for state-worker signalsassistant_sessions_dir— session JSONL rootassistant_operators_dir— operator log rootpipeline_log_path— absolute path for pipeline append-logcorpus_path— absolute path tocorpus.dbtank_path— absolute path to the holding-tank DB
Optional indexer paths and embed-provider overrides are also read from this file. Run corpus init to write a starter config.
CLI
The package installs a corpus binary:
corpus init [--dry-run] [--force]— bootstrapconfig.tomland XDG dirscorpus index [source] [--rebuild]— run batch indexers.corpus index personalis idempotent: re-runs skip already-indexed entries (matched by a stableinput_hash) and retire entries removed from source. Add--refresh-enrichment(personal/all) to clear all personal rows and re-index with fresh LLM enrichment — run this once after upgrading to the idempotent indexer to reset pre-existing personal rows that predateinput_hash.corpus search "<query>" [--source X] [--project Y] [--type Z] [--limit N] [--json]— hybrid searchcorpus doctor— read-only health check (config, DB, sqlite-vec, embed, LLM, plugins)corpus stats [--growth] [--starvation] [--json]— corpus composition + optional viewscorpus aging [--dry-run] [--verbose]— type-aware nightly aging (transition rows toconfidence=stale/kind=archived); pinned rows are protectedcorpus delete --source <name> [--confirm]— cascade-delete every row of a source across all tables in one transaction; dry-run by default,--confirmmutates and writes an audit JSONL to~/.local/state/corpus/audit/corpus migrate-from-lore [--dry-run] [--resume] [--lore-db <path>]— one-shot migration from a legacy lore.db
Every subcommand accepts --help / -h for usage; corpus --help lists all subcommands.
launchd schedules (operator install)
Two launchd plists ship in the repo. Install per-user with:
cp info.voidwire.corpus.pipeline.plist ~/Library/LaunchAgents/
cp info.voidwire.corpus.aging.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/info.voidwire.corpus.pipeline.plist # daily 03:00
launchctl load ~/Library/LaunchAgents/info.voidwire.corpus.aging.plist # daily 04:00Exit codes: 0 success, 1 usage error, 2 runtime error.
Runtime requirement
Bun >= 1.0.0. The package ships TypeScript source (.ts) — Bun resolves it directly. There is no Node build target.
License
MIT
