@voidwire/corpus

v0.4.0

Published

10 days ago

Personal knowledge pipeline — extracts signals from sources, deduplicates, and serves hybrid (FTS5 + vector) search over a local SQLite store

Downloads

280

0High
0Medium
0Low

nickpending

corpus knowledge search sqlite sqlite-vec fts5 embeddings bun cli

@voidwire/corpus

Personal knowledge pipeline. Extracts signals from sources, deduplicates, and serves hybrid (FTS5 + vector) search over a local SQLite store.

What it is

Corpus collects signals from project artifacts (transcripts, operator logs, session state, READMEs, commits, notes) into a holding tank, runs LLM-based consolidation (dedup + synthesis), and writes the result to a local SQLite database with FTS5 full-text search and sqlite-vec vector embeddings.

The library exposes a single entry point: a CorpusApi factory that opens the corpus DB and provides hybridSearch (FTS5 BM25 + vector cosine fused), plus the universal ingestion gate insertCorpusEntry.

A corpus CLI ships alongside the library for batch indexing, search, health checks, and one-shot migrations.

Install

bun add @voidwire/corpus

Bun >= 1.0.0 required. Corpus uses bun:sqlite and is not Node-compatible without compilation changes.

Public API

The package exports exactly five symbols from @voidwire/corpus:

`createCorpusDb(dbPath: string): CorpusApi`

Factory that opens (or creates) the corpus SQLite database at dbPath, runs migrations, loads the sqlite-vec extension, and returns a CorpusApi instance. Lazy: the database file is not touched until the first method call on the returned api.

`getCorpusPath(): string`

Returns the configured corpus DB path from ~/.config/corpus/config.toml. Lazy — getConfig() fires only when this function is called, never at module evaluation. Use as the argument to createCorpusDb for consumers that want to honour the user's configured path.

`interface CorpusApi`

Methods (selected): getDb(), insertCorpusEntry(entry, embedding, opts?), hybridSearch(query, options?), plus query helpers (queryByProject, queryByAgentType, etc.) and a few internal helpers used by the consolidation engine.

`interface HybridSearchOptions`

{ limit?, source?, type?, project?, sourceFilter?, sourceExclude?, includeSuperseded? } — passed to hybridSearch. sourceFilter whitelists sources; sourceExclude blacklists them (exclusion wins on conflict).

`interface HybridSearchResult`

{ rowid, source, title, content, topic, type, timestamp, score, vectorScore, textScore } — one row per match, fused score in [0, 1].

Quick start

import { createCorpusDb, getCorpusPath } from "@voidwire/corpus";

const corpus = createCorpusDb(getCorpusPath());
const results = await corpus.hybridSearch("session state worker hook", {
  limit: 10,
});
for (const r of results) {
  console.log(`${r.score.toFixed(3)}  ${r.source}  ${r.title}`);
}

Configuration

Corpus reads its config from ~/.config/corpus/config.toml. Required keys (validated at first getConfig() call — missing keys throw):

custom_sqlite — path to the SQLite library that supports loadable extensions
llm_service — service identifier consumed by @voidwire/llm-core
transcripts_dir — Claude Code transcript root
projects_dir — root of project directories scraped for state-worker signals
assistant_sessions_dir — session JSONL root
assistant_operators_dir — operator log root
pipeline_log_path — absolute path for pipeline append-log
corpus_path — absolute path to corpus.db
tank_path — absolute path to the holding-tank DB

Optional indexer paths and embed-provider overrides are also read from this file. Run corpus init to write a starter config.

CLI

The package installs a corpus binary:

corpus init [--dry-run] [--force] — bootstrap config.toml and XDG dirs
corpus index [source] [--rebuild] — run batch indexers. corpus index personal is idempotent: re-runs skip already-indexed entries (matched by a stable input_hash) and retire entries removed from source. Add --refresh-enrichment (personal/all) to clear all personal rows and re-index with fresh LLM enrichment — run this once after upgrading to the idempotent indexer to reset pre-existing personal rows that predate input_hash.
corpus search "<query>" [--source X] [--project Y] [--type Z] [--limit N] [--json] — hybrid search
corpus doctor — read-only health check (config, DB, sqlite-vec, embed, LLM, plugins)
corpus stats [--growth] [--starvation] [--json] — corpus composition + optional views
corpus aging [--dry-run] [--verbose] — type-aware nightly aging (transition rows to confidence=stale / kind=archived); pinned rows are protected
corpus delete --source <name> [--confirm] — cascade-delete every row of a source across all tables in one transaction; dry-run by default, --confirm mutates and writes an audit JSONL to ~/.local/state/corpus/audit/
corpus migrate-from-lore [--dry-run] [--resume] [--lore-db <path>] — one-shot migration from a legacy lore.db

Every subcommand accepts --help / -h for usage; corpus --help lists all subcommands.

launchd schedules (operator install)

Two launchd plists ship in the repo. Install per-user with:

cp info.voidwire.corpus.pipeline.plist ~/Library/LaunchAgents/
cp info.voidwire.corpus.aging.plist    ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/info.voidwire.corpus.pipeline.plist  # daily 03:00
launchctl load ~/Library/LaunchAgents/info.voidwire.corpus.aging.plist     # daily 04:00

Exit codes: 0 success, 1 usage error, 2 runtime error.

Runtime requirement

Bun >= 1.0.0. The package ships TypeScript source (.ts) — Bun resolves it directly. There is no Node build target.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@voidwire/corpus

What it is

Install

Public API

createCorpusDb(dbPath: string): CorpusApi

getCorpusPath(): string

interface CorpusApi

interface HybridSearchOptions

interface HybridSearchResult

Quick start

Configuration

CLI

launchd schedules (operator install)

Runtime requirement

License

`createCorpusDb(dbPath: string): CorpusApi`

`getCorpusPath(): string`

`interface CorpusApi`

`interface HybridSearchOptions`

`interface HybridSearchResult`