@nusoft/nuos-build-catalogue
v0.28.0
Published
NuOS build-catalogue tooling: semantic search (WU 110) + migration runner that lifts markdown artefacts into JSON-backed workflow records (WU 111, Phase G).
Readme
nuos-build-catalogue
Indexes the NuOS build catalogue (docs/build/, docs/contracts/, docs/philosophy/, docs/guides/) into NuVector for semantic search. Implements WU 110.
This is the first concrete step in NuOS taking over its own build, per D040. Before WU 110, finding things in the catalogue meant grep. After WU 110, it means semantic queries with metadata filters.
Setup
npm installThe embedder is selected via NUOS_CATALOGUE_EMBEDDER:
| Value | Provider | Default model | Dimensions | Notes |
|---|---|---|---|---|
| ollama (default) | Local Ollama | qwen3-embedding:8b | 4096 | Sovereignty by default. No network egress. Override the model with NUOS_CATALOGUE_OLLAMA_MODEL=qwen3-embedding:4b (2560 dims) or qwen3-embedding:0.6b (1024 dims) for smaller boxes. Needs ollama serve running and the model pulled (ollama pull qwen3-embedding:8b). |
| vertex | Google Vertex | text-embedding-005 | 768 | Cloud Google. Needs GOOGLE_CLOUD_PROJECT plus a Vertex access token (set GOOGLE_VERTEX_ACCESS_TOKEN, or have gcloud on PATH and run gcloud auth application-default login). |
| openai | OpenAI | text-embedding-3-small | 1536 | Cloud OpenAI. Needs OPENAI_API_KEY. |
| stub | Hash-based, no API | — | 384 | Tests + dev only. Results are noisy. |
Switching embedder (or model variant) requires a full reindex (rm -rf .nuos-catalogue && npm run index) because dimensions differ.
Quick start
# Pre-flight (one time):
ollama serve # in another shell
ollama pull qwen3-embedding:8b # ~4.7 GB download
# Index the catalogue (first time — takes ~20 min on 8b)
npm run index
# Search
npm run search -- "module boundary enforcement"
npm run search -- "epistemic discipline" --kind=decision --limit=5
npm run search -- "EHCP lifecycle" --json
# Re-run after editing a few files (only changed files re-embed)
npm run indexStorage
Index lives at .nuos-catalogue/index.nv (file-backed NuVector store, REDB underneath) plus a sibling hashes.json mapping each file path to its content SHA-256 + the chunk IDs it produced. Both are committed to git so the index state is reproducible across machines (Topology A per D041).
If .nuos-catalogue/index.nv ever gets corrupt or out-of-sync, delete the directory and re-run npm run index.
Verification gate
Before any other code lands, the verification gate proves that @nusoft/[email protected] actually persists file-backed storage across process restarts. Re-run it any time you bump the NuVector dep or suspect storage is broken:
npm run verify-storagePass = file storage works in the published binary; the indexer can use it. Fail = something has regressed and the package needs a Postgres fallback or a NuVector fix.
Architecture in one paragraph
crawl walks the catalogue picking up .md files (skipping _index.md, done/, archive/, superseded/). Each file goes to chunkMarkdown which splits on H1/H2/H3 boundaries (preserving code fences) into ~600-token chunks with deterministic IDs. extractMetadata produces structured metadata per file (kind, idInKind, status, date, cross-refs). The Embedder then turns chunk texts into Float32Array vectors. The orchestrator (runIndex) only re-embeds files whose content hash has changed since the last run, then upserts them into NuVector as nuwiki_article_summary records. runSearch embeds the query and calls searchKnowledge to retrieve the top-K hits, which the formatter renders as a human-readable list or JSON.
Out of scope (Phase 0)
- Auto-running on commit — that's WU 128.
- A GUI — CLI only.
- Writing to NuVector via NuFlow workflows — that's WU 111.
- Compiling NuWiki articles from indexed content — WU 113–115.
- Multi-user / concurrent indexing.
- Adopting
@nusoft/nuos— uses NuVector directly until WU 130 ships.
Known API quirks (NuVector v0.1.0)
Discovered during the WU 110 implementation; documented here so future contributors don't burn time rediscovering them:
embeddingmust be aFloat32Array, not a plainnumber[]. Plain arrays fail withGet TypedArray info failed on NvMemoryRecord.embedding.- Search results expose the upsert-time
idasrefon each item (asymmetry with the input shape). tenantbelongs onMemoryRecord(upsert) but not onSearchKnowledgeRequest— the store-level tenant fromNuVector.openscopes search automatically.searchKnowledgeoperates on Layer 1 records (nuwiki_article_summary); to make a chunk retrievable directly, index it asnuwiki_article_summary.nuwiki_sectionis for sections within an enclosing article and requiressearchSectionsInArticleswith the article IDs.
Tests
npx tsx --test tests/13 tests across chunk, metadata, crawl cover the indexing primitives. End-to-end is exercised by running npm run index then npm run search against the real catalogue.
License
Private; not published to npm.
