@adia-ai/a2ui-corpus
v0.6.47
Published
AdiaUI A2UI training corpus — canonical v0.9 catalog + chunks + eval fixtures + feedback + gap registry. Consumed by the compose engine's retrieval layer + the MCP pipeline.
Downloads
17,164
Readme
@adia-ai/a2ui-corpus
Corpus and operational-learning artifacts for the gen-UI pipeline — chunks (the canonical retrieval surface), feedback, gaps, eval fixtures. Pure data plus the scripts that maintain it. No runtime.
The pipeline reads this package; this package never reads the pipeline. See
@adia-ai/a2ui-composefor engine code,@adia-ai/web-componentsfor UI atoms,@adia-ai/a2ui-runtimefor the A2UI runtime (renderer, registry, streams, wiring),@adia-ai/a2ui-mcpfor the MCP server.
Install
npm install @adia-ai/a2ui-corpusPure data — typically consumed transitively by @adia-ai/a2ui-compose and @adia-ai/a2ui-mcp, which list it as a runtime dependency. Direct installs are useful for offline eval tooling or building bespoke retrieval layers.
Dependency direction
a2ui-compose ──reads──▶ a2ui-corpus
a2ui-mcp ──reads──▶ a2ui-compose, a2ui-corpus
web-components ──used-by──▶ apps/, playgrounds/, catalog/ (chunk sources)No back-writes. No circular reads. Web-components ships UI atoms only;
corpus lives here; runtime lives in a2ui-compose. The chunk pipeline
harvests data-chunk-tagged regions from the apps that consume
web-components, so the read-direction stays one-way.
Glossary
The corpus has converged on one retrievable concept: the chunk.
| Term | Source of truth | Granularity | What it carries | Engine that consumes it |
| --- | --- | --- | --- | --- |
| chunk | chunks/<id>.json (one file per chunk) + _index.json | A single labeled HTML region, harvested via data-chunk markers | html, intent, domain, kind (block / page / panel), keywords, metadata (when annotated), template (transpiled A2UI tree, when annotated) | chunk-zettel + monolithic-pro + zettel (composition-library wraps annotated chunks) |
Annotated vs raw chunks: every chunk has source/page provenance,
but only chunks carrying data-chunk-{domain,description,keywords,kind}
attributes on their source HTML become retrievable as compositions
(the harvester's transpile pass produces a template + lifts the
metadata block onto them). Raw chunks remain as substrate for
nested-expand reference resolution but don't compete in retrieval.
Grounding rule (locked v0.4.6, enforced v0.4.7): every retrievable
unit MUST trace to a real page under site/pages/, apps/,
playgrounds/, or catalog/ via its data-chunk-* annotations. No
hand-authored ungrounded JSON lives in corpus/.
Historical note: Fragments (atomic A2UI sub-trees with named slots, §37, 2026-05-12), patterns (hand-authored full-canvas A2UI templates,
patterns/dir), and compositions (hand-authored multi-section A2UI surfaces,compositions/dir) were earlier corpus formats. Fragments retired in v0.4.4 (§37). Patterns + compositions retired in v0.4.7 (§72 — the v0.4.6 carryover of §65). Their retrievable equivalents were either (a) already represented in the chunks corpus via annotateddata-chunk-*regions, or (b) deemed non-grounded under the locked rule and DELETE'd in §73 Mode-C triage (target for v0.4.7). Seedocs/journal/2026/05/2026-05-12.md§§ 36-§42, §65, §72 for the multi-arc retirement narrative.
Layout
a2ui/corpus/
├── chunks/ retrievable + raw chunks (~190 entries)
│ — one JSON per chunk; carries `source`,
│ `metadata` (when annotated), `template`
│ (transpiled A2UI tree, when annotated).
│ Harvested by `scripts/build/harvest-chunks.mjs`
│ from data-chunk markers across site/pages/*,
│ apps/*, playgrounds/*, catalog/*.
│ └── _index.json harvester output — name + by-kind tallies +
│ normalized chunk list (shape consumed by
│ chunk-loader + composition-library)
│
├── evals/ held-out.jsonl + eval fixtures
├── feedback/ daily JSONL — user feedback events
├── gaps/ gap registry — prompts with missing coverage
├── scripts/ maintenance tooling (extract, ingest, feedback, ticket)
│ └── chunk-library.js in-memory loader + keyword/semantic search over chunks/
│
├── catalog-a2ui_0_9.json aggregated artifact — what a2ui-compose + a2ui-mcp read
├── catalog-a2ui_0_9_rules.txt natural-language composition rules (per component)
├── common_types.json shared A2UI type shapes
├── chunk-embeddings.json pre-computed embeddings for chunk semantic search (0.0.4+)
├── functions.json declarative wiring-engine function catalog
├── manifest.json extraction metadata (what / when / counts)
├── pattern-specs.md written specs for each pattern category (historical reference)
└── data-flow.md how the signal sources feed the pipelineWhat's committed vs generated vs published
| Kind | Committed | Published to npm | Source of truth |
|----------------------------|:---------:|:----------------:|----------------------------------------|
| Chunks (chunks/) | ✓ | ✓ | scripts/build/harvest-chunks.mjs (from data-chunk markers across site/pages, apps, playgrounds, catalog) |
| catalog-a2ui_0_9.json | ✓ | ✓ | npm run components (assembled from yamls in packages/web-components/) |
| chunk-embeddings.json | ✓ | ✗ since 0.2.1| scripts/build/embeddings-chunks.mjs |
| Feedback JSONL | ✓ | ✓ | Written by @adia-ai/a2ui-retrieval at runtime |
| Gap registry | ✓ | ✓ | Written by @adia-ai/a2ui-retrieval at runtime |
Extracted artifacts are committed for convenience (avoids a build step to
read the pipeline), but the scripts are authoritative — regenerate via
npm run harvest:chunks (chunks), npm run components (catalog), or
npm run build:embeddings:chunks (chunk embeddings) if anything drifts.
Why embeddings ship via git, not npm
chunk-embeddings.json (~20 MB) is committed in git so the monorepo's
own pipeline runs without a network round-trip, but excluded from the
published npm tarball — every npm i @adia-ai/a2ui-corpus was pulling
~20 MB of pre-computed float arrays that consumers had no reliable way
to address (the chunk-embedding-retriever resolves them via a
relative-path that breaks under a node_modules/@adia-ai/a2ui-corpus/
install layout).
The companion
pattern-embeddings.jsonretired in v0.4.7 §72 along with thepatterns/source directory. Its only consumer wasconcept-mapper.js(dead post-v0.4.6 §64 retirement ofpattern-library.js). The pattern-embeddings build script (scripts/build/embeddings.mjs) and the matchingembedding-retriever.jswere retired in the same arc.
Consumers who want embedding-based retrieval either:
- Regenerate locally —
npm run build:embeddings:chunksproduces thechunk-embeddings.jsonfile in yournode_modules/@adia-ai/a2ui-corpus/checkout. Requires API access to your embedding provider. - Use the keyword-only fallback —
chunk-library.searchChunks()works without embeddings; the embedding-awaresearchChunksAsyncpath falls through to keyword scoring when the index file is absent (chunk-embedding-retriever.jsreturnsnullgracefully).
Embedding model pinning
The provider and model recorded in each *-embeddings.json header
are the source of truth at query time. The retrievers
(chunk-embedding-retriever.js, embedding-retriever.js) re-resolve
the same embedder from those header fields — they do not auto-pick
a different provider when the recorded one's API key is unset, because
cross-model cosine similarity is meaningless and same-provider/
different-model emits different-dim vectors that cosine()
short-circuits to 0 (silent retrieval failure).
Currently pinned defaults:
| Provider | Model | Dims | Env |
|---|---|---:|---|
| openai | text-embedding-3-small | 1536 | OPENAI_API_KEY |
| voyage | voyage-3-lite | 1024 | VOYAGE_API_KEY |
detectProvider() (in packages/a2ui/retrieval/embedding/embedding-provider.js)
prefers Voyage when both keys are present (denser vectors, lower cost).
The build:embeddings* scripts record the chosen provider/model into the
.json header, so subsequent reads always re-bind to the same model.
When upgrading to a new model (e.g. text-embedding-3-small →
text-embedding-3-large):
- Update the default in
embedding-provider.js. - Rebuild the chunk index (
npm run build:embeddings:chunks). - Verify with
npm run check:embeddings-freshthat both index headers record the new model. - Re-run
npm run eval:diff -- --engine zettelto confirm the new model doesn't regress retrieval quality (different models score queries differently — thresholds inchunk-synthesizer.jsmay need a re-look, though the absolute keyword score floor remains independent).
Don't mix models. If one index records voyage-3-lite and the other
records text-embedding-3-small, the retrievers will load both fine
but the rankings will be incomparable across the two corpora.
Scripts
All run from repo root via npm:
npm run harvest:chunks # full re-harvest of chunks/ from data-chunk markers
npm run components # regenerate v0.9 sidecars + assemble catalog-a2ui_0_9.json
npm run components -- --verify # fail if catalog/sidecars are stale vs yamls
npm run feedback:report # human-readable feedback digest
npm run feedback:promote # promote high-confidence feedback → new training data
npm run ticket # open ticket tracker
npm run ticket:list # list open tickets
npm run ticket:create # create a ticket against corpus/pipelineScript inventory (scripts/):
| Script | Purpose |
|---------------------------|------------------------------------------------------|
| chunk-library.js | In-memory loader + keyword/semantic search over chunks/ |
| feedback-report.js | Aggregates feedback JSONL into a readable digest |
| feedback-promote.js | Moves high-confidence feedback into training data |
| ticket.mjs | Corpus/pipeline issue tracker |
Retired in v0.4.7 §72 (with the
patterns/+compositions/dirs):extract.js,ingest.js,run-pipeline.mjs,build-pattern-index.mjs. These fed the pattern-library retrieval surface (pattern-library.js, retired v0.4.6 §64) and the exemplar→chunks pipeline (retired v0.4.4 §36). Pattern embeddings (scripts/build/embeddings.mjs) and the grounded-corpus triage audit (scripts/audit/grounded-corpus-triage.mjs) also retired in the same arc — the corpus is now one-format (chunks-only) and harvester-driven.
Repo-side build scripts (not in tarball; run from the workspace root):
| Script | Purpose |
|-------------------------------------|----------------------------------------------------------|
| npm run harvest:chunks | Walks site/pages/, apps/, playgrounds/, catalog/, harvests every [data-chunk] element, writes chunks/<name>.json + _index.json |
| npm run build:embeddings:chunks | Generates chunk-embeddings.json (~190 chunks × 1536d) |
Exports
// Catalog — the aggregated read-target for engines.
// Carries per-component aliases under `components[name].x-adiaui.synonyms.tags`.
import catalog from '@adia-ai/a2ui-corpus';
// Chunk corpus (since 0.0.3 / 0.0.4)
import chunkIndex from '@adia-ai/a2ui-corpus/chunks'; // _index.json (metadata only)
import { searchChunks, searchChunksAsync, getChunk }
from '@adia-ai/a2ui-corpus/chunk-library'; // in-memory query APIAuthoring order — demo page → data-chunk marker → training
When adding coverage for a new intent:
- Live demo page — author the HTML in
apps/<name>/app/<demo>/<demo>.contents.html(or underplaygrounds//catalog/) using pure primitive composition. See repo-rootAGENTS.md. - Tag the reusable region — add
data-chunk="<slug>"+data-chunk-kind="<kind>"(block / page / panel / field) on the element. The harvester extracts the bounding HTML on the next build. Seedocs/specs/genui-chunk-marker.mdfor the marker convention. - Harvest + ingest —
npm run harvest:chunkswriteschunks/<slug>.jsonand refreshes_index.json.npm run pipelinedoes the full extract → ingest → catalog refresh. - Verify —
npm run eval:diff -- --engine zettelshould still hold coverage ≥ 83%, avgScore ≥ 88 (per the regression floors inAGENTS.md).
See data-flow.md for the full pipeline (chunks → feedback).
Regression floors
The pipeline must hold these thresholds — tracked in the held-out benchmark:
- Fragment reuse ratio ≥ 29.9% — 167 refs / 559 composition nodes
- Zettel: coverage 100%, avgScore ≥ 88, MRR ≥ 0.94
- Monolithic: coverage 100%, avgScore ≥ 95
- Dogfood: 20/20 intents at avg ≥ 95
What this package does NOT contain
- Pipeline runtime —
gen-ui/ - UI custom elements —
web-components/ - MCP transport —
gen-ui-mcp/ - Site / playground UI —
/site/
If a file here is .js / .mjs, it's a maintenance script, not runtime.
Runtime readers go through gen-ui/retrieval/*.
License
MIT
