@adia-ai/a2ui-corpus

v0.8.3

Published

an hour ago

AdiaUI A2UI training corpus — canonical v0.9 catalog + chunks + eval fixtures + feedback + gap registry. Consumed by the compose engine's retrieval layer + the MCP pipeline.

0High
0Medium
0Low

kimgish-adia

@adia-ai/a2ui-corpus

Corpus and operational-learning artifacts for the gen-UI pipeline — chunks (the canonical retrieval surface), feedback, gaps, eval fixtures. Pure data plus the scripts that maintain it. No runtime.

The pipeline reads this package; this package never reads the pipeline. See @adia-ai/a2ui-compose for engine code, @adia-ai/web-components for UI atoms, @adia-ai/a2ui-runtime for the A2UI runtime (renderer, registry, streams, wiring), @adia-ai/a2ui-mcp for the MCP server.

Install

npm install @adia-ai/a2ui-corpus

Pure data — typically consumed transitively by @adia-ai/a2ui-compose and @adia-ai/a2ui-mcp, which list it as a runtime dependency. Direct installs are useful for offline eval tooling or building bespoke retrieval layers.

Dependency direction

a2ui-compose   ──reads──▶  a2ui-corpus
a2ui-mcp       ──reads──▶  a2ui-compose, a2ui-corpus
web-components ──used-by──▶  apps/, playgrounds/, catalog/ (chunk sources)

No back-writes. No circular reads. Web-components ships UI atoms only; corpus lives here; runtime lives in a2ui-compose. The chunk pipeline harvests data-chunk-tagged regions from the apps that consume web-components, so the read-direction stays one-way.

Glossary

The corpus has converged on one retrievable concept: the chunk.

| Term | Source of truth | Granularity | What it carries | Engine that consumes it | | --- | --- | --- | --- | --- | | chunk | chunks/<id>.json (one file per chunk) + _index.json | A single labeled HTML region, harvested via data-chunk markers | html, intent, domain, kind (block / page / panel), keywords, metadata (when annotated), template (transpiled A2UI tree, when annotated) | chunk-zettel + monolithic-pro + zettel (composition-library wraps annotated chunks) |

Annotated vs raw chunks: every chunk has source/page provenance, but only chunks carrying data-chunk-{domain,description,keywords,kind} attributes on their source HTML become retrievable as compositions (the harvester's transpile pass produces a template + lifts the metadata block onto them). Raw chunks remain as substrate for nested-expand reference resolution but don't compete in retrieval.

Grounding rule (locked v0.4.6, enforced v0.4.7): every retrievable unit MUST trace to a real page under site/pages/, apps/, playgrounds/, or catalog/ via its data-chunk-* annotations. No hand-authored ungrounded JSON lives in corpus/.

Historical note: Fragments (atomic A2UI sub-trees with named slots, §37, 2026-05-12), patterns (hand-authored full-canvas A2UI templates, patterns/ dir), and compositions (hand-authored multi-section A2UI surfaces, compositions/ dir) were earlier corpus formats. Fragments retired in v0.4.4 (§37). Patterns + compositions retired in v0.4.7 (§72 — the v0.4.6 carryover of §65). Their retrievable equivalents were either (a) already represented in the chunks corpus via annotated data-chunk-* regions, or (b) deemed non-grounded under the locked rule and DELETE'd in §73 Mode-C triage (target for v0.4.7). See .claude/docs/journal/2026/05/2026-05-12.md §§ 36-§42, §65, §72 for the multi-arc retirement narrative.

Layout

a2ui/corpus/
├── chunks/                  retrievable + raw chunks (~190 entries)
│                              — one JSON per chunk; carries `source`,
│                              `metadata` (when annotated), `template`
│                              (transpiled A2UI tree, when annotated).
│                              Harvested by `scripts/build/harvest-chunks.mjs`
│                              from data-chunk markers across site/pages/*,
│                              apps/*, playgrounds/*, catalog/*.
│   └── _index.json            harvester output — name + by-kind tallies +
│                              normalized chunk list (shape consumed by
│                              chunk-loader + composition-library)
│
├── evals/                   held-out.jsonl + eval fixtures
├── feedback/                daily JSONL — user feedback events
├── gaps/                    gap registry — prompts with missing coverage
├── scripts/                 maintenance tooling (extract, ingest, feedback, ticket)
│   └── chunk-library.js       in-memory loader + keyword/semantic search over chunks/
│
├── catalog-a2ui_0_9.json       aggregated artifact — what a2ui-compose + a2ui-mcp read
├── catalog-a2ui_0_9_rules.txt  natural-language composition rules (per component)
├── common_types.json           shared A2UI type shapes
├── chunk-embeddings.json       pre-computed embeddings for chunk semantic search (0.0.4+)
├── functions.json              declarative wiring-engine function catalog
├── manifest.json               extraction metadata (what / when / counts)
├── pattern-specs.md            written specs for each pattern category (historical reference)
└── data-flow.md                how the signal sources feed the pipeline

What's committed vs generated vs published

| Kind | Committed | Published to npm | Source of truth | |----------------------------|:---------:|:----------------:|----------------------------------------| | Chunks (chunks/) | ✓ | ✓ | scripts/build/harvest-chunks.mjs (from data-chunk markers across site/pages, apps, playgrounds, catalog) | | catalog-a2ui_0_9.json | ✓ | ✓ | npm run components (assembled from yamls in packages/web-components/) | | chunk-embeddings.json | ✓ | ✗ since 0.2.1| scripts/build/embeddings-chunks.mjs | | Feedback JSONL | ✓ | ✓ | Written by @adia-ai/a2ui-retrieval at runtime | | Gap registry | ✓ | ✓ | Written by @adia-ai/a2ui-retrieval at runtime |

Extracted artifacts are committed for convenience (avoids a build step to read the pipeline), but the scripts are authoritative — regenerate via npm run harvest:chunks (chunks), npm run components (catalog), or npm run build:embeddings:chunks (chunk embeddings) if anything drifts.

Why embeddings ship via git, not npm

chunk-embeddings.json (~20 MB) is committed in git so the monorepo's own pipeline runs without a network round-trip, but excluded from the published npm tarball — every npm i @adia-ai/a2ui-corpus was pulling ~20 MB of pre-computed float arrays that consumers had no reliable way to address (the chunk-embedding-retriever resolves them via a relative-path that breaks under a node_modules/@adia-ai/a2ui-corpus/ install layout).

The companion pattern-embeddings.json retired in v0.4.7 §72 along with the patterns/ source directory. Its only consumer was concept-mapper.js (dead post-v0.4.6 §64 retirement of pattern-library.js). The pattern-embeddings build script (scripts/build/embeddings.mjs) and the matching embedding-retriever.js were retired in the same arc.

Consumers who want embedding-based retrieval either:

Regenerate locally — npm run build:embeddings:chunks produces the chunk-embeddings.json file in your node_modules/@adia-ai/a2ui-corpus/ checkout. Requires API access to your embedding provider.
Use the keyword-only fallback — chunk-library.searchChunks() works without embeddings; the embedding-aware searchChunksAsync path falls through to keyword scoring when the index file is absent (chunk-embedding-retriever.js returns null gracefully).

Embedding model pinning

The provider and model recorded in each *-embeddings.json header are the source of truth at query time. The retrievers (chunk-embedding-retriever.js, embedding-retriever.js) re-resolve the same embedder from those header fields — they do not auto-pick a different provider when the recorded one's API key is unset, because cross-model cosine similarity is meaningless and same-provider/ different-model emits different-dim vectors that cosine() short-circuits to 0 (silent retrieval failure).

Currently pinned defaults:

| Provider | Model | Dims | Env | |---|---|---:|---| | openai | text-embedding-3-small | 1536 | OPENAI_API_KEY | | voyage | voyage-3-lite | 1024 | VOYAGE_API_KEY |

detectProvider() (in packages/a2ui/retrieval/embedding/embedding-provider.js) prefers Voyage when both keys are present (denser vectors, lower cost). The build:embeddings* scripts record the chosen provider/model into the .json header, so subsequent reads always re-bind to the same model.

When upgrading to a new model (e.g. text-embedding-3-small → text-embedding-3-large):

Update the default in embedding-provider.js.
Rebuild the chunk index (npm run build:embeddings:chunks).
Verify with npm run check:embeddings-fresh that both index headers record the new model.
Re-run npm run eval:diff -- --engine zettel to confirm the new model doesn't regress retrieval quality (different models score queries differently — thresholds in chunk-synthesizer.js may need a re-look, though the absolute keyword score floor remains independent).

Don't mix models. If one index records voyage-3-lite and the other records text-embedding-3-small, the retrievers will load both fine but the rankings will be incomparable across the two corpora.

Scripts

All run from repo root via npm:

npm run harvest:chunks       # full re-harvest of chunks/ from data-chunk markers
npm run components           # regenerate v0.9 sidecars + assemble catalog-a2ui_0_9.json
npm run components -- --verify   # fail if catalog/sidecars are stale vs yamls

npm run feedback:report      # human-readable feedback digest
npm run feedback:promote     # promote high-confidence feedback → new training data

npm run ticket               # open ticket tracker
npm run ticket:list          # list open tickets
npm run ticket:create        # create a ticket against corpus/pipeline

Script inventory (scripts/):

| Script | Purpose | |---------------------------|------------------------------------------------------| | chunk-library.js | In-memory loader + keyword/semantic search over chunks/ | | feedback-report.js | Aggregates feedback JSONL into a readable digest | | feedback-promote.js | Moves high-confidence feedback into training data | | ticket.mjs | Corpus/pipeline issue tracker |

Retired in v0.4.7 §72 (with the patterns/ + compositions/ dirs): extract.js, ingest.js, run-pipeline.mjs, build-pattern-index.mjs. These fed the pattern-library retrieval surface (pattern-library.js, retired v0.4.6 §64) and the exemplar→chunks pipeline (retired v0.4.4 §36). Pattern embeddings (scripts/build/embeddings.mjs) and the grounded-corpus triage audit (scripts/audit/grounded-corpus-triage.mjs) also retired in the same arc — the corpus is now one-format (chunks-only) and harvester-driven.

Repo-side build scripts (not in tarball; run from the workspace root):

| Script | Purpose | |-------------------------------------|----------------------------------------------------------| | npm run harvest:chunks | Walks site/pages/, apps/, playgrounds/, catalog/, harvests every [data-chunk] element, writes chunks/<name>.json + _index.json | | npm run build:embeddings:chunks | Generates chunk-embeddings.json (~190 chunks × 1536d) |

Exports

// Catalog — the aggregated read-target for engines.
// Carries per-component aliases under `components[name].x-adiaui.synonyms.tags`.
import catalog from '@adia-ai/a2ui-corpus';

// Chunk corpus (since 0.0.3 / 0.0.4)
import chunkIndex from '@adia-ai/a2ui-corpus/chunks';   // _index.json (metadata only)
import { searchChunks, searchChunksAsync, getChunk }
  from '@adia-ai/a2ui-corpus/chunk-library';            // in-memory query API

Authoring order — demo page → `data-chunk` marker → training

When adding coverage for a new intent:

Live demo page — author the HTML in apps/<name>/app/<demo>/<demo>.contents.html (or under playgrounds/ / catalog/) using pure primitive composition. See repo-root AGENTS.md.
Tag the reusable region — add data-chunk="<slug>" + data-chunk-kind="<kind>" (block / page / panel / field) on the element. The harvester extracts the bounding HTML on the next build. See .claude/docs/specs/genui-chunk-marker.md for the marker convention.
Harvest + ingest — npm run harvest:chunks writes chunks/<slug>.json and refreshes _index.json. npm run pipeline does the full extract → ingest → catalog refresh.
Verify — npm run eval:diff -- --engine zettel should still hold coverage ≥ 83%, avgScore ≥ 88 (per the regression floors in AGENTS.md).

See data-flow.md for the full pipeline (chunks → feedback).

Regression floors

The pipeline must hold these thresholds — tracked in the held-out benchmark:

Fragment reuse ratio ≥ 29.9% — 167 refs / 559 composition nodes
Zettel: coverage 100%, avgScore ≥ 88, MRR ≥ 0.94
Monolithic: coverage 100%, avgScore ≥ 95
Dogfood: 20/20 intents at avg ≥ 95

What this package does NOT contain

Pipeline runtime — gen-ui/
UI custom elements — web-components/
MCP transport — gen-ui-mcp/
Site / playground UI — /site/

If a file here is .js / .mjs, it's a maintenance script, not runtime. Runtime readers go through gen-ui/retrieval/*.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@adia-ai/a2ui-corpus

Install

Dependency direction

Glossary

Layout

What's committed vs generated vs published

Why embeddings ship via git, not npm

Embedding model pinning

Scripts

Exports

Authoring order — demo page → data-chunk marker → training

Regression floors

What this package does NOT contain

License

Authoring order — demo page → `data-chunk` marker → training