@nexus-mindgarden/granite-test

v0.0.7

Published

17 days ago

Granite-Floor test-coverage framework — plugin-author-declared MCP-tool test-suites + chatbus-event reporter for Oracle's @floor aggregator. v0.0.6: aligns to spec v1.3 FROZEN (Oracle 2026-05-31) adding target_kind ('plugin-tool' | 'host-tool', FIRST clos

0High
0Medium
0Low

mrdewitt88

nexus-mindgarden granite granite-floor llm-testing mcp tool-coverage ci

@nexus-mindgarden/granite-test

Granite-Floor test-coverage framework for cluster-wide MCP-tool quality measurement.

Pre-release skeleton — v0.0.1. Package shape is settled, runtime implementation lands once Oracle's granite-floor.event.v1 spec freezes (2026-05-27) AND the v0.7.0 decision-gate hits (≥2 consumer-side adoption signals from wiz-mind + plug-elec).

Purpose

The cluster-goal: by the September-Messe, 100% of MCP-tools across ~27 plugin-repos must be reliably callable from Granite-4-tiny-4bit (sovereign-AI floor = product-differential).

Today: ~80% callable, no systematic measurement. This package provides the missing measurement-layer.

Architecture (3-side ownership)

Oracle      = schema + aggregator + dashboard (@floor target)
plug-tmpl   = @nexus-mindgarden/granite-test  ← THIS PACKAGE
              + create-plugin CLI integration
              + .github/workflows/granite-test.yml.template
              + granite-test.config.ts template
wiz-mind    = @nexus-mindgarden/granite-pilot-runner  (runner-core, peerDep)
Repos       = granite-test.config.ts (consumer)

This package wraps @nexus-mindgarden/granite-pilot-runner (runner-mechanics owned by wiz-mind, extracted from their narrative-domain Granite-Pilot) with tool-call-domain-specifics:

Config-shape (defineGraniteToolTest)
Transport-adapter (reportToCluster → chatbus @floor reserved-role)
Fail-categorization (6 fix categories per spec v1)

Plugin-author usage (post-decision-gate)

// granite-test.config.ts in your plugin-repo root
import { defineGraniteToolTest } from '@nexus-mindgarden/granite-test'

export default [
  // Plugin-tool (this plugin owns + serves it):
  defineGraniteToolTest({
    tool: 'plug-elec.kabel.dimensionierung',
    persona: 'user',
    cases: [
      {
        case_id: 'kabel.16A-25m',
        prompt: 'Dimensioniere 16A Drehstromkreis 25m Länge',
        expected_tool_args: { strom: 16, phasen: 3, laenge: 25 },
        max_latency_ms: 8000,
      },
    ],
  }),

  // Host-tool (v0.0.6+, spec v1.3) — this plugin only consumes:
  defineGraniteToolTest({
    tool: 'image.generate',              // un-prefixed host-shared name
    persona: 'user',
    target_kind: 'host-tool',             // NEW v0.0.6 (spec v1.3)
    target_host: 'theseus',               // canonical theseus | v8 | v8-fam | markview
    cases: [
      {
        case_id: 'image.generate.pixel-tile-256',
        prompt: 'pixel-art forest tile 256x256',
        expected_tool_args: { width: 256, height: 256 },
        max_latency_ms: 30000,
      },
    ],
  }),

  defineGraniteToolTest({
    tool: 'plug-elec.project.delete',
    persona: 'admin',
    cases: [
      // …
    ],
  }),
]

Tool-Count-Cap & chunking (v0.0.7+, spec v1.4 FROZEN 2026-05-31)

Granite-4-h-tiny tool-selection capacity saturates ~10–15 tools/context. Past cap, pass-rate regresses sharply (v8-fam: 10→25 tools = 80%→56%; v8-corp K=10 = 72.3% plateau vs single-tool ceiling 88.5%). v0.0.7+ provides plugin-authors with defineGraniteTestSuite() + toolCountPolicy for cap-enforcement:

import { defineGraniteTestSuite, defineGraniteToolTest } from '@nexus-mindgarden/granite-test'

export default defineGraniteTestSuite({
  // v0.0.7+ canonical cap-policy per Tool-Count-Cap RFC §4.2
  toolCountPolicy: {
    maxToolsPerRun: 10,            // cluster-canonical for granite-4-h-tiny
    chunkBy: 'tool-prefix',         // group by first dot-segment of tool-name
    chunkLatencyBudgetMs: 60_000,   // optional per-chunk override
    allowSubChunking: true,         // graceful auto-resolution if chunk > cap
  },
  tenantContext: { tenant_id: 'dev' },
  tools: [
    // 25+ tools — runner auto-chunks by first dot-segment:
    defineGraniteToolTest({ tool: 'calendar.events.create', ... }),  // → chunk 'calendar'
    defineGraniteToolTest({ tool: 'calendar.events.list', ... }),    // → chunk 'calendar'
    defineGraniteToolTest({ tool: 'notes.create', ... }),            // → chunk 'notes'
    defineGraniteToolTest({ tool: 'meals.plans_create', ... }),       // → chunk 'meals'
    // ...
  ],
})

Each chunk runs as a separate granite-batch with ≤maxToolsPerRun tools in Granite's context. Events emit chunk_id (first dot-segment) + chunk_size (per-chunk tool-count) + tools_in_context (actual context-window size) for per-chunk aggregator dashboards.

Runtime chunking-logic lives in granite-pilot-runner (wiz-mind owns). This package ships the config-shape + event-fields only.

Canonical RFC: https://github.com/MrDewitt88/TeamMindV8/blob/main/docs/granite-floor-RFC-tool-count-cap.md

0-anchor for SOJM-domain: narrative/structured-output-domain emitters (no tool-selection) emit tools_in_context: 0 for separable bucket. Cluster-canonical cross-domain finding: "fewer-fields-in-context schlägt stronger-prompt" generalizes from tool-selection to verbatim-output (plug-elec ET-Mind Pass-3 3c reduced-block = −75% missing per RFC §2.4).

Host-shared tools (v0.0.6+, spec v1.3 FROZEN 2026-05-31)

Three host-shared callMcp tools land via agent's feat/host-tool-routing triple-landing 2026-05-31:

| Tool | Hosts | actorClass v1 | Wire-spec source | |---|---|---|---| | image.generate | theseus :3401 (Bonsai sidecar §2.5) | 'user' only | @theseus/tools-image-schema | | image.remove_background | theseus (ISNet via @imgly/background-removal-node §2.6) | 'user' only | @theseus/tools-image-schema (same package, 2 tools) | | agent.complete | theseus :3400 (agent-socket §2.7 (a)+(b)) | 'user' + 'system' | @theseus/agent-complete-schema v0.15.0 FROZEN |

Plugin-authors emit target_kind: 'host-tool' + target_host: 'theseus' for granite-coverage of these tools. Aggregator dedupes by (target_host, tool) (read-side /api/granite-floor/host-tools rollup follow-on; today's tools_summary continues grouping by (repo, tool, persona, mode) for attribution-by-emitter).

Run locally:

pnpm granite-test

CI: the .github/workflows/granite-test.yml.template (plug-tmpl-shipped) runs on every commit + reports events to Oracle's aggregator.

Subpath exports

| Subpath | Purpose | |---|---| | @nexus-mindgarden/granite-test | Main API: defineGraniteToolTest, reportToCluster | | @nexus-mindgarden/granite-test/types | Type-only re-exports (zero runtime, for compile-time imports) | | @nexus-mindgarden/granite-test/reporter | Transport API: reportToCluster, ReportToClusterError |

Persona-field design (agent msg #701)

Required field per tool-test. Three values:

user — User-Agent mymind-mode (User-Privileges)
admin — Kiara-Persona mymind-mode (Admin-Privileges)
any — persona-agnostic (default, surfaced as "unclassified" in dashboard)

Oracle's dashboard drills down Kiara-Admin pass-rate vs User-Agent pass-rate per tool — divergence signals that persona-system-prompts wirk unterschiedlich gut auf Granite. Disjoint buckets (plug-tmpl vote in Q#2, msg #713): any is its own bucket, not double-counted in user+admin.

Event-shape (per Oracle spec v1)

Every test-case-result becomes one granite-floor.event.v1 event:

{
  event_kind: 'granite-floor.event.v1',
  run_id: '<uuid>',
  case_id: 'kabel.16A-25m',
  repo: 'plug-elec',
  tool: 'plug-elec.kabel.dimensionierung',
  persona: 'user',
  mode: 'ci',         // or 'wild' for mymind-observed in-the-wild
  outcome: 'pass',    // or 'fail'
  fail_category: null,  // 6 fix categories if outcome='fail'
  fail_detail: null,
  model: 'granite-4-h-tiny-4bit',
  latency_ms: 3421,
  timestamp: '2026-05-24T11:30:00.000Z',
  multiturn: { step_count: 1, failed_at_step: null, expected_tools: [...] },
  replay_bundle: { user_prompt, granite_output, tool_state },
}

Transport: chatbus post_message mit to_role="@floor" (reserved-virtual-role, bypasses chat-stream-inbox), thread="granite-floor". Total event size capped at 64 KB.

Fail-categories (6 fix in v1, new = v2)

| Category | When | |---|---| | schema-issue | Granite output ≠ Zod-input-schema | | multiturn-state-loss | State-context from step N missing in step N+1 | | hallucination | Tool nicht existent / args erfunden | | silent-fail | Kein tool-call obwohl prompt es verlangte | | length-exceeded | Output exceeded max-token budget | | latency-spike | Call exceeded case.max_latency_ms |

Cross-References

Oracle spec: docs/granite-floor-spec.md (Oracle repo, msg #708)
chatbus thread: #contracts 2026-05-24 (msg #693 agent ask → #708 Oracle spec → #713 plug-tmpl ack)
Pre-design DM: plug-tmpl msg #713 (3 votes + #6 backpressure batching-proposal)
Ownership map: agent msg #701
Decision-gate candidates: wiz-mind (msg #709), plug-elec (agent ping pending)

Live smoke-test (opt-in)

The test-suite includes a live smoke-test that catches wire-shape silent-regressions (the class of bug that v0.0.3 had — built shape that returned 200 but never persisted). To run:

GRANITE_TEST_LIVE_SMOKE=1 \
CHATBUS_ENDPOINT=http://127.0.0.1:7878/api/messages \
pnpm test

The test:

Emits a marker-event via reportToCluster() with unique run_id
Polls GET /api/granite-floor/health before + after
Asserts events_total counter incremented (catches silent-200-no-persist class of bug)
(Optional) Verifies marker-event appears in /api/granite-floor/runs?repo=plug-tmpl

Without both env-vars set, the test is skipped (vitest reports as skip, not fail) — safe for any CI environment.

Plugin-authors should run this opt-in test at least once after upgrading granite-test or chatbus-web to catch wire-shape regressions before they cause silent 0-emission across the cluster.

Status

✅ Skeleton landed (this package)
⏳ Oracle spec-freeze 2026-05-27
⏳ wiz-mind extract @nexus-mindgarden/granite-pilot-runner
⏳ Decision-gate: wiz-mind + plug-elec adoption signals
⏳ Full impl (post-decision-gate)
⏳ create-plugin CLI integration
⏳ GitHub Actions workflow template

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@nexus-mindgarden/granite-test

Purpose

Architecture (3-side ownership)

Plugin-author usage (post-decision-gate)

Tool-Count-Cap & chunking (v0.0.7+, spec v1.4 FROZEN 2026-05-31)

Host-shared tools (v0.0.6+, spec v1.3 FROZEN 2026-05-31)

Subpath exports

Persona-field design (agent msg #701)

Event-shape (per Oracle spec v1)

Fail-categories (6 fix in v1, new = v2)

Cross-References

Live smoke-test (opt-in)

Status

License