tryassay

v0.21.0

Published

16 hours ago

AI code verification CLI — find bugs that tests miss, linters ignore, and code review overlooks

Downloads

2,238

0High
0Medium
0Low

franklabs

ai verification hallucination code-quality formal-verification cli

Assay

AI Code Verification

Find bugs that tests miss, linters ignore, and code review overlooks.

Built on the LUCID methodology — Leveraging Unverified Claims Into Deliverables.

Website | Paper | Methodology Guide | Prior Art | CLI Reference

Patent Notice: The verification methodology implemented by Assay is the subject of U.S. Provisional Patent Application No. 63/980,048, filed February 11, 2026, assigned to Rock Steady Systems LLC. The software is licensed under the MIT License. Use of the software does not grant any patent license beyond the rights conveyed by the MIT License.

Benchmark Results

Assay was evaluated on two standard code generation benchmarks. All results validated by running real test suites, not LLM judgment.

| Benchmark | Baseline | Assay | Improvement | |-----------|----------|-------|-------------| | HumanEval pass@1 | 86.6% | 98.8% | +14.1% | | HumanEval pass@5 | -- | 100% (164/164) | All problems solved | | SWE-bench resolve@1 | 18.3% | 25.0% | +36.4% | | SWE-bench best-of-5 | -- | 30.3% (91/300) | +65.5% |

Key finding: LLM-as-judge verification actually performs worse at higher k values (97.2% vs 100% for Assay at k=5) because it hallucinates false positives. Structured claim extraction avoids this failure mode.

Full benchmark data: results/ | Benchmark report

The Problem

Every AI development workflow treats hallucination as the enemy. Spec-Driven Development writes precise specs to prevent it. Prompt engineering constrains it. Guardrails filter it out.

But three independent formal proofs have established that hallucination cannot be eliminated from LLMs:

Xu et al. (2024) -- learning theory proof that LLMs must hallucinate as general problem solvers
Banerjee et al. (2024) -- Godel's Incompleteness Theorem applied to LLM architecture
Karpowicz (2025) -- impossibility theorem via mechanism design and transformer analysis

If hallucination is mathematically inevitable, suppressing it is fighting thermodynamics. Assay harnesses it instead.

The Insight

When you ask an AI to write Terms of Service for an application that doesn't exist, it doesn't say "this application doesn't exist." It confabulates. It invents specific capabilities, data handling procedures, user rights, performance guarantees, and limitations -- all in the authoritative, precise language that legal documents demand.

Every one of those hallucinated claims is a testable requirement.

A single hallucinated ToS produces 80--150 testable claims spanning functionality, security, data privacy, performance, operations, and legal compliance. No human requirements-gathering process generates this breadth in 30 seconds.

How Assay Works

Assay implements the LUCID methodology -- a six-phase iterative cycle that converges hallucinated fiction toward verified reality:

                        THE ASSAY CYCLE
    +------------------------------------------------------+
    |                                                      |
    |   +-----------+    +--------------+    +----------+  |
    |   | 1. DESCRIBE|-->|2. HALLUCINATE|-->|3. EXTRACT|  |
    |   |            |    |              |    |          |  |
    |   | Loose idea |    | AI writes    |    | Each     |  |
    |   | of the app |    | ToS as if    |    | claim =  |  |
    |   |            |    | app is live  |    | testable |  |
    |   +-----------+    +--------------+    | req      |  |
    |                                        +----+-----+  |
    |                                             |        |
    |   +------------+    +-------------+         |        |
    |   |5. CONVERGE |<---|  4. BUILD   |<--------+        |
    |   |            |    |             |                   |
    |   | Verify ToS |    | Implement   |                   |
    |   | vs reality |    | until code  |                   |
    |   |            |    | satisfies   |                   |
    |   +-----+------+    +-------------+                   |
    |         |                                            |
    |    Gap found?                                        |
    |    YES --> Fix --> Re-verify                         |
    |    NO  --> Continue                                  |
    |         |                                            |
    |   +-----v--------+                                   |
    |   |6. REGENERATE |   Feed verified reality back.     |
    |   |              |   AI writes updated ToS.          |
    |   | New ToS from |   New hallucinations = new reqs.  |
    |   | updated state|-----------------------------------+
    |   +--------------+   Loop to step 3
    |
    +-- EXIT: Delta between ToS and reality is acceptable

Phase Details

| Phase | What Happens | Output | |-------|-------------|--------| | 1. Describe | Give the AI a loose, intentionally incomplete description. The gaps are where hallucination does its best work. | Seed description | | 2. Hallucinate | AI writes Terms of Service as if the app is live in production with paying customers. Legal language forces precision -- no hedging allowed. | 400--600 lines of dense legal text | | 3. Extract | Parse every declarative statement into a structured, testable claim with category, severity, and traceability back to the ToS clause. | 80--150 categorized claims | | 4. Build | Implement the application using any methodology (TDD, agile, etc.). The ToS-derived claims are the acceptance criteria. | Working code | | 5. Converge | Verify every claim against the actual codebase. Assign verdicts: PASS, PARTIAL, FAIL, or N/A. Generate a gap report. | Compliance score + gap report | | 6. Regenerate | Feed verified reality back to the AI. It writes an updated ToS -- keeping what's real, revising what's partial, and hallucinating new features. | Next iteration's specification |

Convergence

With each iteration:

The ratio of accurate-to-hallucinated claims increases
New hallucinations become more contextually grounded
The gap between spec and reality shrinks
The application grows in directions the AI deems plausible for the domain

Exit condition: The team decides the delta is acceptable. This is a human judgment call, not an automated threshold.

Empirical Results

Assay was applied to a production Next.js application (~30,000 lines of TypeScript, 200+ files):

| Iteration | Compliance | PASS | PARTIAL | FAIL | N/A | |-----------|-----------|------|---------|------|-----| | 1 | ~35% (est.) | -- | -- | -- | -- | | 3 | 57.3% | 38 | 15 | 32 | 6 | | 4 | 69.8% | 47 | 18 | 20 | 6 | | 5 | 83.2% | 61 | 15 | 9 | 6 | | 6 | 90.8% | 68 | 12 | 5 | 6 |

Compliance Over Iterations:

100% |
 90% |                                          *  90.8%
 80% |                              *  83.2%
 70% |                  *  69.8%
 60% |      *  57.3%
 50% |
 40% |
 35% |  *  ~35%
     +--+------+------+------+------+------+--
        1      2      3      4      5      6
                    Iteration

Total cost for 6 iterations: ~$17 in API tokens.

The 5 remaining FAIL claims after convergence were all genuine missing functionality -- not false positives. The hallucinated ToS correctly identified requirements a production app should have.

Why Terms of Service?

ToS is the ideal hallucination vehicle because the document format forces specificity across every dimension of a software product simultaneously:

| ToS Section | Produces | Example Claim | |-------------|----------|---------------| | Service Description | Feature requirements | "The Service allows batch processing of up to 10,000 records" | | Acceptable Use | Input validation rules | "Users may not upload files exceeding 50MB" | | Data Handling | Privacy & security requirements | "User data is encrypted at rest using AES-256" | | Limitations | Performance boundaries | "The Service supports up to 10,000 concurrent users" | | SLA / Uptime | Reliability requirements | "The Service maintains 99.9% uptime" | | Termination | Account lifecycle requirements | "Data is retained for 30 days post-deletion" | | Liability | Error handling requirements | "Graceful degradation on third-party API failure" | | Modifications | Versioning requirements | "Users are notified 30 days before changes" |

Legal language cannot be vague. "The Service may do things" is not a valid legal clause. The format forces the AI to hallucinate precisely.

Quick Start

Prerequisites

Node.js 20+
An Anthropic API key (Claude)

Installation

# Clone the repository
git clone https://github.com/gtsbahamas/hallucination-reversing-system.git
cd hallucination-reversing-system

# Install dependencies
npm install

# Build the CLI
npm run build

# Set your API key
export ANTHROPIC_API_KEY="sk-ant-..."

Generate Verified Code (Fastest Way to Try Assay)

npm install -g tryassay
export ANTHROPIC_API_KEY="sk-ant-..."

# Generate code with Layer 2 verification built in
tryassay generate --task "Write a function that validates email addresses" --lang typescript --verbose

Verification runs automatically. You get the code + a proof showing every claim that was checked. No "skip verification" option exists.

Or Use the SDK Programmatically

import { AssaySDK } from 'tryassay';

const sdk = new AssaySDK();
const result = await sdk.generate({
  task: 'Write a CSV parser that handles quoted fields',
  language: 'typescript'
});

console.log(result.verified);              // true if all critical claims pass
console.log(result.verification.passed);   // claims that passed
console.log(result.verification.formalStats); // deterministic vs LLM verdicts

Run the Full LUCID Cycle

# 1. Initialize an Assay project
tryassay init

# 2. Generate a hallucinated Terms of Service
tryassay hallucinate

# 3. Extract testable claims from the hallucination
tryassay extract

# 4. Verify claims against your codebase
tryassay verify --repo /path/to/your/project

# 5. Generate a gap report
tryassay report

# 6. Generate remediation tasks for gaps
tryassay remediate --repo /path/to/your/project

# 7. After fixing gaps, regenerate for the next iteration
tryassay regenerate

Each iteration stores artifacts in .assay/iterations/{N}/, maintaining a complete audit trail.

Assess a Codebase (One Command)

# Assess any local project or GitHub repo
tryassay assess /path/to/your/project
tryassay assess https://github.com/org/repo --publish

SDK (Programmatic API)

npm install tryassay

import { AssaySDK } from 'tryassay';

const sdk = new AssaySDK({ anthropicApiKey: 'sk-ant-...' });

// Generate verified code — verification runs automatically
const result = await sdk.generate({
  task: 'Write a function that parses CSV with quoted fields',
  language: 'typescript'
});

// result.verified: boolean — did all critical/high claims pass?
// result.code: string — the generated code
// result.verification: full claim-by-claim proof
// result.verification.formalStats: { formallyVerified, llmVerified, formalOverrides }

// Verify existing code
const check = await sdk.verify({
  code: myCode,
  language: 'typescript'
});
// check.claims: every implicit claim extracted
// check.verifications: each claim's verdict (PASS/PARTIAL/FAIL)

The SDK enforces Layer 2 — generate() always returns a verification proof alongside the code. The formal verifier runs deterministic pattern checks (regex, not LLM) and can override LLM verdicts. On its first production call, it caught the LLM hallucinating PASS on code with SQL injection.

MCP Server (Claude Code, Cursor, Windsurf)

Add Assay verification as a native tool in your AI editor with one config block.

npm install -g assay-mcp

Claude Code (~/.claude/settings.json):

{
  "mcpServers": {
    "assay": {
      "command": "npx",
      "args": ["-y", "assay-mcp"],
      "env": { "ASSAY_API_KEY": "ak_live_your_key_here" }
    }
  }
}

Then ask your AI assistant: "Verify this file with Assay" or "Generate a verified function that parses CSV"

Assay catches what the AI missed and shows you exactly what would have shipped without verification.

Get a free API key at tryassay.ai. See mcp-server/README.md for full docs.

GitHub Action

Add Assay verification to your CI/CD pipeline. Every PR gets a verification report as a comment.

- uses: gtsbahamas/hallucination-reversing-system/[email protected]
  with:
    assay-api-key: ${{ secrets.ASSAY_API_KEY }}

Two modes: Assay API (recommended, uses your Assay key) or BYOK (bring your own Anthropic key for self-hosted verification). See github-action/README.md for full docs.

CLI Reference

| Command | Description | |---------|-------------| | tryassay generate | Generate verified code — Layer 2 verification below the model call | | tryassay assess <target> | Run autonomous LVR assessment against a codebase or GitHub URL | | tryassay init | Initialize project configuration | | tryassay hallucinate | Generate a hallucinated ToS/API docs/user manual | | tryassay describe | Fetch an existing ToS from a URL | | tryassay extract | Extract testable claims from a document | | tryassay verify | Verify extracted claims against a codebase | | tryassay report | Generate a gap report from verification results | | tryassay remediate | Generate code-level fix tasks from gaps | | tryassay regenerate | Feed verified reality back, regenerate spec for next iteration | | tryassay reverse | Generate code with hallucination prevention | | tryassay runtime start | Start the Verified Agent Runtime |

Options

tryassay generate --task "..." --lang typescript --verbose  # Verified code generation
tryassay assess /path/to/project --publish                  # Assess + publish dashboard
tryassay hallucinate --type tos|api-docs|user-manual        # Document type (default: tos)
tryassay extract --iteration 3                              # Specify iteration (default: latest)
tryassay verify --repo /path/to/code --iteration 3          # Verify specific iteration
tryassay remediate --threshold 95                           # Set compliance target (default: 95%)

Scoring Methodology

Assay assigns four verdicts to each claim:

| Verdict | Meaning | Score Weight | |---------|---------|-------------| | PASS | Code fully implements the claim | 1.0 | | PARTIAL | Code partially implements (some aspects missing) | 0.5 | | FAIL | Code does not implement or contradicts the claim | 0.0 | | N/A | Cannot be verified from code (e.g., legal-only claims) | Excluded |

Compliance score:

Score = (PASS + 0.5 * PARTIAL) / (Total - N/A) * 100

Claims are categorized by type and severity:

| Category | Examples | |----------|---------| | functionality | Features, user workflows, UI components | | security | Encryption, auth, access control | | data-privacy | Data handling, retention, deletion | | operational | Performance, uptime, monitoring | | legal | Terms, disclaimers, compliance |

| Severity | Meaning | |----------|---------| | critical | Security breach or data loss if false | | high | Core functionality broken if false | | medium | Important but not showstopping | | low | Nice-to-have or cosmetic |

The Neuroscience Behind Assay

Assay is not an arbitrary methodology. It is grounded in three convergent lines of evidence from cognitive neuroscience:

1. Transformers = Hippocampal Pattern Completion

Ramsauer et al. (2020) proved that transformer self-attention is mathematically equivalent to the update rule of Hopfield networks -- the same associative memory computation performed by the hippocampal CA3 network. When an LLM generates text about a nonexistent app, it performs pattern completion from partial cues, filling gaps with plausible details. This is identical to how human memory reconstructs events -- some accurate, some confabulated.

2. Perception as Controlled Hallucination

The predictive processing framework (Friston, Clark, Seth) holds that the brain is a prediction machine. As Anil Seth states: "We're all hallucinating all the time; when we agree about our hallucinations, we call it reality." Hallucination and perception are the same generative process under different constraint levels. Assay deliberately operates unconstrained during the Hallucinate phase, then progressively introduces constraint through Converge and Regenerate.

3. The REBUS Model (Relaxed Beliefs Under Psychedelics)

Carhart-Harris and Friston (2019) showed that psychedelics relax the brain's top-down constraints, enabling novel associations that rigid priors normally suppress. This maps directly to LLM temperature: higher temperature = more novel (and hallucination-prone) outputs. Assay exploits this by generating freely at "high temperature," then constraining iteratively -- just as the brain reintegrates psychedelic insights under normal conditions.

The Naming

The LUCID methodology is named for lucid dreaming -- the state where a dreamer becomes metacognitively aware they are dreaming while remaining in the dream. A lucid dreamer does not fight the dream. They participate with awareness, harvesting creative content while maintaining the ability to distinguish generated from real. Assay applies this principle to AI-generated code: harness the hallucination, don't suppress it.

How Assay Differs From Traditional Approaches

| Approach | Hallucination Stance | Spec Source | Convergence Loop | Verification | |----------|---------------------|-------------|------------------|-------------| | Spec-Driven Development (GitHub, 2025) | Prevents | Human-written | No | Spec compliance | | Readme-Driven Development (Preston-Werner, 2010) | N/A | Human-written | No | Manual | | Design Fiction (Sterling, 2005) | Intentional (human) | Human fiction | Loose | Informal | | Vibe Coding (Karpathy, 2025) | Tolerates | Human prompt | No | Ad hoc | | Protein Hallucination (Baker, Nobel 2024) | Exploits | Neural network | Validate-only | Lab synthesis | | Assay | Exploits | AI-hallucinated ToS | Yes | Codebase verification |

Assay is the only methodology that combines AI-generated specification, deliberate hallucination exploitation, and iterative convergence verification against a real codebase.

The closest analogue is David Baker's protein hallucination -- where neural network "dreams" serve as blueprints for novel biological structures. That insight earned the 2024 Nobel Prize in Chemistry. Assay applies the identical principle to software engineering.

Real-World Application

Assay was developed and dogfooded on production applications, including an event photography platform and an AI agent platform. The gap analysis from a real Assay iteration looks like this:

Iteration 1: CrowdPics TV (112 claims extracted)
+----------------------------------+
|  REAL          36  (32%)  ====   |
|  PARTIAL       13  (12%)  ==     |
|  HALLUCINATED  63  (56%)  ====== |
+----------------------------------+

Each HALLUCINATED claim is a missing feature.
Each PARTIAL claim is incomplete work.
The gap IS the backlog.

After iterative remediation and regeneration, compliance converges toward 90%+. The remaining gaps are genuine missing functionality that serves as a prioritized development roadmap.

Project Structure

hallucination-reversing-system/
├── src/
│   ├── cli.ts                  # CLI entry point (Commander.js)
│   ├── index.ts                # SDK entry point (package export)
│   ├── sdk/                    # Programmatic SDK
│   │   ├── index.ts            # AssaySDK class
│   │   ├── types.ts            # SDK type definitions
│   │   ├── forward-verify.ts   # Claim extraction + verification pipeline
│   │   └── verified-generate.ts # Generate-verify-regenerate loop
│   ├── commands/               # CLI commands
│   │   ├── generate.ts         # Verified code generation
│   │   ├── assess.ts           # Autonomous LVR assessment
│   │   ├── runtime.ts          # Verified Agent Runtime commands
│   │   └── ...                 # init, hallucinate, extract, verify, etc.
│   ├── lib/                    # Core modules
│   │   ├── anthropic.ts        # Claude SDK wrapper
│   │   ├── formal-verifier.ts  # Deterministic formal verifier (no LLM)
│   │   ├── spec-synthesizer.ts # Spec synthesis from task descriptions
│   │   ├── constraint-engine.ts # Domain constraint generation
│   │   ├── guided-generator.ts # Constraint-guided code generation
│   │   └── ...
│   └── runtime/                # Verified Agent Runtime
│       ├── agent-loop.ts       # Core agent decision-execution cycle
│       ├── layer2-guardian.ts   # 4-level immutability enforcement
│       ├── kill-switch.ts      # Strict hierarchy kill switch
│       └── ...                 # 30+ runtime modules
├── api/                        # Vercel serverless API
│   ├── v1/forward.ts           # Forward verification endpoint
│   └── lib/formal-verifier.ts  # Production formal verifier
├── mcp-server/                 # MCP server for AI editors
├── github-action/              # GitHub Action for CI/CD
├── docs/                       # Research papers, plans, specs
└── results/                    # Benchmark data (HumanEval, SWE-bench)

Publications

| Venue | Status | Link | |-------|--------|------| | Zenodo (peer-reviewed DOI) | Published | 10.5281/zenodo.18522644 | | arXiv | Submitted | arxiv-submission/main.pdf | | CHI 2026 Workshop | In progress | chi-submission/ |

Token Economics

Running a full Assay iteration is inexpensive:

| Phase | Input Tokens | Output Tokens | Cost (approx.) | |-------|-------------|---------------|----------------| | Hallucinate | ~2K | ~12K | $0.15 | | Extract | ~15K | ~8K | $0.25 | | Verify | ~80K | ~20K | $1.50 | | Remediate | ~30K | ~15K | $0.60 | | Regenerate | ~20K | ~12K | $0.40 | | Full iteration | | | ~$2.90 |

A complete 6-iteration cycle that achieves 90%+ compliance costs approximately $17 in API tokens -- producing a verified specification with 91 claims, a gap report, and a prioritized remediation plan.

Principles

Hallucination is signal, not noise. The AI's confabulations reveal what a plausible version of the application looks like.
Legal language enforces precision. ToS cannot be vague. The format forces the AI to hallucinate precisely.
The gap is the backlog. The difference between what the ToS claims and what the code does is your task list.
Reality is the only test. A claim is satisfied when verified against running code, not when code is written.
The loop is the methodology. Assay is not one-shot generation. It is iterative convergence between fiction and reality.
Verification requires external ground truth. LLMs cannot self-correct without external feedback (Huang et al., ICLR 2024). The codebase is the ground truth.

Contributing

Contributions are welcome. Areas where help is particularly valuable:

Multi-document hallucination -- Extending beyond ToS to API docs, user manuals, privacy policies, and compliance certifications simultaneously
Formal verification integration -- Replacing LLM-based verification with property-based testing, model checking, or static analysis for specific claim categories
CI/CD integration -- Running Assay in continuous integration pipelines for specification-drift detection
Language support -- The CLI currently targets TypeScript/JavaScript codebases; other languages need codebase indexing adapters
Benchmarking -- Comparing initial hallucination quality across different LLMs (Claude, GPT-4, Gemini, Llama)

Development

git clone https://github.com/gtsbahamas/hallucination-reversing-system.git
cd hallucination-reversing-system
npm install
npm run dev    # Watch mode (TypeScript compilation)
npm run build  # Production build

FAQ

Q: Isn't this just "make stuff up and hope for the best"?

No. The hallucination is the input, not the output. Assay verifies every claim against the actual codebase. Unverified claims are surfaced as gaps. Nothing ships without evidence. The methodology is closer to the scientific method: hypothesize (hallucinate), test (verify), refine (regenerate).

Q: Why not just write requirements manually?

You can. But no human writes 91 testable requirements spanning functionality, security, data privacy, performance, operations, and legal compliance in 30 seconds. Assay generates comprehensive first-draft specifications at machine speed, then converges them toward reality through verification.

Q: Does this actually work in production?

Yes. Assay was developed while building production applications. The empirical results (57% to 91% compliance over 6 iterations) come from a real codebase with 30,000+ lines of TypeScript. The remaining gaps were genuine missing functionality, not false positives.

Q: How is this different from vibe coding?

Vibe coding tolerates hallucination in the code. Assay exploits hallucination in the specification and then demands rigorous verification of the code against that specification. The verification loop is the critical difference -- vibe coding has no convergence mechanism.

Q: What models does Assay support?

The CLI currently uses Anthropic's Claude via the official SDK. The architecture is model-agnostic -- any LLM capable of generating structured legal text and performing code analysis can be substituted.

Citation

@article{wells2026lucid,
  title={LUCID: Leveraging Unverified Claims Into Deliverables},
  author={Wells, Ty},
  year={2026},
  doi={10.5281/zenodo.18522644},
  url={https://github.com/gtsbahamas/hallucination-reversing-system}
}

License

MIT -- Use it, fork it, build on it.

"Normal specification is hallucination constrained by reality. Assay is the first development tool that uses this principle: generate freely, then constrain iteratively, just as the brain does."

Built by Rock Steady Systems LLC