forgeagent

v3.0.0-alpha.12

Published

a month ago

AgentForge — AI Agent Factory. Requirements in, verified deliverables out.

0High
0Medium
0Low

platano78

ai mcp learning framework

AgentForge — Intent Compiler for AI Agents

Define contracts. Compile agents. Verify deliverables.

AgentForge is a domain-agnostic intent compiler. It takes structured goals—contracts, specs, requirements—and executes them through dynamically generated agents with bounded capabilities and evidence-based verification.

What AgentForge actually is:

An agent compiler: proposed agents are validated, normalized, and policy-checked before instantiation
A contract enforcement system: VCC (Verifiable Contract Criteria) define acceptance criteria before execution
A verification framework: deliverables must pass auditable checks to be accepted
A bounded execution runtime: agents operate within capability, tool, and resource constraints

What AgentForge is NOT:

A chat assistant with extra steps
A "let AI figure it out" autonomous system
A software-only framework — AgentForge is domain-agnostic; domain behavior is derived from VCC, DomainContextExtractor, and learned patterns
A replacement for human judgment on critical decisions

What AgentForge compiles:

A plan (what will be produced)
Contracts (what "done" means)
Execution steps (how work proceeds)
Evidence requirements (how success is proven)

The output is not "an answer"—it is a bundle of artifacts + verification results.

This is not a chat assistant. AgentForge is an intent compiler + bounded execution system that turns contracts into reproducible, verifiable deliverables.

What AgentForge Actually Does

Project → Analysis → Team Design → Agent Compilation → Execution → Review → Deliverables

The critical step is Agent Compilation: proposed agents are validated, normalized, and policy-checked before they are allowed to exist.

You give AgentForge a project—specifications, research notes, datasets, PRDs, codebases, or a single document. AgentForge figures out what experts are needed, generates them, enforces capability boundaries, and executes through quality gates.

# Analyze any project
agentforge analyze ./your-project --dry-run

# Domain: financial-analysis (91% confidence)
# Recommended Roles:
#   • Quantitative Analyst
#   • Portfolio Optimizer
#   • Compliance Auditor
# Key Open Questions:
#   ? Are regulatory constraints fully captured?

Why AgentForge Exists

| Typical AI Tools | AgentForge | | ----------------------- | --------------------------------- | | You explain context | AgentForge analyzes your input | | One generic assistant | Dynamically generated specialists | | Stateless conversations | Persistent learning across runs | | Manual orchestration | Compiler-enforced execution | | Writes snippets | Produces reviewable deliverables |

You don't manage the AI. The project tells AgentForge what's needed.

AgentForge v3.0 — Domain-Agnostic Agent Compiler

AgentForge v3.0 removes all hardcoded agent types.

There are no predefined roles, no enums, no fixed domains.

Instead:

The project is analyzed
Roles are recommended dynamically
Roles are normalized & deduplicated
Personas are researched by LLMs
Capabilities are enforced by policy
Agents are compiled, validated, and only then instantiated

AgentForge will reject agents that violate capability, security, or determinism guarantees—even if an LLM proposes them.

const { agents, result } = await agentforge.generateAgentTeam({
  projectName: 'Risk Platform',
  domain: 'quantitative finance',
  description: 'Portfolio risk optimization system',
  techStack: ['Python', 'NumPy', 'PostgreSQL'],
});

Generated roles are not limited to software development.

What Makes v3.0 Different (and Hard)

| Feature | Why It Matters | | -------------------------- | ------------------------------------- | | Dynamic Roles | Any domain, any expertise | | Persona ≠ Capability | LLM creativity without security risk | | Deterministic Backends | Same input → same execution | | Role Normalization | No duplicate or overlapping agents | | Compiler Gate | Invalid agents cannot run | | Invariant Tests | Architecture is enforced, not implied |

AgentForge v3.0 behaves more like a compiler than a framework.

See: v3.0 Design Contract

VCC — Verifiable Contract Criteria

VCC is AgentForge's contract-driven quality system. Instead of hoping outputs are good, you define acceptance criteria upfront in YAML and AgentForge enforces them.

# vcc_research.yml
acceptance:
  - acId: AC-RES-1
    targetDeliverables: [A1]
    type: structure
    severity: must
    rule:
      requiredSections: [Queries, Sources, Methodology, Limitations]

How it works:

Define VCC spec (YAML) with acceptance criteria
Pass VCC to execution via task.metadata.vcc
Quality pipeline scores artifacts against criteria
Failed criteria trigger refinement with specific feedback
Task fails if MUST criteria aren't satisfied

Type-safe integration:

// VCCContext enforces both fields required together
const vccContext: VCCContext = {
  vcc: loadedSpec,      // Required
  artifactId: 'A1'      // Required
};

// Pass to execution - scoring happens automatically
pipeline.assessQuality(output, { task, vccContext });

VCC supports:

Structure criteria — Required sections/headers
Rubric criteria — LLM-evaluated quality dimensions
Schema criteria — JSON/data format validation

Codebase Navigation & Agent Context

AgentForge now ships a 3-tier filesystem pyramid for agent navigability — all backends (Groq, Gemini, local models, Claude) get structural project context automatically:

| Tier | File | Content | Token cost | |------|------|---------|------------| | 1 | CLAUDE.md → Module Map section | 29-row module index (entry points, purposes) | ~870, once per session | | 2 | src/*/README.md | Key files, entry point, dependency map per module | On-demand via scout | | 3 | Source files | Implementation | On-demand |

27 per-directory README index files now ship in src/, covering every module from src/core/ through src/tui/. Each lists real file names, real exports, entry points, and Depends On / Depended On By relationships — not guesses.

projectIndexFile config option — the tier-1 file is configurable, not hardcoded to CLAUDE.md. Projects with PROJECT_CONTEXT.md, README.md, or any other index file work automatically:

# Use a different index file
agentforge orchestrate --project-index-file PROJECT_CONTEXT.md

# Via env var
AgentForge_PROJECT_INDEX_FILE=README.md agentforge orchestrate

The ScoutAgent loads and injects the Module Map section (or full file up to 3000 chars) into every agent's context before task execution. SubagentBootupRitual does the same for directly-spawned subagents.

VCC Resolution

VCC is now resolved ONCE at composition root (orchestrate.ts) before engine creation
ResolvedVCCContext provides a 4-tier precedence chain: explicit VCC → PRD resource constraints → model defaults → system defaults
All 3 execution engines (local, API, Claude Code) consume the same resolved values — no split-brain threshold resolution
All resolved properties are readonly after construction

Execution System (Production-Grade)

In v3.0, all generated agents execute exclusively through this system.

AgentForge includes a full execution engine with:

Task state machine
Verification gates (checks, analysis, security)
AI + human review
Rework loops
Workspace isolation (e.g., via Git worktrees)
Audit logging

pending → in_progress → quality_gate → ai_review → human_review → done
             ↑                ↓              ↓
             └────── in_rework ──────────────┘

This system is always on by default. Risky changes cannot silently bypass review.

See: Execution System

Engine Architecture

LocalExecutionEngine decomposed from 3,820 to 1,307 lines — now a thin coordinator delegating to focused modules:

| Module | Lines | Responsibility | |--------|-------|----------------| | modes/atomic-workflow.ts | 1,667 | Atomic decomposition, remediation, compile gate | | modes/critic-workflow.ts | 456 | VCC + generic critic passes | | modes/ralph-workflow.ts | 65 | Ralph loop strategy | | shared/domain-detection.ts | 477 | Tech stack detection, domain context | | shared/execution-pipeline.ts | 97 | Quality pipeline factory shared by all engines | | core/BaseExecutionEngine.ts | 90 | Abstract base class with template method pattern |

Provider-Agnostic by Design

AgentForge works with:

Claude (recommended)
OpenAI
Local LLMs (Ollama, LM Studio)
Custom providers via plugins

Switch providers without changing code or architecture.

See: Supported AI Providers

No Domain Packs Needed

AgentForge is domain-agnostic by design—no pluggable validator packs are required. Domain-specific behavior emerges automatically from three mechanisms that are always active:

VCC (Verifiable Contract Criteria) — acceptance criteria defined upfront in YAML drive what "done" means for any domain
DomainContextExtractor — detects tech stack, language, and domain signals from the project itself
SQLite pattern learning — persists rubrics, thresholds, and quality signals across runs, improving accuracy automatically

There is no plugin API to implement, no pack to install, and no domain to register. Point AgentForge at any project and it adapts.

Quick Start

git clone https://github.com/Platano78/AgentForge.git
cd AgentForge
npm install
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY
npm run build
npm test

# Generate a team from a codebase
npm run agentforge create-team --analyze ./project

See: Quick Start Guide | CLI Reference

Who This Is For

AgentForge is built for:

Engineers shipping complex systems
Teams tired of "AI assistants" that don't understand context
Projects where quality, review, and determinism matter
Anyone who wants AI to behave like an organization, not a chatbot

Not designed for casual chat-based coding or one-off prompt experiments.

Quality Scoring System

AgentForge uses an LLM Judge to evaluate every generated artifact against domain-specific rubrics. The judge model is always different from the generator model to prevent self-evaluation bias.

4-Tier Rubric Lookup

Rubrics are resolved through a cascading lookup that balances speed and cost:

| Tier | Source | Latency | Cost | |------|--------|---------|------| | 1 | Memory cache — session-scoped, instant hit | ~0ms | Free | | 2 | Hardcoded rubrics — built-in domain defaults | ~0ms | Free | | 3 | SQLite learned rubrics — persisted across sessions | ~1ms | Free | | 4 | LLM generation — on-demand for unknown domains | ~2-5s | 1 API call |

If all tiers fail, the system falls back to the general rubric.

Supported Domain Rubrics

Hardcoded rubrics ship for these domains:

unity — correctness, performance (GC avoidance, pooling), architecture (MonoBehaviour, ScriptableObjects), maintainability, documentation
godot — correctness (GDScript 4.x), node-lifecycle, signals-and-patterns, type-safety, performance
typescript — correctness (strict mode), type-safety, architecture (SOLID), error-handling, testability
general — correctness, clarity, completeness, best-practices

Any domain not listed above will trigger Tier 3/4 lookup and the result is cached for future use.

Judge Model Selection

The judge provider is selected by checking API keys in priority order:

| Priority | Provider | Model | Notes | |----------|----------|-------|-------| | 1 | Claude | claude-3-5-haiku-20241022 | High quality, good rate limits | | 2 | Gemini | gemini-2.0-flash | Fast, generous free tier | | 3 | DeepSeek | deepseek-chat | Strong reasoning | | 4 | Groq | llama-3.3-70b-versatile | Fast but limited free tier |

If no API key is available, quality assessment is skipped and a warning is logged.

Quality Recommendations

Each evaluation produces a score (0-100) and a recommendation:

pass (80-100) — artifact accepted, task complete
revise (50-79) — automatic retry with criterion-specific feedback
fail (0-49) — task marked as failed

Per-criterion scores and feedback are tracked for all domains, enabling continuous rubric improvement.

Atomic Task Execution

Atomic mode decomposes a PRD (Product Requirements Document) into file-level tasks, each targeting a single file creation in 10 minutes or less. This replaces monolithic "generate the whole project" prompts with bounded, verifiable units of work.

How It Works

Decomposition — The PRD is analyzed and split into atomic tasks with explicit dependencies
Execution — Tasks run in dependency order, each with fresh LLM context
Quality gate — Each task output is scored by the LLM Judge
Retry — Tasks scoring "revise" (50-79%) are automatically re-generated with feedback
Persistence — Task specs and results are stored in SQLite for resume capability

Default Configuration

All numeric defaults are centralized in src/config/engine-defaults.ts and overridable via AgentForge_{CONSTANT_NAME} environment variable.

| Setting | Default | CLI Flag | Env Override | |---------|---------|----------|--------------| | Max tasks | 50 | --atomic-max-tasks | AgentForge_DEFAULT_ATOMIC_MAX_TASKS | | Quality threshold | 55 | --quality-threshold | AgentForge_DEFAULT_QUALITY_THRESHOLD | | Max retries | 2 | — | AgentForge_DEFAULT_ATOMIC_TASK_RETRIES | | Target minutes/task | 10 | --atomic-target-minutes | AgentForge_DEFAULT_ATOMIC_TARGET_MINUTES | | Max quality retries | 1 | — | AgentForge_DEFAULT_MAX_REGEN_ATTEMPTS | | TDD mode | off | --atomic-tdd | — | | Compile gate timeout | 120s | — | AgentForge_COMPILE_GATE_TIMEOUT_MS | | Task timeout | 300s | — | AgentForge_DEFAULT_ATOMIC_TASK_TIMEOUT_MS |

TDD Mode

When --atomic-tdd is enabled, each task follows the TDD cycle:

RED — Generate the test file first (expects failure)
GREEN — Generate the implementation to make tests pass
REFACTOR — Optional cleanup pass

CLI Flags

# Enable TDD cycles
agentforge orchestrate -p myproject --atomic-tdd

# Limit decomposition to 30 tasks
agentforge orchestrate -p myproject --atomic-max-tasks 30

# Disable atomic mode (legacy behavior)
agentforge orchestrate -p myproject --no-atomic

# Resume an interrupted workflow
agentforge orchestrate -p myproject --resume

# Stop on first failure
agentforge orchestrate -p myproject --atomic-stop-on-failure

Execution Modes

AgentForge supports four execution modes. All modes use the same agent compiler and workflow structure but differ in how tasks are dispatched.

autonomous-local (Mode 2)

Fully autonomous execution using a local LLM. Zero API cost.

agentforge orchestrate -p myproject -m autonomous-local \
  --local-endpoint http://localhost:8080

Supports llamacpp-router, Ollama, LM Studio, or any OpenAI-compatible local server. The loaded model is auto-detected from the server.

autonomous-api (Mode 1)

Fully autonomous execution using a cloud API. Highest quality output.

agentforge orchestrate -p myproject -m autonomous-api \
  --api-provider claude

Supports claude, openai, groq, nvidia, deepseek, together, and custom providers.

interactive

Human-guided execution with approval checkpoints between tasks.

agentforge orchestrate -p myproject -m interactive \
  --checkpoint-threshold 0.75

The --checkpoint-threshold (0.0-1.0) controls how often AgentForge pauses for human review. Lower values mean more checkpoints.

claude-code

Persona-based orchestration through Claude Code CLI. Generates agent personas and a launcher script.

agentforge orchestrate -p myproject -m claude-code \
  --output ./agentforge-exports

Auto-detects available CLI orchestrators (Claude Code, Gemini CLI, Ollama).

Environment Variables

Required (at least one for quality scoring)

| Variable | Purpose | |----------|---------| | ANTHROPIC_API_KEY | Claude API access (judge + autonomous-api mode) |

Judge Fallbacks (checked in order)

| Variable | Purpose | |----------|---------| | GEMINI_API_KEY or GOOGLE_API_KEY | Gemini Flash judge provider | | DEEPSEEK_API_KEY | DeepSeek judge provider | | GROQ_API_KEY | Groq judge provider |

Execution

| Variable | Purpose | |----------|---------| | LOCAL_LLM_ENDPOINT | Local LLM server URL (e.g., http://localhost:8080) | | LOCAL_LLM_MODEL | Override local model name (auto-detected if omitted) | | OPENAI_API_KEY | OpenAI provider for autonomous-api mode |

Observability (optional)

| Variable | Purpose | |----------|---------| | LANGSMITH_API_KEY | LangSmith tracing and observability |

Documentation

| Document | Description | |----------|-------------| | Quick Start | Get running in 5 minutes | | CLI Reference | Command-line usage | | Architecture | System design | | v3.0 Design Contract | Agent compiler guarantees | | Execution System | Task lifecycle and review | | Configuration | Environment and settings | | Custom Providers | Provider plugin system |

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Navigating the codebase? AgentForge uses a 3-tier navigation pyramid:
CLAUDE.md → Module Map — 29-row table mapping every src/ module to its purpose and entry point. Start here for any architecture or module-location question.
src/*/README.md — per-module index with key files, exports, entry point, and dependency relationships.
Source files — implementation detail.
If you're unsure where a capability lives, the Module Map in CLAUDE.md resolves it without grepping.

git checkout -b feature/amazing-feature
npm test
# Submit pull request

License

Apache License 2.0 - see LICENSE

In One Sentence

AgentForge turns projects into reproducible, verifiable execution pipelines.