forgeagent
v3.0.0-alpha.12
Published
AgentForge — AI Agent Factory. Requirements in, verified deliverables out.
Readme
AgentForge — Intent Compiler for AI Agents
Define contracts. Compile agents. Verify deliverables.
AgentForge is a domain-agnostic intent compiler. It takes structured goals—contracts, specs, requirements—and executes them through dynamically generated agents with bounded capabilities and evidence-based verification.
What AgentForge actually is:
- An agent compiler: proposed agents are validated, normalized, and policy-checked before instantiation
- A contract enforcement system: VCC (Verifiable Contract Criteria) define acceptance criteria before execution
- A verification framework: deliverables must pass auditable checks to be accepted
- A bounded execution runtime: agents operate within capability, tool, and resource constraints
What AgentForge is NOT:
- A chat assistant with extra steps
- A "let AI figure it out" autonomous system
- A software-only framework — AgentForge is domain-agnostic; domain behavior is derived from VCC, DomainContextExtractor, and learned patterns
- A replacement for human judgment on critical decisions
What AgentForge compiles:
- A plan (what will be produced)
- Contracts (what "done" means)
- Execution steps (how work proceeds)
- Evidence requirements (how success is proven)
The output is not "an answer"—it is a bundle of artifacts + verification results.
This is not a chat assistant. AgentForge is an intent compiler + bounded execution system that turns contracts into reproducible, verifiable deliverables.
What AgentForge Actually Does
Project → Analysis → Team Design → Agent Compilation → Execution → Review → DeliverablesThe critical step is Agent Compilation: proposed agents are validated, normalized, and policy-checked before they are allowed to exist.
You give AgentForge a project—specifications, research notes, datasets, PRDs, codebases, or a single document. AgentForge figures out what experts are needed, generates them, enforces capability boundaries, and executes through quality gates.
# Analyze any project
agentforge analyze ./your-project --dry-run
# Domain: financial-analysis (91% confidence)
# Recommended Roles:
# • Quantitative Analyst
# • Portfolio Optimizer
# • Compliance Auditor
# Key Open Questions:
# ? Are regulatory constraints fully captured?Why AgentForge Exists
| Typical AI Tools | AgentForge | | ----------------------- | --------------------------------- | | You explain context | AgentForge analyzes your input | | One generic assistant | Dynamically generated specialists | | Stateless conversations | Persistent learning across runs | | Manual orchestration | Compiler-enforced execution | | Writes snippets | Produces reviewable deliverables |
You don't manage the AI. The project tells AgentForge what's needed.
AgentForge v3.0 — Domain-Agnostic Agent Compiler
AgentForge v3.0 removes all hardcoded agent types.
There are no predefined roles, no enums, no fixed domains.
Instead:
- The project is analyzed
- Roles are recommended dynamically
- Roles are normalized & deduplicated
- Personas are researched by LLMs
- Capabilities are enforced by policy
- Agents are compiled, validated, and only then instantiated
AgentForge will reject agents that violate capability, security, or determinism guarantees—even if an LLM proposes them.
const { agents, result } = await agentforge.generateAgentTeam({
projectName: 'Risk Platform',
domain: 'quantitative finance',
description: 'Portfolio risk optimization system',
techStack: ['Python', 'NumPy', 'PostgreSQL'],
});Generated roles are not limited to software development.
What Makes v3.0 Different (and Hard)
| Feature | Why It Matters | | -------------------------- | ------------------------------------- | | Dynamic Roles | Any domain, any expertise | | Persona ≠ Capability | LLM creativity without security risk | | Deterministic Backends | Same input → same execution | | Role Normalization | No duplicate or overlapping agents | | Compiler Gate | Invalid agents cannot run | | Invariant Tests | Architecture is enforced, not implied |
AgentForge v3.0 behaves more like a compiler than a framework.
See: v3.0 Design Contract
VCC — Verifiable Contract Criteria
VCC is AgentForge's contract-driven quality system. Instead of hoping outputs are good, you define acceptance criteria upfront in YAML and AgentForge enforces them.
# vcc_research.yml
acceptance:
- acId: AC-RES-1
targetDeliverables: [A1]
type: structure
severity: must
rule:
requiredSections: [Queries, Sources, Methodology, Limitations]How it works:
- Define VCC spec (YAML) with acceptance criteria
- Pass VCC to execution via
task.metadata.vcc - Quality pipeline scores artifacts against criteria
- Failed criteria trigger refinement with specific feedback
- Task fails if MUST criteria aren't satisfied
Type-safe integration:
// VCCContext enforces both fields required together
const vccContext: VCCContext = {
vcc: loadedSpec, // Required
artifactId: 'A1' // Required
};
// Pass to execution - scoring happens automatically
pipeline.assessQuality(output, { task, vccContext });VCC supports:
- Structure criteria — Required sections/headers
- Rubric criteria — LLM-evaluated quality dimensions
- Schema criteria — JSON/data format validation
Codebase Navigation & Agent Context
AgentForge now ships a 3-tier filesystem pyramid for agent navigability — all backends (Groq, Gemini, local models, Claude) get structural project context automatically:
| Tier | File | Content | Token cost |
|------|------|---------|------------|
| 1 | CLAUDE.md → Module Map section | 29-row module index (entry points, purposes) | ~870, once per session |
| 2 | src/*/README.md | Key files, entry point, dependency map per module | On-demand via scout |
| 3 | Source files | Implementation | On-demand |
27 per-directory README index files now ship in src/, covering every module from src/core/ through src/tui/. Each lists real file names, real exports, entry points, and Depends On / Depended On By relationships — not guesses.
projectIndexFile config option — the tier-1 file is configurable, not hardcoded to CLAUDE.md. Projects with PROJECT_CONTEXT.md, README.md, or any other index file work automatically:
# Use a different index file
agentforge orchestrate --project-index-file PROJECT_CONTEXT.md
# Via env var
AgentForge_PROJECT_INDEX_FILE=README.md agentforge orchestrateThe ScoutAgent loads and injects the Module Map section (or full file up to 3000 chars) into every agent's context before task execution. SubagentBootupRitual does the same for directly-spawned subagents.
VCC Resolution
- VCC is now resolved ONCE at composition root (
orchestrate.ts) before engine creation ResolvedVCCContextprovides a 4-tier precedence chain: explicit VCC → PRD resource constraints → model defaults → system defaults- All 3 execution engines (local, API, Claude Code) consume the same resolved values — no split-brain threshold resolution
- All resolved properties are
readonlyafter construction
Execution System (Production-Grade)
In v3.0, all generated agents execute exclusively through this system.
AgentForge includes a full execution engine with:
- Task state machine
- Verification gates (checks, analysis, security)
- AI + human review
- Rework loops
- Workspace isolation (e.g., via Git worktrees)
- Audit logging
pending → in_progress → quality_gate → ai_review → human_review → done
↑ ↓ ↓
└────── in_rework ──────────────┘This system is always on by default. Risky changes cannot silently bypass review.
See: Execution System
Engine Architecture
LocalExecutionEngine decomposed from 3,820 to 1,307 lines — now a thin coordinator delegating to focused modules:
| Module | Lines | Responsibility |
|--------|-------|----------------|
| modes/atomic-workflow.ts | 1,667 | Atomic decomposition, remediation, compile gate |
| modes/critic-workflow.ts | 456 | VCC + generic critic passes |
| modes/ralph-workflow.ts | 65 | Ralph loop strategy |
| shared/domain-detection.ts | 477 | Tech stack detection, domain context |
| shared/execution-pipeline.ts | 97 | Quality pipeline factory shared by all engines |
| core/BaseExecutionEngine.ts | 90 | Abstract base class with template method pattern |
Provider-Agnostic by Design
AgentForge works with:
- Claude (recommended)
- OpenAI
- Local LLMs (Ollama, LM Studio)
- Custom providers via plugins
Switch providers without changing code or architecture.
No Domain Packs Needed
AgentForge is domain-agnostic by design—no pluggable validator packs are required. Domain-specific behavior emerges automatically from three mechanisms that are always active:
- VCC (Verifiable Contract Criteria) — acceptance criteria defined upfront in YAML drive what "done" means for any domain
- DomainContextExtractor — detects tech stack, language, and domain signals from the project itself
- SQLite pattern learning — persists rubrics, thresholds, and quality signals across runs, improving accuracy automatically
There is no plugin API to implement, no pack to install, and no domain to register. Point AgentForge at any project and it adapts.
Quick Start
git clone https://github.com/Platano78/AgentForge.git
cd AgentForge
npm install
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY
npm run build
npm test# Generate a team from a codebase
npm run agentforge create-team --analyze ./projectSee: Quick Start Guide | CLI Reference
Who This Is For
AgentForge is built for:
- Engineers shipping complex systems
- Teams tired of "AI assistants" that don't understand context
- Projects where quality, review, and determinism matter
- Anyone who wants AI to behave like an organization, not a chatbot
Not designed for casual chat-based coding or one-off prompt experiments.
Quality Scoring System
AgentForge uses an LLM Judge to evaluate every generated artifact against domain-specific rubrics. The judge model is always different from the generator model to prevent self-evaluation bias.
4-Tier Rubric Lookup
Rubrics are resolved through a cascading lookup that balances speed and cost:
| Tier | Source | Latency | Cost | |------|--------|---------|------| | 1 | Memory cache — session-scoped, instant hit | ~0ms | Free | | 2 | Hardcoded rubrics — built-in domain defaults | ~0ms | Free | | 3 | SQLite learned rubrics — persisted across sessions | ~1ms | Free | | 4 | LLM generation — on-demand for unknown domains | ~2-5s | 1 API call |
If all tiers fail, the system falls back to the general rubric.
Supported Domain Rubrics
Hardcoded rubrics ship for these domains:
- unity — correctness, performance (GC avoidance, pooling), architecture (MonoBehaviour, ScriptableObjects), maintainability, documentation
- godot — correctness (GDScript 4.x), node-lifecycle, signals-and-patterns, type-safety, performance
- typescript — correctness (strict mode), type-safety, architecture (SOLID), error-handling, testability
- general — correctness, clarity, completeness, best-practices
Any domain not listed above will trigger Tier 3/4 lookup and the result is cached for future use.
Judge Model Selection
The judge provider is selected by checking API keys in priority order:
| Priority | Provider | Model | Notes |
|----------|----------|-------|-------|
| 1 | Claude | claude-3-5-haiku-20241022 | High quality, good rate limits |
| 2 | Gemini | gemini-2.0-flash | Fast, generous free tier |
| 3 | DeepSeek | deepseek-chat | Strong reasoning |
| 4 | Groq | llama-3.3-70b-versatile | Fast but limited free tier |
If no API key is available, quality assessment is skipped and a warning is logged.
Quality Recommendations
Each evaluation produces a score (0-100) and a recommendation:
- pass (80-100) — artifact accepted, task complete
- revise (50-79) — automatic retry with criterion-specific feedback
- fail (0-49) — task marked as failed
Per-criterion scores and feedback are tracked for all domains, enabling continuous rubric improvement.
Atomic Task Execution
Atomic mode decomposes a PRD (Product Requirements Document) into file-level tasks, each targeting a single file creation in 10 minutes or less. This replaces monolithic "generate the whole project" prompts with bounded, verifiable units of work.
How It Works
- Decomposition — The PRD is analyzed and split into atomic tasks with explicit dependencies
- Execution — Tasks run in dependency order, each with fresh LLM context
- Quality gate — Each task output is scored by the LLM Judge
- Retry — Tasks scoring "revise" (50-79%) are automatically re-generated with feedback
- Persistence — Task specs and results are stored in SQLite for resume capability
Default Configuration
All numeric defaults are centralized in src/config/engine-defaults.ts and overridable via AgentForge_{CONSTANT_NAME} environment variable.
| Setting | Default | CLI Flag | Env Override |
|---------|---------|----------|--------------|
| Max tasks | 50 | --atomic-max-tasks | AgentForge_DEFAULT_ATOMIC_MAX_TASKS |
| Quality threshold | 55 | --quality-threshold | AgentForge_DEFAULT_QUALITY_THRESHOLD |
| Max retries | 2 | — | AgentForge_DEFAULT_ATOMIC_TASK_RETRIES |
| Target minutes/task | 10 | --atomic-target-minutes | AgentForge_DEFAULT_ATOMIC_TARGET_MINUTES |
| Max quality retries | 1 | — | AgentForge_DEFAULT_MAX_REGEN_ATTEMPTS |
| TDD mode | off | --atomic-tdd | — |
| Compile gate timeout | 120s | — | AgentForge_COMPILE_GATE_TIMEOUT_MS |
| Task timeout | 300s | — | AgentForge_DEFAULT_ATOMIC_TASK_TIMEOUT_MS |
TDD Mode
When --atomic-tdd is enabled, each task follows the TDD cycle:
- RED — Generate the test file first (expects failure)
- GREEN — Generate the implementation to make tests pass
- REFACTOR — Optional cleanup pass
CLI Flags
# Enable TDD cycles
agentforge orchestrate -p myproject --atomic-tdd
# Limit decomposition to 30 tasks
agentforge orchestrate -p myproject --atomic-max-tasks 30
# Disable atomic mode (legacy behavior)
agentforge orchestrate -p myproject --no-atomic
# Resume an interrupted workflow
agentforge orchestrate -p myproject --resume
# Stop on first failure
agentforge orchestrate -p myproject --atomic-stop-on-failureExecution Modes
AgentForge supports four execution modes. All modes use the same agent compiler and workflow structure but differ in how tasks are dispatched.
autonomous-local (Mode 2)
Fully autonomous execution using a local LLM. Zero API cost.
agentforge orchestrate -p myproject -m autonomous-local \
--local-endpoint http://localhost:8080Supports llamacpp-router, Ollama, LM Studio, or any OpenAI-compatible local server. The loaded model is auto-detected from the server.
autonomous-api (Mode 1)
Fully autonomous execution using a cloud API. Highest quality output.
agentforge orchestrate -p myproject -m autonomous-api \
--api-provider claudeSupports claude, openai, groq, nvidia, deepseek, together, and custom providers.
interactive
Human-guided execution with approval checkpoints between tasks.
agentforge orchestrate -p myproject -m interactive \
--checkpoint-threshold 0.75The --checkpoint-threshold (0.0-1.0) controls how often AgentForge pauses for human review. Lower values mean more checkpoints.
claude-code
Persona-based orchestration through Claude Code CLI. Generates agent personas and a launcher script.
agentforge orchestrate -p myproject -m claude-code \
--output ./agentforge-exportsAuto-detects available CLI orchestrators (Claude Code, Gemini CLI, Ollama).
Environment Variables
Required (at least one for quality scoring)
| Variable | Purpose |
|----------|---------|
| ANTHROPIC_API_KEY | Claude API access (judge + autonomous-api mode) |
Judge Fallbacks (checked in order)
| Variable | Purpose |
|----------|---------|
| GEMINI_API_KEY or GOOGLE_API_KEY | Gemini Flash judge provider |
| DEEPSEEK_API_KEY | DeepSeek judge provider |
| GROQ_API_KEY | Groq judge provider |
Execution
| Variable | Purpose |
|----------|---------|
| LOCAL_LLM_ENDPOINT | Local LLM server URL (e.g., http://localhost:8080) |
| LOCAL_LLM_MODEL | Override local model name (auto-detected if omitted) |
| OPENAI_API_KEY | OpenAI provider for autonomous-api mode |
Observability (optional)
| Variable | Purpose |
|----------|---------|
| LANGSMITH_API_KEY | LangSmith tracing and observability |
Documentation
| Document | Description | |----------|-------------| | Quick Start | Get running in 5 minutes | | CLI Reference | Command-line usage | | Architecture | System design | | v3.0 Design Contract | Agent compiler guarantees | | Execution System | Task lifecycle and review | | Configuration | Environment and settings | | Custom Providers | Provider plugin system |
Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines.
Navigating the codebase? AgentForge uses a 3-tier navigation pyramid:
CLAUDE.md→ Module Map — 29-row table mapping everysrc/module to its purpose and entry point. Start here for any architecture or module-location question.src/*/README.md— per-module index with key files, exports, entry point, and dependency relationships.- Source files — implementation detail.
If you're unsure where a capability lives, the Module Map in
CLAUDE.mdresolves it without grepping.
git checkout -b feature/amazing-feature
npm test
# Submit pull requestLicense
Apache License 2.0 - see LICENSE
In One Sentence
AgentForge turns projects into reproducible, verifiable execution pipelines.
