useathena

v0.4.3

Published

12 hours ago

athena captures decision traces and turns them into reusable tacit knowledge for agents.

0High
0Medium
0Low

Athena captures decision traces from real work and turns them into reusable tacit knowledge for agents. It is not a generic memory store or enterprise context graph. It is the local-first tacit knowledge layer that makes context actionable: one place where how experts decide becomes transferable across Claude, Codex, browser agents, internal agents, and whatever comes next.

Most agent failures in expert work are not caused by missing public facts. They happen because the agent lacks the unwritten judgment that connects context to action: tone boundaries, account-specific exceptions, precedent from past decisions, approval paths, when to escalate, what not to assume, what "good enough" looks like, and which correction from last time should generalize to a new, non-identical task.

Athena watches the moments where that judgment becomes visible, records them as immutable decision traces, turns those traces into evidence-backed hypotheses, validates where they transfer, and serves the right rules back to agents before they act. The goal is not to replay old answers. It is to help agents recognize structurally similar situations and apply the team's judgment to new work.

Athena does not try to extract someone's whole mind. It captures decision traces, asks rare clarifying questions only when a correction is high-value, filters weak traces before learning, and validates reusable judgment through future outcomes.

work -> decision traces -> tacit hypotheses -> reusable tacit knowledge -> agent briefs -> new-task outcomes -> better rules

npx useathena onboard --yes

Product thesis

Athena is a local-first tacit knowledge layer built from decision traces.

Context graphs answer what exists, where it came from, and how it is connected. Athena focuses on the missing layer: how an expert turned that context into action. It can draw on documents, messages, CRM records, repos, and other context sources, but its durable asset is the decision trace — the correction, approval, override, choice, rationale, boundary, and outcome that explain why a decision was allowed and what pattern should transfer.

The user should not have to write a giant operating manual for every assistant or rebuild the same context inside every vendor's tool. Instead, Athena should learn from the normal feedback loop:

The user edits, rejects, approves, or redirects agent output.
Athena captures that judgment moment as an immutable decision trace.
A model-backed engine infers candidate rules with supporting examples and boundaries.
Replay validation decides whether rules can be served; observed outcomes decide whether they become trusted.
Agents call Athena for a brief before acting on new work.
Outcomes feed back into confidence, stale detection, and counterexamples.

This is why Athena should sit outside any one agent. Agents can observe local corrections, but Athena owns the shared evidence, validation, and transfer logic. A future agent should be able to handle a new task by retrieving structurally similar traces, not by finding an exact precedent or relearning the same lesson in a vendor silo.

The north-star metric is a declining correction rate per served brief.

Product shape

Athena is designed as two layers:

Athena Harness — the open local runtime: capture surfaces, event schema, local store, MCP tools, connectors, outcome tracking, and a baseline inference engine that can run with local or bring-your-own models.
Athena Brain — the future hosted intelligence layer: stronger extraction, deterministic validators, replay at scale, confidence calibration, evals, and eventually specialist models trained only where users have explicitly consented.

Today, this repo contains the Harness plus the baseline local engine. The Brain is roadmap, not a dependency: Athena should stay useful even when someone runs it fully locally with their own model.

Licensing note: the 0.4.0 package is public but still UNLICENSED while the final open-source boundary is decided. Treat the code as source-available for evaluation, pilots, and design-partner use, not as an open-source grant yet.

Where we're at

Athena 0.4.0 is a public beta of the local Harness: the full loop works end-to-end on real work, the package installs from npm/GitHub, and the core question is now empirical: whether captured decision traces reduce repeated corrections for real users.

Working today:

Capture — a Chrome extension for LinkedIn and Gmail draft edits, plus opt-in capture on any other site through a config-driven generic adapter; Claude Code and Codex sensor hooks for corrections, approvals, and remember: notes; CLI and MCP capture for everything else. Corrections get semantic edit labels such as tone, specificity, risk, timing, and ask-strength; decisions can carry structured options, rankings, choices, and rationale.
Self-configuring sensors — for opted-in sites, Athena infers capture selectors from sanitized DOM outlines using the configured model and syncs them back to the extension (athena site-config).
Store — local SQLite as the source of truth; raw decision traces private by default; every learned rule cites the evidence that produced it.
Learn — a model-agnostic hypothesis engine (claude/codex CLI, Anthropic or OpenAI-compatible APIs, Ollama) that auto-triggers as evidence accumulates, filters weak/noisy traces before inference, surfaces rare one-question micro-probes for high-value corrections, treats approvals/probes/mixed edit directions as boundary evidence, replay-validates rules before serving, and extracts cited entity facts from the same evidence.
Serve — an MCP surface of exactly four tools; athena_brief is meant to run silently before substantive work and auto-skips tiny/status-only prompts unless forced. Briefs carry applicable rules, confidence, boundaries, citations, do-not-assume items, and a knowledge map. The map is there to navigate supporting context, not to turn Athena into a broad company brain.
Outcomes — agents register drafts, code, plans, and action proposals; Claude Code and Codex hooks can also auto-register the latest assistant output against the most recent matching brief when the user approves or corrects it. When a sensor later sees what the human actually did with the output, the outcome records itself; explicit skipped, escalated, abandoned, no-output, and unknown closures keep the loop from rotting. Rule confidence adjusts, and athena outcomes plus athena metrics show the full funnel, rule fires, accepted/corrected outputs, edit distance, trace quality, closure rate, and correction-rate trend.
Connectors — Notion, Slack, and Google (Drive + Gmail) sync into cited, searchable sources that feed briefs.
Onboarding — one command (npx useathena onboard) sets up the store, model provider, hooks, MCP, background service, and extension.

Known beta caveats:

Google consent-screen verification (today: unverified-app interstitial, ~100-user cap).
Chrome Web Store listing (today: load-unpacked from the installed package).
Browser smoke tests and extension selector hardening.
Embeddings or semantic search, if lexical search stops being enough.
Team workspaces and rule-promotion flows.

Roadmap

The strategy in one line: an open, local-first decision-trace layer that earns trust by design, serves reusable tacit knowledge across the agent ecosystem, and later adds a hosted intelligence layer that has to prove on evals that it beats a frontier model running locally.

1. Prove the loop (current phase) Make the north-star metric measurable end-to-end, remove install friction (Chrome Web Store, OAuth verification), and drive daily use with public-beta testers. This phase is done when the correction rate visibly declines for real users. Everything else waits on that evidence.

2. Open Athena Harness and the baseline engine Once the license is chosen, commit the boundary publicly: Athena Harness — capture surfaces, event schema, local store, MCP tools, connectors, outcome tracking, and the baseline engine — stays open. Version the schema as a spec others can adopt. Start a private eval corpus of graded real-world traces (right rule, overgeneralized, one-off), and add opt-in, content-free lifecycle telemetry (rule inferred → served → upheld/overridden) that anyone can verify in the open code.

3. Distribution, then teams Public launch. Expand capture to wherever agents live — more editors, CLIs, and sites. Ship the first multiplayer features on the open engine: shared workspaces and rule promotion (a pull request, but for judgment rules). Bring on design partners, and build enterprise plumbing — SSO, audit logs, admin controls, managed deployment — when customers demand it, not before.

4. Athena Brain The hosted intelligence layer: stronger extraction validators, replay at scale, and confidence calibration from real outcomes, benchmarked against the open baseline on the eval corpus — launched only when the margin is one users can feel. Individuals get a free hosted quota; contributing anonymized traces is optional and never required. No training on private content by default; paid tiers add contractual privacy, no-training guarantees, admin controls, and managed deployment. A specialist extraction model comes last, trained only on consented traces.

Standing rules along the way:

Nothing ever moves from open to closed.
The eval corpus compounds continuously — real traces get graded as they come in.
Every product decision tests against one number: correction rate per served brief.

The MVP loop

The MVP is not a broad knowledge base or a generic context graph. It is a working closed loop for reusable tacit knowledge:

Capture high-signal decision traces from real workflows.
- CLI and MCP capture for manual or agent-recorded events.
- Claude Code hook for explicit remember: / athena: notes and redirect-shaped prompts.
- Chrome extension sensor for LinkedIn, Gmail, and any opted-in site: paste a generated draft, edit it, send it, and Athena captures the before/after as a correction or approval.
Store evidence locally and safely.
- SQLite is the source of truth.
- Raw decision traces are private by default.
- Every learned rule cites the evidence that produced it.
- Agents can propose evidence and outcomes, but cannot directly mutate durable rules.
Infer transferable tacit rules — and explicit facts — from decision-trace evidence.
- The hypothesis engine clusters examples by domain.
- Weak traces such as generic clarification questions, hook test notes, and missing-contrast captures are kept as evidence but filtered out of learning.
- It infers cues, expectancies, goals, rules, and boundary conditions.
- Rules capture reasoning patterns that can apply to new, related situations, not "when X happened, repeat Y" automation.
- High-value corrections can surface one micro-probe such as "When should Athena not apply this?" or "What told you this was off?" Capture surfaces attach the answer to the same event when they can ask before recording. Probe policy has cooldowns; Athena should not ask on ordinary prompts or weak traces.
- It replay-validates against held-out examples before a rule can be served.
- It also extracts entity facts ("Acme's renewal is in September") — deduped, cited, and staleness-tracked; restatements confirm instead of duplicating.
- Inference is model-backed, not a fake deterministic string matcher — and it triggers itself: once enough new evidence accumulates, a background athena learn runs automatically (disable with ATHENA_AUTO_LEARN=off).
Validate before serving; review is optional.
- Replay-validated rules are served to agents immediately, flagged with caveats.
- Rules graduate to confidently-served on their own through repeated upheld outcomes — no human in the loop required.
- athena review is an accelerator and audit surface: approving promotes a rule instantly, rejecting retires it (kept for audit). It is never a gate.
Brief agents before they act on new work.
- MCP exposes exactly four tools: athena_brief, athena_open, athena_record, and athena_search.
- Agents should call athena_brief silently before substantive tasks, not every tiny prompt. In auto mode, low-value/status-only prompts are skipped without persisting a brief; mode=force exists for deliberate inspection.
- A brief returns applicable rules, confidence, boundaries, citations, entity facts, do-not-assume items, open questions, and a readiness verdict.
- The brief tells an agent which past judgment patterns match the current task, where the match is weak, and when safer boundaries require inspection or a human check.
- It also carries a knowledge map — what else athena knows and the query that retrieves each slice — so agents pull what they need instead of being flooded.
Learn from outcomes — mostly without anyone reporting.
- Agents register the artifact they produced (athena_record type=output); Claude Code and Codex hooks also try to auto-register the latest assistant answer against the latest matching brief when the next user prompt is an approval or correction. When a sensor captures what the human actually did, the outcome records itself: approval → uncorrected, edit → corrected with the diff as a counterexample. Matching is conservative; a false match would corrupt trust.
- Short approval sign-offs in Claude Code/Codex ("lgtm", "ship it") count as observed approvals; unresolved drafts expire to unknown after the match window.
- Explicit reporting still works everywhere sensors can't see.
- Rules gain trust when upheld and lose trust when overridden.
- Repeated counterexamples make rules stale.

Install

Requires Node 22.5+ (Athena uses the built-in node:sqlite module).

npx useathena onboard          # interactive
npx useathena onboard --yes    # accept every default

Working from a branch, or without registry access? The same flow runs straight from GitHub for anyone with repo access:

npx github:ErikLit005/athena onboard

Onboarding does everything: creates the local store, detects which model providers you have (claude CLI, codex CLI, Anthropic or OpenAI API keys, a running Ollama), verifies the one you pick with a real inference call, installs itself globally so hooks survive the npx cache, wires up Claude Code and Codex when present (sensor hook + MCP registration), installs the browser-sensor API as a login service, offers to connect a knowledge source (Notion — so briefs have real facts on day one), and prints the extension setup with its token. --yes takes the first detected provider and says yes to all of it (knowledge sources need a token paste, so non-interactive runs print the athena connect pointer instead); opt out per-piece with --no-hook, --no-service, --no-connect, --skip-test, or --model <spec>.

Already installed? athena update pulls the latest published version, restarts the background serve service on the new code, and prints the load-unpacked extension reload path while the Chrome Web Store listing is not live.

Model providers are spec strings, set once by setup (or per-run via ATHENA_MODEL). The built-in default is cli:claude:sonnet.

cli:claude[:model]          Claude subscription via the claude CLI — no API key
cli:codex[:model]           ChatGPT subscription via the codex CLI — no API key
anthropic[:model]           Anthropic API (ANTHROPIC_API_KEY)
openai:<model>              OpenAI API (OPENAI_API_KEY)
openai:<model>@<baseUrl>    any OpenAI-compatible server (LM Studio, vLLM, …)
ollama:<model>              local Ollama

Architecture

src/core       Types, ids, refs, fixtures
src/store      SQLite store and enforced invariants
src/capture    SensorEvent -> immutable JudgmentInstance ingestion
src/engine     Model-backed hypothesis inference and replay validation
src/serve      Agent brief compilation and outcome recording
src/eval       Golden scenarios, harness, and judges
src/mcp        MCP tools for agents
src/cli        CLI command logic and interactive review loop
src/api        Localhost API for browser sensors
src/model      ModelClient interface and CLI-backed providers
src/connect    Knowledge connectors: provider clients, token store, sync runner
src/sensors    Local sensors such as the Claude Code hook
apps/chrome-extension
               MV3 sensor for LinkedIn, Gmail, and opted-in sites
docs           Schema and research notes

Important contracts:

docs/schema.md is the product/domain contract.
docs/research/2026-06-11-tacit-capture-research.md contains the research and competitive landscape behind the design.
apps/chrome-extension/README.md documents the browser sensor capture policy.

Local development

git clone https://github.com/ErikLit005/athena.git && cd athena
npm install
npm run verify      # typecheck + tests

Useful commands:

npm run typecheck
npm test
npm run eval       # model-backed; requires ATHENA_MODEL/CLI auth
npm run build       # compile to dist/ (what a global install runs)
npm run athena -- status

The default store path is ~/.athena/athena.db. Override it with ATHENA_DB for tests or isolated local experiments. Config (model choice, optional API keys) lives in ~/.athena/config.json.

Running Athena locally

Initialize the local store and print MCP setup instructions (or run athena setup for the guided version):

npm run athena -- init

Capture an example correction:

npm run athena -- capture correction \
  --summary "edited a LinkedIn connection note to be warmer and less salesy" \
  --domain linkedin.outreach \
  --before "Would love to connect and discuss how we can help your team." \
  --after "Saw your post on GTM hiring. Would be glad to connect."

Learn candidate rules (and entity facts) from captured evidence — this also runs automatically in the background once enough new evidence accumulates:

npm run athena -- learn --domain linkedin.outreach
npm run athena -- facts

Review inferred rules:

npm run athena -- review

Ask for the kind of brief an agent should get before acting:

npm run athena -- brief "draft a LinkedIn connection note to a GTM leader" --domain linkedin.outreach

Start the localhost API for the Chrome extension (or install it as a login service with serve install):

npm run athena -- serve

Check the browser sensor setup and reload path:

npm run athena -- extension help

Export everything athena has learned as one JSON bundle (--redact strips raw text):

npm run athena -- export

Design principles

Instances are immutable evidence; hypotheses are revisable views over evidence.
Behavior is ground truth; rationale is labeled hypothesis.
Transfer beats replay: rules should help agents handle new, structurally similar work, not memorize exact prior outputs.
Everything cites.
Privacy fields are mandatory on capture-derived records.
SQLite is the source of truth; markdown is a projection.
The engine is model-backed and model-agnostic.
The agent surface stays small and auditable.
Rules earn trust autonomously — replay validation to be served, upheld outcomes to be served confidently. Human review accelerates and audits; it never gates.
Human interaction concentrates at critical moments: briefs say ask_human when athena's knowledge is thin or contradictory, and counterexamples surface for audit.

Where to help

Good near-term contribution areas:

Hardening the Chrome extension against LinkedIn and Gmail DOM changes.
Hardening the generic adapter and site-config loop for opted-in sites.
Adding browser smoke tests for the extension capture flow.
Publishing the Chrome extension to the Web Store.
Expanding eval scenarios from real corrections, especially cases with subtle boundaries or counterexamples.

Before changing core behavior, read docs/schema.md. Before changing product direction, read the research notes and keep the north-star metric in mind: fewer repeated corrections per served brief.

Broader vision

The first wedge is personal: Athena should make one person's agents learn the transferable decision patterns behind their standards, preferences, and judgment over time.

The broader vision is an evidence-backed tacit knowledge layer for teams and organizations: personal tacit rules stay private by default, vetted rules get promoted to a workspace, and experts spend less time repeating corrections and more time reviewing high-leverage rules. Organizations accumulate operational judgment by preserving decision traces and validating where they transfer — not by pretending every important rule can be written down in advance.

If Athena works, an agent should not merely remember what happened or retrieve more context. It should apply how the expert decided to new work.