pando-proxy
v0.1.33
Published
By pando (https://getpando.ai) under the MIT License. Run Codex through a local OpenAI Responses-compatible memory proxy with one npx command. The proxy enforces a one-tier exact-piece memory sieve across rounds and keeps an archive-backed recall path for
Maintainers
Readme
pando-proxy
pando-proxy is a local Codex wrapper that rewrites each Responses request through a strict
active-task working set.
The important invariant is simple:
- active memory is one active task plus exact kept pieces
- the next forwarded prompt contains that exact kept set
- exact duplicate content is shown once, with duplicate source markers on the canonical kept piece
- dropped material leaves active memory, but raw round sources stay in the per-session archive
- if older exact material is needed later, the agent can use
recall({offset,limit})against the archive, up to 3 times in that round, with no per-call item cap
There are no memory groups, summaries as source material, embeddings, projection layers, hidden
omitted-memory tiers, or retained-state tags. Exact answers must come from visible pieces or
archive recall.
Validation Policy
For the active-memory redesign in this repository:
- ignore unit tests completely as product proof
- validate with live E2E runs against the real backend
- inspect logs and persisted state as the primary verification method
Unit tests are still useful for local regressions in the state transition code.
Current Design
The runtime is built around:
activeTask: the single current executable task, its title, and its active piece idsarchivedTasks: previous titled task bundles, revivable by negative relative indexpieces: exact retained user messages plus assistant reasoning/talk/tool-result/tool-call chunksprocessedSourceIds: source ids already seen and archivedarchive: raw original sources kept for explicit recovery, not normal prompt memory
Normal end-of-round flow:
- collect new round sources, including user input, assistant talk/reasoning, tool calls, and tool results
- run
source_chunk_batchfor non-user sources andtask_routein parallel- user messages are never split; each user message is one atomic
wholepiece source_chunk_batchalways uses the configured full/overflow structured model, currentlygpt-5.4, with low reasoning effort when the model supports it and priority service tier- chunking is biased toward coherent multi-chunk output because bad splits fail closed and later pruning can keep too much; clearly single-piece sources still remain whole
task_routeandpiece_drop_batchuse the configured small structured model when the request fits, currentlygpt-5.4-mini, and overflow only when needed- if chunking fails, returns malformed output, omits a requested source, or is too large for the structured window, that source is kept whole
- valid model-selected text chunks must be exact and lossless: returned chunks joined together must equal the raw source text exactly
- user messages are never split; each user message is one atomic
- materialize exact new pieces
- apply the task route
- for
same_taskandrevive_task, collapse exact duplicate new pieces by content hash while recording duplicate source markers - build the routed candidate active set
- run
piece_drop_batchover full-payload batches sized under the prune budget - keep everything not dropped with an accepted concrete reason, including the local sanity guard that rejects non-structural drops if they would leave only assistant output after non-assistant evidence existed
- collapse surviving exact duplicates; on
new_task, old/new duplicates are intentionally collapsed only after prune and prefer the new piece as canonical - persist the active task title, archived task bundles, and surviving exact pieces
Task switching is structural. On new_task, the previous active task identity is always archived as
a complete bundle and a fresh active task is created. The old task's exact pieces are still
evaluated alongside the new round pieces; any old piece that still belongs to the new task is copied
into the new active working set. The old task itself is never kept active.
Task routing sees the current task title, the full exact active pieces, full new user messages, and
only the five newest archived task cards by default. Archived cards are shown newest-first with
relative indexes -1 through -5; the route model can request the next linear page when it needs
older task titles.
Normal request flow:
- load session state
- materialize active-piece payloads for rendering
- inject one synthetic
<pando_task_memory>developer block - forward to upstream
- if the model explicitly calls
recall, resolve archived sources locally - finalize memory after the upstream round completes
Active Memory vs Archive
These two surfaces are intentionally separate.
Active memory:
- one active task
- exact surviving pieces only
- always shown in the next rewritten prompt
- what survives is exactly what crosses the prompt boundary
Archive:
- raw original round sources on disk
- not part of normal prompt construction
- only reachable through explicit
recall - call-count bounded to at most 3 recall calls per round, with no per-call item cap
The archive is a recovery surface, not a second active-memory tier.
recall
The proxy may inject one local function tool:
- name:
recall - arguments:
{ offset, limit } - max uses per round:
3 - per-call item cap: none
Guidance injected to the model:
- prefer answering from active memory first
- use
recallonly when exact needed material is not visible in active memory - when using it, err on requesting more archive coverage rather than too little
The tool result explicitly marks returned content as archive content and includes:
requestedOffsetrequestedLimitreturnedCountremainingArchivedSourceCount- exact archived source payloads
Quickstart
Requires:
- Deno
- Codex on
PATH - Codex already logged in
Typical use:
deno run --allow-net --allow-env --allow-read --allow-write --allow-run \
src/main.ts \
exec \
--sandbox read-only \
"inspect this repo"Resume with the exact thread id printed by the wrapper:
deno run --allow-net --allow-env --allow-read --allow-write --allow-run \
src/main.ts \
exec resume 019dc204-22fb-7c50-95ad-2f2508254945 \
--sandbox read-only \
"continue"Prefer exact thread ids almost always. --last should be treated as fallback-only.
Auth
Live calls resolve auth in this order:
OPENAI_API_KEY~/.codex/auth.jsonviatokens.access_token
If Codex is already logged in, that is usually enough.
Live E2E Workflow
For real validation, use one fixed state dir and one fixed log file per session:
deno run --allow-net --allow-env --allow-read --allow-write --allow-run \
src/main.ts \
--proxy-log-file /tmp/pando-test.jsonl \
--proxy-state-dir /tmp/pando-test-state \
exec \
--sandbox read-only \
-o /tmp/round1.txt \
"round 1 prompt"Then resume the same exact thread id:
deno run --allow-net --allow-env --allow-read --allow-write --allow-run \
src/main.ts \
--proxy-log-file /tmp/pando-test.jsonl \
--proxy-state-dir /tmp/pando-test-state \
exec resume 019dc204-22fb-7c50-95ad-2f2508254945 \
--sandbox read-only \
-o /tmp/round2.txt \
"round 2 prompt"Inspect after each run:
incoming_requestwith the original Responses bodymaterialized_memory_loadedrewritten_contextwith the original and rewritten bodiesupstream_loop_iterationupstream_requestupstream_responsestructured_model_requeststructured_model_responsememory_round_chunkedmemory_round_decision- raw prune decisions, all candidate ids, accepted drop ids, sanity-rejected drop ids, kept/dropped old/new ids, and duplicate ids
memory_round_updatedmemory_update_inputsmemory_state_savedarchive_recallstructured_model_usagestructured_model_skippedround_complete
--proxy-log is the one switch for full proxy data-flow logging. It writes every main-model
request/response, internal structured-model request/response, memory materialization, chunking
input/output, prune decision, archive recall payload, and round summary to JSONL. Use
--proxy-log-file <path> when you want a stable path. Direct serve mode has equivalent --log
and --log-file flags. Authorization and token fields are redacted, but user prompts, tool outputs,
model outputs, and retained memory payloads are intentionally present for debugging.
Wrapper stderr also prints at exit:
- estimated input tokens without the proxy
- billed all-in tokens with the proxy
- proxy overhead tokens from internal manager calls
Repo Map
- ACTIVE_MEMORY_REDESIGN_PLAN.md — implemented target
- CONTEXT_MEMORY_DESIGN.md — one-task sieve design
- DESIGN_PRINCIPLES.md — architecture rules
- WORKING_SET_PRUNE_DESIGN.md — full-payload prune design
- MEMORY_OPERATIONS.md — round-by-round operations
- REFERENCE.md — concrete runtime types and contracts
- LIVE_E2E.md — live validation loop
- MEMORY_DIAGRAMS.md — simplified diagrams
- npm-publishing.md — package checks and npm release flow
Key runtime files:
src/memory_state.tssrc/memory_pipeline.tssrc/working_set_manager.tssrc/chunking.tssrc/prompt_view.tssrc/upstream.tssrc/store.tssrc/server.ts
Benchmarks
Replay and benchmark material remains in this repository, but the benchmark docs are now limited to current active-task working-set runtime measurements.
Start with:
- QUICK_BENCHMARKS.md for the short current-results table
- BENCHMARKS.md for methodology, artifacts, and interpretation
Latest completed benchmark snapshots include:
- real-LLM replay on 10 public SWE-bench Verified devstral trajectories: average per-turn context
8,115.7 -> 4,378.5approximate input tokens (46.0%lower), with 10/10 replay jobs completed and 0 manager errors - Metabase #42434 long-session task run: proxy-estimated forwarded context total
156,193,363 -> 29,797,331approximate input tokens (80.9%lower), with 7/7 Codex exits and 7/7 clean diffs in both baseline and proxy conditions - deterministic full-corpus public replay on 345 SWE-bench Verified devstral trajectories: average
context
15,199 -> 1,093approximate input tokens (92.8%lower)
The defensible current claim is context-window reduction. Task-level correctness evidence currently comes from the Metabase run's exit status, clean diff checks, and oracle file overlap; it is not a substitute for a full repository test suite.
