casestudies
v0.1.0
Published
Intention-driven PDF reading guide generator
Readme
casestudies — system map (TOON)
Format: Token-Oriented Object Notation.
Audience: LLMs. Describes ONLY code currently in this repo. No roadmap, no aspirations.
Paths are repo-relative. "pkg" = package name. "cli" = installed binary name.
meta: pkg_name: casestudies cli_name: casestudies version: 0.1.0 description: Intention-driven PDF reading guide generator module_type: esm repo_root_abs: /home/barnyak/work/spectra node_required: ">=20" python_required: ">=3.11" package_manager: pnpm workspaces[1]: . nvmrc: 20
purpose: one_line: Ingests one PDF + a stated reader intention, runs a plan→read→synthesize LLM pipeline, and emits a structured "reading guide" (markdown + JSON). paradigm: Local POC. Single-user CLI. State in local SQLite + runs/ dir. No server, no queue, no auth. unit_of_work: run (row in runs table, uuid + integer id). One run = one intention over one PDF.
external_binaries_required[4]{name,purpose,checked_by}: pdftotext,TSV words+bbox and plain reading-order text,scripts/bootstrap.sh pdfinfo,Metadata + per-page box sizes,scripts/bootstrap.sh pdfimages,Image manifest via -list,scripts/bootstrap.sh qpdf,PDF outline (bookmarks) via --json,scripts/bootstrap.sh
node_deps[6]{name,version,role}: @anthropic-ai/sdk,^0.37.0,Claude API client better-sqlite3,^11.3.0,Sync SQLite driver for state.sqlite commander,^12.1.0,CLI command framework pino,^9.5.0,Structured JSON logger pino-pretty,^11.3.0,Dev-mode pretty log transport zod,^3.23.8,Schema validation of LLM JSON outputs and config
python_deps[3]{name,version,role}: fastapi,==0.115.5,HTTP extraction service uvicorn[standard],==0.32.1,ASGI server that hosts extract_service pydantic,==2.10.3,Request body validation in FastAPI endpoints
env_vars[5]{name,purpose,default,where_read}: ANTHROPIC_API_KEY,Auth for Anthropic SDK,unset (required),src/llm/client.js CASESTUDIES_STATE_PATH,Override path to state.sqlite,/runs/state.sqlite,src/db/index.js CASESTUDIES_RUNS_DIR,Override runs/ dir (state.sqlite + per-run meta),/runs,src/db/index.js CASESTUDIES_LOG_LEVEL,pino level,info,src/util/logger.js CASESTUDIES_LOG_PRETTY,Set to 1 to use pino-pretty transport,unset,src/util/logger.js
env_loading: mechanism: bin/casestudies.js probes $CWD/.env then <repo_root>/.env. Uses process.loadEnvFile if available (Node >=20.6), else a built-in tiny parser (regex ^KEY=VALUE with optional ""/''). Only the first found file is loaded. Does NOT overwrite already-set env vars. note: No dotenv dependency.
────────────────────────────────────────────────────────────────────────────
ARCHITECTURE
────────────────────────────────────────────────────────────────────────────
architecture: top_level_processes[2]{name,language,role,transport}: casestudies-cli,Node.js,CLI entry + pipeline orchestrator + DB writer + LLM client,n/a extract_service,Python (FastAPI/uvicorn),PDF extraction worker (pdftotext/pdfinfo/pdfimages/qpdf),HTTP on 127.0.0.1:$port
process_lifecycle: - When a pipeline run begins, the Node process spawns the Python extract_service as a child (uvicorn) bound to 127.0.0.1 on configurable port (default 8765). - Node polls GET /health every 100ms until healthy or startup_timeout (default 10000ms). - On SIGINT/SIGTERM, Node calls onShutdown handlers which kill the child (SIGTERM → SIGKILL after 2s) and mark the run status as 'paused'. - The pipeline always calls extract_client.unload({doc_id}) and shutdownExtract() in finally blocks.
data_stores[2]{name,location,purpose}: sqlite,runs/state.sqlite (WAL mode),All durable state — runs/documents/pages/segments/llm_calls/plan_revisions/synthesis filesystem,runs/run-/,Per-run metadata directory. Currently only contains source.txt (absolute PDF path). Debug prompt dumps go to runs/debug/run-/ when --debug-prompts is set.
run_state_machine: states[8]: created,extracting,planning,reading,synthesizing,completed,failed,paused transitions[9]{from,to,trigger}: created,extracting,run start (orchestrator begins) extracting,planning,after python service load + docmap persist planning,reading,after initial planner call persists segments reading,synthesizing,when no pending segments remain and not shutting down synthesizing,completed,after synthesizer row written + completed_at stamped any,paused,SIGINT/SIGTERM shutdown handler any,failed,exception caught in runPipeline try/catch (error_json populated) failed,reading,run resume --allow-retry resets failed segments to pending (on resume) paused/failed/reading/planning/etc,extracting,run resume re-enters from the top of runPipeline (python service is re-loaded from persisted docmap)
────────────────────────────────────────────────────────────────────────────
PIPELINE (end-to-end, exactly as implemented)
────────────────────────────────────────────────────────────────────────────
pipeline_entry: function: runPipeline({ runId, sourcePath, allowRetry?, debugPrompts? }) file: src/pipeline/orchestrator.js preconditions: - run row exists with a status != 'completed' (completed throws) - status != 'failed' unless allowRetry=true on_allow_retry: UPDATE segments SET status='pending' WHERE run_id=? AND status='failed'; clear runs.error_json. debug_prompts: When true, sets reader request debug dump dir to runs/debug/run-/; reader.js calls setDebugDir(); call.js writes reader--seq.json per reader call.
pipeline_stages[4]{idx,name,file,what_it_does}: 1,extraction,src/extraction/lifecycle.js + src/pipeline/orchestrator.js,Boot python extract_service → POST /load {pdf_path,doc_id:'doc1',…} → receives {docmap,pages} → inserts one documents row + one pages row per page (page_json is the full per-page json from the extractor). 2,planner,src/pipeline/planner.js,Single LLM call that segments the PDF into s01…sNN. Persists rows to segments with status='pending'. Reuses existing segments if any already exist (resume path). 3,reader loop,src/pipeline/reader.js + orchestrator.js,Loops: pick next status='pending' segment (ORDER BY idx LIMIT 1). Fetch text for its page range via GET /text. Call LLM with tool_use forcing record_segment_notes. Persist notes/claims/etc and mark status='completed'. If the reader emits plan_feedback, runs a revision planner (subject to budgets) that supersedes remaining pending segments and appends replacements. 4,synthesizer,src/pipeline/synthesizer.js,Single LLM call over ALL completed segments' notes_md in idx order. Upserts one synthesis row per run.
────────────────────────────────────────────────────────────────────────────
CLI (commander.js tree)
────────────────────────────────────────────────────────────────────────────
cli: bin: bin/casestudies.js builder: buildProgram() in src/cli/index.js id_resolution: file: src/cli/resolve.js rule: A run id arg matching /^\d+$/ is treated as integer primary key; otherwise treated as a uuid prefix. Prefix match uses 'uuid LIKE "prefix%"' with LIMIT 2 — ambiguous prefixes throw.
cli_commands[22]{path,args,options,action_summary,file}: run new,,--intention (required) --config --name ,Resolve PDF from (file or dir with exactly one .pdf). Insert runs row (status='created'). Merge CLI --config JSON over DEFAULT_CONFIG. Write runs/run-/source.txt.,src/cli/run.js run start,,--debug-prompts,resolveRun → read source.txt → runPipeline({ runId,sourcePath,debugPrompts }). Throws if source.txt absent.,src/cli/run.js run resume,,--debug-prompts,resolveRun → runPipeline({ runId, allowRetry:true, sourcePath }). Throws if run is already completed.,src/cli/run.js run list,none,--status ,Print id/uuid-prefix/status/name/created_at rows from runs ordered by id DESC. Optional status filter.,src/cli/run.js run show,,none,Print JSON block with run cols + segment counts (active vs completed) + llm_calls count + parsed error_json.,src/cli/run.js run delete,,none,DELETE FROM runs WHERE id=? (cascades to documents/pages/segments/llm_calls/plan_revisions/synthesis via FK).,src/cli/run.js intention show,,none,Print runs.intention.,src/cli/intention.js intention set, ,none,UPDATE runs.intention. Throws unless status='created'.,src/cli/intention.js config get, [key],none,Print config_json (full JSON if no key; else JSON value).,src/cli/config.js config set, ,none,Parse/coerce to its zod-schema type. Throws if key unknown. Throws if key ∈ LOCKED_UNTIL_CREATED and status!='created'. Re-validates full config with ConfigSchema.parse then writes back.,src/cli/config.js doc list,,none,Rows: doc_id/bytes/source_path.,src/cli/doc.js doc show, <doc_id>,none,Prints a slimmed docmap summary JSON: source + metadata + outline_entries count + heading_candidates count + image_count + extraction block.,src/cli/doc.js doc pages, <doc_id>,none,Print page_num one per line (ordered).,src/cli/doc.js segments list,,none,One line per segment: segment_id status-padded pp page_start-page_end title.,src/cli/segments.js segments show, <segment_id>,none,Prints a JSON block with all parsed JSON sidecars (tags/cross_refs/claims/baseline_deltas/gaps/plan_feedback) + notes_md + timestamps.,src/cli/segments.js segments revisions,,none,Lists plan_revisions rows for run.,src/cli/segments.js calls list,,--role <planner|reader|synthesizer> --since ,One line per llm_calls row (seq/role/segment_id/token counts/latency/error).,src/cli/calls.js calls show, ,none,Prints a JSON block with full request_json + response_json + usage_json for one call.,src/cli/calls.js guide show,,none,Prints renderMarkdown(db,runId) to stdout. Uses synthesis + completed segments. See markdown_rendering section below.,src/cli/guide.js guide export, ,--format <md|json> (default md),Writes markdown (same as show) OR buildJson(db,runId) JSON to .,src/cli/guide.js export, <output_dir>,none,Writes one file: <output_dir>/run.json containing {run,documents,segments,llm_calls,plan_revisions,synthesis} raw SQLite rows.,src/cli/export.js import,<sqlite_or_dir>,none,Stub — writes "import: not implemented in POC" to stderr and exit(2).,src/cli/export.js
────────────────────────────────────────────────────────────────────────────
DATABASE (src/db/schema.sql — canonical DDL)
────────────────────────────────────────────────────────────────────────────
db: engine: SQLite (better-sqlite3) pragmas[2]: journal_mode=WAL, foreign_keys=ON schema_version: 2 (row in meta table) migrations_file: src/db/migrate.js migrate_behavior: - applySchema(db) runs schema.sql via db.exec (all CREATE TABLE IF NOT EXISTS). - rebuildSynthesisIfStale: if synthesis table has preamble_md column (legacy v1), create synthesis_v2 with v2 shape, copy rows, DROP original, RENAME v2 → synthesis. - migrateSegmentColumns: idempotently ALTER TABLE segments ADD COLUMN for claims_json, baseline_deltas_json, gaps_json; swallows "duplicate column name" errors.
tables[8]{name,purpose,pk,unique,fk_cascade}: meta,Schema metadata,"key","key","—" runs,One per pipeline invocation,"id AUTOINCREMENT","uuid","—" documents,One PDF per run (currently always one),"id","(run_id,doc_id)","run_id→runs ON DELETE CASCADE" pages,Per-page extracted JSON,"id","(document_id,page_num)","document_id→documents ON DELETE CASCADE" segments,Reading plan entries + reader output,"id","(run_id,segment_id)","run_id→runs + document_id→documents ON DELETE CASCADE" plan_revisions,Audit trail of revision planner calls,"id","—","run_id→runs ON DELETE CASCADE" llm_calls,One row per Anthropic messages.create invocation,"id","—","run_id→runs ON DELETE CASCADE" synthesis,Final synthesizer output (one per run),"id","run_id","run_id→runs ON DELETE CASCADE"
runs_cols[11]{col,type,notes}: id,INTEGER PK AUTOINCREMENT,— uuid,TEXT UNIQUE NOT NULL,Generated by crypto.randomUUID() name,TEXT,Default: -<slugify(intention,5)> intention,TEXT NOT NULL,Reader goal. Lockable after status≠'created' via intention.js. config_json,TEXT NOT NULL,Full ConfigSchema-validated JSON status,TEXT NOT NULL CHECK,One of created/extracting/planning/reading/synthesizing/completed/failed/paused error_json,TEXT,Populated on failure: {message,stack} created_at,TEXT NOT NULL,strftime('%Y-%m-%dT%H:%M:%fZ','now') default updated_at,TEXT NOT NULL,Same format started_at,TEXT,Set when run first transitions to 'extracting' completed_at,TEXT,Set when run transitions to 'completed'
documents_cols[7]{col,type,notes}: id,INTEGER PK,— run_id,INTEGER FK,— doc_id,TEXT NOT NULL,Currently always literal 'doc1' from orchestrator source_path,TEXT NOT NULL,Absolute PDF path from docmap.source.path sha256,TEXT NOT NULL,From docmap.source.sha256 (hex) bytes,INTEGER NOT NULL,os.stat(path).st_size docmap_json,TEXT NOT NULL,Full docmap (see docmap_shape below)
pages_cols[4]{col,type,notes}: id,INTEGER PK,— document_id,INTEGER FK,— page_num,INTEGER NOT NULL,1-based page_json,TEXT NOT NULL,{page,width_pt,height_pt,text_reading_order,blocks[],images[]}
segments_cols[16]{col,type,notes}: id,INTEGER PK,— run_id,INTEGER FK,— document_id,INTEGER FK,— segment_id,TEXT NOT NULL,Format: s (e.g. s01). Regex in zod: /^s\d{2,}$/ idx,INTEGER NOT NULL,Insertion order — controls processing and rendering title,TEXT NOT NULL,From planner page_start,INTEGER NOT NULL,Inclusive page_end,INTEGER NOT NULL,Inclusive why_this_boundary,TEXT,Planner rationale expected_relevance,TEXT,Planner rationale status,TEXT NOT NULL CHECK,pending|in_progress|completed|superseded|failed notes_md,TEXT,Reader output — ≤150 words of narrative (per reader prompt) claims_json,TEXT,Array (3-6) of Claim objects — see schemas baseline_deltas_json,TEXT,Array (2-3) of BaselineDelta objects gaps_json,TEXT,Array (0-2) of Gap objects cross_refs_json,TEXT,Array of prior segment_ids plan_feedback_json,TEXT,Nullable PlanFeedback object — triggers revision planner
segments_indexes[2]: idx_segments_run_idx(run_id,idx), idx_segments_run_status(run_id,status)
plan_revisions_cols[6]{col,type,notes}: id,INTEGER PK,— run_id,INTEGER FK,— triggered_by_segment_id,TEXT,segment_id whose plan_feedback caused the revision before_plan_json,TEXT NOT NULL,Snapshot of the plan before revision after_plan_json,TEXT NOT NULL,The new_segments array inserted in this revision (NOT the full post-revision plan) rationale,TEXT,feedback.rationale or ''
llm_calls_cols[14]{col,type,notes}: id,INTEGER PK,— run_id,INTEGER FK,— seq,INTEGER NOT NULL,1-based within a run (COALESCE(MAX(seq),0)+1) role,TEXT NOT NULL CHECK,planner|reader|synthesizer segment_id,TEXT,Null for planner-initial and synthesizer; set for reader and revision-planner model,TEXT NOT NULL,Model passed to Anthropic API request_json,TEXT NOT NULL,Full payload we sent (system+messages+tools+tool_choice if set) response_json,TEXT,Full SDK response object usage_json,TEXT,resp.usage input_tokens,INTEGER,usage.input_tokens output_tokens,INTEGER,usage.output_tokens cache_creation_tokens,INTEGER,usage.cache_creation_input_tokens cache_read_tokens,INTEGER,usage.cache_read_input_tokens latency_ms,INTEGER,Client-side wall time error,TEXT,String of caught exception (null on success)
llm_calls_indexes[1]: idx_llm_calls_run_role(run_id,role)
synthesis_cols[6]{col,type,notes}: id,INTEGER PK,— run_id,INTEGER UNIQUE FK,Ensures exactly one synthesis row per run document_shape_md,TEXT NOT NULL,Free-text 1-2 paragraphs portability_notes_json,TEXT NOT NULL,"{generalizes:string[],medium_bound:string[]}" threads_json,TEXT NOT NULL,Thread[] — see schemas tensions_json,TEXT NOT NULL DEFAULT '[]',Tension[] — see schemas
────────────────────────────────────────────────────────────────────────────
CONFIG (src/util/config.js — ConfigSchema — zod, .strict())
────────────────────────────────────────────────────────────────────────────
config: validator: ConfigSchema.parse (zod, strict mode — extra keys rejected) default_source: ConfigSchema.parse({}) — every key has a zod default file_override: run new --config merges a JSON file on top of defaults (via mergeConfig). per_key_override: config set . coerceValue(key,value) routes strings to the key's zod type: 'true'/'false' → boolean, integer/float regex → Number, else the raw string.
config_keys[26]{key,type,default,lockable_after_created}: planner_model,string,claude-sonnet-4-6,true reader_model,string,claude-opus-4-7,true synthesizer_model,string,claude-opus-4-7,true planner_temperature,number,0,true reader_temperature,number,0.3,true synthesizer_temperature,number,0.3,true planner_max_tokens,int,4096,true reader_max_tokens,int,2500,true synthesizer_max_tokens,int,4000,true max_revisions_per_segment,int,2,false max_total_revisions_formula,string,ceil(n/3),false segment_min_pages,int,2,false segment_max_pages,int,30,false segments_per_10_pages,number,1.0,false segment_count_floor,int,4,false segment_count_ceiling,int,30,false segment_wallclock_timeout_s,int,300,false python_service_port,int,8765,false python_service_startup_timeout_ms,int,10000,false extraction_nodiag,bool,true,true extraction_cropbox,bool,false,true preview_char_length,int,200,true heading_min_pt,number,13,true heading_tier_count,int,3,true strip_boilerplate,bool,true,true boilerplate_min_doc_fraction,number,0.30,true boilerplate_band_frac,number,0.10,true cache_ttl,string,5m,false max_estimated_cost_usd,number,5.0,false
config_locked_set[17]:
- planner_model
- reader_model
- synthesizer_model
- planner_temperature
- reader_temperature
- synthesizer_temperature
- planner_max_tokens
- reader_max_tokens
- synthesizer_max_tokens
- extraction_nodiag
- extraction_cropbox
- preview_char_length
- heading_min_pt
- heading_tier_count
- strip_boilerplate
- boilerplate_min_doc_fraction
- boilerplate_band_frac
config_observations:
- cache_ttl is stored in config but never read anywhere in code (not wired to Anthropic cache_control ttl).
- max_estimated_cost_usd is stored in config but never read anywhere in code (no cost-guard logic).
- ConfigSchema is .strict(); unknown keys throw on insertion or update.
────────────────────────────────────────────────────────────────────────────
PYTHON EXTRACTION SERVICE (python/)
────────────────────────────────────────────────────────────────────────────
python_service: app: python/extract_service.py framework: FastAPI + uvicorn bind: 127.0.0.1:$port (port from config, default 8765) spawned_by: src/extraction/lifecycle.js startExtractService() spawn_cmd: python/.venv/bin/python -m uvicorn python.extract_service:app --host 127.0.0.1 --port $port --log-level warning spawn_env: process.env merged with PYTHONPATH=$REPO_ROOT startup_probe: GET /health every 100ms until 200 or deadline in_memory_cache: store: OrderedDict doc_id → {docmap, pages} capacity: 10 (pops oldest via popitem(last=False)) lifetime: process-lifetime (not disk-persistent)
python_endpoints[4]{method,path,request,response,timeout_ms_from_node}: GET,/health,none,"{ok,version,cached_docs[]}",5000 POST,/load,LoadBody (see schema),"{doc_id,docmap,pages:{<page_str>:<page_json>}}",600000 GET,/text,"?doc_id=&start=&end=","{pages:[{page,text_reading_order,images}]}",30000 POST,/unload,"{doc_id}","{ok:true}",5000
python_load_body_fields[10]{field,type,default}: pdf_path,str,(required) doc_id,str,(required) nodiag,bool,true cropbox,bool,false preview_char_length,int,200 heading_min_pt,float,13.0 heading_tier_count,int,3 strip_boilerplate,bool,true boilerplate_min_doc_fraction,float,0.30 boilerplate_band_frac,float,0.10
python_extractors[7]{file,role,external_cmd,fail_mode}: python/extractors/qpdf_outline.py,Parse PDF bookmarks via qpdf --json (tries --json-output=2 then fallback plain --json),qpdf,Soft-fail: returns {source:'none',entries:[],warnings:[reason]} on any error or missing outline. python/extractors/pdfinfo.py,Metadata (title/author/etc) + per-page sizes via 'pdfinfo -isodates -box' then 'pdfinfo -box -f 1 -l N',pdfinfo,Hard-fail raises RuntimeError if returncode≠0. python/extractors/pdftotext_tsv.py,Word-level TSV (level=5 rows only): bbox_pt + max_word_height_pt + mean_word_height_pt. Groups words by (block_num,line_num) into blocks with line lists.,pdftotext -tsv [-nodiag],Hard-fail raises RuntimeError on returncode≠0. python/extractors/pdftotext_plain.py,Reading-order plain text. NFC-normalized. Line-breaking hyphens rejoined via regex (\w+)-\n(\w+) → \1\2. Split on form-feed (\f) per page; trailing empty page popped.,pdftotext [-nodiag],Hard-fail raises RuntimeError on returncode≠0. python/extractors/images.py,Parse 'pdfimages -list' rows into {image_id,page,bbox_pt:[0,0,width,height],inferred_kind:'figure',status:'unprocessed'}.,pdfimages -list,Soft-fail: returns ([],version) on returncode≠0 or <3 header lines. python/extractors/headings.py,Heading inference. Build histogram of line max_word_height_pt (>=heading_min_pt) in 1pt bins. Take top tier_count bins by count desc then height desc; sort tallest→smallest as tier 1..N. Emit candidate line per match with tier/page/y_pt/text.,(none),Returns {source:'bbox_height_clustering',tiers:[],candidates:[]} when no lines pass min-pt filter. python/extractors/boilerplate.py,Detect repeating top/bottom-band lines and strip them. band: line y-center in page_heightband_frac top or bottom band. normalized: lowercase + digits→'#' + whitespace-squashed. pattern seen on ≥ max(2, round(total_pagesmin_doc_fraction)) pages gets removed. Rewrites both block text and reading_pages. Skips entirely if total_pages<3.,(none),Returns (blocks,pages,report). report has patterns_removed[] + lines_removed_total + pages_affected + config echo.
docmap_builder: python/extractors/docmap.py — build(pdf_path, doc_id, ...flags) returns (docmap, per_page_json).
docmap_shape: top_level_fields[9]: schema_version,doc_id,source,metadata,pages,outline,headings_inferred,images,extraction schema_version_value: 1 source: "{path,sha256,bytes}" metadata: "{title,author,subject,creator,producer,creation_date,modification_date,pdf_version,page_count,encrypted,tagged,has_outline}" pages_entry: "{page,width_pt,height_pt,text_length,word_count,image_count,preview}" outline: "{source:'qpdf'|'none',entries:[{entry_id,level,title,page}]}" headings_inferred: "{source:'bbox_height_clustering',tiers:[{tier,min_height_pt,max_height_pt,count}],candidates:[{heading_id,tier,page,y_pt,text,bbox_height_pt}]}" images: "[{image_id,page,bbox_pt,inferred_kind,status}]" extraction: "{extracted_at,tool_chain:[qpdf_v,pdftotext_v,pdfinfo_v,pdfimages_v],flags_used:{nodiag,tsv,cropbox,strip_boilerplate},boilerplate:,warnings:string[]}"
per_page_json_shape: "{page,width_pt,height_pt,text_reading_order,blocks:[{block_id,bbox_pt,lines:[{line_id,bbox_pt,text,word_count,max_word_height_pt,mean_word_height_pt}]}],images:[{image_id,bbox_pt}]}"
────────────────────────────────────────────────────────────────────────────
LLM LAYER (src/llm/)
────────────────────────────────────────────────────────────────────────────
llm_client: file: src/llm/client.js sdk: @anthropic-ai/sdk init: new Anthropic() with no args — SDK picks up ANTHROPIC_API_KEY from env. Cached singleton. guard: Throws "ANTHROPIC_API_KEY is not set" before construction if env absent.
llm_call_wrapper:
file: src/llm/call.js
export: callAndValidate({db,runId,role,segmentId,model,maxTokens,temperature,system,messages,schema,tool?})
responsibilities:
- Generate next seq for llm_calls (COALESCE(MAX(seq),0)+1 per run_id).
- Build request: {model,max_tokens,system,messages[,temperature][,tools,tool_choice]}.
- Omit temperature if model matches /claude-(opus|sonnet|haiku)-(\d+)-(\d+)/ AND (major>4 OR (major==4 AND minor>=7)) — covers opus-4-7 which rejects temperature.
- Enforce that system is string OR array; messages' content is string OR array.
- Call client.messages.create(payload) and time it (Date.now()).
- Write one llm_calls row before returning — both success and failure paths. Persists request_json + response_json + usage_json + cache_* + latency_ms + error.
- Optionally dump reader requests to the debug dir per seq when setDebugDir was called.
- Extract output: if tool given and the response has a tool_use block matching tool.name → use block.input as parsed object (API already parsed JSON); else concatenate text blocks with \n.
- Validate with provided zod schema.
json_repair_retry:
applies_to: text-path responses only (planner, synthesizer)
trigger: JSON parse failure OR zod validation failure
action: One additional LLM call with [...originalMessages, {role:'assistant',content:}, {role:'user',content:'Your last response could not be used. Error: \n\nReturn the corrected JSON only. No fences, no prose.'}]. New llm_calls row written. Hard-fail if repair call also parses/validates badly.
tool_path_validation:
- Runs zod safeParse on tool_input.
- On failure: throws 'tool validation failed: …'. No repair retry in tool path.
- Note the jsdoc comment above callAndValidate describes a "log and return anyway" policy but the implemented behavior is throw.
lenient_json_parser: parseJsonLenient(raw) — strips … fences and optional json tag, then slices between first '{' and last '}'. Exposed in src/llm/schemas.js.
llm_debug_dump: module_state: _debugDir (closure in src/llm/call.js) set via setDebugDir(dir). trigger_path: orchestrator sets it when --debug-prompts; reader calls dumpDebug for each reader request. artifact: runs/debug/run-/reader-<segment_id>-seq.json containing the full request body.
zod_schemas: file: src/llm/schemas.js planner_output: {segments: PlannerSegment[] >=1} .strict() planner_segment: id_regex: ^s\d{2,}$ fields: id,title(>=1),page_start(int>=1),page_end(int>=1),why_this_boundary(>=1),expected_relevance(>=1) refinement: page_end >= page_start reader_output: .strict() — notes_md(>=1) + tags (enum array, default []) + claims (3..6) + baseline_deltas (2..3) + gaps (<=2, default []) + cross_refs (string[], default []) + plan_feedback (nullable optional) claim: {id,title,stance,evidence:{page(int>=0),quote},ui_translation,confidence in ('direct','inferred')} baseline_delta: {baseline_assumption,source_deviation,why_it_matters} gap: {topic,why_notable} plan_feedback: {kind in ('split','merge_next','expand_end','reframe'),rationale,suggested_boundary? int>=1} .strict() synthesizer_output: .strict() — document_shape(string>=1) + portability_notes{generalizes[3..6],medium_bound[3..6]} + threads(Thread[], default []) + tensions(Tension[0..5], default []) thread: {title,segment_ids:string[] default [],why,strength in ('dominant','strong','weak'),generalizes_beyond_source:bool} tension: {description,segments_involved:string[],resolution}
reader_tool: name: record_segment_notes (exported as READER_TOOL_NAME) mode: tool_use forcing tool_choice={type:'tool',name} input_schema: JSON Schema object mirroring the reader_output zod shape (src/llm/schemas.js READER_TOOL_INPUT_SCHEMA). Notable: tags minItems=2 maxItems=5; claims minItems=3 maxItems=6; baseline_deltas minItems=2 maxItems=3; gaps maxItems=2; plan_feedback oneOf [{type:null},{object…}]. Note the zod schema allows tags empty by default whereas JSON Schema requires tags minItems=2 — discrepancy tolerated because the API enforces the JSON Schema side.
tag_vocabulary_TAG_VOCABULARY[17]:
- color
- typography
- spacing
- hierarchy
- assets
- governance
- prohibitions
- variants
- composition
- migration
- contrast
- wayfinding
- tokens
- reproduction
- alignment
- naming
- anti-ornament
────────────────────────────────────────────────────────────────────────────
PROMPTS (src/llm/prompts/*.js — builders + system prompts)
────────────────────────────────────────────────────────────────────────────
prompts: planner: file: src/llm/prompts/planner.js builders[2]: buildPlannerMessages({docmap,intention,config}), buildPlannerRevisionMessages({docmap,intention,config,currentPlan,completedNotesMd,feedback,triggerSegment}) system_prompt_essence: Planning agent. Segment the PDF. Respect semantic boundaries (headings, topic shifts). Min 2 / max 30 pages per segment. Number segments s01, s02, ... Each segment emits why_this_boundary + expected_relevance. Output valid JSON only — no fences, no prose. target_segment_count_formula: ceil(page_count × segments_per_10_pages / 10), clamped to [segment_count_floor, segment_count_ceiling] — passed as a prose hint in the constraints block, not enforced by code. condensed_docmap_sent_to_model: "{doc_id,metadata,outline,headings_inferred:{source,tiers,candidates[:400]},previews}" previews_trim: trimPreviews picks page previews for pages in outline/heading anchor sets and caps total at 80. revision_prompt_adds: existing plan JSON + notes_md of completed segments + the feedback object + the trigger segment_id. Re-plan only the REMAINING (not-yet-completed) segments. Keep numbering continuous after the last used segment id. reader: file: src/llm/prompts/reader.js builder: buildCacheableReaderPayload({intention,completedSegments,currentSegment,segmentPages}) system_structure[2_blocks]: - {type:'text',text:READER_SYSTEM} — base instructions. NOT cached explicitly. - {type:'text',text:intention,cache_control:{type:'ephemeral'}} — caches the intention as the second system block. user_content_blocks[2]: - {type:'text',text:<guide_block>,cache_control:{type:'ephemeral'}} — the running guide text assembled from completed segments ([segid: title]\n<notes_md>) joined by '\n\n---\n\n'. Falls back to '(no prior segments)' when empty. - {type:'text',text:'## Segment : (pp -)\n\n<segment_text>'} — the current segment's reading-order text. NO cache_control (varies per segment). segment_text_format: '--- page ---\n<page.text_reading_order>' joined by '\n\n'. cache_invariants_verified_by: scripts/test-reader-determinism.js — asserts (1) identical inputs → byte-identical payload and (2) the guide block for segment N is a strict string-prefix of the guide block for segment N+1. system_essence: "Reader agent, not summarizer. Emits 3-6 claims + 2-3 baseline_deltas + 0-2 gaps + notes_md ≤150 words + 2-5 tags from the fixed vocabulary + optional plan_feedback. Each claim has id c, title, stance, evidence{page,quote≤25 words verbatim}, ui_translation, confidence (direct|inferred)." synthesizer: file: src/llm/prompts/synthesizer.js builder: buildSynthesizerMessages({intention,notesMd}) system_essence: "Synthesizer agent. Emits exactly four fields: document_shape (1-2 paragraphs), portability_notes{generalizes[3-6],medium_bound[3-6]}, threads[] (5-10 with title/segment_ids/why/strength/generalizes_beyond_source), tensions[] (0-5). Output valid JSON only — no fences." notes_md_format: '### : (pp -)\n\n<notes_md>' joined by '\n\n---\n\n'.
────────────────────────────────────────────────────────────────────────────
READER LOOP + REVISION LOGIC (orchestrator.js + planner.js + reader.js)
────────────────────────────────────────────────────────────────────────────
reader_loop: selection_sql: SELECT … FROM segments WHERE run_id=? AND status='pending' ORDER BY idx ASC LIMIT 1 termination: exits when SELECT returns no row OR isShuttingDown(). per_iteration_flow: - Mark segment row status='in_progress'. - GET /text?doc_id=&start=&end= on python service for the page range. - Build cacheable reader payload using completed segments with idx < current segment.idx. - Call callAndValidate with role='reader', tool=READER_TOOL_{NAME,INPUT_SCHEMA}, schema=ReaderOutputSchema. Wrapped in withTimeout(… , segment_wallclock_timeout_s*1000ms). - On success: persist notes_md + all JSON sidecars + status='completed' + completed_at. Log cache_hit_rate = cache_read / (cache_read + cache_create + input_tokens). - On failure: UPDATE segments SET status='failed'; continue the loop (does NOT abort the run immediately — loop will next pick up any remaining pending). But if no more pending, run will proceed to synthesizer (even with failed segments present). Failed segments are not rendered in the guide (render filters status='completed').
revision_logic: trigger: Reader's successful result includes plan_feedback!=null with a non-null kind. per_segment_cap: config.max_revisions_per_segment (default 2) counts revisions triggered by a single segment_id (in-memory revisionCounters Map). total_cap: maxTotalRevisionsForFormula(formula,initialCount) — parses 'ceil(n/K)' regex, else defaults to ceil(n/3). initialCount = COUNT(segments WHERE status!='superseded') at reader start. revision_call: runRevisionPlanner → buildPlannerRevisionMessages (includes existing plan + completed notes + feedback + trigger segment). Model/tokens/temp use planner_* config. persistence: persistRevisionSegments() transaction: - UPDATE segments SET status='superseded' WHERE status='pending' (kills the remaining pending queue). - INSERT new segments with idx continuing from existingMax+1 and segment_ids renumbered to continue after the last existing sNN. - INSERT into plan_revisions with before_plan_json=currentPlan, after_plan_json=renumbered-new-segments, rationale=feedback.rationale||''. counters_update: revisionCounters[segment_id]++ ; totalRevisions++.
────────────────────────────────────────────────────────────────────────────
RENDERING (src/cli/guide.js)
────────────────────────────────────────────────────────────────────────────
markdown_rendering: function: renderMarkdown(db,runId) order: synthesis header → Document shape → Generalizes/Medium-bound → Tensions → Threads → Segments (in idx order, status='completed' only). per_segment_render: "### : (pp -) + optional Tags + optional Cross-refs + notes_md + optional Claims (id/title/stance/evidence/ui_translation/confidence) + optional Baseline Deltas + optional Gaps." no_synthesis_fallback: '# Reading Guide (synthesis pending)' then straight to Segments.
json_rendering: function: buildJson(db,runId) shape: "{run:{id,uuid,name,intention},synthesis:{document_shape,portability_notes,threads,tensions}|null,segments:[{segment_id,idx,title,page_start,page_end,status,notes_md,claims,baseline_deltas,gaps,tags,cross_refs}]}"
────────────────────────────────────────────────────────────────────────────
SHUTDOWN & LOGGING
────────────────────────────────────────────────────────────────────────────
shutdown: file: src/util/shutdown.js signals_handled: SIGINT, SIGTERM handler_ordering: reverse registration order (LIFO). exit_code: 130 isShuttingDown_flag: set true on first signal — polled by reader loop to break between iterations. pipeline_shutdown_registrations[1]: Stop python child + mark run status='paused'.
logger: file: src/util/logger.js library: pino level_env: CASESTUDIES_LOG_LEVEL (default 'info') pretty_env: CASESTUDIES_LOG_PRETTY=1 → transport pino-pretty notable_events[log_msg]: - run.start / run.complete / run.failed / run.paused / run.paused.shutdown / run.retry.reset_failed_segments - extraction.start / extraction.complete / extraction.reload - planner.start / planner.complete / planner.revision / planner.revision.failed / planner.revision.skipped_per_segment_cap / planner.revision.skipped_total_cap - reader.segment.start / reader.segment.complete / reader.segment.failed - synthesizer.start / synthesizer.complete - extract service ready / debug.prompts.enabled
────────────────────────────────────────────────────────────────────────────
SCRIPTS
────────────────────────────────────────────────────────────────────────────
scripts[5]{path,purpose,notes}: scripts/bootstrap.sh,Verify pdftotext/pdfinfo/pdfimages/qpdf/python3>=3.11/node>=20/pnpm. Then 'pnpm install' + create python/.venv + pip install requirements.txt.,Exits non-zero with hints on any missing dep. scripts/dev-python.sh,Run extract_service manually: python/.venv/bin/python -m uvicorn python.extract_service:app --host 127.0.0.1 --port $PORT (default 8765).,For dev/testing of the Python HTTP API independent of Node. scripts/run-pipeline.sh,Create a run from ./input PDF with $INTENTION (default 'To build a compact and nuanced set of guidelines for UI development'); start it with CASESTUDIES_LOG_PRETTY=1 in background; poll 'run show' every POLL_INTERVAL (default 10s); print guide + export guide on completion; append size-comparison diagnostics to the exported markdown.,Writes log to runs/_demo_logs/run-.log and guide export to runs/_demo_logs/guide-run-.md. scripts/compare-sizes.sh,Extract text from a PDF via pdftotext and diff char/word counts against a markdown file; prints Δ, percent change, ratio, and a proportional bar.,Takes <input.pdf> <output.md>; colors disabled when stdout is not a TTY. scripts/test-reader-determinism.js,Verifies buildCacheableReaderPayload byte-identical property and that the cached guide block is a strict prefix as completed_segments grows. Also verifies cache_control placement.,Pure unit test with in-memory fixtures — no DB, no network.
────────────────────────────────────────────────────────────────────────────
FILE LAYOUT (complete)
────────────────────────────────────────────────────────────────────────────
files[43]{path,role}: bin/casestudies.js,CLI entry — loads .env then runs buildProgram().parseAsync(argv). src/cli/index.js,Builds the commander tree (registers 8 command groups). src/cli/run.js,'run new/start/resume/list/show/delete' commands + readSourceForRun helper. src/cli/intention.js,'intention show/set' commands. src/cli/config.js,'config get/set' commands. src/cli/doc.js,'doc list/show/pages' commands. src/cli/segments.js,'segments list/show/revisions' commands. src/cli/calls.js,'calls list/show' commands. src/cli/guide.js,'guide show/export' + renderMarkdown + buildJson. src/cli/export.js,'export ' writes run.json; 'import' is a stub. src/cli/resolve.js,resolveRun(db,arg) — integer id or uuid prefix. src/db/index.js,openDb/closeDb/resolveRunsDir/resolveStatePath/newUuid; cached singleton DB. src/db/migrate.js,applySchema + synthesis-v1→v2 rebuild + segments ALTER TABLE migrations. src/db/schema.sql,Canonical DDL (schema_version=2). src/extraction/client.js,Fetch-based HTTP client for the python extract service (health/load/text/unload). src/extraction/lifecycle.js,startExtractService() — spawn uvicorn child + poll /health + shutdown(). src/llm/client.js,Cached Anthropic SDK singleton. src/llm/call.js,callAndValidate — the single-entry LLM invocation wrapper + setDebugDir. src/llm/schemas.js,All zod schemas + READER_TOOL_NAME/INPUT_SCHEMA + parseJsonLenient. src/llm/prompts/planner.js,Planner system prompt + message builders (initial + revision). src/llm/prompts/reader.js,TAG_VOCABULARY + reader system text + buildCacheableReaderPayload. src/llm/prompts/synthesizer.js,Synthesizer system prompt + buildSynthesizerMessages. src/pipeline/orchestrator.js,runPipeline — full extract→plan→read→synthesize state machine with retry/shutdown. src/pipeline/planner.js,runInitialPlanner + runRevisionPlanner + persistSegments + persistRevisionSegments. src/pipeline/reader.js,readSegment — per-segment fetch+LLM+persist; persistReaderOutput + withTimeout. src/pipeline/synthesizer.js,runSynthesizer — fetch completed segments → LLM → upsert synthesis row. src/util/config.js,ConfigSchema + LOCKED_UNTIL_CREATED + DEFAULT_CONFIG + loadConfigFile + mergeConfig + coerceValue. src/util/ids.js,slugify + sha256Hex + shortUuid + formatSegmentId (some only partially used). src/util/logger.js,createLogger + 'log' singleton (pino). src/util/shutdown.js,onShutdown + isShuttingDown — SIGINT/SIGTERM graceful exit. python/extract_service.py,FastAPI app — /health /load /text /unload endpoints + OrderedDict doc cache (cap 10). python/extractors/docmap.py,build() — orchestrates all other extractors into a docmap + per_page_json tuple. python/extractors/pdfinfo.py,Wrap 'pdfinfo -isodates -box' + per-page 'pdfinfo -box -f 1 -l N'. python/extractors/pdftotext_tsv.py,Wrap 'pdftotext -tsv [-nodiag]' into page→blocks→lines structure. python/extractors/pdftotext_plain.py,Wrap 'pdftotext [-nodiag]' → NFC-normalize → rejoin line-break hyphens → split on \f. python/extractors/qpdf_outline.py,Wrap 'qpdf --json [--json-output=2]' and walk outlines tree → (entry_id,level,title,page) list. python/extractors/images.py,Wrap 'pdfimages -list' → manifest rows keyed image_id. python/extractors/headings.py,Heading tier inference via max_word_height_pt histogram bins. python/extractors/boilerplate.py,Detect/strip repeating top/bottom-band lines across pages. python/extractors/init.py,Empty package marker. python/requirements.txt,fastapi==0.115.5, uvicorn[standard]==0.32.1, pydantic==2.10.3 scripts/bootstrap.sh,Dep check + install. scripts/dev-python.sh,Standalone python service runner. scripts/run-pipeline.sh,End-to-end pipeline wrapper (run + export + size diagnostics). scripts/compare-sizes.sh,PDF→MD size comparison utility. scripts/test-reader-determinism.js,Cache-determinism unit test. packages/cli/package.json,"@spectraforge/cli — publishable npm package (bin: spectraforge)." packages/cli/bin/spectraforge.js,Bin entrypoint; imports src/index.js. packages/cli/src/index.js,Commander tree (run, workspace, --version). packages/cli/src/run.js,'run ' — validates input, ensures workspace, symlinks input, invokes scripts/run-pipeline.sh. packages/cli/src/workspace.js,ensureWorkspace (clone or fetch+reset origin/$REPO_REF) + ensureBootstrapped (runs scripts/bootstrap.sh if node_modules or python/.venv missing). packages/cli/src/paths.js,workspaceDir() — $SPECTRAFORGE_WORKSPACE or XDG/macOS/Windows equivalent under spectraforge/workspace. .github/workflows/publish-cli.yml,"On tag v* or workflow_dispatch: pnpm install --ignore-scripts, verify tag matches packages/cli version, npm publish --access public --provenance (uses secrets.NPM_TOKEN)."
────────────────────────────────────────────────────────────────────────────
ON-DISK LAYOUT AT RUNTIME (under cwd or CASESTUDIES_RUNS_DIR)
────────────────────────────────────────────────────────────────────────────
runtime_fs: runs/state.sqlite,"SQLite state DB (WAL mode). Location overridable via CASESTUDIES_STATE_PATH; otherwise CASESTUDIES_RUNS_DIR/state.sqlite or cwd/runs/state.sqlite." runs/run-/source.txt,"Absolute PDF path written by 'run new'. Consumed by 'run start' and 'run resume'." runs/debug/run-/reader--seqN.json,"Reader request dumps when --debug-prompts was passed to start/resume." runs/_demo_logs/*,"Created by scripts/run-pipeline.sh (log + guide export with appended size diagnostics)."
────────────────────────────────────────────────────────────────────────────
USAGE (happy path — exact commands that work today)
────────────────────────────────────────────────────────────────────────────
happy_path_commands[6]:
- cp .env.example .env && edit .env (or export ANTHROPIC_API_KEY=…)
- bash scripts/bootstrap.sh
- drop a .pdf into ./input
- node bin/casestudies.js run new ./input --intention "To build a compact and nuanced set of guidelines for UI development"
- node bin/casestudies.js run start
- node bin/casestudies.js guide export guide.md
end_user_cli[3]:
- npm install -g @spectraforge/cli
- export ANTHROPIC_API_KEY=…
- spectraforge run ./input # clones galaticstarforge/spectra into $XDG_DATA_HOME/spectraforge/workspace, bootstraps on first run, symlinks ./input into the workspace, runs scripts/run-pipeline.sh.
cli_release_flow[3]:
- bump version in packages/cli/package.json
- git tag v && git push --tags
- .github/workflows/publish-cli.yml verifies the tag matches the package version, then runs 'npm publish --access public --provenance' using secrets.NPM_TOKEN.
known_gotchas:
- input/ directory must contain exactly one .pdf (resolvePdfFromInput rejects 0 or >1 files for the POC).
- 'run new' ONLY writes source.txt to disk after DB insert; if you lose that file before 'run start', readSourceForRun throws. (Resume after first extraction is fine — source is recovered from documents.source_path in resolveSourceFromRun, but that path is only reached when sourcePath arg is falsy, which 'run resume' passes it along from readSourceForRun, which reads the same source.txt… so losing source.txt before first extraction is fatal.)
- extraction.reload on resume re-sends the same flags the initial load used AFTER parsing config_json — if those locked config values changed upstream they would still be frozen at run-new time.
- The reader is wrapped in a per-segment wallclock timer (segment_wallclock_timeout_s). Exceeding it marks only that segment status='failed'; run continues with remaining pending.
- 'import' is a stub; export/import round-trip is not implemented.
