@c4a/extract
v0.5.41-beta.7
Published
Code extraction framework for C4A. It owns the language-plugin protocol, repository runner, raw code snapshot contract, digest generation, and shared Tree-sitter parsing utilities.
Downloads
1,691
Readme
@c4a/extract
Code extraction framework for C4A. It owns the language-plugin protocol, repository runner, raw code snapshot contract, digest generation, and shared Tree-sitter parsing utilities.
Role in the Monorepo
@c4a/extract is the protocol and runner layer under context capture --code.
- Language plugins implement
ExtractionPluginand returnExtractionResultv2. - The runner loads one or more plugins, scans repository modules, emits progress/module-error/summary events, and can build a raw code snapshot payload.
@c4a/context-cliwrites the snapshot under.context/raw/aspect/code/<source-slug>/<snapshot-id>/.context compile --code <source-slug>reads that snapshot and materializes package/category/symbol knowledge Nodes.
Depends on: @c4a/core, web-tree-sitter, zod
Depended on by: @c4a/extract-ts, @c4a/context-cli, @c4a/daemon, @c4a/e2e
Protocol Layers
1. Language Plugin Protocol
Language packages implement ExtractionPlugin from protocol.ts.
interface ExtractionPlugin {
id: string;
languages: string[];
packageManagers: string[];
canHandle(source: SourceInfo): boolean;
detectEntries(manifest: ManifestInfo, fs: FileSystem): Promise<EntryDetectionResult>;
extractSymbols(entries: EntryFile[], fs: FileSystem): Promise<ExtractionResult>;
detectPatterns?(fs: FileSystem): Promise<PatternDetectionResult>;
}Important constraints:
- Plugins read through
FileSystem; they should not accessnode:fsdirectly. detectEntries()must return package identity, package kind, language, optional version, and entry files.extractSymbols()must returnExtractionResultv2 with stable symbols and relations.detectEntries()is called beforeextractSymbols(); plugins may keep per-detection package context between those calls.
2. ExtractionResult v2
Every plugin returns:
{
version: "2",
meta: { extractedAt, pluginId, commitHash, language },
package: { name, kind, language, version? },
files: [{ path, language, lines }],
symbols: SymbolInfo[],
relations: RelationInfo[],
stats: { files, lines, exportedSymbols, internalSymbols, relations }
}SymbolInfo supports:
- identity:
name,kind,visibility,file,line,endLine - structure: nested
members - type surfaces:
params,returnType,typeAnnotation,extends,implements,propsType,unionValues - source documentation:
doc
RelationInfo supports code edges such as imports, imports_type, calls, extends, implements, param_type, return_type, of_type, depends_on, and contains.
3. Repository Runner Protocol
The package exposes c4a-extract-code, a NDJSON runner used by context capture --code.
Input is JSON on stdin:
{
"repoPath": "/path/to/repo",
"modules": ["packages/example"],
"commitHash": "abc123",
"pathFilter": {},
"plugins": [{ "package": "@c4a/extract-ts", "exportName": "TypeScriptPlugin" }],
"snapshot": {
"sourceId": "aspect:code:example",
"sourceSlug": "example",
"snapshotId": "code-abc123-deadbeef",
"codeSnapshotContractVersion": "<contract-version>",
"scriptHash": "sha256:...",
"toolchain": {
"manager_package": "@c4a/context-cli",
"manager_version": "<manager-version>",
"runner_package": "@c4a/extract",
"runner_package_version": "<runner-version>",
"runner_bin": "c4a-extract-code",
"plugin_package": "@c4a/extract-ts",
"plugin_package_version": "<plugin-version>",
"plugin_export": "TypeScriptPlugin"
}
}
}Output is one JSON object per line:
{ "type": "progress", "phase": "scanning|parsing|uploading", ... }{ "type": "module_error", "module_name": "...", "module_path": "...", "error": "..." }{ "type": "summary", "extraction": ..., "snapshot": ... }{ "type": "error", "code": "runner-failed", "message": "..." }
The runner does not write .context directly. It returns snapshot files in the summary; @c4a/context-cli validates and writes them atomically.
4. Raw Code Snapshot Contract
When snapshot input is provided, the runner builds these files:
| File | Purpose |
|---|---|
| source.yaml | Source manifest for the code aspect source |
| manifest.json | Snapshot manifest: contract version, toolchain, counts, hash, dirty state |
| _meta.yaml | Backward-compatible snapshot metadata and input summary |
| digests.jsonl | Per-module digest rows with version, hash, dirty state, and digest payload |
| source-files.jsonl | Source-to-module/digest mapping |
| packages.jsonl | Package rows: name, kind, language, module path, optional version/description |
| symbols.jsonl | Flat symbol rows; nested members are flattened and retain package/module fields |
| edges.jsonl | Code relation rows with package/module/version/hash fields |
@c4a/context-cli validates this contract before projection. Required fields include package/module identity, version labels on digest/source-file/edge rows, symbol identity fields, and matching edge/digest versions.
During projection, code-owned Sections receive code source_ref values derived from these rows:
- package rows:
src-N#package:<package>@<hash> - symbol rows:
src-N#symbol:<locator>:<kind>@<hash>
These refs are verified against the raw code snapshot JSONL indexes. They are separate from prose evidence refs, because code snapshots use evidence.mode: none and do not create raw block manifests.
Writing a New Language Plugin
Create a package such as @c4a/extract-python and export an ExtractionPlugin.
Minimum requirements:
- Detect the language manifest in
canHandle()anddetectEntries(). - Return stable package identity: package name, kind, language, and version when available.
- Return
subPackageswhen one manifest represents a nested package layout. - Resolve public entry files so exported symbols can be distinguished from internal symbols.
- Emit
SymbolInfo[]with stablename,kind,visibility,file,line, andendLine. - Emit
RelationInfo[]for imports and important type/inheritance/use edges. - Keep paths module-relative inside the plugin; the repository runner prefixes them to repo-relative paths.
- Register the plugin in the runner input used by
context capture --code.
Example skeleton:
import type {
EntryDetectionResult,
EntryFile,
ExtractionPlugin,
ExtractionResult,
FileSystem,
ManifestInfo,
SourceInfo,
} from "@c4a/extract";
export class PythonPlugin implements ExtractionPlugin {
readonly id = "c4a-extract-python";
readonly languages = ["python"];
readonly packageManagers = ["pip"];
canHandle(source: SourceInfo): boolean {
return source.manifests.some((manifest) => manifest.type === "pyproject.toml");
}
async detectEntries(manifest: ManifestInfo, fs: FileSystem): Promise<EntryDetectionResult> {
// Parse pyproject.toml/setup metadata and return package + entry files.
}
async extractSymbols(entries: EntryFile[], fs: FileSystem): Promise<ExtractionResult> {
// Parse entry graph, classify exported/internal symbols, emit relations.
}
}Relationship to Code Compile and Obsidian
@c4a/extract is upstream of code compile; it does not render knowledge itself.
context capture --codeuses the runner to produce raw code snapshots.context compile --codeconsumespackages.jsonl,symbols.jsonl,edges.jsonl, anddigests.jsonlto build package/category/symbol Nodes such aspkg,pkg/components, andpkg/symbol/button.- Obsidian Render reads the compiled Markdown,
_edges.yaml, and_external.yaml. It does not read raw runner snapshots directly.
That means language plugins affect Obsidian only through the compiled knowledge graph: better symbols/relations produce better symbol Nodes, graph edges, source refs, and source-status chips.
Development
bun run --filter @c4a/extract build
bun run --filter @c4a/extract typecheck
bun run --filter @c4a/extract test
bun run --filter @c4a/extract lint