structurecc
v4.0.0
Published
Claude Code plugin for extracting structured data from documents using native vision and parallel Task agents
Maintainers
Readme
structurecc
Document Structure Extraction for Claude Code
Extract structured data from PDFs, Word documents, and images using Claude's native vision capabilities and parallel Task agents.
Installation
npx structureccThis installs the plugin to ~/.claude/plugins/structurecc/.
Usage
Single Document
/structure document.pdf
/structure lab_image.png
/structure report.docxBatch Processing
/structure:batch ./documents/
/structure:batch ./patient_files/ --output ./extracted/Supported Formats
| Format | Extension | Notes |
|--------|-----------|-------|
| PDF | .pdf | Multi-page supported, chunked for large documents |
| Word | .docx, .doc | Text and embedded images extracted |
| Images | .png, .jpg, .jpeg, .tiff, .bmp | Single-page extraction |
Output
For each document, structurecc generates:
document_extracted/
├── chunks/ # Individual chunk extractions (for debugging)
├── structure.json # Complete structured extraction
└── STRUCTURE.md # Human-readable markdown summarystructure.json
{
"source": "/path/to/document.pdf",
"extracted": "2026-01-30T14:30:22Z",
"pages": [
{
"page": 1,
"elements": [
{
"id": "element_1",
"type": "table",
"title": "Table 1. Lab Results",
"data": {
"headers": ["Test", "Result", "Units", "Reference"],
"rows": [
["Glucose", "126", "mg/dL", "70-100"]
]
},
"confidence": 0.98
}
]
}
],
"summary": {
"total_pages": 5,
"tables": 3,
"figures": 4,
"equations": 1,
"average_confidence": 0.94
}
}Architecture
structurecc uses a chunk-based parallel processing approach:
- Document Analysis - Determine page count and split into chunks (5 pages each)
- Parallel Extraction - Launch one Task agent per chunk for parallel processing
- Chunk Merge - Combine chunk results with page offset correction
- Output Generation - Create JSON and Markdown outputs
Document (20 pages)
│
├── Chunk 1 (Pages 1-5) → Agent 1
├── Chunk 2 (Pages 6-10) → Agent 2
├── Chunk 3 (Pages 11-15)→ Agent 3
└── Chunk 4 (Pages 16-20)→ Agent 4
│
▼
Merged OutputThis approach:
- Maximizes throughput via parallel processing
- Preserves context within chunks (figures and captions stay together)
- Uses Claude's native vision (no external APIs)
- Each agent has 200K context for thorough extraction
Element Types
Tables
Extracted with:
- Headers and all rows
- Cell values with exact formatting
- Flags (H, L, *, †)
- Footnotes
- Merged cell information
Figures
Supports various figure types:
- Charts/Graphs: Line, bar, scatter, pie with data series and axes
- Scientific Images: Western blots, gels, micrographs
- Diagrams: Flowcharts, illustrations, photographs
Each figure includes:
- Title and caption
- Data points (when visible)
- Axis labels and ranges
- Annotations and legends
Equations
Extracted as:
- LaTeX representation
- Plain text fallback
- Variable definitions
Text Blocks
Captured with:
- Full content
- Type (header, paragraph, caption, footnote)
- Formatting information
Confidence Scores
Every element includes a confidence score (0.0-1.0):
| Score | Meaning | |-------|---------| | 0.95-1.00 | Crystal clear extraction | | 0.85-0.94 | Clear with minor uncertainty | | 0.70-0.84 | Readable but some ambiguity | | < 0.70 | Needs manual verification |
Low confidence items are flagged in the output for review.
Use Cases
- Medical Lab Results: Extract patient data from PDF reports
- Research Papers: Structure tables and figures from publications
- Scientific Images: Transcribe gel/blot data for documentation
- Patient Records: Batch process document folders
- Data Digitization: Convert scanned documents to structured data
Requirements
- Claude Code CLI
- No external dependencies (uses Claude's native capabilities)
How It Works
structurecc leverages Claude's multimodal capabilities:
- Claude Vision: Reads PDFs and images natively without OCR
- Parallel Agents: Task tool spawns chunk agents for parallel processing
- Structured Output: JSON schema ensures consistent, parseable output
- Markdown Summary: Human-readable format for quick review
No web searches, no external APIs, no Python dependencies. Just Claude + document = structured data.
Limitations
- Very large documents (100+ pages) may require multiple runs
- Handwritten content has lower accuracy than printed text
- Low-resolution images may have reduced confidence scores
- Complex nested tables may require manual verification
License
MIT
