structurecc

v4.0.0

Published

4 months ago

Claude Code plugin for extracting structured data from documents using native vision and parallel Task agents

0High
0Medium
0Low

jacweath

claude-code plugin document-extraction pdf vision structured-data pharmacogenomics medical-documents

structurecc

Document Structure Extraction for Claude Code

Extract structured data from PDFs, Word documents, and images using Claude's native vision capabilities and parallel Task agents.

Installation

npx structurecc

This installs the plugin to ~/.claude/plugins/structurecc/.

Usage

Single Document

/structure document.pdf
/structure lab_image.png
/structure report.docx

Batch Processing

/structure:batch ./documents/
/structure:batch ./patient_files/ --output ./extracted/

Supported Formats

| Format | Extension | Notes | |--------|-----------|-------| | PDF | .pdf | Multi-page supported, chunked for large documents | | Word | .docx, .doc | Text and embedded images extracted | | Images | .png, .jpg, .jpeg, .tiff, .bmp | Single-page extraction |

Output

For each document, structurecc generates:

document_extracted/
├── chunks/              # Individual chunk extractions (for debugging)
├── structure.json       # Complete structured extraction
└── STRUCTURE.md         # Human-readable markdown summary

structure.json

{
  "source": "/path/to/document.pdf",
  "extracted": "2026-01-30T14:30:22Z",
  "pages": [
    {
      "page": 1,
      "elements": [
        {
          "id": "element_1",
          "type": "table",
          "title": "Table 1. Lab Results",
          "data": {
            "headers": ["Test", "Result", "Units", "Reference"],
            "rows": [
              ["Glucose", "126", "mg/dL", "70-100"]
            ]
          },
          "confidence": 0.98
        }
      ]
    }
  ],
  "summary": {
    "total_pages": 5,
    "tables": 3,
    "figures": 4,
    "equations": 1,
    "average_confidence": 0.94
  }
}

Architecture

structurecc uses a chunk-based parallel processing approach:

Document Analysis - Determine page count and split into chunks (5 pages each)
Parallel Extraction - Launch one Task agent per chunk for parallel processing
Chunk Merge - Combine chunk results with page offset correction
Output Generation - Create JSON and Markdown outputs

Document (20 pages)
       │
       ├── Chunk 1 (Pages 1-5)  → Agent 1
       ├── Chunk 2 (Pages 6-10) → Agent 2
       ├── Chunk 3 (Pages 11-15)→ Agent 3
       └── Chunk 4 (Pages 16-20)→ Agent 4
               │
               ▼
         Merged Output

This approach:

Maximizes throughput via parallel processing
Preserves context within chunks (figures and captions stay together)
Uses Claude's native vision (no external APIs)
Each agent has 200K context for thorough extraction

Element Types

Tables

Extracted with:

Headers and all rows
Cell values with exact formatting
Flags (H, L, *, †)
Footnotes
Merged cell information

Figures

Supports various figure types:

Charts/Graphs: Line, bar, scatter, pie with data series and axes
Scientific Images: Western blots, gels, micrographs
Diagrams: Flowcharts, illustrations, photographs

Each figure includes:

Title and caption
Data points (when visible)
Axis labels and ranges
Annotations and legends

Equations

Extracted as:

LaTeX representation
Plain text fallback
Variable definitions

Text Blocks

Captured with:

Full content
Type (header, paragraph, caption, footnote)
Formatting information

Confidence Scores

Every element includes a confidence score (0.0-1.0):

| Score | Meaning | |-------|---------| | 0.95-1.00 | Crystal clear extraction | | 0.85-0.94 | Clear with minor uncertainty | | 0.70-0.84 | Readable but some ambiguity | | < 0.70 | Needs manual verification |

Low confidence items are flagged in the output for review.

Use Cases

Medical Lab Results: Extract patient data from PDF reports
Research Papers: Structure tables and figures from publications
Scientific Images: Transcribe gel/blot data for documentation
Patient Records: Batch process document folders
Data Digitization: Convert scanned documents to structured data

Requirements

Claude Code CLI
No external dependencies (uses Claude's native capabilities)

How It Works

structurecc leverages Claude's multimodal capabilities:

Claude Vision: Reads PDFs and images natively without OCR
Parallel Agents: Task tool spawns chunk agents for parallel processing
Structured Output: JSON schema ensures consistent, parseable output
Markdown Summary: Human-readable format for quick review

No web searches, no external APIs, no Python dependencies. Just Claude + document = structured data.

Limitations

Very large documents (100+ pages) may require multiple runs
Handwritten content has lower accuracy than printed text
Low-resolution images may have reduced confidence scores
Complex nested tables may require manual verification

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

structurecc

Installation

Usage

Single Document

Batch Processing

Supported Formats

Output

structure.json

Architecture

Element Types

Tables

Figures

Equations

Text Blocks

Confidence Scores

Use Cases

Requirements

How It Works

Limitations

License