glmmedia-ocr
v0.1.0
Published
Convert PDFs and images to structured Markdown using local GLM-OCR + Ollama
Maintainers
Readme
glmmedia-ocr
Convert PDFs and images to structured Markdown using local GLM-OCR + Ollama. Fully self-contained — zero ongoing maintenance after install.
npm install -g glmmedia-ocr
glmmedia-ocr scan invoice.pdf
# → invoice.md writtenTable of Contents
- Requirements
- Installation
- Quick Start
- CLI Reference
- How It Works
- Architecture
- Output Format
- Configuration
- GPU Support
- Troubleshooting
- Project Structure
- License
Requirements
Only two things need to be on your machine before installing:
| Requirement | Why | Where |
|---|---|---|
| Python 3.12 or 3.13 | Runs the GLM-OCR SDK | python.org |
| Ollama (installed, not necessarily running) | Serves the glm-ocr model locally | ollama.com/download |
That's it. Everything else — the Python virtual environment, all dependencies, and the Ollama process lifecycle — is managed automatically by the package.
Note: Python 3.14+ is not yet supported. The GLM-OCR SDK and its dependencies (PyTorch, Transformers) only publish wheels for Python 3.10–3.13.
Installation
npm (recommended)
npm install -g glmmedia-ocrThis triggers a postinstall script that:
- Creates a dedicated Python virtual environment inside the package (
.venv/) - Installs
glmocr[selfhosted]with CPU-only PyTorch into the venv - Verifies the installation by importing the SDK
The first install takes a few minutes while pip downloads ~1-2GB of dependencies. This is a one-time cost.
pip
pip install .Or from source:
git clone https://github.com/glmmedia-ocr/glmmedia-ocr.git
cd glmmedia-ocr
pip install .This installs the same dependencies directly into your Python environment and registers the glmmedia-ocr CLI command. Both npm and pip packages provide the exact same functionality and CLI interface.
GPU install (optional)
By default, the npm package installs CPU-only PyTorch to avoid GPU resource competition with Ollama. If you have a GPU and want to use it for layout detection:
# npm
GLMOCR_GPU=1 npm install -g glmmedia-ocr
# pip — pip resolves CUDA PyTorch by default
pip install .Reinstall / repair
# npm
npm rebuild glmmedia-ocr
# pip
pip install --force-reinstall .Quick Start
# Single PDF
glmmedia-ocr scan invoice.pdf
# Single image
glmmedia-ocr scan receipt.png
# Multiple images
glmmedia-ocr scan page1.png page2.png page3.png
# Mixed PDFs and images
glmmedia-ocr scan report.pdf page1.png page2.png
# All images in a directory
glmmedia-ocr scan ./images/
# All images in directory + subdirectories
glmmedia-ocr scan ./images/ --recursive
# Shell glob
glmmedia-ocr scan *.png
# Custom output path
glmmedia-ocr scan contract.pdf --output ./results/contract.md
# Higher DPI for better OCR quality
glmmedia-ocr scan receipt.pdf --dpi 300
# Connect to a remote Ollama instance
glmmedia-ocr scan report.pdf --ollama-host 192.168.1.100:11434
# Faster processing with parallel workers
glmmedia-ocr scan book.pdf --concurrency 2
# Debug logging to see layout detection progress
glmmedia-ocr scan document.pdf --log-level DEBUGFirst run
On the very first run, the CLI will:
- Detect that Ollama is not running and start it automatically
- Detect that the
glm-ocr:latestmodel is not pulled and download it (~2.2GB) - Process your input
- Shut down Ollama on exit (since it started it)
Subsequent runs skip steps 1 and 2 if Ollama is already running and the model is cached.
CLI Reference
glmmedia-ocr scan <input...> [options]
Inputs:
<file.pdf> Single PDF file
<image.png> Single image file (PNG, JPEG, WebP, BMP, TIFF, GIF)
<img1.png> <img2.png> ... Multiple image files
<directory>/ Directory of images (use --recursive for subfolders)
Input/Output:
--output <path> Output .md path (default: auto-generated from input names)
--recursive Scan directories recursively for images
Rendering:
--dpi <number> Render DPI for PDFs (default: 200)
--image-format <format> Image format: PNG, JPEG, WEBP (default: PNG)
--min-pixels <number> Minimum image pixels (default: 12544)
--max-pixels <number> Maximum image pixels (default: 71372800)
--patch-expand-factor <n> Patch expansion factor (default: 1)
--t-patch-size <n> T-patch size (default: 2)
--image-expect-length <n> Image expect length (default: 6144)
Generation:
--max-tokens <number> Max generation tokens (default: 8192)
--temperature <float> Sampling temperature (default: 0.0)
--top-p <float> Top-p sampling (default: 0.00001)
--top-k <number> Top-k sampling (default: 1)
--repetition-penalty <float> Repetition penalty (default: 1.1)
Layout (PP-DocLayoutV3):
--layout-device <device> Device: cpu, cuda, cuda:N (default: cpu)
--layout-model-dir <path> Custom layout model directory
--layout-threshold <float> Detection threshold (default: 0.3)
--layout-batch-size <n> Layout batch size (default: 1)
--layout-use-polygon Use polygon masks for cropping
--no-layout-nms Disable layout NMS
--layout-merge-mode <mode> Merge overlapping bboxes: large|small (default: large)
--layout-workers <n> Layout workers (default: 1)
Result formatting:
--output-format <format> Output: markdown, json, both (default: markdown)
--no-merge-formula-numbers Disable formula number merging
--no-merge-text-blocks Disable text block merging
--no-format-bullet-points Disable bullet point formatting
Pipeline:
--concurrency <number> Parallel OCR workers (default: 1)
--page-maxsize <number> Page queue max size (default: 100)
--region-maxsize <number> Region queue max size (default: 2000)
Ollama / API:
--ollama-host <host> Ollama host (default: localhost:11434)
--ollama-num-ctx <n> Ollama num_ctx for glm-ocr (default: 8192; 0 = omit)
--api-scheme <scheme> API scheme: http, https (default: auto)
--api-key <key> API key for MaaS providers
--verify-ssl Enable SSL verification
--connect-timeout <seconds> Connect timeout (default: 30)
--request-timeout <seconds> Request timeout (default: 120)
MaaS (Zhipu Cloud):
--maas Enable MaaS mode (disables local OCR)
--maas-api-url <url> MaaS API URL
--maas-model <model> MaaS model name
--maas-api-key <key> MaaS API key
--no-maas-verify-ssl Disable MaaS SSL verification
--maas-connect-timeout <s> MaaS connect timeout (default: 30)
--maas-request-timeout <s> MaaS request timeout (default: 300)
--maas-retry-attempts <n> MaaS retry attempts (default: 2)
Logging:
--log-level <level> Log level: DEBUG, INFO, WARNING, ERROR (default: INFO)Flag Details
Inputs
| Input type | Description |
|---|---|
| <file.pdf> | One or more PDF files. Each page becomes <!-- PAGE N --> in output. |
| <image.png> | One or more image files. Supported: PNG, JPEG, WebP, BMP, TIFF, GIF. |
| <file.pdf> <img.png> | Mixed PDFs and images. Pages are merged in input order. |
| <directory>/ | Directory of images. Scans flat by default; use --recursive for subfolders. |
Input/Output
| Flag | Default | Description |
|---|---|---|
| --output | auto-generated | Where to write the Markdown output. Single input → <name>.md. Multiple inputs → <name1>_<name2>_output.md. --output overrides all. |
| --recursive | off | When a directory is passed, recurse into subdirectories for images. |
Rendering
| Flag | Default | Description |
|---|---|---|
| --dpi | 200 | Resolution for rendering PDF pages to images. Higher DPI improves OCR accuracy but increases processing time and memory usage. Recommended: 200-300. |
| --image-format | PNG | Format for images sent to the OCR API. PNG is lossless (best for code, diagrams). JPEG is smaller (best for text documents). WEBP is smallest but may not be supported by all backends. |
| --min-pixels | 12544 | Minimum image pixel count (112×112). Images smaller than this are upscaled. |
| --max-pixels | 71372800 | Maximum image pixel count (14×14×4×1280). Images larger than this are downscaled. |
| --patch-expand-factor | 1 | Patch expansion factor for image processing. |
| --t-patch-size | 2 | T-patch size for image processing. |
| --image-expect-length | 6144 | Expected image token length. |
Generation
| Flag | Default | Description |
|---|---|---|
| --max-tokens | 8192 | Maximum tokens generated per region. Increase for very dense pages. |
| --temperature | 0.0 | Sampling temperature. 0.0 = deterministic (recommended for OCR). |
| --top-p | 0.00001 | Top-p (nucleus) sampling. Keep very low for OCR. |
| --top-k | 1 | Top-k sampling. 1 = always pick the most likely token. |
| --repetition-penalty | 1.1 | Penalty for repeating tokens. Prevents the model from getting stuck in loops. |
Layout (PP-DocLayoutV3)
| Flag | Default | Description |
|---|---|---|
| --layout-device | cpu | Device for the PP-DocLayoutV3 layout detection model. cpu avoids GPU memory competition with Ollama. Use cuda or cuda:N for GPU. |
| --layout-model-dir | (SDK default) | Path to a custom PP-DocLayoutV3 model directory. Leave unset to use the SDK's built-in default. |
| --layout-threshold | 0.3 | Confidence threshold for layout detection. Lower values detect more regions (may include false positives). |
| --layout-batch-size | 1 | Max images per layout model forward pass. Reduce to 1 if OOM. |
| --layout-use-polygon | off | Use polygon masks for region cropping instead of bounding boxes. More precise for rotated or staggered layouts. |
| --no-layout-nms | off | Disable non-maximum suppression for layout detection. |
| --layout-merge-mode | large | How to merge overlapping bounding boxes. large keeps the larger region, small keeps the smaller one. |
| --layout-workers | 1 | Number of layout detection workers. |
Result Formatting
| Flag | Default | Description |
|---|---|---|
| --output-format | markdown | Output format: markdown, json, or both. |
| --no-merge-formula-numbers | off | Disable automatic merging of formula numbers with their equations. |
| --no-merge-text-blocks | off | Disable automatic merging of adjacent text blocks. |
| --no-format-bullet-points | off | Disable automatic bullet point formatting normalization. |
Pipeline
| Flag | Default | Description |
|---|---|---|
| --concurrency | 1 | Number of parallel OCR workers. Increase for faster processing on multi-page documents. Set to 1 for maximum stability with Ollama. |
| --page-maxsize | 100 | Maximum number of pages queued for processing. |
| --region-maxsize | 2000 | Maximum number of regions queued for OCR. |
Ollama / API
| Flag | Default | Description |
|---|---|---|
| --ollama-host | localhost:11434 | Ollama server address. Use this to connect to a remote or non-standard Ollama instance. |
| --ollama-num-ctx | 8192 | Ollama num_ctx parameter for glm-ocr. Prevents GGML tensor size crashes. Set to 0 to omit. |
| --api-scheme | auto | API URL scheme: http or https. Auto-detects based on port (HTTPS if 443). |
| --api-key | null | API key for MaaS providers (Zhipu, OpenAI, etc.). |
| --verify-ssl | off | Enable SSL certificate verification for API requests. |
| --connect-timeout | 30 | Connection timeout in seconds. |
| --request-timeout | 120 | Request timeout in seconds. |
MaaS (Zhipu Cloud)
| Flag | Default | Description |
|---|---|---|
| --maas | off | Enable MaaS mode. Sends requests directly to Zhipu's cloud API. Disables local OCR and Ollama checks. |
| --maas-api-url | Zhipu default | MaaS API endpoint URL. |
| --maas-model | glm-ocr | MaaS model name. |
| --maas-api-key | null | MaaS API key (or set ZHIPU_API_KEY env var). |
| --no-maas-verify-ssl | off | Disable SSL verification for MaaS requests. |
| --maas-connect-timeout | 30 | MaaS connection timeout in seconds. |
| --maas-request-timeout | 300 | MaaS request timeout in seconds. |
| --maas-retry-attempts | 2 | Number of retry attempts for transient MaaS errors. |
Logging
| Flag | Default | Description |
|---|---|---|
| --log-level | INFO | Log level: DEBUG, INFO, WARNING, ERROR. Use DEBUG to see detailed timing and layout detection progress. |
How It Works
Startup Sequence
glmmedia-ocr scan invoice.pdf
│
├─ 1. Preflight Checks
│ ├─ Python 3.12 or 3.13 found?
│ ├─ Ollama binary on PATH? (skipped if --maas)
│ └─ GLM-OCR SDK importable in managed venv?
│
├─ 2. Ollama Lifecycle (skipped if --maas)
│ ├─ Is Ollama already running? (GET localhost:11434)
│ ├─ If yes → use it, leave it running after exit
│ └─ If no → spawn ollama serve, wait until healthy
│
├─ 3. Model Check (skipped if --maas)
│ ├─ Is glm-ocr:latest pulled? (ollama list)
│ └─ If no → ollama pull glm-ocr:latest (~2.2GB, one-time)
│
├─ 4. Pipeline Execution
│ ├─ PDF: Render pages to images (pypdfium2, in-memory, capped to 2000px)
│ │ Images: Load and cap to 2000px (no rendering step)
│ ├─ Run layout detection (PP-DocLayoutV3) — progress logged to stderr
│ ├─ OCR each region via Ollama (/api/generate) or MaaS
│ └─ Merge results with page markers
│
└─ 5. Cleanup
├─ Write output .md
└─ Shut down Ollama (only if CLI started it)Ollama Ownership Tracking
The CLI tracks whether it started Ollama or found it already running:
| Scenario | CLI behavior | |---|---| | Ollama was already running | Uses it, leaves it running on exit | | CLI started Ollama | Shuts it down on normal exit, SIGINT, or SIGTERM | | CLI crashes | Still shuts down Ollama via signal trap |
This means you can run Ollama manually before using the CLI, and it won't be touched.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ User (CLI) │
│ glmmedia-ocr scan invoice.pdf (or *.png, ./images/) │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ bin/glmmedia-ocr.js (Node.js) │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ Preflight │ │ Ollama │ │ Model Check │ │
│ │ Checks │ │ Lifecycle │ │ (pull if needed)│ │
│ └──────┬──────┘ └──────┬───────┘ └────────┬──────────┘ │
│ │ │ │ │
│ └────────────────┼────────────────────┘ │
│ │ │
│ ┌───────────▼────────────┐ │
│ │ Resolve inputs │ │
│ │ (files, dirs, globs) │ │
│ └───────────┬────────────┘ │
│ │ │
│ ┌───────────▼────────────┐ │
│ │ Generate config.yaml │ │
│ │ (full SDK template) │ │
│ └───────────┬────────────┘ │
│ │ │
│ ┌───────────▼────────────┐ │
│ │ Spawn Python Pipeline │ │
│ │ lib/pipeline.py │ │
│ └───────────┬────────────┘ │
└──────────────────────────┼──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ lib/pipeline.py (Python) │
│ │
│ ┌──────────────────┐ ┌──────────────────────────────┐ │
│ │ PDF: pypdfium2 │ │ GlmOcr SDK (selfhosted) │ │
│ │ Image: PIL open │───▶│ ┌────────────────────────┐ │ │
│ │ (2000px cap) │ │ │ PP-DocLayoutV3 │ │ │
│ └──────────────────┘ │ │ (Transformers + CPU │ │ │
│ │ │ PyTorch layout detect) │ │ │
│ │ └───────────┬────────────┘ │ │
│ │ │ │ │
│ │ ┌───────────▼────────────┐ │ │
│ │ │ OCRClient │ │ │
│ │ │ → Ollama /api/generate │ │ │
│ │ └────────────────────────┘ │ │
│ └──────────────────────────────┘ │
│ │ │
│ ┌──────────▼────────────┐ │
│ │ Merge + Page Markers │ │
│ │ → output.md │ │
│ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────┘Key Design Decisions
| Decision | Rationale |
|---|---|
| Managed .venv | The package owns its Python environment. Never touches the user's global Python. Reproducible, isolated, self-contained. |
| CPU-only PyTorch by default | Avoids GPU memory competition with Ollama. Smaller venv (~1-2GB vs 4GB+). Layout detection on CPU is fast enough for most documents. |
| Ollama /api/generate mode | Official GLM-OCR recommendation for Ollama. More stable than the OpenAI-compatible endpoint for vision requests. |
| pypdfium2 for PDF rendering | Ships its own PDFium binary in the wheel. Zero system dependencies. Renders directly to PIL images in-memory — no temp files, no subprocess calls. |
| 2000px image cap | Balances OCR quality with model stability. Images exceeding 2000px on their longest dimension are downscaled via LANCZOS. Prevents GGML tensor size crashes on Ollama. |
| Full SDK config | Generates a complete config.yaml matching the SDK's template on every run. All 50+ options are exposed as CLI flags. |
| Per-page error tolerance | A failed page gets a placeholder in the output. The rest of the document continues processing. |
Output Format
The output Markdown file contains clear page boundaries:
<!-- PAGE 1 -->
# Invoice
**Invoice Number:** INV-2024-0042
**Date:** January 15, 2024
| Item | Quantity | Price |
|------|----------|-------|
| Widget A | 10 | $50.00 |
| Widget B | 5 | $75.00 |
**Total: $875.00**
---
<!-- PAGE 2 -->
## Terms and Conditions
1. Payment is due within 30 days.
2. Late payments incur a 2% monthly fee.
---Page Markers
Each page is delimited by:
<!-- PAGE N -->— HTML comment identifying the page number---— Markdown horizontal rule as a visual separator
Failed Pages
If a page fails OCR (e.g., Ollama timeout, model error), it gets a placeholder:
<!-- PAGE 4 -->
<!-- PAGE 4: OCR failed — API request failed after 3 attempts -->
---The rest of the document continues processing normally.
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
| GLMOCR_GPU | 0 | Set to 1 during install to use GPU PyTorch instead of CPU-only. |
Internal Config (auto-generated)
The CLI generates a temporary YAML config for each run. All SDK options are exposed as CLI flags:
# Example of generated config (abbreviated)
pipeline:
maas:
enabled: false
ocr_api:
api_host: localhost
api_port: 11434
api_path: /api/generate
api_mode: ollama_generate
model: glm-ocr:latest
connect_timeout: 30
request_timeout: 120
max_workers: 1
page_maxsize: 100
region_maxsize: 2000
page_loader:
max_tokens: 8192
temperature: 0.0
top_p: 0.00001
top_k: 1
repetition_penalty: 1.1
image_format: PNG
min_pixels: 12544
max_pixels: 71372800
result_formatter:
output_format: markdown
enable_merge_formula_numbers: true
enable_merge_text_blocks: true
enable_format_bullet_points: true
layout:
device: "cpu"
threshold: 0.3
batch_size: 1
use_polygon: false
layout_nms: true
layout_merge_bboxes_mode: largeThis config is written to a temp directory before each run and cleaned up afterward. Users don't need to manage it manually.
GPU Support
The default installation uses CPU-only PyTorch for layout detection. This is intentional:
- No GPU competition — Ollama loads the glm-ocr model into GPU VRAM. Running layout detection on the same GPU can cause OOM errors.
- Smaller venv — CPU PyTorch is ~500MB vs ~4GB for CUDA.
- Fast enough — PP-DocLayoutV3 is lightweight and runs quickly on CPU for typical document sizes.
Enabling GPU
If you have ample GPU memory and want faster layout detection:
# Uninstall the CPU-only version
npm uninstall -g glmmedia-ocr
# Reinstall with GPU PyTorch
GLMOCR_GPU=1 npm install -g glmmedia-ocrThen use --layout-device cuda when scanning:
glmmedia-ocr scan document.pdf --layout-device cudaRecommended GPU Setup
If running both Ollama (glm-ocr model) and layout detection on the same GPU:
- GPU with 12GB+ VRAM — glm-ocr takes ~2.2GB, layout detection takes ~1-2GB
- Use
--concurrency 1— Avoids queuing multiple OCR requests that could spike memory - Monitor with
nvidia-smi— Watch for OOM during processing
Troubleshooting
Python not found or unsupported version
✗ Python 3.12+ not found on PATH. Install from python.orgFix: Install Python 3.12 or 3.13 from python.org. Make sure it's on your PATH. Python 3.14+ is not yet supported because key dependencies (PyTorch, Transformers) don't publish 3.14 wheels yet.
# Verify
python --version # Should show 3.12.x or 3.13.xOllama not found
✗ Ollama not found on PATH. Install from https://ollama.com/downloadFix: Install Ollama from ollama.com/download.
# Verify
ollama --versionSDK installation failed
✗ GLM-OCR SDK installation failed. Run 'npm rebuild glmmedia-ocr' to retry.Fix: Rebuild the package:
npm rebuild glmmedia-ocrIf that fails, try a clean reinstall:
npm uninstall -g glmmedia-ocr
npm install -g glmmedia-ocrModel pull failed
✗ ollama pull failed with code 1Fix: Check your internet connection and try again. The model is ~2.2GB and requires a stable connection.
# Manual pull to debug
ollama pull glm-ocr:latestOllama won't start
✗ Ollama did not become healthy within 15sFix: Start Ollama manually and check for errors:
ollama serve
# In another terminal:
ollama listIf Ollama is already running on a different port, use --ollama-host:
glmmedia-ocr scan document.pdf --ollama-host localhost:11435OCR timeout on large documents
Error: OCR failed — API request failed after 3 attemptsFix: Increase the request timeout or reduce concurrency:
# Reduce to single worker (most stable)
glmmedia-ocr scan large-document.pdf --concurrency 1
# If using a remote Ollama, ensure the network is stable
glmmedia-ocr scan document.pdf --ollama-host 192.168.1.100:11434Out of memory
Error: CUDA out of memoryFix: Use CPU for layout detection:
glmmedia-ocr scan document.pdf --layout-device cpuOr reduce concurrency:
glmmedia-ocr scan document.pdf --concurrency 1Corrupt or encrypted PDF
Error: Failed to render PDF: ...Fix: Ensure the PDF is valid and not password-protected. The current version does not support encrypted PDFs. Use a tool like qpdf to decrypt first:
qpdf --decrypt --password=your-password input.pdf decrypted.pdf
glmmedia-ocr scan decrypted.pdfNo image files found in directory
✗ No image files found in directory: ./images/Fix: Ensure the directory contains supported image files (PNG, JPEG, WebP, BMP, TIFF, GIF). Use --recursive if images are in subdirectories:
glmmedia-ocr scan ./images/ --recursiveInput not found
✗ Input not found: ./missing.pdfFix: Check the file path and ensure the input exists.
Project Structure
glmmedia-ocr/
├── bin/
│ └── glmmedia-ocr.js # npm CLI entry point
│ # - Thin wrapper: finds .venv Python
│ # - Delegates to lib/pipeline.py
│
├── scripts/
│ └── postinstall.js # npm package setup
│ # - Creates .venv
│ # - pip install glmocr[selfhosted] + CPU torch
│ # - Verifies installation
│
├── lib/
│ └── pipeline.py # PDF/Image-to-Markdown pipeline (npm path)
│ # - pypdfium2: PDF → PIL images (2000px cap)
│ # - PIL: load images directly (2000px cap)
│ # - GlmOcr SDK: layout detection + OCR
│ # - Logging: surfaces SDK progress to stderr
│ # - Merge with page markers → .md
│
├── src/glmmedia_ocr/ # Pure Python CLI package (pip path)
│ ├── __init__.py # Package version
│ ├── __main__.py # python -m glmmedia_ocr entry
│ ├── cli.py # Full CLI: args, Ollama, config, spinner
│ ├── config.py # Config YAML generation
│ ├── inputs.py # Input resolution (files, dirs, types)
│ ├── ollama.py # Ollama lifecycle management
│ ├── pipeline.py # Rendering + OCR + output
│ └── spinner.py # Animated terminal spinner
│
├── pyproject.toml # Python package metadata + deps
├── .venv/ # Created at npm install time (gitignored)
├── .gitignore
├── package.json # npm package metadata
└── README.mdDistribution Channels
| Channel | Entry point | Code path |
|---|---|---|
| npm | bin/glmmedia-ocr.js | JS wrapper → lib/pipeline.py |
| pip | src/glmmedia_ocr/cli.py | Pure Python (full implementation) |
Both provide the same CLI interface and functionality. They are independent implementations — changes to one should be mirrored in the other.
What's NOT Here
| Not included | Why |
|---|---|
| node_modules/ | Zero npm dependencies — uses Node.js built-ins only |
| vendor/poppler/ | pypdfium2 ships its own PDFium binary in its pip wheel |
| config.yaml | Generated dynamically per run, cleaned up after |
| *.md output files | Generated by the CLI, not part of the package |
| dist/, build/, *.egg-info/ | Build artifacts (gitignored) |
Under the Hood
Input Resolution
The CLI accepts PDFs, images, and directories. When a directory is passed, it collects all supported image files (flat or recursive with --recursive). Mixed input types (PDF + image) are supported — pages are merged in input order into a single output file with sequential <!-- PAGE N --> markers.
PDF Rendering
Uses pypdfium2, which bundles the PDFium engine (same as Chromium). Renders PDF pages directly to PIL images in-memory at the specified DPI. Images exceeding 2000px on their longest dimension are downscaled via LANCZOS resampling. No temp files, no subprocess calls, no system dependencies.
Image Loading
Images are opened with PIL and capped to 2000px on their longest dimension via LANCZOS resampling. This ensures consistent quality while preventing GGML tensor size crashes on Ollama.
Layout Detection
Uses PP-DocLayoutV3 via HuggingFace Transformers. Detects text blocks, tables, formulas, images, and other regions on each page. Runs on CPU by default to avoid GPU memory competition with Ollama. Progress is logged to stderr when --log-level DEBUG is used.
OCR
Each detected region is sent to the glm-ocr model via Ollama's native /api/generate endpoint. The model returns structured Markdown for each region.
Result Merging
Per-page results are merged with <!-- PAGE N --> markers and --- separators. Failed pages get error placeholders instead of aborting the entire document.
License
MIT
