glmmedia-ocr

v0.1.0

Published

11 days ago

Convert PDFs and images to structured Markdown using local GLM-OCR + Ollama

0High
0Medium
0Low

dusy4

ocr pdf markdown glm-ocr ollama cli

glmmedia-ocr

Convert PDFs and images to structured Markdown using local GLM-OCR + Ollama. Fully self-contained — zero ongoing maintenance after install.

npm install -g glmmedia-ocr
glmmedia-ocr scan invoice.pdf
# → invoice.md written

Requirements

Only two things need to be on your machine before installing:

| Requirement | Why | Where | |---|---|---| | Python 3.12 or 3.13 | Runs the GLM-OCR SDK | python.org | | Ollama (installed, not necessarily running) | Serves the glm-ocr model locally | ollama.com/download |

That's it. Everything else — the Python virtual environment, all dependencies, and the Ollama process lifecycle — is managed automatically by the package.

Note: Python 3.14+ is not yet supported. The GLM-OCR SDK and its dependencies (PyTorch, Transformers) only publish wheels for Python 3.10–3.13.

Installation

npm (recommended)

npm install -g glmmedia-ocr

This triggers a postinstall script that:

Creates a dedicated Python virtual environment inside the package (.venv/)
Installs glmocr[selfhosted] with CPU-only PyTorch into the venv
Verifies the installation by importing the SDK

The first install takes a few minutes while pip downloads ~1-2GB of dependencies. This is a one-time cost.

pip

pip install .

Or from source:

git clone https://github.com/glmmedia-ocr/glmmedia-ocr.git
cd glmmedia-ocr
pip install .

This installs the same dependencies directly into your Python environment and registers the glmmedia-ocr CLI command. Both npm and pip packages provide the exact same functionality and CLI interface.

GPU install (optional)

By default, the npm package installs CPU-only PyTorch to avoid GPU resource competition with Ollama. If you have a GPU and want to use it for layout detection:

# npm
GLMOCR_GPU=1 npm install -g glmmedia-ocr

# pip — pip resolves CUDA PyTorch by default
pip install .

Reinstall / repair

# npm
npm rebuild glmmedia-ocr

# pip
pip install --force-reinstall .

Quick Start

# Single PDF
glmmedia-ocr scan invoice.pdf

# Single image
glmmedia-ocr scan receipt.png

# Multiple images
glmmedia-ocr scan page1.png page2.png page3.png

# Mixed PDFs and images
glmmedia-ocr scan report.pdf page1.png page2.png

# All images in a directory
glmmedia-ocr scan ./images/

# All images in directory + subdirectories
glmmedia-ocr scan ./images/ --recursive

# Shell glob
glmmedia-ocr scan *.png

# Custom output path
glmmedia-ocr scan contract.pdf --output ./results/contract.md

# Higher DPI for better OCR quality
glmmedia-ocr scan receipt.pdf --dpi 300

# Connect to a remote Ollama instance
glmmedia-ocr scan report.pdf --ollama-host 192.168.1.100:11434

# Faster processing with parallel workers
glmmedia-ocr scan book.pdf --concurrency 2

# Debug logging to see layout detection progress
glmmedia-ocr scan document.pdf --log-level DEBUG

First run

On the very first run, the CLI will:

Detect that Ollama is not running and start it automatically
Detect that the glm-ocr:latest model is not pulled and download it (~2.2GB)
Process your input
Shut down Ollama on exit (since it started it)

Subsequent runs skip steps 1 and 2 if Ollama is already running and the model is cached.

CLI Reference

glmmedia-ocr scan <input...> [options]

Inputs:
  <file.pdf>                   Single PDF file
  <image.png>                  Single image file (PNG, JPEG, WebP, BMP, TIFF, GIF)
  <img1.png> <img2.png> ...    Multiple image files
  <directory>/                 Directory of images (use --recursive for subfolders)

Input/Output:
  --output <path>              Output .md path (default: auto-generated from input names)
  --recursive                  Scan directories recursively for images

Rendering:
  --dpi <number>               Render DPI for PDFs (default: 200)
  --image-format <format>      Image format: PNG, JPEG, WEBP (default: PNG)
  --min-pixels <number>        Minimum image pixels (default: 12544)
  --max-pixels <number>        Maximum image pixels (default: 71372800)
  --patch-expand-factor <n>    Patch expansion factor (default: 1)
  --t-patch-size <n>           T-patch size (default: 2)
  --image-expect-length <n>    Image expect length (default: 6144)

Generation:
  --max-tokens <number>        Max generation tokens (default: 8192)
  --temperature <float>        Sampling temperature (default: 0.0)
  --top-p <float>              Top-p sampling (default: 0.00001)
  --top-k <number>             Top-k sampling (default: 1)
  --repetition-penalty <float> Repetition penalty (default: 1.1)

Layout (PP-DocLayoutV3):
  --layout-device <device>     Device: cpu, cuda, cuda:N (default: cpu)
  --layout-model-dir <path>    Custom layout model directory
  --layout-threshold <float>   Detection threshold (default: 0.3)
  --layout-batch-size <n>      Layout batch size (default: 1)
  --layout-use-polygon         Use polygon masks for cropping
  --no-layout-nms              Disable layout NMS
  --layout-merge-mode <mode>   Merge overlapping bboxes: large|small (default: large)
  --layout-workers <n>         Layout workers (default: 1)

Result formatting:
  --output-format <format>     Output: markdown, json, both (default: markdown)
  --no-merge-formula-numbers   Disable formula number merging
  --no-merge-text-blocks       Disable text block merging
  --no-format-bullet-points    Disable bullet point formatting

Pipeline:
  --concurrency <number>       Parallel OCR workers (default: 1)
  --page-maxsize <number>      Page queue max size (default: 100)
  --region-maxsize <number>    Region queue max size (default: 2000)

Ollama / API:
  --ollama-host <host>         Ollama host (default: localhost:11434)
  --ollama-num-ctx <n>         Ollama num_ctx for glm-ocr (default: 8192; 0 = omit)
  --api-scheme <scheme>        API scheme: http, https (default: auto)
  --api-key <key>              API key for MaaS providers
  --verify-ssl                 Enable SSL verification
  --connect-timeout <seconds>  Connect timeout (default: 30)
  --request-timeout <seconds>  Request timeout (default: 120)

MaaS (Zhipu Cloud):
  --maas                       Enable MaaS mode (disables local OCR)
  --maas-api-url <url>         MaaS API URL
  --maas-model <model>         MaaS model name
  --maas-api-key <key>         MaaS API key
  --no-maas-verify-ssl         Disable MaaS SSL verification
  --maas-connect-timeout <s>   MaaS connect timeout (default: 30)
  --maas-request-timeout <s>   MaaS request timeout (default: 300)
  --maas-retry-attempts <n>    MaaS retry attempts (default: 2)

Logging:
  --log-level <level>          Log level: DEBUG, INFO, WARNING, ERROR (default: INFO)

Flag Details

Inputs

| Input type | Description | |---|---| | <file.pdf> | One or more PDF files. Each page becomes  in output. | | <image.png> | One or more image files. Supported: PNG, JPEG, WebP, BMP, TIFF, GIF. | | <file.pdf> <img.png> | Mixed PDFs and images. Pages are merged in input order. | | <directory>/ | Directory of images. Scans flat by default; use --recursive for subfolders. |

Input/Output

| Flag | Default | Description | |---|---|---| | --output | auto-generated | Where to write the Markdown output. Single input → <name>.md. Multiple inputs → <name1>_<name2>_output.md. --output overrides all. | | --recursive | off | When a directory is passed, recurse into subdirectories for images. |

Rendering

| Flag | Default | Description | |---|---|---| | --dpi | 200 | Resolution for rendering PDF pages to images. Higher DPI improves OCR accuracy but increases processing time and memory usage. Recommended: 200-300. | | --image-format | PNG | Format for images sent to the OCR API. PNG is lossless (best for code, diagrams). JPEG is smaller (best for text documents). WEBP is smallest but may not be supported by all backends. | | --min-pixels | 12544 | Minimum image pixel count (112×112). Images smaller than this are upscaled. | | --max-pixels | 71372800 | Maximum image pixel count (14×14×4×1280). Images larger than this are downscaled. | | --patch-expand-factor | 1 | Patch expansion factor for image processing. | | --t-patch-size | 2 | T-patch size for image processing. | | --image-expect-length | 6144 | Expected image token length. |

Generation

| Flag | Default | Description | |---|---|---| | --max-tokens | 8192 | Maximum tokens generated per region. Increase for very dense pages. | | --temperature | 0.0 | Sampling temperature. 0.0 = deterministic (recommended for OCR). | | --top-p | 0.00001 | Top-p (nucleus) sampling. Keep very low for OCR. | | --top-k | 1 | Top-k sampling. 1 = always pick the most likely token. | | --repetition-penalty | 1.1 | Penalty for repeating tokens. Prevents the model from getting stuck in loops. |

Layout (PP-DocLayoutV3)

| Flag | Default | Description | |---|---|---| | --layout-device | cpu | Device for the PP-DocLayoutV3 layout detection model. cpu avoids GPU memory competition with Ollama. Use cuda or cuda:N for GPU. | | --layout-model-dir | (SDK default) | Path to a custom PP-DocLayoutV3 model directory. Leave unset to use the SDK's built-in default. | | --layout-threshold | 0.3 | Confidence threshold for layout detection. Lower values detect more regions (may include false positives). | | --layout-batch-size | 1 | Max images per layout model forward pass. Reduce to 1 if OOM. | | --layout-use-polygon | off | Use polygon masks for region cropping instead of bounding boxes. More precise for rotated or staggered layouts. | | --no-layout-nms | off | Disable non-maximum suppression for layout detection. | | --layout-merge-mode | large | How to merge overlapping bounding boxes. large keeps the larger region, small keeps the smaller one. | | --layout-workers | 1 | Number of layout detection workers. |

Result Formatting

| Flag | Default | Description | |---|---|---| | --output-format | markdown | Output format: markdown, json, or both. | | --no-merge-formula-numbers | off | Disable automatic merging of formula numbers with their equations. | | --no-merge-text-blocks | off | Disable automatic merging of adjacent text blocks. | | --no-format-bullet-points | off | Disable automatic bullet point formatting normalization. |

Pipeline

| Flag | Default | Description | |---|---|---| | --concurrency | 1 | Number of parallel OCR workers. Increase for faster processing on multi-page documents. Set to 1 for maximum stability with Ollama. | | --page-maxsize | 100 | Maximum number of pages queued for processing. | | --region-maxsize | 2000 | Maximum number of regions queued for OCR. |

Ollama / API

| Flag | Default | Description | |---|---|---| | --ollama-host | localhost:11434 | Ollama server address. Use this to connect to a remote or non-standard Ollama instance. | | --ollama-num-ctx | 8192 | Ollama num_ctx parameter for glm-ocr. Prevents GGML tensor size crashes. Set to 0 to omit. | | --api-scheme | auto | API URL scheme: http or https. Auto-detects based on port (HTTPS if 443). | | --api-key | null | API key for MaaS providers (Zhipu, OpenAI, etc.). | | --verify-ssl | off | Enable SSL certificate verification for API requests. | | --connect-timeout | 30 | Connection timeout in seconds. | | --request-timeout | 120 | Request timeout in seconds. |

MaaS (Zhipu Cloud)

| Flag | Default | Description | |---|---|---| | --maas | off | Enable MaaS mode. Sends requests directly to Zhipu's cloud API. Disables local OCR and Ollama checks. | | --maas-api-url | Zhipu default | MaaS API endpoint URL. | | --maas-model | glm-ocr | MaaS model name. | | --maas-api-key | null | MaaS API key (or set ZHIPU_API_KEY env var). | | --no-maas-verify-ssl | off | Disable SSL verification for MaaS requests. | | --maas-connect-timeout | 30 | MaaS connection timeout in seconds. | | --maas-request-timeout | 300 | MaaS request timeout in seconds. | | --maas-retry-attempts | 2 | Number of retry attempts for transient MaaS errors. |

Logging

| Flag | Default | Description | |---|---|---| | --log-level | INFO | Log level: DEBUG, INFO, WARNING, ERROR. Use DEBUG to see detailed timing and layout detection progress. |

How It Works

Startup Sequence

glmmedia-ocr scan invoice.pdf
│
├─ 1. Preflight Checks
│   ├─ Python 3.12 or 3.13 found?
│   ├─ Ollama binary on PATH? (skipped if --maas)
│   └─ GLM-OCR SDK importable in managed venv?
│
├─ 2. Ollama Lifecycle (skipped if --maas)
│   ├─ Is Ollama already running? (GET localhost:11434)
│   ├─ If yes → use it, leave it running after exit
│   └─ If no → spawn ollama serve, wait until healthy
│
├─ 3. Model Check (skipped if --maas)
│   ├─ Is glm-ocr:latest pulled? (ollama list)
│   └─ If no → ollama pull glm-ocr:latest (~2.2GB, one-time)
│
├─ 4. Pipeline Execution
│   ├─ PDF: Render pages to images (pypdfium2, in-memory, capped to 2000px)
│   │   Images: Load and cap to 2000px (no rendering step)
│   ├─ Run layout detection (PP-DocLayoutV3) — progress logged to stderr
│   ├─ OCR each region via Ollama (/api/generate) or MaaS
│   └─ Merge results with page markers
│
└─ 5. Cleanup
    ├─ Write output .md
    └─ Shut down Ollama (only if CLI started it)

Ollama Ownership Tracking

The CLI tracks whether it started Ollama or found it already running:

| Scenario | CLI behavior | |---|---| | Ollama was already running | Uses it, leaves it running on exit | | CLI started Ollama | Shuts it down on normal exit, SIGINT, or SIGTERM | | CLI crashes | Still shuts down Ollama via signal trap |

This means you can run Ollama manually before using the CLI, and it won't be touched.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     User (CLI)                              │
│   glmmedia-ocr scan invoice.pdf  (or *.png, ./images/)     │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│              bin/glmmedia-ocr.js (Node.js)                  │
│                                                             │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────────┐  │
│  │  Preflight  │  │   Ollama     │  │   Model Check     │  │
│  │  Checks     │  │  Lifecycle   │  │   (pull if needed)│  │
│  └──────┬──────┘  └──────┬───────┘  └────────┬──────────┘  │
│         │                │                    │              │
│         └────────────────┼────────────────────┘              │
│                          │                                   │
│              ┌───────────▼────────────┐                      │
│              │  Resolve inputs        │                      │
│              │  (files, dirs, globs)  │                      │
│              └───────────┬────────────┘                      │
│                          │                                   │
│              ┌───────────▼────────────┐                      │
│              │  Generate config.yaml  │                      │
│              │  (full SDK template)   │                      │
│              └───────────┬────────────┘                      │
│                          │                                   │
│              ┌───────────▼────────────┐                      │
│              │  Spawn Python Pipeline │                      │
│              │  lib/pipeline.py       │                      │
│              └───────────┬────────────┘                      │
└──────────────────────────┼──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│              lib/pipeline.py (Python)                       │
│                                                             │
│  ┌──────────────────┐    ┌──────────────────────────────┐  │
│  │  PDF: pypdfium2  │    │  GlmOcr SDK (selfhosted)     │  │
│  │  Image: PIL open │───▶│  ┌────────────────────────┐  │  │
│  │  (2000px cap)    │    │  │ PP-DocLayoutV3         │  │  │
│  └──────────────────┘    │  │ (Transformers + CPU    │  │  │
│                          │  │  PyTorch layout detect) │  │  │
│                          │  └───────────┬────────────┘  │  │
│                          │              │                │  │
│                          │  ┌───────────▼────────────┐  │  │
│                          │  │ OCRClient              │  │  │
│                          │  │ → Ollama /api/generate │  │  │
│                          │  └────────────────────────┘  │  │
│                          └──────────────────────────────┘  │
│                                     │                       │
│                          ┌──────────▼────────────┐          │
│                          │  Merge + Page Markers │          │
│                          │  → output.md          │          │
│                          └───────────────────────┘          │
└─────────────────────────────────────────────────────────────┘

Key Design Decisions

| Decision | Rationale | |---|---| | Managed .venv | The package owns its Python environment. Never touches the user's global Python. Reproducible, isolated, self-contained. | | CPU-only PyTorch by default | Avoids GPU memory competition with Ollama. Smaller venv (~1-2GB vs 4GB+). Layout detection on CPU is fast enough for most documents. | | Ollama /api/generate mode | Official GLM-OCR recommendation for Ollama. More stable than the OpenAI-compatible endpoint for vision requests. | | pypdfium2 for PDF rendering | Ships its own PDFium binary in the wheel. Zero system dependencies. Renders directly to PIL images in-memory — no temp files, no subprocess calls. | | 2000px image cap | Balances OCR quality with model stability. Images exceeding 2000px on their longest dimension are downscaled via LANCZOS. Prevents GGML tensor size crashes on Ollama. | | Full SDK config | Generates a complete config.yaml matching the SDK's template on every run. All 50+ options are exposed as CLI flags. | | Per-page error tolerance | A failed page gets a placeholder in the output. The rest of the document continues processing. |

Output Format

The output Markdown file contains clear page boundaries:

<!-- PAGE 1 -->

# Invoice

**Invoice Number:** INV-2024-0042
**Date:** January 15, 2024

| Item | Quantity | Price |
|------|----------|-------|
| Widget A | 10 | $50.00 |
| Widget B | 5 | $75.00 |

**Total: $875.00**

---

<!-- PAGE 2 -->

## Terms and Conditions

1. Payment is due within 30 days.
2. Late payments incur a 2% monthly fee.

---

Page Markers

Each page is delimited by:

 — HTML comment identifying the page number
--- — Markdown horizontal rule as a visual separator

Failed Pages

If a page fails OCR (e.g., Ollama timeout, model error), it gets a placeholder:

<!-- PAGE 4 -->

<!-- PAGE 4: OCR failed — API request failed after 3 attempts -->

---

The rest of the document continues processing normally.

Configuration

Environment Variables

| Variable | Default | Description | |---|---|---| | GLMOCR_GPU | 0 | Set to 1 during install to use GPU PyTorch instead of CPU-only. |

Internal Config (auto-generated)

The CLI generates a temporary YAML config for each run. All SDK options are exposed as CLI flags:

# Example of generated config (abbreviated)
pipeline:
  maas:
    enabled: false
  ocr_api:
    api_host: localhost
    api_port: 11434
    api_path: /api/generate
    api_mode: ollama_generate
    model: glm-ocr:latest
    connect_timeout: 30
    request_timeout: 120
  max_workers: 1
  page_maxsize: 100
  region_maxsize: 2000
  page_loader:
    max_tokens: 8192
    temperature: 0.0
    top_p: 0.00001
    top_k: 1
    repetition_penalty: 1.1
    image_format: PNG
    min_pixels: 12544
    max_pixels: 71372800
  result_formatter:
    output_format: markdown
    enable_merge_formula_numbers: true
    enable_merge_text_blocks: true
    enable_format_bullet_points: true
  layout:
    device: "cpu"
    threshold: 0.3
    batch_size: 1
    use_polygon: false
    layout_nms: true
    layout_merge_bboxes_mode: large

This config is written to a temp directory before each run and cleaned up afterward. Users don't need to manage it manually.

GPU Support

The default installation uses CPU-only PyTorch for layout detection. This is intentional:

No GPU competition — Ollama loads the glm-ocr model into GPU VRAM. Running layout detection on the same GPU can cause OOM errors.
Smaller venv — CPU PyTorch is ~500MB vs ~4GB for CUDA.
Fast enough — PP-DocLayoutV3 is lightweight and runs quickly on CPU for typical document sizes.

Enabling GPU

If you have ample GPU memory and want faster layout detection:

# Uninstall the CPU-only version
npm uninstall -g glmmedia-ocr

# Reinstall with GPU PyTorch
GLMOCR_GPU=1 npm install -g glmmedia-ocr

Then use --layout-device cuda when scanning:

glmmedia-ocr scan document.pdf --layout-device cuda

Recommended GPU Setup

If running both Ollama (glm-ocr model) and layout detection on the same GPU:

GPU with 12GB+ VRAM — glm-ocr takes ~2.2GB, layout detection takes ~1-2GB
Use --concurrency 1 — Avoids queuing multiple OCR requests that could spike memory
Monitor with nvidia-smi — Watch for OOM during processing

Troubleshooting

Python not found or unsupported version

✗ Python 3.12+ not found on PATH. Install from python.org

Fix: Install Python 3.12 or 3.13 from python.org. Make sure it's on your PATH. Python 3.14+ is not yet supported because key dependencies (PyTorch, Transformers) don't publish 3.14 wheels yet.

# Verify
python --version  # Should show 3.12.x or 3.13.x

Ollama not found

✗ Ollama not found on PATH. Install from https://ollama.com/download

Fix: Install Ollama from ollama.com/download.

# Verify
ollama --version

SDK installation failed

✗ GLM-OCR SDK installation failed. Run 'npm rebuild glmmedia-ocr' to retry.

Fix: Rebuild the package:

npm rebuild glmmedia-ocr

If that fails, try a clean reinstall:

npm uninstall -g glmmedia-ocr
npm install -g glmmedia-ocr

Model pull failed

✗ ollama pull failed with code 1

Fix: Check your internet connection and try again. The model is ~2.2GB and requires a stable connection.

# Manual pull to debug
ollama pull glm-ocr:latest

Ollama won't start

✗ Ollama did not become healthy within 15s

Fix: Start Ollama manually and check for errors:

ollama serve
# In another terminal:
ollama list

If Ollama is already running on a different port, use --ollama-host:

glmmedia-ocr scan document.pdf --ollama-host localhost:11435

OCR timeout on large documents

Error: OCR failed — API request failed after 3 attempts

Fix: Increase the request timeout or reduce concurrency:

# Reduce to single worker (most stable)
glmmedia-ocr scan large-document.pdf --concurrency 1

# If using a remote Ollama, ensure the network is stable
glmmedia-ocr scan document.pdf --ollama-host 192.168.1.100:11434

Out of memory

Error: CUDA out of memory

Fix: Use CPU for layout detection:

glmmedia-ocr scan document.pdf --layout-device cpu

Or reduce concurrency:

glmmedia-ocr scan document.pdf --concurrency 1

Corrupt or encrypted PDF

Error: Failed to render PDF: ...

Fix: Ensure the PDF is valid and not password-protected. The current version does not support encrypted PDFs. Use a tool like qpdf to decrypt first:

qpdf --decrypt --password=your-password input.pdf decrypted.pdf
glmmedia-ocr scan decrypted.pdf

No image files found in directory

✗ No image files found in directory: ./images/

Fix: Ensure the directory contains supported image files (PNG, JPEG, WebP, BMP, TIFF, GIF). Use --recursive if images are in subdirectories:

glmmedia-ocr scan ./images/ --recursive

Input not found

✗ Input not found: ./missing.pdf

Fix: Check the file path and ensure the input exists.

Project Structure

glmmedia-ocr/
├── bin/
│   └── glmmedia-ocr.js          # npm CLI entry point
│                                # - Thin wrapper: finds .venv Python
│                                # - Delegates to lib/pipeline.py
│
├── scripts/
│   └── postinstall.js           # npm package setup
│                                # - Creates .venv
│                                # - pip install glmocr[selfhosted] + CPU torch
│                                # - Verifies installation
│
├── lib/
│   └── pipeline.py              # PDF/Image-to-Markdown pipeline (npm path)
│                                # - pypdfium2: PDF → PIL images (2000px cap)
│                                # - PIL: load images directly (2000px cap)
│                                # - GlmOcr SDK: layout detection + OCR
│                                # - Logging: surfaces SDK progress to stderr
│                                # - Merge with page markers → .md
│
├── src/glmmedia_ocr/            # Pure Python CLI package (pip path)
│   ├── __init__.py              # Package version
│   ├── __main__.py              # python -m glmmedia_ocr entry
│   ├── cli.py                   # Full CLI: args, Ollama, config, spinner
│   ├── config.py                # Config YAML generation
│   ├── inputs.py                # Input resolution (files, dirs, types)
│   ├── ollama.py                # Ollama lifecycle management
│   ├── pipeline.py              # Rendering + OCR + output
│   └── spinner.py               # Animated terminal spinner
│
├── pyproject.toml               # Python package metadata + deps
├── .venv/                       # Created at npm install time (gitignored)
├── .gitignore
├── package.json                 # npm package metadata
└── README.md

Distribution Channels

| Channel | Entry point | Code path | |---|---|---| | npm | bin/glmmedia-ocr.js | JS wrapper → lib/pipeline.py | | pip | src/glmmedia_ocr/cli.py | Pure Python (full implementation) |

Both provide the same CLI interface and functionality. They are independent implementations — changes to one should be mirrored in the other.

What's NOT Here

| Not included | Why | |---|---| | node_modules/ | Zero npm dependencies — uses Node.js built-ins only | | vendor/poppler/ | pypdfium2 ships its own PDFium binary in its pip wheel | | config.yaml | Generated dynamically per run, cleaned up after | | *.md output files | Generated by the CLI, not part of the package | | dist/, build/, *.egg-info/ | Build artifacts (gitignored) |

Under the Hood

Input Resolution

The CLI accepts PDFs, images, and directories. When a directory is passed, it collects all supported image files (flat or recursive with --recursive). Mixed input types (PDF + image) are supported — pages are merged in input order into a single output file with sequential  markers.

PDF Rendering

Uses pypdfium2, which bundles the PDFium engine (same as Chromium). Renders PDF pages directly to PIL images in-memory at the specified DPI. Images exceeding 2000px on their longest dimension are downscaled via LANCZOS resampling. No temp files, no subprocess calls, no system dependencies.

Image Loading

Images are opened with PIL and capped to 2000px on their longest dimension via LANCZOS resampling. This ensures consistent quality while preventing GGML tensor size crashes on Ollama.

Layout Detection

Uses PP-DocLayoutV3 via HuggingFace Transformers. Detects text blocks, tables, formulas, images, and other regions on each page. Runs on CPU by default to avoid GPU memory competition with Ollama. Progress is logged to stderr when --log-level DEBUG is used.

OCR

Each detected region is sent to the glm-ocr model via Ollama's native /api/generate endpoint. The model returns structured Markdown for each region.

Result Merging

Per-page results are merged with  markers and --- separators. Failed pages get error placeholders instead of aborting the entire document.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

glmmedia-ocr

Table of Contents

Requirements

Installation

npm (recommended)

pip

GPU install (optional)

Reinstall / repair

Quick Start

First run

CLI Reference

Flag Details

Inputs

Input/Output

Rendering

Generation

Layout (PP-DocLayoutV3)

Result Formatting

Pipeline

Ollama / API

MaaS (Zhipu Cloud)

Logging

How It Works

Startup Sequence

Ollama Ownership Tracking

Architecture

Key Design Decisions

Output Format

Page Markers

Failed Pages

Configuration

Environment Variables

Internal Config (auto-generated)

GPU Support

Enabling GPU

Recommended GPU Setup

Troubleshooting

Python not found or unsupported version

Ollama not found

SDK installation failed

Model pull failed

Ollama won't start

OCR timeout on large documents

Out of memory

Corrupt or encrypted PDF

No image files found in directory

Input not found

Project Structure

Distribution Channels

What's NOT Here

Under the Hood

Input Resolution

PDF Rendering

Image Loading

Layout Detection

OCR

Result Merging

License