@pisanvs/mistralocr-cli

v1.0.2

Published

14 days ago

CLI for the Mistral OCR API — converts documents and images to Markdown

0High
0Medium
0Low

pisanvs

ocr mistral pdf markdown cli document image-to-text pdf-to-markdown

mistralocr-cli

Convert PDFs, images, and documents to Markdown using the Mistral OCR API.

mistralocr-cli is a fast, caching-enabled command-line tool that extracts text from scanned documents and images. It handles large files automatically (chunking), retries on transient errors, and can embed or save extracted images alongside the Markdown output.

Features

📄 Multi-format support — PDFs, images (PNG/JPEG/WebP/…), Word, PowerPoint, EPUB, and more
🗂️ Smart caching — SHA-256 based; re-uses previous results so you never pay for the same file twice
✂️ Automatic chunking — splits large PDFs into chunks and reassembles the result
🔁 Retry with back-off — handles rate limits and transient network errors gracefully
🖼️ Image extraction — embed images as base64 data URIs, or save them to a folder
📊 Table extraction — save each detected table to its own Markdown file
🔄 Auto-conversion — converts legacy .doc / .ppt / .xls files via LibreOffice (if installed)
📣 Verbose mode — real-time spinners and detailed progress output

Requirements

| Requirement | Notes | |---|---| | Node.js ≥ 20 | Required | | Mistral API key | Get one at console.mistral.ai | | pdfinfo (optional) | Enables automatic page-count detection and chunking for PDFs. Install via poppler-utils | | LibreOffice (optional) | Required only for converting legacy .doc, .ppt, .xls files |

Installation

Global (recommended)

npm install -g @pisanvs/mistralocr-cli

Local / per-project

npm install @pisanvs/mistralocr-cli
npx mistralocr <file> [options]

API key

Set your Mistral API key as an environment variable (recommended):

export MISTRAL_API_KEY="your-api-key-here"

Or pass it inline with --api-key on every command.

Quick Start

# Set your API key
export MISTRAL_API_KEY="your-api-key-here"

# Convert a PDF and print Markdown to the terminal
mistralocr report.pdf

# Save to a file instead
mistralocr report.pdf --output report.md

Usage

mistralocr [file] [options]

CLI Reference

| Flag | Description | Default | |---|---|---| | [file] | Path to the PDF or image file to process | (required) | | -k, --api-key <key> | Mistral API key (overrides MISTRAL_API_KEY env var) | — | | -o, --output <file> | Write Markdown output to a file instead of stdout | stdout | | -m, --model <model> | OCR model to use | mistral-ocr-latest | | -p, --pages <range> | Page range to process (see Page Ranges) | all pages | | --include-images | Embed extracted images as base64 data URIs in the Markdown | false | | --extract-images <dir> | Save extracted images to <dir>; Markdown will reference them relatively | — | | --extract-tables <dir> | Save each detected table as a separate .md file in <dir> | — | | --bbox-annotation <json> | JSON schema for bounding-box annotation format | — | | --document-annotation <json> | JSON schema for document-level annotation format | — | | --no-cache | Bypass cache and always call the API | false | | --clear-cache | Delete the cache directory and exit | — | | --cache-dir <dir> | Custom cache directory path | .mistralocr-cache | | --chunk-size <n> | Pages per API call; set to 0 to disable chunking | 50 | | --max-retries <n> | Maximum retry attempts per API call | 3 | | --retry-delay <ms> | Initial retry back-off in milliseconds | 1000 | | -v, --verbose | Show detailed progress information | false | | -V, --version | Print version number and exit | — |

Examples

Basic OCR

Convert a PDF and stream Markdown to your terminal:

mistralocr scan.pdf

Convert a JPEG image:

mistralocr photo.jpg

Save Output to a File

mistralocr report.pdf --output report.md

Process Specific Pages

Process only pages 1–5:

mistralocr book.pdf --pages "1-5" --output chapter1.md

Process pages 1, 3, and 7–10:

mistralocr book.pdf --pages "1,3,7-10" --output selection.md

Extract Images to a Folder

Images are saved to ./images/ and the Markdown contains relative links to them:

mistralocr report.pdf --extract-images ./images --output report.md

Files are named <basename>-p<page>-img<n>.<ext> (e.g. report-p1-img1.png).

Embed Images as Base64

All images are inlined as data URIs — useful for fully self-contained documents:

mistralocr invoice.pdf --include-images --output invoice.md

Extract Tables to Separate Files

Each detected table is saved as its own .md file:

mistralocr data.pdf --extract-tables ./tables --output data.md

Table files are named <basename>-p<page>-table<n>.md.

Skip Cache / Clear Cache

Force a fresh API call (ignore any cached result):

mistralocr report.pdf --no-cache

Delete all cached results:

mistralocr --clear-cache

Use a custom cache location:

mistralocr report.pdf --cache-dir /tmp/my-cache

Auto-Convert Legacy Office Files

If LibreOffice is installed, .doc, .ppt, .xls, and similar files are automatically converted before OCR:

mistralocr legacy-document.doc --output result.md

Tune Chunking and Retries

Process 100 pages at a time instead of the default 50:

mistralocr large-book.pdf --chunk-size 100

Disable chunking entirely:

mistralocr small.pdf --chunk-size 0

Increase retry attempts and set a longer initial delay for slow connections:

mistralocr report.pdf --max-retries 5 --retry-delay 2000

Verbose Output

See real-time progress spinners and detailed logs:

mistralocr report.pdf --verbose --output report.md

Supported Formats

Documents

| Format | Extension(s) | |---|---| | PDF | .pdf | | Word | .docx, .doc (auto-converted via LibreOffice) | | PowerPoint | .pptx, .ppt (auto-converted via LibreOffice) | | EPUB | .epub | | RTF | .rtf | | OpenDocument Text | .odt | | LaTeX | .tex | | Jupyter Notebook | .ipynb | | BibTeX | .bib | | FictionBook | .fb2 | | OPML | .opml | | XML (DocBook/JATS) | .xml | | Troff/Man | .1, .man |

Images

| Format | Extension(s) | |---|---| | JPEG | .jpg, .jpeg | | PNG | .png | | WebP | .webp | | GIF | .gif | | TIFF | .tiff, .tif | | BMP | .bmp | | AVIF | .avif | | HEIC/HEIF | .heic, .heif |

Caching

Results are cached locally so repeated runs on the same file (with the same options) return instantly without calling the API.

Location: .mistralocr-cache/ in the current directory (override with --cache-dir)
Key: SHA-256 hash of the file contents + a hash of the relevant options (model, page range, image settings)
Invalidation: The cache is automatically invalidated when the file changes or any option that affects the output changes
Bypass: Use --no-cache to force a fresh API call
Clear: Use --clear-cache to delete all cached results

Page Ranges

The --pages option accepts 1-indexed page numbers in several formats:

| Format | Example | Pages processed | |---|---|---| | Single page | 5 | 5 | | Range | 1-5 | 1, 2, 3, 4, 5 | | List | 1,3,5 | 1, 3, 5 | | Mixed | 1-3,7,10-12 | 1, 2, 3, 7, 10, 11, 12 |

Development

# Clone the repository
git clone https://github.com/pisanvs/mistralocr-cli.git
cd mistralocr-cli

# Install dependencies
npm install

# Run in development mode (TypeScript, no build step)
npm run dev -- report.pdf

# Build to dist/
npm run build

# Type-check without building
npm run typecheck

License

MIT