npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

macos-vision

v1.4.0

Published

Apple Vision OCR + image/PDF analysis for Node.js, with optional Ollama-driven Markdown pipeline — native, fast, offline

Readme

macos-vision

Apple Vision for Node.js — native, fast, offline. Now with an optional Ollama-driven Markdown pipeline.

Uses macOS's built-in Vision framework via a compiled Swift binary. Works completely offline. No cloud services, no API keys, no Python, zero runtime dependencies.

Requirements

  • macOS 12+
  • Node.js 18+
  • Xcode Command Line Tools (xcode-select --install)
  • Ollama running locally — only if you use the Markdown pipeline

Installation

npm install macos-vision

The native Swift binaries (vision-helper, pdf-helper) are compiled automatically on install.

What you get

| Capability | Engine | Network | |---|---|---| | OCR (text + bounding boxes) | Apple Vision | offline | | Face / barcode / rectangle / document detection | Apple Vision | offline | | Image classification | Apple Vision | offline | | Layout inference (lines, paragraphs, reading order) | heuristic in TypeScript | offline | | PDF rasterization | PDFKit (pdf-helper) | offline | | Image / PDF → Markdown | Apple Vision OCR + local LLM via Ollama | local LLM call |


CLI

# OCR — plain text (default)
npx macos-vision photo.jpg

# Structured OCR blocks with bounding boxes
npx macos-vision --blocks photo.jpg

# Detections
npx macos-vision --faces photo.jpg
npx macos-vision --barcodes photo.jpg
npx macos-vision --rectangles photo.jpg
npx macos-vision --document photo.jpg
npx macos-vision --classify photo.jpg

# Run all detections at once
npx macos-vision --all photo.jpg

# Image / PDF → Markdown via VisionScribe + Ollama
npx macos-vision --markdown invoice.pdf -o notes.md
npx macos-vision --markdown receipt.jpg --stdout
npx macos-vision --markdown scan.png --model llama3.2

Multiple Vision flags can be combined: npx macos-vision --blocks --faces --classify photo.jpg. Structured results are printed as JSON to stdout.

CLI flags

| Flag | Description | |---|---| | --ocr | Plain text OCR (default when no flag is given) | | --blocks | OCR with bounding boxes (JSON) | | --faces / --barcodes / --rectangles / --document / --classify | Vision detections (JSON) | | --all | Run every Vision detection at once | | --markdown | Convert image / PDF to Markdown via VisionScribe + Ollama | | --model <name> | Ollama model (default: mistral-nemo). Only used with --markdown | | --ollama-url <url> | Ollama base URL (default: http://localhost:11434). Only used with --markdown | | -o, --output <path> | Write Markdown to a file. Only used with --markdown | | --stdout | Print Markdown to stdout instead of a file. Only used with --markdown | | --help | Show usage |


API — Vision

import {
  ocr,
  detectFaces,
  detectBarcodes,
  detectRectangles,
  detectDocument,
  classify,
  inferLayout,
} from 'macos-vision';

// OCR — plain text
const text = await ocr('photo.jpg');

// OCR — structured blocks with bounding boxes
const blocks = await ocr('photo.jpg', { format: 'blocks' });

// Detect faces / barcodes / rectangles / document boundary
const faces = await detectFaces('photo.jpg');
const codes = await detectBarcodes('invoice.jpg');
const rects = await detectRectangles('document.jpg');
const doc = await detectDocument('photo.jpg'); // DocumentBounds | null

// Classify image content
const labels = await classify('photo.jpg');

// Layout inference — unified reading-order-sorted representation
const layout = inferLayout({ textBlocks: blocks, faces, barcodes: codes });

Layout inference

inferLayout merges raw Vision results into a unified LayoutBlock[] sorted in reading order (top-to-bottom, left-to-right). Text blocks are grouped into lines and paragraphs using geometric heuristics.

import { ocr, detectFaces, detectBarcodes, inferLayout } from 'macos-vision';

const blocks   = await ocr('page.png', { format: 'blocks' });
const faces    = await detectFaces('page.png');
const barcodes = await detectBarcodes('page.png');

const layout = inferLayout({ textBlocks: blocks, faces, barcodes });

for (const block of layout) {
  if (block.kind === 'text') {
    console.log(`[p${block.paragraphId} l${block.lineId}] ${block.text}`);
  } else {
    console.log(`[${block.kind}] at (${block.x.toFixed(2)}, ${block.y.toFixed(2)})`);
  }
}

LayoutBlock is a discriminated union — use block.kind to narrow the type:

| kind | Extra fields | |--------|-------------| | 'text' | text, lineId, paragraphId | | 'barcode' | value, type | | 'face' | — | | 'rectangle' | — | | 'document' | — |

Note: Layout inference is a heuristic layer. It does not understand multi-column layouts or rotated text. Treat it as structured input for downstream tools, not as ground truth.


API — Markdown pipeline (VisionScribe)

VisionScribe converts an image or PDF to Markdown by combining Apple Vision OCR with a local LLM (via Ollama). The LLM never sees the image — it only formats text that Vision already extracted. This keeps image processing local and reduces the risk of vision-model hallucinations, but Markdown reconstruction is still best-effort and depends on the local model and document complexity.

Prerequisites

brew install ollama
ollama serve            # keep this running
ollama pull mistral-nemo

Quick start

import { VisionScribe } from 'macos-vision';

const scribe = new VisionScribe();
const markdown = await scribe.toMarkdown('receipt.png');
console.log(markdown);

For a narrower import surface that pulls in only the markdown sub-module:

import { VisionScribe } from 'macos-vision/markdown';

How it works

Image / PDF
  │
  ▼
Apple Vision OCR          ← macOS native text extraction
  │  VisionBlock[] per page
  ▼
Per-page layout inference ← each page processed independently (page-local coords)
  │  paragraphId, lineId, y
  ▼
Chunker                   ← batches paragraphs to fit the LLM output window
  │  ParagraphGroup[][]
  ▼
Ollama /api/chat          ← system prompt as role:"system", OCR text as role:"user"
  │  temperature=0, top_p=1, num_predict=-1
  ▼
Markdown string           ← chunk results joined with blank lines

The LLM never sees the raw image; it only formats text that Apple Vision has already extracted. The system prompt asks the model to preserve the source text, avoid summarising, and avoid adding content. OCR text is wrapped in <ocr_source> tags so the model is less likely to treat document text as user instructions. Per-page processing keeps paragraph coordinates from different pages from being mixed.

new VisionScribe(options?)

| Option | Type | Default | Description | |---|---|---|---| | model | string | 'mistral-nemo' | Ollama model name | | ollamaUrl | string | 'http://localhost:11434' | Base URL of the Ollama server | | skipPing | boolean | false | Skip per-call Ollama health check (useful in batch loops) | | chunkSizeTokens | number | 1800 | Max estimated output tokens per LLM chunk. Lower = more chunks (safer for small models); higher = fewer calls but risks hitting model output limits |

scribe.toMarkdown(imagePath)

  • Accepts PNG, JPEG, HEIC, HEIF, TIFF, GIF, BMP, WebP and PDF
  • Returns an empty string '' if no text is detected
  • Throws OllamaUnavailableError if the Ollama server is not reachable (unless skipPing: true)

Batch processing

import { VisionScribe, OllamaUnavailableError } from 'macos-vision';

const scribe = new VisionScribe({ skipPing: true });

for (const file of files) {
  try {
    const md = await scribe.toMarkdown(file);
    // …
  } catch (e) {
    if (e instanceof OllamaUnavailableError) {
      console.error(e.message);
      break;
    }
    throw e;
  }
}

Known limitations

  • Local model fidelity: small models (mistral-nemo, gemma) may occasionally summarise or paraphrase long, dense documents. Larger models (llama3.1:70b, qwen2.5:32b) produce significantly better fidelity.
  • Tables: multi-column table layouts are partially supported. OCR reads cells in reading order but the LLM may not always reconstruct correct Markdown table syntax.
  • Images / charts: non-textual content (photos, diagrams, charts) is ignored — only text blocks extracted by Apple Vision are processed.
  • Markdown fidelity: the prompt strongly asks for faithful reconstruction, but LLM output is not a cryptographic or deterministic guarantee. Review important legal, financial, or compliance documents before relying on the generated Markdown.

Migrating from macos-vision-md

The standalone macos-vision-md package has been merged into macos-vision as of v2.0.0. The old package will keep working as a thin re-export shim, but new projects should depend on macos-vision directly.

- import { VisionScribe } from 'macos-vision-md';
+ import { VisionScribe } from 'macos-vision';
- macos-vision-md invoice.pdf -o notes.md
+ macos-vision --markdown invoice.pdf -o notes.md

The VisionScribe API, the system prompt, and the chunking strategy are unchanged. OllamaUnavailableError, VisionScribeOptions, and ParagraphGroup are now exported from macos-vision.


API reference — types

ocr(imagePath, options?)

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | imagePath | string | — | Path to image (PNG, JPG, JPEG, WEBP) or PDF | | options.format | 'text' \| 'blocks' | 'text' | Plain text or structured blocks with coordinates | | options.startPage | number | 1 | PDFs only — first page to OCR, 1-based. Ignored for images. | | options.maxPages | number | all | PDFs only — maximum number of pages to OCR. Ignored for images. |

Returns Promise<string> or Promise<VisionBlock[]>.

interface VisionBlock {
  text: string
  x: number       // 0–1 from left
  y: number       // 0–1 from top
  width: number   // 0–1
  height: number  // 0–1
  confidence: number
  page?: number   // 0-based, only for PDFs
}

PDF page range

Both ocr() and rasterizePdf() accept startPage (1-based) and maxPages to process a subset of pages — useful when the caller only needs a preview, the first few pages, or a specific section of a long document.

// First two pages only
const headText = await ocr('report.pdf', { startPage: 1, maxPages: 2 });

// Page 5 only, as structured blocks
const blocks = await ocr('report.pdf', { format: 'blocks', startPage: 5, maxPages: 1 });

// Rasterize a range without OCR
const { pages } = await rasterizePdf('report.pdf', { startPage: 1, maxPages: 2 });

From the CLI:

macos-vision --start-page 1 --max-pages 2 report.pdf
macos-vision --blocks --start-page 5 --max-pages 1 report.pdf

Notes:

  • Values must be integers >= 1. Out-of-range values throw RangeError (JS) or exit non-zero (CLI).
  • startPage past the end of the document returns an empty result — not an error.
  • VisionBlock.page and PdfPage.page in the response are still 0-based (legacy behaviour).
  • For non-PDF inputs, both options are silently ignored.

detectFaces(imagePath) / detectBarcodes(imagePath) / detectRectangles(imagePath) / detectDocument(imagePath) / classify(imagePath)

See src/index.ts for full type declarations.


Why macos-vision?

| | macos-vision | Tesseract.js | Cloud APIs | |---|---|---|---| | Offline OCR | ✅ | ✅ | ❌ | | Offline image → Markdown | ✅ (with local Ollama) | ❌ | ❌ | | No API key | ✅ | ✅ | ❌ | | Native speed | ✅ | ❌ | — | | Zero runtime deps | ✅ | ❌ | ❌ | | OCR with bounding boxes | ✅ | ✅ | ✅ | | Face / barcode / document detection | ✅ | ❌ | ✅ | | Image classification | ✅ | ❌ | ✅ | | macOS only | ✅ | ❌ | ❌ |

Apple Vision is the same engine used by macOS Spotlight, Live Text, and Shortcuts — highly optimized and accurate.

OCR evaluation notes

In internal tests on anonymized scanned contracts, forms, declarations, and UI screenshots, Apple Vision OCR produced fewer OCR artifacts than Tesseract in most cases. The strongest gains were on multi-column contract-style scans, where Apple Vision preserved substantially more usable text with far fewer artifacts. On simpler UI screenshots, both engines performed similarly.

These results are directional rather than a public benchmark suite. The corpus is not included in this repository, and future benchmark fixtures should use synthetic or public-domain documents only.

License

MIT