just-bash-ocr

v1.0.0

Published

2 months ago

just-bash plugin: OCR for scanned documents, PDFs, invoices, and receipts. Tesseract.js engine, multi-language, offline.

Downloads

0High
0Medium
0Low

rckflr

just-bash ocr tesseract pdf invoice receipt document-scanning agent-shell

just-bash-ocr

A just-bash plugin for OCR (optical character recognition). Extract text from scanned documents, PDFs, invoices, and receipts. Powered by Tesseract.js — works offline, supports 100+ languages.

Install

npm install just-bash-ocr

Quick Start

import { Bash, InMemoryFs } from "just-bash";
import { createOcrPlugin } from "just-bash-ocr";
import { readFileSync } from "fs";

const bash = new Bash({
  fs: new InMemoryFs({
    "/docs/invoice.png": readFileSync("scanned-invoice.png"),
    "/docs/contract.pdf": readFileSync("contract.pdf"),
  }),
  customCommands: createOcrPlugin({ language: "eng" }),
});

// Extract text from a scanned image
const r = await bash.exec(`ocr extract /docs/invoice.png`);
// → {"text":"INVOICE #12345\nTotal: $1,234.56","confidence":92.5,"language":"eng"}

// Extract text from a PDF
const r2 = await bash.exec(`ocr pdf /docs/contract.pdf`);
// → {"text":"...","pages":3,"method":"text-layer","perPage":[...]}

// Detailed scan with bounding boxes
const r3 = await bash.exec(`ocr scan /docs/invoice.png`);
// → {"text":"...","blocks":[{"text":"INVOICE","bbox":{"x0":30,"y0":10,...}},...],"words":42}

Command Reference

`ocr extract <path>` — Basic text extraction

ocr extract /docs/scan.png [--lang=eng]
ocr extract /docs/scan.jpg --lang=eng+spa   # multi-language

Returns { text, confidence, language }. Confidence is 0-100 (Tesseract's estimate of accuracy).

Supports PNG, JPG, TIFF, BMP, and WebP. If a PDF path is given, automatically routes to the PDF handler.

`ocr scan <path>` — Detailed extraction with layout

ocr scan /docs/invoice.png [--lang=eng]

Returns { text, confidence, blocks[], lines, words, language }. Each block includes:

text — the extracted text
confidence — per-block confidence
bbox — bounding box { x0, y0, x1, y1 } in pixels
type — always "text" (future: "table", "image")

Useful for understanding document layout and extracting specific regions.

`ocr pdf <path>` — PDF text extraction

ocr pdf /docs/report.pdf [--lang=eng] [--pages=1-3] [--force-ocr]

Three modes:

Text-layer PDFs: extracts text directly (fast, no OCR needed). Returns method: "text-layer".
Hybrid PDFs: some pages have text, some don't. Returns method: "hybrid" with per-page info.
Scanned PDFs: fully image-based. Currently returns an error with guidance to convert pages to images first.

Flags:

--pages=1-3,5 — extract only specific pages
--force-ocr — skip text-layer extraction, always use OCR
--lang=eng+spa — OCR language(s)

`ocr languages` — Available languages

ocr languages

Lists common languages with codes. Use codes with --lang:

eng — English
spa — Spanish
eng+spa — Multi-language (slower but handles mixed documents)

Full list: Tesseract Data Files

Integration with the just-bash Ecosystem

Combine with other plugins for end-to-end document processing:

import { createOcrPlugin } from "just-bash-ocr";
import { createReportPlugin } from "just-bash-report";
import { createWikiPlugin } from "just-bash-wiki";

const bash = new Bash({
  fs: new InMemoryFs({ "/docs/invoice.png": imageBuffer }),
  customCommands: [
    ...createOcrPlugin({ language: "spa" }),
    ...createReportPlugin({ rootDir: "/data" }),
    ...createWikiPlugin({ rootDir: "/wiki" }),
  ],
});

// 1. OCR the invoice
const ocr = await bash.exec(`ocr extract /docs/invoice.png`);
const text = JSON.parse(ocr.stdout).text;

// 2. Agent parses the text and stores structured data
await bash.exec(`db invoices insert '${JSON.stringify({
  vendor: "Acme Corp",
  total: 1234.56,
  date: "2026-05-02",
  raw_text: text,
})}'`);

// 3. Generate a report
await bash.exec(`report auto invoices --title="Invoices Dashboard"`);

Options

createOcrPlugin({
  language: "eng",     // Default language (default: "eng")
  dataPath: "/path",   // Tesseract trained data path (default: auto-download)
});

How It Works

Image/PDF is read from just-bash's virtual filesystem as a buffer
For PDFs: pdf-parse extracts the text layer first
For images (or scanned PDFs): Tesseract.js runs OCR via WASM
Results returned as JSON to the LLM agent

Tesseract trained data (~15MB for English) is downloaded automatically on first use and cached.

Ecosystem

| Package | Description | |---------|-------------| | just-bash-ocr | OCR for scanned documents (this package) | | just-bash-data | db + vec commands | | just-bash-wiki | LLM-maintained knowledge base | | just-bash-report | Dashboards, invoices, static sites |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

just-bash-ocr

Install

Quick Start

Command Reference

ocr extract <path> — Basic text extraction

ocr scan <path> — Detailed extraction with layout

ocr pdf <path> — PDF text extraction

ocr languages — Available languages

Integration with the just-bash Ecosystem

Options

How It Works

Ecosystem

License

`ocr extract <path>` — Basic text extraction

`ocr scan <path>` — Detailed extraction with layout

`ocr pdf <path>` — PDF text extraction

`ocr languages` — Available languages