just-bash-ocr
v1.0.0
Published
just-bash plugin: OCR for scanned documents, PDFs, invoices, and receipts. Tesseract.js engine, multi-language, offline.
Maintainers
Readme
just-bash-ocr
A just-bash plugin for OCR (optical character recognition). Extract text from scanned documents, PDFs, invoices, and receipts. Powered by Tesseract.js — works offline, supports 100+ languages.
Install
npm install just-bash-ocrQuick Start
import { Bash, InMemoryFs } from "just-bash";
import { createOcrPlugin } from "just-bash-ocr";
import { readFileSync } from "fs";
const bash = new Bash({
fs: new InMemoryFs({
"/docs/invoice.png": readFileSync("scanned-invoice.png"),
"/docs/contract.pdf": readFileSync("contract.pdf"),
}),
customCommands: createOcrPlugin({ language: "eng" }),
});
// Extract text from a scanned image
const r = await bash.exec(`ocr extract /docs/invoice.png`);
// → {"text":"INVOICE #12345\nTotal: $1,234.56","confidence":92.5,"language":"eng"}
// Extract text from a PDF
const r2 = await bash.exec(`ocr pdf /docs/contract.pdf`);
// → {"text":"...","pages":3,"method":"text-layer","perPage":[...]}
// Detailed scan with bounding boxes
const r3 = await bash.exec(`ocr scan /docs/invoice.png`);
// → {"text":"...","blocks":[{"text":"INVOICE","bbox":{"x0":30,"y0":10,...}},...],"words":42}Command Reference
ocr extract <path> — Basic text extraction
ocr extract /docs/scan.png [--lang=eng]
ocr extract /docs/scan.jpg --lang=eng+spa # multi-languageReturns { text, confidence, language }. Confidence is 0-100 (Tesseract's estimate of accuracy).
Supports PNG, JPG, TIFF, BMP, and WebP. If a PDF path is given, automatically routes to the PDF handler.
ocr scan <path> — Detailed extraction with layout
ocr scan /docs/invoice.png [--lang=eng]Returns { text, confidence, blocks[], lines, words, language }. Each block includes:
text— the extracted textconfidence— per-block confidencebbox— bounding box{ x0, y0, x1, y1 }in pixelstype— always"text"(future:"table","image")
Useful for understanding document layout and extracting specific regions.
ocr pdf <path> — PDF text extraction
ocr pdf /docs/report.pdf [--lang=eng] [--pages=1-3] [--force-ocr]Three modes:
- Text-layer PDFs: extracts text directly (fast, no OCR needed). Returns
method: "text-layer". - Hybrid PDFs: some pages have text, some don't. Returns
method: "hybrid"with per-page info. - Scanned PDFs: fully image-based. Currently returns an error with guidance to convert pages to images first.
Flags:
--pages=1-3,5— extract only specific pages--force-ocr— skip text-layer extraction, always use OCR--lang=eng+spa— OCR language(s)
ocr languages — Available languages
ocr languagesLists common languages with codes. Use codes with --lang:
eng— Englishspa— Spanisheng+spa— Multi-language (slower but handles mixed documents)
Full list: Tesseract Data Files
Integration with the just-bash Ecosystem
Combine with other plugins for end-to-end document processing:
import { createOcrPlugin } from "just-bash-ocr";
import { createReportPlugin } from "just-bash-report";
import { createWikiPlugin } from "just-bash-wiki";
const bash = new Bash({
fs: new InMemoryFs({ "/docs/invoice.png": imageBuffer }),
customCommands: [
...createOcrPlugin({ language: "spa" }),
...createReportPlugin({ rootDir: "/data" }),
...createWikiPlugin({ rootDir: "/wiki" }),
],
});
// 1. OCR the invoice
const ocr = await bash.exec(`ocr extract /docs/invoice.png`);
const text = JSON.parse(ocr.stdout).text;
// 2. Agent parses the text and stores structured data
await bash.exec(`db invoices insert '${JSON.stringify({
vendor: "Acme Corp",
total: 1234.56,
date: "2026-05-02",
raw_text: text,
})}'`);
// 3. Generate a report
await bash.exec(`report auto invoices --title="Invoices Dashboard"`);Options
createOcrPlugin({
language: "eng", // Default language (default: "eng")
dataPath: "/path", // Tesseract trained data path (default: auto-download)
});How It Works
- Image/PDF is read from just-bash's virtual filesystem as a buffer
- For PDFs:
pdf-parseextracts the text layer first - For images (or scanned PDFs): Tesseract.js runs OCR via WASM
- Results returned as JSON to the LLM agent
Tesseract trained data (~15MB for English) is downloaded automatically on first use and cached.
Ecosystem
| Package | Description |
|---------|-------------|
| just-bash-ocr | OCR for scanned documents (this package) |
| just-bash-data | db + vec commands |
| just-bash-wiki | LLM-maintained knowledge base |
| just-bash-report | Dashboards, invoices, static sites |
License
MIT
