@zzwz/liteparse-vllm

v1.5.3-custom.1

Published

a month ago

Open-source PDF parsing with spatial text extraction and OCR processing with Custom Codex-OCR and GML-OCR Servers

0High
0Medium
0Low

zzwz

pdf parser ocr text-extraction pdf-to-text document-parsing

LiteParse OCR vLLM

| | Upstream Docs

This repository is an independent custom OCR fork of upstream run-llama/liteparse. The upstream project remains the base LiteParse implementation and source reference; this repo carries local custom work for GLM-OCR, vLLM offline packaging, LM Studio diagnostics, Codex OCR diagnostics, agent skills, and release packaging under a separate package name.

Repository identity:

Fork repo: https://github.com/lwyBZss8924d/liteparse-ocr-vllm.git
Upstream repo: https://github.com/run-llama/liteparse.git
Custom branch: custom/vllm-ocr-main
Upstream mirror branch: main
npm package: @zzwz/liteparse-vllm
Current custom version: 1.5.3-custom.1, based on upstream v1.5.3

Do not publish custom OCR releases from main. Keep upstream syncs on main, merge them into custom/vllm-ocr-main, and publish this fork from the custom branch with custom tags such as v1.5.3-custom.1.

Overview

LiteParse OCR vLLM keeps LiteParse's local-first parser and standard OCR HTTP contract, then adds custom advanced OCR packaging for local VLM workflows.

Fast Text Parsing: Spatial text parsing using PDF.js
Flexible OCR System:
- Built-in: Tesseract.js for the zero-setup local path
- Baseline HTTP Servers: EasyOCR, PaddleOCR, or any custom /ocr service
- GLM-OCR SDK Pipeline: PP-DocLayout-backed layout boxes normalized into LiteParse OCR results
- vLLM Offline Image: optional GPU-accelerated Docker image tar for air-gapped GLM-OCR model serving; the GLM-OCR SDK pipeline itself can run without this image
- LM Studio Direct Diagnostics: lightweight local model smoke tests with degraded fallback boxes
- Codex OCR Diagnostics: online/authenticated multimodal page-understanding artifacts
- Standard API: unchanged multipart POST /ocr contract with results[].text, results[].bbox, and results[].confidence
Screenshot Generation: Generate high-quality page screenshots for LLM agents
Multiple Output Formats: JSON and Text
Bounding Boxes: Precise text positioning information
Standalone CLI: Baseline parsing runs locally; Codex OCR remains online/authenticated only
Multi-platform: Linux, macOS (Intel/ARM), Windows

Installation

CLI Tool

Option 1: Global Install (Recommended)

Install globally via npm to use the lit command anywhere:

npm i -g @zzwz/liteparse-vllm

Then use it:

lit parse document.pdf
lit screenshot document.pdf

For macOS and Linux users who want the upstream package instead of this custom OCR fork, liteparse can also be installed via brew:

brew tap run-llama/liteparse
brew install llamaindex-liteparse

Option 2: Install from Source

You can clone the repo and install the CLI globally from source:

git clone https://github.com/lwyBZss8924d/liteparse-ocr-vllm.git
cd liteparse-ocr-vllm
git switch custom/vllm-ocr-main
npm run build
npm pack
npm install -g ./arthur-liteparse-vllm-*.tgz

For a release-grade offline npm tarball, build on Linux x64 so native runtime dependencies match the target host:

npm ci
npm run build
npm prune --omit=dev
npm pack --dry-run --json
npm run smoke:offline-npm-tgz
npm pack

Agent Skill

This fork keeps its custom agent skill source in the repository so OCR commands, package names, and vLLM/GLM-OCR workflows stay aligned with this custom build:

skills/liteparse-cli-tools-custom-collection/

Use npm run validate:agent-skills before publishing changes, then npm run sync:agent-skills:dry-run and npm run sync:agent-skills to refresh the installed runtime projection under /Users/arthur/.agents/skills/liteparse-cli-tools-custom-collection. Do not edit the installed projection directly.

Usage

Parse Files

# Basic parsing
lit parse document.pdf

# Parse with specific format
lit parse document.pdf --format json -o output.md

# Parse specific pages
lit parse document.pdf --target-pages "1-5,10,15-20"

# Parse without OCR
lit parse document.pdf --no-ocr

# Parse a remote PDF
curl -sL https://example.com/report.pdf | lit parse -

# Parse with official GLM-OCR SDK layout pipeline as a LiteParse OCR server
lit glmocr-ocr-server
lit parse document.pdf --ocr-server-url http://127.0.0.1:8831/ocr --format json

# Parse with Codex OCR server for multimodal page understanding
lit codex-ocr-server
lit parse document.pdf --ocr-server-url http://127.0.0.1:8833/ocr --format json

Batch Parsing

You can also parse an entire directory of documents:

lit batch-parse ./input-directory ./output-directory

Generate Screenshots

Screenshots are essential for LLM agents to extract visual information that text alone cannot capture.

# Screenshot all pages
lit screenshot document.pdf -o ./screenshots

# Screenshot specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots

# Custom DPI
lit screenshot document.pdf --dpi 300 -o ./screenshots

# Screenshot page range
lit screenshot document.pdf --target-pages "1-10" -o ./screenshots

Library Usage

Install as a dependency in your project:

npm install @zzwz/liteparse-vllm
# or
pnpm add @zzwz/liteparse-vllm

import { LiteParse } from '@zzwz/liteparse-vllm';

const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse('document.pdf');
console.log(result.text);

Buffer / Uint8Array Input

You can pass raw bytes directly instead of a file path, which is useful for remote files:

import { LiteParse } from '@zzwz/liteparse-vllm';
import { readFile } from 'fs/promises';

const parser = new LiteParse();

// From a file read
const pdfBytes = await readFile('document.pdf');
const result = await parser.parse(pdfBytes);

// From an HTTP response
const response = await fetch('https://example.com/document.pdf');
const buffer = Buffer.from(await response.arrayBuffer());
const result2 = await parser.parse(buffer);

Non-PDF buffers (images, Office documents) are written to a temp directory for format conversion. Screenshots also work with buffer input:

const screenshots = await parser.screenshot(pdfBytes, [1, 2, 3]);

Browser Usage

LiteParse's core parsing engine (PDF.js text extraction, grid projection, OCR via Tesseract.js) can run in the browser. Since the library has Node-only dependencies (sharp, fs, child_process), you'll need a bundler like Vite to swap those out with browser stubs.

Vite Configuration

The key is a Vite plugin that redirects Node-only source files to browser-safe replacements, plus resolve.alias entries that stub out Node built-in modules:

// vite.config.ts
import { defineConfig, type Plugin } from "vite";
import { resolve, dirname } from "node:path";

// Node-only files → browser stubs (you write these)
const FILE_REDIRECTS = [
  { match: /\/engines\/pdf\/pdfium-renderer(\.js|\.ts)?$/, target: "stubs/pdfium-renderer.ts" },
  { match: /\/engines\/pdf\/pdfjsImporter(\.js|\.ts)?$/,   target: "stubs/pdfjsImporter.ts" },
  { match: /\/engines\/ocr\/http-simple(\.js|\.ts)?$/,     target: "stubs/http-simple.ts" },
  { match: /\/conversion\/convertToPdf(\.js|\.ts)?$/,      target: "stubs/convertToPdf.ts" },
  { match: /\/processing\/gridDebugLogger(\.js|\.ts)?$/,   target: "stubs/gridDebugLogger.ts" },
  { match: /\/processing\/gridVisualizer(\.js|\.ts)?$/,    target: "stubs/gridVisualizer.ts" },
];

function liteparseNodeRedirects(): Plugin {
  return {
    name: "liteparse-node-redirects",
    enforce: "pre",
    async resolveId(source, importer) {
      if (!importer) return null;
      const abs = source.startsWith(".") ? resolve(dirname(importer), source) : source;
      for (const { match, target } of FILE_REDIRECTS) {
        if (match.test(abs) || match.test(source)) return resolve(target);
      }
      return null;
    },
  };
}

export default defineConfig({
  plugins: [liteparseNodeRedirects()],
  optimizeDeps: { include: ["tesseract.js"] },
  resolve: {
    alias: [
      { find: "node:fs/promises", replacement: "stubs/empty.ts" },
      { find: "node:fs",          replacement: "stubs/empty.ts" },
      { find: "node:url",         replacement: "stubs/empty.ts" },
      { find: "node:path",        replacement: "stubs/empty.ts" },
      { find: "node:os",          replacement: "stubs/empty.ts" },
      { find: "node:child_process", replacement: "stubs/empty.ts" },
      { find: /^fs$/,             replacement: "stubs/empty.ts" },
      { find: /^path$/,           replacement: "stubs/empty.ts" },
      { find: /^os$/,             replacement: "stubs/empty.ts" },
      { find: /^child_process$/,  replacement: "stubs/empty.ts" },
      { find: "form-data",        replacement: "stubs/empty.ts" },
      { find: "axios",            replacement: "stubs/empty.ts" },
      { find: "file-type",        replacement: "stubs/file-type.ts" },
    ],
  },
});

See scripts/browser-compat/ for a complete working example with all the stub files.

What works in the browser

PDF parsing from Uint8Array input (use file.arrayBuffer() to get bytes from a <input type="file">)
OCR via Tesseract.js (runs in Web Workers, fetches language data from CDN on first use)
Text and JSON output formats

What doesn't work

File path input (pass Uint8Array instead)
DOCX/XLSX/PPTX/image conversion (requires LibreOffice/ImageMagick)
HTTP OCR server backend
Screenshots (these use PDFium + sharp, which are native Node addons)

CLI Options

Parse Command

$ lit parse --help
Usage: lit parse [options] <file>

Parse a document file (PDF, DOCX, XLSX, PPTX, images, etc.)

Options:
  -o, --output <file>     Output file path
  --format <format>       Output format: json|text (default: "text")
  --ocr-server-url <url>  HTTP OCR server URL (uses Tesseract if not provided)
  --no-ocr                Disable OCR
  --ocr-language <lang>   OCR language(s) (default: "en")
  --num-workers <n>       Number of pages to OCR in parallel (default: CPU cores - 1)
  --max-pages <n>         Max pages to parse (default: "10000")
  --target-pages <pages>  Target pages (e.g., "1-5,10,15-20")
  --dpi <dpi>             DPI for rendering (default: "150")
  --no-precise-bbox       Disable precise bounding boxes
  --preserve-small-text   Preserve very small text
  --password <password>   Password for encrypted/protected documents
  --config <file>         Config file (JSON)
  -q, --quiet             Suppress progress output
  -h, --help              display help for command

Batch Parse Command

$ lit batch-parse --help
Usage: lit batch-parse [options] <input-dir> <output-dir>

Parse multiple documents in batch mode (reuses PDF engine for efficiency)

Options:
  --format <format>       Output format: json|text (default: "text")
  --ocr-server-url <url>  HTTP OCR server URL (uses Tesseract if not provided)
  --no-ocr                Disable OCR
  --ocr-language <lang>   OCR language(s) (default: "en")
  --num-workers <n>       Number of pages to OCR in parallel (default: CPU cores - 1)
  --max-pages <n>         Max pages to parse per file (default: "10000")
  --dpi <dpi>             DPI for rendering (default: "150")
  --no-precise-bbox       Disable precise bounding boxes
  --recursive             Recursively search input directory
  --extension <ext>       Only process files with this extension (e.g., ".pdf")
  --password <password>   Password for encrypted/protected documents (applied to all files)
  --config <file>         Config file (JSON)
  -q, --quiet             Suppress progress output
  -h, --help              display help for command

Screenshot Command

$ lit screenshot --help
Usage: lit screenshot [options] <file>

Generate screenshots of PDF pages

Options:
  -o, --output-dir <dir>  Output directory for screenshots (default: "./screenshots")
  --target-pages <pages>  Page numbers to screenshot (e.g., "1,3,5" or "1-5")
  --dpi <dpi>             DPI for rendering (default: "150")
  --format <format>       Image format: png|jpg (default: "png")
  --password <password>   Password for encrypted/protected documents
  --config <file>         Config file (JSON)
  -q, --quiet             Suppress progress output
  -h, --help              display help for command

OCR Setup

Default: Tesseract.js

# Tesseract is enabled by default
lit parse document.pdf

# Specify language
lit parse document.pdf --ocr-language fra

# Disable OCR
lit parse document.pdf --no-ocr

By default, Tesseract.js downloads language data from the internet on first use. For offline or air-gapped environments, set the TESSDATA_PREFIX environment variable to a directory containing pre-downloaded .traineddata files:

export TESSDATA_PREFIX=/path/to/tessdata
lit parse document.pdf --ocr-language eng

You can also pass tessdataPath in the library config:

const parser = new LiteParse({ tessdataPath: '/path/to/tessdata' });

Optional: HTTP OCR Servers

For higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:

EasyOCR
PaddleOCR
GLM-OCR SDK Pipeline
LM Studio GLM-OCR direct wrapper
Codex OCR CLI/server (lit codex-ocr-server)

You can integrate any OCR service by implementing the simple LiteParse OCR API specification (see OCR_API_SPEC.md).

The API requires:

POST /ocr endpoint
Accepts file and language parameters
Returns JSON: { results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }

See the example servers in ocr/easyocr/ and ocr/paddleocr/ as templates.

For the complete OCR API specification, see OCR_API_SPEC.md.

Optional: GLM-OCR SDK Pipeline

For layout/table/formula-heavy documents, LiteParse can expose the official GLM-OCR SDK self-hosted pipeline as a Custom HTTP OCR server. This path uses PP-DocLayout for layout boxes, then calls a model runtime such as LM Studio for crop OCR:

# Python service path, matching the EasyOCR/PaddleOCR adapter style:
cd ocr/glmocr
uv run server.py

# Or Node-managed wrapper:
# Starts http://127.0.0.1:8831/ocr
# If the model is installed but not loaded, this runs:
# lms load glm-ocr-g32-mixed_4_8-mlx --identifier glm-ocr-g32-mixed_4_8-mlx -y
lit glmocr-ocr-server

lit parse document.pdf \
  --ocr-server-url http://127.0.0.1:8831/ocr \
  --format json

Advanced document-pipeline tooling writes page images, raw GLM-OCR SDK artifacts, LiteParse /ocr result JSON, and final Markdown/JSON:

lit glmocr-pipeline \
  --path document.pdf \
  --output ./glmocr-output \
  --target-pages "1-3"

Use --no-auto-load when you want LiteParse to fail fast instead of calling lms load. Use --model-runtime openai-compatible --ocr-api-url <url> or --model-runtime ollama --ocr-api-url <url> when the GLM-OCR model is hosted outside LM Studio.

Docker: Default Codex OCR Server and Optional vLLM GLM-OCR

The GLM-OCR SDK development path does not require this Docker image and is not GPU-only: cd ocr/glmocr && uv run server.py can run with CPU layout detection and a local LM Studio or other OpenAI-compatible model runtime. The Docker target is an optional vLLM serving package for air-gapped deployment, where a Linux x64 NVIDIA GPU host is expected for practical GLM-OCR model inference.

The image also contains codex-ocr-server, and the default Docker profile is codex. With no profile argument, the container starts a LiteParse-compatible OCR server on 0.0.0.0:8833 using LITEPARSE_CODEX_HOME=/codex-home. The mounted Codex home must provide either Codex auth/config or a custom model_provider config for a local/proxy model endpoint.

The image contains the LiteParse custom CLI, Node runtime dependencies, @openai/codex-sdk, the pinned GLM-OCR SDK, vLLM runtime, zai-org/GLM-OCR, and PaddlePaddle/PP-DocLayoutV3_safetensors.

docker build -f Dockerfile.glmocr-offline \
  -t liteparse-glmocr-vllm-offline:1.5.3-custom.1 \
  --build-arg VLLM_BASE_IMAGE=vllm/vllm-openai@sha256:9eff9734a30b6713a8566217d36f8277630fd2d31cec7f0a0292835901a23aa4 \
  --build-arg GLM_OCR_SDK_REF=cef4d0ea120d1741f5cefe8985eee45f6c8eff1d \
  --build-arg GLM_OCR_MODEL_REVISION=cb34f33832c51008c86436a3b2217bbe4adbe0b8 \
  --build-arg PP_DOCLAYOUT_MODEL_REVISION=3ec586e86ed9245a567bb13395a3db64d5c077cc \
  .

docker save \
  -o liteparse-glmocr-vllm-offline-1.5.3-custom.1.tar \
  liteparse-glmocr-vllm-offline:1.5.3-custom.1

On the deployment host:

docker load -i liteparse-glmocr-vllm-offline-1.5.3-custom.1.tar

# Default profile: codex-ocr-server on :8833.
docker run --rm -p 8833:8833 \
  -e LITEPARSE_CODEX_HOME=/codex-home \
  -v "$HOME/.codex:/codex-home" \
  liteparse-glmocr-vllm-offline:1.5.3-custom.1

# Optional vLLM GLM-OCR profile.
docker run --rm --gpus all --ipc=host --network=none \
  liteparse-glmocr-vllm-offline:1.5.3-custom.1 smoke

docker run --rm --gpus all --ipc=host -p 8831:8831 \
  -e LITEPARSE_OCR_PROFILE=glmocr-vllm \
  liteparse-glmocr-vllm-offline:1.5.3-custom.1

The codex profile starts lit codex-ocr-server on port 8833. The glmocr-vllm profile starts vllm serve /opt/models/glm-ocr on port 8000, waits for /v1/models, then starts lit glmocr-ocr-server on port 8831 with --layout-model-dir /opt/models/pp-doclayout. The image sets HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 at runtime; build the image online once, then distribute the saved tar.

On a Linux x64 NVIDIA GPU host, run the release gate script after copying the tar:

scripts/validate-glmocr-offline-gpu.sh \
  liteparse-glmocr-vllm-offline-1.5.3-custom.1.tar

This script loads the tar, checks image metadata, verifies Docker GPU runtime availability, runs the in-image offline smoke under --network=none, then validates container-internal /health, POST /ocr, and lit parse --ocr-server-url http://127.0.0.1:8831/ocr. On local hosts without NVIDIA GPU support, keep this as an explicit unverified gate and rerun it on the GPU deployment host.

Codex OCR deployment options:

Mount a trusted Codex home: -v "$HOME/.codex:/codex-home" -e LITEPARSE_CODEX_HOME=/codex-home. This may include auth.json from codex login and config.toml; treat auth.json as a secret.
Use a custom Codex model provider in /codex-home/config.toml, then set model_provider to that provider id. Codex custom providers define base_url, wire_api, auth, and optional headers under [model_providers.<id>].
Current official Codex config schema documents wire_api = "responses" for custom providers. For an OpenAI Chat Completions-compatible local endpoint, put an adapter/proxy in front of it that exposes a Responses/Open Responses-compatible API before using it as the Codex provider, unless your pinned Codex version documents another supported wire_api.

Example local Open Responses-compatible Codex config:

# /codex-home/config.toml
#:schema https://developers.openai.com/codex/config-schema.json

model = "local-vision-model"
model_provider = "local-open-responses"
model_reasoning_effort = "medium"

[model_providers.local-open-responses]
name = "Local Open Responses provider"
base_url = "http://host.docker.internal:1234/v1"
wire_api = "responses"
# env_key = "LOCAL_RESPONSES_API_KEY"

References: Codex custom model providers, Codex alternative provider auth, Codex config reference, Codex config schema, OpenAI Responses API, and AI SDK Open Responses provider.

Optional: LM Studio GLM-OCR Direct Wrapper

The legacy direct wrapper remains available for quick single-image or OCR/text smoke tests:

lit lmstudio-ocr page.png --mode text --json
lit lmstudio-ocr-server

Direct mode sends the page or crop straight to LM Studio and may produce fallback line boxes when the model output has no reliable bbox_2d. Use glmocr-ocr-server or glmocr-pipeline when official GLM-OCR layout bboxes are required.

Optional: Codex OCR Server and Pipeline

For agentic multimodal OCR, LiteParse can expose OpenAI Codex as a Custom HTTP OCR server while preserving the standard /ocr response shape:

# Uses @openai/codex-sdk by default.
# Live tests should set HOME to a temp dir containing .codex/auth.json.
lit codex-ocr-server

lit parse document.pdf \
  --ocr-server-url http://127.0.0.1:8833/ocr \
  --format json

The Codex server also exposes POST /ocr/analyze for a full advanced artifact with page Markdown, page metadata, layout regions, segmented assets, annotations, conversion results, model metadata, and provenance. Use --backend app-server to try the experimental codex app-server JSON-RPC wrapper instead of the default SDK path.

Advanced document-pipeline tooling renders supported documents and images into page PNGs, runs Codex OCR per page, and writes page artifacts plus final Markdown/JSON:

lit codex-ocr-pipeline \
  --path document.pdf \
  --output ./codex-ocr-output \
  --target-pages "1-3" \
  --json

The artifact tree includes pages/, codex/, liteparse/, assets/<type>/, annotations/, final/document.md, final/document.json, and manifest.json. Final Markdown includes a LiteParse structured OCR context section that promotes page metadata, selected layout regions, and segmented asset details for downstream QA. Codex bounding boxes are model-inferred visual localization evidence and include codex_bboxes_are_model_inferred warnings; use --strict-bbox to drop regions without usable boxes.

Multi-Format Input Support

LiteParse supports automatic conversion of various document formats to PDF before parsing. This makes it unique compared to other PDF-only parsing tools!

Supported Input Formats

Office Documents (via LibreOffice)

Word: .doc, .docx, .docm, .odt, .rtf
PowerPoint: .ppt, .pptx, .pptm, .odp
Spreadsheets: .xls, .xlsx, .xlsm, .ods, .csv, .tsv

Just install the dependency and LiteParse will automatically convert these formats to PDF for parsing:

# macOS
brew install --cask libreoffice

# Ubuntu/Debian
apt-get install libreoffice

# Windows
choco install libreoffice-fresh # might require admin permissions

For Windows, you might need to add the path to the directory containing LibreOffice CLI executable (generally C:\Program Files\LibreOffice\program) to the environment variables and re-start the machine.

Images (via ImageMagick)

Formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg

Just install ImageMagick and LiteParse will convert images to PDF for parsing (with OCR):

# macOS
brew install imagemagick

# Ubuntu/Debian
apt-get install imagemagick

# Windows
choco install imagemagick.app # might require admin permissions

Environment Variables

| Variable | Description | |----------|-------------| | TESSDATA_PREFIX | Path to a directory containing Tesseract .traineddata files. Used for offline/air-gapped environments where Tesseract.js cannot download language data from the internet. | | LITEPARSE_TMPDIR | Override the temp directory used for format conversion and intermediate files. Defaults to the OS temp directory (os.tmpdir()). Useful in containerized or read-only filesystem environments. | | LITEPARSE_LMSTUDIO_BASE_URL | Base URL for LM Studio GLM-OCR tooling. Defaults to http://localhost:1234. | | LITEPARSE_GLM_OCR_MODEL | LM Studio model identifier. Defaults to glm-ocr-g32-mixed_4_8-mlx. | | LITEPARSE_LMSTUDIO_API_KEY | Optional bearer token for LM Studio-compatible deployments. | | LITEPARSE_LMSTUDIO_AUTO_LOAD | Set to 0 or false to disable automatic lms load for local LM Studio models. | | LITEPARSE_GLMOCR_ROOT | GLM-OCR SDK root used by lit glmocr-ocr-server. Docker defaults to /opt/glm-ocr-sdk; local installs may omit it when glmocr is importable. | | LITEPARSE_GLMOCR_LAYOUT_MODEL_DIR | PP-DocLayout model directory or Hub identifier. Docker defaults to /opt/models/pp-doclayout. | | HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE | Set to 1 in the offline Docker image so Hugging Face and Transformers use only bundled model artifacts. | | LITEPARSE_CODEX_HOME | Codex state directory for Codex OCR. Use $HOME/.codex for live development/testing so OAuth tokens and config remain separate from normal Codex state. | | LITEPARSE_CODEX_OCR_MODEL | Default Codex OCR model. Defaults to gpt-5.5; use gpt-5.4-mini for cheaper smoke tests. | | LITEPARSE_CODEX_OCR_REASONING | Default Codex OCR reasoning effort. Defaults to medium; the pipeline command defaults to high. |

Configuration

You can configure parsing options via CLI flags or a JSON config file. The config file allows you to set sensible defaults and override as needed.

Config File Example

Create a liteparse.config.json file:

{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true,
  "preserveVerySmallText": false,
  "password": "optional_password"
}

For HTTP OCR servers, just add ocrServerUrl:

{
  "ocrServerUrl": "http://localhost:8828/ocr",
  "ocrLanguage": "en",
  "outputFormat": "json"
}

Use with:

lit parse document.pdf --config liteparse.config.json

Development

We provide a fairly rich AGENTS.md/CLAUDE.md that we recommend using to help with development + coding agents.

# Install dependencies
npm install

# Build TypeScript (Linux/macOs)
npm run build

# Build Typescript (Windows)
npm run build:windows

# Watch mode
npm run dev

# Test parsing
npm test

License

This custom LiteParse fork and npm package are licensed under Apache-2.0 under this repository's LICENSE.

Third-party model and runtime notices for optional GLM-OCR deployments:

The GLM-OCR SDK repository code is Apache-2.0.
The GLM-OCR model zai-org/GLM-OCR is MIT licensed according to its model card.
The GLM-OCR pipeline uses PP-DocLayoutV3 for document layout analysis; the PaddlePaddle/PP-DocLayoutV3_safetensors component is Apache-2.0 licensed according to the GLM-OCR model card.

If you build or distribute the optional offline Docker image tar, retain the required notices for LiteParse, GLM-OCR, the GLM-OCR model, PP-DocLayoutV3, vLLM, Node runtime dependencies, and Python runtime dependencies included in that image.

Credits

Built on top of:

PDF.js - PDF parsing engine
Tesseract.js - In-process OCR engine
EasyOCR - HTTP OCR server (optional)
PaddleOCR - HTTP OCR server (optional)
Sharp - Image processing