docmarrow

v1.1.1

Published

8 days ago

DocMarrow - Pure TypeScript document parser for PDF → Markdown, JSON and RAG chunks.

DocMarrow

Pure TypeScript document parser for PDF → Markdown, JSON and RAG chunks.

Convert PDF, Word (DOCX), Excel (XLSX), PowerPoint (PPTX) and HTML to clean Markdown or structured JSON, plus RAG-ready chunks for LLM pipelines. Layout-aware parsing with no Python and no servers — runs in Node, the browser and edge runtimes. Optional OCR for scanned PDFs via @docmarrow/ocr.

▶ Try the live demo — drop a document and watch it become Markdown/JSON/chunks in your browser. Nothing is uploaded; the file never leaves your machine.

npm install docmarrow

import { parseDocument } from "docmarrow";
import { readFile } from "node:fs/promises";

// Format (PDF or DOCX) is autodetected from the bytes.
const doc = await parseDocument(new Uint8Array(await readFile("report.pdf")));
console.log(doc.markdown);
console.log(doc.meta); // { format, pageCount, hasText, title?, warnings[] }

const chunks = doc.chunks({ maxTokens: 512, overlap: 64 });

doc exposes markdown, blocks, json, pages, meta, and chunks(). Block types: heading, paragraph, list, table, code, quote. A CLI is included:

npx docmarrow report.pdf -o report.md --json report.json
npx docmarrow notes.docx -o notes.md

This is a self-contained package (the layout core, the pdfjs-dist PDF backend and the OOXML DOCX backend are bundled in). See the project README for the full API, how it works, the in-browser playground, and current limitations (digital PDFs and DOCX; no OCR; geometric PDF tables).

Dual-licensed: AGPL-3.0-or-later or a commercial license.

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

docmarrow

v1.1.1

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

DocMarrow