mineru-parser
v0.1.0
Published
Parse MinerU PDF extraction JSON output into clean Markdown
Downloads
42
Maintainers
Readme
mineru-parser
Parse MinerU PDF extraction JSON output into clean Markdown.
MinerU extracts PDF content into structured JSON with text, images, tables, charts, and layout metadata. This package converts that JSON into readable Markdown, handling headings, tables, lists, inline equations, page filtering, and more.
Installation
npm install mineru-parser
# or
bun add mineru-parserQuick Start
import { ContentListParser, ContentListParserV2 } from 'mineru-parser';
import data from './mineru-output.json';
// v1 format — flat array of blocks
const parser = new ContentListParser(data);
const markdown = parser.parse();
console.log(markdown);
// v2 format — paginated array of blocks
const parserV2 = new ContentListParserV2(data);
const markdownV2 = parserV2.parse();
console.log(markdownV2);Supported Formats
v1 (flat blocks)
Each block has a type and page_idx:
[
{ "type": "text", "text": "Chapter 1", "text_level": 1, "page_idx": 0, ... },
{ "type": "paragraph", "text": "Lorem ipsum...", "page_idx": 0, ... },
{ "type": "image", "img_path": "images/photo.jpg", "page_idx": 0, ... },
{ "type": "table", "table_body": "<table>...</table>", "page_idx": 1, ... }
]v2 (paginated blocks)
Top-level array represents pages; each page is an array of blocks:
[
[
{ "type": "title", "content": { "title_content": [...], "level": 1 }, ... },
{ "type": "paragraph", "content": { "paragraph_content": [...] }, ... }
],
[
{ "type": "image", "content": { "image_source": { "path": "..." } }, ... }
]
]API
ContentListParser (v1)
import { ContentListParser, type Block } from 'mineru-parser';
const parser = new ContentListParser(blocks: Block[]);
// Parse everything
parser.parse(): string;
// Parse specific pages (1-based)
parser.parsePages(5); // single page
parser.parsePages([1, 3, 7]); // multiple pages
parser.parsePages({ start: 2, end: 5 }); // inclusive rangeContentListParserV2 (v2)
import { ContentListParserV2, type Block } from 'mineru-parser';
const parser = new ContentListParserV2(pages: Block[][]);
// Same API as v1
parser.parse(): string;
parser.parsePages(filter): string;parseBlock (standalone)
Parse a single block without instantiating the class:
import { parseBlock, parseBlockV2 } from 'mineru-parser';
const markdown = parseBlock(block); // v1
const markdownV2 = parseBlockV2(block); // v2Page Filtering
Both parsers support flexible page selection:
// Single page
parser.parsePages(1);
// Multiple pages
parser.parsePages([1, 3, 5]);
// Inclusive range (auto-swaps if reversed)
parser.parsePages({ start: 5, end: 2 }); // pages 2, 3, 4, 5Invalid inputs throw descriptive errors:
parser.parsePages(0); // RangeError: Page number must be a positive integer
parser.parsePages(-1); // RangeError
parser.parsePages(NaN); // RangeErrorBlock Types
| Type | v1 | v2 | Markdown Output |
|------|:--:|:--:|:----------------|
| text / title | ✅ | ✅ | # Heading or plain text |
| paragraph | — | ✅ | Plain text |
| image | ✅ | ✅ |  |
| table | ✅ | ✅ | Markdown table (HTML → Markdown) |
| chart | ✅ | ✅ |  + content |
| header / page_header | ✅ | ✅ | ### Header |
| page_number | ✅ | ✅ | <!-- page N --> |
| footer / page_footer | ✅ | ✅ | <!-- footer: text --> |
| index | — | ✅ | Bullet list |
| list | — | ✅ | Ordered/unordered list |
| equation_inline | — | ✅ | $latex$ |
Error Handling
- Constructor — throws
TypeErrorif input is not an array - Page filters — throws
RangeErrorfor invalid page numbers; throwsTypeErrorfor completely invalid filters - Unknown block types — logs a warning and returns empty string
- Malformed HTML tables — returns empty string gracefully
- Missing properties — defensive checks with warnings, never crashes
Contributing
# Install dependencies
bun install
# Run tests
bun test
# Build
bun run buildLicense
MIT © Souvik De
