npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

kordoc

v1.4.1

Published

Parse Korean documents (HWP, HWPX, PDF) to Markdown

Readme

kordoc

모두 파싱해버리겠다 — The Korean Document Platform.

npm version license node

Parse, compare, extract, and generate Korean documents. HWP, HWPX, PDF — all of them.

한국어

kordoc demo


What's New in v1.4.0

  • Document Compare — Diff two documents at IR level. Cross-format (HWP vs HWPX) supported.
  • Form Field Recognition — Extract label-value pairs from government forms automatically.
  • Structured Parsing — Access IRBlock[] and DocumentMetadata directly, not just markdown.
  • Page Range Parsing — Parse only pages 1-3: parse(buffer, { pages: "1-3" }).
  • Markdown to HWPX — Reverse conversion. Generate valid HWPX files from markdown.
  • OCR Integration — Pluggable OCR for image-based PDFs (bring your own provider).
  • Watch Modekordoc watch ./incoming --webhook https://... for auto-conversion.
  • 7 MCP Tools — parse_document, detect_format, parse_metadata, parse_pages, parse_table, compare_documents, parse_form.
  • Error Codes — Structured code field: "ENCRYPTED", "ZIP_BOMB", "IMAGE_BASED_PDF", etc.

Why kordoc?

South Korea's government runs on HWP — a proprietary word processor the rest of the world has never heard of. Every day, 243 local governments and thousands of public institutions produce mountains of .hwp files. Extracting text from them has always been a nightmare.

kordoc was born from that document hell. Built by a Korean civil servant who spent 7 years buried under HWP files. Battle-tested across 5 real government projects. If a Korean public servant wrote it, kordoc can parse it.


Installation

npm install kordoc

# PDF support (optional)
npm install pdfjs-dist

Quick Start

Parse Any Document

import { parse } from "kordoc"
import { readFileSync } from "fs"

const buffer = readFileSync("document.hwpx")
const result = await parse(buffer.buffer)

if (result.success) {
  console.log(result.markdown)       // Markdown text
  console.log(result.blocks)         // IRBlock[] structured data
  console.log(result.metadata)       // { title, author, createdAt, ... }
}

Compare Two Documents

import { compare } from "kordoc"

const diff = await compare(bufferA, bufferB)
// diff.stats → { added: 3, removed: 1, modified: 5, unchanged: 42 }
// diff.diffs → BlockDiff[] with cell-level table diffs

Cross-format supported: compare HWP against HWPX of the same document.

Extract Form Fields

import { parse, extractFormFields } from "kordoc"

const result = await parse(buffer)
if (result.success) {
  const form = extractFormFields(result.blocks)
  // form.fields → [{ label: "성명", value: "홍길동", row: 0, col: 0 }, ...]
  // form.confidence → 0.85
}

Generate HWPX from Markdown

import { markdownToHwpx } from "kordoc"

const hwpxBuffer = await markdownToHwpx("# Title\n\nParagraph text\n\n| A | B |\n| --- | --- |\n| 1 | 2 |")
writeFileSync("output.hwpx", Buffer.from(hwpxBuffer))

Parse Specific Pages

const result = await parse(buffer, { pages: "1-3" })     // pages 1-3 only
const result = await parse(buffer, { pages: [1, 5, 10] }) // specific pages

OCR for Image-Based PDFs

const result = await parse(buffer, {
  ocr: async (pageImage, pageNumber, mimeType) => {
    return await myOcrService.recognize(pageImage) // Tesseract, Claude Vision, etc.
  }
})

CLI

npx kordoc document.hwpx                          # stdout
npx kordoc document.hwp -o output.md              # save to file
npx kordoc *.pdf -d ./converted/                  # batch convert
npx kordoc report.hwpx --format json              # JSON with blocks + metadata
npx kordoc report.hwpx --pages 1-3                # page range
npx kordoc watch ./incoming -d ./output            # watch mode
npx kordoc watch ./docs --webhook https://api/hook # webhook notification

MCP Server (Claude / Cursor / Windsurf)

{
  "mcpServers": {
    "kordoc": {
      "command": "npx",
      "args": ["-y", "kordoc-mcp"]
    }
  }
}

7 Tools:

| Tool | Description | |------|-------------| | parse_document | Parse HWP/HWPX/PDF → Markdown with metadata | | detect_format | Detect file format via magic bytes | | parse_metadata | Extract metadata only (fast, no full parse) | | parse_pages | Parse specific page range | | parse_table | Extract Nth table from document | | compare_documents | Diff two documents (cross-format) | | parse_form | Extract form fields as structured JSON |

API Reference

Core

| Function | Description | |----------|-------------| | parse(buffer, options?) | Auto-detect format, parse to Markdown + IRBlock[] | | parseHwpx(buffer, options?) | HWPX only | | parseHwp(buffer, options?) | HWP 5.x only | | parsePdf(buffer, options?) | PDF only | | detectFormat(buffer) | Returns "hwpx" \| "hwp" \| "pdf" \| "unknown" |

Advanced

| Function | Description | |----------|-------------| | compare(bufferA, bufferB, options?) | Document diff at IR level | | extractFormFields(blocks) | Form field recognition from IRBlock[] | | markdownToHwpx(markdown) | Markdown → HWPX reverse conversion | | blocksToMarkdown(blocks) | IRBlock[] → Markdown string |

Types

import type {
  ParseResult, ParseSuccess, ParseFailure, FileType,
  IRBlock, IRTable, IRCell, CellContext,
  DocumentMetadata, ParseOptions, ErrorCode,
  DiffResult, BlockDiff, CellDiff, DiffChangeType,
  FormField, FormResult,
  OcrProvider, WatchOptions,
} from "kordoc"

Supported Formats

| Format | Engine | Features | |--------|--------|----------| | HWPX (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery | | HWP 5.x (한컴 Legacy) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection | | PDF | pdfjs-dist | Line grouping, table detection, image PDF + OCR |

Security

Production-grade hardening: ZIP bomb protection, XXE/Billion Laughs prevention, decompression bomb guard, path traversal guard, MCP error sanitization, file size limits (500MB). See SECURITY.md for details.

Credits

Production-tested across 5 Korean government projects: school curriculum plans, facility inspection reports, legal document annexes, municipal newsletters, and public data extraction tools. Thousands of real government documents parsed.

License

MIT