npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

file2md

v1.4.55

Published

A TypeScript library for converting various document types (PDF, DOCX, XLSX, PPTX, HWP, HWPX) into Markdown with image and layout preservation

Readme

file2md

npm version TypeScript License: MIT

A modern TypeScript library for converting various document types (PDF, DOCX, XLSX, PPTX, HWP, HWPX) into Markdown with advanced layout preservation, image extraction, chart conversion, and Korean language support.

English | 한국어

✨ Features

  • 🔄 Multiple Format Support: PDF, DOCX, XLSX, PPTX, HWP, HWPX
  • 🎨 Layout Preservation: Maintains document structure, tables, and formatting
  • 🖼️ Image Extraction: Extract embedded images from DOCX, PPTX, HWP documents
  • 📊 Chart Conversion: Converts charts to Markdown tables
  • 📝 List & Table Support: Proper nested lists and complex tables
  • 🌏 Korean Language Support: Full support for HWP/HWPX Korean document formats
  • 🔒 Type Safety: Full TypeScript support with comprehensive types
  • Modern ESM: ES2022 modules with CommonJS compatibility
  • 🚀 Zero Config: Works out of the box
  • 📄 PDF Text Extraction: Enhanced text extraction with layout detection

Note: XLSX image extraction is planned but not yet supported.

📦 Installation

npm install file2md

🚀 Quick Start

TypeScript / ES Modules

import { convert } from 'file2md';

// Convert from file path
const result = await convert('./document.pdf');
console.log(result.markdown);

// Convert with options
const result = await convert('./presentation.pptx', {
  imageDir: 'extracted-images',
  preserveLayout: true,
  extractCharts: true,
  extractImages: true
});

console.log(`✅ Converted successfully!`);
console.log(`📄 Markdown length: ${result.markdown.length}`);
console.log(`🖼️ Images extracted: ${result.images.length}`);
console.log(`📊 Charts found: ${result.charts.length}`);
console.log(`⏱️ Processing time: ${result.metadata.processingTime}ms`);

Korean Document Support (HWP/HWPX)

import { convert } from 'file2md';

// Convert Korean HWP document
const hwpResult = await convert('./document.hwp', {
  imageDir: 'hwp-images',
  preserveLayout: true,
  extractImages: true
});

// Convert Korean HWPX document (XML-based format)
const hwpxResult = await convert('./document.hwpx', {
  imageDir: 'hwpx-images',
  preserveLayout: true,
  extractImages: true
});

console.log(`🇰🇷 HWP content: ${hwpResult.markdown.substring(0, 100)}...`);
console.log(`📄 HWPX pages: ${hwpxResult.metadata.pageCount}`);

CommonJS

const { convert } = require('file2md');

const result = await convert('./document.docx');
console.log(result.markdown);

From Buffer

import { convert } from 'file2md';
import { readFile } from 'fs/promises';

const buffer = await readFile('./document.xlsx');
const result = await convert(buffer, {
  imageDir: 'spreadsheet-images'
});

📋 API Reference

convert(input, options?)

Parameters:

  • input: string | Buffer - File path or buffer containing document data
  • options?: ConvertOptions - Conversion options

Returns: Promise<ConversionResult>

Options

interface ConvertOptions {
  imageDir?: string;        // Directory for extracted images (default: 'images')
  outputDir?: string;       // Output directory for slide screenshots (PPTX, falls back to imageDir)
  preserveLayout?: boolean; // Maintain document layout (default: true)
  extractCharts?: boolean;  // Convert charts to tables (default: true)
  extractImages?: boolean;  // Extract embedded images (default: true)
  maxPages?: number;        // Max pages for PDFs (default: unlimited)
}

Result

interface ConversionResult {
  markdown: string;           // Generated Markdown content
  images: ImageData[];        // Extracted image information
  charts: ChartData[];        // Extracted chart data
  metadata: DocumentMetadata; // Document metadata with processing info
}

🎯 Format-Specific Features

📄 PDF

  • Text extraction with layout enhancement
  • Table detection and formatting
  • List recognition (bullets, numbers)
  • Heading detection (ALL CAPS, colons)
  • Image extraction (text-only processing)

📝 DOCX

  • Heading hierarchy (H1-H6)
  • Text formatting (bold, italic)
  • Complex tables with merged cells
  • Nested lists with proper indentation
  • Embedded images and charts
  • Cell styling (alignment, colors)
  • Font size preservation and formatting

📊 XLSX

  • Multiple worksheets as separate sections
  • Cell formatting (bold, colors, alignment)
  • Data type preservation
  • Chart extraction to data tables
  • Conditional formatting notes
  • Shared strings handling for large files

🎬 PPTX

  • Slide-by-slide organization
  • Text positioning and layout
  • Image placement per slide
  • Table extraction from slides
  • Multi-column layouts
  • Title extraction from document properties
  • Chart and image inline embedding

🇰🇷 HWP (Korean)

  • Binary format parsing using hwp.js
  • Korean text extraction with proper encoding
  • Image extraction from embedded content
  • Layout preservation for Korean documents
  • Copyright message filtering for clean output

🇰🇷 HWPX (Korean XML)

  • XML-based format parsing with JSZip
  • Multiple section support for large documents
  • Relationship mapping for image references
  • OWPML structure parsing
  • Enhanced Korean text processing
  • BinData image extraction from ZIP archive

🖼️ Image Handling

Images are automatically extracted and saved to the specified directory:

const result = await convert('./presentation.pptx', {
  imageDir: 'my-images'
});

// Result structure:
// my-images/
// ├── image_1.png
// ├── image_2.jpg
// └── chart_1.png

// Markdown will contain:
// ![Slide 1 Image](my-images/image_1.png)

Note: PDF files are processed as text-only. Use dedicated PDF tools for image extraction if needed.

📊 Chart Conversion

Charts are converted to Markdown tables:

#### Chart 1: Sales Data

| Category | Q1 | Q2 | Q3 | Q4 |
| --- | --- | --- | --- | --- |
| Revenue | 100 | 150 | 200 | 250 |
| Profit | 20 | 30 | 45 | 60 |

🛡️ Error Handling

import { 
  convert, 
  UnsupportedFormatError, 
  FileNotFoundError,
  ParseError 
} from 'file2md';

try {
  const result = await convert('./document.pdf');
} catch (error) {
  if (error instanceof UnsupportedFormatError) {
    console.error('Unsupported file format');
  } else if (error instanceof FileNotFoundError) {
    console.error('File not found');
  } else if (error instanceof ParseError) {
    console.error('Failed to parse document:', error.message);
  }
}

🧪 Advanced Usage

Batch Processing

import { convert } from 'file2md';
import { readdir } from 'fs/promises';

async function convertFolder(folderPath: string) {
  const files = await readdir(folderPath);
  const results = [];
  
  for (const file of files) {
    if (file.match(/\.(pdf|docx|xlsx|pptx|hwp|hwpx)$/i)) {
      try {
        const result = await convert(`${folderPath}/${file}`, {
          imageDir: 'batch-images',
          extractImages: true
        });
        results.push({ file, success: true, result });
      } catch (error) {
        results.push({ file, success: false, error });
      }
    }
  }
  
  return results;
}

Large Document Processing

import { convert } from 'file2md';

// Optimize for large documents
const result = await convert('./large-document.pdf', {
  maxPages: 50,              // Limit PDF processing
  preserveLayout: true       // Keep layout analysis
});

// Enhanced PPTX processing
const pptxResult = await convert('./presentation.pptx', {
  outputDir: 'slides',       // Separate directory for slides
  extractCharts: true,       // Extract chart data
  extractImages: true        // Extract embedded images
});

// Performance metrics are available in metadata
console.log('Performance Metrics:');
console.log(`- Processing time: ${result.metadata.processingTime}ms`);
console.log(`- Pages processed: ${result.metadata.pageCount}`);
console.log(`- Images extracted: ${result.metadata.imageCount}`);
console.log(`- File type: ${result.metadata.fileType}`);

📊 Supported Formats

| Format | Extension | Layout | Images | Charts | Tables | Lists | |--------|-----------|---------|---------|---------|---------|--------| | PDF | .pdf | ✅ | ❌ | ❌ | ✅ | ✅ | | Word | .docx | ✅ | ✅ | ✅ | ✅ | ✅ | | Excel | .xlsx | ✅ | ❌ | ✅ | ✅ | ❌ | | PowerPoint | .pptx | ✅ | ✅ | ✅ | ✅ | ❌ | | HWP | .hwp | ✅ | ✅ | ❌ | ❌ | ✅ | | HWPX | .hwpx | ✅ | ✅ | ❌ | ❌ | ✅ |

Note: PDF processing focuses on text extraction with enhanced layout detection. For PDF image extraction, consider using dedicated PDF processing tools.

🌏 Korean Document Support

file2md includes comprehensive support for Korean document formats:

HWP (한글)

  • Binary format used by Hangul (한글) word processor
  • Legacy format still widely used in Korean organizations
  • Full text extraction with Korean character encoding
  • Image and chart extraction support

HWPX (한글 XML)

  • Modern XML-based format, successor to HWP
  • ZIP archive structure with XML content files
  • Enhanced parsing with relationship mapping
  • Multiple sections and complex document support

Usage Examples

// Convert Korean documents
const koreanDocs = [
  'report.hwp',      // Legacy binary format
  'document.hwpx',   // Modern XML format
  'presentation.pptx'
];

for (const doc of koreanDocs) {
  const result = await convert(doc, {
    imageDir: 'korean-docs-images',
    preserveLayout: true
  });
  
  console.log(`📄 ${doc}: ${result.markdown.length} characters`);
  console.log(`🖼️ Images: ${result.images.length}`);
  console.log(`⏱️ Processed in ${result.metadata.processingTime}ms`);
}

🔧 Performance & Configuration

The library is optimized for performance with sensible defaults:

  • Zero configuration - Works out of the box
  • Efficient processing - Optimized for various document sizes
  • Memory management - Proper cleanup of temporary resources
  • Type safety - Full TypeScript support

Performance metrics are included in the conversion result for monitoring and optimization.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Setup

# Clone the repository
git clone https://github.com/ricky-clevi/file2md.git
cd file2md

# Install dependencies
npm install

# Run tests
npm test

# Build the project
npm run build

# Run linting
npm run lint

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links


Made with ❤️ and TypeScript🖼️ Enhanced with intelligent document parsing🇰🇷 Korean document support