file2md
v1.4.55
Published
A TypeScript library for converting various document types (PDF, DOCX, XLSX, PPTX, HWP, HWPX) into Markdown with image and layout preservation
Maintainers
Readme
file2md
A modern TypeScript library for converting various document types (PDF, DOCX, XLSX, PPTX, HWP, HWPX) into Markdown with advanced layout preservation, image extraction, chart conversion, and Korean language support.
English | 한국어
✨ Features
- 🔄 Multiple Format Support: PDF, DOCX, XLSX, PPTX, HWP, HWPX
- 🎨 Layout Preservation: Maintains document structure, tables, and formatting
- 🖼️ Image Extraction: Extract embedded images from DOCX, PPTX, HWP documents
- 📊 Chart Conversion: Converts charts to Markdown tables
- 📝 List & Table Support: Proper nested lists and complex tables
- 🌏 Korean Language Support: Full support for HWP/HWPX Korean document formats
- 🔒 Type Safety: Full TypeScript support with comprehensive types
- ⚡ Modern ESM: ES2022 modules with CommonJS compatibility
- 🚀 Zero Config: Works out of the box
- 📄 PDF Text Extraction: Enhanced text extraction with layout detection
Note: XLSX image extraction is planned but not yet supported.
📦 Installation
npm install file2md🚀 Quick Start
TypeScript / ES Modules
import { convert } from 'file2md';
// Convert from file path
const result = await convert('./document.pdf');
console.log(result.markdown);
// Convert with options
const result = await convert('./presentation.pptx', {
imageDir: 'extracted-images',
preserveLayout: true,
extractCharts: true,
extractImages: true
});
console.log(`✅ Converted successfully!`);
console.log(`📄 Markdown length: ${result.markdown.length}`);
console.log(`🖼️ Images extracted: ${result.images.length}`);
console.log(`📊 Charts found: ${result.charts.length}`);
console.log(`⏱️ Processing time: ${result.metadata.processingTime}ms`);Korean Document Support (HWP/HWPX)
import { convert } from 'file2md';
// Convert Korean HWP document
const hwpResult = await convert('./document.hwp', {
imageDir: 'hwp-images',
preserveLayout: true,
extractImages: true
});
// Convert Korean HWPX document (XML-based format)
const hwpxResult = await convert('./document.hwpx', {
imageDir: 'hwpx-images',
preserveLayout: true,
extractImages: true
});
console.log(`🇰🇷 HWP content: ${hwpResult.markdown.substring(0, 100)}...`);
console.log(`📄 HWPX pages: ${hwpxResult.metadata.pageCount}`);CommonJS
const { convert } = require('file2md');
const result = await convert('./document.docx');
console.log(result.markdown);From Buffer
import { convert } from 'file2md';
import { readFile } from 'fs/promises';
const buffer = await readFile('./document.xlsx');
const result = await convert(buffer, {
imageDir: 'spreadsheet-images'
});📋 API Reference
convert(input, options?)
Parameters:
input: string | Buffer- File path or buffer containing document dataoptions?: ConvertOptions- Conversion options
Returns: Promise<ConversionResult>
Options
interface ConvertOptions {
imageDir?: string; // Directory for extracted images (default: 'images')
outputDir?: string; // Output directory for slide screenshots (PPTX, falls back to imageDir)
preserveLayout?: boolean; // Maintain document layout (default: true)
extractCharts?: boolean; // Convert charts to tables (default: true)
extractImages?: boolean; // Extract embedded images (default: true)
maxPages?: number; // Max pages for PDFs (default: unlimited)
}Result
interface ConversionResult {
markdown: string; // Generated Markdown content
images: ImageData[]; // Extracted image information
charts: ChartData[]; // Extracted chart data
metadata: DocumentMetadata; // Document metadata with processing info
}🎯 Format-Specific Features
- ✅ Text extraction with layout enhancement
- ✅ Table detection and formatting
- ✅ List recognition (bullets, numbers)
- ✅ Heading detection (ALL CAPS, colons)
- ❌ Image extraction (text-only processing)
📝 DOCX
- ✅ Heading hierarchy (H1-H6)
- ✅ Text formatting (bold, italic)
- ✅ Complex tables with merged cells
- ✅ Nested lists with proper indentation
- ✅ Embedded images and charts
- ✅ Cell styling (alignment, colors)
- ✅ Font size preservation and formatting
📊 XLSX
- ✅ Multiple worksheets as separate sections
- ✅ Cell formatting (bold, colors, alignment)
- ✅ Data type preservation
- ✅ Chart extraction to data tables
- ✅ Conditional formatting notes
- ✅ Shared strings handling for large files
🎬 PPTX
- ✅ Slide-by-slide organization
- ✅ Text positioning and layout
- ✅ Image placement per slide
- ✅ Table extraction from slides
- ✅ Multi-column layouts
- ✅ Title extraction from document properties
- ✅ Chart and image inline embedding
🇰🇷 HWP (Korean)
- ✅ Binary format parsing using hwp.js
- ✅ Korean text extraction with proper encoding
- ✅ Image extraction from embedded content
- ✅ Layout preservation for Korean documents
- ✅ Copyright message filtering for clean output
🇰🇷 HWPX (Korean XML)
- ✅ XML-based format parsing with JSZip
- ✅ Multiple section support for large documents
- ✅ Relationship mapping for image references
- ✅ OWPML structure parsing
- ✅ Enhanced Korean text processing
- ✅ BinData image extraction from ZIP archive
🖼️ Image Handling
Images are automatically extracted and saved to the specified directory:
const result = await convert('./presentation.pptx', {
imageDir: 'my-images'
});
// Result structure:
// my-images/
// ├── image_1.png
// ├── image_2.jpg
// └── chart_1.png
// Markdown will contain:
// Note: PDF files are processed as text-only. Use dedicated PDF tools for image extraction if needed.
📊 Chart Conversion
Charts are converted to Markdown tables:
#### Chart 1: Sales Data
| Category | Q1 | Q2 | Q3 | Q4 |
| --- | --- | --- | --- | --- |
| Revenue | 100 | 150 | 200 | 250 |
| Profit | 20 | 30 | 45 | 60 |🛡️ Error Handling
import {
convert,
UnsupportedFormatError,
FileNotFoundError,
ParseError
} from 'file2md';
try {
const result = await convert('./document.pdf');
} catch (error) {
if (error instanceof UnsupportedFormatError) {
console.error('Unsupported file format');
} else if (error instanceof FileNotFoundError) {
console.error('File not found');
} else if (error instanceof ParseError) {
console.error('Failed to parse document:', error.message);
}
}🧪 Advanced Usage
Batch Processing
import { convert } from 'file2md';
import { readdir } from 'fs/promises';
async function convertFolder(folderPath: string) {
const files = await readdir(folderPath);
const results = [];
for (const file of files) {
if (file.match(/\.(pdf|docx|xlsx|pptx|hwp|hwpx)$/i)) {
try {
const result = await convert(`${folderPath}/${file}`, {
imageDir: 'batch-images',
extractImages: true
});
results.push({ file, success: true, result });
} catch (error) {
results.push({ file, success: false, error });
}
}
}
return results;
}Large Document Processing
import { convert } from 'file2md';
// Optimize for large documents
const result = await convert('./large-document.pdf', {
maxPages: 50, // Limit PDF processing
preserveLayout: true // Keep layout analysis
});
// Enhanced PPTX processing
const pptxResult = await convert('./presentation.pptx', {
outputDir: 'slides', // Separate directory for slides
extractCharts: true, // Extract chart data
extractImages: true // Extract embedded images
});
// Performance metrics are available in metadata
console.log('Performance Metrics:');
console.log(`- Processing time: ${result.metadata.processingTime}ms`);
console.log(`- Pages processed: ${result.metadata.pageCount}`);
console.log(`- Images extracted: ${result.metadata.imageCount}`);
console.log(`- File type: ${result.metadata.fileType}`);📊 Supported Formats
| Format | Extension | Layout | Images | Charts | Tables | Lists |
|--------|-----------|---------|---------|---------|---------|--------|
| PDF | .pdf | ✅ | ❌ | ❌ | ✅ | ✅ |
| Word | .docx | ✅ | ✅ | ✅ | ✅ | ✅ |
| Excel | .xlsx | ✅ | ❌ | ✅ | ✅ | ❌ |
| PowerPoint | .pptx | ✅ | ✅ | ✅ | ✅ | ❌ |
| HWP | .hwp | ✅ | ✅ | ❌ | ❌ | ✅ |
| HWPX | .hwpx | ✅ | ✅ | ❌ | ❌ | ✅ |
Note: PDF processing focuses on text extraction with enhanced layout detection. For PDF image extraction, consider using dedicated PDF processing tools.
🌏 Korean Document Support
file2md includes comprehensive support for Korean document formats:
HWP (한글)
- Binary format used by Hangul (한글) word processor
- Legacy format still widely used in Korean organizations
- Full text extraction with Korean character encoding
- Image and chart extraction support
HWPX (한글 XML)
- Modern XML-based format, successor to HWP
- ZIP archive structure with XML content files
- Enhanced parsing with relationship mapping
- Multiple sections and complex document support
Usage Examples
// Convert Korean documents
const koreanDocs = [
'report.hwp', // Legacy binary format
'document.hwpx', // Modern XML format
'presentation.pptx'
];
for (const doc of koreanDocs) {
const result = await convert(doc, {
imageDir: 'korean-docs-images',
preserveLayout: true
});
console.log(`📄 ${doc}: ${result.markdown.length} characters`);
console.log(`🖼️ Images: ${result.images.length}`);
console.log(`⏱️ Processed in ${result.metadata.processingTime}ms`);
}🔧 Performance & Configuration
The library is optimized for performance with sensible defaults:
- Zero configuration - Works out of the box
- Efficient processing - Optimized for various document sizes
- Memory management - Proper cleanup of temporary resources
- Type safety - Full TypeScript support
Performance metrics are included in the conversion result for monitoring and optimization.
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
# Clone the repository
git clone https://github.com/ricky-clevi/file2md.git
cd file2md
# Install dependencies
npm install
# Run tests
npm test
# Build the project
npm run build
# Run linting
npm run lint📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🔗 Links
Made with ❤️ and TypeScript • 🖼️ Enhanced with intelligent document parsing • 🇰🇷 Korean document support
