npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@rlyle1179/pdf2html

v4.4.7

Published

PDF to HTML or Text conversion using Apache Tika. Also generate PDF thumbnail using Apache PDFBox.

Readme

pdf2html

NPM version npm module downloads Build Status License Node.js Version

Convert PDF files to HTML, extract text, generate thumbnails, extract images, and extract metadata using Apache Tika and PDFBox

🚀 Features

  • PDF to HTML conversion - Maintains formatting and structure
  • Text extraction - Extract plain text content from PDFs
  • Page-by-page processing - Process PDFs page by page
  • Metadata extraction - Extract author, title, creation date, and more
  • Thumbnail generation - Generate preview images from PDF pages
  • Image extraction - Extract all embedded images from PDFs
  • Buffer support - Process PDFs from memory buffers or file paths
  • TypeScript support - Full type definitions included
  • Async/Promise based - Modern async API
  • Configurable - Extensive options for customization

📋 Prerequisites

  • Node.js >= 12.0.0
  • Java Runtime Environment (JRE) >= 8

📦 Installation

Using npm:

npm install pdf2html

Using yarn:

yarn add pdf2html

Using pnpm:

pnpm add pdf2html

The installation process will automatically download the required Apache Tika and PDFBox JAR files. You'll see a progress indicator during the download.

🔧 Basic Usage

Convert PDF to HTML

const pdf2html = require('pdf2html');
const fs = require('fs');

// From file path
const html = await pdf2html.html('path/to/document.pdf');
console.log(html);

// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const html = await pdf2html.html(pdfBuffer);
console.log(html);

// With options
const html = await pdf2html.html(pdfBuffer, {
    maxBuffer: 1024 * 1024 * 10, // 10MB buffer
});

Extract Text

// From file path
const text = await pdf2html.text('path/to/document.pdf');

// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const text = await pdf2html.text(pdfBuffer);
console.log(text);

Process Pages Individually

// From file path
const htmlPages = await pdf2html.pages('path/to/document.pdf');

// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const htmlPages = await pdf2html.pages(pdfBuffer);
htmlPages.forEach((page, index) => {
    console.log(`Page ${index + 1}:`, page);
});

// Get text for each page
const textPages = await pdf2html.pages(pdfBuffer, {
    text: true,
});

Extract Metadata

// From file path or buffer
const metadata = await pdf2html.meta(pdfBuffer);
console.log(metadata);
// Output: {
//   title: 'Document Title',
//   author: 'John Doe',
//   subject: 'Document Subject',
//   keywords: 'pdf, conversion',
//   creator: 'Microsoft Word',
//   producer: 'Adobe PDF Library',
//   creationDate: '2023-01-01T00:00:00Z',
//   modificationDate: '2023-01-02T00:00:00Z',
//   pages: 10
// }

Generate Thumbnails

// From file path
const thumbnailPath = await pdf2html.thumbnail('path/to/document.pdf');

// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const thumbnailPath = await pdf2html.thumbnail(pdfBuffer);
console.log('Thumbnail saved to:', thumbnailPath);

// Custom thumbnail options
const thumbnailPath = await pdf2html.thumbnail(pdfBuffer, {
    page: 1, // Page number (default: 1)
    imageType: 'png', // 'png' or 'jpg' (default: 'png')
    width: 300, // Width in pixels (default: 160)
    height: 400, // Height in pixels (default: 226)
});

Extract Images

// From file path
const imagePaths = await pdf2html.extractImages('path/to/document.pdf');
console.log('Extracted images:', imagePaths);
// Output: ['/absolute/path/to/files/image/document1.jpg', '/absolute/path/to/files/image/document2.png', ...]

// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const imagePaths = await pdf2html.extractImages(pdfBuffer);

// With custom output directory
const imagePaths = await pdf2html.extractImages(pdfBuffer, {
    outputDirectory: './extracted-images', // Custom output directory
});

// With custom buffer size for large PDFs
const imagePaths = await pdf2html.extractImages('large-document.pdf', {
    outputDirectory: './output',
    maxBuffer: 1024 * 1024 * 10, // 10MB buffer
});

💻 TypeScript Support

This package includes TypeScript type definitions out of the box. No need to install @types/pdf2html.

Basic TypeScript Usage

import * as pdf2html from 'pdf2html';
// or
import { html, text, pages, meta, thumbnail, extractImages, PDFMetadata, PDFProcessingError } from 'pdf2html';

async function convertPDF() {
    try {
        // All methods accept string paths or Buffers
        const htmlContent: string = await pdf2html.html('document.pdf');
        const textContent: string = await pdf2html.text(Buffer.from(pdfData));

        // Full type safety for options
        const thumbnailPath = await pdf2html.thumbnail('document.pdf', {
            page: 1, // number
            imageType: 'png', // 'png' | 'jpg'
            width: 300, // number
            height: 400, // number
        });

        // TypeScript knows the shape of metadata
        const metadata: PDFMetadata = await pdf2html.meta('document.pdf');
        console.log(metadata['pdf:producer']); // string | undefined
        console.log(metadata.resourceName); // string | undefined
    } catch (error) {
        if (error instanceof pdf2html.PDFProcessingError) {
            console.error('PDF processing failed:', error.message);
            console.error('Exit code:', error.exitCode);
        }
    }
}

Type Definitions

// Input types - all methods accept either file paths or Buffers
type PDFInput = string | Buffer;

// Options interfaces
interface ProcessingOptions {
    maxBuffer?: number; // Maximum buffer size in bytes
}

interface PageOptions extends ProcessingOptions {
    text?: boolean; // Extract text instead of HTML
}

interface ThumbnailOptions extends ProcessingOptions {
    page?: number; // Page number (default: 1)
    imageType?: 'png' | 'jpg'; // Image format (default: 'png')
    width?: number; // Width in pixels (default: 160)
    height?: number; // Height in pixels (default: 226)
}

// Metadata structure with common fields
interface PDFMetadata {
    'pdf:PDFVersion'?: string;
    'pdf:producer'?: string;
    'xmp:CreatorTool'?: string;
    'dc:title'?: string;
    'dc:creator'?: string;
    resourceName?: string;
    [key: string]: any; // Allows additional properties
}

// Error class
class PDFProcessingError extends Error {
    command?: string; // The command that failed
    exitCode?: number; // The process exit code
}

IntelliSense Support

Full IntelliSense support in VS Code and other TypeScript-aware editors:

  • Auto-completion for all methods and options
  • Inline documentation on hover
  • Type checking at compile time
  • Catch errors before runtime

Advanced TypeScript Usage

import { PDFProcessor, utils } from 'pdf2html';

// Using the PDFProcessor class directly
const html = await PDFProcessor.toHTML('document.pdf');

// Using utility classes
const { FileManager, HTMLParser } = utils;
await FileManager.ensureDirectories();

// Type guards
function isPDFProcessingError(error: unknown): error is pdf2html.PDFProcessingError {
    return error instanceof pdf2html.PDFProcessingError;
}

// Generic helper with proper typing
async function processPDFSafely<T>(operation: () => Promise<T>, fallback: T): Promise<T> {
    try {
        return await operation();
    } catch (error) {
        if (isPDFProcessingError(error)) {
            console.error(`PDF operation failed: ${error.message}`);
        }
        return fallback;
    }
}

// Usage
const pages = await processPDFSafely(
    () => pdf2html.pages('document.pdf', { text: true }),
    [] // fallback to empty array
);

⚙️ Advanced Configuration

Buffer Size Configuration

By default, the maximum buffer size is 2MB. For large PDFs, you may need to increase this:

const options = {
    maxBuffer: 1024 * 1024 * 50, // 50MB buffer
};

// Apply to any method
await pdf2html.html('large-file.pdf', options);
await pdf2html.text('large-file.pdf', options);
await pdf2html.pages('large-file.pdf', options);
await pdf2html.meta('large-file.pdf', options);
await pdf2html.thumbnail('large-file.pdf', options);

Error Handling

Always wrap your calls in try-catch blocks for proper error handling:

try {
    const html = await pdf2html.html('document.pdf');
    // Process HTML
} catch (error) {
    if (error.code === 'ENOENT') {
        console.error('PDF file not found');
    } else if (error.message.includes('Java')) {
        console.error('Java is not installed or not in PATH');
    } else {
        console.error('PDF processing failed:', error.message);
    }
}

🏗️ API Reference

pdf2html.html(input, [options])

Converts PDF to HTML format.

  • input string | Buffer - Path to the PDF file or PDF buffer
  • options object (optional)
    • maxBuffer number - Maximum buffer size in bytes (default: 2MB)
  • Returns: Promise<string> - HTML content

pdf2html.text(input, [options])

Extracts text from PDF.

  • input string | Buffer - Path to the PDF file or PDF buffer
  • options object (optional)
    • maxBuffer number - Maximum buffer size in bytes
  • Returns: Promise<string> - Extracted text

pdf2html.pages(input, [options])

Processes PDF page by page.

  • input string | Buffer - Path to the PDF file or PDF buffer
  • options object (optional)
    • text boolean - Extract text instead of HTML (default: false)
    • maxBuffer number - Maximum buffer size in bytes
  • Returns: Promise<string[]> - Array of HTML or text strings

pdf2html.meta(input, [options])

Extracts PDF metadata.

  • input string | Buffer - Path to the PDF file or PDF buffer
  • options object (optional)
    • maxBuffer number - Maximum buffer size in bytes
  • Returns: Promise<object> - Metadata object

pdf2html.thumbnail(input, [options])

Generates a thumbnail image from PDF.

  • input string | Buffer - Path to the PDF file or PDF buffer
  • options object (optional)
    • page number - Page to thumbnail (default: 1)
    • imageType string - 'png' or 'jpg' (default: 'png')
    • width number - Thumbnail width (default: 160)
    • height number - Thumbnail height (default: 226)
    • maxBuffer number - Maximum buffer size in bytes
  • Returns: Promise<string> - Path to generated thumbnail

🔧 Manual Dependency Installation

If automatic download fails (e.g., due to network restrictions), you can manually download the dependencies:

  1. Create the vendor directory:

    mkdir -p node_modules/pdf2html/vendor
  2. Download the required JAR files:

    cd node_modules/pdf2html/vendor
    
    # Download Apache PDFBox
    wget https://archive.apache.org/dist/pdfbox/2.0.34/pdfbox-app-2.0.34.jar
    
    # Download Apache Tika
    wget https://archive.apache.org/dist/tika/3.2.0/tika-app-3.2.0.jar
  3. Verify the files are in place:

    ls -la node_modules/pdf2html/vendor/
    # Should show both JAR files

🐛 Troubleshooting

Common Issues

  1. "Java is not installed"

    • Install Java JRE 8 or higher
    • Ensure java is in your system PATH
    • Verify with: java -version
  2. "File not found" errors

    • Check that the PDF path is correct
    • Use absolute paths for better reliability
    • Ensure the file has read permissions
  3. "Buffer size exceeded"

    • Increase maxBuffer option
    • Process large PDFs page by page
    • Consider splitting very large PDFs
  4. "Download failed during installation"

    • Check internet connection
    • Try manual installation (see above)
    • Check proxy settings if behind firewall

Debug Mode

Enable debug output for troubleshooting:

DEBUG=pdf2html node your-script.js

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

📊 Dependencies

  • Production: Apache Tika 3.2.0, Apache PDFBox 2.0.34
  • Development: See package.json for development dependencies

Made with ❤️ by the pdf2html community