npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

extract2md

v2.0.0

Published

Client-side PDF to Markdown conversion with OCR and optional LLM rewrite. Core dependencies bundled for offline use.

Readme

Extract2MD - Enhanced PDF to Markdown Converter

NPM Version License Downloads

Sponsor on Patreon

A powerful client-side JavaScript library for converting PDFs to Markdown with multiple extraction methods and optional LLM enhancement. Now with scenario-specific methods for different use cases.

🚀 Quick Start

Extract2MD now offers 5 distinct scenarios for different conversion needs:

import Extract2MDConverter from 'extract2md';

// Scenario 1: Quick conversion only
const markdown1 = await Extract2MDConverter.quickConvertOnly(pdfFile);

// Scenario 2: High accuracy OCR conversion only  
const markdown2 = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile);

// Scenario 3: Quick conversion + LLM enhancement
const markdown3 = await Extract2MDConverter.quickConvertWithLLM(pdfFile);

// Scenario 4: High accuracy conversion + LLM enhancement
const markdown4 = await Extract2MDConverter.highAccuracyConvertWithLLM(pdfFile);

// Scenario 5: Combined extraction + LLM enhancement (most comprehensive)
const markdown5 = await Extract2MDConverter.combinedConvertWithLLM(pdfFile);

📋 Scenarios Explained

Scenario 1: Quick Convert Only

  • Use case: Fast conversion when PDF has selectable text
  • Method: quickConvertOnly(pdfFile, config?)
  • Tech: PDF.js text extraction only
  • Output: Basic markdown formatting

Scenario 2: High Accuracy Convert Only

  • Use case: PDFs with images, scanned documents, complex layouts
  • Method: highAccuracyConvertOnly(pdfFile, config?)
  • Tech: Tesseract.js OCR
  • Output: Markdown from OCR extraction

Scenario 3: Quick Convert + LLM

  • Use case: Fast extraction with AI enhancement for better formatting
  • Method: quickConvertWithLLM(pdfFile, config?)
  • Tech: PDF.js + WebLLM
  • Output: AI-enhanced markdown with improved structure and clarity

Scenario 4: High Accuracy + LLM

  • Use case: OCR extraction with AI enhancement
  • Method: highAccuracyConvertWithLLM(pdfFile, config?)
  • Tech: Tesseract.js OCR + WebLLM
  • Output: AI-enhanced markdown from OCR

Scenario 5: Combined + LLM (Recommended)

  • Use case: Most comprehensive conversion using both extraction methods
  • Method: combinedConvertWithLLM(pdfFile, config?)
  • Tech: PDF.js + Tesseract.js + WebLLM with specialized prompts
  • Output: Best possible markdown leveraging strengths of both extraction methods

⚙️ Configuration

Create a configuration object or JSON file to customize behavior:

const config = {
  // PDF.js Worker
  pdfJsWorkerSrc: "../pdf.worker.min.mjs",
  
  // Tesseract OCR Settings
  tesseract: {
    workerPath: "./tesseract-worker.min.js",
    corePath: "./tesseract-core.wasm.js", 
    langPath: "./lang-data/",
    language: "eng",
    options: {}
  },
  
  // LLM Configuration
  webllm: {
    model: "Qwen3-0.6B-q4f16_1-MLC",
    // Optional: Custom model
    customModel: {
      model: "https://huggingface.co/mlc-ai/your-model/resolve/main/",
      model_id: "YourModel-ID",
      model_lib: "https://example.com/your-model.wasm",
      required_features: ["shader-f16"],
      overrides: { conv_template: "qwen" }
    },
    options: {
      temperature: 0.7,
      maxTokens: 4096
    }
  },
  
  // System Prompt Customizations
  systemPrompts: {
    singleExtraction: "Focus on preserving code examples exactly.",
    combinedExtraction: "Pay attention to tables and diagrams from OCR."
  },
  
  // Processing Options
  processing: {
    splitPascalCase: false,
    pdfRenderScale: 2.5,
    postProcessRules: [
      { find: /\bAPI\b/g, replace: "API" }
    ]
  },
  
  // Progress Tracking
  progressCallback: (progress) => {
    console.log(`${progress.stage}: ${progress.message}`);
    if (progress.currentPage) {
      console.log(`Page ${progress.currentPage}/${progress.totalPages}`);
    }
  }
};

// Use configuration
const markdown = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);

🔧 Advanced Usage

Using Individual Components

import { 
  WebLLMEngine, 
  OutputParser, 
  SystemPrompts,
  ConfigValidator 
} from 'extract2md';

// Validate configuration
const validatedConfig = ConfigValidator.validate(userConfig);

// Initialize WebLLM engine
const engine = new WebLLMEngine(validatedConfig);
await engine.initialize();

// Generate text
const result = await engine.generate("Your prompt here");

// Parse output
const parser = new OutputParser();
const cleanMarkdown = parser.parse(result);

Custom System Prompts

The library uses different system prompts for different scenarios:

// For scenarios 3 & 4 (single extraction)
const singlePrompt = SystemPrompts.getSingleExtractionPrompt(
  "Additional instruction: Preserve all technical terms."
);

// For scenario 5 (combined extraction) 
const combinedPrompt = SystemPrompts.getCombinedExtractionPrompt(
  "Focus on creating comprehensive documentation."
);

Configuration from JSON

import { ConfigValidator } from 'extract2md';

// Load from JSON string
const config = ConfigValidator.fromJSON(configJsonString);

// Use with any scenario
const result = await Extract2MDConverter.quickConvertWithLLM(pdfFile, config);

🎯 Error Handling & Progress Tracking

const config = {
  progressCallback: (progress) => {
    switch (progress.stage) {
      case 'scenario_5_start':
        console.log('Starting combined conversion...');
        break;
      case 'webllm_load_progress':
        console.log(`Loading model: ${progress.progress}%`);
        break;
      case 'ocr_page_process':
        console.log(`OCR: ${progress.currentPage}/${progress.totalPages}`);
        break;
      case 'webllm_generate_start':
        console.log('AI enhancement in progress...');
        break;
      case 'scenario_5_complete':
        console.log('Conversion completed!');
        break;
      default:
        console.log(`${progress.stage}: ${progress.message}`);
    }
    
    if (progress.error) {
      console.error('Error:', progress.error);
    }
  }
};

try {
  const result = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);
  console.log('Success:', result);
} catch (error) {
  console.error('Conversion failed:', error.message);
}

🔄 Migration from Legacy API

If you're using the old API, you can still access it:

import { LegacyExtract2MDConverter } from 'extract2md';

// Old way
const converter = new LegacyExtract2MDConverter(options);
const quick = await converter.quickConvert(pdfFile);
const ocr = await converter.highAccuracyConvert(pdfFile);
const enhanced = await converter.llmRewrite(text);

// New way (recommended)
const quick = await Extract2MDConverter.quickConvertOnly(pdfFile, config);
const ocr = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile, config);
const enhanced = await Extract2MDConverter.quickConvertWithLLM(pdfFile, config);

🌟 Features

  • 5 Scenario-Specific Methods: Choose the right approach for your use case
  • WebLLM Integration: Client-side AI enhancement with Qwen models
  • Custom Model Support: Use your own trained models
  • Advanced Output Parsing: Automatic removal of thinking tags and formatting
  • Comprehensive Configuration: Fine-tune every aspect of the conversion
  • Progress Tracking: Real-time updates for UI integration
  • TypeScript Support: Full type definitions included
  • Backwards Compatible: Legacy API still available

📚 TypeScript Support

Full TypeScript definitions are included:

import Extract2MDConverter, { 
  Extract2MDConfig, 
  ProgressReport,
  CustomModelConfig 
} from 'extract2md';

const config: Extract2MDConfig = {
  webllm: {
    model: "Qwen3-0.6B-q4f16_1-MLC",
    options: {
      temperature: 0.7,
      maxTokens: 4096
    }
  },
  progressCallback: (progress: ProgressReport) => {
    console.log(progress.stage, progress.message);
  }
};

const result: string = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);

🏗️ Installation & Deployment

NPM Installation

npm install extract2md

CDN Usage

<script src="https://unpkg.com/[email protected]/dist/assets/extract2md.umd.js"></script>
<script>
    // Available as global Extract2MD
    const result = await Extract2MD.Extract2MDConverter.quickConvertOnly(pdfFile);
</script>

Worker Files Configuration

The package requires worker files for PDF.js and Tesseract.js. These are automatically copied during build:

// Default worker paths (adjust for your deployment)
const config = {
    pdfJsWorkerSrc: "/pdf.worker.min.mjs",
    tesseract: {
        workerPath: "/tesseract-worker.min.js",
        corePath: "/tesseract-core.wasm.js"
    }
};

Bundle Size Considerations

  • Total Size: ~11 MB (includes OCR and PDF processing)
  • PDF.js: ~950 KB
  • Tesseract.js: ~4.5 MB
  • WebLLM: Variable (model-dependent)

Use lazy loading and code splitting for production deployments.

📚 Documentation

📄 License

MIT License - see LICENSE file for details.

🤝 Contributing

Contributions welcome! Please read the contributing guidelines before submitting PRs.

🐛 Issues

Report issues on the GitHub Issues page.