npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

simple-pdf

v0.1.14

Published

A simple PDF parser based on PDF.js

Readme

⚠️ On the state of this package ⚠️

I currently don't have the time to actively maintain this repo. That being said I still believe that this library is of good use to anyone looking to work with PDFs in Node as it simplifies a lot of what's cumbersome with PDFs, but currently I can't recommend anyone to use it in production as I cannot actively maintain it. I do however greatly welcome PRs and issues and will try to respond ASAP.

If your company would benefit from using this package, please consider sponsoring me and the package's continued development and feel free to reach out with any questions regarding maintenance.

simple-pdf

npm Tests david-dm

simple-pdf aims to be a simple drop-in module for extracting text and images from PDF files. It exposes a promise-based and an event-based API.

Table of contents

Features

  • Extracts both text and images
  • Handles most image encodings

Reasons not to use this library

Let's be real. This might not be the library for you. Here are a few reasons why.

  • Slow with images - Images can be embedded in a PDF in many different ways. To ensure that all types of images can be extracted we render the whole PDF and then use sharp to extract the images from the rendered page. This adds extra processing time for pages that contains images (provided that you don't disable image extraction).
  • New to the game - This library is brand new and hasn't been battle tested yet. If you're looking for a reliable solution, this library might not be the best choice for you.
  • No automated testing - Though I'm working on this 🙃

Examples

Minimal example:

const fs = require('fs')
const { SimplePDFParser } = require('simple-pdf')

const fileBuffer = fs.readFileSync('somefile.pdf')

const parser = new SimplePDFParser(fileBuffer)

parser.parse().then((result) => {
  console.log(result)
})

More examples can be found in the examples directory and can be run with the following commands:

npm run example:events
npm run example:promises

Installation

npm i simple-pdf

Docs

The only exposed interface is the SimplePDFParser class. It takes a Buffer containing a PDF file as well as an optional options object.

new SimplePDFParser(fileBuffer, {
  // options
})

Options

|Option|Value type|Default value|Description| |-|-|-|-| |paragraphThreshold|integer|25|The minimum distance between two lines on the y-axis to consider them part of separate paragraphs. This option only affects the parse method. |lineThreshold|integer|1|The minimum distance between two lines on the y-axis to consider them part of the same line. PDFs usually suffer from issues with floating point numbers. This value is used to give a little room for error. You shouldn't have to change this value unless you're dealing with PDFs generated with OCR or other odd PDFs. |imageScale|integer|2|Scaling applied to the PDF before extrating images. Higher value results in greater image resolution, but quadratically increases rendering times. |extractImages|boolean|true|Controls whether or not to extract images. Image extraction requires rendering of each page, which might take a long time depending on the size of the PDF, configured imageScale and underlying hardware. If you don't need to extract images, setting this option to false is recommended. |ignoreEmptyText|boolean|true|Controls whether or not to ignore empty text elements. Text elements are considered empty if their text content contains nothing by whitespace. |joinParagraphs|boolean|false|Controls whether or not to join paragraphs. Enabling this option will join each line that's not separated by a non-text element (paragraph break or image) which will effectively make each line contain a paragraph. Paragraph breaks will be omitted from the final output. This option only affects the parse method. |imageOutputFormat|string|png|Controls what format the image is exported as. Defaults to 'png'. Passed directly to Sharp: https://sharp.pixelplumbing.com/api-output#toformat

Basic parsing

This is probaly the easiest way to use this library. It parses all pages in parallel and returns the result when finished. Paragraphs and lines are automatically joined based on the options passed to the constructor.

Example:

const parser = new SimplePDFParser(fileBuffer)

const result = await parser.parse()

Result:

[
  {
    "type": "text",
    "pageIndex": 0,
    "items": [
      {
        "text": "Lorem ipsum",
        "font": "g_d0_f1"
      }
    ]
  },
  {
    "type": "image",
    "pageIndex": 0,
    "imageBuffer": Buffer
  }
]

Advanced parsing

If you need more granuar control of the resulting data structure you might want to use the advanced parsing. You can choose to either just await the result or use the events to process each page as it is finished parsing. Note that pages are not guaranteed to be returned in order.

Example:

const parser = new SimplePDFParser(fileBuffer)

// Called with each page
parser.on('page', (page) => {
  console.log(`Page ${page.index}:`)
  console.log('Text elements: ', page.textElements)
  console.log('Image elements:', page.imageElements)
})

// Called when the parsing is finished
parser.on('done', () => {
  console.log('Parser done')
})

// This must be run even if you just use the events API, but then you may ignore the return value
const result = await parser.parseRaw()

Result (each page):

{
  index: 0, // Page index
  textElements: [{
    x: 123.456,
    y: 654.321,
    items: [{
      text: 'Lorem ipsum',
      font: 'g_d0_f1'
    }]
  }],
  imageElements: [{
    x: 4.2,
    y: 83.11,
    width: 120,
    height: 80,
    imageBuffer: Buffer
  }]
}

Roadmap

More of a todo, but let's call it a roadmap

  • [ ] Tests
    • [ ] Better coverage
    • [ ] Windows - Something is wrong either with the library or the tests (https://github.com/scriptcoded/simple-pdf/runs/1048499489)
  • [ ] Make a logo (everyone likes a logo)
  • [ ] Rewrite codebase in TypeScript
  • [ ] Improve image extraction
  • [ ] Set up automatic CI/CD pipeline for NPM deployment
  • [ ] Simplify the API

Tests

Tests can be run with with the following commands:

npm run build
npm run test

Contributing

Contributions and PRs are very welcome! PRs should go towards the develop branch.

We use the Airbnb style guide. Please run ESLint before committing any changes:

npx eslint src
npx eslint src --fix

License

This project is licensed under the MIT license.