npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

pdf-metadata-extractor

v1.1.0

Published

Extract text, fonts, colors, and layout metadata from PDF files. Supports file paths, URLs, and Buffers. Includes line/word grouping helpers for structured text output.

Downloads

109

Readme

pdf-metadata-extractor

Extract text elements, fonts, colors, images, and vector graphics metadata from PDF files. Supports file paths, URLs, and Buffers. Includes text-grouping helpers to reconstruct lines and words from raw PDF glyph streams.

Requirements

  • Node.js ≥ 18 (uses native fetch and modern zlib)
  • pnpm (recommended) or npm

Installation

pnpm add pdf-metadata-extractor

Quick start

import { extractPDF } from "pdf-metadata-extractor";

const result = await extractPDF("./document.pdf");
console.log(result.totalPages);            // number of pages
console.log(result.fonts);                 // font list with real names, family, style, weight
console.log(result.pages[0].textElements); // raw TextElement[] for page 1
console.log(result.pages[0].rectElements); // colored rectangles / shapes
console.log(result.pages[0].imageElements); // embedded images with position + metadata

Input can be a file path, a URL (https), or a Buffer:

await extractPDF("./local.pdf");
await extractPDF("https://example.com/file.pdf");
await extractPDF(fs.readFileSync("./local.pdf"));

Text grouping

Raw TextElement[] contains individual glyphs or characters. Use the grouping helpers to reconstruct human-readable text:

Lines + words in one call

import { extractPDF, extractTextStructure } from "pdf-metadata-extractor";

const result = await extractPDF("./document.pdf");
for (const page of result.pages) {
  const lines = extractTextStructure(page.textElements);
  for (const line of lines) {
    console.log(line.text);           // full line string
    for (const word of line.words) {
      console.log(word.text, word.x, word.y, word.fontSize, word.fontFamily);
    }
  }
}

Words only (flat list)

import { extractWords } from "pdf-metadata-extractor";

const words = extractWords(page.textElements);
// returns TextWord[] in reading order (top-to-bottom, left-to-right)

Step by step

import { groupIntoLines, groupIntoWords } from "pdf-metadata-extractor";

const lines = groupIntoLines(page.textElements);        // TextLine[]
const words = groupIntoWords(lines[0].elements);        // TextWord[]

Working with graphics

Colored rectangles and paths

for (const rect of page.rectElements) {
  console.log(rect.x, rect.y, rect.width, rect.height);
  console.log(rect.fillColor);   // RGB | null  e.g. { r: 244, g: 233, b: 215 }
  console.log(rect.strokeColor); // RGB | null
}

Images

for (const img of page.imageElements) {
  console.log(img.name);                          // pdfjs internal XObject name
  console.log(img.x, img.y, img.width, img.height); // display bounding box (pts)
  console.log(img.imageWidth, img.imageHeight);   // source pixel dimensions
  console.log(img.colorSpace, img.filter);        // e.g. "ICCBased", "DCTDecode"
}

Graphic summary

const { imageCount, vectorCount } = page.graphicSummary;

API

extractPDF(input, options?)

| Parameter | Type | Description | |-----------|------|-------------| | input | string \| Buffer | File path, https URL, or raw Buffer | | options.loadExif | boolean | (reserved, not yet active) |

Returns Promise<PDFResult>.


extractTextStructure(elements, lineTolerance?, gapFactor?)

Groups raw TextElement[] into lines with words nested inside.

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | elements | TextElement[] | — | Raw elements from page.textElements | | lineTolerance | number | 2 | Max Y-delta (pts) to treat two elements as the same line | | gapFactor | number | 0.4 | Word-gap threshold as a fraction of fontSize |

Returns TextLineWithWords[].


extractWords(elements, lineTolerance?, gapFactor?)

Convenience wrapper: groupIntoLines → flatMap groupIntoWords.

Returns TextWord[] in reading order.


groupIntoLines(elements, tolerance?)

Bucket elements by Y coordinate (within tolerance pts), sort top-to-bottom, left-to-right.

Returns TextLine[].


groupIntoWords(elements, gapFactor?)

Split a single line's elements into words by detecting:

  • Explicit whitespace elements (word boundary unless letter-spacing heuristic applies)
  • Large X gaps (gap > fontSize × gapFactor)

Letter-spacing heuristic: a space element sandwiched between two single-character elements is treated as decorative letter-spacing (e.g. Canva-generated PDFs) and merged into the current word rather than creating a word boundary.

Returns TextWord[].


getBoundingBox(elements)

Returns the tight BoundingBox (x, y, width, height) that encloses all elements, or null if the list is empty.


filterByRegion(elements, box)

Filter elements whose top-left point falls inside the given BoundingBox.


Color utilities

import { rgbFromArray, rgbToHex, BLACK, WHITE } from "pdf-metadata-extractor";

rgbFromArray([0.2, 0.4, 0.6]);   // { r: 51, g: 102, b: 153 }
rgbToHex({ r: 255, g: 0, b: 0 }); // "#ff0000"

Matrix utilities

import { getFontSizeFromMatrix, getXFromMatrix, getYFromMatrix } from "pdf-metadata-extractor";

Types

PDFResult

interface PDFResult {
  file?: string;            // basename of the source file (if path was given)
  totalPages: number;
  source: string;           // detected creator app ("Word", "Canva", "Inkscape", …)
  isPrintPDF: boolean;      // true if produced by a print driver
  info: Record<string, unknown>;  // raw PDF metadata (Title, Author, Creator, …)
  fonts: FontInfo[];        // deduplicated font list for the whole document
  pages: PageResult[];
}

PageResult

interface PageResult {
  pageNumber: number;       // 1-based
  width: number;            // pts
  height: number;           // pts
  pageType: "text" | "image" | "hybrid" | "vector" | "unknown";
  elements: PageElement[];  // all elements combined (text + rect + path + image)
  textElements: TextElement[];
  imageElements: ImageElement[];
  rectElements: RectElement[];
  pathElements: PathElement[];
  xobjectElements: XObjectElement[];
  graphicSummary: GraphicSummary;
}

TextElement

interface TextElement {
  type: "text";
  text: string;
  x: number;
  y: number;
  width: number;
  height: number;
  fontSize: number;
  fontFamily: string | null;
  fontStyle: string | null;      // "italic" | "normal" | null
  fontWeight: number | null;     // 400 | 700 | null
  fontRealName: string | null;   // e.g. "OpenSans-Regular"
  fontSubtype: string | null;    // "Type1" | "TrueType" | "CIDFontType2" | …
  isSubsetFont: boolean | null;
  color: RGB;
}

RectElement

interface RectElement {
  type: "rect";
  x: number;
  y: number;
  width: number;
  height: number;
  fillColor: RGB | null;
  strokeColor: RGB | null;
  strokeWidth: number | null;
}

PathElement

interface PathElement {
  type: "path";
  x: number;
  y: number;
  width: number;
  height: number;
  fillColor: RGB | null;
  strokeColor: RGB | null;
  strokeWidth: number | null;
}

ImageElement

interface ImageElement {
  type: "image";
  name: string;              // XObject resource name from the PDF
  x: number;                 // display position (pts, top-left origin)
  y: number;
  width: number;             // display size (pts)
  height: number;
  imageWidth: number | undefined;      // source pixel width
  imageHeight: number | undefined;     // source pixel height
  colorSpace: string | null | undefined;  // e.g. "ICCBased", "DeviceRGB"
  bitsPerComponent: number | undefined;
  filter: string | null | undefined;  // e.g. "DCTDecode", "FlateDecode"
  imageMask: boolean | undefined;
}

TextWord

interface TextWord {
  text: string;
  x: number;
  y: number;
  width: number;
  height: number;
  fontSize: number;
  fontRealName: string | null;
  fontFamily: string | null;
  fontStyle: string | null;
  fontWeight: number | null;
  color: RGB;
  elements: TextElement[];   // constituent raw elements
}

TextLine / TextLineWithWords

interface TextLine {
  y: number;            // representative Y coordinate of the line
  text: string;         // full line text (joined elements)
  elements: TextElement[];
}

interface TextLineWithWords extends TextLine {
  words: TextWord[];
}

FontInfo

interface FontInfo {
  key: string;           // PDF resource key ("F4", "f-0-0", …)
  realName: string | null;     // "OpenSans-Regular"
  baseFontRaw: string | null;  // raw /BaseFont value (may include subset prefix)
  isSubset: boolean;     // true if baseFontRaw starts with "XXXXXX+"
  subtype: string | null;      // "TrueType" | "Type1" | "CIDFontType2" | …
  encoding: string | null;
  fontFamily: string | null;
  fontStyle: string | null;
  fontWeight: number | null;
  italicAngle: number | null;
}

GraphicSummary

interface GraphicSummary {
  vectorCount: number;   // total rect + path elements on the page
  imageCount: number;    // total image elements on the page
}

JSON output example

{
  "file": "document.pdf",
  "totalPages": 1,
  "source": "Canva",
  "isPrintPDF": false,
  "fonts": [
    {
      "key": "F4",
      "realName": "OpenSans-Regular",
      "isSubset": true,
      "subtype": "TrueType",
      "fontFamily": "OpenSans",
      "fontStyle": "normal",
      "fontWeight": 400
    }
  ],
  "pages": [
    {
      "pageNumber": 1,
      "width": 595.28,
      "height": 841.89,
      "pageType": "hybrid",
      "graphicSummary": { "vectorCount": 22, "imageCount": 1 },
      "lines": [
        {
          "y": 740,
          "text": "Hello World",
          "words": [
            {
              "text": "Hello",
              "x": 72, "y": 740,
              "width": 42.5, "height": 14,
              "fontSize": 14,
              "fontFamily": "OpenSans",
              "fontStyle": "normal",
              "fontWeight": 400,
              "color": { "r": 0, "g": 0, "b": 0 }
            }
          ]
        }
      ]
    }
  ]
}

Development

# Install dependencies
pnpm install

# Build
pnpm run build

# Run tests
pnpm run test

# Run example (requires Node 18+)
pnpm run example

License

MIT