@chnaaam/ppdf
v0.1.3
Published
A TypeScript PDF object extraction library inspired by pdfplumber.
Downloads
57
Readme
ppdf
ppdf is a TypeScript PDF extraction library inspired by pdfplumber.
It is built for Node.js and focuses on:
- page access
- character-level extraction
- text, word, and search helpers
- lines, rects, curves, images, and annotations
- top-left coordinate handling like
pdfplumber - bbox-based page filtering
Install
yarn add @chnaaam/ppdfQuick Start
import { PPDF } from "@chnaaam/ppdf";
const pdf = await PPDF.open("./sample.pdf");
const page = await pdf.getPage(1);
const text = await page.extractText();
const chars = await page.getChars();
const words = await page.extractWords();
const links = await page.getHyperlinks();
console.log({
pageCount: pdf.pageCount,
text,
firstChar: chars[0],
firstWord: words[0],
links,
});
await pdf.close();Open A PDF
You can open a PDF from a file path, Uint8Array, or ArrayBuffer.
import { PPDF } from "@chnaaam/ppdf";
const fromPath = await PPDF.open("./document.pdf");
const fromBytes = await PPDF.open(bytes);
const fromBuffer = await PPDF.open(arrayBuffer);
await fromPath.close();
await fromBytes.close();
await fromBuffer.close();Optional open options:
const pdf = await PPDF.open("./protected.pdf", {
password: "secret",
stopAtErrors: false,
});Read Pages
const pdf = await PPDF.open("./document.pdf");
console.log(pdf.pageCount);
const page1 = await pdf.getPage(1);
const pages = await pdf.getPages();
console.log(page1.width, page1.height);
console.log(pages.length);
await pdf.close();Extract Text
Plain text
const text = await page.extractText();
console.log(text);Characters
const chars = await page.getChars();
for (const char of chars.slice(0, 5)) {
console.log(char.text, char.x0, char.top, char.x1, char.bottom);
}Each char includes:
textfontnamesizematrixx0,top,x1,bottomwidth,heightpage_number,doctop
Words
const words = await page.extractWords();
for (const word of words.slice(0, 5)) {
console.log(word.text, word.x0, word.top, word.x1, word.bottom);
}Search
Literal search:
const matches = await page.search("invoice", { regex: false });Regex search:
const matches = await page.search(/total:\s+\$?\d+(?:\.\d+)?/i);Extract Shapes And Other Objects
const lines = await page.getLines();
const rects = await page.getRects();
const curves = await page.getCurves();
const images = await page.getImages();
const annotations = await page.getAnnotations();
const hyperlinks = await page.getHyperlinks();You can also collect everything in one call:
const objects = await page.getObjects();Or across the whole document:
const allObjects = await pdf.getObjects();Crop And Filter By Bounding Box
Bounding boxes use pdfplumber-style top-left coordinates:
type BBox = [x0, top, x1, bottom];Crop to a region
const region = page.crop([50, 100, 300, 220]);
const regionChars = await region.getChars();Keep only objects fully within a region
const inner = page.withinBBox([50, 100, 300, 220]);Exclude a region
const outer = page.outsideBBox([50, 100, 300, 220]);Filter with a predicate
const boldishChars = page.filter(
(obj) => obj.object_type === "char" && "fontname" in obj && obj.fontname.includes("Bold"),
);Coordinate System
ppdf normalizes coordinates to a top-left origin, matching pdfplumber.
That means:
x0/x1grow from left to righttop/bottomgrow from top to bottomdoctopis the top offset in document space across pages
End-To-End Example
import { PPDF } from "@chnaaam/ppdf";
const pdf = await PPDF.open("./report.pdf");
for (const page of await pdf.getPages()) {
const words = await page.extractWords();
const links = await page.getHyperlinks();
console.log(`page ${page.pageNumber}`);
console.log(`words: ${words.length}`);
console.log(`links: ${links.length}`);
}
await pdf.close();Local Development
Install dependencies:
yarnType-check:
yarn run checkBuild:
yarn buildRun tests:
yarn testRun the character-accuracy comparison test:
yarn vitest run test/char-accuracy.test.tsRender character bounding boxes over page images:
node --import tsx ./test/render-char-bboxes.ts ./reference_pdf/ref_pdf1.pdf ./tmp/ref1-compare 2 compareNotes
ppdfis currently aimed at machine-generated PDFs.- Character geometry is designed to be close to
pdfplumber, but full feature parity is not finished yet. - Some PDFs may still differ in font fallback behavior or CID text decoding because
ppdfuses PDF.js internally.
