npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@icjia/pdf-search-index

v1.4.0

Published

Full-text PDF, DOCX, PPTX, XLSX search for static sites — Apache Solr for client-side apps, without Solr.

Readme

@icjia/pdf-search-index

Apache Solr for client-side apps — without Solr. Build-time text extraction from PDF, DOCX, PPTX, XLSX that turns every document on your site into a searchable row and ships the index as a static JSON file. No JVM, no Tika service, no search server, no native deps — Node at build time, JSON at runtime.

Multi-format added in 1.1. PDF support is bundled (unpdf); DOCX/PPTX/XLSX are unlocked by installing the optional officeparser peer dep. The package emits a uniform IndexedDocument row with a format discriminator, so downstream search engines (Fuse.js, MiniSearch, FlexSearch, …) handle all four formats identically.

@icjia/pdf-search-index is the core library: it fetches documents, dispatches to the appropriate extractor (unpdf for PDF, officeparser for Office formats), and returns plain IndexedDocument[] rows. The output is fully framework-agnostic — first-party integrations ship for Astro 5 and Nuxt 4 (see Adapter packages below), and the same core works just as well from a prebuild script in Next.js, SvelteKit, Remix, Eleventy, Vite/Vue, or vanilla HTML.

Fuse.js is recommended but optional. The plain-JSON output drops into Fuse.js, MiniSearch, Orama, Lunr, FlexSearch, Pagefind, Typesense, MeiliSearch, or Algolia — your call. The /fuse and /snippet entry points are conveniences for Fuse callers, not gatekeepers; the core indexPdfs / extractPdfText / extractPdfsFromBody functions don't require fuse.js at all.

Why this replaces Solr for static / Jamstack apps: the typical Solr+Tika deployment is a JVM service, a schema, a managed index, and a network round-trip per query — enormous overhead when your corpus is the 50–500 PDFs your CMS or public/ folder already publishes. This package collapses Solr's build-time stage (the Tika part) into a pnpm build hook and lets the framework you already use serve the JSON result. Zero ops, zero servers, zero JVM tuning.

ESM only. Node 20 LTS / 22 LTS. MIT.

Install

npm install @icjia/pdf-search-index
# or
pnpm add @icjia/pdf-search-index
# or
yarn add @icjia/pdf-search-index

Optional peer dependency — fuse.js@^7 — only when you import the /fuse or /snippet subpaths. The package's peerDependencies range is "^7.0.0 || >=7.4.0-beta.0" so both stable Fuse 7.x and the Fuse 7.4 beta channel resolve. The v1.0.3 examples in this monorepo pin to 7.4.0-beta.6 to demo the newest beta surface; consumers who prefer stability can pin to a stable ~7.3.0 and the peer still resolves.

Security

Status as of v1.3.0 (last audited 2026-05-17): Every Critical and Important finding from the original audit against the core package is either remediated and verified in a shipped release, or has a documented active mitigation while the structural fix lands in a future release. Zero unaddressed exploitable issues against the documented usage envelope. Six independent audit passes confirm this: initial 1.0.1 + 1.0.3 delta + 1.0.5 verification (2026-05-16); 1.1.0 multi-format + 1.2.0 perf/security-extension + 1.3.0 search-engine-entries (2026-05-17). The v1.2 release closed I6 (maxUrls cap) and the inflate-bomb deferral; v1.3 ships two new search-engine adapter entries (/flexsearch, /pagefind) plus officeparser-as-direct-dep (supply-chain hardening). Remaining tracked items: C2 SSRF allowlist (v1.4), I2 cache-key normalization (v2.0), I5 CLI sitemap hardening (v1.4), full officeparser source vendoring (v1.4).

Remediation scorecard (core-relevant items)

| Severity | Found | Remediated & verified | Tracked for v1.1+ (mitigated) | Exploitable now | | ------------- | ----- | -------------------------------------- | ----------------------------------------- | --------------- | | Critical | 4 | 3 — C1, C3, C4 (shipped 1.0.2) | 1 — C2 SSRF (CI egress-filter mitigation) | 0 | | Important | 6 | 5 — I1, I3, I4, I7, I8 (shipped 1.0.2) | 1 — I6 maxUrls cap (developer input) | 0 | | Minor | 3 | 2 — M2, M3 (shipped 1.0.2) | 1 — M1 MIME validation (defense-in-depth) | 0 |

(C5 is an Astro-adapter finding; see the Astro package README. I2 + I5 are deferred at the monorepo level — see the top-level scorecard for the full picture.)

Per-finding remediation detail

| ID | What was found | What was specifically remediated | Verified by | Status | | ------ | --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------- | | C1 | ReDoS in extractPdfUrlsFromMarkdown — 130 KB pathological body stalled a build for 50 s. | Bounded greedy quantifiers {1,2048} URL / {0,1024} query; markdown bodies > 1 MB skipped with a warning before the scan. | test/security.test.ts"C1: ReDoS — handles a long hostile payload in under 200ms" (130 KB body now scans < 200 ms). | ✅ Fixed in 1.0.2; verified at 1.0.5 | | C3 | fetchPdfBytes buffered the entire response body before checking maxBytes — multi-GB PDF → OOM. | Content-Length checked first; if absent, body streamed via getReader() aborting once running total > maxBytes. Default lowered 100 MB → 32 MB. | test/security.test.ts"C3: aborts streaming download once running total exceeds maxBytes" + "default maxBytes is 32 MB". | ✅ Fixed in 1.0.2; verified at 1.0.5 | | C4 | MCP cacheDir attacker-controlled — prompt-injected LLM could write outside the cache. | Every MCP tool's cacheDir routed through safeCacheDir(), jailed under <os.tmpdir>/pdf-search-index-mcp/. clearCache strict-allowlist-filters its deletion target to <16hex>.txt / <16hex>.meta.json. | test/security.test.ts"C4: safeCacheDir jail — rejects an absolute path outside the safe base" (4 tests) + "clearCache: allowlist only deletes cache-pattern filenames". | ✅ Fixed in 1.0.2; verified at 1.0.5 | | I1 | Internal URLs (/admin/secret.pdf path components) leaked into CI failure logs. | New scrubUrl(url) export drops path/query/fragment. All console.warn paths that include a URL route through it. Full URL gated behind debug: true. | test/security.test.ts"I1 / M3: scrubUrl ... returns origin only for a normal URL" + "omits the path from failure logs by default". | ✅ Fixed in 1.0.2; verified at 1.0.5 | | I3 | Compression-bomb PDFs decompressing to hundreds of MB of text. | New maxExtractedTextChars ExtractOptions field (default 5,000,000). Truncates above the cap and logs a warning. Raise it if a real PDF in your corpus has more text. | test/security.test.ts"I3: maxExtractedTextChars cap — truncates extracted text above the cap and logs a warning". | ✅ Fixed in 1.0.2; verified at 1.0.5 | | I4 | PDF text containing literal </script> broke out of <script type="application/json"> islands. | New top-level safeJSONForHTML(obj, indent?) export. Escapes <, <!--, U+2028, U+2029. Used by the CLI --out writer; available as a public export for consumers that inline rows themselves. | test/security.test.ts"I4: safeJSONForHTML — escapes <so cannot break out of a <script> embedding" + "escapes U+2028 and U+2029 line separators". | ✅ Fixed in 1.0.2; verified at 1.0.5 | | I7 | Cache write TOCTOU + non-atomic write — parallel builds could see partial files; corruption silent. | Both files written to .tmp.<pid>.<rand> then renamed atomically. Sidecar carries a contentSha (SHA-256 of text). readCache verifies; mismatch → cache miss. | test/security.test.ts"I7: cache writes are atomic and content-hashed" (4 tests including "never returns a corrupt (mismatched-hash) hit under concurrent writes"). | ✅ Fixed in 1.0.2; verified at 1.0.5 | | I8 | pdf.js PasswordException and other parse errors logged verbatim — leaked encrypted-PDF state. | Parse errors categorized into 'encrypted PDF' / 'corrupt PDF structure' / 'PDF font error' / 'PDF parse error'. Full message gated behind debug: true. | test/security.test.ts"I8: categorized parse-error logging — categorizes xref/structure errors as 'corrupt PDF structure'". | ✅ Fixed in 1.0.2; verified at 1.0.5 | | M2 | Cache files world-readable on POSIX. | writeCache writes files with mode 0o600; cache dir created with mode 0o700. POSIX-only; no-op on Windows. | test/security.test.ts"I7: cache writes ... writes both files with mode 0o600" (M2 pinned under the I7 describe). | ✅ Fixed in 1.0.2; verified at 1.0.5 | | M3 | ASCII control chars in URLs / error messages survived into terminal as escape sequences. | Control chars (\x00–\x1f, \x7f) replaced with ? before strings reach console.warn. Applies to both URL and error-message paths. | test/security.test.ts"I1 / M3: scrubUrl drops path/query and strips control chars". | ✅ Fixed in 1.0.2; verified at 1.0.5 | | C2 | SSRF — indexPdfs will fetch any URL incl. http://169.254.169.254/ (AWS metadata). | Active mitigation: outbound network policy at the CI level. Structural fix tracked for v1.1: allowPrivateHosts: boolean opt-in flag — deferred to avoid breaking intranet-PDF consumers. | n/a — deferral intentional; CI egress-filter mitigation documented. | ⚠️ Mitigated; v1.1 allowlist tracked |

For the complete picture across the monorepo (including the Astro-adapter finding C5, the I2 / I5 deferred items, and the v1.0.5 verification pass report), read the top-level README's Security section and the Security considerations & audit history. 115 tests pass at v1.0.5; every ✅ row above is pinned by a named test in test/security.test.ts.

The 30-second integration

import { indexPdfs } from '@icjia/pdf-search-index';

const pdfRows = await indexPdfs([
  'https://example.com/annual-report-2024.pdf',
  'https://example.com/faqs.pdf',
]);

const allRows = [...yourPageRows, ...pdfRows];
const fuse = new Fuse(allRows, { keys: ['title', 'text'], includeMatches: true });

That's it. Each row is { id, url, title, text, pages?, extractedAt? }. Failed extractions return rows with text: '' (the build doesn't fail unless you pass --strict to the CLI).

For highlighted snippets in results:

import { snippetHTMLFor } from '@icjia/pdf-search-index/snippet';

for (const r of fuse.search('stigma')) {
  console.log(r.item.title, snippetHTMLFor(r));
  // → "Stigma PDF For Posting" "…recovery from substance use disorder is hampered by <mark>stigma</mark>…"
}

snippetHTMLFor accepts { contextChars?, matchKey?, collapseWhitespace?, maxSnippets?, separator? }. The maxSnippets option (added 1.0.3, default 1 for backward compatibility) renders up to N non-overlapping highlighted spans per result, joined by separator (default ' … '). Useful for surfacing several passages from a long PDF that matches the query in multiple regions.

Core API

| Function | What it does | | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | | extractPdfText(url, options?) | Fetch one PDF, return its text. Lowest-level entry point. | | indexPdfs(urls, options?) | Batch-index an array of URLs / { url, title?, id? } entries. Dedupes by URL. Concurrency 4 by default. | | extractPdfsFromBody(markdown, opts?) | Scan a markdown body for PDF URLs and index each. | | extractPdfUrlsFromMarkdown(markdown) | URL-discovery without extraction — handy for debugging "why is my index empty?". | | safeJSONForHTML(obj, indent?) | HTML-safe JSON.stringify — escapes <, -->, U+2028/2029 for <script> embedding. | | scrubUrl(url) | Drop path/query/fragment, return protocol://host only (used internally for failure logs). | | createFuseIndex({ urls, fuseOptions }) | Convenience wrapper: index + build a Fuse instance in one call (from /fuse subpath). | | snippetHTMLFor(fuseResult, options?) | <mark>-highlighted ±N-char snippet from a Fuse match (from /snippet subpath). Supports maxSnippets for multi-region rendering. |

Full option tables, semantics, and edge-case behavior are in the top-level README's Core API section.

Entry points

| Subpath | Purpose | Peer dep needed | | --------------------------- | --------------------------------------------------------------------------------------------------------- | --------------- | | . | Core extraction API (indexPdfs, extractPdfText, extractPdfsFromBody, safeJSONForHTML, scrubUrl) | none | | /fuse | createFuseIndex — convenience wrapper | fuse.js@^7 | | /snippet | snippetHTMLFor<mark>-highlighted snippets | fuse.js@^7 | | /mcp | MCP server entry — invoked via pdf-search-index-mcp bin (not direct import) | none | | bin: pdf-search-index | CLI for one-shot indexing, sitemap scan, search, cache management | none | | bin: pdf-search-index-mcp | MCP server bin for LLM workflows | none |

The IndexedPdf row shape

interface IndexedPdf {
  id: string; // 'pdf-' + first 12 hex chars of SHA-256(url) — stable across rebuilds
  url: string;
  title: string; // markdown link text > pdf.js info-dict Title > humanized filename
  text: string; // empty string on extraction failure (not an error)
  pages?: number;
  extractedAt?: string; // ISO timestamp; OMITTED on cache hits so JSON is byte-stable
}

pages and extractedAt are optional. extractedAt is omitted on cache hits so the JSON stays byte-stable across rebuilds — diffs stay reviewable and CDN caching works.

Options summary

indexPdfs accepts IndexPdfsOptions = ExtractOptions & { concurrency? }:

| Option | Type | Default | What it's for | | ----------------------- | -------------------------------- | -------------------------- | ------------------------------------------------------------------------- | | cacheDir | string | '.pdf-cache' | Where extracted text is cached on disk | | fetch | typeof fetch | global fetch | The escape hatch — auth headers, file:// URLs, signed URLs | | fetchTimeout | number (ms) | 30000 | Abort the fetch after this many ms | | maxBytes | number | 32 * 1024 * 1024 (32 MB) | Reject PDFs larger than this. Lowered from 100 MB in 1.0.2 | | maxExtractedTextChars | number | 5_000_000 (5 MB) | Truncate extracted text above this length (compression-bomb defense) | | concurrency | number | 4 | Parallel fetches via p-limit | | cache | 'use' \| 'bypass' \| 'refresh' | 'use' | bypass skips read+write; refresh overwrites; use is read-through | | mergePages | boolean | true | When false, extractPdfText returns one entry per page | | debug | boolean | false | When true, failure logs include full URLs and underlying error messages |

CLI quick reference

# One-shot: index URLs to JSON on stdout
npx @icjia/pdf-search-index https://...pdf https://...pdf

# From a file (one URL per line, # comments allowed)
npx @icjia/pdf-search-index --from urls.txt

# From a sitemap (scans pages for PDF links, indexes them)
npx @icjia/pdf-search-index --from-sitemap https://example.com/sitemap.xml

# Write to a file instead of stdout
npx @icjia/pdf-search-index --out public/searchIndex.json https://...pdf

# Force re-extraction
npx @icjia/pdf-search-index --refresh https://...pdf
npx @icjia/pdf-search-index --refresh-all https://...pdf

# Sanity check / search / cache management
npx @icjia/pdf-search-index verify https://...pdf
npx @icjia/pdf-search-index search index.json "drug testing"
npx @icjia/pdf-search-index cache ls
npx @icjia/pdf-search-index cache rm <url>
npx @icjia/pdf-search-index cache clear

Exit code is 0 even when individual PDFs fail (the index stays valid; failed rows have text: ''). Pass --strict to flip to exit 1 for CI where a broken upload pipeline should fail the build.

Full CLI option table and output formats: README CLI section.

MCP server

For LLM workflows where the model needs to search inside PDFs during a conversation:

npx -p @icjia/pdf-search-index@latest pdf-search-index-mcp

Always use @latest when wiring into Claude Desktop / Cursor / any MCP-aware client so the client picks up security patches and bug fixes automatically. Sample config:

{
  "servers": {
    "pdf-search": {
      "command": "npx",
      "args": ["-p", "@icjia/pdf-search-index@latest", "pdf-search-index-mcp"]
    }
  }
}

Tools: extract_pdf, index_pdfs, get_pdf_index, search_pdfs, clear_cache, get_status. Since v1.0.2, every tool's cacheDir argument is jailed under <os.tmpdir>/pdf-search-index-mcp/ — LLM-supplied paths can't escape the safe base. Full MCP details: README MCP section.

Where to learn more

The top-level README is the source of truth. Key sections:

  • Where your PDFs can live — static /public/, external CMS (Strapi v3/v4/v5, Sanity, Contentful, Drupal), external CDN (S3, R2), local file://. Includes Strapi quirks: relative URLs, token-gated media, structured media relations.
  • Using a search engine other than Fuse.js — recipes for MiniSearch, Orama, Lunr, FlexSearch, Pagefind, Typesense, MeiliSearch, Algolia.
  • Security considerations — trust model, v1.0.2 hardening details, embedding the index into HTML.
  • Examples — seven runnable example sites covering every integration pattern.

Adapter packages

| Adapter | For | | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------ | | @icjia/astro-pdf-search-index | Astro 5 — emits public/<endpoint>.json from configured content collections at build time | | @icjia/nuxt-pdf-search-index | Nuxt 4 — server helpers for mixed CMS + @nuxt/content sites |

For frameworks without an adapter (Vite, Next.js, Eleventy, SvelteKit, Remix, plain Node, etc.), use this package directly with a prebuild script — see the AGENTS.md "Path A" recipe.

Versioning

Currently at 1.0.3 (documentation + ecosystem release, additive snippetHTMLFor maxSnippets option, default DEFAULT_FUSE_OPTIONS.threshold lowered to 0.2, fuse.js dev/example pin moved to 7.4.0-beta.6, second adversarial red/blue team audit pass on the v1.0.3 deltas on 2026-05-16 — see CHANGELOG.md and the top-level audit history). All three packages in this monorepo move in lockstep.

License

MIT