@qicaixin/quick-pdf-search
v1.0.2
Published
MCP server for fast PDF parsing and BM25 search via MinerU.
Readme
@qicaixin/quick-pdf-search
MCP (Model Context Protocol) server for fast PDF parsing and search. It uses the MinerU API to parse PDFs, stores the parsed output locally, and provides BM25 search over layout.json using wink-nlp + wink-bm25-text-search.
Features
- MCP stdio server with
parse_pdfandsearch_pdftools - MinerU API integration (upload, poll, download, extract)
- Cache by file-content hash + meta validation to avoid duplicate API calls
- Search uses
layout.jsononly (no fallback tocontent_list)
Requirements
- Node.js 18+ (tested on Node 22)
- MinerU API token
Install
npm installUse via npx (after publishing)
npx @qicaixin/quick-pdf-searchUse via npm install
Global install
npm install -g @qicaixin/quick-pdf-search
quick-pdf-searchLocal dependency
npm install @qicaixin/quick-pdf-search
npx @qicaixin/quick-pdf-searchInstall directly from GitHub
npm install github:labveritas/quick-pdf-searchWith branch or tag:
npm install github:labveritas/quick-pdf-search#mainRun the MCP server (stdio)
npm startEnvironment variables
MINERU_TOKEN(required for parsing)QPS_OUTPUT_ROOT(optional) output root directory
Default:~/.cache/quick-pdf-search-mcpMINERU_REQUEST_TIMEOUT_MS(optional) per-request timeout for MinerU API calls
Default:30000
Tools
parse_pdf
Parse a local PDF via MinerU and cache results on disk.
Input:
file_path(string, required)data_id(string, optional) override cache keyoutput_root(string, optional) override output root dirmodel_version(string, optional) MinerU model version (default:pipeline)base_url(string, optional) MinerU API base URLtimeout(number, optional) total wait timeout in secondspoll_interval(number, optional) polling interval in secondskeep_zip(boolean, optional) keep the downloaded zipforce(boolean, optional) re-parse even if cachedtoken(string, optional) MinerU token (falls back toMINERU_TOKEN)
Output:
file_pathdata_id(derived from file content hash by default)output_dirlayout_pathcontent_blocks(para_blocks total)
search_pdf
Search parsed PDFs with BM25.
Input:
query(string, required)output_dir(string, optional) directory containinglayout.json(or a directlayout.jsonpath)top_k(number, optional, 1-50)
Output:
resultsarray withtext,snippet,page_idx,block_index, etc.
list_cached_pdfs
List cached parsed PDFs under the output root.
Input:
output_root(string, optional) override output root dir
Output:
itemsarray withdata_id,file_path,output_dir,layout_path,content_blocks, etc.
get_page_content
Return content for a specific page index.
Input:
page_idx(number, required, 0-based)output_dir(string, optional) directory containinglayout.json(or a directlayout.jsonpath)
Output:
combined_text(string)blocksarray withblock_index,type,text,bbox
Cache behavior
data_idis derived from file content SHA256 by default.output/<data_id>/meta.jsonstoreshash,size,mtimeMs.- If
layout.jsonexists andmeta.jsonmatches the current file hash+size, parsing is skipped. - Use
force: trueto re-parse.
Quick test (stdio client)
export MINERU_TOKEN=your_token
node --input-type=module - <<'JS'
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
import { CallToolResultSchema } from '@modelcontextprotocol/sdk/types.js';
const transport = new StdioClientTransport({
command: 'node',
args: ['src/index.js'],
cwd: process.cwd(),
env: { ...process.env },
stderr: 'inherit'
});
const client = new Client({ name: 'test-client', version: '0.0.1' });
await client.connect(transport);
const parse = await client.request({
method: 'tools/call',
params: { name: 'parse_pdf', arguments: { file_path: '/absolute/path/to/file.pdf' } }
}, CallToolResultSchema, { timeout: 600000 });
console.log(parse.content[0].text);
const search = await client.request({
method: 'tools/call',
params: { name: 'search_pdf', arguments: { query: 'testplan' } }
}, CallToolResultSchema);
console.log(search.content[0].text);
await transport.close();
JSProject layout
src/
index.js # MCP server + search
mineru.js # MinerU API client + caching
output/ # (optional) local parsed outputs if you override output_rootRelease (Git tag for npm publish via GitHub Actions)
This repo is configured to publish on tag push v*.
# 1) bump version
npm version patch --no-git-tag-version
# 2) commit version bump
git add package.json package-lock.json
git commit -m "bump version"
# 3) push commit
git push
# 4) create tag and push
git tag vX.Y.Z
git push origin vX.Y.ZNotes:
- The GitHub Action uses
NPM_TOKENsecret and publishes--access public. - Tag version must match
package.jsonversion.
