search_paper
v0.1.2
Published
Multi-source academic paper search library (Semantic Scholar, Google Scholar, arXiv)
Maintainers
Readme
search-papers
A TypeScript library for searching academic papers across multiple sources, returning structured and normalized results.
Built on ghostfetch for robust HTTP requests with browser fingerprint spoofing and anti-bot bypass.
Features
- Multi-source search - Query Google Scholar, Semantic Scholar, and arXiv in parallel
- Unified
Paperinterface - All sources return the same structured format regardless of origin - Deduplication - Automatically merges duplicate papers across sources using DOI, canonical URL, and title matching
- Impact Factor ranking - Results sorted by journal Impact Factor (static mapping of ~100 journals)
- Citation & reference lookup - Retrieve citing/referenced papers via Semantic Scholar
- Anti-bot bypass - ghostfetch handles browser spoofing, JS challenge solving, and redirect tracking
- Partial failure tolerance - If one source fails, results from other sources are still returned
Requirements
- Node.js >= 22.0.0
Installation
npm install search-papersQuick Start
import { searchPapers, getPaper } from 'search-papers';
// Search across all sources
const result = await searchPapers('attention is all you need', {
limit: 10,
});
console.log(result.papers);
// Search specific sources only
const arxivOnly = await searchPapers('transformer', {
sources: ['arxiv'],
limit: 5,
sort: 'date',
});
// Look up a single paper by DOI
const paper = await getPaper('10.48550/arXiv.1706.03762');
console.log(paper?.title);API
searchPapers(query, options?)
Search for papers across multiple sources simultaneously.
const result = await searchPapers('deep learning', {
sources: ['semantic_scholar', 'google_scholar', 'arxiv'], // default: all
limit: 10, // default: 10
offset: 0,
year: { from: 2020, to: 2024 },
sort: 'relevance', // 'relevance' | 'date' | 'citations'
client: {
semanticScholarApiKey: 'your-key', // optional
proxy: 'http://proxy:8080', // optional
timeout: 15000, // default: 15000ms
},
});Returns: SearchResult
interface SearchResult {
query: string;
totalResults?: number;
papers: Paper[];
nextPageToken?: string;
source: SourceType;
errors?: SourceError[]; // errors from failed sources
}getPaper(doi, options?)
Look up a single paper by DOI using Semantic Scholar.
const paper = await getPaper('10.1038/nature14539');
// Returns Paper | nullPaper Interface
Every paper returned by any source conforms to this interface:
interface Paper {
title: string;
authors: Author[];
abstract?: string;
year?: number;
venue?: string; // journal or conference name
doi?: string;
url: string; // link to the paper
canonicalUrl?: string; // final redirect URL
pdfUrl?: string;
citationCount?: number;
impactFactor?: number; // journal Impact Factor
source: SourceType; // 'google_scholar' | 'semantic_scholar' | 'arxiv'
sourceId?: string; // source-specific ID
tags?: string[]; // e.g. arXiv categories
references?: string[];
}Using Individual Sources
For more control, use source classes directly:
import { createClient, SemanticScholarSource, GoogleScholarSource, ArxivSource } from 'search-papers';
const client = createClient();
// Semantic Scholar (implements CitationSource)
const s2 = new SemanticScholarSource(client);
const result = await s2.search('transformer', { limit: 5 });
const paper = await s2.getPaper('DOI:10.48550/arXiv.1706.03762');
const citations = await s2.getCitations('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
const references = await s2.getReferences('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
// Google Scholar (implements PaperSource)
const gs = new GoogleScholarSource(client);
const gsResult = await gs.search('deep learning', { limit: 10 });
// arXiv (implements PaperSource)
const arxiv = new ArxivSource(client);
const arxivResult = await arxiv.search('neural network', { limit: 10 });
const arxivPaper = await arxiv.getPaper('1706.03762');
await client.destroy();Sources
| Source | Type | Search | Get Paper | Citations | References | Notes | |--------|------|--------|-----------|-----------|------------|-------| | Semantic Scholar | API (JSON) | Yes | Yes (DOI, paperId, etc.) | Yes | Yes | Optional API key for dedicated rate limit | | Google Scholar | Scraping (HTML) | Yes | Yes (title search) | No | No | CAPTCHA risk, 2-5s random delay | | arXiv | API (Atom XML) | Yes | Yes (arXiv ID) | No | No | 3s minimum delay between requests |
Search Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| sources | SourceType[] | All 3 sources | Which sources to query |
| limit | number | 10 | Max results to return |
| offset | number | 0 | Pagination offset |
| year | { from?, to? } | - | Publication year range filter |
| sort | string | 'relevance' | Sort order: 'relevance', 'date', 'citations' |
Client Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| browser | string | 'Chrome_131' | Browser to spoof |
| timeout | number | 15000 | Request timeout in ms |
| proxy | string | - | HTTP proxy URL |
| proxyPool | string[] | - | Proxy pool with round-robin rotation |
| semanticScholarApiKey | string | - | Semantic Scholar API key |
How It Works
- Parallel queries - All selected sources are queried simultaneously via
Promise.allSettled - Canonical URL resolution - ghostfetch follows redirects to determine the final URL of each paper
- Impact Factor lookup - Each paper's venue is matched against a static journal Impact Factor table
- Deduplication - Papers are deduplicated using DOI > canonical URL > normalized title (in priority order), merging metadata from multiple sources
- Sorting - Results are sorted by Impact Factor (descending), then by citation count
- Limit - Final results are trimmed to the requested limit
Error Handling
The library uses partial failure tolerance. If one source fails, results from other sources are still returned:
const result = await searchPapers('query');
if (result.errors) {
for (const err of result.errors) {
console.warn(`${err.source}: ${err.message} (${err.code})`);
// err.code: 'RATE_LIMITED' | 'CAPTCHA' | 'TIMEOUT' | 'NETWORK_ERROR' | 'PARSE_ERROR' | 'UNKNOWN'
}
}
// result.papers still contains results from successful sourcesDevelopment
npm run build # Build ESM + CJS + .d.ts via tsup
npm run lint # Type check with tsc --noEmit
npm run test # Run unit tests
npm run test:live # Run live tests (requires LIVE_TEST=true)License
MIT
