search_paper

v0.1.2

Published

4 months ago

Multi-source academic paper search library (Semantic Scholar, Google Scholar, arXiv)

0High
0Medium
0Low

minsuchae

academic paper search arxiv semantic-scholar google-scholar research citation

search-papers

A TypeScript library for searching academic papers across multiple sources, returning structured and normalized results.

Built on ghostfetch for robust HTTP requests with browser fingerprint spoofing and anti-bot bypass.

Features

Multi-source search - Query Google Scholar, Semantic Scholar, and arXiv in parallel
Unified Paper interface - All sources return the same structured format regardless of origin
Deduplication - Automatically merges duplicate papers across sources using DOI, canonical URL, and title matching
Impact Factor ranking - Results sorted by journal Impact Factor (static mapping of ~100 journals)
Citation & reference lookup - Retrieve citing/referenced papers via Semantic Scholar
Anti-bot bypass - ghostfetch handles browser spoofing, JS challenge solving, and redirect tracking
Partial failure tolerance - If one source fails, results from other sources are still returned

Requirements

Node.js >= 22.0.0

Installation

npm install search-papers

Quick Start

import { searchPapers, getPaper } from 'search-papers';

// Search across all sources
const result = await searchPapers('attention is all you need', {
  limit: 10,
});
console.log(result.papers);

// Search specific sources only
const arxivOnly = await searchPapers('transformer', {
  sources: ['arxiv'],
  limit: 5,
  sort: 'date',
});

// Look up a single paper by DOI
const paper = await getPaper('10.48550/arXiv.1706.03762');
console.log(paper?.title);

API

`searchPapers(query, options?)`

Search for papers across multiple sources simultaneously.

const result = await searchPapers('deep learning', {
  sources: ['semantic_scholar', 'google_scholar', 'arxiv'], // default: all
  limit: 10,          // default: 10
  offset: 0,
  year: { from: 2020, to: 2024 },
  sort: 'relevance',  // 'relevance' | 'date' | 'citations'
  client: {
    semanticScholarApiKey: 'your-key', // optional
    proxy: 'http://proxy:8080',        // optional
    timeout: 15000,                    // default: 15000ms
  },
});

Returns: SearchResult

interface SearchResult {
  query: string;
  totalResults?: number;
  papers: Paper[];
  nextPageToken?: string;
  source: SourceType;
  errors?: SourceError[];  // errors from failed sources
}

`getPaper(doi, options?)`

Look up a single paper by DOI using Semantic Scholar.

const paper = await getPaper('10.1038/nature14539');
// Returns Paper | null

Paper Interface

Every paper returned by any source conforms to this interface:

interface Paper {
  title: string;
  authors: Author[];
  abstract?: string;
  year?: number;
  venue?: string;           // journal or conference name
  doi?: string;
  url: string;              // link to the paper
  canonicalUrl?: string;    // final redirect URL
  pdfUrl?: string;
  citationCount?: number;
  impactFactor?: number;    // journal Impact Factor
  source: SourceType;       // 'google_scholar' | 'semantic_scholar' | 'arxiv'
  sourceId?: string;        // source-specific ID
  tags?: string[];          // e.g. arXiv categories
  references?: string[];
}

Using Individual Sources

For more control, use source classes directly:

import { createClient, SemanticScholarSource, GoogleScholarSource, ArxivSource } from 'search-papers';

const client = createClient();

// Semantic Scholar (implements CitationSource)
const s2 = new SemanticScholarSource(client);
const result = await s2.search('transformer', { limit: 5 });
const paper = await s2.getPaper('DOI:10.48550/arXiv.1706.03762');
const citations = await s2.getCitations('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
const references = await s2.getReferences('204e3073870fae3d05bcbc2f6a8e263d9b72e776');

// Google Scholar (implements PaperSource)
const gs = new GoogleScholarSource(client);
const gsResult = await gs.search('deep learning', { limit: 10 });

// arXiv (implements PaperSource)
const arxiv = new ArxivSource(client);
const arxivResult = await arxiv.search('neural network', { limit: 10 });
const arxivPaper = await arxiv.getPaper('1706.03762');

await client.destroy();

Sources

| Source | Type | Search | Get Paper | Citations | References | Notes | |--------|------|--------|-----------|-----------|------------|-------| | Semantic Scholar | API (JSON) | Yes | Yes (DOI, paperId, etc.) | Yes | Yes | Optional API key for dedicated rate limit | | Google Scholar | Scraping (HTML) | Yes | Yes (title search) | No | No | CAPTCHA risk, 2-5s random delay | | arXiv | API (Atom XML) | Yes | Yes (arXiv ID) | No | No | 3s minimum delay between requests |

Search Options

| Option | Type | Default | Description | |--------|------|---------|-------------| | sources | SourceType[] | All 3 sources | Which sources to query | | limit | number | 10 | Max results to return | | offset | number | 0 | Pagination offset | | year | { from?, to? } | - | Publication year range filter | | sort | string | 'relevance' | Sort order: 'relevance', 'date', 'citations' |

Client Options

| Option | Type | Default | Description | |--------|------|---------|-------------| | browser | string | 'Chrome_131' | Browser to spoof | | timeout | number | 15000 | Request timeout in ms | | proxy | string | - | HTTP proxy URL | | proxyPool | string[] | - | Proxy pool with round-robin rotation | | semanticScholarApiKey | string | - | Semantic Scholar API key |

How It Works

Parallel queries - All selected sources are queried simultaneously via Promise.allSettled
Canonical URL resolution - ghostfetch follows redirects to determine the final URL of each paper
Impact Factor lookup - Each paper's venue is matched against a static journal Impact Factor table
Deduplication - Papers are deduplicated using DOI > canonical URL > normalized title (in priority order), merging metadata from multiple sources
Sorting - Results are sorted by Impact Factor (descending), then by citation count
Limit - Final results are trimmed to the requested limit

Error Handling

The library uses partial failure tolerance. If one source fails, results from other sources are still returned:

const result = await searchPapers('query');

if (result.errors) {
  for (const err of result.errors) {
    console.warn(`${err.source}: ${err.message} (${err.code})`);
    // err.code: 'RATE_LIMITED' | 'CAPTCHA' | 'TIMEOUT' | 'NETWORK_ERROR' | 'PARSE_ERROR' | 'UNKNOWN'
  }
}

// result.papers still contains results from successful sources

Development

npm run build      # Build ESM + CJS + .d.ts via tsup
npm run lint       # Type check with tsc --noEmit
npm run test       # Run unit tests
npm run test:live  # Run live tests (requires LIVE_TEST=true)

License

MIT