ncbi-mcp

v2.1.0

Published

4 months ago

Model Context Protocol (MCP) server that integrates NCBI Entrez capabilities.

0High
0Medium
0Low

noahzeidenberg

NCBI Model Context Protocol (MCP)

A Python implementation of the Model Context Protocol for interacting with NCBI databases.

Setup

Clone this repository
Install dependencies:
```
pip install -r requirements.txt
```

Create a .env file with your NCBI API key:

NCBI_API_KEY=your_api_key_here
[email protected]

Running the MCP Server

python ncbi_mcp.py

Using with Cursor/Claude

Once the MCP server is running, you can interact with it using natural language in Cursor/Claude.

Using Natural Language Queries

You can use natural language to perform searches and retrieve information:

tools/call
{
  "name": "nlp-query",
  "arguments": {
    "query": "Find research articles about BRCA1"
  }
}

Or more simply, just use the query directly:

@ncbi-mcp Find research articles about BRCA1

Example Natural Language Queries

Here are some example natural language queries you can try:

Gene function information:

@ncbi-mcp Please summarize the function of TNF-alpha

Genome size and statistics:

@ncbi-mcp How big is the genome for Saccharomyces cerevisiae?

Assembly statistics:

@ncbi-mcp What is the reported L50 and N50 statistics for the most recent E.coli genome?

Dataset counts:

@ncbi-mcp How many datasets are available in the biosample database for b16f10 mouse melanoma cells?

Search for scientific articles:

@ncbi-mcp Find the latest research on COVID-19 vaccines

Get gene information:
```
@ncbi-mcp Tell me about the BRCA1 gene
```

Fetch genome information:

@ncbi-mcp Get genome information for Homo sapiens

Testing

To test the MCP server with various queries, you can use the included test files:

# Test natural language query functionality (default)
.\run_test.bat

# Test all tools
.\run_test.bat all

# Test specific test file
.\run_test.bat test_all_tools.jsonl

# Test high-level tools
.\run_test.bat test_high_level_tools.jsonl

The test script will:

Start the MCP server in background
Send test requests from the specified file
Wait for a few seconds to allow processing
Terminate the server and display the output

This approach is used because the MCP server is designed to run continuously as a service. For manual testing without automatic termination, you can use:

# Run manually with any test file
type test_nlp_query.jsonl | python ncbi_mcp.py

The test files contain example JSON-RPC requests that simulate how Cursor/Claude would interact with the MCP server.

Available Tools

The NCBI MCP provides both high-level tools that understand natural language and low-level tools for direct database interaction.

Tool Usage Guidelines for LLMs

Recommended Workflow Patterns

For most biological queries, start with nlp-query - it's the most intelligent tool that can handle complex questions and automatically route to appropriate specialized tools.

Common Research Workflows:

Gene Analysis Workflow:
- Start with nlp-query for general gene questions
- Use summarize-gene for comprehensive gene information
- Use get_gene_info for detailed structured data
- Use ncbi-search + ncbi-fetch for specific database queries
Genome Analysis Workflow:
- Use genome-stats for organism genome statistics
- Use get_genome_info for detailed genome metadata
- Use count-datasets to explore available genome assemblies
Literature Research Workflow:
- Use nlp-query for natural language literature searches
- Use ncbi-search with database="pubmed" for precise searches
- Use ncbi-fetch to get full publication details
Dataset Discovery Workflow:
- Use count-datasets to assess data availability
- Use nlp-query to explore datasets with natural language
- Use ncbi-search for systematic database exploration
E-utilities Workflow (Advanced):
- Use ncbi-info to discover available databases
- Use ncbi-global-query to see which databases contain your search term
- Use ncbi-search to find specific UIDs in target databases
- Use ncbi-summary to get overview information about records
- Use ncbi-fetch to retrieve complete records
- Use ncbi-link to find related records across databases
Cross-Database Analysis Workflow:
- Use ncbi-search to find genes of interest
- Use ncbi-link to find related proteins, structures, or literature
- Use ncbi-summary to get metadata about related records
- Use ncbi-fetch to retrieve detailed information

Tool Selection Guide

High-Level Tools (Recommended for most users):

nlp-query: Use for general biological questions, complex queries, and when you're unsure which tool to use
summarize-gene: Use for comprehensive gene analysis and understanding gene function
genome-stats: Use for genome size, assembly quality, and organism comparison
count-datasets: Use for research planning and data availability assessment
get_gene_info: Use for detailed, structured gene information
get_genome_info: Use for detailed, structured genome information

Low-Level E-utilities Tools (For advanced users):

ncbi-search (ESearch): Use for precise database searches with specific filters, Boolean operators, and field qualifiers
ncbi-fetch (EFetch): Use to retrieve complete records after searching, supports multiple formats (GenBank, FASTA, XML)
ncbi-summary (ESummary): Use to get document summaries without fetching complete records
ncbi-link (ELink): Use to find related records across databases (e.g., gene to protein, protein to structure)
ncbi-info (EInfo): Use to discover available databases and their capabilities
ncbi-global-query (EGQuery): Use to search across all databases simultaneously
ncbi-spell (ESpell): Use to get spelling suggestions for search terms
ncbi-citation-match (ECitMatch): Use to find PMIDs from citation information

Biological Context and Terminology

Understanding NCBI Databases:

Gene: Contains gene records with symbols, names, functions, and genomic locations
Protein: Contains protein sequences and annotations
Nucleotide: Contains DNA/RNA sequences (genes, transcripts, genomic regions)
PubMed: Contains scientific literature and publications
BioSample: Contains biological sample metadata (tissues, cell lines, etc.)
BioProject: Contains research project information
SRA: Contains raw sequencing data
Assembly: Contains genome assembly information

Common Biological Terms:

Gene Symbol: Short abbreviation (e.g., BRCA1, TP53, TNF)
Gene ID: Unique NCBI identifier (e.g., 672 for BRCA1)
Accession: Unique sequence identifier (e.g., NM_001126114.3)
N50/L50: Assembly quality metrics (larger N50 = better assembly)
Reference Genome: High-quality representative genome for a species
Organism: Use scientific names (Homo sapiens) or common names (human)

Search Strategies:

Use specific gene symbols for precise results
Include organism names to avoid ambiguity
Use Boolean operators (AND, OR, NOT) for complex searches
Use field qualifiers like [Gene], [Organism], [Protein Name] for targeted searches

High-Level Tools

Natural Language Query Processor

tools/call
{
  "name": "nlp-query",
  "arguments": {
    "query": "Please summarize the function of TNF-alpha"
  }
}

Gene Summarizer

tools/call
{
  "name": "summarize-gene",
  "arguments": {
    "gene_name": "BRCA1"
  }
}

Genome Statistics

tools/call
{
  "name": "genome-stats",
  "arguments": {
    "organism": "Escherichia coli"
  }
}

Dataset Counter

tools/call
{
  "name": "count-datasets",
  "arguments": {
    "database": "biosample",
    "query": "mouse melanoma b16f10"
  }
}

Low-Level Tools

Search NCBI Databases

tools/call
{
  "name": "ncbi-search",
  "arguments": {
    "database": "pubmed",
    "term": "BRCA1",
    "filters": {
      "organism": "Homo sapiens",
      "date_range": {
        "start": "2020"
      }
    }
  }
}

Fetch NCBI Records

tools/call
{
  "name": "ncbi-fetch",
  "arguments": {
    "database": "gene",
    "ids": ["70"],
    "rettype": "gb"
  }
}

Get Gene Information

tools/call
{
  "name": "get_gene_info",
  "arguments": {
    "gene_id": "672"
  }
}

Get Genome Information

tools/call
{
  "name": "get_genome_info",
  "arguments": {
    "organism": "Homo sapiens",
    "reference": true
  }
}

License

Apache-2.0