dnagent

v0.3.0

Published

a month ago

AI-powered DNA research assistant

0High
0Medium
0Low

rlipas

dna genomics genetics 23andme ancestrydna ai-agent cli bioinformatics research

🧬 DNAgent — AI-Powered DNA Research Assistant

DNAgent is a CLI-based AI agent that loads raw DNA data files, searches and cross-references genetic variants against gene databases and public annotation APIs (ClinVar, gnomAD), and helps researchers and curious individuals explore possible implications of a broad range of genes.

⚠️ For research and educational purposes only. DNAgent does not provide medical diagnoses or clinical recommendations. Always consult a qualified healthcare provider for medical decisions.

👀 Preview

$ dnagent --file ../dna/AncestryDNA.txt --model openrouter/deepseek/deepseek-v4-flash

 📄 DNA File Loaded
──────────────────────────────────────────────────
  Path:    ../dna/AncestryDNA.txt
  Records:  675.356
  Chr:      26
  Het:      202.834
  Hom:      472.522

══════════════════════════════════════════════

🧬 DNAgent — DNA Research Assistant
   Model: openrouter/deepseek/deepseek-v4-flash  |  Thinking: medium
──────────────────────────────────────────────

Welcome to DNAgent — DNA Research Assistant

To get started, you can:
  • Search a variant:  "Search for rs4680"
  • Look up a gene:    "Tell me about the MTHFR gene"
  • Analyze variants:  "Analyze variants in the BRCA1 gene"
  • Annotate:          "What do we know about rs4680?"
  • Get help:          ?help

Data file: ../dna/AncestryDNA.txt — ready to explore!
──────────────────────────────────────────────
🧬

## 👋 Welcome to DNAgent!

Great, your AncestryDNA file has been loaded successfully with **675,356 records**. Let me give you a quick overview of what stands out:

### 📊 Key Observations About Your Data

| Metric | Value | Notes |
|--------|-------|-------|
| **Total SNPs** | 675,356 | Solid coverage across the genome |
| **Heterozygous** | 202,834 | ~30% — variants where you have two different alleles |
| **Homozygous Alt** | 472,522 | ~70% — variants where you have two copies of the non-reference allele |
| **Insertions** | 6,389 | These are interesting — AncestryDNA flags these as structural variants |
| **Deletions** | 0 | None detected in this dataset |

**Chromosome coverage** is good across all autosomes (1–22), plus **25,218 records on the X chromosome** (Chr 23), a handful on Chr 24 (Y — 3 records), Chr 25 (Mitochondrial — 36 records), and **263 records on Chr 26** (likely the XY pseudoautosomal region / PAR).

### 🔍 What You Can Explore Here

Here are some things you might want to check out:

1. **🧬 Common health & trait genes**
   - **MTHFR** — folate metabolism (a very popular one to check)
   - **CYP1A2** — caffeine metabolism ("slow" vs "fast" metabolizer)
   - **ACTN3** — muscle performance/power athlete trait
   - **LCT** — lactose tolerance
   - **APOE** — apolipoprotein E (lipid metabolism)
   - **BRCA1/BRCA2** — breast cancer-associated genes

2. **📋 See a full summary** of your data statistics with `dna_summary`

3. **🔎 Search for specific variants** by rsID (e.g., `rs4680` for COMT, `rs9939609` for FTO)

4. **🧪 Classify all variants within a gene** with `analyze_variants`

5. **📖 Look up gene information** with `gene_info`

──────────────────────────────⚠️ **Important reminder:** I'm here for **educational and research purposes only**. All findings are raw genetic associations — they don't constitute medical advice. Always consult a qualified healthcare professional for clinical decisions.

──────────────────────────────
**Where would you like to start?** Want to look at a specific gene, search for a particular variant by rsID, or explore interesting variants in a trait-related gene?

💬 Analyze BRCA1, BRCA2 and related genes.
🧬

Absolutely! Let me start gathering information about **BRCA1**, **BRCA2**, and analyze the variants in your DNA for these genes.Great data so far! Let me now look up the **heterozygous variants** in more detail and also check some **BRCA-related genes** (PALB2, CHEK2, ATM).Now let me also analyze the variants in these related genes.

Let me get more details on some of these interesting variants.Here's a comprehensive summary of your **BRCA pathway analysis**! 🧬

──────────────────────────────
## 📋 BRCA1 & BRCA2 — The Core Genes

### BRCA1 (Chromosome 17, 81.2 kb)
| Metric | Value |
|──────────────────────────────|-------|
| **Total variants in region** | 20 |
| **Heterozygous** | 4 (all common polymorphisms) |
| **Homozygous** | 16 |
| **Known Pathogenic** | ⚪ **None detected** |

Your 4 heterozygous variants (rs9908256, rs45484397, rs9630726, rs73309596) are all **common variants** with high population frequencies (11%–65% globally) — no clinical significance reported.

### BRCA2 (Chromosome 13, 84.2 kb)
| Metric | Value |
|--------|-------|
| **Total variants in region** | 2,777 |
| **Heterozygous** | 19 |
| **Homozygous** | 2,111 |
| **Insertions/Deletions** | 647 |

One heterozygous variant caught my attention:

> ⚠️ **rs144848** — **Uncertain Significance** (ClinVar)
- Your genotype: **AC** (heterozygous)
> - Condition: Hereditary breast ovarian cancer syndrome
- **What this means:** This variant has been flagged by one lab as having uncertain significance (VUS). This is *not* the same as being pathogenic — it means there's currently insufficient evidence to classify it as benign or harmful. Many VUS variants are later reclassified as benign as more data accumulates.

──────────────────────────────## 🧩 Related Genes

### PALB2 (Partner & Localizer of BRCA2) — Chromosome 16
| Metric | Value |
|──────────────────────────────|-------|
| **Total variants** | 159 |
| **Heterozygous** | 1 (rs240744 — common, ~86% globally ✅) |
| **ClinVar flagged** | 5 flagged as Pathogenic/Likely Pathogenic |

**Important context on the flagged variants:** For all 5 flagged variants (rs118203998, rs879254113, rs587782050, rs180177136, rs864622192), your genotype shows **homozygous reference alleles** — meaning you do **not** carry the disease-associated alternate allele for any of them. For example:
- **rs118203998** (Pathogenic): You have **GG** (reference), not **C** (pathogenic)
- **rs180177136** (Likely Pathogenic): You have **TT** (reference), not **C** (pathogenic)
- Both are extremely rare (gnomAD frequency: **0.0%** worldwide)

✅ **No pathogenic PALB2 variants detected in your data.**

### CHEK2 (Checkpoint Kinase 2) — Chromosome 22
| Metric | Value |
|--------|-------|
| **Total variants** | 4 |
| **Heterozygous** | 1 (rs5752718 — ~27% globally ✅) |
| **Known Pathogenic** | ⚪ **None detected** |

ATM (Ataxia Telangiectasia Mutated) — Chromosome 11
| Metric | Value |
|──────────────────────────────|-------|
| **Total variants** | 44 |
| **Heterozygous** | 2 (common polymorphisms ✅) |
| **Known Pathogenic flagged** | 8 flagged but you carry **reference (non-risk) alleles** ✅ |

──────────────────────────────✅ Summary Table

| Gene | Variants Checked | Pathogenic Found? | Notable Findings |
|:──────────────────────────────|:----------------:|:-----------------:|:-----------------|
| **BRCA1** | 20 | ❌ No | All common variants |
| **BRCA2** | 2,777 | ❌ No | **rs144848 — VUS** (uncertain significance) |
| **PALB2** | 159 | ❌ No | Reference alleles at all pathogenic sites |
| **CHEK2** | 4 | ❌ No | All common variants |
| **ATM** | 44 | ❌ No | All common variants |

──────────────────────────────
### 📌 Important Research & Educational Context

🔬 **On rs144848 (BRCA2 VUS):** Variants of Uncertain Significance are quite common in direct-to-consumer DNA data. They represent areas where science hasn't yet reached a conclusion. Most VUS variants eventually get reclassified as benign. This does NOT mean you carry a disease-causing mutation.

⚠️ **Critical reminder:** This analysis is for **educational and research purposes only**. It is based on raw genotype data from AncestryDNA, which is not clinical-grade sequencing and does NOT provide medical diagnoses. The variants listed as "Pathogenic" in ClinVar that flagged in your results showed you carry the **reference** (non-pathogenic) allele.

If you have a personal or family history of breast/ovarian cancer, please consult a **genetic counselor or healthcare provider** for proper clinical testing and interpretation.

──────────────────────────────
**Would you like me to explore any of these variants more deeply, or look into other genes or traits?** I can check out things like pharmacogenetics (drug metabolism), other cancer-related genes, or traits like lactose tolerance, caffeine metabolism, and more!

Quick Start

Install from npm

npm install -g dnagent

Requires Node.js 22+ and an API key for at least one LLM provider (see API Keys below).

# Basic usage with a DNA file
dnagent --file ./genome.txt

# Specify a model and thinking level
dnagent --file ./genome.txt --model openrouter/deepseek/deepseek-v4-flash --thinking low

# Save a timestamped session log
dnagent --file ./genome.txt --log-file my-session.log

Build from source

git clone https://codeberg.org/rlipas/dnagent.git
cd dnagent
npm install
npm run build
make check                          # build → lint → test (149 tests)
node dist/main.js --file ./genome.txt

API Keys

DNAgent uses the pi SDK as its agent harness, which supports many LLM providers. Set your API key via an environment variable:

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
export DEEPSEEK_API_KEY=sk-...
export GEMINI_API_KEY=...
export OPENROUTER_API_KEY=sk-or-...

All supported providers and their environment variables:

| Provider | Environment Variable | |---|---| | Anthropic | ANTHROPIC_API_KEY | | Azure OpenAI | AZURE_OPENAI_API_KEY | | Cerebras | CEREBRAS_API_KEY | | Cloudflare AI Gateway | CLOUDFLARE_API_KEY + CLOUDFLARE_ACCOUNT_ID + CLOUDFLARE_GATEWAY_ID | | Cloudflare Workers AI | CLOUDFLARE_API_KEY + CLOUDFLARE_ACCOUNT_ID | | DeepSeek | DEEPSEEK_API_KEY | | Fireworks | FIREWORKS_API_KEY | | Google Gemini | GEMINI_API_KEY | | Groq | GROQ_API_KEY | | Hugging Face | HF_TOKEN | | Kimi For Coding | KIMI_API_KEY | | MiniMax | MINIMAX_API_KEY | | MiniMax (China) | MINIMAX_CN_API_KEY | | Mistral | MISTRAL_API_KEY | | OpenCode Go / Zen | OPENCODE_API_KEY | | OpenAI | OPENAI_API_KEY | | OpenRouter | OPENROUTER_API_KEY | | Together AI | TOGETHER_API_KEY | | Vercel AI Gateway | AI_GATEWAY_API_KEY | | xAI | XAI_API_KEY | | Xiaomi MiMo | XIAOMI_API_KEY | | Xiaomi MiMo Token Plan (China) | XIAOMI_TOKEN_PLAN_CN_API_KEY | | Xiaomi MiMo Token Plan (Amsterdam) | XIAOMI_TOKEN_PLAN_AMS_API_KEY | | Xiaomi MiMo Token Plan (Singapore) | XIAOMI_TOKEN_PLAN_SGP_API_KEY | | ZAI | ZAI_API_KEY |

Keys can also be stored in ~/.pi/agent/auth.json (created with 0600 permissions, user read/write only):

{
  "anthropic": { "type": "api_key", "key": "sk-ant-..." },
  "deepseek": { "type": "api_key", "key": "sk-..." },
  "openrouter": { "type": "api_key", "key": "sk-or-..." }
}

For more details, see the pi SDK providers documentation.

Features

| Capability | How | |---|---| | Load raw DNA files | 23andMe (4-column) and AncestryDNA (5-column) formats, CRLF-tolerant | | Search variants | By rsID, gene symbol, or chromosome region — with ClinVar annotations | | Gene information | 18 built-in well-studied genes + live MyGene.info API lookups with JSON cache | | Variant annotation | MyVariant.info integration — ClinVar clinical significance, gnomAD population frequencies, dbNSFP functional predictions | | Variant analysis | Classify heterozygous, homozygous, insertions, deletions per gene | | Session logs | Timestamped plain-text logs of every session with --log-file | | Summary statistics | Total records, chromosome distribution, het/hom counts, deletions/insertions | | Literature search | Ground findings against published research via Europe PMC (full abstracts) | | Natural language chat | Ask questions in plain English — the LLM interprets and uses tools | | Multi-line editor | :edit opens $EDITOR for composing complex multi-line queries | | Verbose debugging | --verbose shows tool calls, args, results, and reasoning | | Model selection | --model provider/id and --thinking off\|low\|medium\|high\|xhigh |

CLI Flags

--file <path>       Pre-load a DNA file on startup
--model <id>        Specify model (format: provider/id, e.g. openrouter/openai/gpt-5-mini)
--thinking <level>  Thinking level: off, low, medium, high, xhigh (default: medium)
--verbose / -v      Show tool calls, reasoning, and debug output
--debug             Show system prompt, active tools, and raw tool results
--list-models       List all available models and exit
--log-file <path>   Save a timestamped session log (user input, AI thinking, replies, tool calls)

In-Chat Commands

?model          Show current model + list available
?help           Show help
:edit (or :e)   Open $EDITOR for multi-line input
exit            Quit the program

The Tools (LLM-callable)

The AI agent has access to exactly 6 tools — no filesystem or code-editing tools:

| Tool | What it does | |---|---| | search_dna | Search by rsID, gene symbol, or chromosome region. Returns variants enriched with ClinVar clinical significance annotations. | | dna_summary | Summary statistics: total records, chromosome distribution, heterozygous/homozygous counts. | | gene_info | Look up a gene by symbol or name — location, function, associated conditions. Uses static reference, JSON cache, and MyGene.info API. | | variant_info | Look up detailed annotations for a specific rsID. Returns ClinVar significance, gnomAD population frequencies, and functional consequence from MyVariant.info. | | analyze_variants | Classify all variants in a gene region as heterozygous, homozygous, or structural. Flags variants with known clinical significance. | | search_literature | Search published biomedical literature via Europe PMC. Returns authors, title, journal, year, and full-length abstract (max 10 results). |

Docker

A pre-built image is available on Docker Hub so you can run DNAgent without installing Node.js or any dependencies locally.

Pull from Docker Hub

docker pull rlipas/dnagent

Run with a DNA file

Mount your DNA file into the container with a volume (-v) and pass your API key as an environment variable:

docker run --rm -it \
  -v "$(pwd)/genome.txt:/data/genome.txt:ro" \
  -e OPENROUTER_API_KEY=sk-or-v1-... \
  rlipas/dnagent \
  --file /data/genome.txt

💡 Persistent cache: DNAgent caches gene and variant lookups in ~/.dnagent/cache/ inside the container. To persist the cache across runs, add -v dnagent-cache:/home/dnagent/.dnagent/cache.

Using other providers

Replace the environment variable with the one for your provider. For example, to use DeepSeek:

docker run --rm -it \
  -v "$(pwd)/genome.txt:/data/genome.txt:ro" \
  -e DEEPSEEK_API_KEY=sk-... \
  rlipas/dnagent \
  --file /data/genome.txt

Specifying a model

Use the --model flag to pick a specific model. For example, with OpenRouter:

docker run --rm -it \
  -v "$(pwd)/genome.txt:/data/genome.txt:ro" \
  -e OPENROUTER_API_KEY=sk-or-v1-... \
  rlipas/dnagent \
  --model openrouter/openai/gpt-5-mini \
  --file /data/genome.txt

All flags work as usual

docker run --rm -it \
  -v "$(pwd)/genome.txt:/data/genome.txt:ro" \
  -e OPENROUTER_API_KEY=sk-or-v1-... \
  rlipas/dnagent \
  --file /data/genome.txt \
  --verbose --thinking low

Any flags described in the CLI Flags section are passed directly after rlipas/dnagent in the docker run command.

Build locally

To build the image yourself instead of pulling from Docker Hub:

docker build -t dnagent .

Then replace rlipas/dnagent with dnagent in any of the docker run commands above.

Built-in Gene Reference

18 well-studied genes included out of the box (with GRCh37/hg19 coordinates):

| Gene | Function | |---|---| | MTHFR | Folate metabolism | | BRCA1 / BRCA2 | DNA repair, tumor suppression | | APOE | Lipid metabolism, Alzheimer's research | | CYP2C9 / CYP1A2 / CYP3A4 | Drug metabolism | | VKORC1 | Warfarin sensitivity | | LCT | Lactose tolerance | | HFE | Iron regulation | | SLC6A4 / COMT | Neurotransmitter regulation | | ACTN3 | Muscle fiber type | | FTO | Energy metabolism | | MC1R | Pigmentation | | G6PD | Red blood cell enzyme | | TPMT | Thiopurine metabolism | | NAT2 | Acetylator phenotype |

Any gene not in the static set is automatically fetched from the MyGene.info API and cached to ~/.dnagent/cache/genes.json.

Project Structure

src/
├── main.ts                        # CLI entry: args, session, events, interactive loop
├── cli/
│   ├── output.ts                   # Colored terminal output (chalk) + markdown-to-ANSI
│   └── main.test.ts                # 42 integration tests: output, markdown, error panels
├── logger/
│   ├── session-log.ts              # Timestamped session log writer (--log-file)
│   └── session-log.test.ts         # 19 tests: buffering, interleaving, disk I/O
├── dna/
│   ├── types.ts                    # Core types (DnaRecord, DnaDataSet, GeneInfo)
│   ├── parser.ts                   # File parser, search, summary (pure logic)
│   ├── parser.test.ts              # 13 parse + 10 genotype tests
│   ├── gene-reference.ts           # 18 built-in genes (pure logic + data)
│   ├── gene-reference.test.ts      # 9 tests
│   ├── api-client.ts               # HTTP fetch with retry, timeout, rate limiting
│   ├── api-client.test.ts          # 11 tests
│   ├── gene-cache.ts               # JSON file cache (genes.json + variants.json)
│   ├── gene-cache.test.ts          # 10 tests
│   ├── gene-service.ts             # Composite: static → cache → MyGene.info API
│   ├── gene-service.test.ts        # 7 tests
│   ├── variant-annotation.ts       # MyVariant.info client + formatting
│   ├── variant-annotation.test.ts  # 8 tests
│   ├── literature-service.ts       # Europe PMC client + formatting
│   └── literature-service.test.ts  # 14 tests
└── agent/
    ├── index.ts                    # Barrel export
    ├── dna-state.ts                # In-memory session state (current dataset)
    └── tools/
        ├── search-dna.ts           # search_dna (with ClinVar annotations)
        ├── dna-summary.ts          # dna_summary
        ├── gene-info.ts            # gene_info (enriched: MyGene.info data)
        ├── variant-info.ts         # variant_info (rsID → ClinVar, gnomAD)
        ├── search-literature.ts    # search_literature (Europe PMC literature search)
        └── analyze-variants.ts     # analyze_variants (with ClinVar flags)

Format Compatibility

| Format | Columns | Example | Status | |---|---|---|---| | 23andMe | rsid \t chromosome \t position \t genotype | rs4477212\t1\t82154\tAA | ✅ | | AncestryDNA | rsid \t chromosome \t position \t allele1 \t allele2 | rs3131972\t1\t752721\tA\tA | ✅ | | CRLF endings | Both formats tolerate \r\n | | ✅ | | Deletions | --, -, D, DD | | ✅ | | Insertions | I, II | | ✅ |

Privacy

Your raw DNA data stays on your machine. DNAgent loads the file into memory, searches it locally with custom tools, and only sends the results of tool calls to the LLM provider.

Files matching AncestryDNA.txt and dna-data-*.zip are git-ignored to prevent accidental commits of personal genomic data.

Pipeline

make check     # build + lint + test (149 tests)
make build     # tsc
make test      # run unit tests via node --test
make lint      # tsc --noEmit

Current test stats: 149 tests across 33 suites, all passing.

Project Goals

This project was built with three goals in mind:

Learn about building AI agents — DNAgent was created almost entirely using the pi coding agent, an AI-assisted development harness powered by open-weight LLMs (primarily DeepSeek V4 Pro and DeepSeek V4 Flash).
Make genomic exploration accessible — give researchers, students, and curious individuals a natural-language interface to explore raw DNA data, search variants, look up gene information, and cross-reference public annotations — without needing to write code.
Evaluate AI-assisted software engineering — we wanted to test the current state of AI agent-assisted development on a real, non-trivial project. The result was a pleasant surprise: the vast majority of the code, tests, and documentation was generated through conversations with the agent, following a strict protocol of vertical-slice development, TDD for pure logic, and the one-command pipeline (make check). The quality and coverage achieved with open-weight models far exceeded our expectations.

Public genetic data for testing is available from the Personal Genome Project: https://my.pgp-hms.org/public_genetic_data?data_type=23andMe

How It Was Built

DNAgent was created almost entirely using the pi coding agent — an AI-assisted development harness. The majority of the code, tests, and documentation was generated through conversations with DeepSeek V4 Pro and DeepSeek V4 Flash, two open-weight large language models.

The agent follows a strict protocol (see AGENTS.md) enforcing vertical-slice development, TDD for pure logic, and the "One-Command Rule" (make check). Every module has corresponding unit tests, and the 6 DNA/literature tools are the only interface the LLM sees.

Development

# First time setup
npm install

# Iterate
# 1. Edit source files in src/
# 2. Run make check to verify
# 3. Run node dist/main.js to test interactively

Built with:

pi coding agent SDK — agent session, tool system, event hooks
TypeScript 5.7+ with strict mode
TypeBox — runtime parameter validation for tools
Node.js 22 — native test runner, ESM
Chalk — colored terminal output
Marked — markdown-to-ANSI rendering
MyGene.info / MyVariant.info — public REST APIs for gene and variant annotations

License

MIT — see LICENSE.