dkg-arxiv

v0.1.0

Published

3 months ago

Source integration that ingests arXiv papers into a Project's Working Memory on a DKG v10 node.

0High
0Medium
0Low

tomazot

dkg origintrail arxiv knowledge-graph rdf mcp

dkg-arxiv

Source integration that ingests arXiv papers into a DKG v10 Project's Working Memory as Knowledge Assets, with full provenance, status tags, and preserved upstream licence.

This is a Round 1 submission to the OriginTrail DKG v10 integrations bounty programme (cfi-dkgv10-r1).

Status

Version 0.1.0, beta. Phases P0–P5 are landed: core ingest pipeline, resilience, stdio MCP server, keyword search, recurring tracking, opt-in full-text. The package is not yet published to npm: testers build from source (see Get the binary). Tested against a live DKG v10 node on macOS / Node 22.

If you are testing this for the bounty review or for a real Project, please send issue reports the way described in Reporting issues.

What you can test today

| Flow | Entry point | Primary verification | | --- | --- | --- | | One-shot ingest of N papers on a topic | dkg-arxiv ingest or UI | dkg-arxiv search returns the paper, your agent answers from it | | Full-text extraction (licence-gated) | --mode fulltext | Per-paper outcome reports resultMode: fulltext | | Keyword search across already-ingested papers | dkg-arxiv search | Hit list with arXiv IDs and titles | | Recurring topic tracking | dkg-arxiv track + dkg-arxiv tick | tick --dry-run lists due topics; tick runs them | | MCP tools from Cursor / Claude Desktop | dkg-arxiv mcp serve | Four tools listed: arxiv_ingest, arxiv_search, arxiv_status, arxiv_track | | Browser-based panel | dkg-arxiv ui serve | Project picker → topic → run → per-paper summary | | Re-ingest over stub-poisoned data | "Re-ingest" toggle in UI, or force option | Stub-only papers get a fresh real-LLM extraction |

What is not in scope for Round 1:

No PUBLISH to Verified Memory.
No SHARE performed by this integration. SHARE is a Curator-authority operation done through the user's agent (in conversation), not through this CLI or its MCP tools.
No agent-side adapter. This is a source integration; your existing agent reads from WM through whatever adapter it already uses.

Prerequisites

Before the first run, please confirm:

Node 22 or newer. node -v should print v22.x or higher. The repo includes an .nvmrc, so nvm use picks the right version.
A v10 DKG node running locally. The integration calls http://127.0.0.1:9200 by default. Override with DKG_API_URL if your node listens elsewhere.
A daemon auth token. The integration reads ~/.dkg/auth.token by default (one token per line, comments allowed). Override with DKG_AUTH_TOKEN env var or --auth-token-path <path> flag.
At least one Project / Context Graph the user can write to. Pick one in the v10 dashboard, or pass an existing Context Graph ID with --project <id>.
An LLM source (optional but strongly recommended). Without one, ingest falls back to a 200-character-snippet stub and refuses unless DKG_ARXIV_ALLOW_STUB=1 is set. See LLM extraction.

If your daemon does not yet have a configured Project, create one through the v10 dashboard first; the integration does not create Projects.

Get the binary

The package is not yet on npm. Build it from source:

git clone https://github.com/<owner>/dkg-arxiv.git
cd dkg-arxiv
nvm use            # picks Node 22 from .nvmrc
npm install
npm run build      # produces dist/cli/index.js with a #!/usr/bin/env node shebang
npm link           # exposes `dkg-arxiv` on your PATH

Test that it is on the PATH:

dkg-arxiv --version    # 0.1.0

If you do not want to npm link, every command works as node ./dist/cli/index.js … from the repo root.

First run (5 minutes via UI)

The browser panel is the fastest tester surface. It does the auth-token plumbing for you and reports per-paper outcomes inline.

export OPENAI_API_KEY="sk-..."         # any OpenAI-compatible key works; see LLM extraction
dkg-arxiv ui serve                     # opens http://127.0.0.1:<random-port>

In the panel:

Pick a Project from the dropdown (the panel calls /api/context-graph/list against your daemon).
Type a topic, e.g. mechanistic interpretability. Optionally click "Suggest topics" to see what topics the agent is already tracking in this Project's WM.
Set max papers (default 20). For a fast first test, lower it to 3.
Pick extraction mode. abstract is the recommended default and free-tier safe; fulltext downloads the PDF, runs the LLM over the full text, and is gated by upstream licence.
Press Run. The summary updates per-paper as it ingests, with arXiv links, status, and the realised extraction mode.

If you see "LLM: StubLlmClient (no LLM source available)" at the top of the run, your OPENAI_API_KEY is not visible to the integration. Stop, set the key, and rerun.

CLI reference

# One-shot ingest of the top N arXiv papers on a topic into a Project's WM
dkg-arxiv ingest "retrieval augmented generation" --project rag-research --top 20

# Opt-in full-text extraction. Refuses on arXiv-nonexclusive / CC-BY-NC variants
# and falls back to abstract for those papers.
dkg-arxiv ingest "rag" --project rag-research --top 5 --mode fulltext

# Keyword search across previously-ingested papers in WM
dkg-arxiv search "BM25 hybrid retrieval" --project rag-research --limit 5

# Schedule a recurring ingest. Stored as a TrackedTopic Knowledge Asset;
# fired by `dkg-arxiv tick` from cron / launchd / systemd.
dkg-arxiv track "mechanistic interpretability" --project interp --interval weekly

# Fire all due tracked topics in a Project (run once a day from cron)
dkg-arxiv tick --project interp
dkg-arxiv tick --project interp --dry-run    # preview which topics would run

# Inspect ingestion job status
dkg-arxiv status --project rag-research

# Resume an interrupted job
dkg-arxiv resume --job <jobId>

# Run as a stdio MCP server (normally spawned by the MCP client)
dkg-arxiv mcp serve

# Open the source-integration panel in your browser
dkg-arxiv ui serve              # auto-opens http://127.0.0.1:<random-port>
dkg-arxiv ui serve -p 8910      # pin a port
dkg-arxiv ui serve --no-open    # for headless / SSH

# Diagnose where writes land / what the WM view actually returns
dkg-arxiv inspect --project rag-research --assertion-name <name> --arxiv-id <id>

Verifying it worked

After an ingest run:

Check the run summary. Each paper has an action (written, upgraded, skipped, failed) and a resultMode (abstract or fulltext). A successful real-LLM ingest looks like action: written, resultMode: abstract with the arXiv ID and title.
Search for one of the ingested papers.
```
dkg-arxiv search "<a phrase from the abstract>" --project <id>
```
You should see the paper in the hit list within seconds.
Use dkg-arxiv inspect if anything looks off. It prints node identity, agent list, the resolved view-graph IRIs, and probes the WM with a direct ASK so you can see whether the paper actually landed where you expected.
Ask your agent. Open whatever agent is wired into your Project (the v10 dashboard agent, a Cursor or Claude Desktop session bound to your daemon, or any other agent with a v10 adapter), and ask "what came out on <topic> recently?". The agent reads from your Project's WM through its existing adapter and should cite arXiv URLs.

If search returns nothing but the run summary said written, the most common cause is an agentAddress / view scoping mismatch; dkg-arxiv inspect shows you which view your writes ended up in.

LLM extraction

The integration resolves an LLM source in this priority order:

DKG daemon (preferred). If the daemon exposes POST /api/llm/extract-json (a proposed upstream endpoint, not yet shipped in v10), the integration uses the user's UI-configured LLM. Their API key never leaves the daemon. No env var setup on the integration side. The integration probes for endpoint support at startup; if 404, falls through to step 2.
OpenAI-compatible direct. If OPENAI_API_KEY (or DKG_ARXIV_LLM_API_KEY) is set, ingest uses an OpenAI-compatible chat-completions API. Defaults: model gpt-4o-mini, base URL https://api.openai.com/v1. Override with DKG_ARXIV_LLM_MODEL and DKG_ARXIV_LLM_BASE_URL to point at any compatible endpoint (Together, Groq, Ollama, etc).
Stub fallback (opt-in only). Without a real LLM source, ingest refuses to run. Pass DKG_ARXIV_ALLOW_STUB=1 to opt in to a low-fidelity ingest that records a 200-character abstract excerpt as a single low-confidence claim. Useful for plumbing tests; not useful for real Knowledge Asset content.

The resolved source is printed at the top of every ingest run (e.g. LLM: gpt-4o-mini via https://api.openai.com/v1).

Setting an OpenAI key

# One-shot for the current shell
export OPENAI_API_KEY="sk-..."
dkg-arxiv ingest "..." --project ...

# Persistent (zsh)
echo 'export OPENAI_API_KEY="sk-..."' >> ~/.zshrc
source ~/.zshrc

You can also drop the key into the integration's own env var so it does not leak into other tools that read OPENAI_API_KEY:

export DKG_ARXIV_LLM_API_KEY="sk-..."

MCP entry point

The same core code is exposed as a stdio MCP server with four tools: arxiv_ingest, arxiv_search, arxiv_status, arxiv_track. Install via the dkg-integrations registry (which renders the command + args block for your MCP client's config), or wire it manually:

// ~/.cursor/mcp.json (and similar files for Claude Desktop)
{
  "mcpServers": {
    "dkg-arxiv": {
      "command": "npx",
      "args": ["-y", "dkg-arxiv", "mcp", "serve"],
      "env": {
        "DKG_API_URL": "http://127.0.0.1:9200",
        "DKG_AUTH_TOKEN": "<from ~/.dkg/auth.token>",
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Until the package is published, replace the command / args with an absolute path to the built CLI:

"command": "/absolute/path/to/dkg-arxiv/dist/cli/index.js",
"args": ["mcp", "serve"]

Each ingested paper becomes a Knowledge Asset with the IRI ual:dkg:arxiv/{arxivId}, the canonical arXiv URL preserved as prov:wasDerivedFrom, the upstream licence recorded as a nested node, and an extraction status tag.

How the agent reads it

dkg-arxiv is a source integration. It writes to your Project's WM and returns. Your existing agent (the v10 dashboard agent, a Cursor or Claude Desktop session bound to the local mcp-server, or any other agent with a v10 adapter) reads the ingested content from WM through its existing adapter. There is no MCP-tool registration to do, no agent-side wiring; you ingest, your agent answers.

When you want to share a Knowledge Collection of ingested papers with your team, ask your agent in conversation. The agent invokes the SHARE Curator-authority operation through its own adapter; this integration does not perform SHARE itself.

Source-integration panel UI

dkg-arxiv ui serve starts a small loopback HTTP server (vanilla JS, no framework, no build step) that lets you drive ingestion from a browser without leaving the loop. The server reads ~/.dkg/auth.token at startup and injects Authorization: Bearer <token> on every daemon call. The browser only ever talks to the local UI server, so the token never reaches your browser tab. The server listens on 127.0.0.1 only.

Future integrations (PubMed, RSS, GitHub releases, EUR-Lex...) can copy the same panel shape; nothing in the UI is arXiv-specific beyond the field labels.

Troubleshooting

No DKG auth token available. The integration could not read a token. Confirm ~/.dkg/auth.token exists and is readable, or set DKG_AUTH_TOKEN in your shell. The error message prints the path it tried.

No LLM source available; refuse to fall back to stub. You did not set OPENAI_API_KEY and the daemon does not yet expose /api/llm/extract-json. Set the key (see LLM extraction) or pass DKG_ARXIV_ALLOW_STUB=1 if you want a no-LLM smoke test.

Run summary shows LLM: StubLlmClient even though I set OPENAI_API_KEY. The env var is not visible to the process the CLI launched. Confirm with echo $OPENAI_API_KEY in the same shell. If you set it in ~/.zshrc, run source ~/.zshrc first or open a new shell.

Papers were ingested but dkg-arxiv search returns nothing. This is almost always a view-scoping mismatch (the WM view your reads target is not the one your writes landed in). Run dkg-arxiv inspect --project <id> --arxiv-id <id-you-just-ingested>. The output prints the resolved agent address, view graphs, and a probe ASK against each plausible scoping; the row that returns true is where your data lives.

Re-ingesting the same topic skips every paper. That is the dedup gate working. By default, papers with a real-LLM extraction at the requested tier are not re-ingested. To force, use the Re-ingest checkbox in the UI, or pass the force option to the API. Forced runs write timestamp-suffixed assertions alongside any prior ones; nothing is overwritten in place.

Older runs landed as stub-only and now I have configured a real LLM. Just run again. As of v0.1.0, dedup treats stub-poisoned assertions as "needs re-extraction" and overwrites them on the next real-LLM run.

Warning: Empty 'FlateDecode' stream flooding the log. Fixed in v0.1.0 (the pdfjs verbosity is set to errors-only). If you still see them, you are running an older build; rerun npm run build.

MCP client cannot find the server. If the package is not yet published, your MCP config has to point at an absolute path on disk; npx -y dkg-arxiv will fail. See the absolute-path block in MCP entry point.

Reporting issues

When something does not work, please file an issue with:

The CLI command or UI action you ran (verbatim, including flags).
The full run summary (per-paper outcomes for an ingest; the stack trace for a hard error).
The output of dkg-arxiv inspect --project <id> if WM reads do not match expectations.
Your Node version (node -v) and DKG node version (dkg --version or whichever the daemon prints).
Whether the LLM banner at the top of the run shows a real model or StubLlmClient.

Issues go on the GitHub repo; for the bounty review window, please CC the maintainer noted in the registry entry.

Known limitations

npm publish not yet executed; testers build from source.
track / tick runs ingest synchronously inside one process; large topics ingested daily across many Projects may take meaningful wall time. No queue manager yet.
Full-text extraction depends on unpdf (which wraps a worker-thread pdfjs). It works on the arXiv papers exercised in testing, but malformed or scanned PDFs return little text. Affected papers are recorded with extracted-needs-review status, not silently skipped.
The dedup gate identifies "real" extractions by the literal extractionModel value. An external writer that posts assertions with extractionModel = "stub-llm" would be ignored on re-ingest. Within the dkg-arxiv pipeline this is correct; for adversarial inputs it is not a security boundary.
The integration probes the daemon for /api/llm/extract-json at startup, but that endpoint does not yet exist in v10. Until the upstream daemon ships it, every install needs an OPENAI_API_KEY for real extraction.

Development

npm install
npm test                    # unit + contract tests, no live node required
DKG_ARXIV_RUN_INTEGRATION=1 npm test  # also runs integration tests against a live v10 node
npm run lint                # tsc --noEmit
npm run build               # tsc to dist/

License

MIT. See LICENSE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

dkg-arxiv

Status

What you can test today

Prerequisites

Get the binary

First run (5 minutes via UI)

CLI reference

Verifying it worked

LLM extraction

Setting an OpenAI key

MCP entry point

How the agent reads it

Source-integration panel UI

Troubleshooting

Reporting issues

Known limitations

Development

License