@neuralsea/petri-mcp-knowledge-server

v0.2.1

Published

3 months ago

Configurable MCP Knowledge Server with crawler+cache+local index (Petri Workspace Indexer).

Downloads

0High
0Medium
0Low

neuralsea

Petri MCP Knowledge Server (crawler + cache + local index)

A configurable Model Context Protocol (MCP) server that exposes knowledge tools to AI agents.

It can:

Index local workspace folders using @neuralsea/workspace-indexer (Petri indexer)
Crawl and index documentation websites (given a home/start URI) with:
- configurable depth / page limits
- optional robots.txt compliance
- no external links (stay within allowed hosts/prefixes)
- HTML → clean text extraction (Readability)
- heading/section chunking
- cache to disk, then index via Petri (SQLite FTS + embeddings)
Index Confluence via REST API pagination (CQL)
Index SharePoint / other systems via a configurable JSON pagination mode (you supply endpoints/paths)
Index GitHub Enterprise wiki either by crawling the wiki website OR (recommended) by cloning the .wiki.git repository.

Important: You must have permission to crawl/index the sites you configure. For public sites, keep depth/page limits low and respect robots.txt unless you have explicit permission.

Install

npm install
npm run build

Run

cp knowledge-server.config.example.json knowledge-server.config.json
# edit paths/domains/URLs

# set any required tokens
export INTERNAL_DOCS_TOKEN="..."
export CONFLUENCE_TOKEN="..."
export SHAREPOINT_TOKEN="..."
export GHE_TOKEN="..."

node build/index.js --config ./knowledge-server.config.json

This server uses stdio transport. Do not write logs to stdout; it will break MCP. Logs go to stderr.

MCP Tools

knowledge_list_sources
knowledge_health
knowledge_search (cross-source)
knowledge_read (read a URL/id, returns cached clean text where available)
knowledge_sync (crawl/index refresh per source or all)
local_read_file (restricted to allow-listed roots)

Crawler behaviour

The http-crawl-index source supports multiple discovery modes:

crawl: BFS crawl from start URLs, following internal links only
sitemap: consume sitemap.xml (and optional sitemap indexes)
confluence: enumerate pages via Confluence REST API search (CQL)
json-api: generic JSON pagination enumerator (useful for SharePoint via Microsoft Graph, or custom systems)
github-wiki-git: clone/pull a .wiki.git repository, then index it as files

After caching pages/chunks to disk, the server indexes the cache folder using the Petri indexer.

Notes for StackOverflow / Kubernetes docs

You can configure public sites (e.g. https://kubernetes.io/) as http-crawl-index sources. For very large sites, set:

maxDepth low (e.g. 2–4)
maxPages low
restrict to a allowedUrlPrefixes like https://kubernetes.io/docs/

StackOverflow is enormous and rate-limited; use strict limits and keep respectRobotsTxt: true.

Embeddings

This server indexes content using Petri's hybrid retrieval (SQLite FTS + embeddings). Configure the embeddings provider per source:

ollama (local): requires Ollama running and the embedding model pulled (default nomic-embed-text).
openai (hosted): requires an API key via apiKeyEnv (default OPENAI_API_KEY).
hash: deterministic baseline with no external dependencies (lower quality).

See knowledge-server.config.example.json for examples.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme