@neuralsea/petri-mcp-knowledge-server
v0.2.1
Published
Configurable MCP Knowledge Server with crawler+cache+local index (Petri Workspace Indexer).
Downloads
4
Readme
Petri MCP Knowledge Server (crawler + cache + local index)
A configurable Model Context Protocol (MCP) server that exposes knowledge tools to AI agents.
It can:
- Index local workspace folders using
@neuralsea/workspace-indexer(Petri indexer) - Crawl and index documentation websites (given a home/start URI) with:
- configurable depth / page limits
- optional robots.txt compliance
- no external links (stay within allowed hosts/prefixes)
- HTML → clean text extraction (Readability)
- heading/section chunking
- cache to disk, then index via Petri (SQLite FTS + embeddings)
- Index Confluence via REST API pagination (CQL)
- Index SharePoint / other systems via a configurable JSON pagination mode (you supply endpoints/paths)
- Index GitHub Enterprise wiki either by crawling the wiki website OR (recommended) by cloning the
.wiki.gitrepository.
Important: You must have permission to crawl/index the sites you configure. For public sites, keep depth/page limits low and respect robots.txt unless you have explicit permission.
Install
npm install
npm run buildRun
cp knowledge-server.config.example.json knowledge-server.config.json
# edit paths/domains/URLs
# set any required tokens
export INTERNAL_DOCS_TOKEN="..."
export CONFLUENCE_TOKEN="..."
export SHAREPOINT_TOKEN="..."
export GHE_TOKEN="..."
node build/index.js --config ./knowledge-server.config.jsonThis server uses stdio transport. Do not write logs to stdout; it will break MCP. Logs go to stderr.
MCP Tools
knowledge_list_sourcesknowledge_healthknowledge_search(cross-source)knowledge_read(read a URL/id, returns cached clean text where available)knowledge_sync(crawl/index refresh per source or all)local_read_file(restricted to allow-listed roots)
Crawler behaviour
The http-crawl-index source supports multiple discovery modes:
crawl: BFS crawl from start URLs, following internal links onlysitemap: consume sitemap.xml (and optional sitemap indexes)confluence: enumerate pages via Confluence REST API search (CQL)json-api: generic JSON pagination enumerator (useful for SharePoint via Microsoft Graph, or custom systems)github-wiki-git: clone/pull a.wiki.gitrepository, then index it as files
After caching pages/chunks to disk, the server indexes the cache folder using the Petri indexer.
Notes for StackOverflow / Kubernetes docs
You can configure public sites (e.g. https://kubernetes.io/) as http-crawl-index sources.
For very large sites, set:
maxDepthlow (e.g. 2–4)maxPageslow- restrict to a
allowedUrlPrefixeslikehttps://kubernetes.io/docs/
StackOverflow is enormous and rate-limited; use strict limits and keep respectRobotsTxt: true.
Embeddings
This server indexes content using Petri's hybrid retrieval (SQLite FTS + embeddings). Configure the embeddings provider per source:
ollama(local): requires Ollama running and the embedding model pulled (defaultnomic-embed-text).openai(hosted): requires an API key viaapiKeyEnv(defaultOPENAI_API_KEY).hash: deterministic baseline with no external dependencies (lower quality).
See knowledge-server.config.example.json for examples.
