thorbit-knowledge-graph
v0.4.1
Published
Build canonical entity libraries (Wikidata, Wikipedia, DBpedia, Freebase mid) from any website using TextRazor entity linking, then emit grounded schema.org JSON-LD (entity IDs locked, prose written by an LLM). Ships a CLI, an MCP server, and a deployable
Maintainers
Readme
thorbit-knowledge-graph
Build canonical entity libraries from any website. Point it at a site (or a list of pages) and it extracts and links every salient entity to its canonical IDs — Wikidata QID, Wikipedia, DBpedia, productontology, and the Google/Freebase /m/ mid — then dedupes, cleans, categorizes, and ranks them into a tidy library, ready for a final AI cleanup pass.
One repo, three surfaces over one engine (src/engine/):
- CLI (
bin/cli.mjs) — run it locally, zero-config. - MCP server (
mcp/) — expose it to Claude; runs the engine locally or calls the hosted API. - Vercel API (
api/) — host it on thorbit.ai with server-side keys, gated by approved API keys.
Entity linking is content-grounded (via TextRazor), so it resolves terms in context — "rehab" becomes Drug rehabilitation, not a band's album.
CLI
npm install
cp .env.example .env # paste one or more free TextRazor keys (textrazor.com)
node bin/cli.mjs run --url https://example-rehab.com --niche drug-rehab --out ./out
node bin/cli.mjs run --urls "https://a.com/p1,https://a.com/p2" --niche default
node bin/cli.mjs clean --raw ./out/raw.json --niche drug-rehab
node bin/cli.mjs merge --into ./rehab-seed.json --from ./out/library.json --label example-rehab.comOutputs in out/: raw.json (everything), library.json (deduped, cleaned, ranked), library-core.json (high-confidence subset), geo.json (locations split out), summary.json. Niches: drug-rehab, roofing, default — add your own in src/engine/niches.mjs.
If installed (npm i -g . or via npm), the commands are thorbit-kg run ....
MCP
Tools: kg_build_library (scraped pages / site / url-list → library; lean result + saved reference), kg_emit_schema (one page → finished schema.org JSON-LD), kg_emit_schema_bulk (many pages → JSON-LD, in parallel), the saved-KG catalog (kg_library_save / _list / _get / _approve / _remove), kg_resolve_term (one phrase → IDs, no crawl), kg_merge_seed (accumulate a per-niche seed). Results are lean by design — the full library is saved and referenced by path, never dumped into context. The other bucket in each result is the deliberate hand-off for the model to finish.
Saved-KG catalog (build once, reuse by name)
A named, browsable catalog of libraries so you don't rebuild every time. Build → save → approve → reuse:
kg_build_library(...) → libraryUrl
kg_library_save({ name: "acme", libraryUrl }) → saved as PENDING
kg_library_list() → approved only (includePending:true to review the queue)
kg_library_get({ name: "acme" }) → inspect the entities before approving
kg_library_approve({ name: "acme" }) → now usable
kg_emit_schema({ pageType: "home", libraryName: "acme", business })The approval gate is a security boundary, not a formality. A scraped library is untrusted content — an entity name or description could carry a prompt-injection payload that reaches generated schema. So a saved KG is unusable until a human approves it: kg_emit_schema/_bulk refuse an unapproved libraryName (403), the model is instructed to treat library content as data (never instructions), and it must never self-approve. Builds are not auto-registered — you choose what enters the catalog. Catalog + libraries persist in Vercel Blob.
Emit schema (kg_emit_schema)
Turns one page into finished schema.org JSON-LD for its type (home / service / about / blog). The split is the point: the knowsAbout/about/mentions DefinedTerm blocks are linked from the grounded library — every @id, name, and 5-link sameAs (Wikidata, Wikipedia, DBpedia, productontology, Google/Freebase mid) is copied verbatim and locked, never invented — while the prose (per-term descriptions, audience pain points, serviceOutput, teaches, positioning) is written by an LLM (MiniMax M3 on OpenRouter by default, OPENROUTER_MODEL to override). After generation, every DefinedTerm is re-locked to the library's canonical IDs and any entity the model invented is dropped — so a hallucinated sameAs can't ship.
kg_emit_schema({
pageType: "service",
content: "<the page text>", // builds the library, or pass library / libraryUrl from kg_build_library
business: { name, url, telephone, address, geo, areaServed, aggregateRating } // used verbatim, never fabricated
}) → { jsonld, report:{ entitiesUsed, droppedHallucinated, model } }One page per call (~50–165s on M3, a reasoning model). After the model returns, every DefinedTerm is re-locked to the library's canonical IDs, invented entities are dropped, and a geo guard strips any areaServed sameAs that doesn't match a business.areaServed you supplied (so a wrong place ID can't ship either). Needs OPENROUTER_API_KEY (local mode) or it's server-side in hosted mode. Templates live in src/templates/.
Bulk — kg_emit_schema_bulk takes a pages array (each with its own pageType + content) plus a shared business and optional shared library/libraryUrl, and emits them in parallel (bounded concurrency, default 3). Each page is a separate model call, so the per-page timeout holds and one slow page never blocks the rest. Lean by default (save: true) — returns a schemaUrl/savedTo + report per page; set save: false for full inline JSON-LD.
kg_emit_schema_bulk({
business: { name, url, telephone, areaServed },
libraryUrl: "<from kg_build_library>",
pages: [
{ pageType: "home", content: "..." },
{ pageType: "service", content: "..." },
{ pageType: "about", content: "..." }
]
}) → { count, ok, failed, results: [{ pageType, schemaUrl, report }] }Pair it with a scraper (the intended flow)
This server is the entity-linking half. A scraper gathers the page content (handling JS rendering, blocks, and full-site crawls a plain fetch can't), then you hand that content to kg_build_library — it links every entity to Wikidata/Wikipedia/DBpedia/Freebase, dedupes, cleans, and ranks. With MCP Scraper:
1. extract_url("https://example.com") → scraped page content (HTML or markdown)
2. kg_build_library({ pages: [{ url, content }], niche: "default" })
→ linked, cleaned entity librarycontent can be HTML, markdown, or plain text — it's submitted to TextRazor with the right cleanup mode automatically (pass format to be explicit). Fallback (no scraper): give kg_build_library a url or urls and it self-fetches with a built-in plain-HTTP crawler (no JS rendering).
Register with Claude Code / Desktop:
{
"mcpServers": {
"thorbit-knowledge-graph": {
"command": "npx",
"args": ["-y", "thorbit-knowledge-graph"]
}
}
}- Local mode: put TextRazor keys in the MCP's environment; it runs the engine itself.
- Hosted mode: set
THORBIT_KG_SERVER_URL+THORBIT_KG_API_KEYand it calls your Vercel API instead — friends need no keys of their own.
Verify: npm run smoke, npm run test:mcp (real stdio handshake).
Vercel API (hosted)
Endpoints: POST /api/kg/build, POST /api/kg/resolve, GET /api/kg/library/:id. TextRazor keys live server-side; results store in Vercel Blob.
Auth (now): approved keys, no database, no billing. Mint one per person and add it to KG_API_KEYS:
node scripts/genkey.mjs alice # -> kg_alice_… give this to AliceEvery request needs x-api-key: <approved key>. Revoke = remove from KG_API_KEYS + redeploy. (Later this swaps for the thorbit-subscription model, same endpoints.)
Deploy:
vercel link
vercel env add TEXTRAZOR_API_KEY # repeat _2.._N for your rotation pool
vercel env add KG_API_KEYS # comma-separated minted keys
vercel env add BLOB_READ_WRITE_TOKEN # Vercel Blob
vercel env add BLOB_PUBLIC_BASE # https://<store>.public.blob.vercel-storage.com
vercel deploy --prodapi/kg/build is maxDuration: 300; on Hobby plan cap pages with max≈20.
How it cleans
Dedupe by wikidataId → drop universal boilerplate (cookie/GDPR/analytics/marketing) + generic bare nouns → split geo → categorize by knowledge-graph type + niche keywords → rank by relevance × page-spread × domain-fit. It never invents IDs — no Freebase mid means googleKg/freebaseId are null, so you never ship a sameAs you can't verify.
Cost
TextRazor free tier is 500 requests/day per key (rotated); a build is ~1 call/page. kg_resolve_term hits only the free Wikidata API.
MIT licensed.
