thorbit-knowledge-graph

v0.4.1

Published

3 days ago

Build canonical entity libraries (Wikidata, Wikipedia, DBpedia, Freebase mid) from any website using TextRazor entity linking, then emit grounded schema.org JSON-LD (entity IDs locked, prose written by an LLM). Ships a CLI, an MCP server, and a deployable

0High
0Medium
0Low

vilovieta

knowledge-graph entity-linking textrazor wikidata dbpedia schema seo geo llmo mcp

thorbit-knowledge-graph

Build canonical entity libraries from any website. Point it at a site (or a list of pages) and it extracts and links every salient entity to its canonical IDs — Wikidata QID, Wikipedia, DBpedia, productontology, and the Google/Freebase /m/ mid — then dedupes, cleans, categorizes, and ranks them into a tidy library, ready for a final AI cleanup pass.

One repo, three surfaces over one engine (src/engine/):

CLI (bin/cli.mjs) — run it locally, zero-config.
MCP server (mcp/) — expose it to Claude; runs the engine locally or calls the hosted API.
Vercel API (api/) — host it on thorbit.ai with server-side keys, gated by approved API keys.

Entity linking is content-grounded (via TextRazor), so it resolves terms in context — "rehab" becomes Drug rehabilitation, not a band's album.

CLI

npm install
cp .env.example .env            # paste one or more free TextRazor keys (textrazor.com)

node bin/cli.mjs run --url https://example-rehab.com --niche drug-rehab --out ./out
node bin/cli.mjs run --urls "https://a.com/p1,https://a.com/p2" --niche default
node bin/cli.mjs clean --raw ./out/raw.json --niche drug-rehab
node bin/cli.mjs merge --into ./rehab-seed.json --from ./out/library.json --label example-rehab.com

Outputs in out/: raw.json (everything), library.json (deduped, cleaned, ranked), library-core.json (high-confidence subset), geo.json (locations split out), summary.json. Niches: drug-rehab, roofing, default — add your own in src/engine/niches.mjs.

If installed (npm i -g . or via npm), the commands are thorbit-kg run ....

MCP

Tools: kg_build_library (scraped pages / site / url-list → library; lean result + saved reference), kg_emit_schema (one page → finished schema.org JSON-LD), kg_emit_schema_bulk (many pages → JSON-LD, in parallel), the saved-KG catalog (kg_library_save / _list / _get / _approve / _remove), kg_resolve_term (one phrase → IDs, no crawl), kg_merge_seed (accumulate a per-niche seed). Results are lean by design — the full library is saved and referenced by path, never dumped into context. The other bucket in each result is the deliberate hand-off for the model to finish.

Saved-KG catalog (build once, reuse by name)

A named, browsable catalog of libraries so you don't rebuild every time. Build → save → approve → reuse:

kg_build_library(...)                              → libraryUrl
kg_library_save({ name: "acme", libraryUrl })      → saved as PENDING
kg_library_list()                                  → approved only (includePending:true to review the queue)
kg_library_get({ name: "acme" })                   → inspect the entities before approving
kg_library_approve({ name: "acme" })               → now usable
kg_emit_schema({ pageType: "home", libraryName: "acme", business })

The approval gate is a security boundary, not a formality. A scraped library is untrusted content — an entity name or description could carry a prompt-injection payload that reaches generated schema. So a saved KG is unusable until a human approves it: kg_emit_schema/_bulk refuse an unapproved libraryName (403), the model is instructed to treat library content as data (never instructions), and it must never self-approve. Builds are not auto-registered — you choose what enters the catalog. Catalog + libraries persist in Vercel Blob.

Emit schema (`kg_emit_schema`)

Turns one page into finished schema.org JSON-LD for its type (home / service / about / blog). The split is the point: the knowsAbout/about/mentions DefinedTerm blocks are linked from the grounded library — every @id, name, and 5-link sameAs (Wikidata, Wikipedia, DBpedia, productontology, Google/Freebase mid) is copied verbatim and locked, never invented — while the prose (per-term descriptions, audience pain points, serviceOutput, teaches, positioning) is written by an LLM (MiniMax M3 on OpenRouter by default, OPENROUTER_MODEL to override). After generation, every DefinedTerm is re-locked to the library's canonical IDs and any entity the model invented is dropped — so a hallucinated sameAs can't ship.

kg_emit_schema({
  pageType: "service",
  content: "<the page text>",          // builds the library, or pass library / libraryUrl from kg_build_library
  business: { name, url, telephone, address, geo, areaServed, aggregateRating }  // used verbatim, never fabricated
})  →  { jsonld, report:{ entitiesUsed, droppedHallucinated, model } }

One page per call (~50–165s on M3, a reasoning model). After the model returns, every DefinedTerm is re-locked to the library's canonical IDs, invented entities are dropped, and a geo guard strips any areaServed sameAs that doesn't match a business.areaServed you supplied (so a wrong place ID can't ship either). Needs OPENROUTER_API_KEY (local mode) or it's server-side in hosted mode. Templates live in src/templates/.

Bulk — kg_emit_schema_bulk takes a pages array (each with its own pageType + content) plus a shared business and optional shared library/libraryUrl, and emits them in parallel (bounded concurrency, default 3). Each page is a separate model call, so the per-page timeout holds and one slow page never blocks the rest. Lean by default (save: true) — returns a schemaUrl/savedTo + report per page; set save: false for full inline JSON-LD.

kg_emit_schema_bulk({
  business: { name, url, telephone, areaServed },
  libraryUrl: "<from kg_build_library>",
  pages: [
    { pageType: "home",    content: "..." },
    { pageType: "service", content: "..." },
    { pageType: "about",   content: "..." }
  ]
})  →  { count, ok, failed, results: [{ pageType, schemaUrl, report }] }

Pair it with a scraper (the intended flow)

This server is the entity-linking half. A scraper gathers the page content (handling JS rendering, blocks, and full-site crawls a plain fetch can't), then you hand that content to kg_build_library — it links every entity to Wikidata/Wikipedia/DBpedia/Freebase, dedupes, cleans, and ranks. With MCP Scraper:

1. extract_url("https://example.com")            → scraped page content (HTML or markdown)
2. kg_build_library({ pages: [{ url, content }], niche: "default" })
                                                  → linked, cleaned entity library

content can be HTML, markdown, or plain text — it's submitted to TextRazor with the right cleanup mode automatically (pass format to be explicit). Fallback (no scraper): give kg_build_library a url or urls and it self-fetches with a built-in plain-HTTP crawler (no JS rendering).

{
  "mcpServers": {
    "thorbit-knowledge-graph": {
      "command": "npx",
      "args": ["-y", "thorbit-knowledge-graph"]
    }
  }
}

Local mode: put TextRazor keys in the MCP's environment; it runs the engine itself.
Hosted mode: set THORBIT_KG_SERVER_URL + THORBIT_KG_API_KEY and it calls your Vercel API instead — friends need no keys of their own.

Verify: npm run smoke, npm run test:mcp (real stdio handshake).

Vercel API (hosted)

Endpoints: POST /api/kg/build, POST /api/kg/resolve, GET /api/kg/library/:id. TextRazor keys live server-side; results store in Vercel Blob.

Auth (now): approved keys, no database, no billing. Mint one per person and add it to KG_API_KEYS:

node scripts/genkey.mjs alice      # -> kg_alice_…   give this to Alice

Every request needs x-api-key: <approved key>. Revoke = remove from KG_API_KEYS + redeploy. (Later this swaps for the thorbit-subscription model, same endpoints.)

Deploy:

vercel link
vercel env add TEXTRAZOR_API_KEY      # repeat _2.._N for your rotation pool
vercel env add KG_API_KEYS            # comma-separated minted keys
vercel env add BLOB_READ_WRITE_TOKEN  # Vercel Blob
vercel env add BLOB_PUBLIC_BASE       # https://<store>.public.blob.vercel-storage.com
vercel deploy --prod

api/kg/build is maxDuration: 300; on Hobby plan cap pages with max≈20.

How it cleans

Dedupe by wikidataId → drop universal boilerplate (cookie/GDPR/analytics/marketing) + generic bare nouns → split geo → categorize by knowledge-graph type + niche keywords → rank by relevance × page-spread × domain-fit. It never invents IDs — no Freebase mid means googleKg/freebaseId are null, so you never ship a sameAs you can't verify.

Cost

TextRazor free tier is 500 requests/day per key (rotated); a build is ~1 call/page. kg_resolve_term hits only the free Wikidata API.

MIT licensed.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

thorbit-knowledge-graph

CLI

MCP

Saved-KG catalog (build once, reuse by name)

Emit schema (kg_emit_schema)

Pair it with a scraper (the intended flow)

Vercel API (hosted)

How it cleans

Cost

Emit schema (`kg_emit_schema`)