npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

thorbit-knowledge-graph

v0.4.1

Published

Build canonical entity libraries (Wikidata, Wikipedia, DBpedia, Freebase mid) from any website using TextRazor entity linking, then emit grounded schema.org JSON-LD (entity IDs locked, prose written by an LLM). Ships a CLI, an MCP server, and a deployable

Readme

thorbit-knowledge-graph

Build canonical entity libraries from any website. Point it at a site (or a list of pages) and it extracts and links every salient entity to its canonical IDs — Wikidata QID, Wikipedia, DBpedia, productontology, and the Google/Freebase /m/ mid — then dedupes, cleans, categorizes, and ranks them into a tidy library, ready for a final AI cleanup pass.

One repo, three surfaces over one engine (src/engine/):

  • CLI (bin/cli.mjs) — run it locally, zero-config.
  • MCP server (mcp/) — expose it to Claude; runs the engine locally or calls the hosted API.
  • Vercel API (api/) — host it on thorbit.ai with server-side keys, gated by approved API keys.

Entity linking is content-grounded (via TextRazor), so it resolves terms in context — "rehab" becomes Drug rehabilitation, not a band's album.


CLI

npm install
cp .env.example .env            # paste one or more free TextRazor keys (textrazor.com)

node bin/cli.mjs run --url https://example-rehab.com --niche drug-rehab --out ./out
node bin/cli.mjs run --urls "https://a.com/p1,https://a.com/p2" --niche default
node bin/cli.mjs clean --raw ./out/raw.json --niche drug-rehab
node bin/cli.mjs merge --into ./rehab-seed.json --from ./out/library.json --label example-rehab.com

Outputs in out/: raw.json (everything), library.json (deduped, cleaned, ranked), library-core.json (high-confidence subset), geo.json (locations split out), summary.json. Niches: drug-rehab, roofing, default — add your own in src/engine/niches.mjs.

If installed (npm i -g . or via npm), the commands are thorbit-kg run ....


MCP

Tools: kg_build_library (scraped pages / site / url-list → library; lean result + saved reference), kg_emit_schema (one page → finished schema.org JSON-LD), kg_emit_schema_bulk (many pages → JSON-LD, in parallel), the saved-KG catalog (kg_library_save / _list / _get / _approve / _remove), kg_resolve_term (one phrase → IDs, no crawl), kg_merge_seed (accumulate a per-niche seed). Results are lean by design — the full library is saved and referenced by path, never dumped into context. The other bucket in each result is the deliberate hand-off for the model to finish.

Saved-KG catalog (build once, reuse by name)

A named, browsable catalog of libraries so you don't rebuild every time. Build → save → approve → reuse:

kg_build_library(...)                              → libraryUrl
kg_library_save({ name: "acme", libraryUrl })      → saved as PENDING
kg_library_list()                                  → approved only (includePending:true to review the queue)
kg_library_get({ name: "acme" })                   → inspect the entities before approving
kg_library_approve({ name: "acme" })               → now usable
kg_emit_schema({ pageType: "home", libraryName: "acme", business })

The approval gate is a security boundary, not a formality. A scraped library is untrusted content — an entity name or description could carry a prompt-injection payload that reaches generated schema. So a saved KG is unusable until a human approves it: kg_emit_schema/_bulk refuse an unapproved libraryName (403), the model is instructed to treat library content as data (never instructions), and it must never self-approve. Builds are not auto-registered — you choose what enters the catalog. Catalog + libraries persist in Vercel Blob.

Emit schema (kg_emit_schema)

Turns one page into finished schema.org JSON-LD for its type (home / service / about / blog). The split is the point: the knowsAbout/about/mentions DefinedTerm blocks are linked from the grounded library — every @id, name, and 5-link sameAs (Wikidata, Wikipedia, DBpedia, productontology, Google/Freebase mid) is copied verbatim and locked, never invented — while the prose (per-term descriptions, audience pain points, serviceOutput, teaches, positioning) is written by an LLM (MiniMax M3 on OpenRouter by default, OPENROUTER_MODEL to override). After generation, every DefinedTerm is re-locked to the library's canonical IDs and any entity the model invented is dropped — so a hallucinated sameAs can't ship.

kg_emit_schema({
  pageType: "service",
  content: "<the page text>",          // builds the library, or pass library / libraryUrl from kg_build_library
  business: { name, url, telephone, address, geo, areaServed, aggregateRating }  // used verbatim, never fabricated
})  →  { jsonld, report:{ entitiesUsed, droppedHallucinated, model } }

One page per call (~50–165s on M3, a reasoning model). After the model returns, every DefinedTerm is re-locked to the library's canonical IDs, invented entities are dropped, and a geo guard strips any areaServed sameAs that doesn't match a business.areaServed you supplied (so a wrong place ID can't ship either). Needs OPENROUTER_API_KEY (local mode) or it's server-side in hosted mode. Templates live in src/templates/.

Bulkkg_emit_schema_bulk takes a pages array (each with its own pageType + content) plus a shared business and optional shared library/libraryUrl, and emits them in parallel (bounded concurrency, default 3). Each page is a separate model call, so the per-page timeout holds and one slow page never blocks the rest. Lean by default (save: true) — returns a schemaUrl/savedTo + report per page; set save: false for full inline JSON-LD.

kg_emit_schema_bulk({
  business: { name, url, telephone, areaServed },
  libraryUrl: "<from kg_build_library>",
  pages: [
    { pageType: "home",    content: "..." },
    { pageType: "service", content: "..." },
    { pageType: "about",   content: "..." }
  ]
})  →  { count, ok, failed, results: [{ pageType, schemaUrl, report }] }

Pair it with a scraper (the intended flow)

This server is the entity-linking half. A scraper gathers the page content (handling JS rendering, blocks, and full-site crawls a plain fetch can't), then you hand that content to kg_build_library — it links every entity to Wikidata/Wikipedia/DBpedia/Freebase, dedupes, cleans, and ranks. With MCP Scraper:

1. extract_url("https://example.com")            → scraped page content (HTML or markdown)
2. kg_build_library({ pages: [{ url, content }], niche: "default" })
                                                  → linked, cleaned entity library

content can be HTML, markdown, or plain text — it's submitted to TextRazor with the right cleanup mode automatically (pass format to be explicit). Fallback (no scraper): give kg_build_library a url or urls and it self-fetches with a built-in plain-HTTP crawler (no JS rendering).

Register with Claude Code / Desktop:

{
  "mcpServers": {
    "thorbit-knowledge-graph": {
      "command": "npx",
      "args": ["-y", "thorbit-knowledge-graph"]
    }
  }
}
  • Local mode: put TextRazor keys in the MCP's environment; it runs the engine itself.
  • Hosted mode: set THORBIT_KG_SERVER_URL + THORBIT_KG_API_KEY and it calls your Vercel API instead — friends need no keys of their own.

Verify: npm run smoke, npm run test:mcp (real stdio handshake).


Vercel API (hosted)

Endpoints: POST /api/kg/build, POST /api/kg/resolve, GET /api/kg/library/:id. TextRazor keys live server-side; results store in Vercel Blob.

Auth (now): approved keys, no database, no billing. Mint one per person and add it to KG_API_KEYS:

node scripts/genkey.mjs alice      # -> kg_alice_…   give this to Alice

Every request needs x-api-key: <approved key>. Revoke = remove from KG_API_KEYS + redeploy. (Later this swaps for the thorbit-subscription model, same endpoints.)

Deploy:

vercel link
vercel env add TEXTRAZOR_API_KEY      # repeat _2.._N for your rotation pool
vercel env add KG_API_KEYS            # comma-separated minted keys
vercel env add BLOB_READ_WRITE_TOKEN  # Vercel Blob
vercel env add BLOB_PUBLIC_BASE       # https://<store>.public.blob.vercel-storage.com
vercel deploy --prod

api/kg/build is maxDuration: 300; on Hobby plan cap pages with max≈20.


How it cleans

Dedupe by wikidataId → drop universal boilerplate (cookie/GDPR/analytics/marketing) + generic bare nouns → split geo → categorize by knowledge-graph type + niche keywords → rank by relevance × page-spread × domain-fit. It never invents IDs — no Freebase mid means googleKg/freebaseId are null, so you never ship a sameAs you can't verify.

Cost

TextRazor free tier is 500 requests/day per key (rotated); a build is ~1 call/page. kg_resolve_term hits only the free Wikidata API.

MIT licensed.