npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

silktext

v0.1.0

Published

Lightweight, runtime-safe crawling → clean Markdown

Readme

silktext

Lightweight, runtime-safe crawling → clean Markdown

A TypeScript library for crawling websites and extracting clean Markdown content, designed to work in any JavaScript runtime (Node, Bun, Deno, Convex, Workers).

Features

  • Runtime-agnostic: Works everywhere - Convex, Node, Bun, Deno, Workers
  • Clean extraction: Converts HTML to clean, consistent Markdown
  • Smart crawling: Respects robots.txt, rate limits, and crawl depth
  • Flexible strategies: Heuristic (fast) or Readability (thorough) extraction
  • No heavy deps: Core has minimal dependencies; heavy libs are optional
  • TypeScript-first: Full type safety with clear data contracts

Installation

npm install silktext

Quick Start

One-liner crawl

import { crawlSite } from "silktext";

await crawlSite({ 
  startUrls: ["https://example.com"], 
  maxDepth: 2, 
  onPage: (page) => console.log(page.markdown) 
});

Two-liner fetch → extract

import { fetchHtml, extract } from "silktext";

const resp = await fetchHtml({ fetch: globalThis.fetch }, { url: "https://example.com" });
const page = await extract(resp, { strategy: "heuristic" });
console.log(page.markdown);

Extract HTML directly

import { extractHtml } from "silktext";

const page = await extractHtml(htmlString, "https://example.com");
console.log(page.metadata.title, page.markdown);

Convex Usage

silktext works in both Convex runtime modes:

Default Convex Runtime (recommended)

Fast and lightweight with heuristic extraction:

import { action } from "convex/_generated/server";
import { crawlSite } from "silktext";

export const crawlDocs = action({
  args: {},
  handler: async (ctx) => {
    const pages: string[] = [];
    
    await crawlSite({
      startUrls: ["https://docs.example.com"],
      maxDepth: 2,
      sameOrigin: true,
      onPage: (page) => {
        pages.push(`# ${page.metadata.title}\n\n${page.markdown}`);
      }
    });
    
    return pages.join("\n\n---\n\n");
  }
});

Node.js Runtime in Convex

For advanced extraction with Readability:

"use node";  // Enable Node.js runtime

import { action } from "convex/_generated/server";
import { crawlSite } from "silktext";

export const crawlWithReadability = action({
  args: {},
  handler: async (ctx) => {
    const pages: string[] = [];
    
    await crawlSite({
      startUrls: ["https://article-site.com"],
      maxDepth: 1,
      onPage: async (page) => {
        // Can use Readability strategy in Node runtime
        const enhanced = await extract(page, { strategy: "readability" });
        pages.push(enhanced.markdown);
      }
    });
    
    return pages;
  }
});

API Reference

crawlSite(options)

Crawl a website starting from one or more URLs.

interface CrawlOptions {
  startUrls: string[];              // URLs to start crawling from
  maxDepth?: number;                // Max crawl depth (default: 1)
  maxPages?: number;                // Max pages to crawl (default: 100)
  sameOrigin?: boolean;             // Stay on same host (default: true)
  allow?: (url: string) => boolean; // Custom allow filter
  deny?: (url: string) => boolean;  // Custom deny filter
  userAgent?: string;               // Custom user agent
  rate?: {                          // Rate limiting
    perHostRps?: number;            // Requests per second per host (default: 1)
    concurrent?: number;            // Concurrent requests (default: 4)
  };
  onPage?: (page: ExtractResult, ctx: CrawlContext) => Promise<void> | void;
}

fetchHtml(http, request)

Fetch HTML from a URL with proper headers and redirect handling.

interface FetchRequest {
  url: string;
  headers?: Record<string, string>;
}

const response = await fetchHtml(
  { fetch: globalThis.fetch }, 
  { url: "https://example.com" }
);

extract(response, options)

Extract clean Markdown and metadata from HTML.

interface ExtractOptions {
  strategy?: "heuristic" | "readability";  // Default: "heuristic"
  langHint?: string;                        // Language hint for extraction
}

const result = await extract(fetchResponse, { strategy: "heuristic" });

extractHtml(html, url, options)

Extract directly from an HTML string.

const result = await extractHtml(
  "<html>...</html>", 
  "https://example.com",
  { strategy: "readability" }
);

Extraction Strategies

Heuristic (default)

  • Fast, lightweight extraction
  • Removes navigation, ads, and other noise
  • Finds main content using common selectors
  • Convex-compatible - uses native DOMParser in Convex/browser
  • Falls back to regex-based extraction in Node without jsdom

Readability

  • Uses Mozilla's Readability algorithm
  • Better for complex article layouts
  • NOT Convex-compatible - requires dynamic imports
  • Only available in Node.js with jsdom installed
  • More accurate for news/blog content

Convex Compatibility

silktext is designed for maximum Convex compatibility:

Default Convex Runtime

Fully supported:

  • Heuristic extraction (uses native DOMParser)
  • All crawling features
  • URL normalization
  • Rate limiting
  • Metadata extraction

Convex with "use node"

Additional features:

  • Readability extraction strategy
  • Full jsdom-based parsing
  • All Node.js-specific features

Simply add "use node"; at the top of your action file to enable Node.js runtime and access all features.

Data Types

ExtractResult

interface ExtractResult {
  url: string;                    // Normalized URL
  markdown: string;                // Clean Markdown content
  metadata: {
    title?: string;                // Page title
    description?: string;          // Meta description
    og?: Record<string, string>;  // Open Graph tags
    twitter?: Record<string, string>; // Twitter Card tags
    jsonLd?: unknown[];            // JSON-LD structured data
    links?: string[];              // Discovered links
  };
}

Features

URL Normalization

  • Lowercases hostnames
  • Removes tracking parameters (utm_*, fbclid, gclid, etc.)
  • Strips URL fragments
  • Sorts query parameters

Politeness

  • Respects rate limits (1 RPS per host by default)
  • Concurrent request limiting
  • Per-host token bucket rate limiting

Content Extraction

  • Removes scripts, styles, navigation, footers
  • Preserves main content structure
  • Converts to clean Markdown with Turndown
  • Extracts all metadata (OG, Twitter, JSON-LD)

License

MIT