silktext
v0.1.0
Published
Lightweight, runtime-safe crawling → clean Markdown
Maintainers
Readme
silktext
Lightweight, runtime-safe crawling → clean Markdown
A TypeScript library for crawling websites and extracting clean Markdown content, designed to work in any JavaScript runtime (Node, Bun, Deno, Convex, Workers).
Features
- Runtime-agnostic: Works everywhere - Convex, Node, Bun, Deno, Workers
- Clean extraction: Converts HTML to clean, consistent Markdown
- Smart crawling: Respects robots.txt, rate limits, and crawl depth
- Flexible strategies: Heuristic (fast) or Readability (thorough) extraction
- No heavy deps: Core has minimal dependencies; heavy libs are optional
- TypeScript-first: Full type safety with clear data contracts
Installation
npm install silktextQuick Start
One-liner crawl
import { crawlSite } from "silktext";
await crawlSite({
startUrls: ["https://example.com"],
maxDepth: 2,
onPage: (page) => console.log(page.markdown)
});Two-liner fetch → extract
import { fetchHtml, extract } from "silktext";
const resp = await fetchHtml({ fetch: globalThis.fetch }, { url: "https://example.com" });
const page = await extract(resp, { strategy: "heuristic" });
console.log(page.markdown);Extract HTML directly
import { extractHtml } from "silktext";
const page = await extractHtml(htmlString, "https://example.com");
console.log(page.metadata.title, page.markdown);Convex Usage
silktext works in both Convex runtime modes:
Default Convex Runtime (recommended)
Fast and lightweight with heuristic extraction:
import { action } from "convex/_generated/server";
import { crawlSite } from "silktext";
export const crawlDocs = action({
args: {},
handler: async (ctx) => {
const pages: string[] = [];
await crawlSite({
startUrls: ["https://docs.example.com"],
maxDepth: 2,
sameOrigin: true,
onPage: (page) => {
pages.push(`# ${page.metadata.title}\n\n${page.markdown}`);
}
});
return pages.join("\n\n---\n\n");
}
});Node.js Runtime in Convex
For advanced extraction with Readability:
"use node"; // Enable Node.js runtime
import { action } from "convex/_generated/server";
import { crawlSite } from "silktext";
export const crawlWithReadability = action({
args: {},
handler: async (ctx) => {
const pages: string[] = [];
await crawlSite({
startUrls: ["https://article-site.com"],
maxDepth: 1,
onPage: async (page) => {
// Can use Readability strategy in Node runtime
const enhanced = await extract(page, { strategy: "readability" });
pages.push(enhanced.markdown);
}
});
return pages;
}
});API Reference
crawlSite(options)
Crawl a website starting from one or more URLs.
interface CrawlOptions {
startUrls: string[]; // URLs to start crawling from
maxDepth?: number; // Max crawl depth (default: 1)
maxPages?: number; // Max pages to crawl (default: 100)
sameOrigin?: boolean; // Stay on same host (default: true)
allow?: (url: string) => boolean; // Custom allow filter
deny?: (url: string) => boolean; // Custom deny filter
userAgent?: string; // Custom user agent
rate?: { // Rate limiting
perHostRps?: number; // Requests per second per host (default: 1)
concurrent?: number; // Concurrent requests (default: 4)
};
onPage?: (page: ExtractResult, ctx: CrawlContext) => Promise<void> | void;
}fetchHtml(http, request)
Fetch HTML from a URL with proper headers and redirect handling.
interface FetchRequest {
url: string;
headers?: Record<string, string>;
}
const response = await fetchHtml(
{ fetch: globalThis.fetch },
{ url: "https://example.com" }
);extract(response, options)
Extract clean Markdown and metadata from HTML.
interface ExtractOptions {
strategy?: "heuristic" | "readability"; // Default: "heuristic"
langHint?: string; // Language hint for extraction
}
const result = await extract(fetchResponse, { strategy: "heuristic" });extractHtml(html, url, options)
Extract directly from an HTML string.
const result = await extractHtml(
"<html>...</html>",
"https://example.com",
{ strategy: "readability" }
);Extraction Strategies
Heuristic (default)
- Fast, lightweight extraction
- Removes navigation, ads, and other noise
- Finds main content using common selectors
- Convex-compatible - uses native DOMParser in Convex/browser
- Falls back to regex-based extraction in Node without jsdom
Readability
- Uses Mozilla's Readability algorithm
- Better for complex article layouts
- NOT Convex-compatible - requires dynamic imports
- Only available in Node.js with jsdom installed
- More accurate for news/blog content
Convex Compatibility
silktext is designed for maximum Convex compatibility:
Default Convex Runtime
✅ Fully supported:
- Heuristic extraction (uses native DOMParser)
- All crawling features
- URL normalization
- Rate limiting
- Metadata extraction
Convex with "use node"
✅ Additional features:
- Readability extraction strategy
- Full jsdom-based parsing
- All Node.js-specific features
Simply add "use node"; at the top of your action file to enable Node.js runtime and access all features.
Data Types
ExtractResult
interface ExtractResult {
url: string; // Normalized URL
markdown: string; // Clean Markdown content
metadata: {
title?: string; // Page title
description?: string; // Meta description
og?: Record<string, string>; // Open Graph tags
twitter?: Record<string, string>; // Twitter Card tags
jsonLd?: unknown[]; // JSON-LD structured data
links?: string[]; // Discovered links
};
}Features
URL Normalization
- Lowercases hostnames
- Removes tracking parameters (utm_*, fbclid, gclid, etc.)
- Strips URL fragments
- Sorts query parameters
Politeness
- Respects rate limits (1 RPS per host by default)
- Concurrent request limiting
- Per-host token bucket rate limiting
Content Extraction
- Removes scripts, styles, navigation, footers
- Preserves main content structure
- Converts to clean Markdown with Turndown
- Extracts all metadata (OG, Twitter, JSON-LD)
License
MIT
