npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

contextractor

v0.3.12

Published

Extract web content from URLs with configurable extraction options

Readme

Contextractor

Extract clean, readable content from any website using Trafilatura.

Available as: PyPI | npm | Docker | Apify actor

Try the Playground to configure extraction settings and preview commands before running.

Install

pip install contextractor

or

npm install -g contextractor

Requires Python 3.12+ (pip) or Node.js 18+ (npm). Playwright Chromium is installed automatically.

Usage

contextractor https://example.com

Works with zero config. Pass URLs directly, or use a config file for complex setups:

contextractor https://example.com --precision --save json -o ./results
contextractor --config config.json --max-pages 10

CLI Options

contextractor [OPTIONS] [URLS...]

Crawl Settings:
  --config, -c          Path to JSON config file
  --output-dir, -o      Output directory
  --max-pages           Max pages to crawl (0 = unlimited)
  --crawl-depth         Max link depth from start URLs (0 = start only)
  --headless/--no-headless  Browser headless mode (default: headless)
  --max-concurrency     Max parallel requests (default: 50)
  --max-retries         Max request retries (default: 3)
  --max-results         Max results per crawl (0 = unlimited)

Proxy:
  --proxy-urls          Comma-separated proxy URLs (http://user:pass@host:port)
  --proxy-rotation      Rotation: recommended, perRequest, untilFailure

Browser:
  --launcher            Browser engine: chromium, firefox (default: chromium)
  --wait-until          Page load event: load, networkidle, domcontentloaded (default: load)
  --page-load-timeout   Timeout in seconds (default: 60)
  --ignore-cors         Disable CORS/CSP restrictions
  --close-cookie-modals Auto-dismiss cookie banners
  --max-scroll-height   Max scroll height in pixels (default: 5000)
  --ignore-ssl-errors   Skip SSL certificate verification
  --user-agent          Custom User-Agent string

Crawl Filtering:
  --globs               Comma-separated glob patterns to include
  --excludes            Comma-separated glob patterns to exclude
  --link-selector       CSS selector for links to follow
  --keep-url-fragments  Preserve URL fragments
  --respect-robots-txt  Honor robots.txt

Cookies & Headers:
  --cookies             JSON array of cookie objects
  --headers             JSON object of custom HTTP headers

Output Format:
  --save                Output formats, comma-separated (default: markdown)
                        Valid: markdown, html, text, json, jsonl, xml, xml-tei, all

Content Extraction:
  --precision           High precision mode (less noise)
  --recall              High recall mode (more content)
  --fast                Fast extraction mode (less thorough)
  --no-links            Exclude links from output
  --no-comments         Exclude comments from output
  --include-tables/--no-tables  Include tables (default: include)
  --include-images      Include image descriptions
  --include-formatting/--no-formatting  Preserve formatting (default: preserve)
  --deduplicate         Deduplicate extracted content
  --target-language     Filter by language (e.g. "en")
  --with-metadata/--no-metadata  Extract metadata (default: with)
  --prune-xpath         XPath patterns to remove from content

Diagnostics:
  --verbose, -v         Enable verbose logging

CLI flags override config file settings. Merge order: defaults → config file → CLI args

Config File (optional)

Use a JSON config file to set options:

{
  "urls": ["https://example.com", "https://docs.example.com"],
  "save": ["markdown"],
  "outputDir": "./output",
  "crawlDepth": 1,
  "proxy": {
    "urls": ["http://user:pass@host:port"],
    "rotation": "recommended"
  },
  "trafilaturaConfig": {
    "favorPrecision": true,
    "includeLinks": true,
    "includeTables": true,
    "deduplicate": true
  }
}

Crawl Settings

| Field | Type | Default | Description | |-------|------|---------|-------------| | urls | array | [] | URLs to extract content from | | maxPages | int | 0 | Max pages to crawl (0 = unlimited) | | outputDir | string | "./output" | Directory for extracted content | | crawlDepth | int | 0 | How deep to follow links (0 = start URLs only) | | headless | bool | true | Browser headless mode | | maxConcurrency | int | 50 | Max parallel browser pages | | maxRetries | int | 3 | Max retries for failed requests | | maxResults | int | 0 | Max results per crawl (0 = unlimited) |

Proxy Configuration

| Field | Type | Default | Description | |-------|------|---------|-------------| | proxy.urls | array | [] | Proxy URLs (http://user:pass@host:port or socks5://host:port) | | proxy.rotation | string | "recommended" | recommended, perRequest, untilFailure | | proxy.tiered | array | [] | Tiered proxy escalation (config-file only) |

Browser Settings

| Field | Type | Default | Description | |-------|------|---------|-------------| | launcher | string | "chromium" | Browser engine: chromium, firefox | | waitUntil | string | "load" | Page load event: load, networkidle, domcontentloaded | | pageLoadTimeout | int | 60 | Page load timeout in seconds | | ignoreCors | bool | false | Disable CORS/CSP restrictions | | closeCookieModals | bool | true | Auto-dismiss cookie consent banners | | maxScrollHeight | int | 5000 | Max scroll height in pixels (0 = disable) | | ignoreSslErrors | bool | false | Skip SSL certificate verification | | userAgent | string | "" | Custom User-Agent string |

Crawl Filtering

| Field | Type | Default | Description | |-------|------|---------|-------------| | globs | array | [] | Glob patterns for URLs to include | | excludes | array | [] | Glob patterns for URLs to exclude | | linkSelector | string | "" | CSS selector for links to follow | | keepUrlFragments | bool | false | Treat URLs with different fragments as different pages | | respectRobotsTxt | bool | false | Honor robots.txt |

Cookies & Headers

| Field | Type | Default | Description | |-------|------|---------|-------------| | cookies | array | [] | Initial cookies ([{"name": "...", "value": "...", "domain": "..."}]) | | headers | object | {} | Custom HTTP headers ({"Authorization": "Bearer token"}) |

Output Format

| Field | Type | Default | Description | |-------|------|---------|-------------| | save | array | ["markdown"] | Output formats: markdown, html, text, json, jsonl, xml, xml-tei, all |

Content Extraction

All options go under the trafilaturaConfig key in config files, or use the equivalent CLI flags:

| Field | Type | Default | Description | |-------|------|---------|-------------| | favorPrecision | bool | false | High precision, less noise | | favorRecall | bool | false | High recall, more content | | includeComments | bool | true | Include comments | | includeTables | bool | true | Include tables | | includeImages | bool | false | Include images | | includeFormatting | bool | true | Preserve formatting | | includeLinks | bool | true | Include links | | deduplicate | bool | false | Deduplicate content | | withMetadata | bool | true | Extract metadata (title, author, date) | | targetLanguage | string | null | Filter by language (e.g. "en") | | fast | bool | false | Fast mode (less thorough) | | pruneXpath | array | null | XPath patterns to remove from content |

Node.js API

Use contextractor as a library in your Node.js code:

const { extract } = require("contextractor");

// Extract a single URL
await extract("https://example.com", {
  save: "markdown",
  outputDir: "./output",
});

// Multiple URLs with extraction options
await extract(["https://a.com", "https://b.com"], {
  precision: true,
  noLinks: true,
  includeTables: true,
  save: ["markdown", "json"],
  outputDir: "./results",
});

// Using a config file
await extract("https://example.com", { config: "./config.json" });

ESM import:

import { extract } from "contextractor";

extract(urls, options) returns Promise<void> — output goes to outputDir or stdout. Options use the same camelCase names as listed in CLI Options and Config File.

Python API

Install the extraction engine:

pip install contextractor-engine

Use ContentExtractor to extract content from HTML:

from contextractor_engine import ContentExtractor, TrafilaturaConfig

# Basic extraction
extractor = ContentExtractor()
result = extractor.extract(html, url="https://example.com", output_format="markdown")
print(result.content)

# High precision with custom config
config = TrafilaturaConfig(favor_precision=True, include_tables=True, deduplicate=True)
extractor = ContentExtractor(config=config)
result = extractor.extract(html, output_format="json")

Extract metadata:

meta = extractor.extract_metadata(html, url="https://example.com")
print(meta.title, meta.author, meta.date)

Available output formats: txt, markdown, json, xml, xmltei

See the contextractor-engine README for full API reference.

Docker

docker run ghcr.io/contextractor/contextractor https://example.com

Save output to your local machine:

docker run -v ./output:/output ghcr.io/contextractor/contextractor https://example.com -o /output

Use a config file:

docker run -v ./config.json:/config.json ghcr.io/contextractor/contextractor --config /config.json

All CLI flags work the same inside Docker.

Docker from Code

Call Docker extraction programmatically:

Node.js:

const { execSync } = require("child_process");
const result = execSync(
  "docker run ghcr.io/contextractor/contextractor https://example.com",
  { encoding: "utf-8" }
);
console.log(result);

Python:

import subprocess
result = subprocess.run(
    ["docker", "run", "ghcr.io/contextractor/contextractor", "https://example.com"],
    capture_output=True, text=True
)
print(result.stdout)

Volume mount for output:

docker run -v $(pwd)/output:/output ghcr.io/contextractor/contextractor https://example.com -o /output

Output

One file per crawled page, named from the URL slug (e.g. example-com-page.md). Metadata (title, author, date) is included in the output header when available.

Platforms

  • npm: macOS arm64, Linux (x64, arm64), Windows x64
  • Docker: linux/amd64, linux/arm64

License

Apache-2.0

Docs version

2026-04-16T12:41:28Z