contextractor

v0.4.14

Published

10 days ago

Standalone CLI and npm library for Contextractor. Built on rs-trafilatura and Crawlee.

Contextractor

Standalone CLI and npm library for Contextractor — crawl any website and extract clean, boilerplate-free main-content text as txt, markdown, json, html, or raw original HTML, ready to feed LLMs, RAG pipelines, and vector databases. Use it as a command-line tool or call the programmatic Library API (createExtractor, extractOne) from your own code.

Built on rs-trafilatura (extraction) and Crawlee (a TypeScript crawler that drives Playwright, or fetches over plain HTTP with Cheerio).

The CLI ships four subcommands that cover the whole extract-to-files workflow: extract crawls one or more sites and saves every page to local storage, extract-one prints a single page straight to stdout (pipe it anywhere), export writes stored results out as files with a manifest.json index, and purge clears the storage again. Try it without installing (no browser needed): npx contextractor extract-one https://example.com/ --crawler-type cheerio.

The library exposes the same engine with the same camelCase options: createExtractor(...).run(urls) returns crawl results in memory, and extractOne(url) fetches exactly one page with nothing persisted. The adaptive crawler decides per page between a real browser and plain HTTP — or pin it to Chromium, Firefox, or the HTTP-only Cheerio client, which needs no browser install at all. Prebuilt native binaries ship for macOS, Linux, and Windows (Node 22+).

Why Contextractor

Contextractor ships the Rust port of Trafilatura as a native (napi-rs) binding — no Python extraction runtime. On the Scrapinghub article set it scores an F1 of 0.966 (precision 0.942, recall 0.991) — ahead of go-trafilatura (0.960) and the original Python Trafilatura (0.958); see the benchmark write-up for the methodology.

It is free and open source (Apache-2.0), runs locally with no API key and no per-page credits, and its Markdown output is typically 80–90% fewer tokens than the raw HTML — cheap to feed to an LLM.

| | Contextractor | Firecrawl | Jina Reader | Crawl4AI | | ----------------- | ------------------------------------------------ | ------------------------------ | ------------------------------------------ | ------------------------------------------------- | | Extraction engine | rs-trafilatura (heuristic + ML routing) | LLM / heuristic | ReaderLM neural model | LLM / heuristic | | Runtime | Rust + Node (no Python engine) | hosted API / self-host | hosted API | Python | | Surfaces | Apify Actor · npm CLI · npm library · PyPI | API · SDKs · self-hosted · MCP | API | Python library · crwl CLI · Docker REST API · MCP | | Output formats | txt · markdown · json · html · original | markdown · html · etc. | markdown · html · text · screenshot · etc. | markdown · etc. | | Crawling | Crawlee + Playwright (adaptive / browser / HTTP) | built-in | none (single URL) | built-in |

Installation

npm install contextractor        # local: library use
npm install -g contextractor     # global: the CLI on your PATH
npx playwright install chromium  # browser crawlers only

The default playwright-adaptive crawler needs the one-time npx playwright install chromium (the firefox crawler needs npx playwright install firefox); the cheerio crawler needs no browser at all.

Or run it on demand without installing (no browser needed):

npx contextractor extract-one https://www.iana.org/help/example-domains \
  --crawler-type cheerio

Output (Markdown on stdout, logs on stderr; trimmed here):

# Example Domains

As described in [RFC 2606](/go/rfc2606) and [RFC 6761](/go/rfc6761), a
number of domains such as example.com and example.org are maintained
for documentation purposes. These domains may be used as illustrative
examples in documents without prior coordination with us. They are not
available for registration or transfer.

Usage

contextractor extract [URLS...]
contextractor extract-one <url>

contextractor extract https://example.com
contextractor extract https://example.com --mode precision --save json-kvs
contextractor extract https://example.com --save markdown-dataset
contextractor extract --config-file config.json --max-requests-per-crawl 10
contextractor extract-one https://example.com/ | less

Subcommands

`extract`

Extract content from one or more URLs and save to storage.

A dataset record is pushed for every successful page (status: 'success') — each content field a ContentNode (hash + bytes always present; inline content for a *-dataset token, key for a *-kvs token, both when routed to both); the record shape is identical to the Apify Actor's. Failed requests are always pushed with status: 'failed', and skipped URLs can be recorded with --store-skipped-urls. The CLI exits with code 2 when at least one request fails after retries.

contextractor extract https://example.com
contextractor extract https://a.com https://b.com --save txt-kvs
contextractor extract --start-urls-file urls.txt --storage ./my-archive

# markdown to both the dataset and the KVS, raw HTML to the KVS
contextractor extract https://example.com \
  --save markdown-dataset --save markdown-kvs \
  --save original-kvs

--start-urls-file <path> — read start URLs (one per line) from a file
--save <token> — repeatable format-destination token (e.g. markdown-kvs, original-dataset); default markdown-kvs
--storage <path> — storage directory for this run (default: ./storage or the XDG data dir). One --storage path fully identifies a run's storage — use different --storage dirs for different runs
--purge — purge the storage at --storage before extracting (clears the datasets/, key_value_stores/, and request_queues/ dirs)

`extract-one`

Extract a single URL (no link-following) and write the content to file(s) and/or stdout — no storage involved, nothing is persisted. With no --save it prints markdown to stdout (markdown-stdout), and all logs and progress go to stderr, so stdout stays clean and pipeable.

contextractor extract-one https://example.com/ | less
contextractor extract-one https://example.com/ --save txt-stdout > body.txt

# → report.md
contextractor extract-one https://example.com/ \
  --save markdown-file -o report

# → out/page.md + out/page.json
contextractor extract-one https://example.com/ \
  --save markdown-file --save json-file -o out/page

# → URL-slug names in snapshots/; original tagged .original.html
contextractor extract-one https://example.com/ \
  --save html-file --save original-file -o snapshots/

--save <token> — repeatable format-destination token; format txt|markdown|json|html|original, destination file|stdout (default: markdown-stdout). At most one format may target stdout — the stream carries that format's raw content only, never a JSON wrapper
-o, --output <path> — file path for the -file tokens (ignored for -stdout): a literal path for one format (the extension is appended only when the value has none), a base prefix for several (each format appends its own extension), or a directory (trailing slash or an existing dir) for URL-slug file names; absent → URL-slug names in the current directory

All single-page flags from extract (--crawler-type, --proxy, --mode, --wait-for-selector, --cookies, …) work here too; the crawl and storage flags (--globs, --max-crawl-depth, --storage, --session-pool-name, …) belong to extract only. Exits 0 on success, 1 on failure, and 2 when the page was extracted but a requested format yielded no content (a warning goes to stderr; any other requested files are still written).

`export`

Export stored extraction content to a user-facing output directory. Reads the dataset record index and, for every success record, writes one file per saved format — using the inline content or fetching the key-value-store blob by key. File names are derived from the record title (then its URL, then page), and a manifest.json listing every record (including failed and skipped) is written alongside the files.

contextractor export                 # → ./contextractor-output
contextractor export --output-dir ./out --storage ./my-archive

--output-dir <path> — output directory (default: ./contextractor-output)
--storage <path> — storage directory to read from

`purge`

Clears the storage at --storage — the datasets/, key_value_stores/, and request_queues/ dirs. Same as running extract --purge before a crawl.

contextractor purge                          # purge the resolved storage dir
contextractor purge --storage ./my-archive   # purge a specific storage dir

Storage directory resolution

Storage directory is resolved in this order (first match wins):

--storage CLI flag
CONTEXTRACTOR_STORAGE_DIR env var
CRAWLEE_STORAGE_DIR env var (Crawlee native compatibility)
./storage if .actor/ or ./storage/ exists in the current working directory
${XDG_DATA_HOME:-~/.local/share}/contextractor/storage (XDG fallback)

CLI flags (extract)

The extract flag list below is generated from the same Commander program the binary uses. Negatable flags (--no-headless, --no-block-media, --no-close-cookie-modals, --no-tables, --no-images, --no-links, --no-comments) appear as separate rows.

| Option | ---------------------------- | --start-urls-file | --headless | --no-headless | --proxy | --proxy-rotation | --max-session-rotations | --crawler-type | --rendering-type-detection | --wait-until | --navigation-timeout | --block-media | --no-block-media | --ignore-cors-and-csp | --close-cookie-modals | --no-close-cookie-modals | --max-scroll-height | --ignore-https-errors | --user-agent | --respect-robots-txt | --cookies | --headers | --max-retries | --mode | --no-links | --no-comments | --no-tables | --images | --no-images | --language | --verbose, -v | --wait-for-dynamic-content | --wait-for-selector | --soft-wait-for-selector | --config-file, | --purge | --max-requests-per-crawl | --max-crawl-depth | --globs | --exclude | --selector | --keep-url-fragment | --use-sitemaps | --initial-concurrency | --max-concurrency | --max-results | --save | --deduplication | --session-pool-name | --storage | --store-skipped-urls | Description | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Read start URLs (one per line) from a file | | Run browser in headless mode | | Run browser with UI | | Proxy URL (repeatable) | | Proxy rotation: recommended, per-request, until-failure | | Max session rotations per request on block detection | | Crawler engine: adaptive, firefox, chromium, cheerio | | Rendering type detection ratio 0–1 (adaptive only) | | Page load event: load, domcontentloaded, networkidle, commit | | Navigation timeout in seconds | | Block images, stylesheets, fonts, PDFs, and ZIPs (default) | | Do not block media requests | | Disable CORS/CSP restrictions | | Auto-dismiss cookie banners | | Do not auto-dismiss cookie banners | | Max scroll height in pixels | | Skip HTTPS certificate verification | | Custom User-Agent string | | Honor robots.txt | | JSON array of cookie objects | | JSON object of custom HTTP headers | | Max request retries | | Extraction mode: precision (less noise), balanced (default), or recall (more content) | | Exclude links from output | | Exclude comments from output | | Exclude tables from output | | Include image alt text and captions | | Exclude image alt text and captions (default) | | Filter by language (e.g. en) | | Enable verbose logging | | Maximum seconds to wait for dynamic content after navigation; the crawler continues as soon as the network is idle or this timeout elapses, whichever comes first (0 = disabled) | | CSS selector to wait for before extracting (fails on timeout) | | CSS selector to wait for before extracting (continues on timeout) | -c | Path to JSON config file | | Purge the storage at --storage before extracting (datasets, KVS, request queues) | | Max requests to handle (0 = unlimited) | | Max link depth from start URLs (0 = unlimited) | | Glob pattern to include (repeatable) | | Glob pattern to exclude (repeatable) | | CSS selector for links to follow | | Preserve URL fragments | | Discover and enqueue URLs from sitemap.xml at each start URL domain root | | Initial parallel requests (0 = Crawlee default) | | Max parallel requests | | Max results per crawl (0 = unlimited) | | Format-destination token, e.g. markdown-kvs, original-dataset (repeatable). Format: txt|markdown|json|html|original; destination: dataset|kvs. List a format twice to save to both. Saving original/html to the dataset risks OOM on large pages. | | Deduplication level: minimal, standard (default), or aggressive | | Named session pool for cross-run session sharing | | Storage directory holding the datasets/key_value_stores/request_queues (default: ./storage or the XDG data dir) | | Push skipped URL records to the dataset after crawl |

JSON config

Pass --config-file path/to/config.json. The file is validated by the Zod 4 schema in @contextractor/schema, so keys use the same camelCase shape as the Apify input schema. Orchestration flags (--proxy, --purge, --storage) are CLI-only and must be set on the command line. Shared schema fields like save are accepted in the JSON config. The datasetName, keyValueStoreName, and requestQueueName fields apply only to the Apify Actor — the CLI parses but ignores them and always uses the default storage buckets under --storage.

In JSON config and the Apify input schema, globs and exclude are arrays of { "glob": "..." } objects (e.g. "globs": [{ "glob": "https://example.com/**" }]), whereas the --globs / --exclude CLI flags take bare glob strings.

{
  "startUrls": [{ "url": "https://example.com" }],
  "headless": false,
  "maxRequestsPerCrawl": 10,
  "mode": "recall",
  "includeImages": true,
  "save": ["txt-dataset", "original-kvs"]
}

Config merge order: schema defaults → config file → explicit CLI args. Unknown keys are stripped by ContextractorInput.parse().

Enum values

These are the values for the JSON config and the Apify input schema. The crawlerType values below are the config/schema names — the --crawler-type CLI flag uses the short forms instead: playwright-adaptive → adaptive, playwright-firefox → firefox, playwright-chromium → chromium, and cheerio is the same in both. All other enums use identical values on the CLI and in config.

`crawlerType` (default `playwright-adaptive`)

| Value | Title | | --------------------- | -------------------------------------- | | playwright-adaptive | Adaptive switching (Recommended) | | playwright-firefox | Headless browser (Firefox+Playwright) | | playwright-chromium | Headless browser (Chromium+Playwright) | | cheerio | Raw HTTP client (Cheerio) |

`deduplication` (default `standard`)

| Value | Title | | ------------ | ------------------------------------ | | minimal | Minimal — Crawlee URL dedup only | | standard | Standard — + canonical URL (default) | | aggressive | Aggressive — + content hash |

`mode` (default `balanced`)

| Value | Title | | ----------- | ---------------------- | | precision | Precision (less noise) | | balanced | Balanced (default) | | recall | Recall (more content) |

`proxyRotation` (default `recommended`)

| Value | Title | | --------------- | ------------------ | | recommended | Recommended | | per-request | Rotate per request | | until-failure | Use until failure |

`waitUntil` (default `load`)

| Value | Title | | ------------------ | ------------------ | | load | Load event | | domcontentloaded | DOM content loaded | | networkidle | Network idle | | commit | Commit |

Library API

Run a crawl programmatically and get results back in memory — a thin, Crawlee-shaped facade (construct from a camelCase options object, then run(urls)):

import { createExtractor } from 'contextractor';

const extractor = createExtractor({
  save: ['txt-kvs'],
  includeHtml: false, // default — omit raw HTML from returned records
  deduplication: 'minimal',
  maxResultsPerCrawl: 10, // bounds the in-memory result set (0 = unlimited)
});

const { dataset, statistics } = await extractor.run(['https://example.com']);

const { requestsFinished, requestsFailed } = statistics;
console.log(`finished=${requestsFinished} failed=${requestsFailed}`);

// Iterate without loading everything, or grab the full array:
await dataset.forEach((record, i) => {
  console.log(i, record.url, 'depth:', record.crawlDepth);
});
const all = dataset.export(); // LibraryRecord[]

// Optionally export to a Crawlee key-value store:
await dataset.exportToJSON('results.json');
await dataset.exportToCSV('results.csv');

The returned dataset holds successful extractions only; failed and skipped requests are reflected in statistics (a subset of Crawlee's FinalStatistics). run() never throws on partial failure and never calls process.exit().

Options use the same camelCase field names as the JSON config below (e.g. crawlerType, maxResultsPerCrawl, deduplication, save). save is a SaveRoute[] of format-destination tokens. Three knobs are library-only: includeHtml (default false; the returned record omits html, but rawHtmlHash / rawHtmlLength are always kept), storageDir (when set, ALSO writes full records to disk in addition to returning them in memory), and logLevel (default warning, keeping stdout clean). In-memory results target small/medium crawls (guidance: under ~10k pages); for very large crawls, set storageDir and read results back from disk.

To use a proxy, pass proxyConfiguration: { proxyUrls: ['http://user:pass@host:port'] } (http/https/socks4/socks5 only). Apify Proxy is available only in the Apify Actor build.

Single page: `extractOne`

For one page, skip the crawl machinery — extractOne fetches exactly one URL (no link-following, nothing persisted) and returns the content directly:

import { extractOne } from 'contextractor';

// formats default: ['markdown']
const { markdown } = await extractOne('https://example.com');

const contents = await extractOne('https://example.com', {
  formats: ['markdown', 'json', 'original'], // 'original' = raw page HTML
  crawlerType: 'cheerio',
});

It resolves to a format → string map keyed by the requested formats and throws when the request fails. Valid formats are the SaveFormat union ('txt' | 'markdown' | 'json' | 'html' | 'original'; the SAVE_FORMATS array is exported too). Options are the single-page subset of the createExtractor options — same camelCase names, but no save, no storageDir, no includeHtml (use formats: ['original'] instead), and no crawl knobs: the run is pinned to a single URL.

Export and purge: `runExportAction`, `runPurgeAction`

The export and purge subcommands are library-callable too. Both resolve the storage directory with the same precedence as the CLI and never call process.exit():

import { runExportAction, runPurgeAction } from 'contextractor';

const exported = await runExportAction({
  outputDir: './out',
  storageDir: './my-archive',
});
console.log(exported.filesWritten, 'files —', exported.manifestPath);

const purged = await runPurgeAction({ storageDir: './my-archive' });
console.log('purged', purged.storageDir);

runExportAction(opts: ExportOpts): Promise<ExportResult> — writes stored results out as files plus a manifest.json index (the export subcommand's behavior). opts: outputDir (default ./contextractor-output) and storageDir (default: the resolved storage dir). Returns { outputDir, filesWritten, recordsTotal, manifestPath }.
runPurgeAction(opts?: PurgeOpts): Promise<PurgeResult> — clears the datasets/, key_value_stores/, and request_queues/ dirs (the purge subcommand's behavior). opts: storageDir. Returns { storageDir } with the resolved path.

Library use (Crawlee re-exports)

contextractor re-exports Crawlee's storage types for library consumers:

import {
  Dataset,
  KeyValueStore,
  Configuration,
} from 'contextractor';

const ds = await Dataset.open('my-dataset');
await ds.forEach((item) => console.log(item));

const kvs = await KeyValueStore.open('default');
const value = await kvs.getValue('my-key');

Requirements

Node 22+. The package ships prebuilt native extraction binaries for macOS (x64/arm64), Linux (x64/arm64 glibc), and Windows (x64) — no build toolchain needed.
Browser-based crawler types need a Playwright browser: npx playwright install chromium for adaptive/chromium, or npx playwright install firefox for the firefox crawler. The raw HTTP crawler (--crawler-type cheerio) needs none.

Contributing

Issues and pull requests are welcome at the issue tracker. The extraction engine, npm CLI, and Apify Actor all live in the same source repository.

License

Apache-2.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Contextractor

Table of Contents

Why Contextractor

Installation

Usage

Subcommands

extract

extract-one

export

purge