@tadalabz/descry

v1.0.0

Published

4 days ago

Configurable URL discovery and processing pipeline library with built-in content handling, retention-aware sharded JSON persistence, hookable context stages, and pluggable selector logic.

0High
0Medium
0Low

tadalabz

web-crawler web-scraping crawler-framework content-extraction data-pipeline rss sitemap html-parser

Descry

Definition

descry (verb): to catch sight of something far off

Description

Descry is a URL discovery and processing pipeline library that helps you work through the runtime lifecycle around:

descry(): emit candidate URLs plus discovery telemetry
see(): build the canonical context for one candidate
amass(): hydrate that context with content and extracted artifacts
select(): run pluggable selector logic against the hydrated context
remember(): persist the durable result of the run

It is a good fit when you want repeatable discovery, a clear context model, pluggable selectors, built-in content handling, and shard-backed JSON persistence.

Out of the box, descry can create stores, fetch content, extract useful artifacts, persist results, re-read stored content on later runs, and emit structured runtime logs when a channel enables them. In most cases, what you bring is channel configuration and selector logic.

This package is currently built and verified on Node.js 24.x.

Best Fit

Descry is a strong fit for structured, channel-based discovery work:

repeated runs against durable stores
selector-driven processing decisions
channel designs that stay bounded in responsibility
discovery topologies that split or promote recurring hot areas into their own channels

It is not positioned as a monolithic high-scale crawler platform where one forever-hot channel owns an ever-growing universe of URLs.

Quality Statement

Descry is validated with automated tests across the pipeline, persistence, logging, promotion analysis, and public CLI/package surface, plus repeated multi-run scenario exercises that check durable-state behavior over time. These validations have been used to confirm backlog and recrawl behavior, known-work suppression, discovery-scope control, and promotion-trigger behavior under realistic channel workloads.

Canonical Scope

This document is the high-level introduction to the package.

It tells you:

what descry is,
what stages it owns,
what the published package contains,
where to go next.

When you want more detail, use:

docs/USAGE.md: basic getting-started usage and first pipeline setup
docs/RUNTIME_WALK_THROUGH.md: plain-English runtime walk-through of one pipeline run
docs/DATA_MODEL.md: exact candidate, context, decision, and persistence shapes
docs/CONTENT_HANDLING_GUIDE.md: built-in content handling and override contracts
docs/PERSISTENCE.md: persistence behavior, mechanics, and limitations
docs/SELECTOR_GUIDE.md: selector authoring guidance
docs/PRIMARY_CHANNEL_GUIDE.md: primary-channel seed guidance

Published Surface

The published package includes:

README.md: this concept summary
index.js: package entrypoint
src/: runtime implementation
docs/: the documentation set listed above
tools/create-store.js: public CLI for initializing default persistence stores
tools/analyze-channel.js: public CLI for reading or calculating canonical channel promotionAnalysis
tools/extract-channel.js: public CLI for birthing one child channel from persisted or explicit promotionAnalysis
examples/: the packaged plain starter example and sample configuration
LICENSE: license text

Start with docs/USAGE.md for a simple first setup, then use examples/README.md for the packaged starter example.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme