walk-urls
v0.0.1
Published
Walk URLs
Readme
walk-urls (experimental)
A minimal, Puppeteer-based link crawler with concurrency, depth-limiting, extension filtering, and flexible callbacks.
Installation
npm i walk-urlsQuick start
import { walkURLs } from 'walk-urls'
const targetURL = new URL('https://example.com')
await walkURLs(targetURL, {
// Only visit links on the same domain
onURL: (url) => url.hostname === targetURL.hostname,
onPage: (page) => {
console.log('Page title:', page.title)
console.log('Page content:', page.content)
},
})Examples
With depth limit
import { walkURLs } from 'walk-urls'
const targetURL = new URL('https://example.com')
await walkURLs(targetURL, {
// Visit the initial page and its direct links
depth: 1,
onURL: (url) => url.hostname === targetURL.hostname,
onPage: (page) => {
console.log('Page title:', page.title)
console.log('Depth:', page.depth)
},
})With concurrency
import { walkURLs } from 'walk-urls'
const targetURL = new URL('https://example.com')
await walkURLs(targetURL, {
// Visit up to 5 pages concurrently
concurrency: 5,
onURL: (url) => url.hostname === targetURL.hostname,
onPage: (page) => {
console.log('Page title:', page.title)
},
})API Reference
walkURLs(targetURL, options)
Crawls links starting from targetURL. Returns a Promise that resolves once all pages have been processed or fails on internal errors (unless caught by onError).
Parameters
targetURL: string | URLThe starting URL to crawl.options: WalkURLsOptionsConfiguration object:| Option | Type | Description | | ----------------- | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | |
depth|number \| undefined| Limits crawl depth.depth = 0visits onlytargetURL;depth = 1includes its children, etc. Default isundefined(no limit). | |concurrency|number \| undefined| Number of pages processed in parallel. Defaults to1(serial crawling). | |onURL|(url: URL, meta: { href: string; depth: number }) => boolean \| void \| Promise<boolean \| void>| Called before enqueuing a link. Returnfalseto skip. | |onPage|(page: Page) => void \| Promise<void>| Called after navigating to a page. Can be used to extract or process HTML content. | |onError|(error: unknown, url: URL) => void \| Promise<void> \| undefined| Called on errors (e.g., non-2xx HTTP status if you handle it that way, network errors, etc.). If not provided, errors are logged toconsole.error. | |extensions|string[] \| null \| undefined| File extensions recognized as HTML. Defaults to[".html", ".htm"]. Ifnull, all links are followed. If you pass your own array, it completely overrides the default. |
Returns
Promise<void>Resolves when the entire crawl finishes (or rejects on internal errors, unless you handle them inonError).
OnURL Type
type OnURL = (
url: URL,
metadata: { href: string; depth: number },
) => boolean | void | Promise<boolean | void>- Return
falseto skip crawlingurl. - All other return values will include it in the crawl queue.
Page Type
type Page = {
title: string
url: URL
href: string
content: string
depth: number
ok: boolean
status: number
}title:<title>of the page.url: The final URL as aURLobject.href: String form of the link from which we arrived here.content: Inner HTML (page.content()).depth: Depth relative to the starting URL.ok:trueif HTTP status was in the 200 range.status: Numeric HTTP status code.
