icrawl

v0.1.0

Published

4 years ago

Crawl pages and generate `html`s corresponding to the path.

0High
0Medium
0Low

minesaner

crawl icrawl ssr seo spa seo prerender page snapshot

icrawl

Crawl pages and generate htmls corresponding to the path.

Features

With nginx, you can do SEO on the front-end rendered page.
Built-in server, you can directly crawl the page based on the built folder
The html save path corresponds to the url path
Does not depend on any front-end framework
Provide API calls and command line calls

Examples

Node API

const path = require('path')
const Crawl = require('icrawl')
const crawl = new Crawl({
  requestTimeout: 10000,
  isNormalizeSourceURL: true,
  routes: [
    'https://nodejs.org/api/path.html',  
    'https://nodejs.org/api/url.html'
  ],
  path: path.resolve(__dirname, 'static')
})
crawl.start()

Configuration

.icrawlrc.js in your project root

const path = require('path')

module.exports = {
  isNormalizeSourceURL: true,
  routes: [
    'https://nodejs.org/api/path.html',  
    'https://nodejs.org/api/url.html'
  ],
  path: path.resolve(__dirname, 'static')
}

package.json

"scripts": {
  "build": "icrawl"
}

options

options <Object>
- viewport <Object> viewport size
  - width <Number>
  - height <Number>
- maxPageCount <Number> Number of pages that can be opened in parallel, default: 10
- isNormalizeSourceURL <Boolean | Object> Whether to convert the relative path of images, anchors, links, scripts to absolute paths in your crawled html, for example, When you crawl the page url is http://www.example.com/example, it will be /favicon.ico to http://www.example.com/favicon.ico. You can also set each option individually. default: false
  - links <Boolean>
  - images <Boolean>
  - scripts <Boolean>
  - anchors <Boolean>
- requestTimeout <Number> Number of milliseconds for request timeout, default: 30000ms, set to 0 to wait indefinitely
- host <String> default: ''
- routes <Array<String>> The list of routes to be crawled, the relative path needs to set the host option
- outputPath <String> Html saved directory
- saveHTML <Boolean> Whether to save the crawl page as html, default: true
- depth <Number | Object> Specify page depth if it is a Number, The A page is configured on the routes, the A (depth: 0) page contains a link to B (depth: 1), and the B page contains a link to C (depth: 2), default: 0
  - value <Number> page depth
  - include <RegExp> Included link, default: null
  - exclude <RegExp> Excluded link, default: null
  - after <Function(Array<PageRoute>)> Callback after page link collection is complete, default: null
- serverConfig <String | Object> If the page to be crawled is not on a server, you can specify this option to start a server locally. If it is a String, specify the directory where the page is located. default: null
  - path <String> The directory where the page is located, for example, your build directory path, then you can run icrawl after build command or put two commands together in scripts
  - port <Number> default: 3333
  - public <String> This option needs to be specified when the isNormalizeSourceURL option is specified as true at the same time. Relative paths will be converted relative to this option
  - isFallback <Boolean> For SPA, alwalys change the requested location to the index.html
- requestInterception <Object> Filter requests, use this configuration reasonably to speed up crawling. For example, we don't need to wait for images, css, fonts, third-party scripts to load, after all, we only need to save the rendered html most of the time
  - include <RegExp>
  - exclude <RegExp>
- progressBarStyle <Object> Progress bar style
  - prefix <String> default: ''
  - suffix <String> default: ''
  - remaining <String> default: '░'
  - completed <String> default: '█'

crawl.start()

return: Promise

PageRoute

url <String> The page url to crawl
root <PageRoute> The root of chain
referer <PageRoute> The parent of this url

Tips

By configuring nginx, you can enable SEO for front-end rendering pages.
If you use nginx you will need to install the set-misc-nginx-module module, or install the OpenResty directly.

License

MIT licensed.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

icrawl

Features

Examples

Node API

Configuration

options

crawl.start()

PageRoute

Tips

License