wget-parser
v2.0.0
Published
Parses the wget spider output into an object
Downloads
7,736
Readme
Table of Contents
Spider parser
Parses the spider output from wget into an object structure of links.
This object could then be processed further to create a tree structure of the hierarchy of a website such that sitemap generation could be implemented.
Tested using wget v1.15 on linux.
Usage
var parser = require('wget-parser')
, buf = new Buffer(0); // buffer should contain the spider output
console.dir(parser(buf));parser.Parser: The parser class.parser.Link: The class that represents a link.parser.ParseStream: Parse stream class.
Streams support is available, see the test spec for example usage.
wget-parser
A program that reads from stdin and prints the result of the parse as JSON, exits with error code 1 if any broken links are found.
cat test/fixtures/mock.txt | wget-parser
cat test/fixtures/broken.txt | wget-parser; echo $?;wget-spider
A program that performs a spider with wget and pipes the output to wget-parser:
wget-spider http://google.comOutput
Example output from the parser:
{
"links": [
{
"url": {
"protocol": "http:",
"slashes": true,
"auth": null,
"host": "google.com",
"port": null,
"hostname": "google.com",
"hash": null,
"search": null,
"query": null,
"pathname": "/",
"path": "/",
"href": "http://google.com/"
},
"link": "http://google.com/",
"line": "--2016-02-10 16:11:57-- http://google.com/"
},
{
"url": {
"protocol": "http:",
"slashes": true,
"auth": null,
"host": "www.google.co.id",
"port": null,
"hostname": "www.google.co.id",
"hash": null,
"search": "?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"query": "gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"pathname": "/",
"path": "/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"href": "http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ"
},
"link": "http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"line": "--2016-02-10 16:11:57-- http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ"
}
],
"broken": []
}Developer
Test
To run the test suite:
npm testCover
To generate code coverage run:
npm run coverLint
Run the source tree through jshint and jscs:
npm run lintClean
Remove generated files:
npm run cleanReadme
To build the readme file from the partial definitions:
npm run readmeGenerated by mdp(1).
