robots-parse

v0.0.6

Published

2 years ago

A lightweight and simple robots.txt parser in node.

Downloads

0High
0Medium
0Low

b4dnewz

cli-tool parser osint

robots-parse

A lightweight and simple robots.txt parser in node.

Installation

npm install --save robots-parse

Usage

You can use the module to scan a domain for robots file like in the example below:

const robotsParse = require('robots-parse');

robotsParse('github.com', (err, res) => {
  console.log('Result:', res);
});

You can also use it with promises if the callback is not specified:

const robotsParse = require('robots-parse');

(async () => {
  const res = await robotsParse('github.com');
  console.log('Result:', res);
})().catch(console.log)

Or you can use the built-in parser to parse an existing robots.txt file, for example a local file or a string. The parser works in sync so you don't have to use callback or promises.

const {parser} = require('robots-parse');

request('google.com/robots.txt', (err, res, body) => {
  const object = parser(body);
  console.log(object);
});

Parsing an existing local robots.txt file:

const {parser} = require('robots-parse');
const content = fs.readFileSync('./robots.txt', 'utf-8');
const object = parser(content);

console.log(object);

How it works?

By default the script will get and parse the robots.txt file for a given website or domain and it will search for various rules:

Agents: A user-agent identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent.
Host: Supported by Yandex (and not by Google even though some posts say it does), this directive lets you decide whether you want the search engine to show.
Allow: The allow directive specifies paths that may be accessed by the designated crawlers. When no path is specified, the directive is ignored.
Disallow: The disallow directive specifies paths that must not be accessed by the designated crawlers. When no path is specified, the directive is ignored.
Sitemap: An absolute url that points to a Sitemap, Sitemap Index file or equivalent URL.

It returns, if the robots files were successfully retrieved and parsed, an object containing the properties mentioned above, inside every agent found you will find agent-specific allow and disallow rules, which also will be stored in allow and disallow root properties containing all of them indistinctly.

You can read more about the specifications of the robots file on it's Google Reference Page.

Contributing

Create an issue and describe your idea
Fork the project (https://github.com/b4dnewz/robots-parse/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Write tests for your code (npm run test)
Publish the branch (git push origin my-new-feature)
Create a new Pull Request

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

robots-parse

Installation

Usage

How it works?

Contributing

License