tree-sitter-robots_txt
v1.0.1
Published
Tree sitter parser for robots.txt
Readme
Tree-sitter-robots-txt
This is a general tree-sitter parser grammar for the robots.txt.
A robots.txt file is a text file used to instruct web robots (often called crawlers or spiders) how to interact with pages on a website. Here is the basic syntax and rules for a robots.txt file:
User-agent line: Specifies the robot(s) to which the rules apply.
User-agent: *- Applies to all robots.User-agent: Googlebot- Applies specifically to Google's crawler.
Disallow line: Specifies the files or directories that the specified robot(s) should not crawl.
Disallow: /directory/- Disallows crawling of the specified directory.Disallow: /file.html- Disallows crawling of the specific file.Disallow: /- Disallows crawling of the entire site.
Allow line (optional): Overrides a disallow rule for a specific file or directory.
Allow: /directory/file.html- Allows crawling of a specific file within a disallowed directory.
Crawl-delay line (optional): Specifies the delay in seconds between successive requests to the site.
Crawl-delay: 10- Sets a 10-second delay between requests.
Sitemap line (optional): Directs robots to the location of the XML sitemap(s) for the website.
Sitemap: https://www.example.com/sitemap.xml- Specifies the location of the XML sitemap.
Comments: Lines beginning with
#are comments and are ignored by robots. They can be used to annotate the file for humans.
Example robots.txt file:
User-agent: *
Disallow: /admin/
Disallow: /private.html
Allow: /public.html
Crawl-delay: 5
Sitemap: https://www.example.com/sitemap.xml
# This is a comment explaining the robots.txt file.Notes:
- Wildcards (
*) can be used inDisallowdirectives, e.g.,Disallow: /*.pdfto block all PDF files. - Each directive (User-agent, Disallow, Allow, Crawl-delay, Sitemap) should be on a separate line.
- Multiple rules can be specified for different user agents or directories by repeating
User-agentand subsequent directives.
It's important to note that while robots.txt files provide guidance to well-behaved crawlers, malicious or poorly programmed crawlers may ignore these instructions. Therefore, they are primarily used for managing how legitimate search engines and web crawlers interact with a website.
Features
- [x] Directives (
User-agent,Disallow,Allow,Crawl-delay,Sitemap,Host) - [x] Comments (
# comment) - [x] Unknown directives (
X-Robots-Tag)
Developing
How to run & test:
npm install
npm run test