cleaner-adblock
v2.0.11
Published
A minimal domain scanner for ad-blocking filter lists. Finds dead domains and redirecting domains using headless browser automation.
Maintainers
Readme
Minimal Domain Scanner
A Node.js tool that scans adblock filter lists to identify dead domains and redirecting domains, helping maintain clean and efficient filter lists.
Overview
This tool parses adblock filter lists, checks the status of domains found in various rule types, and categorizes them into:
- Dead domains - Domains that don't resolve or return errors (should be removed)
- Redirecting domains - Domains that redirect to different domains (should be reviewed)
Features
- Multiple Rule Format Support: Handles uBlock Origin, Adguard, and network rules
- Simple Domain Lists: Can also parse plain domain lists (one per line)
- Concurrent Processing: Checks multiple domains simultaneously for speed
- Smart Domain Variants: Optionally checks both
domain.comandwww.domain.com - Similar Domain Filtering: Can ignore redirects to subdomains of the same base domain
- DNS Verification: Optionally verify dead domains with DNS lookups
- Ping Verification: Optionally verify dead domains with ping checks
- Export Cleaned Lists: Generate filter lists with dead domains removed
- Hosts File Support: Parse hosts file format (0.0.0.0/127.0.0.1 entries)
- Config File Support: Per-project and per-file settings via
.cleanerconfig - Comprehensive Error Handling: Detects DNS failures, timeouts, HTTP errors
- Graceful Shutdown: Clean browser cleanup on Ctrl+C
- Debug Modes: Various debug levels for troubleshooting
- Test Mode: Quick testing on a subset of domains
Installation
Prerequisites
- Node.js (v20 or higher recommended)
- npm (comes with Node.js)
Setup
# Clone or download the repository
git clone [email protected]:ryanbr/cleaner-adblock.git
cd cleaner-adblock
# Install dependencies
npm install puppeteerUsage
Basic Usage
node cleaner-adblock.js <file>This will scan the specified file and generate two timestamped output files.
Command-Line Options
node cleaner-adblock.js <file> [options]Input Options
--input=<file>- Specify input file to scan (alternative to positional arg)--simple-domains- Parse input as plain domain list (one per line) instead of filter rules--localhost- Parse hosts file format (0.0.0.0/127.0.0.1 domain)
Domain Checking Options
--add-www- Check bothdomain.comandwww.domain.comfor bare domains--ignore-similar- Ignore redirects to subdomains of same base domain--check-dig- Verify dead domains with DNS lookup--check-dig-always- Only report domains with no DNS A records--check-ping- Verify dead domains with ping (checks both bare and www variants)
Output Options
--export-list- Export cleaned filter list (removes dead domains)--remove-redirects- Always remove redirected domains from exported list
Performance Options
--concurrency=N- Number of concurrent checks (1-50, default: 12)--disable-block-resources- Allow images/CSS/fonts to load (slower scans)--quick-disconnect- Disconnect once status is determined (faster scans)
Debug Options
--debug- Enable basic debug output--debug-verbose- Enable verbose debug output--debug-network- Log network requests/responses--debug-browser- Log browser events--debug-all- Enable all debug options
Testing Options
--test-mode- Only test first 5 domains (quick testing)--test-count=N- Only test first N domains
Display Options
--color,--colour- Enable colored output--use-config=<file>- Use a custom config file instead of.cleanerconfig
Help
--helpor-h- Show help message
Examples
# Scan a filter list
node cleaner-adblock.js my_rules.txt
# Scan with input flag
node cleaner-adblock.js --input=my_rules.txt
# Scan a simple list of domains (one per line)
node cleaner-adblock.js domains.txt --simple-domains
# Scan a hosts file
node cleaner-adblock.js hosts.txt --localhost
# Check both domain.com and www.domain.com variants
node cleaner-adblock.js my_rules.txt --add-www
# Ignore subdomain redirects (reduces noise)
node cleaner-adblock.js my_rules.txt --ignore-similar
# Combine options
node cleaner-adblock.js my_rules.txt --add-www --ignore-similar
# Verify dead domains with ping before confirming
node cleaner-adblock.js my_rules.txt --check-ping
# Only report domains with no DNS records
node cleaner-adblock.js my_rules.txt --check-dig-always
# Export a cleaned filter list
node cleaner-adblock.js my_rules.txt --export-list
# Increase concurrency for faster scans
node cleaner-adblock.js my_rules.txt --concurrency=20
# Faster scans with quick disconnect
node cleaner-adblock.js my_rules.txt --quick-disconnect
# Debug mode for troubleshooting
node cleaner-adblock.js my_rules.txt --debug --test-mode
# Test first 10 domains with full debugging
node cleaner-adblock.js my_rules.txt --debug-all --test-count=10Supported Rule Types
uBlock Origin / Cosmetic Rules
domain.com##.selector # Element hiding
domain.com##+js(scriptlet) # Scriptlet injection
domain.com#@#.selector # Exception ruleAdguard Rules
domain.com##selector # Element hiding
domain.com#@#selector # Exception
domain.com#$#selector # CSS injection
domain.com#%#//scriptlet(...) # Scriptlet
domain.com#?#selector # Extended CSS
domain.com#@$?#selector # Extended CSS exception
domain1.com,domain2.com##selector # Multiple domainsNetwork Rules
/path$script,domain=example.com
||domain.com^$script,domain=site1.com|site2.comExtracts domains from the domain= parameter.
Simple Domain Lists
When using --simple-domains:
example.com
another-domain.org
# Comments are ignored
domain1.com, domain2.com, domain3.netOutput Files
Output files are timestamped to avoid overwriting previous scans.
dead_domains_TIMESTAMP.txt
Contains domains that should be removed from filter lists:
- HTTP 404, 410, 5xx errors
- DNS resolution failures
- Connection timeouts
- Network errors
Format:
# Dead/Non-Existent Domains
# Scanned file: my_rules.txt
# Generated: 2025-11-08T10:30:00.000Z
# Total found: 15
example-dead.com # ERR_NAME_NOT_RESOLVED
old-site.net # HTTP 404
timeout-site.org # Navigation timeout of 25000ms exceededredirect_domains_TIMESTAMP.txt
Contains domains that redirect to different domains (review for potential rule updates):
Format:
# Redirecting Domains
# Scanned file: my_rules.txt
# Generated: 2025-11-08T10:30:00.000Z
# Total found: 8
old-domain.com ? new-domain.com # https://new-domain.com/
example.org ? example.com # https://example.com/How It Works
- Parse Input File: Extracts unique domains from various filter rule formats
- Validate Domains: Filters out .onion domains, IP addresses, and localhost
- Expand Variants: Optionally creates domain variants with/without www
- Browser-Based Checking: Uses Puppeteer to:
- Navigate to each domain
- Follow redirects
- Detect DNS failures
- Handle HTTP errors
- Capture timeouts
- DNS Verification: Optionally verifies dead domains with dig
- Ping Verification: Optionally verifies dead domains with ping
- Categorize Results: Separates dead domains from redirecting domains
- Generate Reports: Creates organized output files with explanations
Configuration
Config File (.cleanerconfig)
Create a .cleanerconfig file in your working directory to set defaults per project and per file:
{
"concurrency": 12,
"color": true,
"ignoredDomains": [
"cloudfront.net",
"fastly.net",
"googlesyndication.com"
],
"files": {
"easylist_specific_hide.txt": {
"addWww": true,
"checkPing": true,
"ignoreSimilar": true,
"concurrency": 20,
"ignoredDomains": [
"specific-to-this-list.net"
]
}
}
}Priority order: CLI flags > per-file config > global config > defaults
Available config options: concurrency, addWww, ignoreSimilar, checkDig, checkDigAlways, checkPing, blockResources, exportList, removeRedirects, quickDisconnect, localhost, color
Per-file ignoredDomains are merged with global ignoredDomains (not replaced).
Default Settings
const TIMEOUT = 25000; // Page load timeout (25 seconds)
const FORCE_CLOSE_TIMEOUT = 60000; // Force-close timeout (60 seconds)
const CONCURRENCY = 12; // Concurrent domain checks (use --concurrency=N)Special Features
--simple-domains Behavior
Parses input as a plain domain list instead of filter rules:
- One domain per line
- Supports comma-separated domains
- Ignores lines starting with
#,!, or// - Automatically strips protocols and paths
--check-dig and --check-dig-always Behavior
--check-dig- Adds DNS A record info to dead domain output--check-dig-always- Filters dead domains to only include those with no DNS A records. Useful for confirming domains are truly dead vs temporarily unavailable.
--add-www Behavior
domain.com→ checks bothdomain.comANDwww.domain.com- If either works, domain is marked as active
sub.domain.com→ only checkssub.domain.com(no www added)www.domain.com→ only checkswww.domain.com(already has www)
--ignore-similar Behavior
Reduces noise from internal subdomain redirects:
example.com→sub.example.com(ignored - same base domain)example.com→different.com(flagged - different domain)
Useful for sites that redirect to CDN or regional subdomains.
--check-ping Behavior
Verifies dead domains with ICMP ping before confirming them as dead:
- Tries both
domain.comandwww.domain.com(1 packet, 3s timeout each) - If either responds to ping, the domain is removed from the dead list
- Skips domains that returned HTTP errors (404/5xx) since they clearly have a running server
- Redirecting domains are moved to the dead list without pinging (they already responded)
- Runs after the browser scan in batches of 10 concurrent pings
--localhost Behavior
Parses hosts file format:
- Matches lines like
0.0.0.0 domain.comor127.0.0.1 domain.com - Skips localhost entries
- Ignores comments starting with
#or!
--export-list Behavior
Generates a cleaned version of the original filter list with dead domains removed. The cleaned list is saved with a _cleaned_TIMESTAMP suffix.
Error Handling
The tool handles various error scenarios:
- DNS failures (ERR_NAME_NOT_RESOLVED)
- Connection errors (ERR_CONNECTION_REFUSED, ERR_CONNECTION_TIMED_OUT)
- HTTP status codes (404, 410, 5xx)
- SSL/Certificate errors (automatically ignored)
- Page load timeouts
- Navigation errors
Troubleshooting
Issue: "Cannot find module 'puppeteer'"
npm install puppeteerIssue: Browser fails to launch
Try adding more Puppeteer args in the code:
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage'
]Issue: Too many timeouts
Increase the timeout value in the code:
const TIMEOUT = 35000; // 35 secondsIssue: Running out of memory
Reduce concurrency:
node cleaner-adblock.js my_rules.txt --concurrency=6Performance Tips
- Use
--test-modefirst to verify everything works - Adjust concurrency with
--concurrency=Nbased on your system resources - Use
--quick-disconnectfor faster scans when you only need status codes - Use
--ignore-similarto reduce false positives - Use default resource blocking (don't use
--disable-block-resources) for faster scans - Use
.cleanerconfigto save per-file settings and avoid repeating flags - Monitor system resources during large scans
- Consider splitting very large filter lists
Use Cases
- Filter List Maintenance: Identify outdated domains in adblock lists
- List Optimization: Remove dead domains to reduce list size
- Rule Updates: Find domains that need rule updates due to redirects
- Quality Assurance: Validate filter lists before distribution
- Domain Research: Analyze domain status across multiple filter lists
License
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
Acknowledgments
Built with Puppeteer for reliable browser automation and domain checking.
Support
For issues, questions, or suggestions, please open an issue on GitHub.
