@blackridder22/google-news-scraper

v0.0.1

Published

a month ago

Lightweight async scraper for Google News

0High
0Medium
0Low

blackridder22

news scraper google google news news scraper crawler web crawler news crawler google crawler

Google News Scraper 📰

A lightweight, asynchronous scraper for Google News that retrieves articles, resolves redirects, and ensures clean data.

Features

🔍 Search & Topics: Scrape by search term or topic URL.
🔗 Smart Redirect Resolution: Automatically resolves Google's "ugly" tracking URLs to the direct publisher links.
🖼️ High-Quality Images: Extracts high-resolution images (og:image) from the source article, replacing low-quality Google thumbnails.
🧹 Auto-Filtering: Optional strict filtering to ensure you only get data with resolved URLs and clean images.
⏱️ Timeframe Support: Filter news by hours, days, years (e.g., 1h, 7d, 1y).

Installation

npm install @blackridder22/google-news-scraper

Quick Start

const googleNewsScraper = require('@blackridder22/google-news-scraper');

(async () => {
    const articles = await googleNewsScraper({
        searchTerm: "Artificial Intelligence",
        prettyURLs: true,
        timeframe: "1d",
        filter: true, // Only return articles with resolved links and images
        puppeteerArgs: ['--no-sandbox']
    });

    console.log(articles);
})();

Configuration Options

The function accepts a configuration object with the following properties:

| Property | Type | Default | Description | |----------|------|---------|-------------| | searchTerm | string | null | The search query (e.g., "Crypto"). | | baseUrl | string | ... | Alternate base URL (e.g., for specific topic pages). | | prettyURLs | boolean | true | Resolve Google redirects to actual publisher URLs. | | filter | boolean | false | New! If true, removes any article where the URL or Image could not be resolved (i.e., still points to news.google.com). | | timeframe | string | 7d | Filter by age: h (hours), d (days), y (years). Example: 12h. | | puppeteerArgs | array | [] | Additional flags for Puppeteer (e.g., ['--no-sandbox']). | | limit | number | null | Limit the number of results returned. | | getArticleContent| boolean| false | Experimental: Attempts to fetch full article text (slow). |

Output Format

Returns an array of article objects:

[
  {
    "title": "Example News Title",
    "link": "https://www.nytimes.com/...",
    "image": "https://www.nytimes.com/images/...",
    "source": "New York Times",
    "datetime": "2025-12-22T10:00:00.000Z",
    "time": "2 hours ago",
    "articleType": "regular"
  }
]

Why use `filter: true`?

Google News provides "tracking" URLs and internal thumbnail images (news.google.com/api/attachments/...).

Without Filter: You get 100% of results, but some may have ugly URLs or protected images.
With Filter: The scraper verifies everything. If it can't resolve the redirect or find a high-quality og:image on the publisher's site, it drops that result. You get fewer results, but they are guaranteed to be "clean".

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme