npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

osint-feed

v0.1.0

Published

Config-driven news harvester for OSINT. RSS + HTML scraping with built-in dedup and LLM-ready digest.

Readme

osint-feed

Config-driven news harvester for Node.js. Pulls articles from RSS feeds and HTML pages, deduplicates them, and produces a compact digest ready to feed into an LLM context window.

No AI inside. No opinions about your stack. Just articles in, structured data out.

Use RSS when you have it. Use HTML selectors when you do not. Filtering for your specific topic belongs in the app that consumes this library.

Why

You're building something that needs fresh news context — a SITREP generator, a threat monitor, a research assistant. You have 30+ sources across languages and formats. You need the data compact enough to fit in a Llama/GPT context window without blowing the budget.

Existing tools are either Python-only (newspaper4k), heavy self-hosted platforms (Huginn), or commercial APIs (Newscatcher, NewsAPI). Nothing in the JS/TS ecosystem does config-driven multi-source harvesting with built-in LLM-ready compression.

osint-feed fills that gap.

Install

npm install osint-feed

Requires Node.js 18+.

Quick Start

import { createHarvester } from "osint-feed";

const harvester = createHarvester({
  sources: [
    {
      id: "bbc-world",
      name: "BBC World",
      type: "rss",
      url: "https://feeds.bbci.co.uk/news/world/rss.xml",
      tags: ["global", "uk"],
      interval: 15,
    },
    {
      id: "nato",
      name: "NATO Newsroom",
      type: "html",
      url: "https://www.nato.int/cps/en/natohq/news.htm",
      tags: ["nato"],
      interval: 30,
      selectors: {
        article: ".event-list-item",
        title: "a span:first-child",
        link: "a",
        date: ".event-date",
      },
    },
  ],
});

// Fetch everything
const articles = await harvester.fetchAll();

// Or get an LLM-ready digest
const { articles: digest, stats } = await harvester.digest();
console.log(`${stats.totalFetched} articles -> ${stats.afterDedup} unique -> ${stats.estimatedTokens} tokens`);

Source Types

RSS / Atom

Works out of the box. No selectors needed — feeds are parsed automatically.

{
  id: "france24",
  name: "France24",
  type: "rss",
  url: "https://www.france24.com/en/rss",
  tags: ["global", "europe"],
  interval: 15,
}

HTML Scraping

You define CSS selectors per source. The library uses cheerio — no headless browser, no Puppeteer overhead.

This is still config-driven scraping: the library does not auto-discover article lists or infer what is relevant to your use case.

{
  id: "defence24",
  name: "Defence24",
  type: "html",
  url: "https://defence24.pl/",
  tags: ["poland", "defence"],
  interval: 15,
  selectors: {
    article: "article",        // repeating container
    title: "h2 a",             // title text (within article)
    link: "h2 a",              // link href (within article)
    date: "time",              // optional: publication date
    summary: ".lead",          // optional: description text
  },
}

API

createHarvester(options)

Creates a harvester instance. Options:

| Option | Type | Default | Description | |--------|------|---------|-------------| | sources | SourceConfig[] | required | Array of source definitions | | dedup.known | () => string[] | — | Returns hashes already in your DB (for cross-session dedup) | | digest | DigestOptions | see below | Default digest settings | | requestTimeout | number | 15000 | HTTP timeout in ms | | requestGap | number | 1000 | Minimum ms between requests (rate limiting) | | maxItemsPerSource | number | 50 | Cap articles returned from one source | | fetch | Function | global fetch | Custom fetch for proxies/testing | | onError | Function | — | Callback for per-source fetch or parse errors | | onWarning | Function | — | Callback for non-fatal source diagnostics |

harvester.fetchAll()

Fetches all enabled sources. Returns Article[].

If one source fails, the method still returns articles from the remaining sources and reports the problem through onError when provided.

harvester.fetch(sourceId)

Fetches a single source by ID.

harvester.fetchByTags(tags)

Fetches sources matching any of the given tags.

harvester.digest(options?)

The main event. Fetches all sources, then runs the compression pipeline:

  1. Dedup — Groups similar headlines (Jaccard similarity) and keeps the richest version
  2. Sort — Newest first
  3. Tag budget — Caps articles per tag so no single region dominates
  4. Truncate — Cuts content to N characters per article
  5. Token budget — Trims from the bottom until under the token limit
const { articles, stats } = await harvester.digest({
  maxTokens: 12_000,           // total token budget
  maxArticlesPerTag: 10,       // max articles per tag group
  maxContentLength: 500,       // chars per article content
  similarityThreshold: 0.6,    // title dedup threshold (0-1)
  sort: "recency",
});

// stats.totalFetched     → 700  (raw from all sources)
// stats.afterDedup       → 200  (unique stories)
// stats.afterBudget      → 80   (within tag limits)
// stats.estimatedTokens  → 18000 (final token count)

harvester.start(callbacks) / harvester.stop()

Runs sources on their configured intervals. You handle storage.

harvester.start({
  onArticles: async (articles, source) => {
    await db.insert("articles", articles);
    console.log(`${articles.length} new from ${source.name}`);
  },
  onError: (err, source) => {
    console.error(`${source.name} failed:`, err);
  },
  onWarning: (warning, source) => {
    console.warn(`${source.name}: ${warning.code} - ${warning.message}`);
  },
});

// Later:
harvester.stop();

Article Schema

interface Article {
  sourceId: string;          // matches source config id
  url: string;               // canonical article URL
  title: string;
  content: string | null;    // full text (when available)
  summary: string | null;    // short description
  publishedAt: Date | null;
  hash: string;              // SHA-256 of URL (dedup key)
  fetchedAt: Date;
  tags: string[];            // inherited from source
}

Dedup Across Sessions

The library handles within-batch dedup automatically. For cross-session dedup (don't re-process articles already in your DB), pass a known callback:

const harvester = createHarvester({
  sources,
  dedup: {
    known: async () => {
      const rows = await db.query("SELECT hash FROM articles");
      return rows.map(r => r.hash);
    },
  },
});

// fetchAll() now skips articles whose URL hash is already known

Diagnostics

The library keeps the happy path simple: fetchAll() and digest() still return article data directly.

Use onError and onWarning if you want visibility into partial failures or weak-quality source output.

  • onError covers hard failures like timeouts, HTTP errors, and parsing failures.
  • onWarning covers non-fatal issues like empty source results, missing publication dates, or per-source truncation.

This matches the typical small-library OSS pattern: easy defaults, optional hooks for logging and monitoring.

Scope and Limits

  • RSS and HTML are first-class source types.
  • HTML works best when you can define stable list selectors.
  • The library does not execute page JavaScript or run a headless browser.
  • The library does not decide what is relevant for your domain; apply your own filters downstream.

Use with Next.js

// app/api/feed/route.ts
import { createHarvester } from "osint-feed";

const harvester = createHarvester({ sources: [...] });

export async function GET() {
  const { articles, stats } = await harvester.digest({ maxTokens: 8000 });
  return Response.json({ articles, stats });
}

Use with Express

import express from "express";
import { createHarvester } from "osint-feed";

const app = express();
const harvester = createHarvester({ sources: [...] });

app.get("/digest", async (_req, res) => {
  const result = await harvester.digest();
  res.json(result);
});

How the Digest Math Works

Real numbers from a smoke test with 10 RSS + 3 HTML sources:

Raw fetch:         324 articles
After title dedup: 319 unique stories
After tag budget:  47  (8 per tag, 6 tags)
Estimated tokens:  5,781

That's 1.8% of Llama 3's 128k context. Plenty of room for system prompt, history, and reasoning.

With 35 sources polling every 15 min you'd get ~700 articles/hour. The digest pipeline compresses that to ~80 articles / ~18k tokens. Adjust maxArticlesPerTag and maxTokens to taste.

Dependencies

Just two:

No headless browsers. No native modules. No bloat.

License

MIT

Disclaimer

This library is a tool for fetching and parsing publicly available web content. Users are responsible for compliance with target websites' terms of service and applicable laws. The authors assume no liability for how the library is used.