osint-feed

v0.1.0

Published

3 months ago

Config-driven news harvester for OSINT. RSS + HTML scraping with built-in dedup and LLM-ready digest.

Downloads

0High
0Medium
0Low

marcinfalkowski

osint rss scraper news feed harvester llm digest

osint-feed

Config-driven news harvester for Node.js. Pulls articles from RSS feeds and HTML pages, deduplicates them, and produces a compact digest ready to feed into an LLM context window.

No AI inside. No opinions about your stack. Just articles in, structured data out.

Use RSS when you have it. Use HTML selectors when you do not. Filtering for your specific topic belongs in the app that consumes this library.

Why

You're building something that needs fresh news context — a SITREP generator, a threat monitor, a research assistant. You have 30+ sources across languages and formats. You need the data compact enough to fit in a Llama/GPT context window without blowing the budget.

Existing tools are either Python-only (newspaper4k), heavy self-hosted platforms (Huginn), or commercial APIs (Newscatcher, NewsAPI). Nothing in the JS/TS ecosystem does config-driven multi-source harvesting with built-in LLM-ready compression.

osint-feed fills that gap.

Install

npm install osint-feed

Requires Node.js 18+.

Quick Start

import { createHarvester } from "osint-feed";

const harvester = createHarvester({
  sources: [
    {
      id: "bbc-world",
      name: "BBC World",
      type: "rss",
      url: "https://feeds.bbci.co.uk/news/world/rss.xml",
      tags: ["global", "uk"],
      interval: 15,
    },
    {
      id: "nato",
      name: "NATO Newsroom",
      type: "html",
      url: "https://www.nato.int/cps/en/natohq/news.htm",
      tags: ["nato"],
      interval: 30,
      selectors: {
        article: ".event-list-item",
        title: "a span:first-child",
        link: "a",
        date: ".event-date",
      },
    },
  ],
});

// Fetch everything
const articles = await harvester.fetchAll();

// Or get an LLM-ready digest
const { articles: digest, stats } = await harvester.digest();
console.log(`${stats.totalFetched} articles -> ${stats.afterDedup} unique -> ${stats.estimatedTokens} tokens`);

Source Types

RSS / Atom

Works out of the box. No selectors needed — feeds are parsed automatically.

{
  id: "france24",
  name: "France24",
  type: "rss",
  url: "https://www.france24.com/en/rss",
  tags: ["global", "europe"],
  interval: 15,
}

HTML Scraping

You define CSS selectors per source. The library uses cheerio — no headless browser, no Puppeteer overhead.

This is still config-driven scraping: the library does not auto-discover article lists or infer what is relevant to your use case.

{
  id: "defence24",
  name: "Defence24",
  type: "html",
  url: "https://defence24.pl/",
  tags: ["poland", "defence"],
  interval: 15,
  selectors: {
    article: "article",        // repeating container
    title: "h2 a",             // title text (within article)
    link: "h2 a",              // link href (within article)
    date: "time",              // optional: publication date
    summary: ".lead",          // optional: description text
  },
}

API

`createHarvester(options)`

Creates a harvester instance. Options:

| Option | Type | Default | Description | |--------|------|---------|-------------| | sources | SourceConfig[] | required | Array of source definitions | | dedup.known | () => string[] | — | Returns hashes already in your DB (for cross-session dedup) | | digest | DigestOptions | see below | Default digest settings | | requestTimeout | number | 15000 | HTTP timeout in ms | | requestGap | number | 1000 | Minimum ms between requests (rate limiting) | | maxItemsPerSource | number | 50 | Cap articles returned from one source | | fetch | Function | global fetch | Custom fetch for proxies/testing | | onError | Function | — | Callback for per-source fetch or parse errors | | onWarning | Function | — | Callback for non-fatal source diagnostics |

`harvester.fetchAll()`

Fetches all enabled sources. Returns Article[].

If one source fails, the method still returns articles from the remaining sources and reports the problem through onError when provided.

`harvester.fetch(sourceId)`

Fetches a single source by ID.

`harvester.fetchByTags(tags)`

Fetches sources matching any of the given tags.

`harvester.digest(options?)`

The main event. Fetches all sources, then runs the compression pipeline:

Dedup — Groups similar headlines (Jaccard similarity) and keeps the richest version
Sort — Newest first
Tag budget — Caps articles per tag so no single region dominates
Truncate — Cuts content to N characters per article
Token budget — Trims from the bottom until under the token limit

const { articles, stats } = await harvester.digest({
  maxTokens: 12_000,           // total token budget
  maxArticlesPerTag: 10,       // max articles per tag group
  maxContentLength: 500,       // chars per article content
  similarityThreshold: 0.6,    // title dedup threshold (0-1)
  sort: "recency",
});

// stats.totalFetched     → 700  (raw from all sources)
// stats.afterDedup       → 200  (unique stories)
// stats.afterBudget      → 80   (within tag limits)
// stats.estimatedTokens  → 18000 (final token count)

`harvester.start(callbacks)` / `harvester.stop()`

Runs sources on their configured intervals. You handle storage.

harvester.start({
  onArticles: async (articles, source) => {
    await db.insert("articles", articles);
    console.log(`${articles.length} new from ${source.name}`);
  },
  onError: (err, source) => {
    console.error(`${source.name} failed:`, err);
  },
  onWarning: (warning, source) => {
    console.warn(`${source.name}: ${warning.code} - ${warning.message}`);
  },
});

// Later:
harvester.stop();

Article Schema

interface Article {
  sourceId: string;          // matches source config id
  url: string;               // canonical article URL
  title: string;
  content: string | null;    // full text (when available)
  summary: string | null;    // short description
  publishedAt: Date | null;
  hash: string;              // SHA-256 of URL (dedup key)
  fetchedAt: Date;
  tags: string[];            // inherited from source
}

Dedup Across Sessions

The library handles within-batch dedup automatically. For cross-session dedup (don't re-process articles already in your DB), pass a known callback:

const harvester = createHarvester({
  sources,
  dedup: {
    known: async () => {
      const rows = await db.query("SELECT hash FROM articles");
      return rows.map(r => r.hash);
    },
  },
});

// fetchAll() now skips articles whose URL hash is already known

Diagnostics

The library keeps the happy path simple: fetchAll() and digest() still return article data directly.

Use onError and onWarning if you want visibility into partial failures or weak-quality source output.

onError covers hard failures like timeouts, HTTP errors, and parsing failures.
onWarning covers non-fatal issues like empty source results, missing publication dates, or per-source truncation.

This matches the typical small-library OSS pattern: easy defaults, optional hooks for logging and monitoring.

Scope and Limits

RSS and HTML are first-class source types.
HTML works best when you can define stable list selectors.
The library does not execute page JavaScript or run a headless browser.
The library does not decide what is relevant for your domain; apply your own filters downstream.

Use with Next.js

// app/api/feed/route.ts
import { createHarvester } from "osint-feed";

const harvester = createHarvester({ sources: [...] });

export async function GET() {
  const { articles, stats } = await harvester.digest({ maxTokens: 8000 });
  return Response.json({ articles, stats });
}

Use with Express

import express from "express";
import { createHarvester } from "osint-feed";

const app = express();
const harvester = createHarvester({ sources: [...] });

app.get("/digest", async (_req, res) => {
  const result = await harvester.digest();
  res.json(result);
});

How the Digest Math Works

Real numbers from a smoke test with 10 RSS + 3 HTML sources:

Raw fetch:         324 articles
After title dedup: 319 unique stories
After tag budget:  47  (8 per tag, 6 tags)
Estimated tokens:  5,781

That's 1.8% of Llama 3's 128k context. Plenty of room for system prompt, history, and reasoning.

With 35 sources polling every 15 min you'd get ~700 articles/hour. The digest pipeline compresses that to ~80 articles / ~18k tokens. Adjust maxArticlesPerTag and maxTokens to taste.

Dependencies

Just two:

cheerio — HTML parsing
rss-parser — RSS/Atom parsing

No headless browsers. No native modules. No bloat.

License

MIT

Disclaimer

This library is a tool for fetching and parsing publicly available web content. Users are responsible for compliance with target websites' terms of service and applicable laws. The authors assume no liability for how the library is used.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

osint-feed

Why

Install

Quick Start

Source Types

RSS / Atom

HTML Scraping

API

createHarvester(options)

harvester.fetchAll()

harvester.fetch(sourceId)

harvester.fetchByTags(tags)

harvester.digest(options?)

harvester.start(callbacks) / harvester.stop()

Article Schema

Dedup Across Sessions

Diagnostics

Scope and Limits

Use with Next.js

Use with Express

How the Digest Math Works

Dependencies

License

Disclaimer

`createHarvester(options)`

`harvester.fetchAll()`

`harvester.fetch(sourceId)`

`harvester.fetchByTags(tags)`

`harvester.digest(options?)`

`harvester.start(callbacks)` / `harvester.stop()`