javadocs-scraper

v2.1.2

Published

6 months ago

A TypeScript library to scrape JavaDocs information.

0High
0Medium
0Low

amgelo563

`📚` javadocs-scraper

A TypeScript library to scrape Java objects information from a Javadocs website.

Specifically, it scrapes data (name, description, url, etc) about, and links together:

Packages
Classes
Interfaces
Object Type Parameters (Object Generics), on classes and interfaces
Enums
- Enum Constants
Annotations
- Annotation Elements
Fields
Methods

Some extra data is also calculated post scraping, like method and field inheritance.

[!CAUTION] Tested with Javadocs generated from Java 7 to Java 21. I cannot guarantee this will work with older or newer versions.

`📦` Installation and Usage

Install with your preferred package manager:

npm install javadocs-scraper
yarn add javadocs-scraper
pnpm add javadocs-scraper

Instantiate a Scraper:

import { Scraper } from 'javadocs-scraper';

// From an online URL:
const urlScraper = Scraper.fromURL('https://...');

// From a local path:
const pathScraper = Scraper.fromPath('./path/to/javadocs/index.html');

[!NOTE] This package uses constructor dependency injection for every class.
You can also instantiate Scraper with the new keyword, but you'll need to specify every dependency manually.
The easier way is to use the static fromX methods, which will use the default implementations.

[!TIP] Alternatively, you can provide your own Fetcher to fetch data from the Javadocs:

import type { Fetcher } from 'javadocs-scraper';

class MyFetcher implements Fetcher {
  /** ... */
}

const myFetcher = new MyFetcher('https://...');
const scraper = Scraper.with({ fetcher: myFetcher });

Use the Scraper to scrape and the resulting Javadocs to access the data:

const javadocs: Javadocs = await scraper.scrape();

/** for example */
const myInterface = javadocs.getInterface('org.example.Interface');
console.log(myInterface);
/**
 * {
 *   qualifiedName: 'org.example.Interface',
 *   package: { name: 'org.example', ... },
 *   url: 'https://.../Interface.html',
 *   description: { text: 'An example interface', html: '<p>An example interface</p>' },
 *   methods: Collection {...}
 *   fields: Collection {...},
 *   typeParameters: Collection {...},
 *   // and more data, check the docs!
 * }
 */

[!TIP] The Javadocs object uses discord.js' Collection class to store all the scraped data. This is an extension of Map with utility methods, like find(), reduce(), etc.
These collections are also typed as mutable, so any modification will be reflected in the backing Javadocs. This is by design, since the library no longer uses this object once it's given to you, and doesn't care what you then do with it.
Check the discord.js guide or the Collection docs for more info.

`🔒` Warnings

Make sure to not spam a Javadocs website. It's your responsibility to not abuse the library, and implement appropiate methods to avoid abuse, like a cache.
The scrape() method will take a while to scrape the entire website. Make sure to only run it when necessary, ideally only once in the entire program's lifecycle.

`🔍` Specifics

There are distinct types of objects that hold the library together:

A Fetcher¹, which makes requests to the Javadocs website.
Entities², which represent a scraped object.
QueryStrategies¹, which query the website through cheerio. Needed since HTML class and ids change between Javadoc versions.
Scrapers¹, which scrape information from a given URL or cheerio object, to a partial object.
Partials², which represent a partially scraped object, that is, an object without circular references to other objects.
A ScraperCache, which caches partial objects in memory.
Patchers¹, which patch partials to make them full entities, by linking them together.
Javadocs, which is the final result of the scraping process.

¹ - Replaceable via constructor injection.

² - Only a type, not available in runtime.

The scraping process ocurs in the following steps:

A QueryStrategy is chosen by the QueryStrategyBundleFactory.
The RootScraper iterates through every package in the Javadocs root.
For every package, it's fetched, and passed to the PackageScraper.
The PackageScraper iterates through every class, interface, enum and annotation in the package and passes them to the appropriate Scraper.
Each scraper creates a partial object, and caches it in the ScraperCache.
Once everything is done, the Scraper uses the Patchers to patch the partial objects together, by passing the cache to each patcher.
The Scraper returns the patched objects, in a Javadocs object.

[!TIP] You can provide your own QueryStrategyBundleFactory to change the way the QueryStrategy is chosen.
import { OnlineFetcher } from 'javadocs-scraper';

const scraper = Scraper.with({
  fetcher: new OnlineFetcher('https://...'), // or any other Fetcher implementation
  strategyBundleFactory: ($root: CheerioAPI) => { /** ... */ },
});

Query Strategies

Query strategies help fetch data across Java versions, without needing to write lengthy conditional code. These strategies don't actually know the Java version at runtime they're running on, and are made to support multiple at once.

In particular, the library provides two strategy "types", which are free to be extended:

Legacy strategies

For Javadocs 8 to 15. Some of the queries reassemble those of the modern strategy because of 13-15 Javadocs, which are a mix of legacy and modern, but from testing they mostly match legacy.

Legacy Javadocs don't have a consistent structure, so this strategy has a couple of workarounds, hacks and pre-compiled regexes to extract the data correctly.

Modern strategies

For Javadocs 16 to last supported (21 at the time of writing). Some of the queries reassemble those of the legacy strategy because of 16 Javadocs, which are a mix of legacy and modern, but from testing they mostly match modern.

Modern Javadocs have a more consistent structure, with classes and ids easy to query directly.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

📚 javadocs-scraper

Contents

📦 Installation and Usage

🔒 Warnings

🔍 Specifics