npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

universities

v0.0.3

Published

Comprehensive worldwide universities dataset with TypeScript API, CLI tools, and data processing utilities. Includes web scraping capabilities with respectful rate limiting for enriching university data.

Downloads

63

Readme

Installation Build Linting Tests Security Scan

GitHub repo forks GitHub repo stars GitHub repo contributorsGitHub org sponsors GitHub repo watchers GitHub repo size

GitHub package.json dependency version (dev dep on branch) GitHub package.json dependency version (dev dep on branch) GitHub package.json dependency version (dev dep on branch) GitHub package.json dependency version (dev dep on branch) 

Overview

universities is an evolving TypeScript/Node.js library and CLI that provides a structured, extensible dataset of the world's universities along with an enrichment pipeline that (optionally) visits institutional homepages to extract additional metadata.

Core goals:

  1. Provide immediate, zero‑network access to a clean base list of universities (name, domains, country info, web site) sourced from the public world universities dataset.
  2. Offer an enrichment layer (opt‑in) that scrapes each university homepage respectfully (rate‑limited + retries) to infer or collect:
    • Descriptions / taglines / motto
    • Contact and location hints
    • Founding year
    • Academic programs & faculties (heuristic extraction)
    • Social media links
    • Institutional classification (public/private, research, technical, community, etc. — heuristic)
    • Degree levels (undergraduate / graduate / doctoral)
    • Data quality scoring for traceability
  3. Expose ergonomic programmatic APIs for search, filtering, and statistics.
  4. Provide a CLI for quick querying, enrichment, and aggregated stats generation.
  5. Remain transparent, reproducible, and respectful of target sites (configurable concurrency, caching, resumability, optional full‑dataset execution).

NOTE: Full automatic enrichment of every university (≈9k+) can take considerable time and should be run thoughtfully to avoid undue load on remote servers. The base dataset works instantly without enrichment.

Key Features

  • Base dataset loader (CSV → strongly typed objects)
  • In‑memory repository with searching, filtering, sorting and basic statistics
  • Extensible domain model (University, Program, Faculty, ranking + classification enums)
  • Heuristic scraper with retry + rate limiting queue
  • Batch enrichment script with per‑university JSON caching (resumable)
  • CLI with subcommands: list, enrich, stats
  • TypeScript declarations for consumption in TS or JS projects
  • Modular architecture to allow swapping scraping strategies or adding alternate data sources later (e.g., APIs, ranking feeds)

Installation

Install locally (library usage inside another project):

npm install universities

Or for global CLI usage (optional):

npm install -g universities

After a global install you can invoke the CLI via the universities command (see CLI section below). When using as a dependency, import from the package entry points.

Quick Start (CLI)

List the first 5 US universities:

universities list --country-code US --limit 5

Search by name fragment:

universities list --name polytechnic --limit 10

Output JSON instead of a table:

universities list --country-code CA --json --limit 3

Enrich a single university (fetch + parse homepage):

universities enrich https://www.mit.edu/

View aggregated stats (counts by type, size, etc.— improves once enriched data exists):

universities stats

Programmatic Usage

import { loadBaseUniversities } from 'universities/dist/data/loadBase';
import { UniversityRepository } from 'universities/dist/repository/UniversityRepository';

async function example() {
  const base = await loadBaseUniversities();
  const repo = new UniversityRepository(base);

  const results = repo.search({ countryCode: 'US', name: 'state', limit: 20 });
  console.log(results.slice(0, 3));
  console.log(repo.stats());
}

example();

The scraper (UniversityScraper) is intentionally decoupled and lazily imported in the CLI to avoid pulling ESM‑only dependencies when unnecessary. For programmatic enrichment you can: import { UniversityScraper } from 'universities/dist/scraper/UniversityScraper';

Data Model (Simplified)

interface University {
  id: string; // Stable hash/id generation
  name: string;
  country: string;
  countryCode: string;
  alphaTwoCode?: string; // If present in source
  webPages: string[]; // One or more homepage URLs
  domains: string[]; // Domain(s)
  stateProvince?: string;
  // Enriched fields (optional until scraping):
  description?: string;
  motto?: string;
  foundingYear?: number;
  location?: string;
  contact?: { email?: string; phone?: string; address?: string };
  programs?: { name: string; degreeLevels?: string[] }[];
  faculties?: { name: string; description?: string }[];
  social?: { twitter?: string; facebook?: string; instagram?: string; linkedin?: string; youtube?: string };
  classification?: { type?: string; degreeLevel?: string[] };
  dataQuality?: { score: number; factors: string[] };
  enrichedAt?: string; // ISO timestamp when enrichment occurred
}

See the full definitions in src/types/University.ts for exhaustive enum types, search options, stats structure, and classification helpers.

Architecture Overview

Layered design:

  1. Source Layer (world-universities.csv) – raw dataset.
  2. Loader (loadBaseUniversities) – parses CSV into partial University objects.
  3. Domain Types (types/University.ts) – strongly typed schema + enums + search contracts.
  4. Repository (UniversityRepository) – in‑memory indexing, filtering, sorting, basic statistics.
  5. Scraper (UniversityScraper) – fetch + parse homepage, extraction heuristics, classification & data quality scoring (rate limited via queue).
  6. Enrichment Script (scripts/enrich.ts) – orchestrates batch scraping with caching to data/cache/*.json and writes aggregated enriched dataset.
  7. CLI (cli.ts) – user interface for listing, enrichment, and stats.

Scraper Heuristics (High-Level)

  • Fetch with retry & jitter backoff.
  • Extract <meta name="description">, first meaningful paragraph, or tagline patterns.
  • Look for contact info via regex (emails, phone numbers, address fragments).
  • Infer founding year via patterns like Established 18xx|19xx|20xx.
  • Identify program/faculty keywords in navigation or section headers.
  • Collect social links by domain match (twitter.com, facebook.com, etc.).
  • Classify type (public/private/research/technical/community) by keyword/phrase heuristics.
  • Score data quality based on number & diversity of successfully extracted fields.

Performance & Respectful Crawling

  • Concurrency controlled by a queue (configurable).
  • Optional pauses / resume; per‑record caching prevents redundant fetches.
  • Future roadmap includes robots.txt parsing & adaptive politeness windows.

Enrichment Workflow

The batch enrichment script is optional and can be executed when you purposely want deeper metadata.

npm run build
node dist/scripts/enrich.js --concurrency 3 --resume

Flags (planned / implemented):

| Flag | Description | | --------------------- | ------------------------------------------------------------------------- | | --concurrency <n> | Number of parallel fetches (default modest to prevent overloading sites). | | --resume | Skip already cached universities (looks in data/cache/). | | --limit <n> | (Planned) Process only the first N universities for sampling. | | --country-code <CC> | (Planned) Restrict enrichment to a country subset. |

Outputs:

  • data/cache/{universityId}.json – per‑university enriched snapshot.
  • data/enriched-universities.json – aggregated enriched dataset (written after run).

CLI Reference

| Command | Purpose | Key Options | | -------------- | ---------------------------------------------------------- | ------------------------------------------------------------ | | list | Filter & display base (or partially enriched) universities | --name, --country, --country-code, --limit, --json | | enrich <url> | Enrich a single university homepage | (none yet; uses internal defaults) | | stats | Show aggregated statistics | None |

Examples:

universities list --name technology --limit 8
universities list --country-code GB --json --limit 5
universities enrich https://www.stanford.edu/
universities stats

Roadmap

  • [ ] Full dataset enrichment pipeline automation & snapshot publishing
  • [ ] Dual ESM + CJS distribution build (current workaround: lazy import for ESM‑only deps)
  • [ ] Robots.txt compliance & politeness policy configuration
  • [ ] Advanced classification (continent/region inference, size estimation heuristics, ranking ingestion)
  • [ ] Pluggable enrichment modules (e.g., ranking APIs, accreditation feeds)
  • [ ] Incremental persistent store (SQLite / LiteFS / DuckDB) for historical deltas
  • [ ] Comprehensive test suite (scraper mocks, repository edge cases, CLI integration)
  • [ ] Documentation site (API reference, enrichment metrics dashboard)
  • [ ] Progressive enrichment resume with queuing telemetry
  • [ ] Data provenance & reproducibility manifest (hashes, run metadata)

Testing

Run unit and integration tests:

npm test

Coverage reports are emitted to coverage/.

Contributing

We welcome contributions! Suggested steps:

  1. Fork & create a feature branch.
  2. Install dependencies: npm install.
  3. Run npm run build & ensure tests pass.
  4. Add or update tests for your change.
  5. Follow lint & formatting (npm run lint, npm run format).
  6. Submit a PR referencing any related issues.

Please consult (or propose) a CONTRIBUTING.md for evolving guidelines. Ethical scraping considerations and rate limiting are especially important—avoid aggressive concurrency.

Ethical & Legal Considerations

  • This project performs only light, homepage‑level scraping by default.
  • Always respect target site terms of service and robots.txt (planned feature for enforcement).
  • Do not use the enrichment pipeline to harvest personal data beyond institutional metadata.
  • Consider running enrichment in batches with conservative concurrency settings.

Troubleshooting

| Issue | Cause | Resolution | | --------------------------------------- | --------------------------------------------------------------- | --------------------------------------------------------------- | | ERR_REQUIRE_ESM when using CLI list | ESM‑only dependency (p-queue) pulled into non‑enrichment path | Resolved via lazy import; update to latest version of package | | Empty enrichment fields | Site structure variation | Re‑run later or inspect HTML; heuristics will improve over time | | Slow enrichment run | Network latency / conservative concurrency | Increase --concurrency cautiously |

Security

No secrets are stored. If you identify a security concern (e.g., vulnerable dependency or scraping misuse vector) please open an issue with reproduction details or use private disclosure if sensitive.

License

This repository is distributed under the terms of the MIT License. See LICENSE for details.

Acknowledgements

Inspired by the open university datasets community and contributors who maintain baseline CSV resources. Future improvements will strive for transparency, repeatability, and respectful data gathering.


Give a ⭐ if you find this useful and feel free to open issues for ideas or enhancements.


Generated documentation improvements are iterative; feel free to propose edits.

Technologies

How To Contribute

Click on these badges to see how you might be able to help:

GitHub repo Issues GitHub repo Good Issues for newbies GitHub Help Wanted issuesGitHub repo PRs GitHub repo Merged PRs GitHub Help Wanted PRs

Installation

Installation

npm install

Running

npm start

or

npm run dev

Testing

Tests

npm test

Building

Build

npm run build

Thanks to all Contributors 💪

  • Thank you for considering to contribute
  • Feel free to submit feature requests, UI updates, bugs as issues.
  • Checkout Contribution Guidelines for more information.
  • Have a feature request? Feel free to create a issue for it.

Contributors

Your Support means a lot

Give a ⭐ to show support for the project.