universities

v0.0.3

Published

5 months ago

Comprehensive worldwide universities dataset with TypeScript API, CLI tools, and data processing utilities. Includes web scraping capabilities with respectful rate limiting for enriching university data.

Overview

universities is an evolving TypeScript/Node.js library and CLI that provides a structured, extensible dataset of the world's universities along with an enrichment pipeline that (optionally) visits institutional homepages to extract additional metadata.

Core goals:

Provide immediate, zero‑network access to a clean base list of universities (name, domains, country info, web site) sourced from the public world universities dataset.
Offer an enrichment layer (opt‑in) that scrapes each university homepage respectfully (rate‑limited + retries) to infer or collect:
- Descriptions / taglines / motto
- Contact and location hints
- Founding year
- Academic programs & faculties (heuristic extraction)
- Social media links
- Institutional classification (public/private, research, technical, community, etc. — heuristic)
- Degree levels (undergraduate / graduate / doctoral)
- Data quality scoring for traceability
Expose ergonomic programmatic APIs for search, filtering, and statistics.
Provide a CLI for quick querying, enrichment, and aggregated stats generation.
Remain transparent, reproducible, and respectful of target sites (configurable concurrency, caching, resumability, optional full‑dataset execution).

NOTE: Full automatic enrichment of every university (≈9k+) can take considerable time and should be run thoughtfully to avoid undue load on remote servers. The base dataset works instantly without enrichment.

Key Features

Base dataset loader (CSV → strongly typed objects)
In‑memory repository with searching, filtering, sorting and basic statistics
Extensible domain model (University, Program, Faculty, ranking + classification enums)
Heuristic scraper with retry + rate limiting queue
Batch enrichment script with per‑university JSON caching (resumable)
CLI with subcommands: list, enrich, stats
TypeScript declarations for consumption in TS or JS projects
Modular architecture to allow swapping scraping strategies or adding alternate data sources later (e.g., APIs, ranking feeds)

Installation

Install locally (library usage inside another project):

npm install universities

Or for global CLI usage (optional):

npm install -g universities

After a global install you can invoke the CLI via the universities command (see CLI section below). When using as a dependency, import from the package entry points.

Quick Start (CLI)

List the first 5 US universities:

universities list --country-code US --limit 5

Search by name fragment:

universities list --name polytechnic --limit 10

Output JSON instead of a table:

universities list --country-code CA --json --limit 3

Enrich a single university (fetch + parse homepage):

universities enrich https://www.mit.edu/

View aggregated stats (counts by type, size, etc.— improves once enriched data exists):

universities stats

Programmatic Usage

import { loadBaseUniversities } from 'universities/dist/data/loadBase';
import { UniversityRepository } from 'universities/dist/repository/UniversityRepository';

async function example() {
  const base = await loadBaseUniversities();
  const repo = new UniversityRepository(base);

  const results = repo.search({ countryCode: 'US', name: 'state', limit: 20 });
  console.log(results.slice(0, 3));
  console.log(repo.stats());
}

example();

The scraper (UniversityScraper) is intentionally decoupled and lazily imported in the CLI to avoid pulling ESM‑only dependencies when unnecessary. For programmatic enrichment you can: import { UniversityScraper } from 'universities/dist/scraper/UniversityScraper';

Data Model (Simplified)

interface University {
  id: string; // Stable hash/id generation
  name: string;
  country: string;
  countryCode: string;
  alphaTwoCode?: string; // If present in source
  webPages: string[]; // One or more homepage URLs
  domains: string[]; // Domain(s)
  stateProvince?: string;
  // Enriched fields (optional until scraping):
  description?: string;
  motto?: string;
  foundingYear?: number;
  location?: string;
  contact?: { email?: string; phone?: string; address?: string };
  programs?: { name: string; degreeLevels?: string[] }[];
  faculties?: { name: string; description?: string }[];
  social?: { twitter?: string; facebook?: string; instagram?: string; linkedin?: string; youtube?: string };
  classification?: { type?: string; degreeLevel?: string[] };
  dataQuality?: { score: number; factors: string[] };
  enrichedAt?: string; // ISO timestamp when enrichment occurred
}

See the full definitions in src/types/University.ts for exhaustive enum types, search options, stats structure, and classification helpers.

Architecture Overview

Layered design:

Source Layer (world-universities.csv) – raw dataset.
Loader (loadBaseUniversities) – parses CSV into partial University objects.
Domain Types (types/University.ts) – strongly typed schema + enums + search contracts.
Repository (UniversityRepository) – in‑memory indexing, filtering, sorting, basic statistics.
Scraper (UniversityScraper) – fetch + parse homepage, extraction heuristics, classification & data quality scoring (rate limited via queue).
Enrichment Script (scripts/enrich.ts) – orchestrates batch scraping with caching to data/cache/*.json and writes aggregated enriched dataset.
CLI (cli.ts) – user interface for listing, enrichment, and stats.

Scraper Heuristics (High-Level)

Fetch with retry & jitter backoff.
Extract <meta name="description">, first meaningful paragraph, or tagline patterns.
Look for contact info via regex (emails, phone numbers, address fragments).
Infer founding year via patterns like Established 18xx|19xx|20xx.
Identify program/faculty keywords in navigation or section headers.
Collect social links by domain match (twitter.com, facebook.com, etc.).
Classify type (public/private/research/technical/community) by keyword/phrase heuristics.
Score data quality based on number & diversity of successfully extracted fields.

Performance & Respectful Crawling

Concurrency controlled by a queue (configurable).
Optional pauses / resume; per‑record caching prevents redundant fetches.
Future roadmap includes robots.txt parsing & adaptive politeness windows.

Enrichment Workflow

The batch enrichment script is optional and can be executed when you purposely want deeper metadata.

npm run build
node dist/scripts/enrich.js --concurrency 3 --resume

Flags (planned / implemented):

| Flag | Description | | --------------------- | ------------------------------------------------------------------------- | | --concurrency <n> | Number of parallel fetches (default modest to prevent overloading sites). | | --resume | Skip already cached universities (looks in data/cache/). | | --limit <n> | (Planned) Process only the first N universities for sampling. | | --country-code <CC> | (Planned) Restrict enrichment to a country subset. |

Outputs:

data/cache/{universityId}.json – per‑university enriched snapshot.
data/enriched-universities.json – aggregated enriched dataset (written after run).

CLI Reference

| Command | Purpose | Key Options | | -------------- | ---------------------------------------------------------- | ------------------------------------------------------------ | | list | Filter & display base (or partially enriched) universities | --name, --country, --country-code, --limit, --json | | enrich <url> | Enrich a single university homepage | (none yet; uses internal defaults) | | stats | Show aggregated statistics | None |

Examples:

universities list --name technology --limit 8
universities list --country-code GB --json --limit 5
universities enrich https://www.stanford.edu/
universities stats

Roadmap

[ ] Full dataset enrichment pipeline automation & snapshot publishing
[ ] Dual ESM + CJS distribution build (current workaround: lazy import for ESM‑only deps)
[ ] Robots.txt compliance & politeness policy configuration
[ ] Advanced classification (continent/region inference, size estimation heuristics, ranking ingestion)
[ ] Pluggable enrichment modules (e.g., ranking APIs, accreditation feeds)
[ ] Incremental persistent store (SQLite / LiteFS / DuckDB) for historical deltas
[ ] Comprehensive test suite (scraper mocks, repository edge cases, CLI integration)
[ ] Documentation site (API reference, enrichment metrics dashboard)
[ ] Progressive enrichment resume with queuing telemetry
[ ] Data provenance & reproducibility manifest (hashes, run metadata)

Testing

Run unit and integration tests:

npm test

Coverage reports are emitted to coverage/.

Contributing

We welcome contributions! Suggested steps:

Fork & create a feature branch.
Install dependencies: npm install.
Run npm run build & ensure tests pass.
Add or update tests for your change.
Follow lint & formatting (npm run lint, npm run format).
Submit a PR referencing any related issues.

Please consult (or propose) a CONTRIBUTING.md for evolving guidelines. Ethical scraping considerations and rate limiting are especially important—avoid aggressive concurrency.

Ethical & Legal Considerations

This project performs only light, homepage‑level scraping by default.
Always respect target site terms of service and robots.txt (planned feature for enforcement).
Do not use the enrichment pipeline to harvest personal data beyond institutional metadata.
Consider running enrichment in batches with conservative concurrency settings.

Troubleshooting

| Issue | Cause | Resolution | | --------------------------------------- | --------------------------------------------------------------- | --------------------------------------------------------------- | | ERR_REQUIRE_ESM when using CLI list | ESM‑only dependency (p-queue) pulled into non‑enrichment path | Resolved via lazy import; update to latest version of package | | Empty enrichment fields | Site structure variation | Re‑run later or inspect HTML; heuristics will improve over time | | Slow enrichment run | Network latency / conservative concurrency | Increase --concurrency cautiously |

Security

No secrets are stored. If you identify a security concern (e.g., vulnerable dependency or scraping misuse vector) please open an issue with reproduction details or use private disclosure if sensitive.

License

This repository is distributed under the terms of the MIT License. See LICENSE for details.

Acknowledgements

Inspired by the open university datasets community and contributors who maintain baseline CSV resources. Future improvements will strive for transparency, repeatability, and respectful data gathering.

Give a ⭐ if you find this useful and feel free to open issues for ideas or enhancements.

Generated documentation improvements are iterative; feel free to propose edits.

Technologies

How To Contribute

Click on these badges to see how you might be able to help:

Installation

npm install

Running

npm start

npm run dev

Testing

npm test

Building

npm run build

Thanks to all Contributors 💪

Thank you for considering to contribute
Feel free to submit feature requests, UI updates, bugs as issues.
Checkout Contribution Guidelines for more information.
Have a feature request? Feel free to create a issue for it.

Your Support means a lot

Give a ⭐ to show support for the project.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Overview

Key Features

Installation

Quick Start (CLI)

Programmatic Usage

Data Model (Simplified)

Architecture Overview

Scraper Heuristics (High-Level)

Performance & Respectful Crawling

Enrichment Workflow

CLI Reference

Roadmap

Testing

Contributing

Ethical & Legal Considerations

Troubleshooting

Security

License

Acknowledgements

Technologies

How To Contribute

Installation

Running

Testing

Building

Thanks to all Contributors 💪

Your Support means a lot