universities
v0.0.3
Published
Comprehensive worldwide universities dataset with TypeScript API, CLI tools, and data processing utilities. Includes web scraping capabilities with respectful rate limiting for enriching university data.
Downloads
63
Maintainers
Readme
Overview
universities is an evolving TypeScript/Node.js library and CLI that provides a structured, extensible dataset of the world's universities along with an enrichment pipeline that (optionally) visits institutional homepages to extract additional metadata.
Core goals:
- Provide immediate, zero‑network access to a clean base list of universities (name, domains, country info, web site) sourced from the public world universities dataset.
- Offer an enrichment layer (opt‑in) that scrapes each university homepage respectfully (rate‑limited + retries) to infer or collect:
- Descriptions / taglines / motto
- Contact and location hints
- Founding year
- Academic programs & faculties (heuristic extraction)
- Social media links
- Institutional classification (public/private, research, technical, community, etc. — heuristic)
- Degree levels (undergraduate / graduate / doctoral)
- Data quality scoring for traceability
- Expose ergonomic programmatic APIs for search, filtering, and statistics.
- Provide a CLI for quick querying, enrichment, and aggregated stats generation.
- Remain transparent, reproducible, and respectful of target sites (configurable concurrency, caching, resumability, optional full‑dataset execution).
NOTE: Full automatic enrichment of every university (≈9k+) can take considerable time and should be run thoughtfully to avoid undue load on remote servers. The base dataset works instantly without enrichment.
Key Features
- Base dataset loader (CSV → strongly typed objects)
- In‑memory repository with searching, filtering, sorting and basic statistics
- Extensible domain model (
University,Program,Faculty, ranking + classification enums) - Heuristic scraper with retry + rate limiting queue
- Batch enrichment script with per‑university JSON caching (resumable)
- CLI with subcommands:
list,enrich,stats - TypeScript declarations for consumption in TS or JS projects
- Modular architecture to allow swapping scraping strategies or adding alternate data sources later (e.g., APIs, ranking feeds)
Installation
Install locally (library usage inside another project):
npm install universitiesOr for global CLI usage (optional):
npm install -g universitiesAfter a global install you can invoke the CLI via the universities command (see CLI section below). When using as a dependency, import from the package entry points.
Quick Start (CLI)
List the first 5 US universities:
universities list --country-code US --limit 5Search by name fragment:
universities list --name polytechnic --limit 10Output JSON instead of a table:
universities list --country-code CA --json --limit 3Enrich a single university (fetch + parse homepage):
universities enrich https://www.mit.edu/View aggregated stats (counts by type, size, etc.— improves once enriched data exists):
universities statsProgrammatic Usage
import { loadBaseUniversities } from 'universities/dist/data/loadBase';
import { UniversityRepository } from 'universities/dist/repository/UniversityRepository';
async function example() {
const base = await loadBaseUniversities();
const repo = new UniversityRepository(base);
const results = repo.search({ countryCode: 'US', name: 'state', limit: 20 });
console.log(results.slice(0, 3));
console.log(repo.stats());
}
example();The scraper (
UniversityScraper) is intentionally decoupled and lazily imported in the CLI to avoid pulling ESM‑only dependencies when unnecessary. For programmatic enrichment you can:import { UniversityScraper } from 'universities/dist/scraper/UniversityScraper';
Data Model (Simplified)
interface University {
id: string; // Stable hash/id generation
name: string;
country: string;
countryCode: string;
alphaTwoCode?: string; // If present in source
webPages: string[]; // One or more homepage URLs
domains: string[]; // Domain(s)
stateProvince?: string;
// Enriched fields (optional until scraping):
description?: string;
motto?: string;
foundingYear?: number;
location?: string;
contact?: { email?: string; phone?: string; address?: string };
programs?: { name: string; degreeLevels?: string[] }[];
faculties?: { name: string; description?: string }[];
social?: { twitter?: string; facebook?: string; instagram?: string; linkedin?: string; youtube?: string };
classification?: { type?: string; degreeLevel?: string[] };
dataQuality?: { score: number; factors: string[] };
enrichedAt?: string; // ISO timestamp when enrichment occurred
}See the full definitions in src/types/University.ts for exhaustive enum types, search options, stats structure, and classification helpers.
Architecture Overview
Layered design:
- Source Layer (
world-universities.csv) – raw dataset. - Loader (
loadBaseUniversities) – parses CSV into partialUniversityobjects. - Domain Types (
types/University.ts) – strongly typed schema + enums + search contracts. - Repository (
UniversityRepository) – in‑memory indexing, filtering, sorting, basic statistics. - Scraper (
UniversityScraper) – fetch + parse homepage, extraction heuristics, classification & data quality scoring (rate limited via queue). - Enrichment Script (
scripts/enrich.ts) – orchestrates batch scraping with caching todata/cache/*.jsonand writes aggregated enriched dataset. - CLI (
cli.ts) – user interface for listing, enrichment, and stats.
Scraper Heuristics (High-Level)
- Fetch with retry & jitter backoff.
- Extract
<meta name="description">, first meaningful paragraph, or tagline patterns. - Look for contact info via regex (emails, phone numbers, address fragments).
- Infer founding year via patterns like
Established 18xx|19xx|20xx. - Identify program/faculty keywords in navigation or section headers.
- Collect social links by domain match (twitter.com, facebook.com, etc.).
- Classify type (public/private/research/technical/community) by keyword/phrase heuristics.
- Score data quality based on number & diversity of successfully extracted fields.
Performance & Respectful Crawling
- Concurrency controlled by a queue (configurable).
- Optional pauses / resume; per‑record caching prevents redundant fetches.
- Future roadmap includes robots.txt parsing & adaptive politeness windows.
Enrichment Workflow
The batch enrichment script is optional and can be executed when you purposely want deeper metadata.
npm run build
node dist/scripts/enrich.js --concurrency 3 --resumeFlags (planned / implemented):
| Flag | Description |
| --------------------- | ------------------------------------------------------------------------- |
| --concurrency <n> | Number of parallel fetches (default modest to prevent overloading sites). |
| --resume | Skip already cached universities (looks in data/cache/). |
| --limit <n> | (Planned) Process only the first N universities for sampling. |
| --country-code <CC> | (Planned) Restrict enrichment to a country subset. |
Outputs:
data/cache/{universityId}.json– per‑university enriched snapshot.data/enriched-universities.json– aggregated enriched dataset (written after run).
CLI Reference
| Command | Purpose | Key Options |
| -------------- | ---------------------------------------------------------- | ------------------------------------------------------------ |
| list | Filter & display base (or partially enriched) universities | --name, --country, --country-code, --limit, --json |
| enrich <url> | Enrich a single university homepage | (none yet; uses internal defaults) |
| stats | Show aggregated statistics | None |
Examples:
universities list --name technology --limit 8
universities list --country-code GB --json --limit 5
universities enrich https://www.stanford.edu/
universities statsRoadmap
- [ ] Full dataset enrichment pipeline automation & snapshot publishing
- [ ] Dual ESM + CJS distribution build (current workaround: lazy import for ESM‑only deps)
- [ ] Robots.txt compliance & politeness policy configuration
- [ ] Advanced classification (continent/region inference, size estimation heuristics, ranking ingestion)
- [ ] Pluggable enrichment modules (e.g., ranking APIs, accreditation feeds)
- [ ] Incremental persistent store (SQLite / LiteFS / DuckDB) for historical deltas
- [ ] Comprehensive test suite (scraper mocks, repository edge cases, CLI integration)
- [ ] Documentation site (API reference, enrichment metrics dashboard)
- [ ] Progressive enrichment resume with queuing telemetry
- [ ] Data provenance & reproducibility manifest (hashes, run metadata)
Testing
Run unit and integration tests:
npm testCoverage reports are emitted to coverage/.
Contributing
We welcome contributions! Suggested steps:
- Fork & create a feature branch.
- Install dependencies:
npm install. - Run
npm run build& ensure tests pass. - Add or update tests for your change.
- Follow lint & formatting (
npm run lint,npm run format). - Submit a PR referencing any related issues.
Please consult (or propose) a CONTRIBUTING.md for evolving guidelines. Ethical scraping considerations and rate limiting are especially important—avoid aggressive concurrency.
Ethical & Legal Considerations
- This project performs only light, homepage‑level scraping by default.
- Always respect target site terms of service and robots.txt (planned feature for enforcement).
- Do not use the enrichment pipeline to harvest personal data beyond institutional metadata.
- Consider running enrichment in batches with conservative concurrency settings.
Troubleshooting
| Issue | Cause | Resolution |
| --------------------------------------- | --------------------------------------------------------------- | --------------------------------------------------------------- |
| ERR_REQUIRE_ESM when using CLI list | ESM‑only dependency (p-queue) pulled into non‑enrichment path | Resolved via lazy import; update to latest version of package |
| Empty enrichment fields | Site structure variation | Re‑run later or inspect HTML; heuristics will improve over time |
| Slow enrichment run | Network latency / conservative concurrency | Increase --concurrency cautiously |
Security
No secrets are stored. If you identify a security concern (e.g., vulnerable dependency or scraping misuse vector) please open an issue with reproduction details or use private disclosure if sensitive.
License
This repository is distributed under the terms of the MIT License. See LICENSE for details.
Acknowledgements
Inspired by the open university datasets community and contributors who maintain baseline CSV resources. Future improvements will strive for transparency, repeatability, and respectful data gathering.
Give a ⭐ if you find this useful and feel free to open issues for ideas or enhancements.
Generated documentation improvements are iterative; feel free to propose edits.
Technologies
How To Contribute
Click on these badges to see how you might be able to help:
Installation
npm installRunning
npm startor
npm run devTesting
npm testBuilding
npm run buildThanks to all Contributors 💪
- Thank you for considering to contribute
- Feel free to submit feature requests, UI updates, bugs as issues.
- Checkout Contribution Guidelines for more information.
- Have a feature request? Feel free to create a issue for it.
Your Support means a lot
Give a ⭐ to show support for the project.
