guidelinescraper
v1.0.17
Published
Scrape a Frontify brand portal and save every page as PDF and clean HTML
Downloads
137
Maintainers
Readme
Frontify Guideline Scraper
Scrape a Frontify brand portal and save every guideline page as a PDF and clean semantic HTML.
How it works
- Discover — Queries Frontify's portal and document navigation APIs to build the full site tree (documents, pages, groups, headings, external links).
- Crawl — Visits every page with Playwright, expands accordions, forces lazy images to load, dismisses cookie/overlay dialogs, then saves a PDF and raw HTML snapshot.
- Clean — Strips the raw HTML down to semantic content (headings, text, images, tables) with no scripts, styles, or navigation chrome.
Setup
npm install
npx playwright install chromiumUsage
node crawl.mjs --url brand.uber.comOr pass a full URL:
node crawl.mjs --url https://developer.frontify.comOptions
| Flag | Short | Description |
|------|-------|-------------|
| --url <url> | -u | Portal domain or full URL |
| --hub <id> | -h | Hub ID (auto-detected if omitted) |
| --cookie <str> | -c | Cookie header for authenticated portals |
| --help | | Show help |
These can also be set via environment variables or a .env file:
URL=brand.uber.com
HUB_ID=25
COOKIE=frontify-session-id=your-session-idOutput
output/{domain}/
pdf/
Group Name/
Document Title.pdf
Document Title/
Page Title.pdf
html/
Group Name/
Document Title.html
...- PDF — Full-page A4 captures with background graphics, expanded accordions, and loaded lazy images.
- HTML — Cleaned semantic HTML: headings, paragraphs, images, tables. No scripts, styles, classes, or navigation elements. Wrapped in minimal readable CSS.
Discover only
Run the discovery step standalone to inspect or save the navigation tree:
node discover.mjs --url brand.uber.com --output brand.uber.com.jsonThis outputs a JSON tree of the portal's structure without crawling any pages.
Clean HTML only
Re-clean previously scraped raw HTML:
node purge-html.mjs output/.raw/html output/cleanAuthenticated portals
For portals that require login, grab your session cookie from browser dev tools and pass it:
node crawl.mjs --url brand.uber.com --cookie "frontify-session-id=your-session-id"Or add it to .env:
COOKIE=frontify-session-id=your-session-idSee .env.example for reference.
