google-sites-clone
v0.2.1
Published
Clone any Google Sites page to static HTML. Two-pass pipeline: SingleFile + Puppeteer.
Maintainers
Readme
🌐 Google Sites Clone
Clone any Google Sites page to static HTML — own your content forever
Quick Start · Features · How It Works · Tech Stack · Roadmap
No more vendor lock-in. Your Google Sites content belongs to you. Paste a URL, get a complete static clone with all images, styles, and navigation — ready for self-hosting.
💡 Concept
Google Sites stores your content behind an SPA that search engines can't index and you can't export. google-sites-clone uses a two-pass pipeline (SingleFile + Puppeteer) to capture everything — CSS fidelity from SingleFile and clean semantic content from Puppeteer — then merges both into standalone HTML files with localized images and SEO metadata.
✨ Features
| Feature | Description |
|---------|-------------|
| 🔍 Auto-crawl | Discovers all pages from sidebar navigation automatically |
| 🎨 Two-pass pipeline | SingleFile for CSS/images + Puppeteer for clean content |
| 🖼️ Image localization | Downloads all images as local files (no CDN dependency) |
| 📺 YouTube thumbnails | Converts embedded iframes to clickable thumbnails |
| 🎬 Video grid | Injects CSS Grid of video thumbnails into SingleFile pages |
| 🗺️ SEO ready | Generates sitemap.xml + robots.txt |
| ⚡ Batch processing | 5 pages per batch with anti-rate-limit pauses |
| 🔄 SPA fallback | Internal navigation for pages that fail direct URL loading |
| 🚀 GitHub Pages deploy | One command to push to gh-pages branch |
| 📦 ZIP export | Create downloadable archive of cloned site |
🚀 Quick Start
npx google-sites-clone https://sites.google.com/view/your-sitegit clone https://github.com/maximosovsky/google-sites-clone
cd google-sites-clone
npm install
node bin/gsclone.js https://sites.google.com/view/your-sitegsclone <url> [options]
Options:
-o, --output <dir> Output directory (default: ./clone)
--no-images Skip image localization
--no-youtube Skip YouTube thumbnail download
--serve Start local server after build
--custom-nav Use custom sidebar navigation
--inline Keep images inline (base64)💡 How It Works
URL → [1. Crawl] → page-map.json (all pages + structure)
→ [2. SingleFile] → _pages/ CSS + base64 images (visual fidelity)
→ [3. Puppeteer] → _content/ Clean content + iframe sources
→ [4. Images] → site/images/ base64 → local files
→ [4b. Video] → site/thumbnails/ YouTube/Vimeo thumbs
→ [5. Build] → site/ iframe nav + pages + video grid + report| Pass | Tool | Captures |
|------|------|----------|
| 1 | Puppeteer | Navigation structure → page-map.json |
| 2 | SingleFile CLI | CSS, base64 images, layout → _pages/ |
| 3 | Puppeteer (batch ×5) | Clean text, links, iframe srcs → _content/ |
| 4 | Base64 decoder | Images from SingleFile → site/images/ |
| 4b | Video scanner | YouTube/Vimeo thumbnails → site/thumbnails/ |
| 5 | Build script | iframe nav + video grid + report + sitemap → site/ |
🏗️ Tech Stack
| Layer | Technology | |-------|------------| | Runtime | Node.js 18+ | | Content extraction | Puppeteer | | CSS preservation | SingleFile CLI | | CLI interface | Commander.js |
google-sites-clone/
├── bin/
│ └── gsclone.js # CLI entry point
├── lib/
│ ├── index.js # Pipeline orchestrator
│ ├── crawl.js # Auto-crawl navigation
│ ├── singlefile.js # SingleFile pass
│ ├── puppeteer.js # Puppeteer batch extraction
│ ├── images.js # Base64 → local images
│ ├── video.js # YouTube/Vimeo thumbnail download
│ ├── build.js # iframe nav + page assembly
│ └── report.js # Clone report dashboard
├── rebuild.js # Quick rebuild from cache
├── site/
│ ├── index.html # Landing page (gsclone.osovsky.com)
│ └── style.css # Design system
├── ARCHITECTURE.md
├── ROADMAP.md
├── llms.txt
└── package.json🗺️ Roadmap
See ROADMAP.md for full details.
- [x] Core pipeline (SingleFile + Puppeteer)
- [x] CLI interface
- [x] Auto-crawl navigation
- [x] Image localization
- [x] iframe-based navigation (sidebar + content)
- [x] Clone report dashboard
- [x] Landing page (gsclone.osovsky.com)
- [x] YouTube/Vimeo thumbnail download
- [x] GitHub Pages deploy
- [x] ZIP export
- [ ] npm publish
🔀 Alternatives
| Tool | Approach | Google Sites (new) | |------|----------|--------------------| | HTTrack | Recursive wget-style crawl | ❌ Can't execute JavaScript — downloads empty SPA shell | | google-sites-backup | Google Sites API (GData) | ❌ Classic Sites only, API deprecated | | generate-static-site | Headless SSR pre-render | ⚠️ Generic tool, no auto-crawl or Google Sites awareness | | google-sites-clone | Puppeteer + SingleFile | ✅ Full SPA rendering, auto-crawl, CSS fidelity, image localization |
New Google Sites (2020+) is a single-page application — all content is rendered by JavaScript. Traditional crawlers see an empty page. That's why this project uses a headless browser.
🤝 Contributing
Fork → feature/name → PR
📄 License
Maxim Osovsky. Licensed under MIT.
