pagescan
v0.0.2
Published
웹페이지에서 디자인 요소를 추출하여 XML로 저장하는 CLI 도구
Maintainers
Readme
PageScan
Extract design elements from any webpage and transform them into LLM-ready XML format.
PageScan is a powerful CLI tool that helps you capture HTML/CSS from web pages with intelligent optimization and automatic design token extraction. Perfect for design system analysis, UI component extraction, and AI-assisted design workflows.
Features
- Interactive Extraction - Launch a browser, set up the page exactly as you want, then extract with a single keypress
- Smart Optimization - Automatically removes scripts, comments, hidden elements, and unnecessary attributes
- Design Token Detection - Auto-categorizes CSS variables into colors, spacing, and typography
- LLM-Ready Output - Structured XML format optimized for AI analysis and processing
- Zero Configuration - Works out of the box with sensible defaults
Getting Started
Installation
Install globally via npm:
npm install -g pagescanOr use directly with npx (no installation required):
npx pagescan https://example.comAlternative Package Managers
Using pnpm:
pnpm add -g pagescanUsing yarn:
yarn global add pagescanQuick Start
Run PageScan with a URL
pagescan https://github.comA browser window opens - Interact with the page (scroll, click, expand menus)
Press Enter in terminal when ready to capture
Find your output in the
output/directory as structured XML
Usage
Basic Usage
pagescan <URL>Example:
pagescan https://github.com
pagescan https://stripe.com/pricingHow It Works
- Launch - Run the command with your target URL
- Interact - A browser window opens - scroll, click, expand menus, etc.
- Capture - Press Enter in the terminal when ready
- Extract - PageScan automatically captures and optimizes the HTML/CSS
- Done - Find your structured XML in the
output/directory
Example Workflow
# Extract a component library
pagescan https://ui.shadcn.com/docs/components/button
# Capture a landing page
pagescan https://vercel.com
# Analyze a pricing page
pagescan https://stripe.com/pricingOutput Format
PageScan generates timestamped XML files in the output/ directory with the following structure:
<?xml version="1.0" encoding="UTF-8"?>
<design-extraction>
<metadata>
<url>https://example.com</url>
<timestamp>2025-01-01T00:00:00.000Z</timestamp>
</metadata>
<styles>
<design-tokens>
<colors>
<token name="--primary-color" value="#3498db" />
<token name="--background" value="#ffffff" />
</colors>
<spacing>
<token name="--spacing-md" value="16px" />
<token name="--gap-lg" value="24px" />
</spacing>
<typography>
<token name="--font-base" value="Arial, sans-serif" />
<token name="--font-size-lg" value="18px" />
</typography>
</design-tokens>
<global-styles><![CDATA[
/* Optimized CSS content */
]]></global-styles>
</styles>
<structure>
<html-content><![CDATA[
<!-- Cleaned HTML structure -->
]]></html-content>
</structure>
</design-extraction>Optimization Features
PageScan intelligently cleans extracted HTML by removing:
<script>tags and JavaScript code- HTML comments
- Hidden elements (
display:none,visibility:hidden) - Elements with
hiddenattribute - Event handlers (
onclick,onload, etc.) - Long text content (preserving headings and buttons)
This results in clean, focused output perfect for design analysis and LLM processing.
Design Token Classification
CSS custom properties are automatically categorized:
| Category | Detected Patterns |
|----------|------------------|
| Colors | color, background, bg, border-color, fill, stroke |
| Spacing | margin, padding, spacing, gap, inset |
| Typography | font, text, line-height, letter-spacing |
| Other | All other CSS variables |
Use Cases
- Design System Analysis - Extract and analyze design tokens from existing sites
- Component Libraries - Capture UI components for reference or reimplementation
- AI-Assisted Design - Feed structured design data to LLMs for analysis or generation
- Design Audits - Quickly capture and review design patterns across pages
- Documentation - Generate design documentation from live examples
Development
Prerequisites
- Node.js 16+
- pnpm (recommended) or npm
Setup
git clone https://github.com/yourusername/design-extractor.git
cd design-extractor
pnpm installAvailable Scripts
pnpm build # Build the project
pnpm dev <URL> # Run in development mode
pnpm test # Run tests
pnpm test:coverage # Run tests with coverage
pnpm lint # Lint code
pnpm typecheck # Type checkProject Structure
src/
├── index.ts # CLI entry point
├── core/ # Core logic
│ ├── extractor.ts # Page extraction
│ ├── optimizer.ts # HTML/CSS optimization
│ └── packer.ts # XML packaging
├── utils/ # Utilities
│ ├── css.ts # CSS helpers
│ └── xml.ts # XML helpers
├── types/ # Type definitions
│ └── index.ts
└── config.ts # Configuration
tests/
├── css-utils.test.ts # CSS utilities tests
├── xml-utils.test.ts # XML utilities tests
├── packer.test.ts # Packer tests
└── fixtures/ # Test dataContributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
MIT License - see LICENSE for details
Support
- Report bugs: GitHub Issues
- Questions: Open a discussion on GitHub
Made with ❤️ for designers and developers
