pagescan

v0.0.2

Published

3 months ago

웹페이지에서 디자인 요소를 추출하여 XML로 저장하는 CLI 도구

0High
0Medium
0Low

greatsumini

design extractor cli web scraping puppeteer design-system xml

PageScan

Extract design elements from any webpage and transform them into LLM-ready XML format.

PageScan is a powerful CLI tool that helps you capture HTML/CSS from web pages with intelligent optimization and automatic design token extraction. Perfect for design system analysis, UI component extraction, and AI-assisted design workflows.

Features

Interactive Extraction - Launch a browser, set up the page exactly as you want, then extract with a single keypress
Smart Optimization - Automatically removes scripts, comments, hidden elements, and unnecessary attributes
Design Token Detection - Auto-categorizes CSS variables into colors, spacing, and typography
LLM-Ready Output - Structured XML format optimized for AI analysis and processing
Zero Configuration - Works out of the box with sensible defaults

Getting Started

Installation

Install globally via npm:

npm install -g pagescan

Or use directly with npx (no installation required):

npx pagescan https://example.com

Alternative Package Managers

Using pnpm:

pnpm add -g pagescan

Using yarn:

yarn global add pagescan

Quick Start

Run PageScan with a URL
```
pagescan https://github.com
```
A browser window opens - Interact with the page (scroll, click, expand menus)
Press Enter in terminal when ready to capture
Find your output in the output/ directory as structured XML

Usage

Basic Usage

pagescan <URL>

Example:

pagescan https://github.com
pagescan https://stripe.com/pricing

How It Works

Launch - Run the command with your target URL
Interact - A browser window opens - scroll, click, expand menus, etc.
Capture - Press Enter in the terminal when ready
Extract - PageScan automatically captures and optimizes the HTML/CSS
Done - Find your structured XML in the output/ directory

Example Workflow

# Extract a component library
pagescan https://ui.shadcn.com/docs/components/button

# Capture a landing page
pagescan https://vercel.com

# Analyze a pricing page
pagescan https://stripe.com/pricing

Output Format

PageScan generates timestamped XML files in the output/ directory with the following structure:

<?xml version="1.0" encoding="UTF-8"?>
<design-extraction>
  <metadata>
    <url>https://example.com</url>
    <timestamp>2025-01-01T00:00:00.000Z</timestamp>
  </metadata>

  <styles>
    <design-tokens>
      <colors>
        <token name="--primary-color" value="#3498db" />
        <token name="--background" value="#ffffff" />
      </colors>
      <spacing>
        <token name="--spacing-md" value="16px" />
        <token name="--gap-lg" value="24px" />
      </spacing>
      <typography>
        <token name="--font-base" value="Arial, sans-serif" />
        <token name="--font-size-lg" value="18px" />
      </typography>
    </design-tokens>

    <global-styles><![CDATA[
      /* Optimized CSS content */
    ]]></global-styles>
  </styles>

  <structure>
    <html-content><![CDATA[
      <!-- Cleaned HTML structure -->
    ]]></html-content>
  </structure>
</design-extraction>

Optimization Features

PageScan intelligently cleans extracted HTML by removing:

<script> tags and JavaScript code
HTML comments
Hidden elements (display:none, visibility:hidden)
Elements with hidden attribute
Event handlers (onclick, onload, etc.)
Long text content (preserving headings and buttons)

This results in clean, focused output perfect for design analysis and LLM processing.

Design Token Classification

CSS custom properties are automatically categorized:

| Category | Detected Patterns | |----------|------------------| | Colors | color, background, bg, border-color, fill, stroke | | Spacing | margin, padding, spacing, gap, inset | | Typography | font, text, line-height, letter-spacing | | Other | All other CSS variables |

Use Cases

Design System Analysis - Extract and analyze design tokens from existing sites
Component Libraries - Capture UI components for reference or reimplementation
AI-Assisted Design - Feed structured design data to LLMs for analysis or generation
Design Audits - Quickly capture and review design patterns across pages
Documentation - Generate design documentation from live examples

Development

Prerequisites

Node.js 16+
pnpm (recommended) or npm

Setup

git clone https://github.com/yourusername/design-extractor.git
cd design-extractor
pnpm install

Available Scripts

pnpm build          # Build the project
pnpm dev <URL>      # Run in development mode
pnpm test           # Run tests
pnpm test:coverage  # Run tests with coverage
pnpm lint           # Lint code
pnpm typecheck      # Type check

Project Structure

src/
├── index.ts           # CLI entry point
├── core/              # Core logic
│   ├── extractor.ts   # Page extraction
│   ├── optimizer.ts   # HTML/CSS optimization
│   └── packer.ts      # XML packaging
├── utils/             # Utilities
│   ├── css.ts         # CSS helpers
│   └── xml.ts         # XML helpers
├── types/             # Type definitions
│   └── index.ts
└── config.ts          # Configuration

tests/
├── css-utils.test.ts  # CSS utilities tests
├── xml-utils.test.ts  # XML utilities tests
├── packer.test.ts     # Packer tests
└── fixtures/          # Test data

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

MIT License - see LICENSE for details

Support

Report bugs: GitHub Issues
Questions: Open a discussion on GitHub

Made with ❤️ for designers and developers