@osmn-byhn/htmlparser

v0.1.0

Published

4 months ago

Unified HTML/CSS/JS extractor that inlines styles and resolves script functions into a single JSON structure.

Downloads

0High
0Medium
0Low

osmn-byhn

html css javascript extractor parser inline-styles unified-extraction scraping

@osmn-byhn/htmlparser 🚀

Unify HTML, CSS, and JS into a single, element-centric JSON structure.

This library is designed for developers who need to extract data from HTML while preserving its visual and functional context. Unlike traditional parsers, @osmn-byhn/htmlparser inlines styles from <style> tags and resolves JavaScript event handlers (like onclick) into their actual function bodies.

🌟 Why use this?

Deep Extraction: Don't just get the HTML; get the "computed" feel of it. Styles that live in the <head> are automatically mapped to the elements they target in the <body>.
Function Intelligence: If an element has an onclick="doSomething()", this library searches the <script> tags, finds doSomething, and includes its full source code in the JSON entry for that element.
AI Friendly: The unified, self-contained JSON output is perfect for feeding into LLMs (Large Language Models) for UI analysis, code generation, or automated testing.
Zero Heavy Dependencies: Built with performance and simplicity in mind.

📦 Installation

npm install @osmn-byhn/htmlparser
# or
pnpm add @osmn-byhn/htmlparser
# or
yarn add @osmn-byhn/htmlparser

🚀 Quick Start

TypeScript

import { extractUnifiedFromHTML } from "@osmn-byhn/htmlparser";

const html = `
  <html>
    <head>
      <style>.btn { color: red; }</style>
    </head>
    <body>
      <button class="btn" onclick="sayHi()">Click Me</button>
      <script>function sayHi() { console.log('Hi!'); }</script>
    </body>
  </html>
`;

async function main() {
  const result = await extractUnifiedFromHTML(html);
  
  const button = result.body.children[0];
  console.log(button.inlineStyle); // { color: 'red' }
  console.log(button.events.click.function); // "function sayHi() { ... }"
}

main();

JavaScript (ES Modules)

import { extractUnifiedFromHTML } from "@osmn-byhn/htmlparser";

const result = await extractUnifiedFromHTML('<div>Hello</div>');
console.log(result.body);

JavaScript (CommonJS)

const { extractUnifiedFromHTML } = require("@osmn-byhn/htmlparser");

extractUnifiedFromHTML('<div>Hello</div>').then(result => {
  console.log(result.body);
});

🛠️ Output Structure

The output is a UnifiedExtraction object:

The `UnifiedElement` object:

Every element in the tree has this structure:

{
  "tag": "div",
  "id": "main-container",
  "class": "active primitive",
  "attrs": { "data-custom": "value" },
  "inlineStyle": { 
    "color": "red", 
    "font-size": "16px" 
  },
  "events": {
    "click": {
      "handler": "myFunc()",
      "function": "function myFunc() { ... }"
    }
  },
  "children": [ ... ],
  "textContent": "Hello World"
}

🎯 Use Cases

Web Scraping: Extract data from modern web pages while keeping the styling info associated with the data points.
LLM / AI Processing: Convert messy HTML into a structured JSON format that AI can easily understand and reason about.
UI-to-Code: Build tools that convert existing websites into React/Vue/Tailwind components by having all styles and logic per-element.
Automated Audits: Programmatically check if elements have specific styles or correctly mapped event handlers.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme