destructure-html

v2.1.6

Published

3 years ago

destructure-html simplifies HTML element deconstruction and data extraction, making it effortless to extract desired information from complex HTML structures with its intuitive syntax and powerful features.

Downloads

0High
0Medium
0Low

adxxtya

The destructure-html is a lightweight package that simplifies HTML deconstruction and data extraction, making it easy to extract information & elements from complex HTML structures.

Install the package:

npm i destructure-html

This package

was created to extract relevant information seamlessly from scraped data
enables destructuring data which is in the form of html
constructs data in any form from raw html

New features will be consistently updated and released on a regular basis.

🏃 Quick Start

CommonJS

// commonjs require statement
const dsh = require('destructure-html')


// scraped data from netflix
const htmlData = `
<div class="lolomoRow ltr-0" data-context="genre">42479280414AECBB...
<div class="lolomoRow ltr-0" data-context="continueWatching"><h2 class="rowHeader"...
<div class="lolomoRow  ltr-0" data-context="trendingNow"><h2 class="rowHeader ltr-0"...
`;


// This will return an array of src values which may containt images or other important data from the html content
const getHtmlText = dsh.grabSrcValues(htmlData)
console.log(getHtmlText);



// output: [
//  'https://occ-0-1947-2164.1.nflxso.net/dnm/api/v6/6gmvu2hxdfnQ55LZZjyzYR4kzGk/AAAABaJ71EC0meuaQJkcwU3H1IVx-9PSbCQ-1vzPySh7k3264YotnvQ9lQmPQP_S_cb95GRP9lUkJsTlkmGcIpqXspMai9q5C_2Mq-k.jpg?r=183',
  .
  .
  .
//  'https://occ-0-1947-2164.1.nflxso.net/dnm/api/v6/6gmvu2hxdfnQ55LZZjyzYR4kzGk/AAAABeo26eQTyK5t9xceCCE86N3JsqgZ2eCMMsHxyBzGx8UTvD8-aHTe6EAtYMbn5R4gfMWLRNbUhOZZljpBjZ8zTIiPJjt3L-3TWyKv-5fSvooKuS0sLg0v0oT9--ay1HFx3MU3.jpg?r=438' ]

ModernJS

// modernjs import statement
import { getContentByUniqueText } from 'destructure-html'


// scraped data from netflix
const exampleHtmlData = `
<div class="lolomoRow ltr-0" data-context="genre">42479280414AECBB...
<div class="lolomoRow ltr-0" data-context="continueWatching"><h2 class="rowHeader"...
<div class="lolomoRow  ltr-0" data-context="trendingNow"><h2 class="rowHeader ltr-0"...
`;


// This will return the whole html content from the starting of the tag with a unique text
// like an unique class or other attribute that only the div contains in the whole page
const htmlTag = getContentByUniqueText(html, "continueWatching")
console.log(htmlTag);



// output: <div class="lolomoRow ltr-0" data-context="continueWatching"><h2 
// class="title">Continue Watching for Aditya</div><div class="aro - row 
// - header more - visible"><div><di ... div></div></div></div></div>

CDN package

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">


    <!-- The package can be imported via CDN links as well -->
    <script src="https://unpkg.com/[email protected]/lib/es5/index.js"></script>


    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    I don't know what I'm doing with my life.
</body>
</html>

✂️ What to use When && When to use What

To establish a clearer relationship with the table below, let's consider the following example data.

Example data (htmlData)

const htmlData =

<div class="gray">
    <p>Some text</p>
    <div class="blue" id="blue-div">
      <div>More text</div>
      <a href="https://lorem-ipsum.com/browse/69">
        <img src="https://placeholder.com/first.png" alt="" />
      </a>
    </div>
    <div class="blue">
      <div>Some More text</div>
      <a href="https://lorem-ipsum.com/browse/420">
        <img src="https://placeholder.com/second.png" alt="" />
      </a>    
    </div>
</div>
<div>
    <p>Another paragraph</p>
</div>

| Functions | Parameter(s) | Parameter Example | Output | Takes | Returns | | --- | --- | --- | --- | --- | --- | | grabHrefValues() | html (string) | grabHrefValues(htmlData); | [ 'https://lorem-ipsum.com/browse/69', 'https://lorem-ipsum.com/browse/420' ] | Accepts the HTML data as a parameter. | Returns an array of all href values found in the provided input. | | grabSrcValues() | html (string) | grabSrcValues(htmlData); | [ 'https://placeholder.com/first.png', 'https://placeholder.com/second.png' ] | Accepts the HTML data as a parameter. | Returns an array of all src values found in the provided input. | | findNestedTexts() | html (string) | const htmlText = findNestedTexts(exampleHtmlData); | [ 'Some text', 'More text', 'Some More text', 'Another paragraph' ] | Accepts the complete HTML data as a parameter. | Returns an array containing all the text found at different locations within the HTML data. | | getContentById() | html (string), id (string) | const htmlContent = getContentById(exampleHtmlData, "blue-div"); | <div class="blue" id="blue-div"><div>More text</div><a href="https://lorem-ipsum.com/browse/69">![alt](https//placeholder.com/first.png)</a></div>| Important: The ID should be a unique identifier present only within this element. Accepts the HTML data and a unique ID as parameters. | Returns the entire HTML content of the specified element, including its tags and inner content, which can be used to extract text or other relevant data later. | | findTagById() | htmlData,uniqueId | findTagById(exampleHtmlData, "gray"); | <div id="gray"> | Accepts the HTML data and a unique text identifier present within the HTML tag as parameters. | Returns the complete opening tag of the HTML element matching the specified ID, without its content or closing tag. | | findTagByClass() | htmlData,className | const htmlTag = findTagByClass(exampleHtmlData, "blue"); | 2 | Accepts the HTML data and a class name used for styling as parameters. | If there is a single HTML tag with the provided class name, it returns a string containing the entire HTML tag similar to the findTagById() function. If there are multiple HTML tags with the same class, it returns the total count of occurrences. |
| getContentBetweenTags() | htmlData,openingTag | const htmlContent = getContentBetweenTags(exampleHtmlData, `<div class="gray">`); | <div class="gray"> <p>Some text</p> <div class="blue"> <div>More text</div> </div> <div class="blue"> <div>Some More text</div> </div></div> | Accepts the HTML data and the complete opening tag of a div element (obtained from either findTagById() or findTagByClass()) as parameters. | Returns all the HTML content starting from the specified opening tag, including all content within until the closing tag. |

🙌 Contributing

Contributions to destructure-html are welcome and encouraged! To contribute to the project, follow these steps:

Fork the repository and clone it to your local machine.
Set up your development environment.
Make changes or add new features to the codebase.
Write tests to ensure the code behaves as expected.
Commit your changes and push them to your forked repository.
Submit a pull request with a clear description of the changes you made and their purpose.
Your pull request will be reviewed by the maintainers, and any necessary feedback will be provided.
Once your changes pass the review process, they will be merged into the main repository.

By contributing to destructure-html, you help improve the package and make it more robust for everyone to use.

📲 Contact me

If you have any questions, feedback, or need support with destructure-html, you can reach out to me through the following channels:

GitHub Issues: https://github.com/adxxtya/destructure-html/issues

I am always ready to assist you and appreciate any feedback or suggestions you may have.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

🏃 Quick Start

✂️ What to use When && When to use What

🙌 Contributing

📲 Contact me