npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

custom_html_parser

v0.0.2

Published

JSON configured html parser for organized information extraction

Downloads

5

Readme

Custom HTML parser

Created for extracting information from html documents easily.

Installing:

npm install custom_html_parser

Starting example:

import parser from 'custom_html_parser';
const result = parser.parse(html, options);

Description

This module operates with a SAX parser under the hood. This is done with heavily relying on the htmlparser2 module. The goal was to create a module which is only needed to set up with a JSON file.

Here is how it works:

It searches for tags with given name and attributes and tags inside them. Each search group consists of a base_tag. Every time the parser finds one of these tags it will always create a new result. And everything it finds inside this search group (defined in the base_tag and search_tags) will automatically be added to this result object. We can define several search targets with the search_tags object.

Example:

HTML document:
<div class="foo" title="bar" interesting-attribute="interesting-value">
  something
</div>
Options:
[
  {
    "base_tag": {
      "tag": "div", 
      "attributes": ["class", "title"],
      "values": ["foo", "bar"],
      
      "get_attributes": ["interesting-attribute", "missing_attribute"],
      "get_attributes_as": ["save_name", "save_name_2"],
      "prefix_attributes_with": ["custom_prefix ", "prefix_2 "],
      "empty_attributes_placeholders": ["", "missing"],

      "get_text": true,
      "get_text_as": "inside_text",
      "prefix_text_with": "another_custom_prefix "
    }
  }
]
Result:
[
  [
    {
      "save_name":["custom_prefix interesting-value"],
      "save_name_2":["missing"],
      "inside_text":["another_custom_prefix something"]
    }
  ]
]

More complicated example:

This is where the strength of this approach is easy to see. If there are tags repeating (like a list), it is much easier to extract and organize information this way.

HTML document:
<ul class="hsy_ul" style="width: 1472px">
  <li class="hsy_li">
    <strong class="hsy_m">Jan</strong>
    <a title="Spring">
      <span>12 in.</span>
      <i style="height:56%;"></i>
    </a>
    <a title="Summer">
      <span>23 in.</span>
      <i style="height:57%;"></i>
    </a>
    <a title="Fall">
      <span>22 in.</span>
      <i style="height:57%;"></i>
    </a>
    <a title="Winter">
      <span>1 in.</span>
      <i style="height:57%;"></i>
    </a>
  </li>
</ul>
Options:
[
  {
    "base_tag": {
      "tag": "ul", 
      "attributes": ["class"],
      "values": ["hsy_ul"]
    },
    "search_tags": [
      [
        {
          "tag": "li", 
          "attributes": ["class"],
          "values": ["hsy_li"]
        },
        {
          "tag": "a", 
          "attributes": [],
          "values": [],

          "get_text": true,
          "get_text_as": "Depth",
          "empty_text_placeholder": "0 in.",
          "inside_tag_text": true,

          "get_attributes": ["title"],
          "get_attributes_as": ["Season"],
          "prefix_attributes_with": ["2017, "]
        }
      ]
    ]
  }
]
Result:
[
  [
    {
      "Season": ["2017, Spring","2017, Summer","2017, Fall","2017, Winter"],
      "Depth": ["3 in.","23 in.","22 in.","0 in."]
    }
  ]
]

Documentation:

Every time a tag occurs you have an option to save it's attributes and/or the text in between it's opening and closing tags. You have several options to decide how you want these values to be saved. All of these can be added to both the base_tag and the search_tags as well, but always if it finds something to save it will be saved to the actual base_tag's result object.

| Option | Type | Optional | Meaning | | ------------- | ------------- | ------------- |:-------------:| | tag | string | Needs | The name of the searched tag | | attributes | string[] | Needs | The attribute names to check for a match for the tag | | values | string[] | Needs | The values of each attribute for the searched tag | | get_text | boolean | Optional | Whether to save the text in between the opening and closing tags or not. Default: false | | get_text_as | string | Optional | The key in which the text between the tags should be saved | | prefix_text_with | string | Optional | A prefix value for the text. Only adds it if it finds something | | empty_text_placeholder | string | Optional | If there is nothing in between the tags it can save this instead. Does not add prefix to it | | only_first_text | boolean | Optional | Whether to only save the text right after the opening tag until the first opening tag inside (or until the end of the tag). Default: false | inside_tag_text | boolean | Optional | Whether to also save the text in between other tags as well as long as they are between the current tag. Default: false | | get_attributes | string[] | Optional | Which attribute values to save if there is a match | | get_attributes_as | string[] | Optional | The name at which each attribute value should be saved | | prefix_attributes_with | string[] | Optional | Prefixes to use at the attribute values | | empty_attributes_placeholders | string[] | Optional | In case of an empty or not-existing attribute, what placeholder to use instead of the value |

The string array values that belong to each other are used by order. This means, that each of them has to be the same length. And the values at the same index are addressing the same use case. In the case of:

{
  "get_attributes": ["theme", "alt"],
  "get_attributes_as": ["style", "description"],
  "prefix_attributes_with": ["Image: ", "Text:"],
  "empty_attributes_placeholders": ["empty", ""]
}

The alt attribute's value of the tag will be saved with the name description. It will be prefixed with Text:, and if there is no alt value, nothing will be saved.

The hierarchy between the seach_tags are from top to bottom. Each time the index grows it searches inside the previous tag. But everything that it finds will be saved to the same base_tag. You can add more of these hierarchies in the same base tag with adding more arrays into the search_tags.

TODO: Make attributes and values optional parameters. Testing.


License

MIT