custom_html_parser

v0.0.2

Published

2 years ago

JSON configured html parser for organized information extraction

Downloads

0High
0Medium
0Low

agyim

html parser data extraction

Custom HTML parser

Created for extracting information from html documents easily.

Installing:

npm install custom_html_parser

Starting example:

import parser from 'custom_html_parser';
const result = parser.parse(html, options);

Description

This module operates with a SAX parser under the hood. This is done with heavily relying on the htmlparser2 module. The goal was to create a module which is only needed to set up with a JSON file.

Here is how it works:

It searches for tags with given name and attributes and tags inside them. Each search group consists of a base_tag. Every time the parser finds one of these tags it will always create a new result. And everything it finds inside this search group (defined in the base_tag and search_tags) will automatically be added to this result object. We can define several search targets with the search_tags object.

Example:

HTML document:

<div class="foo" title="bar" interesting-attribute="interesting-value">
  something
</div>

Options:

[
  {
    "base_tag": {
      "tag": "div", 
      "attributes": ["class", "title"],
      "values": ["foo", "bar"],
      
      "get_attributes": ["interesting-attribute", "missing_attribute"],
      "get_attributes_as": ["save_name", "save_name_2"],
      "prefix_attributes_with": ["custom_prefix ", "prefix_2 "],
      "empty_attributes_placeholders": ["", "missing"],

      "get_text": true,
      "get_text_as": "inside_text",
      "prefix_text_with": "another_custom_prefix "
    }
  }
]

Result:

[
  [
    {
      "save_name":["custom_prefix interesting-value"],
      "save_name_2":["missing"],
      "inside_text":["another_custom_prefix something"]
    }
  ]
]

More complicated example:

This is where the strength of this approach is easy to see. If there are tags repeating (like a list), it is much easier to extract and organize information this way.

HTML document:

<ul class="hsy_ul" style="width: 1472px">
  <li class="hsy_li">
    <strong class="hsy_m">Jan</strong>
    <a title="Spring">
      <span>12 in.</span>
      <i style="height:56%;"></i>
    </a>
    <a title="Summer">
      <span>23 in.</span>
      <i style="height:57%;"></i>
    </a>
    <a title="Fall">
      <span>22 in.</span>
      <i style="height:57%;"></i>
    </a>
    <a title="Winter">
      <span>1 in.</span>
      <i style="height:57%;"></i>
    </a>
  </li>
</ul>

Options:

[
  {
    "base_tag": {
      "tag": "ul", 
      "attributes": ["class"],
      "values": ["hsy_ul"]
    },
    "search_tags": [
      [
        {
          "tag": "li", 
          "attributes": ["class"],
          "values": ["hsy_li"]
        },
        {
          "tag": "a", 
          "attributes": [],
          "values": [],

          "get_text": true,
          "get_text_as": "Depth",
          "empty_text_placeholder": "0 in.",
          "inside_tag_text": true,

          "get_attributes": ["title"],
          "get_attributes_as": ["Season"],
          "prefix_attributes_with": ["2017, "]
        }
      ]
    ]
  }
]

Result:

[
  [
    {
      "Season": ["2017, Spring","2017, Summer","2017, Fall","2017, Winter"],
      "Depth": ["3 in.","23 in.","22 in.","0 in."]
    }
  ]
]

Documentation:

Every time a tag occurs you have an option to save it's attributes and/or the text in between it's opening and closing tags. You have several options to decide how you want these values to be saved. All of these can be added to both the base_tag and the search_tags as well, but always if it finds something to save it will be saved to the actual base_tag's result object.

| Option | Type | Optional | Meaning | | ------------- | ------------- | ------------- |:-------------:| | tag | string | Needs | The name of the searched tag | | attributes | string[] | Needs | The attribute names to check for a match for the tag | | values | string[] | Needs | The values of each attribute for the searched tag | | get_text | boolean | Optional | Whether to save the text in between the opening and closing tags or not. Default: false | | get_text_as | string | Optional | The key in which the text between the tags should be saved | | prefix_text_with | string | Optional | A prefix value for the text. Only adds it if it finds something | | empty_text_placeholder | string | Optional | If there is nothing in between the tags it can save this instead. Does not add prefix to it | | only_first_text | boolean | Optional | Whether to only save the text right after the opening tag until the first opening tag inside (or until the end of the tag). Default: false | inside_tag_text | boolean | Optional | Whether to also save the text in between other tags as well as long as they are between the current tag. Default: false | | get_attributes | string[] | Optional | Which attribute values to save if there is a match | | get_attributes_as | string[] | Optional | The name at which each attribute value should be saved | | prefix_attributes_with | string[] | Optional | Prefixes to use at the attribute values | | empty_attributes_placeholders | string[] | Optional | In case of an empty or not-existing attribute, what placeholder to use instead of the value |

The string array values that belong to each other are used by order. This means, that each of them has to be the same length. And the values at the same index are addressing the same use case. In the case of:

{
  "get_attributes": ["theme", "alt"],
  "get_attributes_as": ["style", "description"],
  "prefix_attributes_with": ["Image: ", "Text:"],
  "empty_attributes_placeholders": ["empty", ""]
}

The alt attribute's value of the tag will be saved with the name description. It will be prefixed with Text:, and if there is no alt value, nothing will be saved.

The hierarchy between the seach_tags are from top to bottom. Each time the index grows it searches inside the previous tag. But everything that it finds will be saved to the same base_tag. You can add more of these hierarchies in the same base tag with adding more arrays into the search_tags.

TODO: Make attributes and values optional parameters. Testing.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Custom HTML parser

Description

Here is how it works:

Example:

HTML document:

Options:

Result:

More complicated example:

HTML document:

Options:

Result:

Documentation:

License