html-data-parser

v1.1.13

Published

a month ago

Parse, search and stream HTML tabular data using Node.js and isaacs/sax-js.

0High
0Medium
0Low

drewletcher

html table parser scraper

html-data-parser 1.1.x

Parse and stream tabular data from HTML documents using Node.js and isaacs/sax-js.

This readme explains how to use html-data-parser in your code or as a console program using the command line interface (CLI).

Only supports HTML documents containing TABLE elements. Does not support parsing grid or other table like elements.

Installation

For use as module in a Node.js project. See Developers Guide below.

npm install html-data-parser

For use as command line utility. Requires Node.js 18+.

npm -g install html-data-parser

Command Line Interface

Parse tabular data from an HTML document.

hdp <filename|URL> <output-file> --options=filename.json --heading=title --id=name --cells=# --headers=name1,name2,... --format=csv|json|rows

  `filename|URL` - path name or URL of HTML file to process, required.
  `output-file`  - local path name for output of parsed data, default stdout.
  `--options`    - JSON or JSONC file containing JSON object with hdp options, default: hdp.options.json.
  `--heading`    - text of heading to find in document that precedes desired data table, default none.
  `--id`         - TABLE element id attribute to find in document.
  `--cells`      - number of cells in a data row, minimum or "min-max", default = "1-256".
  `--headers`    - comma separated list of column names for data, default none the first table row contains names.
  `--format`     - output data format CSV, JSON, or ROWS (JSON array of arrays), default JSON.

Note: If the hdp command conflicts with another program on your system use htmldataparser instead.

Options File

The options file supports options for all html-data-parser modules. Parser will read plain JSON files or JSONC files with Javascript style comments. The default name of the options file is hdp.options.json located in the current working directory.

{
  /* HtmlDataParser options */

  // url - local path name or URL of HTML file to process, required.
  "url": "",
  // output - local path name for output of parsed data, default stdout.
  "output": "",
  // format - output data format CSV, JSON or rows, default JSON, rows is JSON array of arrays (rows).
  "format": "json",
  // heading - text of heading to find in document that precedes desired data table, default none.
  "heading": null,
  // id - TABLE element id attribute to find in document.
  "id": "",
  // cells - number of cells for a data row, minimum or "min-max", default = "1-256".
  "cells": "1-256",
  // newlines - preserve new lines in cell data, default: false.
  "newlines": false,
  // trim whitespace from output values, default: true.
  "trim": true,

  /* RowAsObjectTransform options */

  // hasHeaders - data has a header row, if true and headers set then headers overrides header row.
  "RowAsObject.hasHeader": false
  // headers - comma separated list of column names for data, default none. When not defined the first table row encountered will be treated as column names.
  "RowAsObject.headers": []

  /* RepeatCellTransform options */

  // column - column index of cell to repeat, default 0.
  "RepeatCell.column": 0

  /* RepeatHeadingTransform options */

  // hasHeaders - data has a header row, if true and headers set then headers overrides header row.
  "RepeatHeading.hasHeader": true
  // header - column name for the repeating heading field. Can optionally contain suffix :m:n with index for inserting into header and data rows.
  "RepeatHeading.header": "subheading:0:0"

  /* HTTP options */
  // see HTTP Options below

}

Note: Transform property names can be shortened to hasHeader, headers, column and header.

Examples

hdp ./test/data/html/helloworld.html --headers="Greeting" --format=csv

hdp ./test/data/html/helloworld.html --id="cosmic" --headers="BigBang"

hdp ./test/data/html/ansi.html  --heading="Congressional Districts"

hdp https://www.sos.state.tx.us/elections/historical/jan2024.shtml ./test/output/hdp/tx_voter_reg.json

hdp --options="./test/RepeatCell.options.json"

RepeatCell.options.json:
{
  "url": "./test/data/html/texas_jan2024.shtml",
  "output": "./test/output/hdp/repeat_cell.json",
  "format": "json",
  "cells": 7,
  "RepeatCell.column": 0
}

Developer Guide

Basic Usage

The parser processes the entire document then returns the row data as an array of arrays.

import { HtmlDataParser } from "html-data-parser";

let parser = new HtmlDataParser({url: "filename.html"});

async function parseDocument() {
  var rows = await parser.parse();
  // process the rows
}

Using Event Interface

Listen to parser events and process each row as it is parsed from the document.

import { HtmlDataParser } from "html-data-parser";

let parser = new HtmlDataParser({url: "filename.html"});

parser.on('head', (head) => {
  // triggered by /HEAD end tag
  // zero or one event per document
  // see head object below
})

parser.on('data', (row) => {
  // process row, row is an array of cell values
});

parser.on('end', () => {
});

parser.on('error', (err) => {
  // log error
})

head = {
  url,       // starting url
  title,     // text from <head><title> element
  redirect,  // HTTP status code of last redirect - 301, 302, 307 or 308
  location   // URL of last redirect
}

Using Stream Interface

Use the NodeJS Stream interface to process rows as they are parsed from the document.

import { HtmlDataReader } from "html-data-parser";
import { pipeline } from 'node:stream/promises';

let reader = new HtmlDataReader(options);
let writer = `<some writable that can handle Object Mode data>`

await pipeline(reader, writer);

Class HtmlDataParser

HtmlDataParser given a HTML document will output an array of arrays (rows). Use the streaming classes HtmlDataReader and RowAsObjectTransform transform to convert the arrays to Javascript objects. With default settings HtmlDataParser will output rows in __all__ TABLE found in the document. Using [HtmlDataParser Options](#html-data-parser-options) headingorid` the parser can filter content to retrieve the desired data TABLE in the document.

HtmlDataParser only works on a certain subset of HTML documents specifically those that contain some TABLE elements and NOT other table like grid elements. The parser uses isaacs/sax-js library to transform HTML table elements into rows of cells.

Rows and Cells terminology is used instead of Rows and Columns because the content in a HTML document flows rather than being strict rows/columns like database query results. Some rows may have more cells than other rows. For example a heading or description paragraph will be a row (array) with one cell (string). See Notes below.

HtmlDataParser Options

HtmlDataParser constructor takes an options object with the following fields. One of url or data arguments is required.

{String|URL} url - The local path or URL of the HTML document. {String|Uint8Array} data - HTML document in a string. {Readable} rs - Readable stream for the HTML document.

Common Options:

{String|RegExp} heading - Heading, H1-H6 element, in the document after which the parser will look for a TABLE; optional, default: none. The parser does a string comparison or regexp match looking for first occurrence of heading value in a heading element. If neither heading or id are specified then data output contains all rows from all tables found in the document.

{String|RegExp} id - TABLE element id attribute in the document to parse for tabular data; optional, default: none. The parser does a string comparison of the id value in TABLE elements ID attribute. If neither heading or id are specified then data output contains all rows from all tables found in the document.

{Number} cells - Minimum number of cells in tabular data; optional, default: 1. The parser will NOT output rows with less than cells number of cells.

{Boolean} newlines - Preserve new lines in cell data; optional, default: false. When false newlines will be replaced by spaces. Preserving newlines characters will keep the formatting of multiline text such as descriptions. Though, newlines are problematic for cells containing multi-word identifiers and keywords that might be wrapped in the cell text.

{Boolean} trim - trim whitespace from output values, default: true.

HTTP Options

HTTP requests are mode using Node.js HTTP modules. See the source code file lib/httpRequest.js for more details.

{Object} http - options to pass thru to HTTP request {String} http.method - HTTP method, default is "GET" {Object} http.params - object containing URL querystring parameters. {Object} http.headers - object containing HTTP headers {Array} http.cookies - array of HTTP cookie strings {String} http.auth - string for Basic Authentication (Authorization header), i.e. "user:password".

Class HtmlDataReader

HtmlDataReader is a Node.js stream reader implemented with the Object mode option. It uses HtmlDataParser event interface to stream one data row (array) per chunk.

HtmlDataReader Options

HtmlDataReader constructor options are the same as HtmlDataParser Options.

Class RowAsObjectTransform

HtmlDataReader operates in Object Mode. The reader outputs arrays (rows). To convert rows into Javascript objects use the RowAsObjectTransform transform. RowAsObjectTransform operates in Object mode where output is a Javascript Object of <name,value> pairs.

import { HtmlDataReader, RowAsObjectTransform } from "html-data-parser";
import { pipeline } from 'node:stream/promises';

let reader = new HtmlDataReader(options);
let transform1 = new RowAsObjectTransform(options);
let writable = `<some writable that can handle Object Mode data>`

await pipeline(reader, transform1, writable);

RowAsObjectTransform Options

RowAsObjectTransform constructor takes an options object with the following fields.

{String[]} headers - array of cell property names; optional, default: none. If a headers array is not specified then parser will assume the first row found contains cell property names.

{Boolean} hasHeaders - data has a header row, if true and headers options is set then provided headers override header row. Default is true.

If a row is encountered with more cells than in the headers array then extra cell property names will be the ordinal position. For example if the data contains five cells, but only three headers where specified. Specifying options = { headers: [ 'name', 'type', 'info' ] } then the Javascript objects in the stream will contain { "name": "value1", "type": "value2", "info": "value3", "4": "value4", "5": "value5" }.

Class RepeatCellTransform

RepeatCellTransform will normalize data the was probably generated by a report writer. The specified cell will be repeated in following rows that contain one less cell.

import { HtmlDataReader, RepeatCellTransform } from "html-data-parser";
import { pipeline } from 'node:stream/promises';

let reader = new HtmlDataReader(options);
let transform1 = new RepeatCellTransform({ column: 0 });
let writable = <some writable that can handle Object Mode data>

await pipeline(reader, transform1, writable);

RepeatCellTransform Options

RepeatCellTransform constructor takes an options object with the following fields.

{Number} column - column index of cell to repeat, default 0.

Example

In this example "Dewitt" will be repeated in rows 2 and 3.

HTML Document

County   Precincts  Date/Period   Total
Dewitt          44  JUL 2023     52,297
                44  OCT 2023     52,017
                44  JAN 2024     51,712

Output

[ "County", "Precincts", "Date/Period", "Total" ]
[ "Dewitt", "44", "JUL 2023", "52,297" ]
[ "Dewitt", "44", "OCT 2023", "52,017" ]
[ "Dewitt", "44", "JAN 2024", "51,712" ]

Class RepeatHeadingTransform

RepeatHeadingTransform will normalize data the was probably generated by a report writer. Subheadings are rows containing a single cell interspersed in data rows. The header name is inserted in to the header row. The subheading value will be repeated in rows that follow until another subheading is encountered.

import { HtmlDataReader, RepeatHeadingTransform } from "html-data-parser";
import { pipeline } from 'node:stream/promises';

let reader = new HtmlDataReader(options);
let transform1 = new RepeatHeadingTransform({header: "County:1:0"});
let writable = <some writable that can handle Object Mode data>

await pipeline(reader, transform1, writable);

RepeatHeadingTransform Options

RepeatHeadingTransform constructor takes an options object with the following fields.

{String} header - column name for the repeating heading field. Can optionally contain an index of where to insert the header in the header row. Default "heading:0".

{Boolean} hasHeaders - data has a header row, if true and headers options is set then provided headers override header row. Default is true.

Example

In this example options = {header: "County:1:0"}.

HTML Document

District  Precincts    Total

Congressional District 5
Maricopa        120  403,741
Pinal            30  102,512
Total:          150  506,253

Output

[ "District", "County", "Precincts", "Total" ]
[ "Congressional District 5", "Maricopa", "120", "403,741" ]
[ "Congressional District 5", "Pinal", "30", "102,512" ]
[ "Congressional District 5", "Total:", "150", "506,253" ]

Class FormatCSV and FormatJSON

The hdpdataparser CLI program uses the FormatCSV and FormatJSON transforms to stringify Javascript objects that can be saved to a file.

import { HtmlDataReader, RowAsObjectTransform, FormatCSV } from "html-data-parser";
import { pipeline } from 'node:stream/promises';

let reader = new HtmlDataReader(options);
let transform1 = new RowAsObjectTransform(options);
let transform2 = new FormatCSV();

await pipeline(reader, transform1, transform2, process.stdout);

Examples

In the source code the html-data-parser.js program and the Javascript files in the /test folder are good examples of using the library modules.

Hello World

HelloWorld.html is a single page HTML document with the string "Hello, world!" positioned on the page. The HtmlDataParser output is one row with one cell.

[
  ["Hello, world!"]
]

To transform the row array into an object specify the headers option to RowAsObjectTransform transform.

let transform = new RowAsObjectTransform({
  headers: [ "Greeting" ]
})

Output as JSON objects:

[
  { "Greeting": "Hello, world!" }
]

Notes

Only supports HTML files containing TABLE elements. Does not support other table like grid elements.
Does not support identification of titles, headings, column headers, etc. by using style information for a cell.
Vertical spanning cells are parsed with first row where the cell is encountered. Subsequent rows will not contain the cell and have one less cell. Currently, vertical spanning cells must be at the end of the row otherwise the ordinal position of cells in the following rows will be incorrect, i.e. missing values are not supported.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

html-data-parser 1.1.x

Installation

Command Line Interface

Options File

Examples

Developer Guide

Basic Usage

Using Event Interface

Using Stream Interface

Class HtmlDataParser

HtmlDataParser Options

HTTP Options

Class HtmlDataReader

HtmlDataReader Options

Class RowAsObjectTransform

RowAsObjectTransform Options

Class RepeatCellTransform

RepeatCellTransform Options

Example

Class RepeatHeadingTransform

RepeatHeadingTransform Options

Example

Class FormatCSV and FormatJSON

Examples

Hello World

Notes