@willh/html-text-node-parser

v1.0.0

Published

7 months ago

A CLI tool to parse HTML files and extract text nodes with their byte positions

0High
0Medium
0Low

willh

html parser text-node cli

html-text-node-parser

A CLI tool to parse HTML files and extract text nodes with their byte positions in the source file.

Features

Parses HTML files and extracts all text nodes
Reports byte positions (begin/end) for each text node
Excludes text content from specific tags: STYLE, SCRIPT, NOSCRIPT, IFRAME, OBJECT, CODE, TEXTAREA, INPUT, SELECT
Outputs results in JSON format

Installation

npm install

Usage

node index.js <html-file-path>

Or if installed globally:

html-text-node-parser <html-file-path>

Example

Given an HTML file example.html:

<!DOCTYPE html>
<html>
<head>
    <title>Test Page</title>
</head>
<body>
    <h1>Hello World</h1>
    <p>This is a <strong>test</strong> paragraph.</p>
</body>
</html>

Running the command:

node index.js example.html

Will output:

[
  {
    "begin": 41,
    "end": 50,
    "text": "Test Page"
  },
  {
    "begin": 82,
    "end": 93,
    "text": "Hello World"
  },
  {
    "begin": 106,
    "end": 116,
    "text": "This is a "
  },
  {
    "begin": 124,
    "end": 128,
    "text": "test"
  },
  {
    "begin": 137,
    "end": 148,
    "text": " paragraph."
  }
]

Note: Whitespace-only text nodes are also included in the output. The example above shows only the significant text nodes for clarity.

Output Format

The output is a JSON array where each element represents a text node with the following fields:

begin (number): The byte offset where the text starts in the source file (UTF-8 encoded)
end (number): The byte offset where the text ends in the source file (UTF-8 encoded)
text (string): The actual text content

Important: The positions are byte offsets, not character positions. This is particularly important when working with multi-byte characters (e.g., Chinese, Japanese, emoji) where a single character may occupy 2-4 bytes in UTF-8 encoding.

Excluded Tags

The following HTML tags are excluded from text node extraction:

<style> - CSS styles
<script> - JavaScript code
<noscript> - Content for non-script browsers
<iframe> - Embedded frames
<object> - Embedded objects
<code> - Code snippets
<textarea> - Text input areas
<input> - Form inputs
<select> - Dropdown selects

Testing

npm test

Implementation Details

This tool uses parse5 to parse HTML with source location information. The byte positions reported correspond to the exact location of the text content in the original source file.

Note: In some edge cases involving document boundaries and complex HTML structures, parse5 may report text nodes that span across closing tags. This tool automatically detects and handles these cases by searching for the actual contiguous text in the source file.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme