@willh/html-text-node-parser
v1.0.0
Published
A CLI tool to parse HTML files and extract text nodes with their byte positions
Readme
html-text-node-parser
A CLI tool to parse HTML files and extract text nodes with their byte positions in the source file.
Features
- Parses HTML files and extracts all text nodes
- Reports byte positions (begin/end) for each text node
- Excludes text content from specific tags:
STYLE,SCRIPT,NOSCRIPT,IFRAME,OBJECT,CODE,TEXTAREA,INPUT,SELECT - Outputs results in JSON format
Installation
npm installUsage
node index.js <html-file-path>Or if installed globally:
html-text-node-parser <html-file-path>Example
Given an HTML file example.html:
<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
</head>
<body>
<h1>Hello World</h1>
<p>This is a <strong>test</strong> paragraph.</p>
</body>
</html>Running the command:
node index.js example.htmlWill output:
[
{
"begin": 41,
"end": 50,
"text": "Test Page"
},
{
"begin": 82,
"end": 93,
"text": "Hello World"
},
{
"begin": 106,
"end": 116,
"text": "This is a "
},
{
"begin": 124,
"end": 128,
"text": "test"
},
{
"begin": 137,
"end": 148,
"text": " paragraph."
}
]Note: Whitespace-only text nodes are also included in the output. The example above shows only the significant text nodes for clarity.
Output Format
The output is a JSON array where each element represents a text node with the following fields:
begin(number): The byte offset where the text starts in the source file (UTF-8 encoded)end(number): The byte offset where the text ends in the source file (UTF-8 encoded)text(string): The actual text content
Important: The positions are byte offsets, not character positions. This is particularly important when working with multi-byte characters (e.g., Chinese, Japanese, emoji) where a single character may occupy 2-4 bytes in UTF-8 encoding.
Excluded Tags
The following HTML tags are excluded from text node extraction:
<style>- CSS styles<script>- JavaScript code<noscript>- Content for non-script browsers<iframe>- Embedded frames<object>- Embedded objects<code>- Code snippets<textarea>- Text input areas<input>- Form inputs<select>- Dropdown selects
Testing
npm testImplementation Details
This tool uses parse5 to parse HTML with source location information. The byte positions reported correspond to the exact location of the text content in the original source file.
Note: In some edge cases involving document boundaries and complex HTML structures, parse5 may report text nodes that span across closing tags. This tool automatically detects and handles these cases by searching for the actual contiguous text in the source file.
