@xcrap/html-parser
v0.2.0
Published
Xcrap HTML Parser is an experimental library written in Rust, built with the NAPI-RS framework for compatibility with Node.js. Its goal is to be fast, lightweight, and support both CSS and XPath queries. Designed for the Xcrap framework ecosystem — but no
Readme
🕷️ @xcrap/html-parser
A blazing-fast HTML parser for Node.js, powered by Rust and NAPI-RS
@xcrap/html-parser is an experimental HTML parsing library written in Rust, exposed to Node.js through the NAPI-RS framework. It is designed to be fast, lightweight, and to support both CSS selectors and XPath queries — with built-in support for result limits and element nesting.
Although part of the Xcrap scraping ecosystem, this library can be used as a standalone package in any Node.js project.
📋 Table of Contents
- ✨ Features
- ⚡ Performance
- 📦 Installation
- 🚀 Quick Start
- 📖 API Reference
- 🔍 Usage Examples
- 🏗️ Architecture
- 🛠️ Development
- 🤝 Contributing
- 📝 License
✨ Features
- ⚡ Blazing Fast — Core parsing done in Rust; significantly faster than JS-based parsers at instance initialization.
- 🎯 Dual Query Support — Query elements using both CSS selectors (via
scraper) and XPath expressions (viasxd-xpath). - 🦥 Lazy Loading — Internal CSS and XPath engines are only initialized when first needed, reducing unnecessary overhead.
- 🔢 Built-in Limits — Pass a
limitoption toselectManyto cap the number of returned elements. - 🌲 Element Traversal — Navigate nested elements using
selectFirstandselectManydirectly onHTMLElementinstances. - 🔒 Type-Safe — Fully typed TypeScript declarations included (
index.d.ts). - 🖥️ Platform Support — Pre-built native binary currently available for Windows x64 only. Other platforms require compilation from source (see Development).
⚡ Performance
Benchmarks below compare parser initialization speed (instantiation time per file):
@xcrap/html-parser : 0.246214 ms/file ± 0.136808 ✅ Fastest
html-parser : 36.825500 ms/file ± 28.855100
htmljs-parser : 0.501577 ms/file ± 1.210800
html-dom-parser : 2.180280 ms/file ± 1.796170
html5parser : 1.674640 ms/file ± 1.222790
cheerio : 8.679980 ms/file ± 6.328520
parse5 : 4.821180 ms/file ± 2.668220
htmlparser2 : 1.497390 ms/file ± 1.398040
htmlparser : 16.171200 ms/file ± 109.076000
high5 : 2.982290 ms/file ± 1.927480
node-html-parser : 2.901670 ms/file ± 1.908040Benchmarks sourced from node-html-parser repository.
The performance advantage comes from lazy loading: the internal Html (CSS engine) and Package (XPath engine) instances are only initialized on first use and reused across subsequent calls on the same parser instance.
📦 Installation
Install via your preferred package manager:
# npm
npm install @xcrap/html-parser
# yarn
yarn add @xcrap/html-parser
# pnpm
pnpm add @xcrap/html-parserRequirements:
- Node.js >= 18.0.0
Native binaries are pre-built and distributed for the following platforms:
| Platform | Architecture | Support | |------------------|--------------|-----------------| | Windows | x64 | ✅ Pre-built | | macOS | x64 | 🔧 Build from source | | macOS | ARM64 | 🔧 Build from source | | Linux | x64 (GNU) | 🔧 Build from source |
⚠️ Note: Currently only the Windows x64 binary is pre-built and included in the published package. Users on other platforms must compile the native addon locally — see the Development section for instructions.
🚀 Quick Start
import { HtmlParser, css, xpath } from "@xcrap/html-parser"
const html = `
<html>
<body>
<h1 class="title">Hello World</h1>
<ul>
<li class="item">Item 1</li>
<li class="item">Item 2</li>
<li class="item">Item 3</li>
</ul>
</body>
</html>
`
const parser = new HtmlParser(html)
// Select a single element using a CSS selector
const heading = parser.selectFirst({ query: css("h1") })
console.log(heading?.text) // "Hello World"
// Select multiple elements and limit results
const items = parser.selectMany({ query: css("li.item"), limit: 2 })
console.log(items.map(el => el.text)) // ["Item 1", "Item 2"]
// Use XPath instead
const firstItem = parser.selectFirst({ query: xpath("//li[@class='item']") })
console.log(firstItem?.text) // "Item 1"CommonJS is also fully supported via
require:const { parse, css, xpath } = require("@xcrap/html-parser") const parser = parse(html)
📖 API Reference
HtmlParser / HTMLParser
The main entry point for parsing an HTML string. CSS and XPath engines are lazily initialized on first use and reused across subsequent queries.
Constructor
new HtmlParser(content: string): HtmlParser| Parameter | Type | Description |
|-----------|----------|--------------------------------|
| content | string | The raw HTML string to parse. |
Alias: You can also use the
parse(content: string)function as a convenience wrapper:import { parse } from "@xcrap/html-parser" const parser = parse(html)
selectFirst(options)
Selects the first element matching the given query.
parser.selectFirst(options: SelectFirstOptions): HTMLElement | null| Parameter | Type | Description |
|------------------|-------------------|------------------------------------------|
| options.query | QueryConfig | A query config built with css() or xpath(). |
Returns HTMLElement | null — null if no element matches.
selectMany(options)
Selects all elements matching the given query.
parser.selectMany(options: SelectManyOptions): HTMLElement[]| Parameter | Type | Description |
|------------------|-------------------|------------------------------------------|
| options.query | QueryConfig | A query config built with css() or xpath(). |
| options.limit | number? | Optional. Maximum number of elements to return. Values <= 0 are ignored (returns all). |
Returns HTMLElement[] — an empty array if no matches.
HTMLElement
Represents a matched DOM element. Provides properties and methods to inspect and traverse its content.
Note:
HTMLElementinstances also supportselectFirstandselectMany, allowing scoped queries within a found element.
Properties
| Property | Type | Description |
|--------------|---------------------------|--------------------------------------------------------------------|
| outerHTML | string | The full HTML of the element, including its opening and closing tags. |
| innerHTML | string (getter) | The inner HTML content (children only, excluding the element's own tags). |
| text | string (getter) | The concatenated plain-text content of the element and its descendants. |
| id | string \| null (getter) | The element's id attribute, or null if not present. |
| tagName | string (getter) | The element's tag name in UPPERCASE (e.g., "DIV", "H1"). |
| className | string (getter) | The full class attribute string (e.g., "post featured"). |
| classList | string[] (getter) | An array of individual class names. Empty array if no class. |
| attributes | Record<string, string> (getter) | All attributes as a key-value object. |
| firstChild | HTMLElement \| null (getter) | The first child element, or null if none. |
| lastChild | HTMLElement \| null (getter) | The last child element, or null if none. |
Methods
getAttribute(name)
element.getAttribute(name: string): string | nullReturns the value of the named attribute, or null if the attribute does not exist.
selectFirst(options)
element.selectFirst(options: SelectFirstOptions): HTMLElement | nullScoped version of HtmlParser.selectFirst. Searches within the current element.
selectMany(options)
element.selectMany(options: SelectManyOptions): HTMLElement[]Scoped version of HtmlParser.selectMany. Searches within the current element.
toString()
element.toString(): stringReturns the outerHTML string of the element.
css() and xpath()
Helper functions to create typed QueryConfig objects.
css(query: string): QueryConfig
xpath(query: string): QueryConfigThese functions are the recommended way to build query configurations. They ensure the correct query type is set.
import { css, xpath } from "@xcrap/html-parser"
css("article.post") // → { query: "article.post", type: QueryType.CSS }
xpath("//article[@class]") // → { query: "//article[@class]", type: QueryType.XPath }Types
// Identifies the query engine to use
export declare const enum QueryType {
CSS = 0,
XPath = 1,
}
// Holds a raw query string and its associated engine type
export interface QueryConfig {
query: string
type: QueryType
}
// Options for single-element selection
export interface SelectFirstOptions {
query: QueryConfig
}
// Options for multi-element selection
export interface SelectManyOptions {
query: QueryConfig
limit?: number // <= 0 or undefined means no limit
}🔍 Usage Examples
CSS Selectors
import { HtmlParser, css } from "@xcrap/html-parser"
const html = `
<main>
<article id="post-1" class="post featured" data-author="alice">
<h2 class="post-title">First Post</h2>
<p class="excerpt">A short description.</p>
</article>
<article id="post-2" class="post" data-author="bob">
<h2 class="post-title">Second Post</h2>
<p class="excerpt">Another description.</p>
</article>
</main>
`
const parser = new HtmlParser(html)
// Select by tag name
const firstArticle = parser.selectFirst({ query: css("article") })
console.log(firstArticle?.id) // "post-1"
// Select by class
const allPosts = parser.selectMany({ query: css(".post") })
console.log(allPosts.length) // 2
// Select by attribute
const featuredPost = parser.selectFirst({ query: css("[data-author='alice']") })
console.log(featuredPost?.getAttribute("data-author")) // "alice"
// Select with limit
const limited = parser.selectMany({ query: css("article"), limit: 1 })
console.log(limited.length) // 1XPath Queries
import { HtmlParser, xpath } from "@xcrap/html-parser"
const html = `
<ul>
<li class="tag">rust</li>
<li class="tag">napi</li>
<li class="tag">nodejs</li>
</ul>
`
const parser = new HtmlParser(html)
// Select all <li> with class "tag"
const tags = parser.selectMany({ query: xpath("//li[@class='tag']") })
console.log(tags.map(t => t.text)) // ["rust", "napi", "nodejs"]
// Limit XPath results
const limited = parser.selectMany({ query: xpath("//li"), limit: 2 })
console.log(limited.length) // 2Navigating Nested Elements
import { HtmlParser, css } from "@xcrap/html-parser"
const html = `
<nav id="main-nav">
<ul>
<li><a href="/home">Home</a></li>
<li><a href="/about">About</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
</nav>
`
const parser = new HtmlParser(html)
// Find the nav, then narrow down inside it
const nav = parser.selectFirst({ query: css("#main-nav") })
if (nav) {
const links = nav.selectMany({ query: css("a") })
links.forEach(link => {
console.log(`${link.text} → ${link.getAttribute("href")}`)
// "Home → /home"
// "About → /about"
// "Contact → /contact"
})
// First and last child shortcuts
console.log(nav.firstChild?.tagName) // "UL"
console.log(nav.lastChild?.tagName) // "UL"
}Working with Attributes
import { HtmlParser, css } from "@xcrap/html-parser"
const html = `
<a
id="cta"
class="btn btn-primary"
href="https://example.com"
target="_blank"
data-track="click"
>
Click here
</a>
`
const parser = new HtmlParser(html)
const link = parser.selectFirst({ query: css("a") })
if (link) {
console.log(link.id) // "cta"
console.log(link.tagName) // "A"
console.log(link.className) // "btn btn-primary"
console.log(link.classList) // ["btn", "btn-primary"]
console.log(link.getAttribute("href")) // "https://example.com"
console.log(link.getAttribute("target")) // "_blank"
console.log(link.getAttribute("missing")) // null
console.log(link.attributes)
// {
// id: "cta",
// class: "btn btn-primary",
// href: "https://example.com",
// target: "_blank",
// "data-track": "click"
// }
}🏗️ Architecture
The library is structured as a native Node.js addon written in Rust, bridged via NAPI-RS.
src/
├── lib.rs # Crate entry point; exposes the `parse()` function via NAPI
├── parser.rs # HTMLParser struct — lazy-loads CSS (scraper) and XPath (sxd) engines
├── types.rs # HTMLElement struct — all DOM properties and methods
├── engines.rs # Internal: select_first/many by CSS and XPath (pure Rust)
└── query_builders.rs # css() and xpath() helper functions exposed to JSKey Design Decisions
Lazy Initialization:
HTMLParserholdsOption<Html>andOption<Package>fields. Each engine is only allocated on first use and reused automatically, so callingselectFirst(CSS) and thenselectMany(XPath) on the same parser creates only two parsing passes total — one per engine.Dual Engine: CSS queries use the
scrapercrate; XPath queries usesxd-xpathwithsxd_htmlfor HTML→XML normalization.Zero-copy Approach: Elements are represented by their
outerHTMLstring, avoiding complex lifetime management across the FFI boundary.
Internal Rust Dependencies
| Crate | Version | Role |
|---------------|----------|-------------------------------------------|
| napi | 3.0.0 | NAPI-RS runtime for Node.js integration |
| napi-derive | 3.0.0 | Procedural macros for NAPI bindings |
| scraper | 0.25.0 | HTML parsing and CSS selector engine |
| sxd-document| 0.3.2 | XML document model (used for XPath) |
| sxd-xpath | 0.4.2 | XPath expression evaluator |
| sxd_html | 0.1.2 | HTML → sxd document converter |
🛠️ Development
Prerequisites
- Rust (stable toolchain) — Install
- Node.js >= 18 — Install
- Yarn >= 4 —
npm install -g yarn - NAPI-RS CLI — installed automatically via dev dependencies
Setup
# Clone the repository
git clone https://github.com/Xcrap-Cloud/html-parser.git
cd html-parser
# Install Node.js dependencies
yarn installBuilding
# Build native addon in release mode
yarn build
# Build in debug mode (faster compilation, slower runtime)
yarn build:debugThe output binary (html-parser.<platform>.node) will be placed in the project root.
Running Tests
yarn testTests are written with AVA and located in the __test__/ directory.
Formatting
# Format all (TypeScript/JS, Rust, TOML)
yarn format
# Individual formatters
yarn format:prettier # Prettier for TS/JS/JSON/YAML/Markdown
yarn format:rs # cargo fmt for Rust
yarn format:toml # Taplo for TOML filesLinting
yarn lint # OXLint for TypeScript/JavaScript files🤝 Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a branch:
git checkout -b feat/your-featureorgit checkout -b fix/your-bug. - Make your changes, ensuring all tests pass:
yarn test. - Format your code:
yarn format. - Commit with a descriptive message:
git commit -m "feat: add support for XYZ". - Push your branch:
git push origin feat/your-feature. - Open a Pull Request with a clear description of the changes.
Please see CONTRIBUTING.md for detailed guidelines.
📝 License
Distributed under the MIT License.
© Marcuth and contributors.
