@origints/html
v0.5.1
Published
HTML parsing and manipulation for Origins with full lineage tracking
Downloads
537
Maintainers
Readme
@origints/html
HTML parsing and manipulation for Origins with full lineage tracking.
Features
- Parse HTML with source position tracking
- CSS selector queries (hast-util-select)
- Type-safe extractors for common elements
- Convert HTML to Markdown
- Navigation API for tree traversal
- Integrates with Origins transform registry
Installation
npm install @origints/html @origints/coreUsage with Planner
Extract content from an HTML file
import { Planner, loadFile, run } from '@origints/core'
import { parseHtml } from '@origints/html'
const plan = new Planner()
.in(loadFile('page.html'))
.mapIn(parseHtml())
.emit((out, $) =>
out
.add('title', $.select('h1').text())
.add('href', $.select('a').attr('href'))
)
.compile()
const result = await run(plan, { readFile, registry })
// result.value: { title: 'Welcome', href: '/about' }Extract collections with selectAll
Use selectAll() to extract data from all matching elements as an array:
// Extract all list item texts
const plan = new Planner()
.in(loadFile('page.html'))
.mapIn(parseHtml())
.emit((out, $) =>
out.add(
'items',
$.select('ul').selectAll('li', node => node.text())
)
)
.compile()
const result = await run(plan, { readFile, registry })
// result.value: { items: ['First', 'Second', 'Third'] }Extract structured data from repeated elements
// Extract href and text from all links
const plan = new Planner()
.in(loadFile('page.html'))
.mapIn(parseHtml())
.emit((out, $) =>
out.add(
'links',
$.selectAll('a', node => ({
kind: 'object',
properties: {
href: node.attr('href'),
text: node.text(),
},
}))
)
)
.compile()
const result = await run(plan, { readFile, registry })
// result.value: {
// links: [
// { href: '/about', text: 'About' },
// { href: '/contact', text: 'Contact' },
// ]
// }Extract children of an element
const plan = new Planner()
.in(loadFile('page.html'))
.mapIn(parseHtml())
.emit((out, $) =>
out.add(
'sections',
$.select('main').children(node => node.text())
)
)
.compile()Combine HTML with other data sources
const plan = new Planner()
.in(loadFile('report.html'))
.mapIn(parseHtml())
.emit((out, $) => out.add('title', $.select('h1').text()))
.in(loadFile('metadata.json'))
.mapIn(parseJson())
.emit((out, $) => out.add('author', $.get('author').string()))
.compile()Standalone usage (without Planner)
For direct HTML navigation and CSS selector queries:
import { parseHtmlImpl, HtmlNode } from '@origints/html'
const node = parseHtmlImpl.execute(htmlString) as HtmlNode
// CSS selector queries return Result types
const titleResult = node.select('h1')
if (titleResult.ok) {
console.log(titleResult.value.text())
}
// Select by class
const introResult = node.select('.intro')
if (introResult.ok) {
console.log(introResult.value.text())
}
// Extract attributes
const linkResult = node.select('a')
if (linkResult.ok) {
const href = linkResult.value.attr('href')
if (href.ok) {
console.log(href.value)
}
}
// Select all matching elements
const items = node.selectAll('li')
for (const item of items) {
console.log(item.text())
}Working with tables
const node = parseHtmlImpl.execute(htmlWithTable) as HtmlNode
const tableResult = node.select('table')
if (tableResult.ok) {
console.log(tableResult.value.text())
}Converting to Markdown
import { parseHtmlImpl, toMarkdown } from '@origints/html'
const node = parseHtmlImpl.execute(htmlContent) as HtmlNode
const markdown = toMarkdown(node)API
| Export | Description |
| ---------------------------------- | ----------------------------------------------------- |
| parseHtml(options?) | Create a transform AST for use with Planner.mapIn() |
| parseHtmlImpl | Sync transform implementation (string input) |
| parseHtmlAsyncImpl | Async transform implementation (string or stream) |
| registerHtmlTransforms(registry) | Register all HTML transforms with a registry |
| HtmlNode | Navigable wrapper with CSS selector support |
| toMarkdown(node) | Convert HTML to Markdown |
| toJson(node, options?) | Convert HtmlNode to JSON |
License
MIT
