n8n-nodes-cheerio-html-parser
v0.5.8
Published
n8n node to parse HTML using Cheerio
Downloads
257
Maintainers
Readme
n8n-nodes-cheerio-html-parser
This is a custom n8n node that uses Cheerio to parse HTML content.
Features
- Parse HTML using multiple CSS selectors
- Convert selected output to array or string
- Remove unwanted elements (scripts, styles, navigation, etc.) before parsing
- Extract specific attributes from elements
Installation
- Clone this repository
- Install dependencies:
npm install- Build the node:
npm run build- Link to your n8n installation:
npm link- In your n8n installation directory, run:
npm link n8n-nodes-cheerio-html-parserUsage
- Add the "Cheerio HTML Parser" node to your workflow
- Input the HTML content you want to parse
- Add one or more selectors with:
- Name: A unique identifier for this selector result
- CSS Selector: The CSS selector to find elements (e.g., "div.content", "p.title", "#main")
- Attribute (optional): Extract a specific attribute instead of text content
- Return Single Item: Choose whether to return the first match or all matches
- Optionally specify elements to remove before parsing (e.g., "script, style, nav, footer")
- Connect the node to your workflow
Example
Input HTML:
<div class="content">
<h1>Title</h1>
<p>Some text</p>
</div>With selector: .content h1 the node will return:
{
"results": {
"title": "Title"
},
"totalElements": 1,
"selectors": 1
}Complete Example
Input HTML:
<div class="article">
<h1 class="title">Welcome to my blog</h1>
<div class="content">
<p>First paragraph of content</p>
<p>Second paragraph of content</p>
</div>
</div>Node Configuration:
{
"selectors": [
{
"name": "title",
"selector": "h1.title",
"singleItem": true
},
{
"name": "paragraphs",
"selector": "div.content p",
"singleItem": false
}
]
}Output:
{
"results": {
"title": "Welcome to my blog",
"paragraphs": [
"First paragraph of content",
"Second paragraph of content"
]
},
"totalElements": 3,
"selectors": 2
}Advanced Example with Element Removal
Input HTML:
<html>
<head>
<script>console.log('analytics');</script>
<style>.hidden { display: none; }</style>
</head>
<body>
<nav>Navigation Menu</nav>
<main>
<h1 class="title">Article Title</h1>
<div class="content">
<p>Main content here</p>
</div>
</main>
<footer>Footer content</footer>
</body>
</html>Node Configuration:
{
"removeElements": "script, style, nav, footer",
"selectors": [
{
"name": "title",
"selector": "h1.title",
"singleItem": true
},
{
"name": "content",
"selector": "div.content p",
"singleItem": true
},
{
"name": "titleClass",
"selector": "h1.title",
"attribute": "class",
"singleItem": true
}
]
}Output:
{
"results": {
"title": "Article Title",
"content": "Main content here",
"titleClass": "title"
},
"totalElements": 3,
"selectors": 3
}Note: The script, style, nav, and footer elements were removed before parsing, so they don't interfere with the content extraction.
Node Structure
The node outputs an object with the following structure:
results: An object containing the extracted data, with keys matching the selector namestotalElements: The total number of elements found across all selectorsselectors: The number of selectors that were processed
Development
To run tests:
npm test