dirty-html-content-parser

v0.0.10

Published

4 years ago

Module for parsing content from dirty HTML.

Downloads

0High
0Medium
0Low

alfredgodoy

nodejs-dirty-html-content-parser

Module for parsing content from dirty HTML.

It uses diff for extracting content fragments from html documents. First, you have to register a reference html document with string position markers defining different types of content. The module uses this reference to find the same type of content in other html documents, by bruteforcing for the smallest diff.

Since the module is just using string diffs, this method works on dirty invalid html.

To reduce the number of diffs to bruteforce, all defined contents must be between tags (see the result in example code below). That can be any kind of tag, an opening tag, closing tag or both. TODO: This must be fixed for version 0.0.0.0.0.1

Yo can define a validator function in the reference, to increase the chanses of proper matching.

var parser = new Parser();
parser.reference('title', {
	html: referenceHtml,
	start: 33431,
	end: 33479,
	validator: function (data) {
		if (data.indexOf('<h1>') === 0) return true;
		return false;
	}
});
parser.reference('author', {
	html: referenceHtml,
	start: 33482,
	end: 33533,
	validator
});
parser.parse(html, function (data) {
	console.dir(data);
	/*
		Example result:
		{
			title: '<h1>Example title</h1>',
			author: '<br />John Doe, Bagarmossen</div>'
		}
	*/
});

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

dirty-html-content-parser

v0.0.10

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

nodejs-dirty-html-content-parser