wce

v1.0.6

Published

2 years ago

Extract content of any web page by using various extraction libraries.

Downloads

0High
0Medium
0Low

mxr576

readability grab read scrape spider article readability crawl readable web page content extract

Web page Content Extractor (wce)

Extract the content of any web page by using various content extractor libraries.

Currently the following ones are implemented:

This is the base module of the Webpage Content Extractor API module.

Usage example

var winston = require('winston');
var util = require('util');
var wce = require('wce');
var logger = new (winston.Logger)({});
logger.add(winston.transports.Console, {
  prettyPrint: true,
  colorize: true
});

var extractors =['read-art', 'node-readability'];
var options = {};
var WCE = new wce(extractors, options);

try {
  WCE.extract('https://en.wikipedia.org/wiki/Hungary')
    .on('success', function (result, errors) {
      logger.log('info', result);
      if (errors && errors.length !== 0) {
        logger.log('warn', 'Extraction was successful, but there were some errors: %s', util.inspect(errors));
      }
    })
    .on('error', function (errors) {
      logger.log('error', 'Extraction failed with the following error(s): %s', util.inspect(errors));
    });
} catch (error) {
  logger.log('error', util.inspect(error));
}

WCE-Proxy

It is a built-in wrapper for content proxies. This wrapper could be used to retrieve the previously extracted content of the URLs from a cache through a REST API. This REST API could built in any language and it could store the content of the url in any database, but the wce-proxy wrapper was made, then I had a few expectations:

The content of an URL could be queried with a GET request, the queried URL sent in the GET parameter to the server. Ex.: http://wce-proxy/?url=http://cnn.com
- If the proxy found content of the URL, then it is respond with 200 http status code and the respond's body contains the content of the URL.
- If the content of the URL not found, then the responde code is 204 and the body is empty.
- Any other status code will be handled as an error. The proxy could send back error messages in the repond's body.
The proxy could accept data through POST request. A request should contains two parameters: url and content.
- When the content of URL successfully stored in the proxy's database, then the proxy should return with 200 http status code and the 'Success' message in the body.
- Any other status code will be handled as an error, the respond's body could contains information about the reason.

Licence

Apache Licence 2.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Web page Content Extractor (wce)

Usage example

WCE-Proxy

Licence