npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

docparse-parse-scraped-worker

v1.0.9

Published

Parse data fetched by docparse scrapers

Readme

Docparse Scraper API server

Parse scraped documents

Startup

To start the parseScraped worker, execute

node parseScrapedWorker.js --config test/config.json

This will create a new Parser object which can receive remote requests to start parsing documents. See app.js for the construction of a new Parser object. Also See the parseScrapedWorker.js file in the project root for details

var ParseScrapedWorker = require('./index')
var config = require('nconf').defaults({
  seaport: {
    host: 'localhost',
    port: 4598
  }
})
var parser = new ParseScrapedWorker(config)

Parsing

The parseScrapedWorker object has a function parseScraped. This function should be called with a scraped document as the first parameter and a callback function as the second.

var inspect = require('eyespect').inspector();
var config = require('nconf').defaults({
  seaport: {
    host: 'localhost',
    port: 4598
  }
})
var parser = new ParseScrapedWorker(config)
var scrapedDoc = {
  supplierCode: 'HES',
  payload: {
    billDate: '2011-02-23 00:00:00 +00:00',
    accountNumber: 'fooAccountNumber',
    billNumber: 'fooBillNumber',
    loginID: 'fooLoginID',
    supplierCode: 'HES',
    textPages: ['foo page 1', 'bar page 2'] // these are the extracted text pages from the bill pdf file
  }
}

parser.parseScraped(scrapedDoc, function (err, reply) {
  if (err) {
    inspect(err, 'error parsing scraped document')
    return
  }
  inspect(reply, 'parsed scraped document correctly')
})

Sockets

A scraped document is parsed differently for each supplier. Therefore in the DocParse system, there need to parsing servers online for each supplier. When the ParseScrapedWorker is initiated, it binds to a request Axon socket for each supported supplier. See lib/getRemoteParseSockets.js for details. It also registers a service with seaport so that the supplier response sockets know where to connect

The system currently supports HES (Hess), NST (NStar), NGE (NGrid Electric), and NGA (NGrid Gas). For HES, the ParseScrapedWorker would bind to a socket as follows

var config = require('nconf').defaults({
  seaport: {
    host: 'localhost',
    port: 4598
  }
})
var seaConfig = config.get('seaport')
var seaHost = seaConfig.host
var seaPort = seaConfig.port
var ports = seaport.connect(seaHost, seaPort)
var role = 'pushScrapedParseJobHES' // all push roles are in the format "pushScrapedParseJob<supplier code>"
var port = ports.register(role) // register with seaport so the remote HES parsing server knows where to connect
var socket = axon.socket('req');
socket.format('json')
socket.bind(port)

This binding happens for each supported supplier. The ParseScrapedWorker keeps a reference to each supplier-specific push socket in its self.sockets property. The sockets property is an object keyed by supplierCode When the ParseScrapedWorker.parseScraped method is called, it gets the appropriate supplier-specific socket and sends out a request to the remote parsing server. The supplier-specific socket is an Axon req socket.

Parsing Scraped Documents

Each instatiated ParseScrapedWorker object has a parseScraped function. This function is called with an unparsed scraped document and a callback. The parseScraped function gets the appropriate supplier-specific request socket and sends a new parsing request out the the remote supplier-specific parsing server

Parser.prototype.parseScrape = function (doc, cb) {
  var self = this
  var config = self.config
  var supplierCode = doc.supplierCode
  var sockets = self.sockets
  var socket = sockets[supplierCode]
  socket.send(scrapedDoc, cb)
})

In the example above, the parseScrape function actually calls lib/parseRemote function. The functionality is the same, but parseRemote adds a some timeout logic in case the remote parsing server goes down request fails for some reason