npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

express-crawler-snapshots

v0.6.0

Published

express.js middleware for generating web page html snapshots for web crawlers

Downloads

38

Readme

Build Status Dependency Status

Express Crawler Snapshots

The purpose of this express middleware is to pre-render javascript heavy pages for crawlers that can't do execute javacript on their own. It is intended as a drop-in solution with minimal configuration.

It detects search engine crawler requests by inspect User-Agent header and proxies their requests to a phantomjs instance. Phantomjs render the page fully including any async javascript and resulting static html is proxied back to the crawler.

Please note, if you use html5 history (no hashbangs) in your application, don't add a <meta name="fragment" content="!"> tag for this to work correctly.

Features

  • Phantomjs process pooling
  • Request queuing when no phantomjs instances are immediately available
  • Automatic search engine bot detection via user agent string
  • 'escaped_fragment' semantics support
  • Forced timeout for phantomjs page renders
  • Optional caching

Requirements

Phantomjs 1.3+. "phantomjs" binary must be available on sys path. See http://phantomjs.org/download.html for download & install instructions

Install

npm install express-crawler-snapshots --save

Use

Just add it as express middleware, before route handlers


var crawlerSnapshots = require('express-crawler-snapshots');

var app = express();

//make sure you include the middleware before route handlers
app.use(crawlerSnapshots(/* {options} */));

app.use('/', require('./routes'));

Once that is done, open http://yourapp.com/?snapshot=true and view source to verify that it's working

Options

Option | Default | Decription -------------|---------------|------------ timeout | 10000 | ms, how long to wait for page to load on a phantomjs instance delay | 200 | ms, how long to wait for javascript to settle on the page snapshotTrigger| 'snapshot' | string, query param, which if present, will trigger static page render agents |see source | list of UA strings for crawler bots shouldRender | snapshot trigger found in query params OR user agent matches one of the agents OR escaped_fragment fonund in query params | function(req, options) { return bool;} protocol | same as request | string, 'http' or 'https' domain | same as request | string. Use this if you want phantomjs to call 'localhost' maxInstances | 1 | max number of phantomjs instances to use logger | console | object that implements 'info', 'warn', 'error' methods. Set to null for silent operation attempts | 1 | number of attempts to render a page, in case phantomjs crashes or times out. Set to > 1 if phantomjs is unstable for you loadImages | true | should phantom load images. Careful: there's a mem leak with older versions of QT: https://github.com/ariya/phantomjs/issues/11390
maxPageLoads | 100 | if > 0, will kill phantomjs instance after x pages is loaded. Useful to work around mem leaks phantomConfig| {} | an object which will be passed as config to PhantomJS

Kill all phantom instances programtically

In some rare cases you might want to kill all phantomjs instances programatically. For example, a http server won't close if it's serving an app that has this middleware active and some phantomjs instances spawned - the instances are holding onto open connections.

var crawlerSnapshots = require('express-crawler-snapshots');
crawlerSnapshots.killAllInstances.then(function() {
   // done
});

What it does

  1. Request passing through middleware is inspected. If it either: contains search engine bot's user agent string, contains 'snaphsot' query param or contains 'escaped_fragment' query param
  2. Url is edited: if it contains 'snapshot' parameter, it is removed; if it contains 'escaped_fragment' parameter, it is transformed as per google spec to use "#!"
  3. A phantomjs instance is retrieved from pool; If none are available, request is queued until one becomes available after a previous request completes
  4. Phantomjs renders the page
  5. <script> tags are removed to prevent being execute again by however consumes the result
  6. Resulting html is written to response, phantomjs instance is released to pool

Phantomjs process management

New phantomjs processes are started when a bot requests comes in, number of active phantomjs processes is < maxInstanes and all active processes are currently rendering a page.
If maxInstances is reached, all phantomjs instances are busy and a new request comes in, the request is queued untll a phantomjs instance becomes available. Queue operates on first in, first out basis.
If a phantomjs process is killed from outside/dies, it's handled cleaned up gracefully and will be replaced with next request - feel free to kill them on whim :)
There's a hard timeout on opening a page and rendering content. If timeout is reached and render is still not complete, phantomjs instance is assumed toe be fubar and is forcefully killed.
Note that if an error happens while rendering a page, currently there are no retries - midleware produces an error.

Test

npm test