npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

nutch-web-api

v1.0.0

Published

Web API for Apache Nutch application

Readme

nutch-web-api

travis ci build status Coverage Status

What is it

nutch-web-api is a RESTFul API implementation for apache Nutch crawling application. This project is completely written in node.js and coffeescript with the goal of simplifying usage and for improved flexibility. The REST API is not a replacement for apache nutch application, it simply provides the web interface for the nutch commands.

Installation

Prerequisites

Apache Nutch Application

nutch-web-api requires that apache nutch application be installed and running on the same server. For more information about downloading and getting started for apache nutch, please refer to http://nutch.apache.org.

Node.js

node.js is required to get the web application up and running. For more information about installing node.js for your platform, please visit http://nodejs.org/download/.

###Downloading Source And Install Dependencies

  • git clone https://github.com/wei-m-teh/nutch-web-api
  • npm install

Initial Project Setup

Environment Variables

By default, the project expects the following environment variables available in the environment:

  • NUTCH_HOME
  • JAVA_HOME

These environment variables can be overwritten in conf/env-.json file. For example, please refer to the configuration for test and dev environments respectively. Additionally, the standard NUTCH_OPT environment variable will be picked up as additional options required to run nutch application. This variable can also be overwritten by specifying it in conf/env-.json. Other variables used by nutch-web-api are as followed:

  • NUTCH_WEB_API_SERVER_HOST
  • NUTCH_WEB_API_SERVER_PORT
  • NUTCH_WEB_API_SOLR_URL
  • NUTCH_WEB_API_SEED_DIR (Directory where seed file is persisted in.)
  • NUTCH_WEB_API_DATA_DIR (Directory where the embedded database Nedb used for data storage)

Starting And Stopping The Server

Start nutch-web-api

Execute the npm command to start the web application:

npm start

Stop nutch-web-api

npm stop

Supported HTTP Operations

nutch-web-api supports the crawler job that performs all the nutch jobs in one call, and individual nutch job for clients who wants to invoke nutch job individually. For details about each API operation, please refer to the swagger document hosted on the server and port of the web application: e.g. http://localhost:4000/api-docs

Invoke Nutch Crawler Job

This API executes all the individual nutch jobs in the following order:

  • inject, generate, fetch, parse, updatedb, solr index, solr delete duplicates Any failure encountered during the processing of these jobs will result in the job failure.

  • HTTP Method: POST

  • REST Endpoint: http://localhost:4000/nutch/crawl

  • Sample Request Payload:

{
  "identifier" : "sampleCrawl", 
  "limit" : 5,
  "seeds" : [ "http://mysite1.com", "http://mysite2.com ]
}

Invoke Nutch Injector Job

  • HTTP Method: POST
  • Rest Endpoint: http://localhost:4000/nutch/inject
  • Sample Request Payload:
{
  "identifier" : "sampleCrawl"
}

Invoke Nutch Generator

  • HTTP Method: POST
  • Rest Endpoint: http://localhost:4000/nutch/generate
  • Sample Request Payload:
{
  "identifier" : "sampleCrawl",
  "batchId: "12134343"
}

Invoke Nutch Fetcher

  • HTTP Method: POST
  • Rest Endpoint: http://localhost:4000/nutch/fetch
  • Sample Request Payload:
{
  "identifier" : "sampleCrawl",
  "batchId: "12134343"
}

Invoke Nutch Parser

  • HTTP Method: POST
  • Rest Endpoint: http://localhost:4000/nutch/parse
  • Sample Request Payload:
{
  "identifier" : "sampleCrawl",
  "batchId: "12134343"
}

Invoke Nutch UpdateDb

  • HTTP Method: POST
  • Rest Endpoint: http://localhost:4000/nutch/updatedb
  • Sample Request Payload:
{
  "identifier" : "sampleCrawl"
}

Invoke Nutch SolrIndex

  • HTTP Method: POST
  • Rest Endpoint: http://localhost:4000/nutch/solrIndex
  • Sample Request Payload:
{
  "identifier" : "sampleCrawl"
}

Invoke Nutch Solr Delete Duplicates

  • HTTP Method: POST
  • Rest Endpoint: http://localhost:4000/nutch/solr-delete-duplicates

Checking Nutch Job Status

By default, upon summiting a nutch job request, a HTTP status code of 202 is returned indicating the server has received the particular request. A typical response from the request would look like the following:

{
    "message": "injector job submitted successfully",
    "status": 202,
    "identifier": "testInjector"
}

The nutch job is executed asynchronously while the server continues to serve other requests. To check the status of a particular job, do one of the following:

  • Use the API to request for the current job status. The URL to get the up to date status of the current job is: http://localhost:4000/nutch/status?identifier=&jobName= A sample response from the request would look like the following:
{
        "identifier": "testInjector",
        "jobName": "INJECTOR",
        "status": SUCCESS,
        "date": 1415761722588
 }

Job Name and Status Reference

The following table describes the list of valid nutch job names.

| Job Name | Job Description | | ------------- | ------------- | | INJECTOR | Nutch Injector | | GENERATOR | Nutch Generator | | FETCHER | Nutch Fetcher | | PARSER | Nutch Parser | | DBUPDATE | Nutch DB Updater | | SOLRINDEX | Nutch Solr Index | | SOLRDELETEDUPS | Nutch Solr Delete Duplicates |