npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, πŸ‘‹, I’m Ryan HefnerΒ  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you πŸ™

Β© 2024 – Pkg Stats / Ryan Hefner

@harvard-lil/wacz-preparator

v0.0.5

Published

πŸ“š CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.

Downloads

8

Readme

wacz-preparator πŸ“š

npm version JavaScript Style Guide Linting

CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.

wacz-preparator --extractor "archive-it" --username "lil" --password $PASSWORD --collection-id 12345

See also: wacz-exhibitor for embedding a self-contained web archive collection on a web page.


Summary


Foreword

⚠️πŸ₯ΌπŸ§ͺ Experimental:

This pipeline was originally developed in the context of The Harvard Library Innovation Lab's partnership with the Radcliffe Institute's Schlesinger Library on experimental access to web archives.

We have only tested it on The Schlesinger #meToo Web Archives collection and would welcome feedback from the community to help solidify it.

In particular, we would love to hear more about:

  • Any edge cases this pipeline currently doesn't account for.
  • General interest in exploring new ways of storing, copying, and giving access to web archives

Contact: [email protected]

πŸ‘† Back to the summary


How does it work?

Given a specific extractor and valid combination of credentials, wacz-preparator will perform the following steps in order to pull and package a remote web archives collection into a single WACZ file.

Example: Archive-It Extractor

| # | Description | Notes | | --- | --- | -- | | 01 | Check validity of credentials and access to the collection | | | 02 | Create local collection folder if not already present | Because the underlying files are kept around in that folder, processing can be interrupted, resumed, and run multiple times over. | | 03 | Pull Collection Information | | | 04 | Pull list of available WARC files | | | 05 | Pull crawl information for all WARC files | This includes retrieving seeds (urls).| | 06 | Pull page title for all of the crawled URLs | Will first try to fetch that information from the seed meta data. If not available, will try to pull that information from the Wayback Machine. | | 07 | Delete "loose" WARCs from local collection folder | This comparison allows for discarding WARC files that may have previously been pulled locally but are no longer part of the collection. | | 08 | Compare hashes of local WARC files against remote hashes (1) | This allows for determining what files need to be downloaded or re-downloaded. | | 09 | Pull WARC files | Only the files that are not already present locally will be pulled. | | 10 | Compare hashes of local WARC files against remote hashes (2) | At this stage, there should be no discrepancies. | | 11 | Build pages list | | | 12 | Prepare WACZ file | |

At the end of this process, a WACZ file named after the collection ID should be available (ie: 12345.wacz).

WACZ files can be read with any compatible playback software, such as replayweb.page.

Note: All of the operations that involve talking to the Archive-It API are run in parallel batches: the --concurrency option allows for determining how many requests can be run in parallel.

πŸ‘† Back to the summary


Getting Started

Dependencies

wacz-preparator requires Node.js 18+.

Compatibility

This program has been written for UNIX-like systems and is expected to work on Linux, Mac OS, and Windows Subsystem for Linux.

Installation

wacz-preparator is available on npmjs.org and can be installed as follows:

# As a CLI
npm install -g @harvard-lil/wacz-preparator

# As a library
npm install @harvard-lil/wacz-preparator --save

πŸ‘† Back to the summary


CLI

Here are a few examples of how wacz-preparator can be used in the command line to extract a full collection from Archive-It into a WACZ file:

# The program needs an Archive-It username, password, and collection-id to operate ...
wacz-preparator --extractor "archive-it" --username 'foo' --password 'bar' --collection-id 12345

# ... the latter can / should be passed as an environment variable
wacz-preparator --extractor "archive-it"  --username 'foo' --password $PASSWORD --collection-id 12345

# Unless specified otherwise with --output-path, wacz-preparator will work in the current directory
wacz-preparator --extractor "archive-it"  --output-path "/path/to/directory" --username 'foo' --password $PASSWORD --collection-id 12345

# The resulting WACZ file can be signed using an authsign-compatible endpoint.
# See: https://specs.webrecorder.net/wacz-auth/0.1.0/#implementations
wacz-preparator --extractor "archive-it" --signing-url "https://example.com/sign" --username foo --password $PASSWORD --collection-id 12345

# Use --help to list the available options, and see what the defaults are.
wacz-preparator --help
Usage: wacz-preparator [options]

πŸ“š CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.
More info: https://github.com/harvard-lil/wacz-preparator

Options:
  -v, --version                 Display Library and CLI version.
  -e, --extractor <string>      Web Archiving platform to extract the collection from. (choices: "archive-it", default: "archive-it")
  -u, --username <string>       API username (required for Archive-it). (default: null)
  -p, --password <string>       API password (required for Archive-it). (default: null)
  -i, --collection-id <string>  Id of the collection to process (required for Archive-it). (default: null)
  -o, --output-path <string>    Path in which wacz-preparator will work. (default: pwd)
  -c, --concurrency <number>    Sets a limit for parallel requests to the Archive-It API. (default: 50)
  --auto-clear <bool>           Automatically delete the collection-specific folder that was created? (choices: "true", "false", default: "false")
  --signing-url <string>        Authsign-compatible endpoint for signing WACZ file.
  --signing-token <string>      Authentication token to --signing-url, if needed.
  --log-level <string>          Controls CLI verbosity. (choices: "silent", "trace", "debug", "info", "warn", "error", default: "info")
  -h, --help                    Show options list.

πŸ‘† Back to the summary


JavaScript Library

wacz-preparator can also be used as JavaScript library in a Node.js project.

Example: Using the Preparator.process() method

import { ArchiveItExtractor } from "@harvard-lil/wacz-preparator"

const collection = new ArchiveItExtractor({
  username: 'username', 
  password: 'password', 
  collectionId: 12345
})

if (await collection.process()) {
  // WACZ file is ready!
  // ... 
}

The process() method runs through all the steps described in the "How does it work?" section.

It is also possible to go through each individual step manually and customize the behavior of wacz-preparator.

πŸ‘† Back to the summary


Development

Standard JS

This codebase uses the Standard JS coding style.

  • npm run lint can be used to check formatting.
  • npm run lint-autofix can be used to check formatting and automatically edit files accordingly when possible.
  • Most IDEs can be configured to automatically check and enforce this coding style.

JSDoc

JSDoc is used for both documentation and loose type checking purposes on this project.

Testing

⚠️ In its current state, this experimental codebase doesn't come with an automated test suite.

Available CLI

# Runs linter
npm run lint

# Runs linter and attempts to automatically fix issues
npm run lint-autofix

# Step-by-step NPM publishing helper
npm run publish-util

πŸ‘† Back to the summary