@openpets/internet-archive

v1.0.0

Published

2 months ago

Access the Internet Archive's vast digital library including the Wayback Machine, millions of books, audio recordings, videos, and more. Search items, retrieve metadata, check URL snapshots, download files, and list historical captures.

0High
0Medium
0Low

raggle_npm

opencode openpets plugin internet-archive archive-org wayback-machine digital-library web-archiving

Internet Archive Plugin for OpenPets

Access the Internet Archive's vast digital library including the Wayback Machine, millions of books, audio recordings, videos, and more. This plugin enables searching, metadata retrieval, Wayback Machine queries, and file downloads from archive.org.

Features

Search - Query archive.org's catalog with advanced Lucene syntax
Metadata - Retrieve detailed item information, files, and collections
Wayback Machine - Check if URLs are archived and retrieve snapshots
CDX API - List historical captures with timestamps and status codes
Downloads - Generate direct download links for item files

Installation

# Install the pet
pets add internet-archive

# Or for development
pets new internet-archive
cd pets/internet-archive
bun install

Configuration

Get Internet Archive Credentials

Create an account at https://archive.org/account/signup
Get your S3-like API keys at https://archive.org/account/s3.php
Set the environment variables:

# Copy and edit the configuration
cp .env.example .env

# Edit .env with your credentials:
IA_ACCESS_KEY=your_access_key
IA_SECRET_KEY=your_secret_key

Required Permissions

Most features work with any account. For write operations (metadata updates), ensure your account has proper permissions.

Available Tools

1. Test Connection

Verify your credentials are working:

opencode run "test internet-archive connection"

2. Search Items

Search archive.org's catalog:

# Basic search
opencode run "search archive.org for public domain astronomy books"

# With filters
opencode run "search archive.org for 'mediatype:audio AND year:2020'"

# Specific collection
opencode run "search archive.org for items in collection NASA"

Search Query Syntax:

The plugin supports Internet Archive's Lucene-style query syntax:

title:shakespeare - Search by title
mediatype:audio - Filter by media type (texts, audio, video, software, image)
year:2020 - Filter by year
subject:history - Filter by subject
collection:NASA - Search within a collection
creator:"Mark Twain" - Search by creator
language:eng - Filter by language code

3. Get Metadata

Retrieve detailed item information:

opencode run "get metadata for item goodytwoshoes00newyiala"

4. Check URL (Wayback Machine)

Check if a URL is archived:

# Check current availability
opencode run "check if https://example.com is archived"

# Check specific date
opencode run "check if https://example.com is archived with timestamp 2020"

5. List Snapshots (CDX)

Get all historical captures:

# List all snapshots
opencode run "list snapshots of https://google.com"

# With date range
opencode run "list snapshots of https://google.com from 2020 to 2021"

# With filters
opencode run "list snapshots of https://example.com with filter statuscode:200"

6. Download Files

Get download links for item files:

# List all downloadable files
opencode run "download files from item nasa-image-library"

# Filter by format
opencode run "download PDF files from item classic-literature-collection"

Example Workflows

Research Workflow

# 1. Search for vintage computing materials
opencode run "search archive.org for 'vintage computing manuals'"

# 2. Get detailed metadata for the most relevant item
opencode run "get metadata for item softwarelibrary"

# 3. Download the PDF version
opencode run "download PDF files from item softwarelibrary"

Wayback Research

# 1. Check if a site is archived
opencode run "check if https://old-website.com is archived"

# 2. Get snapshots from a specific year
opencode run "list snapshots of https://old-website.com from 2015 to 2016"

# 3. Access a specific snapshot (returns archived URL)
opencode run "check if https://old-website.com is archived with timestamp 20150601"

Environment Variables

| Variable | Required | Description | |----------|----------|-------------| | IA_ACCESS_KEY | Yes* | S3-like access key from archive.org | | IA_SECRET_KEY | Yes* | S3-like secret key from archive.org |

*Most search and Wayback features work without credentials, but authentication is required for full metadata access and all write operations.

Rate Limiting

The Internet Archive has rate limits in place. The plugin automatically includes proper User-Agent headers to identify requests. Please be respectful:

Add delays between bulk operations
Honor 429 Too Many Requests responses
Cache responses when possible
Use filters to limit result sizes

API Documentation

Bot Guidelines

This plugin follows Internet Archive's bot and LLM guidelines:

Includes descriptive User-Agent headers
Respects rate limits
Identifies as automated tool
Uses authentication when available

Troubleshooting

Authentication Errors

# Test your credentials
opencode run "test internet-archive connection"

# Check your keys at:
# https://archive.org/account/s3.php

Item Not Found

Items use specific identifiers. To find an identifier:

Visit the item on archive.org
Look at the URL: https://archive.org/details/[IDENTIFIER]
Use that identifier in your queries

Rate Limited

If you receive 429 errors:

Wait a few minutes between requests
Reduce result sizes with rows parameter
Add more specific filters to your queries

Contributing

Contributions welcome! Please see the main OpenPets repository for contribution guidelines.

License

AGPL-3.0 - See LICENSE file for details.