@openpets/internet-archive
v1.0.0
Published
Access the Internet Archive's vast digital library including the Wayback Machine, millions of books, audio recordings, videos, and more. Search items, retrieve metadata, check URL snapshots, download files, and list historical captures.
Maintainers
Readme
Internet Archive Plugin for OpenPets
Access the Internet Archive's vast digital library including the Wayback Machine, millions of books, audio recordings, videos, and more. This plugin enables searching, metadata retrieval, Wayback Machine queries, and file downloads from archive.org.
Features
- Search - Query archive.org's catalog with advanced Lucene syntax
- Metadata - Retrieve detailed item information, files, and collections
- Wayback Machine - Check if URLs are archived and retrieve snapshots
- CDX API - List historical captures with timestamps and status codes
- Downloads - Generate direct download links for item files
Installation
# Install the pet
pets add internet-archive
# Or for development
pets new internet-archive
cd pets/internet-archive
bun installConfiguration
Get Internet Archive Credentials
- Create an account at https://archive.org/account/signup
- Get your S3-like API keys at https://archive.org/account/s3.php
- Set the environment variables:
# Copy and edit the configuration
cp .env.example .env
# Edit .env with your credentials:
IA_ACCESS_KEY=your_access_key
IA_SECRET_KEY=your_secret_keyRequired Permissions
Most features work with any account. For write operations (metadata updates), ensure your account has proper permissions.
Available Tools
1. Test Connection
Verify your credentials are working:
opencode run "test internet-archive connection"2. Search Items
Search archive.org's catalog:
# Basic search
opencode run "search archive.org for public domain astronomy books"
# With filters
opencode run "search archive.org for 'mediatype:audio AND year:2020'"
# Specific collection
opencode run "search archive.org for items in collection NASA"Search Query Syntax:
The plugin supports Internet Archive's Lucene-style query syntax:
title:shakespeare- Search by titlemediatype:audio- Filter by media type (texts, audio, video, software, image)year:2020- Filter by yearsubject:history- Filter by subjectcollection:NASA- Search within a collectioncreator:"Mark Twain"- Search by creatorlanguage:eng- Filter by language code
3. Get Metadata
Retrieve detailed item information:
opencode run "get metadata for item goodytwoshoes00newyiala"4. Check URL (Wayback Machine)
Check if a URL is archived:
# Check current availability
opencode run "check if https://example.com is archived"
# Check specific date
opencode run "check if https://example.com is archived with timestamp 2020"5. List Snapshots (CDX)
Get all historical captures:
# List all snapshots
opencode run "list snapshots of https://google.com"
# With date range
opencode run "list snapshots of https://google.com from 2020 to 2021"
# With filters
opencode run "list snapshots of https://example.com with filter statuscode:200"6. Download Files
Get download links for item files:
# List all downloadable files
opencode run "download files from item nasa-image-library"
# Filter by format
opencode run "download PDF files from item classic-literature-collection"Example Workflows
Research Workflow
# 1. Search for vintage computing materials
opencode run "search archive.org for 'vintage computing manuals'"
# 2. Get detailed metadata for the most relevant item
opencode run "get metadata for item softwarelibrary"
# 3. Download the PDF version
opencode run "download PDF files from item softwarelibrary"Wayback Research
# 1. Check if a site is archived
opencode run "check if https://old-website.com is archived"
# 2. Get snapshots from a specific year
opencode run "list snapshots of https://old-website.com from 2015 to 2016"
# 3. Access a specific snapshot (returns archived URL)
opencode run "check if https://old-website.com is archived with timestamp 20150601"Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| IA_ACCESS_KEY | Yes* | S3-like access key from archive.org |
| IA_SECRET_KEY | Yes* | S3-like secret key from archive.org |
*Most search and Wayback features work without credentials, but authentication is required for full metadata access and all write operations.
Rate Limiting
The Internet Archive has rate limits in place. The plugin automatically includes proper User-Agent headers to identify requests. Please be respectful:
- Add delays between bulk operations
- Honor
429 Too Many Requestsresponses - Cache responses when possible
- Use filters to limit result sizes
API Documentation
Bot Guidelines
This plugin follows Internet Archive's bot and LLM guidelines:
- Includes descriptive User-Agent headers
- Respects rate limits
- Identifies as automated tool
- Uses authentication when available
Troubleshooting
Authentication Errors
# Test your credentials
opencode run "test internet-archive connection"
# Check your keys at:
# https://archive.org/account/s3.phpItem Not Found
Items use specific identifiers. To find an identifier:
- Visit the item on archive.org
- Look at the URL:
https://archive.org/details/[IDENTIFIER] - Use that identifier in your queries
Rate Limited
If you receive 429 errors:
- Wait a few minutes between requests
- Reduce result sizes with
rowsparameter - Add more specific filters to your queries
Contributing
Contributions welcome! Please see the main OpenPets repository for contribution guidelines.
License
AGPL-3.0 - See LICENSE file for details.
