@mdream/llms-db
v0.17.1
Published
A CLI tool for managing a database of open-source project documentation converted to llms.txt format.
Readme
@mdream/llms-db
A CLI tool for managing a database of open-source project documentation converted to llms.txt format.
Features
- SQLite Database: Store metadata about crawled sites with full history
- Automated Crawling: Integrates with @mdream/crawl for website processing
- Artifact Management: Automatically compress and store crawled files
- llms.txt Generation: Generate master llms.txt files from database entries
- CLI Interface: Simple command-line interface for managing entries
Installation
pnpm install @mdream/llms-dbUsage
Run crawler on a URL
mdream-db run https://docs.example.com
# With options
mdream-db run https://docs.example.com --name "Example Docs" --description "Documentation for Example project"
# Force local mode (ignore production env vars)
mdream-db run https://docs.example.com --localRe-crawl an existing entry
mdream-db recrawl "Example Docs"
# or by ID
mdream-db recrawl 1List all entries
mdream-db listGenerate master llms.txt
mdream-db generate
# or specify output path
mdream-db generate ./my-llms.txtRemove an entry
mdream-db remove "Example Docs"CLI Options
run command
--name <name>: Custom name for the entry (defaults to domain)--description <desc>: Description of the site--depth <number>: Crawl depth (default: 3)--max-pages <number>: Maximum pages to crawl--exclude <pattern>: Exclude URL patterns (can be used multiple times)--output <dir>: Output directory for crawled files--local: Force local mode (ignore production environment variables)
recrawl command
- Accepts entry name or ID
Architecture
The package uses a repository pattern with multiple storage backends:
- Repository Interface:
LlmsRepositorydefines the contract for database operations - Drizzle Implementation:
DrizzleLlmsRepositoryprovides SQLite/LibSQL implementation - Storage Implementation:
LlmsStorageRepositoryprovides file-based storage using unstorage - Type Safety: Full TypeScript support with schema inference
- Migrations: Managed through
drizzle-kitfor schema evolution
Database Schema
Drizzle Repository (SQLite/LibSQL)
The tool creates a database with the following tables:
llms_entries: Main entries with metadatacrawled_pages: Individual pages crawled for each entryartifacts: Generated files (llms.txt, archives, etc.)
Storage Repository (File-based)
Uses unstorage for file-based key-value storage:
entries/: Main entries stored as JSON filespages/: Individual pages for each entryartifacts/: Generated files metadatameta/: Metadata for lookups and counters
Programmatic Usage
Using Drizzle Repository (SQLite/LibSQL)
import { createRepository } from '@mdream/llms-db'
// Local SQLite database
const repository = createRepository({ dbPath: './my-database.db' })
// Production LibSQL database (using environment variables)
// Automatically initializes R2 storage if credentials are available
const prodRepository = createRepository({
production: true,
authToken: process.env.TURSO_AUTH_TOKEN
})
// Production LibSQL database (with explicit configuration)
const prodRepositoryExplicit = createRepository({
production: true,
authToken: 'your_auth_token_here'
})
// Create a new entry
const entry = await repository.createEntry({
name: 'example-docs',
url: 'https://docs.example.com',
description: 'Example documentation'
})
// Update status
await repository.updateEntryStatus(entry.id, 'completed')
// Upload artifact to R2 (production only)
const archiveData = Buffer.from('archive content')
const r2Url = await repository.uploadArtifactToR2('my-project', 'archive.tar.gz', archiveData)
// Add artifact with automatic R2 upload (production only)
const llmsData = Buffer.from('llms.txt content')
await repository.addArtifactWithR2Upload(
entry.id,
'llms.txt',
'llms.txt',
llmsData,
llmsData.length
)
// Generate llms.txt
const llmsTxt = await repository.generateLlmsTxt()
console.log(llmsTxt)
repository.close()Using Storage Repository (File-based)
import { createStorageRepository } from '@mdream/llms-db'
const repository = createStorageRepository({
dbPath: './my-storage'
})
// Same API as drizzle repository
const entry = await repository.createEntry({
name: 'example-docs',
url: 'https://docs.example.com',
description: 'Example documentation'
})
repository.close()Storage
Local Development
- Database:
.mdream/llms.db(SQLite) - File Storage:
.mdream/llms-storage/(unstorage) - Crawled files:
.mdream/crawls/<entry-name>/ - Archives:
.mdream/archives/<entry-name>.tar.gz
Production
- Database: LibSQL (Turso) at
libsql://mdream-production-harlan-zw.aws-ap-northeast-1.turso.io - Authentication: Requires
TURSO_AUTH_TOKENenvironment variable - Artifact Storage: Cloudflare R2 (automatic upload in production mode)
- R2 Integration: Archives and llms.txt files are automatically uploaded to R2
Environment Variables
Database Configuration
NODE_ENV=production: Automatically use production LibSQL databaseTURSO_DATABASE_URL: LibSQL database URL (e.g.,libsql://your-database.turso.io)TURSO_AUTH_TOKEN: Required for production database access
R2 Storage Configuration (Production)
R2_ACCESS_KEY_ID: Your R2 access key IDR2_SECRET_ACCESS_KEY: Your R2 secret access keyR2_ACCOUNT_ID: Your Cloudflare account IDR2_BUCKET_NAME: Your R2 bucket nameR2_ENDPOINT: Your R2 endpoint URLR2_PUBLIC_URL: Your R2 public URL
Setup
Copy the example environment file:
cp .env.example .envEdit
.envand add your configuration:# Database Configuration TURSO_DATABASE_URL=libsql://your-database.turso.io TURSO_AUTH_TOKEN=your_actual_token_here # R2 Storage Configuration (Production) R2_ACCESS_KEY_ID=your_access_key_id R2_SECRET_ACCESS_KEY=your_secret_access_key R2_ACCOUNT_ID=your_account_id R2_BUCKET_NAME=your_bucket_name R2_ENDPOINT=https://your-account-id.r2.cloudflarestorage.com R2_PUBLIC_URL=https://your-public-url.r2.devRun commands with native Node.js dotenv support:
# Database operations node --env-file=.env ./dist/cli.mjs # Drizzle commands pnpm db:generate # Uses --env-file automatically pnpm db:migrate # Uses --env-file automatically pnpm db:studio # Uses --env-file automatically
Native dotenv Support
This package uses Node.js native dotenv support (available since Node.js 20.6.0) via the --env-file flag instead of the dotenv package. This provides better performance and reduces dependencies.
R2 Storage Integration
The package includes built-in support for Cloudflare R2 object storage for artifact management:
Automatic R2 Integration (Production)
When running in production mode (NODE_ENV=production or production: true), the drizzle repository automatically:
- Initializes R2 Client: If R2 environment variables are configured, an R2 client is automatically created
- Uploads Artifacts: When using
addArtifactWithR2Upload(), artifacts are automatically uploaded to R2 - Stores Public URLs: The database stores the public R2 URL instead of local file paths
- Fallback Behavior: If R2 is not configured, it falls back to local storage without errors
// Production repository automatically uses R2 if configured
const repository = createRepository({ production: true })
// This will upload to R2 in production, local storage in development
const archiveData = Buffer.from('archive content')
await repository.addArtifactWithR2Upload(
entryId,
'archive',
'project.tar.gz',
archiveData,
archiveData.length
)Manual R2 Setup
import { createR2Client } from '@mdream/llms-db'
const r2Client = createR2Client({
accessKeyId: process.env.R2_ACCESS_KEY_ID,
secretAccessKey: process.env.R2_SECRET_ACCESS_KEY,
accountId: process.env.R2_ACCOUNT_ID,
bucketName: process.env.R2_BUCKET_NAME,
endpoint: process.env.R2_ENDPOINT,
publicUrl: process.env.R2_PUBLIC_URL,
})
// Upload an artifact
const publicUrl = await r2Client.uploadArtifact('my-project', 'docs.tar.gz', archiveBuffer)
// Download an artifact
const data = await r2Client.downloadArtifact('my-project', 'docs.tar.gz')
// Get public URL
const url = r2Client.getPublicUrl('my-project', 'docs.tar.gz')Environment Variables
Add these to your .env file:
# Cloudflare R2 Configuration
R2_ACCESS_KEY_ID=your_access_key_id
R2_SECRET_ACCESS_KEY=your_secret_access_key
R2_ACCOUNT_ID=your_account_id
R2_BUCKET_NAME=your_bucket_name
R2_ENDPOINT=https://your-account-id.r2.cloudflarestorage.com
R2_PUBLIC_URL=https://your-public-url.r2.dev