tabvault
v0.1.0
Published
Turn browser tab chaos into a structured knowledge vault
Maintainers
Readme
tabvault
Turn browser tab chaos into a structured knowledge vault.
Takes URLs from tab managers (OneTab, bookmarks, Pocket exports — any text file of URLs), fetches content with domain-aware strategies, AI-categorizes into folders and tags, and writes an Obsidian-compatible markdown vault.
Usage
# Fetch URLs into a vault
tabvault fetch urls.txt ./vault
# Check vault health
tabvault scan ./vault
# Smart retry failed entries (domain-aware: GitHub API, Reddit JSON, YouTube oEmbed, etc.)
tabvault retry ./vault --scan
# Deduplicate, fix tags, merge folders
tabvault cleanup ./vault # dry-run (default)
tabvault cleanup ./vault --apply # executeFrom the monorepo root: pnpm tabvault <command>.
Commands
fetch <input-file> [vault-dir]
Main pipeline. Reads a URL list (one per line, or OneTab's url | title format), fetches content via Jina Reader, categorizes with AI, and writes markdown files with YAML frontmatter.
--provider jina|firecrawl Content fetcher (default: jina)
--ai anthropic|openai|gemini AI categorizer (default: anthropic)
--skip-ai Organize by domain instead of AI
--concurrency <n> Parallel fetches (default: 10)Progress is saved to .progress.json in the vault — safe to interrupt and resume.
retry <vault-dir> [input-file]
Re-fetches failed URLs using the best strategy per domain:
| Domain | Strategy |
| ------------------------------ | -------------------------------- |
| GitHub repos | GitHub API (README + metadata) |
| GitHub issues/PRs | GitHub API (title + body) |
| YouTube | oEmbed (title + channel) |
| Reddit posts | JSON API (title + selftext) |
| Paywall sites (WSJ, NYT, etc.) | Direct HTML <title> + <meta> |
| Google Search | Extract query, save to list file |
| Google Docs/Sheets | Save to list file |
| Amazon products | Save to list file |
| Everything else | Jina Reader |
--scan Scan vault for bad entries instead of retrying from progress file
--dry-run Preview without changes
--concurrency Parallel fetches (default: 5)scan <vault-dir>
Read-only vault health audit. Reports bad entries, duplicates, tag stats, and folder structure.
--json Machine-readable outputcleanup <vault-dir>
Six-step vault cleanup:
- Deduplicate — find URLs with multiple files, keep the best copy
- Garbage titles — delete bot-check pages ("Just a moment...", "Access denied", etc.)
- Problem folders — remove
_failed-fetch/,libhunt.com/, etc. - Tag merges — normalize plural/singular inconsistencies, fix spaces
- Folder merges — consolidate similar subfolders (
llms/→llm/,selfhosted/→self-hosting/) - Orphan folders — merge tiny top-level folders into appropriate categories
--apply Execute changes (default is dry-run)API Keys
Set via environment variables or .env.local in the working directory:
# Content fetching
JINA_API_KEY=... # Optional (Jina has a free tier)
FIRECRAWL_API_KEY=... # Required if using --provider firecrawl
# AI categorization
ANTHROPIC_API_KEY=... # Default provider
OPENAI_API_KEY=...
GEMINI_API_KEY=...
# GitHub (for retry command)
GITHUB_TOKEN=... # Optional, increases rate limit from 60 to 5000 req/hrOutput Format
Each article becomes a markdown file with YAML frontmatter:
---
url: "https://example.com/article"
title: "Article Title"
domain: example.com
description: "A brief description"
tags: [topic, subtopic, tool]
word_count: 1234
---
# Article Title
> Source: [https://example.com/article](https://example.com/article)
Article content in markdown...Files are organized into semantic folders like programming/react/, ai-ml/agents/, design/ux/.
Requirements
- Bun runtime
- At least one AI API key for categorization (or use
--skip-ai)
