tabvault

v0.1.0

Published

5 days ago

Turn browser tab chaos into a structured knowledge vault

0High
0Medium
0Low

nicholascullencooper

onetab bookmarks tabs obsidian vault knowledge-base cli

tabvault

Turn browser tab chaos into a structured knowledge vault.

Takes URLs from tab managers (OneTab, bookmarks, Pocket exports — any text file of URLs), fetches content with domain-aware strategies, AI-categorizes into folders and tags, and writes an Obsidian-compatible markdown vault.

Usage

# Fetch URLs into a vault
tabvault fetch urls.txt ./vault

# Check vault health
tabvault scan ./vault

# Smart retry failed entries (domain-aware: GitHub API, Reddit JSON, YouTube oEmbed, etc.)
tabvault retry ./vault --scan

# Deduplicate, fix tags, merge folders
tabvault cleanup ./vault           # dry-run (default)
tabvault cleanup ./vault --apply   # execute

From the monorepo root: pnpm tabvault <command>.

Commands

`fetch <input-file> [vault-dir]`

Main pipeline. Reads a URL list (one per line, or OneTab's url | title format), fetches content via Jina Reader, categorizes with AI, and writes markdown files with YAML frontmatter.

--provider jina|firecrawl    Content fetcher (default: jina)
--ai anthropic|openai|gemini AI categorizer (default: anthropic)
--skip-ai                    Organize by domain instead of AI
--concurrency <n>            Parallel fetches (default: 10)

Progress is saved to .progress.json in the vault — safe to interrupt and resume.

`retry <vault-dir> [input-file]`

Re-fetches failed URLs using the best strategy per domain:

| Domain | Strategy | | ------------------------------ | -------------------------------- | | GitHub repos | GitHub API (README + metadata) | | GitHub issues/PRs | GitHub API (title + body) | | YouTube | oEmbed (title + channel) | | Reddit posts | JSON API (title + selftext) | | Paywall sites (WSJ, NYT, etc.) | Direct HTML <title> + <meta> | | Google Search | Extract query, save to list file | | Google Docs/Sheets | Save to list file | | Amazon products | Save to list file | | Everything else | Jina Reader |

--scan          Scan vault for bad entries instead of retrying from progress file
--dry-run       Preview without changes
--concurrency   Parallel fetches (default: 5)

`scan <vault-dir>`

Read-only vault health audit. Reports bad entries, duplicates, tag stats, and folder structure.

--json    Machine-readable output

`cleanup <vault-dir>`

Six-step vault cleanup:

Deduplicate — find URLs with multiple files, keep the best copy
Garbage titles — delete bot-check pages ("Just a moment...", "Access denied", etc.)
Problem folders — remove _failed-fetch/, libhunt.com/, etc.
Tag merges — normalize plural/singular inconsistencies, fix spaces
Folder merges — consolidate similar subfolders (llms/ → llm/, selfhosted/ → self-hosting/)
Orphan folders — merge tiny top-level folders into appropriate categories

--apply    Execute changes (default is dry-run)

API Keys

Set via environment variables or .env.local in the working directory:

# Content fetching
JINA_API_KEY=...           # Optional (Jina has a free tier)
FIRECRAWL_API_KEY=...      # Required if using --provider firecrawl

# AI categorization
ANTHROPIC_API_KEY=...      # Default provider
OPENAI_API_KEY=...
GEMINI_API_KEY=...

# GitHub (for retry command)
GITHUB_TOKEN=...           # Optional, increases rate limit from 60 to 5000 req/hr

Output Format

Each article becomes a markdown file with YAML frontmatter:

---
url: "https://example.com/article"
title: "Article Title"
domain: example.com
description: "A brief description"
tags: [topic, subtopic, tool]
word_count: 1234
---

# Article Title

> Source: [https://example.com/article](https://example.com/article)

Article content in markdown...

Files are organized into semantic folders like programming/react/, ai-ml/agents/, design/ux/.

Requirements

Bun runtime
At least one AI API key for categorization (or use --skip-ai)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

tabvault

Usage

Commands

fetch <input-file> [vault-dir]

retry <vault-dir> [input-file]

scan <vault-dir>

cleanup <vault-dir>

API Keys

Output Format

Requirements

`fetch <input-file> [vault-dir]`

`retry <vault-dir> [input-file]`

`scan <vault-dir>`

`cleanup <vault-dir>`