vecpdf
v0.0.1
Published
CLI tool to process PDFs and create local vector databases using ChromaDB
Downloads
49
Maintainers
Readme
Badges
vecpdf — PDF → ChromaDB (HTTP server)
vecpdf is a tiny CLI that:
- an excuse to not need to rely on Pinecone, etc.
- extracts text from a PDF (via Python PyMuPDF),
- splits the text into chunks (token-aware when
tiktokenis available), - and indexes those chunks into a ChromaDB collection over HTTP.
Note: Chroma is a local vector database. vecpdf talks to a running Chroma server (default
http://localhost:8000). Reminder - vectors will live inside the Chroma server, not in your project folder.
Requirements
- Python with:
(tiktoken is optional, but gives nicer chunking.)pip install PyMuPDF tiktoken - ChromaDB server running locally (HTTP). By default, vecpdf uses
http://localhost:8000.
Use a specific Python (virtualenv)
# PowerShell example (Windows)
$env:VECPDF_PYTHON="C:\Path\to\your\venv\Scripts\python.exe"
# macOS/Linux example
export VECPDF_PYTHON="$HOME/.venvs/vecpdf/bin/python"Chroma server URL
Default: http://localhost:8000
To use a different server:
export CHROMA_URL="http://localhost:8001"Quick Start
Create a tiny sample PDF:
python - <<'PY'
import fitz
doc = fitz.open()
page = doc.new_page()
page.insert_text((72,72), "Neural networks learn by adjusting weights.\nEmbeddings map meaning to vectors.")
doc.save("sample.pdf"); doc.close()
PYProcess the PDF:
# Basic usage (indexes into the 'documents' collection)
vecpdf process sample.pdf
# Append to an existing collection instead of recreating it
vecpdf process sample.pdf --keep-existing
# Use a custom chunk ID prefix (helps avoid collisions + label sources)
vecpdf process sample.pdf --id-prefix "paperA_"
# Adjust chunk size (tokens)
vecpdf process sample.pdf -s 800Query the collection:
# Top 3 results (preview)
vecpdf query "neural networks" -c documents -n 3
# Print full text for each result
vecpdf query "neural networks" -c documents -n 3 --fullCLI Reference
vecpdf process <pdf-path> [options]
<pdf-path>: Path to your PDF file (required)-c, --collection <name>: Chroma collection name (default:documents)-s, --chunk-size <size>: Token chunk size (default:500)--python-script <path>: Use your own Python script (advanced)--keep-existing: Append to existing collection instead of recreating it--id-prefix <prefix>: Custom prefix for new chunk IDs (default:chunk_)
vecpdf query <query-text> [options]
<query-text>: Text to search for (required)-c, --collection <name>: Collection name (default:documents)-n, --results <number>: Number of results to return (default:5)--full: Show full text for each result (instead of a preview)
Where data lives
- vecpdf talks to a running Chroma server over HTTP (default
http://localhost:8000). - Documents and vectors are stored by that server (not in a local
./vectordbfolder).
Troubleshooting
Python extraction errors
- Make sure PyMuPDF is installed:
pip install PyMuPDF - If tiktoken is missing, vecpdf falls back to a simple character split (still works).
Embedding/Indexing errors
- Your Chroma server needs an embedder. One path is:
pip install chromadb sentence-transformers - If you see duplicate-ID errors, try a different
--id-prefixor run without--keep-existing.
No results
- Increase
-n, try a simpler query, or confirm the-ccollection name.
License
MIT
