vecpdf

v0.0.1

Published

8 months ago

CLI tool to process PDFs and create local vector databases using ChromaDB

Badges

vecpdf — PDF → ChromaDB (HTTP server)

vecpdf is a tiny CLI that:

an excuse to not need to rely on Pinecone, etc.
extracts text from a PDF (via Python PyMuPDF),
splits the text into chunks (token-aware when tiktoken is available),
and indexes those chunks into a ChromaDB collection over HTTP.

Note: Chroma is a local vector database. vecpdf talks to a running Chroma server (default http://localhost:8000). Reminder - vectors will live inside the Chroma server, not in your project folder.

Requirements

Python with:
```
pip install PyMuPDF tiktoken
```
(tiktoken is optional, but gives nicer chunking.)
ChromaDB server running locally (HTTP). By default, vecpdf uses http://localhost:8000.

Use a specific Python (virtualenv)

# PowerShell example (Windows)
$env:VECPDF_PYTHON="C:\Path\to\your\venv\Scripts\python.exe"

# macOS/Linux example
export VECPDF_PYTHON="$HOME/.venvs/vecpdf/bin/python"

Chroma server URL

Default: http://localhost:8000
To use a different server:

export CHROMA_URL="http://localhost:8001"

Quick Start

Create a tiny sample PDF:

python - <<'PY'
import fitz
doc = fitz.open()
page = doc.new_page()
page.insert_text((72,72), "Neural networks learn by adjusting weights.\nEmbeddings map meaning to vectors.")
doc.save("sample.pdf"); doc.close()
PY

Process the PDF:

# Basic usage (indexes into the 'documents' collection)
vecpdf process sample.pdf

# Append to an existing collection instead of recreating it
vecpdf process sample.pdf --keep-existing

# Use a custom chunk ID prefix (helps avoid collisions + label sources)
vecpdf process sample.pdf --id-prefix "paperA_"

# Adjust chunk size (tokens)
vecpdf process sample.pdf -s 800

Query the collection:

# Top 3 results (preview)
vecpdf query "neural networks" -c documents -n 3

# Print full text for each result
vecpdf query "neural networks" -c documents -n 3 --full

CLI Reference

`vecpdf process <pdf-path> [options]`

<pdf-path>: Path to your PDF file (required)
-c, --collection <name>: Chroma collection name (default: documents)
-s, --chunk-size <size>: Token chunk size (default: 500)
--python-script <path>: Use your own Python script (advanced)
--keep-existing: Append to existing collection instead of recreating it
--id-prefix <prefix>: Custom prefix for new chunk IDs (default: chunk_)

`vecpdf query <query-text> [options]`

<query-text>: Text to search for (required)
-c, --collection <name>: Collection name (default: documents)
-n, --results <number>: Number of results to return (default: 5)
--full: Show full text for each result (instead of a preview)

Where data lives

vecpdf talks to a running Chroma server over HTTP (default http://localhost:8000).
Documents and vectors are stored by that server (not in a local ./vectordb folder).

Troubleshooting

Python extraction errors

Make sure PyMuPDF is installed:
```
pip install PyMuPDF
```
If tiktoken is missing, vecpdf falls back to a simple character split (still works).

Embedding/Indexing errors

Your Chroma server needs an embedder. One path is:
```
pip install chromadb sentence-transformers
```
If you see duplicate-ID errors, try a different --id-prefix or run without --keep-existing.

No results

Increase -n, try a simpler query, or confirm the -c collection name.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme