@docfide/duct
v0.2.0
Published
Document intelligence pipeline — extract, chunk, embed, and search any document format
Maintainers
Readme
Duct
Extract, chunk, embed, search, and ask — document intelligence in one command.
Duct is an open-source document intelligence pipeline. Point it at a PDF, DOCX, Markdown, image, HTML, or text file (or a whole directory), and it extracts the text, splits it into searchable chunks, and lets you query them instantly — with or without AI embeddings.
npx @docfide/duct index ./contracts/
npx @docfide/duct search "termination clauses"
npx @docfide/duct ask "What are my obligations?"
npx @docfide/duct serve # web UI at http://localhost:3456Quickstart
npm install -g @docfide/duct
duct index ./docs
duct search "payment terms"
duct serve
# → http://localhost:3456No API keys required. No configuration files. Works offline.
Features
| Feature | Description | |---------|-------------| | BM25 Search | Keyword search out of the box — no API keys, fully offline | | Vector Search | Semantic search via OpenAI or Gemini embeddings | | Hybrid Search | BM25 + Vector blended with Reciprocal Rank Fusion | | Re-Ranking | Second-pass term-proximity scoring for precision | | HyDE | Query expansion via hypothetical document embeddings | | Q&A | Ask questions, get answers with source citations | | Agentic Retrieval | Multi-hop search that decomposes complex questions | | Watch Mode | Auto-index files as they're added or modified | | Schema Extraction | Extract structured fields from documents via LLM | | Diff Tracking | Line-level changes between document versions | | Export API | Search results in JSON or CSV | | Web UI | Tabbed interface for search, ask, upload, and settings | | URL Indexing | Index web pages by URL | | Table Extraction | Detects pipe and whitespace-separated tables | | OCR | Tesseract.js + sharp for scanned PDFs and images | | All Formats | PDF, DOCX, Markdown, HTML, plain text, images |
Documentation
| Guide | Contents | |-------|---------| | CLI Reference | All commands: index, search, ask, watch, extract, diff, serve | | API Reference | REST endpoints for the web server | | Library API | Programmatic usage in Node.js/TypeScript | | Search | BM25, vector, hybrid, re-ranking, HyDE | | Q&A | LLM providers, agentic retrieval, configuration |
Quick Examples
# Index and search
duct index ./contracts/
duct search "indemnification clause"
duct search "termination" --search-mode hybrid --rerank
# Ask questions
duct ask "What is the governing law?"
duct ask "Compare all NDAs" --multi
# Watch a directory for changes
duct watch ./inbox --ocr
# Extract structured data
duct extract invoice_date:date:Issue date total:number:Amount --index ./invoices/
# Export results
curl "http://localhost:3456/api/export?q=termination&format=csv"Environment
| Variable | Required For |
|----------|--------------|
| OPENAI_API_KEY | OpenAI embeddings (text-embedding-3-small / 3-large) and LLM (gpt-4o) |
| GEMINI_API_KEY | Google Gemini embeddings (text-embedding-004) and LLM (gemini-2.0-flash) |
| DUCT_AUTH_TOKEN | Server authentication (alternative to --auth-token) |
Without any API key, Duct uses BM25 keyword search — still works, just no semantic understanding. For Q&A, Ollama is the default LLM provider and runs entirely locally.
Web UI
duct serve
# → http://localhost:3456Four tabs:
- Search — search indexed documents with mode badge
- Ask — Q&A with configurable LLM, agentic mode toggle
- Upload — drag-and-drop files or index by URL
- Settings — configure LLM provider, API keys, search mode, chunking
duct serve --port 8080 --persist .duct-data --auth-token my-secret --llm ollamaLibrary
import { Duct } from '@docfide/duct'
const duct = new Duct({
chunk: { strategy: 'by-heading', size: 1000 },
embed: { provider: 'openai' },
llm: { provider: 'ollama', model: 'llama3.2' },
search: { mode: 'hybrid', alpha: 0.3, rerank: true },
})
await duct.index('./report.pdf')
const results = await duct.search('termination clause')
const answer = await duct.ask('What are my obligations?')See the Library API for the full API.
Architecture
file.pdf ──┐
file.docx ─┤ extract() → chunk() → embed() → store() → search() → ask()
file.md ───┤ │ │ │ │
file.html ─┤ text chunks vectors results answer
file.png ───┤ │ │ │ │
file.txt ──┘ pdfjs-dist sliding OpenAI BM25 Ollama
mammoth window Gemini hybrid OpenAI
marked by- rerank Gemini
cheerio heading HyDE
sharp +
tesseractDeploy
docker build -t duct .
docker run -d -p 3456:3456 ductSee the Dockerfile for build details. Deploy on Railway, Fly.io, or any VPS.
Development
git clone https://github.com/docfide/duct
cd duct
npm install
npm run dev # run CLI with tsx
npm test # run tests
npm run build # compile TypeScript
npm run typecheck # type-check without emittingLicense
MIT — see LICENSE.
Built by Docfide. We build contract software; Duct is our gift to developers who work with documents.
