nuktaa
v0.1.24
Published
Nuktaa helps AI teams turn public or private source material into usable knowledge for LLM applications.
Readme
Nuktaa CLI
Nuktaa helps AI teams turn public or private source material into usable knowledge for LLM applications.
Most AI products still need a reliable way to collect data before the model can answer anything useful. Nuktaa handles that preparation work at scale: it can discover internet pages, collect HTML and PDF files, extract readable content, turn that content into smaller knowledge records, and index those records into a vector database. Your team can spend more time building the AI experience and less time wiring data sourcing, crawling, parsing, retrying, and indexing.
Use Nuktaa when you need to prepare enterprise or domain knowledge for:
- RAG chatbots and internal assistants
- document intelligence workflows
- search and Q&A over websites, policies, manuals, reports, or PDFs
- AI products that need fresh source-grounded context

How Nuktaa Works
Nuktaa sits between your raw source material and your AI application. You give it websites, PDFs, manuals, documentation, or private source locations. Nuktaa then runs a repeatable pipeline that discovers useful URLs, fetches raw HTML/PDF files, extracts readable text, checks quality, breaks content into smaller knowledge units, and indexes those units into your vector database.
The result is source-grounded context your RAG chatbot, assistant, search experience, or Q&A system can retrieve. Nuktaa does not replace your AI product; it handles the data sourcing and preparation work so your product can focus on retrieval, prompts, evaluation, and user experience.
Requirements
- Node.js 22 or newer
- Chromium for Playwright
- Ghostscript for PDF handling
- Tesseract for OCR
- Qdrant and embedding/LLM providers when you want to index and answer over the extracted knowledge
Install
npm install -g nuktaaThe npm package is nuktaa; the installed command is nuktaa.
Install the browser runtime:
npx playwright install chromiumInstall PDF/OCR tools on Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y ghostscript tesseract-ocr tesseract-ocr-hinInstall PDF/OCR tools on macOS:
brew install ghostscript tesseract tesseract-langStart Qdrant when indexing is enabled:
docker run -p 6333:6333 qdrant/qdrantOr let Nuktaa prepare local indexing services from inside a workspace:
nuktaa services up
nuktaa services statusRemote Qdrant works too. Put the remote URL and collection in
nukta.config.json; nuktaa index reads it automatically.
Initialize A Workspace
For the friendliest first run, use:
nuktaa startstart can create the workspace, ask for website URLs, save them as sources,
and tell you the next command.
nuktaa init --output-dir ./nuktaShort form:
nuktaa init -o ./nuktaFor our Chhattisgarh public-sector server deployment, initialize with the
included seed template so seeds/seeds.json is ready without manual editing:
nuktaa init -o ./nukta --seed-template cg-public-sectorIf you run nuktaa init without --output-dir or -o, Nuktaa asks for a
workspace directory. Press Enter to create the workspace in the current
directory.
This creates:
./nukta/
nukta.config.json
seeds/seeds.json
profiles/default.json
data/
logs/Add source websites without editing JSON:
nuktaa sources add https://example.com
nuktaa sources listAfter initialization, run commands from inside the workspace:
cd ./nukta
nuktaa doctorThe generated nukta.config.json also contains indexing settings:
{
"vector": {
"provider": "qdrant",
"url": "http://localhost:6333",
"collection": "knowledge_chunks"
},
"embedding": {
"provider": "tei",
"url": "http://127.0.0.1:8001/embed",
"dimensions": 1024,
"model": "BAAI/bge-m3"
}
}Change vector.url for a remote Qdrant server and change
vector.collection when you want a separate index. Nuktaa currently supports
Qdrant for vector storage.
Indexing also requires embeddings. The default nuktaa services up command
starts a Dockerized Hugging Face Text Embeddings Inference service with
BAAI/bge-m3, a multilingual model that works for English, Hindi, and mixed
Hindi-English content. Advanced users can still point embedding.url at their
own HTTPS or HTTP embedding service.
The default TEI endpoint accepts JSON like this:
{
"inputs": ["text to embed"]
}and returns:
[[0.1, 0.2, 0.3]]If you configure embedding.provider as http, Nuktaa also supports the legacy
custom service shape with input requests and embeddings or embedding
responses. Set embedding.dimensions to match the vector size returned by your
provider.
Verify The Server
Run doctor before starting the pipeline:
nuktaa doctor --skip-llm --skip-embedding --skip-qdrantWhen indexing is enabled, run the full check:
nuktaa doctorDoctor prints the output directory, SQLite state DB, command log file, and error log file at startup.
Run The Pipeline
The pipeline starts from seed URLs and ends with searchable knowledge in your vector database:
seed URLs
-> discover pages and PDFs
-> fetch raw files
-> extract clean text
-> audit quality
-> create smaller knowledge units
-> request embeddings from embedding.url
-> index records into vector.provider/vector.collectionnuktaa discover
nuktaa fetch
nuktaa extract
nuktaa audit
nuktaa services up
nuktaa index
nuktaa backup qdrant
nuktaa status
nuktaa runsservices up starts Qdrant with Docker and starts a local TEI embedding
container. backup qdrant creates a Qdrant collection snapshot and downloads
it into backups/qdrant/ inside the workspace.
discover is queue-based. It syncs your sources into SQLite, refreshes stale
host sitemaps, checks URLs that still need checking, and expands links from
usable pages. URLs that pass discovery become Ready to fetch; slow, skipped,
or error URLs stay visible as problem URLs.
When you are unsure what to run, use:
nuktaa nextstatus prints a human summary first. Add --verbose when you want detailed
stage-by-stage diagnostics.
runs shows recent command runs with strategy, worker count, processed items,
slow URLs, and errors. This is the quickest way to see whether the last run was
spread across many hosts, focused on a few hosts, or protected by cooldown.
To see whether a run is spread across many hosts, focused on a few hosts, or protected by cooldown/backoff, use:
nuktaa hosts
nuktaa hosts --pending
nuktaa hosts --problemsUse crawl when you want discovery and raw HTML/PDF saving in one pass:
nuktaa crawlnuktaa index connects to the vector database declared in nukta.config.json.
You can override the collection for one run:
nuktaa index --collection my_collection --resetAfter indexing, your AI application can retrieve these records from the vector database and use them as grounded context for LLM answers.
Useful Runtime Flags
nuktaa discover
nuktaa fetch
nuktaa extract --status
nuktaa index --resetNuktaa auto-tunes worker counts from CPU, RAM, and GPU visibility. On a laptop it
stays conservative; on a large server it increases discover, fetch, extract,
OCR, and index concurrency automatically. Normal production runs should not need
capacity flags. Use --local, --server, or explicit concurrency flags only
when you want to override auto-tuning for diagnostics.
Runtime path resolution order is:
- CLI flags such as
-o/--output-dir,--seed-file, and--state-db - nearest
nukta.config.json
All commands except nuktaa init must run inside an initialized Nuktaa
workspace. If Nuktaa cannot find a nukta.config.json, it stops and asks you to
initialize a workspace instead of creating surprise folders in the current
directory.
Compatibility environment variables such as CRAWL_OUTPUT_DIR and
PIPELINE_STATE_DB still work for existing scripts, but new installs should run
commands from inside the workspace.
Production scraping stays Playwright-only. Normal discover and fetch use a
soft browser timeout so very slow pages are marked retryable instead of holding
workers for the full timeout. Problem retries still use the full --timeout-ms
window. The default soft timeout is already tuned for production; pass
--soft-timeout-ms only when diagnosing a specific source.
nuktaa discover --retry-problems --timeout-ms 60000
nuktaa fetch --retry-failed --timeout-ms 60000For long server runs, pages and browser contexts are recycled automatically to keep Chromium healthy. The defaults work for most installs, but operators can tune them when diagnosing memory or stuck-page behavior:
nuktaa discover --page-recycle-after 100 --context-recycle-after 1000
nuktaa fetch --page-recycle-after 100 --context-recycle-after 1000The same knobs are available as NUKTAA_SOFT_TIMEOUT_MS,
NUKTAA_PAGE_RECYCLE_AFTER, and NUKTAA_CONTEXT_RECYCLE_AFTER.
For a capped smoke test on a new server, use a small page cap and debug logs:
nuktaa discover --debug --max-pages 500
nuktaa fetch --debug --max-pages 500Those flags are for proving an install or collecting diagnostics. Normal
pipeline runs are simply nuktaa discover, nuktaa fetch, nuktaa extract,
nuktaa audit, and nuktaa index.
Logs
Nuktaa writes command logs and error records inside the configured workspace:
<output-dir>/logs/commands/YYYY-MM-DD/<run-id>.log
<output-dir>/command-errors.jsonlIf a command is slow or fails on a server, keep those files. They are the first thing to inspect when tuning concurrency, timeouts, browser setup, OCR, or provider configuration.
Interactive terminals show a compact live progress line while commands are running. Non-interactive shells and CI keep plain append-only logs.
