nuktaa

v0.1.24

Published

5 days ago

Nuktaa helps AI teams turn public or private source material into usable knowledge for LLM applications.

0High
0Medium
0Low

rizwan92

Nuktaa CLI

Nuktaa helps AI teams turn public or private source material into usable knowledge for LLM applications.

Most AI products still need a reliable way to collect data before the model can answer anything useful. Nuktaa handles that preparation work at scale: it can discover internet pages, collect HTML and PDF files, extract readable content, turn that content into smaller knowledge records, and index those records into a vector database. Your team can spend more time building the AI experience and less time wiring data sourcing, crawling, parsing, retrying, and indexing.

Use Nuktaa when you need to prepare enterprise or domain knowledge for:

RAG chatbots and internal assistants
document intelligence workflows
search and Q&A over websites, policies, manuals, reports, or PDFs
AI products that need fresh source-grounded context

Nuktaa turns source material into usable AI knowledge

How Nuktaa Works

Nuktaa sits between your raw source material and your AI application. You give it websites, PDFs, manuals, documentation, or private source locations. Nuktaa then runs a repeatable pipeline that discovers useful URLs, fetches raw HTML/PDF files, extracts readable text, checks quality, breaks content into smaller knowledge units, and indexes those units into your vector database.

The result is source-grounded context your RAG chatbot, assistant, search experience, or Q&A system can retrieve. Nuktaa does not replace your AI product; it handles the data sourcing and preparation work so your product can focus on retrieval, prompts, evaluation, and user experience.

Requirements

Node.js 22 or newer
Chromium for Playwright
Ghostscript for PDF handling
Tesseract for OCR
Qdrant and embedding/LLM providers when you want to index and answer over the extracted knowledge

Install

npm install -g nuktaa

The npm package is nuktaa; the installed command is nuktaa.

Install the browser runtime:

npx playwright install chromium

Install PDF/OCR tools on Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y ghostscript tesseract-ocr tesseract-ocr-hin

Install PDF/OCR tools on macOS:

brew install ghostscript tesseract tesseract-lang

Start Qdrant when indexing is enabled:

docker run -p 6333:6333 qdrant/qdrant

Or let Nuktaa prepare local indexing services from inside a workspace:

nuktaa services up
nuktaa services status

Remote Qdrant works too. Put the remote URL and collection in nukta.config.json; nuktaa index reads it automatically.

Initialize A Workspace

For the friendliest first run, use:

nuktaa start

start can create the workspace, ask for website URLs, save them as sources, and tell you the next command.

nuktaa init --output-dir ./nukta

Short form:

nuktaa init -o ./nukta

For our Chhattisgarh public-sector server deployment, initialize with the included seed template so seeds/seeds.json is ready without manual editing:

nuktaa init -o ./nukta --seed-template cg-public-sector

If you run nuktaa init without --output-dir or -o, Nuktaa asks for a workspace directory. Press Enter to create the workspace in the current directory.

This creates:

./nukta/
  nukta.config.json
  seeds/seeds.json
  profiles/default.json
  data/
  logs/

Add source websites without editing JSON:

nuktaa sources add https://example.com
nuktaa sources list

After initialization, run commands from inside the workspace:

cd ./nukta
nuktaa doctor

The generated nukta.config.json also contains indexing settings:

{
  "vector": {
    "provider": "qdrant",
    "url": "http://localhost:6333",
    "collection": "knowledge_chunks"
  },
  "embedding": {
    "provider": "tei",
    "url": "http://127.0.0.1:8001/embed",
    "dimensions": 1024,
    "model": "BAAI/bge-m3"
  }
}

Change vector.url for a remote Qdrant server and change vector.collection when you want a separate index. Nuktaa currently supports Qdrant for vector storage.

Indexing also requires embeddings. The default nuktaa services up command starts a Dockerized Hugging Face Text Embeddings Inference service with BAAI/bge-m3, a multilingual model that works for English, Hindi, and mixed Hindi-English content. Advanced users can still point embedding.url at their own HTTPS or HTTP embedding service.

The default TEI endpoint accepts JSON like this:

{
  "inputs": ["text to embed"]
}

and returns:

[[0.1, 0.2, 0.3]]

If you configure embedding.provider as http, Nuktaa also supports the legacy custom service shape with input requests and embeddings or embedding responses. Set embedding.dimensions to match the vector size returned by your provider.

Verify The Server

Run doctor before starting the pipeline:

nuktaa doctor --skip-llm --skip-embedding --skip-qdrant

When indexing is enabled, run the full check:

nuktaa doctor

Doctor prints the output directory, SQLite state DB, command log file, and error log file at startup.

Run The Pipeline

The pipeline starts from seed URLs and ends with searchable knowledge in your vector database:

seed URLs
  -> discover pages and PDFs
  -> fetch raw files
  -> extract clean text
  -> audit quality
  -> create smaller knowledge units
  -> request embeddings from embedding.url
  -> index records into vector.provider/vector.collection

nuktaa discover
nuktaa fetch
nuktaa extract
nuktaa audit
nuktaa services up
nuktaa index
nuktaa backup qdrant
nuktaa status
nuktaa runs

services up starts Qdrant with Docker and starts a local TEI embedding container. backup qdrant creates a Qdrant collection snapshot and downloads it into backups/qdrant/ inside the workspace.

discover is queue-based. It syncs your sources into SQLite, refreshes stale host sitemaps, checks URLs that still need checking, and expands links from usable pages. URLs that pass discovery become Ready to fetch; slow, skipped, or error URLs stay visible as problem URLs.

When you are unsure what to run, use:

nuktaa next

status prints a human summary first. Add --verbose when you want detailed stage-by-stage diagnostics.

runs shows recent command runs with strategy, worker count, processed items, slow URLs, and errors. This is the quickest way to see whether the last run was spread across many hosts, focused on a few hosts, or protected by cooldown.

To see whether a run is spread across many hosts, focused on a few hosts, or protected by cooldown/backoff, use:

nuktaa hosts
nuktaa hosts --pending
nuktaa hosts --problems

Use crawl when you want discovery and raw HTML/PDF saving in one pass:

nuktaa crawl

nuktaa index connects to the vector database declared in nukta.config.json. You can override the collection for one run:

nuktaa index --collection my_collection --reset

After indexing, your AI application can retrieve these records from the vector database and use them as grounded context for LLM answers.

Useful Runtime Flags

nuktaa discover
nuktaa fetch
nuktaa extract --status
nuktaa index --reset

Nuktaa auto-tunes worker counts from CPU, RAM, and GPU visibility. On a laptop it stays conservative; on a large server it increases discover, fetch, extract, OCR, and index concurrency automatically. Normal production runs should not need capacity flags. Use --local, --server, or explicit concurrency flags only when you want to override auto-tuning for diagnostics.

Runtime path resolution order is:

CLI flags such as -o/--output-dir, --seed-file, and --state-db
nearest nukta.config.json

All commands except nuktaa init must run inside an initialized Nuktaa workspace. If Nuktaa cannot find a nukta.config.json, it stops and asks you to initialize a workspace instead of creating surprise folders in the current directory.

Compatibility environment variables such as CRAWL_OUTPUT_DIR and PIPELINE_STATE_DB still work for existing scripts, but new installs should run commands from inside the workspace.

Production scraping stays Playwright-only. Normal discover and fetch use a soft browser timeout so very slow pages are marked retryable instead of holding workers for the full timeout. Problem retries still use the full --timeout-ms window. The default soft timeout is already tuned for production; pass --soft-timeout-ms only when diagnosing a specific source.

nuktaa discover --retry-problems --timeout-ms 60000
nuktaa fetch --retry-failed --timeout-ms 60000

For long server runs, pages and browser contexts are recycled automatically to keep Chromium healthy. The defaults work for most installs, but operators can tune them when diagnosing memory or stuck-page behavior:

nuktaa discover --page-recycle-after 100 --context-recycle-after 1000
nuktaa fetch --page-recycle-after 100 --context-recycle-after 1000

The same knobs are available as NUKTAA_SOFT_TIMEOUT_MS, NUKTAA_PAGE_RECYCLE_AFTER, and NUKTAA_CONTEXT_RECYCLE_AFTER.

For a capped smoke test on a new server, use a small page cap and debug logs:

nuktaa discover --debug --max-pages 500
nuktaa fetch --debug --max-pages 500

Those flags are for proving an install or collecting diagnostics. Normal pipeline runs are simply nuktaa discover, nuktaa fetch, nuktaa extract, nuktaa audit, and nuktaa index.

Logs

Nuktaa writes command logs and error records inside the configured workspace:

<output-dir>/logs/commands/YYYY-MM-DD/<run-id>.log
<output-dir>/command-errors.jsonl

If a command is slow or fails on a server, keep those files. They are the first thing to inspect when tuning concurrency, timeouts, browser setup, OCR, or provider configuration.

Interactive terminals show a compact live progress line while commands are running. Non-interactive shells and CI keep plain append-only logs.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme