ark-kb
v1.0.1
Published
ποΈ Ark Knowledge Base β Personal knowledge base powered by LanceDB + Multimodal Embedding (Qwen3-VL-8B)
Maintainers
Readme
ποΈ Ark KB
Ark Knowledge Base β Drop files in a folder. Everything else is automatic.
A personal knowledge base for AI assistants. Powered by LanceDB + MinerU + Multimodal Embedding. Auto-detects file types, auto-parses, auto-chunks, auto-embeds, auto-indexes. 90+ file formats, 14-file codebase, ~7,000 lines.
β¨ What It Does
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Knowledge Folder β
β β
β π report.pdf π data.xlsx π½οΈ deck.pptx β
β π notes.md πΌοΈ photo.jpg π¬ demo.mp4 β
β π contract.docx π§ email attachments β
β β
β β fs.watch (real-time) β
β β detectFileKind (90+ formats) β
β β complexity detection (Office XML) β
β β MinerU / VLM / direct extract β
β β chunkText (paragraph / fixed / sentence) β
β β embed (vectorize) β
β β upsert to LanceDB β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
You ask: "What was the Q2 revenue forecast?"
β
I search: semantic hybrid search β find the .xlsx chunk β show resultsπ Supported Formats (90+)
π Text (46 formats)
.md .txt .csv .html .htm .json .yaml .yml .xml .toml
.py .js .ts .jsx .tsx .java .c .cpp .h .go .rs .rb .php .sh .bash .sql .r .scala .lua
.css .scss .less .vue .swift .kt .dart
.log .conf .cfg .ini .env .tex .rst .org .adoc
π’ Office (6 formats)
| Format | Simple | Complex | Binary |
|--------|--------|---------|--------|
| Word | .docx (text only) β pipeline | .docx (formulas/images) β vlm | .doc β vlm |
| Excel | .xlsx (data only) β pipeline | .xlsx (charts/formulas) β vlm | .xls β vlm |
| PPT | .pptx (text only) β pipeline | .pptx (charts/animations) β vlm | .ppt β vlm |
Complexity detection: For
.docx/.xlsx/.pptx, the plugin reads the ZIP-internal XML to detect formulas, charts, images, pivot tables, animations, etc. (10 markers each). Complex files automatically route tovlmmodel for higher accuracy.
π PDF
.pdf β MinerU vlm (formulas, tables, images all extracted)
π HTML
.html .htm β MinerU MinerU-HTML model (structured extraction)
πΌοΈ Images (17 formats)
.png .jpg .jpeg .jfif .webp .gif .bmp .svg .tiff .tif .ico .heic .heif .raw .cr2 .nef .arw
β VLM summary β text embedding (configurable to multimodal direct embedding)
π¬ Video (11 formats)
.mp4 .mov .avi .mkv .webm .wmv .flv .m4v .3gp .ogv .ts
β VLM frame analysis β summary β text embedding
π§ Email Ingestion
IMAP-based (imapflow), auto-polls inbox, extracts .txt .md .pdf .doc .docx .ppt .pptx .xls .xlsx .html .htm .png .jpg .jpeg .gif .svg .webp .bmp attachments.
π§ How It Works
Ingestion Pipeline
file dropped / email received
β
detectFileKind() β FileKind
β
ββ text / code β direct chunking
ββ pdf β MinerU API v4 (precision parsing)
ββ docx/pptx/xlsx β isComplex() β pipeline | vlm β MinerU
ββ doc/ppt/xls β MinerU vlm (binary)
ββ html β MinerU MinerU-HTML
ββ image β VLM describe β text embed (or multimodal)
ββ video β VLM frame β summary β text embed
β
chunkText() (paragraph / fixed / sentence strategy)
β
embedder.embed() β vectors
β
store.upsert() β LanceDBChunking Strategies
| Strategy | Behavior |
|----------|----------|
| paragraph | Split on blank lines, merge up to maxTokens |
| fixed | Fixed-size window with overlap |
| sentence | Split on sentence-ending punctuation |
Hash Deduplication
SHA256-based content dedup with hash-level locking to prevent concurrent ingestion of identical files with different names.
UID-based Email Tracking
Persists lastProcessedUID to avoid re-processing emails across restarts. Failed emails go to retry queue tracked by UID.
π οΈ OpenClaw Tools
| Tool | Description |
|------|-------------|
| kb_search | Hybrid semantic + keyword search with optional fileType filter |
| kb_ingest | Manually trigger indexing of a file or all files |
| kb_remove | Remove indexed chunks for a given source file |
| kb_status | Show total chunks, indexed files, config summary |
π Architecture
βββββββββββββββββββββββββββββββββββββββ
β Storage Layer β
β /path/to/knowledge/ (filesystem) β
β ~/.ark-kb/* (state, LanceDB) β
ββββββββββββββββ¬βββββββββββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββββββββββ
β Index Layer β
β LanceDB (embedded, zero-ops) β
β IVF-PQ ANN search, millisecond β
ββββββββββββββββ¬βββββββββββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββββββββββ
β Retrieval Layer β
β Hybrid: vector + BM25 β
β Optional: reranking (BGE-m3) β
β fileType filter, pagination β
βββββββββββββββββββββββββββββββββββββββπ© Technical Stack
| Component | Technology | |---|---| | Vector DB | LanceDB (embedded, zero-ops) | | PDF/Office parsing | MinerU API v4 (precision parsing + complexity routing) | | Embedding | Configurable (MiniMax / Qwen / etc.) | | Image understanding | VLM (MiniMax-VL / Qwen-VL) | | Video summarization | VLM frame analysis | | File watching | fs.watch + debounce | | Email | imapflow (IMAP, configurable polling) | | Search algo | IVF-PQ ANN + BM25 keyword | | Chunking | paragraph / fixed / sentence strategies | | Dedup | SHA256 content hash + hash-level locking | | Runtime | Node.js / TypeScript |
π Quick Start
# 1. Install
npm install ark-kb
# 2. Configure (plugin-config.json)
{
"knowledgePath": "/path/to/knowledge",
"embedding": { "model": "qwen/Qwen3-VL-Embedding-8B" },
"pdfParser": {
"api": "mineru",
"endpoint": "https://mineru.net/api/v4/extract/task",
"apiKey": "your-mineru-token"
}
}
# 3. Drop files into /path/to/knowledge
# β Auto-indexed. No action needed.
# 4. Search via OpenClaw tool
kb_search(query="Q2 revenue forecast", fileType="xlsx")π License
AGPL v3 Β© njuboy11
A small ark that holds your world. ποΈ
