@dooor-ai/cortexdb

v0.8.9

Published

13 days ago

Official TypeScript/JavaScript SDK for CortexDB - Multi-modal RAG Platform with advanced document processing

0High
0Medium
0Low

bruno353

cortexdb vector-database rag embeddings semantic-search document-processing ai ml llm docling multimodal typescript sdk

CortexDB TypeScript SDK

██████╗  ██████╗  ██████╗  ██████╗ ██████╗ 
██╔══██╗██╔═══██╗██╔═══██╗██╔═══██╗██╔══██╗
██║  ██║██║   ██║██║   ██║██║   ██║██████╔╝
██║  ██║██║   ██║██║   ██║██║   ██║██╔══██╗
██████╔╝╚██████╔╝╚██████╔╝╚██████╔╝██║  ██║
╚═════╝  ╚═════╝  ╚═════╝  ╚═════╝ ╚═╝  ╚═╝

Official TypeScript/JavaScript SDK for CortexDB

What is CortexDB?

CortexDB is a multi-modal RAG (Retrieval Augmented Generation) platform that combines traditional database capabilities with vector search and advanced document processing. It enables you to:

Store structured and unstructured data in a unified database
Automatically extract text from documents (PDF, DOCX, XLSX) using Docling
Generate embeddings for semantic search using various providers (OpenAI, Gemini, etc.)
Perform hybrid search combining filters with vector similarity
Build RAG applications with automatic chunking and vectorization

CortexDB handles the complex infrastructure of vector databases (Qdrant), object storage (MinIO), and traditional databases (PostgreSQL) behind a simple API.

Features

Multi-modal document processing: Upload PDFs, DOCX, XLSX files and automatically extract text with OCR fallback
Semantic search: Vector-based search using embeddings from OpenAI, Gemini, or custom providers
Automatic chunking: Smart text splitting optimized for RAG applications
Flexible schema: Define collections with typed fields (string, number, boolean, file, array)
Hybrid queries: Combine exact filters with semantic search
Storage control: Choose where each field is stored (PostgreSQL, Qdrant, MinIO)
Type-safe: Full TypeScript support with comprehensive type definitions
Modern API: Async/await using native fetch (Node.js 18+)
Infra management: Database (client.databases) and embedding provider (client.embeddingProviders) APIs built-in
🆕 TypeScript Decorators: Define schemas using decorators (like TypeORM) with full IDE support - see Schema Decorators Guide

Installation

npm install @dooor-ai/cortexdb

Or with yarn:

yarn add @dooor-ai/cortexdb

Or with pnpm:

pnpm add @dooor-ai/cortexdb

Quick Start

import { CortexClient, FieldType, StoreLocation } from '@dooor-ai/cortexdb';

async function main() {
  // Initialize with database in connection string
  const client = new CortexClient('cortexdb://my-api-key@localhost:8000/production');

  // Create a collection with vectorization enabled
  await client.collections.create(
    'documents',
    [
      { name: 'title', type: FieldType.STRING },
      { name: 'content', type: FieldType.TEXT, vectorize: true },
      { name: 'published_at', type: FieldType.DATETIME, store_in: [StoreLocation.POSTGRES] }
    ],
    'your-embedding-provider-id'  // Required when vectorize=true
    // database parameter is optional here since we set 'production' as default
  );

  // Create a record
  const record = await client.records.create('documents', {
    title: 'Introduction to AI',
    content: 'Artificial intelligence is transforming how we build software...'
  });

  // Semantic search - finds relevant content by meaning, not just keywords
  const results = await client.records.search(
    'documents',
    'How is AI changing software development?',
    undefined,  // filters
    10  // limit - database parameter optional since we have default
  );

  results.results.forEach(result => {
    console.log(`Score: ${result.score.toFixed(4)}`);
    console.log(`Title: ${result.record.data.title}`);
    console.log(`Content: ${result.record.data.content}\n`);
  });

  await client.close();
}

main();

Project-Specific Typing

The SDK becomes fully type-safe once you apply your YAML schema with the Dooor CLI:

npx dooor schema apply          # reads dooor/schemas by default and generates types in dooor/generated/

This command creates dooor/generated/cortex-schema.ts and automatically augments the SDK types. After the file exists in your project, you can keep importing CortexClient from @dooor-ai/cortexdb; TypeScript will infer the fields/collections defined in your YAML. Invalid field names or missing required properties inside client.records.create('my_collection', {...}) now trigger compile-time errors, Prisma-style.

If you need an explicit factory, the generated file also exports createCortexClient() and TypedCortexClient helpers.

ℹ️ The CLI also drops a lightweight .d.ts shim in node_modules/@dooor-ai/cortexdb/generated/schema.d.ts, so TypeScript picks up your schema automatically—no need to tweak tsconfig.json.

Prisma-like Records Delegates

Once the schema is generated, you can call collections with property access instead of passing strings:

// Fully typed
const record = await client.records.tool_calls.create({
  chatId: "chat-123",
  description: "RAG invocation summary",
  createdAt: new Date().toISOString(),
});

// String form still available when you need something dynamic
await client.records.create("tool_calls", {
  chatId,
  description,
  createdAt,
});

Usage

Initialize Client

import { CortexClient } from '@dooor-ai/cortexdb';

// Using connection string with database (recommended)
const client = new CortexClient('cortexdb://my-api-key@localhost:8000/production');

// Without database in connection string (must pass database to each method)
const client = new CortexClient('cortexdb://my-api-key@localhost:8000');

// Production (HTTPS auto-detected)
const client = new CortexClient('cortexdb://[email protected]/production');

// Using options object (alternative)
const client = new CortexClient({
  baseUrl: 'http://localhost:8000',
  apiKey: 'your-api-key',
  database: 'production',  // Optional: set default database
  timeout: 1800000,        // Optional: override timeout (default = 30 min to cover large uploads)
  waitUntilComplete: true, // Optional: keep SDK waiting for async ingestion to finish (default = true)
});

Connection String Format: cortexdb://[api_key@]host[:port][/database]

Benefits:

Single string configuration
Easy to store in environment variables
Familiar pattern (like PostgreSQL, MongoDB, Redis)
Auto-detects HTTP vs HTTPS
Optional database specification for multi-tenant isolation

Database Parameter:

If you specify a database in the connection string or options, it becomes the default for all operations
You can override the default database on a per-method basis
If no default database is set, you must pass the database parameter to each method

Async File Uploads & Processing

Large documents (PDFs, DOCXs, etc.) are ingested asynchronously to avoid timeouts. When you call client.records.create(...) the gateway now responds immediately with a payload like:

{
  "id": "rec_123",
  "status": "pending",
  "processing_state": {
    "record_id": "rec_123",
    "status": "pending",
    "processed_chunks": 0,
    "total_chunks": 0
  }
}

By default the SDK keeps polling the processing_state endpoint until the background worker finishes and only then resolves with the final CreateRecordResponse. That preserves backward compatibility with existing backends that expect a fully processed record once create() returns.

You can control this behavior:

// Return immediately (HTTP 202) and poll manually later
const pending = await client.records.create(
  'documents',
  { title: 'Async', content: '...' },
  undefined,
  { waitUntilComplete: false }
);

// Later in your workflow…
const status = await client.records.getStatus('documents', pending.id);
if (status?.status === 'completed') {
  const finalRecord = await client.records.waitForCompletion('documents', pending.id);
}

Useful options:

waitUntilComplete (default true): let the SDK poll automatically.
pollingIntervalMs (default 5000): change how often the SDK checks status.
timeoutMs (default 30 min): upper bound for the auto-poll loop.

Under the hood the SDK calls GET /records/{id}/status until the worker updates the processing_state to completed or failed. You can also call that endpoint directly via client.records.getStatus(...) to drive custom progress indicators.

Databases

// Create database
await client.databases.create({ name: 'ai_docs', description: 'Knowledge base' });

// List databases
const databases = await client.databases.list();

// Delete database
await client.databases.delete('ai_docs');

Embedding Providers

await client.embeddingProviders.create({
  name: 'Gemini Flash',
  provider: 'gemini',
  embedding_model: 'models/text-embedding-004',
  api_key: process.env.GEMINI_API_KEY!,
});

const providers = await client.embeddingProviders.list();

Collections

Collections define the schema for your data. Each collection can have multiple fields with different types and storage options.

import { FieldType, StoreLocation } from '@dooor-ai/cortexdb';

// Create collection with vectorization (database required)
const collection = await client.collections.create(
  'articles',
  [
    {
      name: 'title',
      type: FieldType.STRING
    },
    {
      name: 'content',
      type: FieldType.TEXT,
      vectorize: true  // Enable semantic search on this field
    },
    {
      name: 'year',
      type: FieldType.INT,
      store_in: [StoreLocation.POSTGRES, StoreLocation.QDRANT_PAYLOAD]
    }
  ],
  'embedding-provider-id',  // Required when any field has vectorize=true
  'production'  // Database name (or omit if default database is set)
);

// List collections (uses default database if set, or pass specific database)
const collections = await client.collections.list('production');

// Get collection schema
const schema = await client.collections.get('articles', 'production');

// Delete collection and all its records
await client.collections.delete('articles', 'production');

// If you set a default database in the client, you can omit it:
const client = new CortexClient('cortexdb://key@host:8000/production');
const collections = await client.collections.list();  // Uses 'production'

Records

Records are the actual data stored in collections. They must match the collection schema.

import fs from 'node:fs';

// Create record (with optional file upload and database)
const created = await client.records.create(
  'articles',
  {
    title: 'Machine Learning Basics',
    content: 'Machine learning is a subset of AI focused on learning from data...',
    year: 2024,
  },
  {
    attachment: fs.readFileSync('ml-intro.pdf'),
  },
  'production'  // Database name
);

// Get record by ID
const fetched = await client.records.get('articles', created.id, 'production');

// Update record
const updated = await client.records.update('articles', created.id, {
  year: 2025,
}, 'production');

// Delete record
await client.records.delete('articles', created.id, 'production');

// List records with filters/pagination
const results = await client.records.list('articles', {
  limit: 10,
  offset: 0,
  filters: { year: { $gte: 2023 } },
});

Schema CLI (YAML)

Install the CLI (recommended in devDependencies):

npm install --save-dev dooor

Use the unified dooor CLI to synchronize declarative schemas. Also install the "Dooor Tools" extension in VS Code/Cursor for real-time validation (Open VSX).

# Check differences between local YAML and CortexDB
npx dooor schema diff --dir dooor/schemas

# Create collections that don't exist yet
npx dooor schema apply --dir dooor/schemas

# Apply without generating types (by default apply already generates them)
npx dooor schema apply --no-generate-types

# Generate TypeScript types for use in services
npx dooor schema generate-types --dir dooor/schemas --out src/generated/cortex-schema.ts

Automatic Collection Typing

After synchronizing the schema, the CLI generates dooor/generated/cortex-schema.ts with derived types. Provide this schema to the SDK to get Prisma-like autocomplete and validation:

import { CortexClient } from '@dooor-ai/cortexdb';
import type {
  CortexGeneratedSchema,
  CollectionCreateInput,
} from '../dooor/generated/cortex-schema';

const client = new CortexClient<CortexGeneratedSchema>(
  process.env.CORTEXDB_CONNECTION!,
);

const payload: CollectionCreateInput<'tool_calls'> = {
  chatId,
  workspaceId,
  toolName,
  description,
  toolOutput,
  createdAt: new Date().toISOString(),
};

await client.records.create('tool_calls', payload);

Generics propagate to records.update, records.list, records.get, and records.search. If you prefer the old dynamic mode, instantiate new CortexClient() without the generic parameter.

Set CORTEXDB_CONNECTION (e.g., cortexdb://key@host:8000) or the CORTEXDB_BASE_URL + CORTEXDB_API_KEY variables before running commands. If no directory is specified, the CLI automatically looks in dooor/schemas.

To avoid repeating flags, configure dooor/config.yaml at the project root:

cortexdb:
  connection: env(CORTEXDB_CONNECTION)
  defaultEmbeddingProvider: default-provider

schema:
  dir: dooor/schemas
  typesOut: dooor/generated/cortex-schema.ts

You can override with dooor/config.local.yaml or point to another path via DOOOR_CONFIG.

Semantic Search

Semantic search finds records by meaning, not just exact keyword matches. It uses vector embeddings to understand context.

// Basic semantic search
const results = await client.records.search(
  'articles',
  'machine learning fundamentals',
  undefined,
  10
);

// Search with filters - combine semantic search with exact matches
const filteredResults = await client.records.search(
  'articles',
  'neural networks',
  {
    year: 2024,
    category: 'AI'
  },
  5
);

// Process results - ordered by relevance score
filteredResults.results.forEach(result => {
  console.log(`Score: ${result.score.toFixed(4)}`);  // Higher = more relevant
  console.log(`Title: ${result.record.data.title}`);
  console.log(`Year: ${result.record.data.year}`);
});

Working with Files

CortexDB can process documents and automatically extract text for vectorization.

// Create collection with file field
await client.collections.create(
  'documents',
  [
    { name: 'title', type: FieldType.STRING },
    {
      name: 'document',
      type: FieldType.FILE,
      vectorize: true  // Extract text and create embeddings
    }
  ],
  'embedding-provider-id'
);

// Note: File upload support is currently available in the REST API
// TypeScript SDK file upload will be added in a future version

Filter Operators

// Exact match filters
const results = await client.records.list('articles', {
  filters: {
    category: 'technology',
    published: true,
    year: 2024
  }
});

// Combine multiple filters
const filtered = await client.records.list('articles', {
  filters: {
    year: 2024,
    category: 'AI',
    author: 'John Doe'
  },
  limit: 20
});

Error Handling

The SDK provides specific error types for different failure scenarios.

import {
  CortexDBError,
  CortexDBNotFoundError,
  CortexDBValidationError,
  CortexDBConnectionError,
  CortexDBTimeoutError
} from '@dooor-ai/cortexdb';

try {
  const record = await client.records.get('articles', 'invalid-id');
} catch (error) {
  if (error instanceof CortexDBNotFoundError) {
    console.log('Record not found');
  } else if (error instanceof CortexDBValidationError) {
    console.log('Invalid data:', error.message);
  } else if (error instanceof CortexDBConnectionError) {
    console.log('Connection failed:', error.message);
  } else if (error instanceof CortexDBTimeoutError) {
    console.log('Request timed out:', error.message);
  } else if (error instanceof CortexDBError) {
    console.log('General error:', error.message);
  }
}

Examples

Check the examples/ directory for complete working examples:

quickstart.ts - Complete walkthrough of SDK features
search.ts - Semantic search with filters and providers
basic.ts - Basic CRUD operations

Run examples:

npx ts-node -O '{"module":"commonjs"}' examples/quickstart.ts

Development

Setup

# Clone repository
git clone https://github.com/yourusername/cortexdb
cd cortexdb/clients/typescript

# Install dependencies
npm install

# Build
npm run build

Scripts

# Build TypeScript
npm run build

# Build in watch mode
npm run build:watch

# Clean build artifacts
npm run clean

# Lint code
npm run lint

# Format code
npm run format

Requirements

Node.js >= 18.0.0 (for native fetch support)
CortexDB gateway running locally or remotely
Embedding provider configured (OpenAI, Gemini, etc.) if using vectorization

Architecture

CortexDB integrates multiple technologies:

PostgreSQL: Stores structured data and metadata
Qdrant: Vector database for semantic search
MinIO: Object storage for files
Docling: Advanced document processing and text extraction

The SDK abstracts this complexity into a simple, unified API.

Advanced RAG Strategies (v0.4.0+)

CortexDB now supports multiple RAG strategies to improve search quality and relevance. Choose the strategy that best fits your use case:

Available Strategies

SIMPLE: Basic vector similarity search (default)
MULTI_QUERY: Generate multiple query variations and combine results using Reciprocal Rank Fusion
HYDE: Generate hypothetical documents and use them for improved retrieval
RERANK: Use LLM to rerank search results by relevance
FUSION: Combine multi-query expansion with LLM reranking
CONTEXTUAL_QUERY: Reformulate queries based on conversation context

Setup AI Providers

Before using advanced strategies, configure an AI provider:

// Create an AI provider for query expansion/reranking
const aiProvider = await client.aiProviders.create({
  name: "Gemini Flash",
  provider: "gemini",
  api_key: "your-gemini-api-key",
  model: "gemini-1.5-flash",
  enabled: true,
});

// List providers
const providers = await client.aiProviders.list();

// Update provider
await client.aiProviders.update(aiProvider.id, {
  model: "gemini-2.0-flash",
});

Using Advanced Search

import { RAGStrategy } from '@dooor-ai/cortexdb';

// Simple search (default)
const simpleResults = await client.records.searchAdvanced('documents', {
  query: 'What is machine learning?',
  limit: 10,
  strategy: RAGStrategy.SIMPLE,
});

// Multi-query with automatic query expansion
const multiQueryResults = await client.records.searchAdvanced('documents', {
  query: 'What is machine learning?',
  limit: 10,
  strategy: RAGStrategy.MULTI_QUERY,
  strategyConfig: {
    num_queries: 5, // Generate 5 query variations
  },
  aiProviderName: "Gemini Flash", // Use provider by name
});

// HyDE: Generate hypothetical document for better retrieval
const hydeResults = await client.records.searchAdvanced('documents', {
  query: 'Explain neural networks',
  limit: 10,
  strategy: RAGStrategy.HYDE,
  strategyConfig: {
    document_length: 200, // Length of hypothetical document
  },
  aiProviderName: "Gemini Flash",
});

// Rerank: Use LLM to reorder results by relevance
const rerankResults = await client.records.searchAdvanced('documents', {
  query: 'Benefits of deep learning',
  limit: 10,
  strategy: RAGStrategy.RERANK,
  strategyConfig: {
    initial_k: 50, // Fetch 50 results then rerank to top 10
  },
  aiProviderName: "Gemini Flash",
});

// Fusion: Best of both worlds (multi-query + reranking)
const fusionResults = await client.records.searchAdvanced('documents', {
  query: 'How does AI work?',
  limit: 10,
  strategy: RAGStrategy.FUSION,
  strategyConfig: {
    num_queries: 5,
    initial_k: 50,
  },
  aiProviderName: "Gemini Flash",
});

// Contextual: Reformulate query based on conversation history
const contextualResults = await client.records.searchAdvanced('documents', {
  query: 'What about its applications?',
  limit: 10,
  strategy: RAGStrategy.CONTEXTUAL_QUERY,
  strategyConfig: {
    context: [
      'Previous: What is machine learning?',
      'Answer: Machine learning is a subset of AI...',
    ],
  },
  aiProviderName: "Gemini Flash",
});

// Access results
fusionResults.results.forEach(result => {
  console.log(`Score: ${result.score}`);
  console.log(`Content: ${result.record.content}`);
  console.log(`Strategy used: ${fusionResults.strategy_used}`);
});

Collection-Specific Delegates

The advanced search is also available on collection delegates:

// Using the facade pattern
const results = await client.records.documents.searchAdvanced({
  query: 'Machine learning applications',
  strategy: RAGStrategy.FUSION,
  aiProviderName: "Gemini Flash",
});

Performance Tips

SIMPLE: Fastest, use for basic semantic search
MULTI_QUERY: 5x slower than simple (generates 5 queries)
HYDE: Similar to multi-query, good for questions
RERANK: Moderate cost, great for accuracy improvement
FUSION: Highest cost and latency, best quality
CONTEXTUAL_QUERY: Use for conversational interfaces

For more details, see RAG Strategies Documentation.

License

MIT License - see LICENSE for details.

CortexDB Python SDK - Python client for CortexDB
CortexDB Documentation - Complete platform documentation