ydb-qdrant-indexer

v1.1.0

Published

9 days ago

Indexer for ydb-qdrant that uses external embedding models and stores metadata in an external table.

0High
0Medium
0Low

astandrik

ydb vector-search qdrant-compatible indexer embeddings rag

ydb-qdrant-indexer

High-level indexing helper for ydb-qdrant.

This package takes plain-text documents, calls an external embedding model, and writes the resulting vectors into YDB collections via the ydb-qdrant programmatic client. It also keeps a small YDB table with indexing metadata (documents, versions, chunk counts, and status).

When to use it

You already use ydb-qdrant as a Qdrant-compatible vector store backed by YDB.
You have an external embedding model (OpenAI, local, etc.) and want a reusable ingestion pipeline:
- Chunk documents with overlap.
- Batch embed chunks.
- Upsert vectors into a collection.
- Track which documents are pending / indexed / failed.

Install

From npm (library usage):

npm install ydb-qdrant ydb-qdrant-indexer

Environment and credentials follow the same conventions as ydb-qdrant:

YDB_ENDPOINT, YDB_DATABASE.
One of YDB_SERVICE_ACCOUNT_KEY_FILE_CREDENTIALS, YDB_METADATA_CREDENTIALS, YDB_ACCESS_TOKEN_CREDENTIALS, or YDB_ANONYMOUS_CREDENTIALS.

Core concepts

Embedding model

The indexer does not call any specific provider directly. Instead it expects an EmbeddingModel implementation:

info.modelId: opaque model identifier (for payloads/observability).
info.dimension: vector dimension; used to configure the target collection.
embedDocuments(texts: string[]): Promise<number[][]>: batch embedding call.
Optional embedQuery(text: string): for query-time use (not required for indexing only).

This package provides createEmbeddingModel(adapter, options?) to adapt any concrete provider:

adapter.modelId / adapter.dimension describe the model.
adapter.embed(texts) performs the actual API call.
options.maxBatchSize controls how many texts are sent to embedDocuments at once.

Metadata store

Indexing metadata is stored in a separate YDB table (default name: qdrant_index_documents) via YdbMetadataStore:

Keys: tenant, collection, docId.
Columns: version, status (pending | indexed | failed), chunkCount, sourceUri, error, createdAt, updatedAt.
Helper: createYdbMetadataStore({ endpoint?, database?, connectionString?, tableName? }).

You can swap in your own implementation by providing a custom MetadataStore that implements:

ensureReady(): create/validate tables.
upsertDocument(record).
getDocument(tenant, collection, docId).
updateStatus(tenant, collection, docId, status, chunkCount?, error?).

Chunking

Documents are represented as IndexDocumentInput:

id: string: stable document identifier.
text: string: full plain-text content.
Optional: sourceUri, metadata (arbitrary record), version (user-supplied version string).

Chunking behaviour:

Sliding window over characters with configurable overlap (ChunkingOptions).
Defaults (DEFAULT_CHUNKING_OPTIONS):
- maxCharacters = 2000.
- overlapCharacters = 200.
Helper: chunkDocuments(documents, options?) → DocumentChunk[].

Indexer client

The main entrypoint is createIndexerClient() which returns an IndexerClient with a single method:

indexDocuments(request: IndexDocumentsRequest): Promise<void>.

IndexDocumentsRequest fields:

tenant?: logical tenant id (defaults to "default").
collection: target collection name (Qdrant-compatible).
distance: one of "Cosine" | "Euclid" | "Dot" | "Manhattan".
vectorSize: embedding dimension for this collection.
documents: array of IndexDocumentInput.
embeddingModel: EmbeddingModel or EmbeddingModelWithOptions.
metadataStore: implementation of MetadataStore (typically createYdbMetadataStore()).
clientOptions?: forwarded to createYdbQdrantClient (endpoint, database, auth, default tenant, etc.).
maxUpsertBatchSize?: points per upsertPoints call (default: 64).
chunking?: overrides default chunking options.

What `indexDocuments` does

For each call:

Ensure collection
- Uses createYdbQdrantClient from ydb-qdrant.
- getCollection(collection):
  - If collection exists, validates size and distance against vectorSize/distance from the request.
  - If it does not exist (404), creates it with the requested vector size/distance and data_type: "float".
Chunk documents and upsert metadata
- Splits each document into overlapping chunks via chunkDocuments.
- Computes chunkCount per document.
- Computes a version for each document:
  - Uses doc.version when provided, otherwise a deterministic SHA-256 hash of doc.text.
- Calls metadataStore.upsertDocument for each document with status "pending".
Embed and upsert points
- Batches chunks into groups of size embeddingModel.options.maxBatchSize (or 32 by default).
- Calls embeddingModel.embedDocuments(texts) for each batch.
- For each chunk, builds:
  - Deterministic point id: "{docId}:{chunkIndex}".
  - Vector: the embedding returned by the model.
  - Payload object including:
    - doc_id, chunk_index, chunk_start, chunk_end, source_uri.
    - embedding_model (from embeddingModel.info.modelId).
    - Any metadata from the original document merged in (without overriding these reserved keys).
- Splits points into maxUpsertBatchSize groups and calls upsertPoints(collection, { points }) via the ydb-qdrant client.
Update document status
- On success: for each document, calls metadataStore.updateStatus(..., "indexed", chunkCount).
- On failure at any point in embedding or upsert:
  - Marks all documents as "failed", attaching the error message via metadataStore.updateStatus(..., "failed", chunkCount, errorMessage).
  - Rethrows the original error.

Minimal usage example (TypeScript)

This example sketches how to wire an external embedding API, the metadata store, and the indexer client together. It omits details like HTTP client implementation for brevity.

import { createYdbMetadataStore } from "ydb-qdrant-indexer";
import {
  createEmbeddingModel,
  type EmbeddingModelAdapter,
} from "ydb-qdrant-indexer";
import { createIndexerClient } from "ydb-qdrant-indexer";

const adapter: EmbeddingModelAdapter = {
  modelId: "my-embeddings-001",
  dimension: 768,
  async embed(texts) {
    // Call your embedding provider here and return number[][]
    return texts.map(() => new Array(768).fill(0));
  },
};

const embeddingModel = createEmbeddingModel(adapter, { maxBatchSize: 16 });

const metadataStore = createYdbMetadataStore({
  endpoint: process.env.YDB_ENDPOINT,
  database: process.env.YDB_DATABASE,
});

const indexer = createIndexerClient();

await indexer.indexDocuments({
  tenant: "my-tenant",
  collection: "documents",
  distance: "Cosine",
  vectorSize: 768,
  documents: [
    {
      id: "doc-1",
      text: "Some text to index...",
      sourceUri: "file:///docs/doc-1.txt",
      metadata: { language: "en" },
    },
  ],
  embeddingModel,
  metadataStore,
  clientOptions: {
    // forwarded to createYdbQdrantClient
    endpoint: process.env.YDB_ENDPOINT,
    database: process.env.YDB_DATABASE,
  },
});

Notes and guarantees

Idempotent upserts: point ids are deterministic (docId:chunkIndex), so re-running indexing for the same document content overwrites instead of duplicating points.
Document versioning: version is stored in the metadata table so you can compare current vs previous content and decide whether to reindex.
Metadata independence: the metadata table is separate from qdrant_all_points, so you can query indexing status without touching the main points table.
Embedding provider agnostic: any model that can be adapted to EmbeddingModelAdapter can be used (cloud API, on-prem model server, etc.).

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ydb-qdrant-indexer

When to use it

Install

Core concepts

Embedding model

Metadata store

Chunking

Indexer client

What indexDocuments does

Minimal usage example (TypeScript)

Notes and guarantees

What `indexDocuments` does