Skip to content

Vector

Vector indexer module.

VectorDBIndexer(data_store_map=None, em_invoker_map=None, cache_size=DEFAULT_CACHE_SIZE, retryable_exceptions=None)

Bases: BaseIndexer

Index elements into a vector datastore capability.

Initialize the indexer with mappings for vector DB capabilities and embeddings.

Parameters:

Name Type Description Default
data_store_map dict[str, Type[BaseDataStore]] | None

Mapping of db_engine strings to BaseDataStore classes. If not provided, uses DEFAULT_DATA_STORE_MAP which includes "chroma", "elasticsearch", and "opensearch". Defaults to None.

None
em_invoker_map dict[str, Type[BaseEMInvoker]] | None

Mapping of provider strings to embedding classes (BaseEMInvoker subclasses). If not provided, uses DEFAULT_EM_INVOKER_MAP which includes "azure-openai", "bedrock", "google", "openai", and "voyage". Defaults to None.

None
cache_size int

Maximum number of vector capability instances to cache using LRU policy. Defaults to DEFAULT_CACHE_SIZE (128).

DEFAULT_CACHE_SIZE
retryable_exceptions tuple[type[Exception], ...] | None

Tuple of exception types to retry on during batch processing. Defaults to DEFAULT_RETRYABLE_EXCEPTIONS.

None

delete_chunk(chunk_id, file_id, **kwargs)

Delete a single chunk by chunk ID and file ID.

In case of chunk not found, it will be treated as successful deletion (nothing to delete).

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to delete.

required
file_id str

The ID of the file the chunk belongs to.

required
**kwargs Any

Additional keyword arguments for customization.

{}
Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status and error message.

Raises:

Type Description
ValueError

If invalid params or unsupported config is provided.

KeyError

If missing required kwargs.

delete_file_chunks(file_id, **kwargs)

Delete all chunks for a specific file.

  • No index: treated as success (nothing to delete).
  • Index exists, no matching chunks: success.
  • Index exists, matching chunks: success if delete succeeds, otherwise failed.

Parameters:

Name Type Description Default
file_id str

The ID of the file whose chunks should be deleted.

required
**kwargs Any

Additional keyword arguments for customization.

{}
Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status and error message.

Raises:

Type Description
ValueError

If invalid params or unsupported config is provided.

KeyError

If missing required kwargs.

get_chunk(chunk_id, file_id, **kwargs)

Get a single chunk by chunk ID and file ID.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to retrieve.

required
file_id str

The ID of the file the chunk belongs to.

required
**kwargs Any

Additional keyword arguments for customization.

{}
Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.

Returns:

Type Description
dict[str, Any] | None

dict[str, Any] | None: The chunk data, or None if not found.

Raises:

Type Description
ValueError

If invalid params or unsupported config is provided.

KeyError

If missing required kwargs.

get_file_chunks(file_id, page=0, size=20, **kwargs)

Get chunks for a specific file with pagination support.

Parameters:

Name Type Description Default
file_id str

The ID of the file to get chunks from.

required
page int

The page number (0-indexed). Defaults to 0.

0
size int

The number of chunks per page. Defaults to 20.

20
**kwargs Any

Additional keyword arguments for customization.

{}
Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response containing: 1. chunks (list[dict[str, Any]]): List of chunks (elements) with text, structure, and metadata. 2. pagination (dict[str, Any]): Pagination metadata with: - page (int): Current page number. - size (int): Number of items per page. - total_chunks (int): Total number of chunks for the file. - total_pages (int): Total number of pages. - has_next (bool): Whether there is a next page. - has_previous (bool): Whether there is a previous page.

Raises:

Type Description
ValueError

If invalid params or unsupported config is provided.

KeyError

If missing required kwargs.

index_chunk(element, **kwargs)

Index a single chunk.

This method only indexes the chunk. It does NOT update the metadata of neighboring chunks (previous_chunk/next_chunk). The caller is responsible for maintaining chunk relationships by updating adjacent chunks' metadata separately.

Parameters:

Name Type Description Default
element dict[str, Any]

The chunk to be indexed.

required
**kwargs Any

Additional keyword arguments for customization.

{}
Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status, error message, and chunk_id.

Raises:

Type Description
ValueError

If invalid params or unsupported config is provided.

KeyError

If missing required kwargs.

index_chunks(elements, **kwargs)

Index multiple chunks.

This method enables indexing multiple chunks in a single operation without requiring file replacement semantics (i.e., it inserts or overwrites the provided chunks directly without first deleting existing chunks). The chunks provided can belong to multiple different files.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

The chunks to be indexed. Each dict should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'.

required
**kwargs Any

Additional keyword arguments for customization.

{}
Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use. db_config (dict[str, Any]): The configuration for the database engine. batch_size (int, optional): The number of chunks to process in each batch. Defaults to 100. max_retries (int, optional): The maximum number of retry attempts for failed batches. Defaults to 3.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. total (int): The total number of chunks indexed.

Raises:

Type Description
ValueError

If invalid params or unsupported config is provided.

KeyError

If missing required kwargs.

index_file_chunks(elements, file_id, **kwargs)

Index chunks for a specific file.

This method indexes chunks for a file. The indexer is responsible for deleting any existing chunks for the file_id before indexing the new chunks to ensure consistency.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

The chunks to be indexed.

required
file_id str

The ID of the file these chunks belong to.

required
**kwargs Any

Additional keyword arguments for customization.

{}
Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration. batch_size (int, optional): The number of chunks to process in each batch. Defaults to 100. max_retries (int, optional): The maximum number of retry attempts for failed batches. Defaults to 3.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status, error message, and total count.

Raises:

Type Description
ValueError

If invalid params or unsupported config is provided.

KeyError

If missing required kwargs.

update_chunk(element, **kwargs)

Update a chunk by chunk ID.

This method replaces both text content and metadata of a chunk by deleting and recreating the target chunk record. This guarantees metadata replacement semantics (old metadata keys are removed).

For metadata updates: 1. Chunk identity metadata keys (file_id and chunk_id) will not be updated. 2. Metadata is fully replaced by the metadata from the provided element. 3. Old metadata keys not present in the provided element are removed. 4. Fails on chunk not found.

Parameters:

Name Type Description Default
element dict[str, Any]

The updated chunk data.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status, error message, and chunk_id.

Raises:

Type Description
ValueError

If invalid params or unsupported config is provided.

KeyError

If missing required kwargs.

update_chunk_metadata(chunk_id, file_id, metadata, **kwargs)

Update metadata for a specific chunk.

This method patches new metadata into the existing chunk metadata. Existing metadata fields will be overwritten, and new fields will be added.

  1. Chunk identity metadata keys (file_id and chunk_id) will not be updated.
  2. Overwrite and add new metadata fields from the provided element.
  3. Existing metadata will remain unchanged.
  4. Fails on chunk not found.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to update.

required
file_id str

The ID of the file the chunk belongs to.

required
metadata dict[str, Any]

The metadata fields to update.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status and error message.

Raises:

Type Description
ValueError

If invalid params or unsupported config is provided.

KeyError

If missing required kwargs.