Vector
Vector indexer module.
VectorDBIndexer(data_store_map=None, em_invoker_map=None, cache_size=DEFAULT_CACHE_SIZE, retryable_exceptions=None)
Bases: BaseIndexer
Index elements into a vector datastore capability.
Initialize the indexer with mappings for vector DB capabilities and embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_store_map
|
dict[str, Type[BaseDataStore]] | None
|
Mapping of db_engine strings to BaseDataStore classes. If not provided, uses DEFAULT_DATA_STORE_MAP which includes "chroma", "elasticsearch", and "opensearch". Defaults to None. |
None
|
em_invoker_map
|
dict[str, Type[BaseEMInvoker]] | None
|
Mapping of provider strings to embedding classes (BaseEMInvoker subclasses). If not provided, uses DEFAULT_EM_INVOKER_MAP which includes "azure-openai", "bedrock", "google", "openai", and "voyage". Defaults to None. |
None
|
cache_size
|
int
|
Maximum number of vector capability instances to cache using LRU policy. Defaults to DEFAULT_CACHE_SIZE (128). |
DEFAULT_CACHE_SIZE
|
retryable_exceptions
|
tuple[type[Exception], ...] | None
|
Tuple of exception types to retry on during batch processing. Defaults to DEFAULT_RETRYABLE_EXCEPTIONS. |
None
|
delete_chunk(chunk_id, file_id, **kwargs)
Delete a single chunk by chunk ID and file ID.
In case of chunk not found, it will be treated as successful deletion (nothing to delete).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to delete. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status and error message. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |
delete_file_chunks(file_id, **kwargs)
Delete all chunks for a specific file.
- No index: treated as success (nothing to delete).
- Index exists, no matching chunks: success.
- Index exists, matching chunks: success if delete succeeds, otherwise failed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_id
|
str
|
The ID of the file whose chunks should be deleted. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status and error message. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |
get_chunk(chunk_id, file_id, **kwargs)
Get a single chunk by chunk ID and file ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to retrieve. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
dict[str, Any] | None: The chunk data, or None if not found. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |
get_file_chunks(file_id, page=0, size=20, **kwargs)
Get chunks for a specific file with pagination support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_id
|
str
|
The ID of the file to get chunks from. |
required |
page
|
int
|
The page number (0-indexed). Defaults to 0. |
0
|
size
|
int
|
The number of chunks per page. Defaults to 20. |
20
|
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response containing: 1. chunks (list[dict[str, Any]]): List of chunks (elements) with text, structure, and metadata. 2. pagination (dict[str, Any]): Pagination metadata with: - page (int): Current page number. - size (int): Number of items per page. - total_chunks (int): Total number of chunks for the file. - total_pages (int): Total number of pages. - has_next (bool): Whether there is a next page. - has_previous (bool): Whether there is a previous page. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |
index_chunk(element, **kwargs)
Index a single chunk.
This method only indexes the chunk. It does NOT update the metadata of neighboring chunks (previous_chunk/next_chunk). The caller is responsible for maintaining chunk relationships by updating adjacent chunks' metadata separately.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
dict[str, Any]
|
The chunk to be indexed. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status, error message, and chunk_id. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |
index_chunks(elements, **kwargs)
Index multiple chunks.
This method enables indexing multiple chunks in a single operation without requiring file replacement semantics (i.e., it inserts or overwrites the provided chunks directly without first deleting existing chunks). The chunks provided can belong to multiple different files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
The chunks to be indexed. Each dict should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use. db_config (dict[str, Any]): The configuration for the database engine. batch_size (int, optional): The number of chunks to process in each batch. Defaults to 100. max_retries (int, optional): The maximum number of retry attempts for failed batches. Defaults to 3.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. total (int): The total number of chunks indexed. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |
index_file_chunks(elements, file_id, **kwargs)
Index chunks for a specific file.
This method indexes chunks for a file. The indexer is responsible for deleting any existing chunks for the file_id before indexing the new chunks to ensure consistency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
The chunks to be indexed. |
required |
file_id
|
str
|
The ID of the file these chunks belong to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration. batch_size (int, optional): The number of chunks to process in each batch. Defaults to 100. max_retries (int, optional): The maximum number of retry attempts for failed batches. Defaults to 3.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status, error message, and total count. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |
update_chunk(element, **kwargs)
Update a chunk by chunk ID.
This method replaces both text content and metadata of a chunk by deleting and recreating the target chunk record. This guarantees metadata replacement semantics (old metadata keys are removed).
For metadata updates: 1. Chunk identity metadata keys (file_id and chunk_id) will not be updated. 2. Metadata is fully replaced by the metadata from the provided element. 3. Old metadata keys not present in the provided element are removed. 4. Fails on chunk not found.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
dict[str, Any]
|
The updated chunk data. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status, error message, and chunk_id. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |
update_chunk_metadata(chunk_id, file_id, metadata, **kwargs)
Update metadata for a specific chunk.
This method patches new metadata into the existing chunk metadata. Existing metadata fields will be overwritten, and new fields will be added.
- Chunk identity metadata keys (file_id and chunk_id) will not be updated.
- Overwrite and add new metadata fields from the provided element.
- Existing metadata will remain unchanged.
- Fails on chunk not found.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to update. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
metadata
|
dict[str, Any]
|
The metadata fields to update. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status and error message. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |