Fulltext
Fulltext DB Indexer module.
FulltextDBIndexer(data_store_map=None, cache_size=DEFAULT_CACHE_SIZE, retryable_exceptions=None)
Bases: BaseIndexer
Index elements into a fulltext datastore capability (no embeddings required).
Initialize the indexer with mappings for fulltext datastore capabilities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_store_map
|
dict[str, Type[BaseDataStore]] | None
|
Mapping of db_engine strings to BaseDataStore classes. If not provided, uses DEFAULT_DATA_STORE_MAP which includes "chroma", "elasticsearch", and "opensearch". Defaults to None. |
None
|
cache_size
|
int
|
Maximum number of fulltext datastore instances to cache using LRU policy. Defaults to DEFAULT_CACHE_SIZE (128). |
DEFAULT_CACHE_SIZE
|
retryable_exceptions
|
tuple[type[Exception], ...] | None
|
Tuple of exception types to retry on during batch processing. Defaults to DEFAULT_RETRYABLE_EXCEPTIONS. |
None
|
delete_chunk(chunk_id, file_id, **kwargs)
Delete a single chunk by chunk ID and file ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to delete. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the deletion process. Should include: 1. success (bool): True if deletion succeeded, False otherwise. 2. error_message (str): Error message if deletion failed, empty string otherwise. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
delete_file_chunks(file_id, **kwargs)
Delete all chunks for a specific file.
Missing index/collection is treated as success (nothing to delete). On version conflicts, the delete is retried up to max_retries times; if conflicts persist after all attempts, a RuntimeError is raised and success=False is returned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_id
|
str
|
The ID of the file to delete chunks from. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Datastore config (index_name, url, etc.). max_retries (int, optional): Maximum retry attempts on version conflicts. Defaults to 3.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status and error message. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |
get_chunk(chunk_id, file_id, **kwargs)
Get a single chunk by chunk ID and file ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to retrieve. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
dict[str, Any] | None: The chunk data following the Element structure with 'text' and 'metadata' keys, or None if the chunk is not found. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
get_file_chunks(file_id, page=0, size=20, **kwargs)
Get chunks for a specific file with pagination support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_id
|
str
|
The ID of the file to get chunks from. |
required |
page
|
int
|
The page number (0-indexed). Defaults to 0. |
0
|
size
|
int
|
The number of chunks per page. Defaults to 20. |
20
|
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response containing: 1. chunks (list[dict[str, Any]]): List of chunks (elements) with text, structure, and metadata. 2. pagination (dict[str, Any]): Pagination metadata with: - page (int): Current page number. - size (int): Number of items per page. - total_chunks (int): Total number of chunks for the file. - total_pages (int): Total number of pages. - has_next (bool): Whether there is a next page. - has_previous (bool): Whether there is a previous page. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
index_chunk(element, **kwargs)
Index a single chunk.
Note: This method only indexes the chunk. It does NOT update the metadata of neighboring chunks (previous_chunk/next_chunk). The caller is responsible for maintaining chunk relationships by updating adjacent chunks' metadata separately.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
dict[str, Any]
|
The chunk to be indexed. Should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. chunk_id (str): The ID of the indexed chunk. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
index_chunks(elements, **kwargs)
Index multiple chunks.
This method enables indexing multiple chunks in a single operation without requiring file replacement semantics (i.e., it inserts or overwrites the provided chunks directly without first deleting existing chunks). The chunks provided can belong to multiple different files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
The chunks to be indexed. Each dict should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. total (int): The total number of chunks indexed. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
index_file_chunks(elements, file_id, **kwargs)
Index chunks for a specific file, replacing any existing chunks for that file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
The chunks to be indexed. |
required |
file_id
|
str
|
The ID of the file these chunks belong to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Datastore config (index_name, url, etc.). batch_size (int, optional): Number of elements to process in each batch. Defaults to 100. max_retries (int, optional): Maximum number of retry attempts for failed batches. Defaults to 3.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status, error message, and total count. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid params or unsupported config is provided. |
KeyError
|
If missing required kwargs. |
update_chunk(element, **kwargs)
Update a chunk by chunk ID.
This method updates both the text content and metadata of a chunk.
Fails on chunk not found.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
dict[str, Any]
|
The updated chunk data. Should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the update process. Should include: 1. success (bool): True if update succeeded, False otherwise. 2. error_message (str): Error message if update failed, empty string otherwise. 3. chunk_id (str): The ID of the updated chunk. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
update_chunk_metadata(chunk_id, file_id, metadata, **kwargs)
Update metadata for a specific chunk.
This method patches new metadata into the existing chunk metadata. Existing metadata fields will be overwritten, and new fields will be added.
Fails on chunk not found.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to update. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
metadata
|
dict[str, Any]
|
The metadata fields to update. Only the provided fields will be updated; other existing metadata will remain unchanged. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the update process. Should include: 1. success (bool): True if update succeeded, False otherwise. 2. error_message (str): Error message if update failed, empty string otherwise. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |