Indexer
Document Processing Orchestrator Indexer Package.
Modules:
| Name | Description |
|---|---|
BaseIndexer |
Abstract base class for indexing document. |
BaseIndexer
Bases: ABC
Base class for document converter.
delete(**kwargs)
abstractmethod
Delete document from a vector DB.
The arguments are not defined yet, it depends on the implementation. Some vector database will require: db_url, index_name, document_id.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
Any |
Any
|
The response from the deletion process. |
delete_chunk(chunk_id, file_id, **kwargs)
abstractmethod
Delete a single chunk by chunk ID and file ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to delete. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the deletion process. Should include: 1. success (bool): True if deletion succeeded, False otherwise. 2. error_message (str): Error message if deletion failed, empty string otherwise. |
delete_file_chunks(file_id, **kwargs)
abstractmethod
Delete all chunks for a specific file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_id
|
str
|
The ID of the file whose chunks should be deleted. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the deletion process. Should include: 1. success (bool): True if deletion succeeded, False otherwise. 2. error_message (str): Error message if deletion failed, empty string otherwise. |
get_chunk(chunk_id, file_id, **kwargs)
abstractmethod
Get a single chunk by chunk ID and file ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to retrieve. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
dict[str, Any] | None: The chunk data following the Element structure with 'text' and 'metadata' keys, or None if the chunk is not found. |
get_file_chunks(file_id, page=0, size=20, **kwargs)
abstractmethod
Get chunks for a specific file with pagination support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_id
|
str
|
The ID of the file to get chunks from. |
required |
page
|
int
|
The page number (0-indexed). Defaults to 0. |
0
|
size
|
int
|
The number of chunks per page. Defaults to 20. |
20
|
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response containing chunks and pagination metadata. Should include: 1. chunks (list[dict[str, Any]]): List of chunks, each following the Element structure. 2. total (int): Total number of chunks for the file. 3. page (int): Current page number. 4. size (int): Number of chunks per page. 5. total_pages (int): Total number of pages. |
Note
Chunks should be sorted by their metadata.order field (position within the file).
index(elements, **kwargs)
abstractmethod
Index data from a source file into Elasticsearch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
Any
|
The information to be indexed. Ideally formatted as List[Dict] and each Dict following the structure of model 'Element'. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
Any |
Any
|
The response from the indexing process. |
index_chunk(element, **kwargs)
abstractmethod
Index a single chunk.
Note: This method only indexes the chunk. It does NOT update the metadata of neighboring chunks (previous_chunk/next_chunk). The caller is responsible for maintaining chunk relationships by updating adjacent chunks' metadata separately.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
dict[str, Any]
|
The chunk to be indexed. Should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. chunk_id (str): The ID of the indexed chunk. |
index_file_chunks(elements, file_id, **kwargs)
abstractmethod
Index chunks for a specific file.
This method indexes chunks for a file. The indexer is responsible for deleting any existing chunks for the file_id before indexing the new chunks to ensure consistency. This ensures that the file's chunks are completely replaced with the new set of chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
The chunks to be indexed. Each dict should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
file_id
|
str
|
The ID of the file these chunks belong to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. total (int): The total number of chunks indexed. |
update_chunk(element, **kwargs)
abstractmethod
Update a chunk by chunk ID.
This method updates both the text content and metadata of a chunk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
dict[str, Any]
|
The updated chunk data. Should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the update process. Should include: 1. success (bool): True if update succeeded, False otherwise. 2. error_message (str): Error message if update failed, empty string otherwise. 3. chunk_id (str): The ID of the updated chunk. |
update_chunk_metadata(chunk_id, file_id, metadata, **kwargs)
abstractmethod
Update metadata for a specific chunk.
This method patches new metadata into the existing chunk metadata. Existing metadata fields will be overwritten, and new fields will be added.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to update. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
metadata
|
dict[str, Any]
|
The metadata fields to update. Only the provided fields will be updated; other existing metadata will remain unchanged. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the update process. Should include: 1. success (bool): True if update succeeded, False otherwise. 2. error_message (str): Error message if update failed, empty string otherwise. |