Skip to content

Indexer

Document Processing Orchestrator Indexer Package.

Modules:

Name Description
BaseIndexer

Abstract base class for indexing document.

BaseIndexer

Bases: ABC

Base class for document converter.

delete(**kwargs) abstractmethod

Delete document from a vector DB.

The arguments are not defined yet, it depends on the implementation. Some vector database will require: db_url, index_name, document_id.

Parameters:

Name Type Description Default
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Name Type Description
Any Any

The response from the deletion process.

delete_chunk(chunk_id, file_id, **kwargs) abstractmethod

Delete a single chunk by chunk ID and file ID.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to delete.

required
file_id str

The ID of the file the chunk belongs to.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the deletion process. Should include: 1. success (bool): True if deletion succeeded, False otherwise. 2. error_message (str): Error message if deletion failed, empty string otherwise.

delete_file_chunks(file_id, **kwargs) abstractmethod

Delete all chunks for a specific file.

Parameters:

Name Type Description Default
file_id str

The ID of the file whose chunks should be deleted.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the deletion process. Should include: 1. success (bool): True if deletion succeeded, False otherwise. 2. error_message (str): Error message if deletion failed, empty string otherwise.

get_chunk(chunk_id, file_id, **kwargs) abstractmethod

Get a single chunk by chunk ID and file ID.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to retrieve.

required
file_id str

The ID of the file the chunk belongs to.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any] | None

dict[str, Any] | None: The chunk data following the Element structure with 'text' and 'metadata' keys, or None if the chunk is not found.

get_file_chunks(file_id, page=0, size=20, **kwargs) abstractmethod

Get chunks for a specific file with pagination support.

Parameters:

Name Type Description Default
file_id str

The ID of the file to get chunks from.

required
page int

The page number (0-indexed). Defaults to 0.

0
size int

The number of chunks per page. Defaults to 20.

20
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response containing chunks and pagination metadata. Should include: 1. chunks (list[dict[str, Any]]): List of chunks, each following the Element structure. 2. total (int): Total number of chunks for the file. 3. page (int): Current page number. 4. size (int): Number of chunks per page. 5. total_pages (int): Total number of pages.

Note

Chunks should be sorted by their metadata.order field (position within the file).

index(elements, **kwargs) abstractmethod

Index data from a source file into Elasticsearch.

Parameters:

Name Type Description Default
elements Any

The information to be indexed. Ideally formatted as List[Dict] and each Dict following the structure of model 'Element'.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Name Type Description
Any Any

The response from the indexing process.

index_chunk(element, **kwargs) abstractmethod

Index a single chunk.

Note: This method only indexes the chunk. It does NOT update the metadata of neighboring chunks (previous_chunk/next_chunk). The caller is responsible for maintaining chunk relationships by updating adjacent chunks' metadata separately.

Parameters:

Name Type Description Default
element dict[str, Any]

The chunk to be indexed. Should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. chunk_id (str): The ID of the indexed chunk.

index_file_chunks(elements, file_id, **kwargs) abstractmethod

Index chunks for a specific file.

This method indexes chunks for a file. The indexer is responsible for deleting any existing chunks for the file_id before indexing the new chunks to ensure consistency. This ensures that the file's chunks are completely replaced with the new set of chunks.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

The chunks to be indexed. Each dict should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'.

required
file_id str

The ID of the file these chunks belong to.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. total (int): The total number of chunks indexed.

update_chunk(element, **kwargs) abstractmethod

Update a chunk by chunk ID.

This method updates both the text content and metadata of a chunk.

Parameters:

Name Type Description Default
element dict[str, Any]

The updated chunk data. Should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the update process. Should include: 1. success (bool): True if update succeeded, False otherwise. 2. error_message (str): Error message if update failed, empty string otherwise. 3. chunk_id (str): The ID of the updated chunk.

update_chunk_metadata(chunk_id, file_id, metadata, **kwargs) abstractmethod

Update metadata for a specific chunk.

This method patches new metadata into the existing chunk metadata. Existing metadata fields will be overwritten, and new fields will be added.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to update.

required
file_id str

The ID of the file the chunk belongs to.

required
metadata dict[str, Any]

The metadata fields to update. Only the provided fields will be updated; other existing metadata will remain unchanged.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the update process. Should include: 1. success (bool): True if update succeeded, False otherwise. 2. error_message (str): Error message if update failed, empty string otherwise.