Hierarchical retriever
Hierarchical retriever module for N-level coarse-to-fine retrieval.
This module provides a general HierarchicalRetriever that supports N-level coarse-to-fine retrieval across multiple levels (e.g., corpus -> doc -> section -> chunk).
Each level can be constrained by the IDs returned by earlier levels, enabling progressive refinement of retrieval results.
ConstraintMode
Bases: StrEnum
Constraint mode for hierarchical retrieval levels.
HierarchicalRetriever(config)
Bases: BaseRetriever[list[Chunk]]
A retriever that performs N-level coarse-to-fine hierarchical retrieval.
The HierarchicalRetriever executes a sequence of retrieval levels, where each level can be constrained by the IDs returned from previous levels. This enables progressive refinement from coarse-grained to fine-grained retrieval (e.g., corpus -> document -> chunk).
Algorithm
- Initialize results_by_level dictionary
- For each level in order: a. Determine constraint IDs based on constrain_by mode b. If constrained and no IDs available, short-circuit return [] c. Build filters with constraint IDs d. Retrieve chunks using level retriever e. Apply score threshold if set f. Stable sort by (-score, id) g. Store results in results_by_level h. Log level execution details
- Select output level (configured or last level)
- Return top final_top_k results with stable sort
Examples:
from gllm_retrieval.retriever.hierarchical_retriever import (
HierarchicalRetriever,
HierarchicalRetrieverConfig,
LevelConfig,
)
config = HierarchicalRetrieverConfig(
levels=[
LevelConfig(
name="document",
retriever=doc_retriever,
top_k=20,
filter_key="doc_id",
constrain_by=None,
),
LevelConfig(
name="chunk",
retriever=chunk_retriever,
top_k=50,
filter_key="parent_doc_id",
constrain_by="previous",
score_threshold=0.7,
),
],
output_level="chunk",
final_top_k=10,
)
retriever = HierarchicalRetriever(config=config)
results = await retriever.retrieve("search query")
# results: [Chunk(...), Chunk(...), ...]
Attributes:
| Name | Type | Description |
|---|---|---|
config |
HierarchicalRetrieverConfig
|
The configuration for hierarchical retrieval. |
Initialize the HierarchicalRetriever with a configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
HierarchicalRetrieverConfig
|
The hierarchical retrieval configuration. |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If config is not a HierarchicalRetrieverConfig instance. |
retrieve(query, query_filter=None, **kwargs)
async
Retrieve documents using hierarchical retrieval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str | list[str]
|
The query string or list of query strings. |
required |
query_filter
|
FilterClause | QueryFilter | None
|
Base filter for all levels. This filter is combined with level-specific constraint filters. Defaults to None. |
None
|
**kwargs
|
Any
|
Additional parameters passed to level retrievers. |
{}
|
Returns:
| Type | Description |
|---|---|
list[Chunk] | list[list[Chunk]]
|
list[Chunk] | list[list[Chunk]]: Retrieved chunks. Returns list[Chunk] for single query, list[list[Chunk]] for batch queries. |
HierarchicalRetrieverConfig
Bases: BaseModel
Configuration for the HierarchicalRetriever.
Attributes:
| Name | Type | Description |
|---|---|---|
levels |
list[LevelConfig]
|
List of level configurations in order of execution. |
output_level |
str | None
|
Name of the level to output results from. If None, outputs from the last level. |
final_top_k |
int | None
|
Maximum number of final results to return. If None, returns all results from the output level. |
validate_config()
Validate the entire configuration.
Returns:
| Name | Type | Description |
|---|---|---|
HierarchicalRetrieverConfig |
HierarchicalRetrieverConfig
|
The validated configuration. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If level names are not unique. |
ValueError
|
If output_level is not found in level names. |
LevelConfig
Bases: BaseModel
Configuration for a single retrieval level in the hierarchy.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Unique name for this level (e.g., "corpus", "document", "chunk"). |
retriever |
BaseRetriever[list[Chunk]]
|
The retriever to use for this level. |
top_k |
int
|
Maximum number of results to retrieve at this level. |
filter_key |
str
|
The metadata field name used to filter by parent IDs. |
constrain_by |
ConstraintMode | None
|
How to constrain this level by prior levels. None means no constraint (typically for the first level). |
score_threshold |
float | None
|
Minimum score threshold for filtering results. If None, no threshold filtering is applied. |
validate_constrain_by(value)
classmethod
Validate and convert constrain_by to ConstraintMode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
None | ConstraintMode | str
|
The value to validate and convert. |
required |
Returns:
| Type | Description |
|---|---|
ConstraintMode | None
|
ConstraintMode | None: The validated and converted value. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the string value is not a valid ConstraintMode. |