Utils
Utility modules for gllm-retrieval.
Fuseable
Bases: Protocol
Protocol for objects that can be fused using rank fusion algorithms.
Objects must have an id attribute for deduplication and an optional
score attribute that can be set with the fusion score.
id
property
Unique identifier for deduplication.
score
property
writable
Optional score attribute that can be set.
concat_fusion(chunk_lists, **kwargs)
Concatenates lists in order, deduplicating by chunk ID.
Preserves the order of first appearance. If a chunk with the same ID appears in multiple lists, only the first occurrence is kept.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_lists
|
list[list[FuseableT]]
|
A list of chunk lists to concatenate. |
required |
**kwargs
|
Any
|
Keyword arguments (unused, accepted for signature compatibility). |
{}
|
Returns:
| Type | Description |
|---|---|
list[FuseableT]
|
list[FuseableT]: Deduplicated concatenated list of chunks. |
format_sql_query(query)
Format the SQL query to ensure it is correctly structured.
Removes the code block markdown from the SQL query and trims any leading or trailing whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The SQL query output from the language model. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The formatted SQL query. |
passthrough_fusion(chunk_lists, **kwargs)
Returns all retriever results as-is without any processing.
Returns the nested list structure unchanged, preserving the separation between each retriever's results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_lists
|
list[list[FuseableT]]
|
A list of chunk lists from each retriever. |
required |
**kwargs
|
Any
|
Keyword arguments (unused, accepted for signature compatibility). |
{}
|
Returns:
| Type | Description |
|---|---|
list[list[FuseableT]]
|
list[list[FuseableT]]: The original nested list structure unchanged. |
resolve_fusion_fn(fn)
Resolves fn to a callable.
If fn is a string, looks it up in FUSION_REGISTRY. If fn is already callable, returns it unchanged.
Custom fusion functions should accept
- chunk_lists (list[list[FuseableT]]): Required first positional argument
- **kwargs (Any): Optional keyword arguments (weights, rank_constant, etc.)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
str | FusionCallable
|
Fusion function name or callable. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FusionCallable |
FusionCallable
|
The resolved fusion callable. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If fn is a string not found in FUSION_REGISTRY. |
TypeError
|
If fn is neither a string nor a callable. |
rrf_fusion(chunk_lists, **kwargs)
Reciprocal Rank Fusion with weights.
Delegates to gllm_retrieval.utils.weighted_reciprocal_rank.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_lists
|
list[list[FuseableT]]
|
A list of ranked chunk lists. |
required |
**kwargs
|
Any
|
Keyword arguments including: - weights (list[float]): Weights for each chunk list. Defaults to equal weights. - rank_constant (int): Constant for RRF calculation. Defaults to 60. |
{}
|
Returns:
| Type | Description |
|---|---|
list[FuseableT]
|
list[FuseableT]: Fused and deduplicated list of chunks, sorted by RRF score. |
validate_query(query, dialect='postgres')
Validates if the given string is an SQL statement using sqlglot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The SQL query to be validated. |
required |
dialect
|
str
|
The SQL dialect to be used for validation. Defaults to "postgres". |
'postgres'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the query is not a valid SQL statement. |
weighted_reciprocal_rank(doc_lists, weights, rank_constant=60, set_scores=False)
Perform weighted Reciprocal Rank Fusion on multiple rank lists.
This function implements the Weighted Reciprocal Rank Fusion (RRF) algorithm, which combines multiple ranked document lists into a single ranked list. RRF is particularly effective for combining results from different retrieval strategies (e.g., filtered search and semantic search, or multiple retrievers).
The RRF score for each document is calculated as:
score = sum(weight_i / (rank_i + k)) for each list i
where rank_i is the document's rank in list i (1-based), and k is the rank constant.
Examples:
from gllm_retrieval.utils.rank_fusion import weighted_reciprocal_rank
filtered_results = [chunk1, chunk2, chunk3] # Ranked by entity filtering
semantic_results = [chunk2, chunk1, chunk4] # Ranked by semantic similarity
fused = weighted_reciprocal_rank(
[filtered_results, semantic_results],
weights=[0.2, 0.8],
rank_constant=60
)
# Returns chunks ordered by combined RRF scores
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_lists
|
list[list[FuseableT]]
|
Ranked lists of Fuseable objects to merge.
Must match the length of |
required |
weights
|
list[float]
|
Weights for each rank list. Higher weights give more importance to that retrieval source. |
required |
rank_constant
|
int
|
The rank constant (k) that controls the influence of rank position. Higher values reduce the impact of rank differences. Defaults to 60. |
60
|
set_scores
|
bool
|
If True, sets the |
False
|
Returns:
| Type | Description |
|---|---|
list[FuseableT]
|
list[FuseableT]: The final aggregated list of unique documents sorted by their weighted RRF scores in descending order. Documents with higher scores appear first. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the number of rank lists doesn't match the weights count. |
Note
- Documents are deduplicated by their
idfield - The
rank_constantparameter controls the influence of rank position - Higher
rank_constantvalues reduce the impact of rank differences - The algorithm is commutative - order of doc_lists doesn't matter if weights are adjusted
- Lists may overlap (same ID in multiple lists); duplicates are merged
- Empty inner lists are handled gracefully