Skip to content

Utils

Utility modules for gllm-retrieval.

Fuseable

Bases: Protocol

Protocol for objects that can be fused using rank fusion algorithms.

Objects must have an id attribute for deduplication and an optional score attribute that can be set with the fusion score.

id property

Unique identifier for deduplication.

score property writable

Optional score attribute that can be set.

concat_fusion(chunk_lists, **kwargs)

Concatenates lists in order, deduplicating by chunk ID.

Preserves the order of first appearance. If a chunk with the same ID appears in multiple lists, only the first occurrence is kept.

Parameters:

Name Type Description Default
chunk_lists list[list[FuseableT]]

A list of chunk lists to concatenate.

required
**kwargs Any

Keyword arguments (unused, accepted for signature compatibility).

{}

Returns:

Type Description
list[FuseableT]

list[FuseableT]: Deduplicated concatenated list of chunks.

format_sql_query(query)

Format the SQL query to ensure it is correctly structured.

Removes the code block markdown from the SQL query and trims any leading or trailing whitespace.

Parameters:

Name Type Description Default
query str

The SQL query output from the language model.

required

Returns:

Name Type Description
str str

The formatted SQL query.

passthrough_fusion(chunk_lists, **kwargs)

Returns all retriever results as-is without any processing.

Returns the nested list structure unchanged, preserving the separation between each retriever's results.

Parameters:

Name Type Description Default
chunk_lists list[list[FuseableT]]

A list of chunk lists from each retriever.

required
**kwargs Any

Keyword arguments (unused, accepted for signature compatibility).

{}

Returns:

Type Description
list[list[FuseableT]]

list[list[FuseableT]]: The original nested list structure unchanged.

resolve_fusion_fn(fn)

Resolves fn to a callable.

If fn is a string, looks it up in FUSION_REGISTRY. If fn is already callable, returns it unchanged.

Custom fusion functions should accept
  • chunk_lists (list[list[FuseableT]]): Required first positional argument
  • **kwargs (Any): Optional keyword arguments (weights, rank_constant, etc.)

Parameters:

Name Type Description Default
fn str | FusionCallable

Fusion function name or callable.

required

Returns:

Name Type Description
FusionCallable FusionCallable

The resolved fusion callable.

Raises:

Type Description
ValueError

If fn is a string not found in FUSION_REGISTRY.

TypeError

If fn is neither a string nor a callable.

rrf_fusion(chunk_lists, **kwargs)

Reciprocal Rank Fusion with weights.

Delegates to gllm_retrieval.utils.weighted_reciprocal_rank.

Parameters:

Name Type Description Default
chunk_lists list[list[FuseableT]]

A list of ranked chunk lists.

required
**kwargs Any

Keyword arguments including: - weights (list[float]): Weights for each chunk list. Defaults to equal weights. - rank_constant (int): Constant for RRF calculation. Defaults to 60.

{}

Returns:

Type Description
list[FuseableT]

list[FuseableT]: Fused and deduplicated list of chunks, sorted by RRF score.

validate_query(query, dialect='postgres')

Validates if the given string is an SQL statement using sqlglot.

Parameters:

Name Type Description Default
query str

The SQL query to be validated.

required
dialect str

The SQL dialect to be used for validation. Defaults to "postgres".

'postgres'

Raises:

Type Description
ValueError

If the query is not a valid SQL statement.

weighted_reciprocal_rank(doc_lists, weights, rank_constant=60, set_scores=False)

Perform weighted Reciprocal Rank Fusion on multiple rank lists.

This function implements the Weighted Reciprocal Rank Fusion (RRF) algorithm, which combines multiple ranked document lists into a single ranked list. RRF is particularly effective for combining results from different retrieval strategies (e.g., filtered search and semantic search, or multiple retrievers).

The RRF score for each document is calculated as: score = sum(weight_i / (rank_i + k)) for each list i where rank_i is the document's rank in list i (1-based), and k is the rank constant.

Examples:

from gllm_retrieval.utils.rank_fusion import weighted_reciprocal_rank

filtered_results = [chunk1, chunk2, chunk3]  # Ranked by entity filtering
semantic_results = [chunk2, chunk1, chunk4]  # Ranked by semantic similarity
fused = weighted_reciprocal_rank(
    [filtered_results, semantic_results],
    weights=[0.2, 0.8],
    rank_constant=60
)
# Returns chunks ordered by combined RRF scores

Parameters:

Name Type Description Default
doc_lists list[list[FuseableT]]

Ranked lists of Fuseable objects to merge. Must match the length of weights.

required
weights list[float]

Weights for each rank list. Higher weights give more importance to that retrieval source.

required
rank_constant int

The rank constant (k) that controls the influence of rank position. Higher values reduce the impact of rank differences. Defaults to 60.

60
set_scores bool

If True, sets the score attribute on each document to its RRF score. Defaults to False.

False

Returns:

Type Description
list[FuseableT]

list[FuseableT]: The final aggregated list of unique documents sorted by their weighted RRF scores in descending order. Documents with higher scores appear first.

Raises:

Type Description
ValueError

If the number of rank lists doesn't match the weights count.

Note
  1. Documents are deduplicated by their id field
  2. The rank_constant parameter controls the influence of rank position
  3. Higher rank_constant values reduce the impact of rank differences
  4. The algorithm is commutative - order of doc_lists doesn't matter if weights are adjusted
  5. Lists may overlap (same ID in multiple lists); duplicates are merged
  6. Empty inner lists are handled gracefully