Skip to content

Aggregation

Aggregation module for GEval metrics.

This module provides metric aggregation utilities for polarity-aware pass/fail scoring. Extend MetricsAggregator and override compute_score or compute_success to customise behaviour per evaluator.

AggregationResult(aggregate_success, aggregate_score) dataclass

Result of metric aggregation.

Attributes:

Name Type Description
aggregate_success bool

True if all metrics passed (AND-gate), False otherwise.

aggregate_score float

Mean quality score with polarity inversion applied.

AverageAggregationStrategy

Bases: BaseJudgeAggregator

Aggregate repeated judge results using arithmetic mean.

strategy property

Return the aggregation identifier for arithmetic averaging.

Returns:

Name Type Description
AggregationMethod AggregationMethod

AggregationMethod.AVERAGE.

aggregate(all_results, total_judges)

Aggregate judge results by computing the arithmetic mean score.

The representative result is chosen as the valid judge output whose numeric score is closest to the computed average, using input order as the tie-breaker.

Parameters:

Name Type Description Default
all_results list[MetricOutput]

Raw results produced by each judge, including successful outputs and optional error payloads.

required
total_judges int

Total number of judges configured for the evaluation run.

required

Returns:

Name Type Description
MetricOutput MetricOutput

Representative result annotated with average-based metadata.

Raises:

Type Description
ValueError

If no valid judge results exist, or if any valid score is non-numeric.

BaseJudgeAggregator

Bases: ABC

Abstract strategy for repeated-judge result aggregation.

strategy abstractmethod property

Return the canonical aggregation strategy identifier.

Returns:

Name Type Description
AggregationMethod AggregationMethod

Enum value that identifies the aggregation strategy.

aggregate(all_results, total_judges) abstractmethod

Aggregate repeated metric results into one representative result.

Parameters:

Name Type Description Default
all_results list[MetricOutput]

Raw results produced by each judge, including successful outputs and optional error payloads.

required
total_judges int

Total number of judges configured for the evaluation run.

required

Returns:

Name Type Description
MetricOutput MetricOutput

Representative aggregated result containing the selected score, metadata, and supporting judge context.

Raises:

Type Description
ValueError

If the implementation cannot produce a valid aggregate from the input.

TypeError

If the implementation rejects the provided input type or strategy.

aggregate_repeated_results(all_results, total_judges)

Extract valid repeated-judge results and supporting metadata.

Parameters:

Name Type Description Default
all_results list[MetricOutput]

Raw results produced by each judge, including successful outputs and optional error payloads.

required
total_judges int

Total number of judges configured for the evaluation run.

required

Returns:

Name Type Description
RepeatedResults RepeatedResults

Tuple containing valid results, valid scores, valid original indices, and collected judge error messages.

Raises:

Type Description
ValueError

If no judge results are provided, or if every result is invalid after filtering out missing scores and explicit errors.

MajorityVoteAggregationStrategy

Bases: BaseJudgeAggregator

Aggregate repeated judge results using majority vote.

strategy property

Return the aggregation identifier for majority vote.

Returns:

Name Type Description
AggregationMethod AggregationMethod

AggregationMethod.MAJORITY_VOTE.

aggregate(all_results, total_judges)

Aggregate judge results by selecting the most frequent numeric score.

Ties are resolved by delegating to the median strategy so the result still maps to a real judge output.

Parameters:

Name Type Description Default
all_results list[MetricOutput]

Raw results produced by each judge, including successful outputs and optional error payloads.

required
total_judges int

Total number of judges configured for the evaluation run.

required

Returns:

Name Type Description
MetricOutput MetricOutput

Representative result annotated with majority-vote metadata.

Raises:

Type Description
ValueError

If no valid judge results exist, or if any valid score is non-numeric.

MedianAggregationStrategy

Bases: BaseJudgeAggregator

Aggregate repeated judge results using the observed median.

strategy property

Return the aggregation identifier for median selection.

Returns:

Name Type Description
AggregationMethod AggregationMethod

AggregationMethod.MEDIAN.

aggregate(all_results, total_judges)

Aggregate judge results by selecting the observed median score.

For even-sized inputs, this strategy chooses the upper median instead of the arithmetic mean so the representative output still corresponds to an actual judge result.

Parameters:

Name Type Description Default
all_results list[MetricOutput]

Raw results produced by each judge, including successful outputs and optional error payloads.

required
total_judges int

Total number of judges configured for the evaluation run.

required

Returns:

Name Type Description
MetricOutput MetricOutput

Representative result annotated with median-selection metadata.

Raises:

Type Description
ValueError

If no valid judge results exist, or if any valid score is non-numeric.

MetricsAggregator

Aggregator for GEval metrics.

Computes aggregate_success (AND-gate) and aggregate_score (polarity-aware mean). Subclass and override compute_success or compute_score to customize behavior per evaluator.

aggregate(named_results)

Aggregate GEval metric results.

Parameters:

Name Type Description Default
named_results dict[str, MetricResult]

Dictionary mapping metric names to MetricResult objects.

required

Returns:

Type Description
AggregationResult

AggregationResult with aggregate_success and aggregate_score.

AggregationResult

Empty dict returns aggregate_success=False and aggregate_score=0.0.

compute_score(named_results)

Polarity-aware mean of metric scores.

Parameters:

Name Type Description Default
named_results dict[str, MetricResult]

Dictionary mapping metric names to MetricResult objects.

required

Returns:

Name Type Description
float float

Mean of quality-adjusted scores, or 0.0 if empty.

compute_success(named_results)

AND-gate of all metric success flags.

Parameters:

Name Type Description Default
named_results dict[str, MetricResult]

Dictionary mapping metric names to MetricResult objects.

required

Returns:

Name Type Description
bool bool

True if every metric passed. False if any metric failed or if named_results is empty (no metrics evaluated implies no confidence).

WeightedMetricsAggregator(weights, score_mapping)

Bases: MetricsAggregator

MetricsAggregator with weighted scoring.

Overrides compute_score to apply per-metric score mappings then a weighted sum.

IMPORTANT: Uses result.rubric_score (pre-threshold integer), not result.score (normalized float), because score_mapping keys are {1, 2, 3} rubric integers.

Attributes:

Name Type Description
weights

Dictionary mapping metric names to their weights.

score_mapping

Dictionary mapping metric names to either: - dict[int, float]: Maps rubric score integers to normalized floats. - Callable[[float], float]: Function to transform rubric scores.

Initialize WeightedMetricsAggregator.

Parameters:

Name Type Description Default
weights dict[str, float]

Dictionary mapping metric names to their weights.

required
score_mapping dict[str, dict[int, float] | Callable[[float], float]]

Dictionary mapping metric names to score transformations.

required

compute_score(named_results)

Compute weighted aggregate score using rubric_score lookups.

Parameters:

Name Type Description Default
named_results dict[str, MetricResult]

Dictionary mapping metric names to MetricResult objects.

required

Returns:

Name Type Description
float float

Weighted aggregate score, or 0.0 if no results or zero total weight.

build_aggregation_strategy(aggregation_method)

Build an aggregation strategy from enum, string, or strategy input.

Parameters:

Name Type Description Default
aggregation_method AggregationSelector

Aggregation strategy expressed as an AggregationMethod enum, a compatible string value, or an existing strategy instance.

required

Returns:

Name Type Description
BaseJudgeAggregator BaseJudgeAggregator

Aggregation strategy instance matching the requested method.

Raises:

Type Description
ValueError

If aggregation_method is None or a string that does not map to a supported AggregationMethod value.

TypeError

If aggregation_method has an unsupported type.