Aggregation

Aggregation module for GEval metrics.

This module provides metric aggregation utilities for polarity-aware pass/fail scoring. Extend MetricsAggregator and override compute_score or compute_success to customise behaviour per evaluator.

`AggregationResult(aggregate_success, aggregate_score)` `dataclass`

Result of metric aggregation.

Attributes:

Name	Type	Description
`aggregate_success`	`bool`	True if all metrics passed (AND-gate), False otherwise.
`aggregate_score`	`float`	Mean quality score with polarity inversion applied.

`AverageAggregationStrategy`

Bases: BaseJudgeAggregator

Aggregate repeated judge results using arithmetic mean.

`strategy` `property`

Return the aggregation identifier for arithmetic averaging.

Returns:

Name	Type	Description
`AggregationMethod`	`AggregationMethod`	`AggregationMethod.AVERAGE`.

`aggregate(all_results, total_judges)`

Aggregate judge results by computing the arithmetic mean score.

The representative result is chosen as the valid judge output whose numeric score is closest to the computed average, using input order as the tie-breaker.

Parameters:

Name	Type	Description	Default
`all_results`	`list[MetricOutput]`	Raw results produced by each judge, including successful outputs and optional error payloads.	required
`total_judges`	`int`	Total number of judges configured for the evaluation run.	required

Returns:

Name	Type	Description
`MetricOutput`	`MetricOutput`	Representative result annotated with average-based metadata.

Raises:

Type	Description
`ValueError`	If no valid judge results exist, or if any valid score is non-numeric.

`BaseJudgeAggregator`

Bases: ABC

Abstract strategy for repeated-judge result aggregation.

`strategy` `abstractmethod` `property`

Return the canonical aggregation strategy identifier.

Returns:

Name	Type	Description
`AggregationMethod`	`AggregationMethod`	Enum value that identifies the aggregation strategy.

`aggregate(all_results, total_judges)` `abstractmethod`

Aggregate repeated metric results into one representative result.

Parameters:

Name	Type	Description	Default
`all_results`	`list[MetricOutput]`	Raw results produced by each judge, including successful outputs and optional error payloads.	required
`total_judges`	`int`	Total number of judges configured for the evaluation run.	required

Returns:

Name	Type	Description
`MetricOutput`	`MetricOutput`	Representative aggregated result containing the selected score, metadata, and supporting judge context.

Raises:

Type	Description
`ValueError`	If the implementation cannot produce a valid aggregate from the input.
`TypeError`	If the implementation rejects the provided input type or strategy.

`aggregate_repeated_results(all_results, total_judges)`

Extract valid repeated-judge results and supporting metadata.

Parameters:

Name	Type	Description	Default
`all_results`	`list[MetricOutput]`	Raw results produced by each judge, including successful outputs and optional error payloads.	required
`total_judges`	`int`	Total number of judges configured for the evaluation run.	required

Returns:

Name	Type	Description
`RepeatedResults`	`RepeatedResults`	Tuple containing valid results, valid scores, valid original indices, and collected judge error messages.

Raises:

Type	Description
`ValueError`	If no judge results are provided, or if every result is invalid after filtering out missing scores and explicit errors.

`MajorityVoteAggregationStrategy`

Bases: BaseJudgeAggregator

Aggregate repeated judge results using majority vote.

`strategy` `property`

Return the aggregation identifier for majority vote.

Returns:

Name	Type	Description
`AggregationMethod`	`AggregationMethod`	`AggregationMethod.MAJORITY_VOTE`.

`aggregate(all_results, total_judges)`

Aggregate judge results by selecting the most frequent numeric score.

Ties are resolved by delegating to the median strategy so the result still maps to a real judge output.

Parameters:

Name	Type	Description	Default
`all_results`	`list[MetricOutput]`	Raw results produced by each judge, including successful outputs and optional error payloads.	required
`total_judges`	`int`	Total number of judges configured for the evaluation run.	required

Returns:

Name	Type	Description
`MetricOutput`	`MetricOutput`	Representative result annotated with majority-vote metadata.

Raises:

Type	Description
`ValueError`	If no valid judge results exist, or if any valid score is non-numeric.

`MedianAggregationStrategy`

Bases: BaseJudgeAggregator

Aggregate repeated judge results using the observed median.

`strategy` `property`

Return the aggregation identifier for median selection.

Returns:

Name	Type	Description
`AggregationMethod`	`AggregationMethod`	`AggregationMethod.MEDIAN`.

`aggregate(all_results, total_judges)`

Aggregate judge results by selecting the observed median score.

For even-sized inputs, this strategy chooses the upper median instead of the arithmetic mean so the representative output still corresponds to an actual judge result.

Parameters:

Name	Type	Description	Default
`all_results`	`list[MetricOutput]`	Raw results produced by each judge, including successful outputs and optional error payloads.	required
`total_judges`	`int`	Total number of judges configured for the evaluation run.	required

Returns:

Name	Type	Description
`MetricOutput`	`MetricOutput`	Representative result annotated with median-selection metadata.

Raises:

Type	Description
`ValueError`	If no valid judge results exist, or if any valid score is non-numeric.

`MetricsAggregator`

Aggregator for GEval metrics.

Computes aggregate_success (AND-gate) and aggregate_score (polarity-aware mean). Subclass and override compute_success or compute_score to customize behavior per evaluator.

`aggregate(named_results)`

Aggregate GEval metric results.

Parameters:

Name	Type	Description	Default
`named_results`	`dict[str, MetricResult]`	Dictionary mapping metric names to MetricResult objects.	required

Returns:

Type	Description
`AggregationResult`	AggregationResult with aggregate_success and aggregate_score.
`AggregationResult`	Empty dict returns aggregate_success=False and aggregate_score=0.0.

`compute_score(named_results)`

Polarity-aware mean of metric scores.

Parameters:

Name	Type	Description	Default
`named_results`	`dict[str, MetricResult]`	Dictionary mapping metric names to MetricResult objects.	required

Returns:

Name	Type	Description
`float`	`float`	Mean of quality-adjusted scores, or 0.0 if empty.

`compute_success(named_results)`

AND-gate of all metric success flags.

Parameters:

Name	Type	Description	Default
`named_results`	`dict[str, MetricResult]`	Dictionary mapping metric names to MetricResult objects.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if every metric passed. False if any metric failed or if named_results is empty (no metrics evaluated implies no confidence).

`WeightedMetricsAggregator(weights, score_mapping)`

Bases: MetricsAggregator

MetricsAggregator with weighted scoring.

Overrides compute_score to apply per-metric score mappings then a weighted sum.

IMPORTANT: Uses result.rubric_score (pre-threshold integer), not result.score (normalized float), because score_mapping keys are {1, 2, 3} rubric integers.

Attributes:

Name	Type	Description
`weights`		Dictionary mapping metric names to their weights.
`score_mapping`		Dictionary mapping metric names to either: - dict[int, float]: Maps rubric score integers to normalized floats. - Callable[[float], float]: Function to transform rubric scores.

Initialize WeightedMetricsAggregator.

Parameters:

Name	Type	Description	Default
`weights`	`dict[str, float]`	Dictionary mapping metric names to their weights.	required
`score_mapping`	`dict[str, dict[int, float] \| Callable[[float], float]]`	Dictionary mapping metric names to score transformations.	required

`compute_score(named_results)`

Compute weighted aggregate score using rubric_score lookups.

Parameters:

Name	Type	Description	Default
`named_results`	`dict[str, MetricResult]`	Dictionary mapping metric names to MetricResult objects.	required

Returns:

Name	Type	Description
`float`	`float`	Weighted aggregate score, or 0.0 if no results or zero total weight.

`build_aggregation_strategy(aggregation_method)`

Build an aggregation strategy from enum, string, or strategy input.

Parameters:

Name	Type	Description	Default
`aggregation_method`	`AggregationSelector`	Aggregation strategy expressed as an `AggregationMethod` enum, a compatible string value, or an existing strategy instance.	required

Returns:

Name	Type	Description
`BaseJudgeAggregator`	`BaseJudgeAggregator`	Aggregation strategy instance matching the requested method.

Raises:

Type	Description
`ValueError`	If `aggregation_method` is None or a string that does not map to a supported `AggregationMethod` value.
`TypeError`	If `aggregation_method` has an unsupported type.

Aggregation

AggregationResult(aggregate_success, aggregate_score) dataclass

AverageAggregationStrategy

strategy property

aggregate(all_results, total_judges)

BaseJudgeAggregator

strategy abstractmethod property

aggregate(all_results, total_judges) abstractmethod

aggregate_repeated_results(all_results, total_judges)

MajorityVoteAggregationStrategy

strategy property

aggregate(all_results, total_judges)

MedianAggregationStrategy

strategy property

aggregate(all_results, total_judges)

MetricsAggregator

aggregate(named_results)

compute_score(named_results)

compute_success(named_results)

WeightedMetricsAggregator(weights, score_mapping)

compute_score(named_results)

build_aggregation_strategy(aggregation_method)

`AggregationResult(aggregate_success, aggregate_score)` `dataclass`

`AverageAggregationStrategy`

`strategy` `property`

`aggregate(all_results, total_judges)`

`BaseJudgeAggregator`

`strategy` `abstractmethod` `property`

`aggregate(all_results, total_judges)` `abstractmethod`

`aggregate_repeated_results(all_results, total_judges)`

`MajorityVoteAggregationStrategy`

`strategy` `property`

`aggregate(all_results, total_judges)`

`MedianAggregationStrategy`

`strategy` `property`

`aggregate(all_results, total_judges)`

`MetricsAggregator`

`aggregate(named_results)`

`compute_score(named_results)`

`compute_success(named_results)`

`WeightedMetricsAggregator(weights, score_mapping)`

`compute_score(named_results)`

`build_aggregation_strategy(aggregation_method)`