Aggregation
Aggregation module for GEval metrics.
This module provides metric aggregation utilities for polarity-aware pass/fail scoring. Extend MetricsAggregator and override compute_score or compute_success to customise behaviour per evaluator.
AggregationResult(aggregate_success, aggregate_score)
dataclass
Result of metric aggregation.
Attributes:
| Name | Type | Description |
|---|---|---|
aggregate_success |
bool
|
True if all metrics passed (AND-gate), False otherwise. |
aggregate_score |
float
|
Mean quality score with polarity inversion applied. |
AverageAggregationStrategy
Bases: BaseJudgeAggregator
Aggregate repeated judge results using arithmetic mean.
strategy
property
Return the aggregation identifier for arithmetic averaging.
Returns:
| Name | Type | Description |
|---|---|---|
AggregationMethod |
AggregationMethod
|
|
aggregate(all_results, total_judges)
Aggregate judge results by computing the arithmetic mean score.
The representative result is chosen as the valid judge output whose numeric score is closest to the computed average, using input order as the tie-breaker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
all_results
|
list[MetricOutput]
|
Raw results produced by each judge, including successful outputs and optional error payloads. |
required |
total_judges
|
int
|
Total number of judges configured for the evaluation run. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MetricOutput |
MetricOutput
|
Representative result annotated with average-based metadata. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no valid judge results exist, or if any valid score is non-numeric. |
BaseJudgeAggregator
Bases: ABC
Abstract strategy for repeated-judge result aggregation.
strategy
abstractmethod
property
Return the canonical aggregation strategy identifier.
Returns:
| Name | Type | Description |
|---|---|---|
AggregationMethod |
AggregationMethod
|
Enum value that identifies the aggregation strategy. |
aggregate(all_results, total_judges)
abstractmethod
Aggregate repeated metric results into one representative result.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
all_results
|
list[MetricOutput]
|
Raw results produced by each judge, including successful outputs and optional error payloads. |
required |
total_judges
|
int
|
Total number of judges configured for the evaluation run. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MetricOutput |
MetricOutput
|
Representative aggregated result containing the selected score, metadata, and supporting judge context. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the implementation cannot produce a valid aggregate from the input. |
TypeError
|
If the implementation rejects the provided input type or strategy. |
aggregate_repeated_results(all_results, total_judges)
Extract valid repeated-judge results and supporting metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
all_results
|
list[MetricOutput]
|
Raw results produced by each judge, including successful outputs and optional error payloads. |
required |
total_judges
|
int
|
Total number of judges configured for the evaluation run. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
RepeatedResults |
RepeatedResults
|
Tuple containing valid results, valid scores, valid original indices, and collected judge error messages. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no judge results are provided, or if every result is invalid after filtering out missing scores and explicit errors. |
MajorityVoteAggregationStrategy
Bases: BaseJudgeAggregator
Aggregate repeated judge results using majority vote.
strategy
property
Return the aggregation identifier for majority vote.
Returns:
| Name | Type | Description |
|---|---|---|
AggregationMethod |
AggregationMethod
|
|
aggregate(all_results, total_judges)
Aggregate judge results by selecting the most frequent numeric score.
Ties are resolved by delegating to the median strategy so the result still maps to a real judge output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
all_results
|
list[MetricOutput]
|
Raw results produced by each judge, including successful outputs and optional error payloads. |
required |
total_judges
|
int
|
Total number of judges configured for the evaluation run. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MetricOutput |
MetricOutput
|
Representative result annotated with majority-vote metadata. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no valid judge results exist, or if any valid score is non-numeric. |
MedianAggregationStrategy
Bases: BaseJudgeAggregator
Aggregate repeated judge results using the observed median.
strategy
property
Return the aggregation identifier for median selection.
Returns:
| Name | Type | Description |
|---|---|---|
AggregationMethod |
AggregationMethod
|
|
aggregate(all_results, total_judges)
Aggregate judge results by selecting the observed median score.
For even-sized inputs, this strategy chooses the upper median instead of the arithmetic mean so the representative output still corresponds to an actual judge result.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
all_results
|
list[MetricOutput]
|
Raw results produced by each judge, including successful outputs and optional error payloads. |
required |
total_judges
|
int
|
Total number of judges configured for the evaluation run. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MetricOutput |
MetricOutput
|
Representative result annotated with median-selection metadata. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no valid judge results exist, or if any valid score is non-numeric. |
MetricsAggregator
Aggregator for GEval metrics.
Computes aggregate_success (AND-gate) and aggregate_score (polarity-aware mean). Subclass and override compute_success or compute_score to customize behavior per evaluator.
aggregate(named_results)
Aggregate GEval metric results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_results
|
dict[str, MetricResult]
|
Dictionary mapping metric names to MetricResult objects. |
required |
Returns:
| Type | Description |
|---|---|
AggregationResult
|
AggregationResult with aggregate_success and aggregate_score. |
AggregationResult
|
Empty dict returns aggregate_success=False and aggregate_score=0.0. |
compute_score(named_results)
Polarity-aware mean of metric scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_results
|
dict[str, MetricResult]
|
Dictionary mapping metric names to MetricResult objects. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Mean of quality-adjusted scores, or 0.0 if empty. |
compute_success(named_results)
AND-gate of all metric success flags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_results
|
dict[str, MetricResult]
|
Dictionary mapping metric names to MetricResult objects. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if every metric passed. False if any metric failed or if named_results is empty (no metrics evaluated implies no confidence). |
WeightedMetricsAggregator(weights, score_mapping)
Bases: MetricsAggregator
MetricsAggregator with weighted scoring.
Overrides compute_score to apply per-metric score mappings then a weighted sum.
IMPORTANT: Uses result.rubric_score (pre-threshold integer), not result.score (normalized float), because score_mapping keys are {1, 2, 3} rubric integers.
Attributes:
| Name | Type | Description |
|---|---|---|
weights |
Dictionary mapping metric names to their weights. |
|
score_mapping |
Dictionary mapping metric names to either: - dict[int, float]: Maps rubric score integers to normalized floats. - Callable[[float], float]: Function to transform rubric scores. |
Initialize WeightedMetricsAggregator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
weights
|
dict[str, float]
|
Dictionary mapping metric names to their weights. |
required |
score_mapping
|
dict[str, dict[int, float] | Callable[[float], float]]
|
Dictionary mapping metric names to score transformations. |
required |
compute_score(named_results)
Compute weighted aggregate score using rubric_score lookups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_results
|
dict[str, MetricResult]
|
Dictionary mapping metric names to MetricResult objects. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Weighted aggregate score, or 0.0 if no results or zero total weight. |
build_aggregation_strategy(aggregation_method)
Build an aggregation strategy from enum, string, or strategy input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
aggregation_method
|
AggregationSelector
|
Aggregation strategy expressed as an
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
BaseJudgeAggregator |
BaseJudgeAggregator
|
Aggregation strategy instance matching the requested method. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
TypeError
|
If |