Metrics aggregator
Metrics aggregation for polarity-aware binary scoring.
This module provides aggregation of GEval metrics using AND-gate success logic, polarity-aware mean scoring, and overridable compute methods for subclasses.
AggregationResult(aggregate_success, aggregate_score, possible_issues=list())
dataclass
Result of metric aggregation.
Attributes:
| Name | Type | Description |
|---|---|---|
aggregate_success |
bool
|
AND-gate of all metric success values. Empty dict returns False. |
aggregate_score |
float
|
Mean quality score with polarity inversion. Empty dict returns 0.0. |
possible_issues |
list[str]
|
List of issue strings. Empty dict returns [Issue.EVAL_ISSUE]. |
MetricMappingMetricsAggregator(retrieval_metrics, generation_metrics)
Bases: MetricsAggregator
MetricsAggregator that maps failed metrics to Issue labels. Used by GEvalGenerationEvaluator.
Overrides compute_issues to emit Issue.RETRIEVAL_ISSUE or Issue.GENERATION_ISSUE based on which metric category failed.
Attributes:
| Name | Type | Description |
|---|---|---|
retrieval_metrics |
Frozenset of metric names considered retrieval metrics. |
|
generation_metrics |
Frozenset of metric names considered generation metrics. |
Initialize MetricMappingMetricsAggregator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retrieval_metrics
|
frozenset[str]
|
Frozenset of retrieval metric names. |
required |
generation_metrics
|
frozenset[str]
|
Frozenset of generation metric names. |
required |
compute_issues(named_results)
Compute issues based on metric success flags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_results
|
dict[str, GEvalMetricResult]
|
Dictionary mapping metric names to GEvalMetricResult objects. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of Issue enum values for failed metric categories. |
MetricsAggregator
Aggregator for GEval metrics.
Computes aggregate_success (AND-gate), aggregate_score (polarity-aware mean), and possible_issues (empty by default). Subclass and override any of the three compute methods to customize behavior per evaluator.
aggregate(named_results)
Aggregate GEval metric results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_results
|
dict[str, GEvalMetricResult]
|
Dictionary mapping metric names to GEvalMetricResult objects. |
required |
Returns:
| Type | Description |
|---|---|
AggregationResult
|
AggregationResult with aggregate_success, aggregate_score, and possible_issues. |
compute_issues(named_results)
Return list of possible issues. Empty by default; override in subclasses.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_results
|
dict[str, GEvalMetricResult]
|
Dictionary mapping metric names to GEvalMetricResult objects. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: Empty list. |
compute_score(named_results)
Polarity-aware mean of metric scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_results
|
dict[str, GEvalMetricResult]
|
Dictionary mapping metric names to GEvalMetricResult objects. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Mean of quality-adjusted scores, or 0.0 if empty. |
compute_success(named_results)
AND-gate of all metric success flags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_results
|
dict[str, GEvalMetricResult]
|
Dictionary mapping metric names to GEvalMetricResult objects. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if every metric passed, False otherwise. |
WeightedMetricsAggregator(weights, score_mapping)
Bases: MetricsAggregator
MetricsAggregator with weighted scoring. Used by QTEvaluator.
Overrides compute_score to apply per-metric score mappings then a weighted sum.
IMPORTANT: Uses result.rubric_score (pre-threshold integer), not result.score (normalized float), because score_mapping keys are {1, 2, 3} rubric integers.
Attributes:
| Name | Type | Description |
|---|---|---|
weights |
Dictionary mapping metric names to their weights. |
|
score_mapping |
Dictionary mapping metric names to either: - dict[int, float]: Maps rubric score integers to normalized floats. - Callable[[float], float]: Function to transform rubric scores. |
Initialize WeightedMetricsAggregator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
weights
|
dict[str, float]
|
Dictionary mapping metric names to their weights. |
required |
score_mapping
|
dict[str, dict[int, float] | Callable[[float], float]]
|
Dictionary mapping metric names to score transformations. |
required |
compute_score(named_results)
Compute weighted aggregate score using rubric_score lookups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
named_results
|
dict[str, GEvalMetricResult]
|
Dictionary mapping metric names to GEvalMetricResult objects. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Weighted aggregate score, or 0.0 if no results or zero total weight. |