Skip to content

Judge

Judge module for multiple LLM evaluation and ensemble calculation.

This module provides utilities for orchestrating multiple LLM judges and calculating ensemble statistics from their evaluation results. It enables combining multiple LLM-based evaluations to improve reliability and robustness through ensemble methods.

Key components: - EnsembleCalculator: Calculates ensemble statistics from multiple judge results - EnsembleMethod: Defines different ensemble calculation strategies

EnsembleCalculator(ensemble_method=EnsembleMethod.MEDIAN, weights=None)

Calculates ensemble statistics from multiple LLM judge results.

This class handles the aggregation of evaluation results from multiple LLM judges using various ensemble methods and provides statistical measures of judge agreement.

Attributes:

Name Type Description
ensemble_method EnsembleMethod

The method to use for aggregating scores.

weights Optional[List[float]]

Optional weights for weighted ensemble methods.

Initialize the EnsembleCalculator.

Parameters:

Name Type Description Default
ensemble_method EnsembleMethod

The ensemble method to use. Defaults to EnsembleMethod.MEDIAN.

MEDIAN
weights Optional[List[float]]

Weights for each judge. If None, defaults to equal weights (1.0 for each judge). Defaults to None.

None

calculate_ensemble_result(judge_results)

Calculate ensemble result from multiple judge evaluations.

Parameters:

Name Type Description Default
judge_results List[Dict[str, Any]]

List of evaluation results from each judge. Each result should contain either: - 'relevancy_rating' key (GEvalGenerationEvaluator) with categorical values - 'score' key (QTEvaluator) with numeric values

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: Ensemble result containing aggregated scores and statistics.

calculate_metric_ensemble(metric_name, judge_results)

Calculate ensemble result for a single metric across multiple judges.

This method aggregates scores for ONE metric evaluated by multiple judges. Unlike calculate_ensemble_result which aggregates final evaluator ratings, this aggregates individual metric scores.

Parameters:

Name Type Description Default
metric_name str

Name of the metric being aggregated.

required
judge_results List[Dict[str, Any]]

List of results from each judge for this metric. Each result should contain 'score' key with numeric value.

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: Ensemble result containing: - score: Aggregated score across judges - success: Whether metric passed threshold - explanation: Combined explanations from judges - agreement_score: Statistical agreement measure (0.0-1.0, higher = better agreement) - individual_judge_results: Original results from each judge - ensemble_method: Method used for aggregation - num_judges: Number of judges used

EnsembleMethod

Bases: StrEnum

Enumeration of ensemble methods for aggregating judge results.

Attributes:

Name Type Description
MEDIAN str

Use median of all judge scores.

AVERAGE_ROUNDED str

Use rounded average of all judge scores.

MAJORITY_VOTE str

Use majority vote of pass/fail decisions; score is average of all valid scores.