Judge
Judge module for multiple LLM evaluation and ensemble calculation.
This module provides utilities for orchestrating multiple LLM judges and calculating ensemble statistics from their evaluation results. It enables combining multiple LLM-based evaluations to improve reliability and robustness through ensemble methods.
Key components: - EnsembleCalculator: Calculates ensemble statistics from multiple judge results - EnsembleMethod: Defines different ensemble calculation strategies
EnsembleCalculator(ensemble_method=EnsembleMethod.MEDIAN, weights=None)
Calculates ensemble statistics from multiple LLM judge results.
This class handles the aggregation of evaluation results from multiple LLM judges using various ensemble methods and provides statistical measures of judge agreement.
Attributes:
| Name | Type | Description |
|---|---|---|
ensemble_method |
EnsembleMethod
|
The method to use for aggregating scores. |
weights |
Optional[List[float]]
|
Optional weights for weighted ensemble methods. |
Initialize the EnsembleCalculator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ensemble_method
|
EnsembleMethod
|
The ensemble method to use. Defaults to EnsembleMethod.MEDIAN. |
MEDIAN
|
weights
|
Optional[List[float]]
|
Weights for each judge. If None, defaults to equal weights (1.0 for each judge). Defaults to None. |
None
|
calculate_ensemble_result(judge_results)
Calculate ensemble result from multiple judge evaluations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
judge_results
|
List[Dict[str, Any]]
|
List of evaluation results from each judge. Each result should contain either: - 'relevancy_rating' key (GEvalGenerationEvaluator) with categorical values - 'score' key (QTEvaluator) with numeric values |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict[str, Any]: Ensemble result containing aggregated scores and statistics. |
calculate_metric_ensemble(metric_name, judge_results)
Calculate ensemble result for a single metric across multiple judges.
This method aggregates scores for ONE metric evaluated by multiple judges. Unlike calculate_ensemble_result which aggregates final evaluator ratings, this aggregates individual metric scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric_name
|
str
|
Name of the metric being aggregated. |
required |
judge_results
|
List[Dict[str, Any]]
|
List of results from each judge for this metric. Each result should contain 'score' key with numeric value. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict[str, Any]: Ensemble result containing: - score: Aggregated score across judges - success: Whether metric passed threshold - explanation: Combined explanations from judges - agreement_score: Statistical agreement measure (0.0-1.0, higher = better agreement) - individual_judge_results: Original results from each judge - ensemble_method: Method used for aggregation - num_judges: Number of judges used |
EnsembleMethod
Bases: StrEnum
Enumeration of ensemble methods for aggregating judge results.
Attributes:
| Name | Type | Description |
|---|---|---|
MEDIAN |
str
|
Use median of all judge scores. |
AVERAGE_ROUNDED |
str
|
Use rounded average of all judge scores. |
MAJORITY_VOTE |
str
|
Use majority vote of pass/fail decisions; score is average of all valid scores. |