Generation evaluator
Base generation evaluator helpers.
Shared helpers for generation-style evaluators. e.g. - Evaluation RAG output - Evaluation LLM output - Evaluation AI Agent output
BaseGenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=None, max_concurrent_judges=None, run_parallel=True, judge=None, refusal_metric=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None)
Bases: BaseEvaluator
Shared base evaluator for generation-style rule-engine evaluation.
Default expected input
- input (str): The input provided to the AI system or component (e.g., a query, prompt, or instruction).
- retrieved_context (str): Supporting context used during generation (e.g., retrieved documents).
- expected_output (str): The reference output used for comparison.
- actual_output (str): The output generated by the AI system or component to evaluate.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
List[BaseMetric]
|
The list of metrics to evaluate. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
judge |
List[Dict[str, Any]] | None
|
Optional list of judge configurations for metric-level aggregation. |
metrics_aggregator |
MetricsAggregator
|
The aggregator for polarity-aware binary scoring. |
Initialize the base generation evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
List[BaseMetric] | None
|
A list of metric instances to use as a base pool.
If None, defaults to a built-in GEval-backed metric pool. Each
custom metric must generate a |
None
|
enabled_metrics
|
List[type[BaseMetric] | str] | None
|
A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used. |
None
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metrics when judge is not provided. |
MODEL
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metrics. |
None
|
model_credentials
|
str | None
|
The model credentials, used for initializing default metrics. Defaults to None. This is required when default metrics are initialized with a string model. |
None
|
num_judges
|
int
|
The number of judges to use for each metric. Defaults to DefaultValues.NUM_JUDGES. |
NUM_JUDGES
|
aggregation_method
|
AggregationSelector | None
|
The aggregation method to use for each metric. If None, each metric uses its own default (MAJORITY_VOTE for GEval metrics). |
None
|
max_concurrent_judges
|
int | None
|
The maximum number of concurrent judges per metric. If None, each metric uses its own default. |
None
|
run_parallel
|
bool
|
Whether to run the metrics in parallel. Defaults to True. |
True
|
judge
|
List[Dict[str, Any]] | None
|
Judge configuration for metric-level aggregation. List of judge model configurations where each dict should contain: - 'model_id' (str): Model identifier (e.g., 'openai/gpt-4o', 'anthropic/claude-3-5-sonnet') - 'model_credentials' (str): API credentials for the model - 'model_config' (Dict[str, Any], optional): Additional model configuration When provided, enables metric-level aggregation with heterogeneous models. None: Uses single judge or num_judges parameter for same-model judges. |
None
|
refusal_metric
|
GEvalRefusalMetric | None
|
The refusal metric to use. If None, the default refusal metric will be used. Defaults to GEvalRefusalMetric. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval). |
BATCH_MAX_ITERATIONS
|
metrics_aggregator
|
MetricsAggregator | None
|
The aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |