Skip to content

Generation evaluator

Base generation evaluator helpers.

Shared helpers for generation-style evaluators. e.g. - Evaluation RAG output - Evaluation LLM output - Evaluation AI Agent output

BaseGenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=None, max_concurrent_judges=None, run_parallel=True, judge=None, refusal_metric=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None)

Bases: BaseEvaluator

Shared base evaluator for generation-style rule-engine evaluation.

Default expected input
  • input (str): The input provided to the AI system or component (e.g., a query, prompt, or instruction).
  • retrieved_context (str): Supporting context used during generation (e.g., retrieved documents).
  • expected_output (str): The reference output used for comparison.
  • actual_output (str): The output generated by the AI system or component to evaluate.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics List[BaseMetric]

The list of metrics to evaluate.

run_parallel bool

Whether to run the metrics in parallel.

judge List[Dict[str, Any]] | None

Optional list of judge configurations for metric-level aggregation.

metrics_aggregator MetricsAggregator

The aggregator for polarity-aware binary scoring.

Initialize the base generation evaluator.

Parameters:

Name Type Description Default
metrics List[BaseMetric] | None

A list of metric instances to use as a base pool. If None, defaults to a built-in GEval-backed metric pool. Each custom metric must generate a score key in the output.

None
enabled_metrics List[type[BaseMetric] | str] | None

A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used.

None
model str | ModelId | BaseLMInvoker

The model to use for the metrics when judge is not provided.

MODEL
model_config dict[str, Any] | None

The model config to use for the metrics.

None
model_credentials str | None

The model credentials, used for initializing default metrics. Defaults to None. This is required when default metrics are initialized with a string model.

None
num_judges int

The number of judges to use for each metric. Defaults to DefaultValues.NUM_JUDGES.

NUM_JUDGES
aggregation_method AggregationSelector | None

The aggregation method to use for each metric. If None, each metric uses its own default (MAJORITY_VOTE for GEval metrics).

None
max_concurrent_judges int | None

The maximum number of concurrent judges per metric. If None, each metric uses its own default.

None
run_parallel bool

Whether to run the metrics in parallel. Defaults to True.

True
judge List[Dict[str, Any]] | None

Judge configuration for metric-level aggregation. List of judge model configurations where each dict should contain: - 'model_id' (str): Model identifier (e.g., 'openai/gpt-4o', 'anthropic/claude-3-5-sonnet') - 'model_credentials' (str): API credentials for the model - 'model_config' (Dict[str, Any], optional): Additional model configuration When provided, enables metric-level aggregation with heterogeneous models. None: Uses single judge or num_judges parameter for same-model judges.

None
refusal_metric GEvalRefusalMetric | None

The refusal metric to use. If None, the default refusal metric will be used. Defaults to GEvalRefusalMetric.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).

BATCH_MAX_ITERATIONS
metrics_aggregator MetricsAggregator | None

The aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None

Raises:

Type Description
ValueError

If model_credentials is not provided when initializing default metrics with a string model.