Skip to content

Generation evaluator

Base generation evaluator helpers.

Shared helpers for generation-style evaluators. e.g. - Evaluation RAG output - Evaluation LLM output - Evaluation AI Agent output

BaseGenerationEvaluator(models=None, metrics=None, aggregation_method=None, max_concurrent_judges=None, run_parallel=True, refusal_metric=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None, fallback_models=None)

Bases: BaseEvaluator

Shared base evaluator for generation-style rule-engine evaluation.

Default expected input
  • input (str): The input provided to the AI system or component (e.g., a query, prompt, or instruction).
  • retrieved_context (str): Supporting context used during generation (e.g., retrieved documents).
  • expected_output (str): The reference output used for comparison.
  • actual_output (str): The output generated by the AI system or component to evaluate.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics List[BaseMetric]

The list of metrics to evaluate.

run_parallel bool

Whether to run the metrics in parallel.

models BaseLMInvoker | list[BaseLMInvoker] | None

Judge models for single-judge/multi-judge evaluation.

metrics_aggregator MetricsAggregator

The aggregator for polarity-aware binary scoring.

Initialize the base generation evaluator.

Parameters:

Name Type Description Default
models BaseLMInvoker | list[BaseLMInvoker] | None

Judge models for single-judge/multi-judge evaluation.

  • None (default): single-judge mode using the default invoker.
  • [invoker] or a single invoker: runs that invoker once.
  • [invoker] * N: homogeneous — same model N times.
  • [invoker_a, invoker_b]: heterogeneous — distinct models.
None
metrics list[BaseMetric] | None

Metric instances to evaluate. If None, default metrics are built.

None
aggregation_method AggregationSelector | None

The aggregation method to use for each metric.

None
max_concurrent_judges int | None

The maximum number of concurrent judges per metric.

None
run_parallel bool

Whether to run the metrics in parallel. Defaults to True.

True
refusal_metric GEvalRefusalMetric | None

Optional explicit refusal metric.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).

BATCH_MAX_ITERATIONS
metrics_aggregator MetricsAggregator | None

The aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None
fallback_models list[BaseLMInvoker] | None

Ordered fallback invoker chain propagated to every metric. Defaults to None.

None

Raises:

Type Description
ValueError

If models list contains invalid invoker configurations.