Skip to content

Geval generation evaluator

GEval Generation Evaluator.

GEvalGenerationEvaluator(models=None, metrics=None, aggregation_method=None, max_concurrent_judges=None, run_parallel=True, refusal_metric=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None, fallback_models=None)

Bases: BaseGenerationEvaluator

GEval Generation Evaluator.

This evaluator is used to evaluate the generation of the model.

Default expected input
  • input (str): The input provided to the AI system or component (e.g., a query, prompt, or instruction).
  • retrieved_context (str): Supporting context used during generation (e.g., retrieved documents).
  • expected_output (str): The reference output used for comparison.
  • actual_output (str): The output generated by the AI system or component to evaluate.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics List[BaseMetric]

The list of metrics to evaluate.

run_parallel bool

Whether to run the metrics in parallel.

metrics_aggregator MetricsAggregator

The aggregator for polarity-aware binary scoring.

Initialize the GEval Generation Evaluator.

Parameters:

Name Type Description Default
models BaseLMInvoker | list[BaseLMInvoker] | None

Judge models for single-judge/multi-judge evaluation.

  • None (default): single-judge mode using the default invoker.
  • [invoker] or a single invoker: runs that invoker once.
  • [invoker] * N: homogeneous — same model N times.
  • [invoker_a, invoker_b]: heterogeneous — distinct models.
None
metrics list[BaseMetric] | None

Metric instances to evaluate. If None, uses DEFAULT_METRICS.

None
aggregation_method AggregationSelector | None

Strategy used to aggregate judge results. Defaults to None.

None
max_concurrent_judges int | None

Maximum number of judges to run concurrently. Defaults to None.

None
run_parallel bool

Whether to run the metrics in parallel.

True
refusal_metric GEvalRefusalMetric | None

The refusal metric to use. If None, the default refusal metric will be used. Defaults to GEvalRefusalMetric.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).

BATCH_MAX_ITERATIONS
metrics_aggregator MetricsAggregator | None

The aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None
fallback_models list[BaseLMInvoker] | None

Ordered fallback invoker chain propagated to every metric. Defaults to None.

None