Evaluator
Evaluator module for orchestrating evaluation workflows.
This module provides evaluator classes that coordinate metrics and evaluation logic for different use cases. Evaluators handle the evaluation process, including data preparation, metric execution, and result aggregation.
Available evaluators: - BaseEvaluator: Abstract base class for all evaluators - BaseGenerationEvaluator: Shared generation evaluator base class - GEvalGenerationEvaluator: G-Eval based generation evaluation - ClassicalRetrievalEvaluator: Traditional retrieval evaluation methods - LMBasedRetrievalEvaluator: LM-based retrieval evaluation methods - CompositeEvaluator: Combines multiple metrics into a single evaluation unit - QTEvaluator: Question-answering evaluation - AgentEvaluator: Combined agent trajectory and generation evaluation - SummarizationEvaluator: Summarization evaluation methods
AgentEvaluator(models=None, tool_correctness_metric=None, generation_evaluator=None, trajectory_accuracy_metric=None, metrics_aggregator=None, aggregation_method=DefaultValues.AGGREGATION_METHOD, fallback_models=None)
Bases: BaseEvaluator
Evaluator for agent tool calling and generation quality evaluation with MetricsAggregator.
This evaluator combines: 1. DeepEval Tool Correctness metric for agent tool call evaluation 2. GEval generation evaluator (completeness, groundedness, redundancy) 3. Optional LangChain Agent Trajectory Accuracy metric (disabled by default) 4. MetricsAggregator for polarity-aware binary scoring aggregation
Uses dependency injection pattern - accepts pre-configured metric and evaluator. Uses preprocess → aggregate → postprocess pattern consistent with other evaluators.
Expected input (LLMTestCase with generation fields): - agent_trajectory (list[dict[str, Any]]): The agent trajectory - expected_agent_trajectory (list[dict[str, Any]]): Expected trajectory - input (str): The input query for generation evaluation - actual_output (str): The generated response - expected_output (str): The expected response - retrieved_context (str | list[str] | None, optional): The retrieved context
Aggregation Logic
- Uses MetricsAggregator.aggregate() for base aggregation:
- aggregate_success = all(metric.success) — AND-gate across all metrics
- aggregate_score = mean(quality_score) — quality_score = score (higher_is_better=True) or 1 - score (higher_is_better=False)
- Trajectory accuracy metric (if enabled) only runs when agent_trajectory is provided in input. Results are included in output but do not affect aggregation
Output Structure:
The output is a flat dictionary containing:
- Aggregated Results:
- aggregate_explanation (str): Human-readable explanation of the evaluation
(added by parent evaluator)
- aggregate_success (bool): AND-gate of all metric success values
- aggregate_score (float): Mean quality score with polarity inversion
- Tool Call Evaluation Results:
- deepeval_tool_correctness (dict): Tool correctness evaluation result containing
score, success, threshold, and other contract fields
- Trajectory Accuracy Results (optional, only when metric is enabled AND agent_trajectory is provided):
- langchain_agent_trajectory_accuracy (dict): Trajectory accuracy evaluation result
- Generation Evaluation Results (nested):
- generation (dict): Generation evaluation results containing:
- aggregate_explanation (str): Explanation of generation quality
- aggregate_success (bool): AND-gate of generation metric success values
- aggregate_score (float): Mean quality score for generation metrics
- Individual metric results (completeness, groundedness, redundancy)
with full contract fields
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
tool_correctness_metric |
DeepEvalToolCorrectnessMetric
|
The metric for tool call assessment. |
generation_evaluator |
GEvalGenerationEvaluator
|
Evaluator for generation quality assessment. |
trajectory_accuracy_metric |
LangChainAgentTrajectoryAccuracyMetric | None
|
Optional metric for LangChain-style trajectory accuracy evaluation. Disabled by default. |
metrics_aggregator |
MetricsAggregator
|
Aggregator for polarity-aware binary scoring. |
Example
import os from gllm_evals.evaluator import AgentEvaluator from gllm_evals.metrics.tool_use.deepeval_tool_correctness import DeepEvalToolCorrectnessMetric from gllm_evals.metrics.tool_use.langchain_agent_trajectory_accuracy import ( ... LangChainAgentTrajectoryAccuracyMetric ... )
Configure metrics
tool_correctness = DeepEvalToolCorrectnessMetric() trajectory_accuracy = LangChainAgentTrajectoryAccuracyMetric()
Create evaluator (trajectory accuracy is optional and runs conditionally)
evaluator = AgentEvaluator( ... tool_correctness_metric=tool_correctness, ... trajectory_accuracy_metric=trajectory_accuracy, # Optional; only runs if agent_trajectory is in input ... )
result = await evaluator.evaluate(data)
Initialize the AgentEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
models
|
BaseLMInvoker | list[BaseLMInvoker] | None
|
Judge models for single-judge/multi-judge evaluation.
|
None
|
tool_correctness_metric
|
DeepEvalToolCorrectnessMetric | None
|
Pre-configured metric for trajectory evaluation. If None, a default metric will be created with models. |
None
|
generation_evaluator
|
GEvalGenerationEvaluator | None
|
Pre-configured evaluator for generation quality assessment. If None, a default GEvalGenerationEvaluator will be created. Defaults to None. |
None
|
trajectory_accuracy_metric
|
LangChainAgentTrajectoryAccuracyMetric | None
|
Pre-configured metric for LangChain-style trajectory accuracy evaluation. Disabled by default (None). When provided, only runs when agent_trajectory field is present in input data. Results are included in output but do not affect final aggregated score. Defaults to None. |
None
|
metrics_aggregator
|
MetricsAggregator | None
|
Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None. |
None
|
aggregation_method
|
AggregationMethod
|
Method for aggregating judge scores at metric level in generation evaluator. Defaults to DefaultValues.AGGREGATION_METHOD. |
AGGREGATION_METHOD
|
fallback_models
|
list[BaseLMInvoker] | None
|
Ordered fallback invoker chain propagated to every default metric and the generation evaluator. Defaults to None. |
None
|
required_fields
property
Returns the required fields for the data.
Returns the combined set of required fields from both the tool correctness metric and generation evaluator.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |
format_tool_violation(tool_score, tool_passed, tool_threshold)
Format Tool Correctness violation message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_score
|
float
|
The tool score. |
required |
tool_passed
|
bool
|
Whether the tool passed. |
required |
tool_threshold
|
float
|
The tool threshold. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The formatted tool violation message. |
BaseEvaluator(name, metrics=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None)
Bases: ABC
Base class for all evaluators.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
required_fields |
set[str]
|
The required fields for the evaluator. |
input_type |
type | None
|
The type of the input data. |
Initialize the evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the evaluator. |
required |
metrics
|
list[BaseMetric] | None
|
Metric instances configured for the evaluator. If None, no metrics are configured. Defaults to None. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval). |
BATCH_MAX_ITERATIONS
|
metrics_aggregator
|
MetricsAggregator | None
|
Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If batch_status_check_interval or batch_max_iterations are not positive. |
aggregate_required_fields(metrics, mode='any')
staticmethod
Aggregate required fields from multiple metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
Iterable[BaseMetric]
|
The metrics to aggregate from. |
required |
mode
|
str
|
The aggregation mode. Options: - "union": All fields required by any metric - "intersection": Only fields required by all metrics - "any": Empty set (no validation) Defaults to "any". |
'any'
|
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The aggregated required fields. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If mode is not one of the supported options. |
can_evaluate_any(metrics, data)
staticmethod
Check if any of the metrics can evaluate the given data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
Iterable[BaseMetric]
|
The metrics to check. |
required |
data
|
MetricInput
|
The data to validate against. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if any metric can evaluate the data, False otherwise. |
ensure_list_of_dicts(data, key)
staticmethod
Ensure that a field in the data is a list of dictionaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput
|
The data to validate. |
required |
key
|
str
|
The key to check. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the field is not a list or contains non-dictionary elements. |
ensure_non_empty_list(data, key)
staticmethod
Ensure that a field in the data is a non-empty list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput
|
The data to validate. |
required |
key
|
str
|
The key to check. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the field is not a list or is empty. |
evaluate(data)
async
Evaluate the data (single item or batch).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
EvalInput | list[EvalInput]
|
The data to be evaluated. Can be a single item or a list for batch processing. |
required |
Returns:
| Type | Description |
|---|---|
EvaluatorResult | list[EvaluatorResult]
|
EvaluatorResult | list[EvaluatorResult]: The evaluation output with aggregate_explanation. Returns a list if input is a list. |
get_input_fields()
classmethod
Return declared input field names if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[str] | None
|
list[str] | None: The input fields. |
get_input_spec()
classmethod
Return structured spec for input fields if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | None
|
list[dict[str, Any]] | None: The input spec. |
BaseGenerationEvaluator(models=None, metrics=None, aggregation_method=None, max_concurrent_judges=None, run_parallel=True, refusal_metric=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None, fallback_models=None)
Bases: BaseEvaluator
Shared base evaluator for generation-style rule-engine evaluation.
Default expected input
- input (str): The input provided to the AI system or component (e.g., a query, prompt, or instruction).
- retrieved_context (str): Supporting context used during generation (e.g., retrieved documents).
- expected_output (str): The reference output used for comparison.
- actual_output (str): The output generated by the AI system or component to evaluate.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
List[BaseMetric]
|
The list of metrics to evaluate. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
models |
BaseLMInvoker | list[BaseLMInvoker] | None
|
Judge models for single-judge/multi-judge evaluation. |
metrics_aggregator |
MetricsAggregator
|
The aggregator for polarity-aware binary scoring. |
Initialize the base generation evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
models
|
BaseLMInvoker | list[BaseLMInvoker] | None
|
Judge models for single-judge/multi-judge evaluation.
|
None
|
metrics
|
list[BaseMetric] | None
|
Metric instances to evaluate. If None, default metrics are built. |
None
|
aggregation_method
|
AggregationSelector | None
|
The aggregation method to use for each metric. |
None
|
max_concurrent_judges
|
int | None
|
The maximum number of concurrent judges per metric. |
None
|
run_parallel
|
bool
|
Whether to run the metrics in parallel. Defaults to True. |
True
|
refusal_metric
|
GEvalRefusalMetric | None
|
Optional explicit refusal metric. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval). |
BATCH_MAX_ITERATIONS
|
metrics_aggregator
|
MetricsAggregator | None
|
The aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None. |
None
|
fallback_models
|
list[BaseLMInvoker] | None
|
Ordered fallback invoker chain propagated to every metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If models list contains invalid invoker configurations. |
ClassicalRetrievalEvaluator(metrics=None, k=20, metrics_aggregator=None)
Bases: BaseEvaluator
A class that evaluates the performance of a classical retrieval system.
Required fields: - retrieved_chunks: The retrieved chunks with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.
Example:
data = RetrievalData(
retrieved_chunks={
"chunk1": 0.9,
"chunk2": 0.8,
"chunk3": 0.7,
},
ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
evaluator = ClassicalRetrievalEvaluator()
await evaluator.evaluate(data)
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
list[str | ClassicalRetrievalMetric] | None
|
The metrics to evaluate. |
k |
int
|
The number of retrieved chunks to consider. |
Initializes the evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
list[str | ClassicalRetrievalMetric] | None
|
The metrics to evaluate. Defaults to all metrics. |
None
|
k
|
int | list[int]
|
The number of retrieved chunks to consider. Defaults to 20. |
20
|
metrics_aggregator
|
MetricsAggregator | None
|
Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None. |
None
|
required_fields
property
Returns the required fields for the data.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |
CompositeEvaluator(metrics, name='composite', parallel=True, metrics_aggregator=None)
Bases: BaseEvaluator
Composite evaluator that runs multiple metrics and aggregates results.
This evaluator composes multiple BaseMetric objects and presents them as a single evaluator unit, following the GoF Composite pattern. It supports parallel or sequential execution, fault isolation, and custom aggregation.
Attributes:
| Name | Type | Description |
|---|---|---|
metrics |
list[BaseMetric]
|
The list of metrics to evaluate. |
name |
str
|
The name of the evaluator. |
parallel |
bool
|
Whether to evaluate the metrics in parallel. |
metrics_aggregator |
MetricsAggregator
|
The aggregator for polarity-aware binary scoring. |
Initialize the composite evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
list[BaseMetric]
|
The list of metrics to evaluate. |
required |
name
|
str
|
The name of the evaluator. |
'composite'
|
parallel
|
bool
|
Whether to evaluate the metrics in parallel. |
True
|
metrics_aggregator
|
MetricsAggregator | None
|
Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None. |
None
|
GEvalGenerationEvaluator(models=None, metrics=None, aggregation_method=None, max_concurrent_judges=None, run_parallel=True, refusal_metric=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None, fallback_models=None)
Bases: BaseGenerationEvaluator
GEval Generation Evaluator.
This evaluator is used to evaluate the generation of the model.
Default expected input
- input (str): The input provided to the AI system or component (e.g., a query, prompt, or instruction).
- retrieved_context (str): Supporting context used during generation (e.g., retrieved documents).
- expected_output (str): The reference output used for comparison.
- actual_output (str): The output generated by the AI system or component to evaluate.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
List[BaseMetric]
|
The list of metrics to evaluate. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
metrics_aggregator |
MetricsAggregator
|
The aggregator for polarity-aware binary scoring. |
Initialize the GEval Generation Evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
models
|
BaseLMInvoker | list[BaseLMInvoker] | None
|
Judge models for single-judge/multi-judge evaluation.
|
None
|
metrics
|
list[BaseMetric] | None
|
Metric instances to evaluate. If None, uses |
None
|
aggregation_method
|
AggregationSelector | None
|
Strategy used to aggregate judge results. Defaults to None. |
None
|
max_concurrent_judges
|
int | None
|
Maximum number of judges to run concurrently. Defaults to None. |
None
|
run_parallel
|
bool
|
Whether to run the metrics in parallel. |
True
|
refusal_metric
|
GEvalRefusalMetric | None
|
The refusal metric to use. If None, the default refusal metric will be used. Defaults to GEvalRefusalMetric. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval). |
BATCH_MAX_ITERATIONS
|
metrics_aggregator
|
MetricsAggregator | None
|
The aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None. |
None
|
fallback_models
|
list[BaseLMInvoker] | None
|
Ordered fallback invoker chain propagated to every metric. Defaults to None. |
None
|
LMBasedRetrievalEvaluator(models=None, metrics=None, enabled_metrics=None, run_parallel=True, metrics_aggregator=None, fallback_models=None)
Bases: BaseEvaluator
Evaluator for LM-based retrieval quality in RAG pipelines.
This evaluator
- Runs a configurable set of retrieval metrics (by default: DeepEval contextual precision and contextual recall)
- Combines their scores using a simple rule-based scheme to produce:
relevancy_rating(good / bad / incomplete)score(aggregated retrieval score)
Default expected input
- input (str): The input query to evaluate the metric.
- expected_output (str): The expected output to evaluate the metric.
- retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. If the retrieved context is a str, it will be converted into a list with a single element.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
list[BaseMetric]
|
The list of metrics to evaluate. |
enabled_metrics |
Sequence[type[BaseMetric] | str] | None
|
The list of metrics to enable. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
Initialize the LM-based retrieval evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
models
|
BaseLMInvoker | list[BaseLMInvoker] | None
|
Judge models for single-judge/multi-judge evaluation.
|
None
|
metrics
|
Sequence[BaseMetric] | None
|
Optional custom retrieval metric instances. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None. |
None
|
enabled_metrics
|
Sequence[type[BaseMetric] | str] | None
|
Optional subset of metrics to enable
from the metric pool. Each entry can be either a metric class or its |
None
|
run_parallel
|
bool
|
Whether to run retrieval metrics in parallel. Defaults to True. |
True
|
metrics_aggregator
|
MetricsAggregator | None
|
Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None. |
None
|
fallback_models
|
list[BaseLMInvoker] | None
|
Ordered fallback invoker chain propagated to every metric. Defaults to None. |
None
|
required_fields
property
Return the union of required fields from all configured metrics.
QTEvaluator(models=None, completeness_metric=None, groundedness_metric=None, redundancy_metric=None, aggregation_method=None, max_concurrent_judges=None, run_parallel=True, score_mapping=None, score_weights=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None, fallback_models=None)
Bases: BaseEvaluator
Evaluator for query transformation tasks.
Default expected input: - input (str): The query to evaluate the completeness of the model's output. - expected_output (str): The expected response to evaluate the completeness of the model's output. - actual_output (str): The generated response to evaluate the completeness of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
completeness_metric |
GEvalCompletenessMetric
|
The completeness metric. |
hallucination_metric |
GEvalGroundednessMetric
|
The groundedness metric. |
redundancy_metric |
GEvalRedundancyMetric
|
The redundancy metric. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
score_mapping |
dict[str, dict[int, float]]
|
The score mapping. |
score_weights |
dict[str, float]
|
The score weights. |
Initialize the QTEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
models
|
BaseLMInvoker | list[BaseLMInvoker] | None
|
Judge models for single-judge/multi-judge evaluation.
|
None
|
completeness_metric
|
GEvalCompletenessMetric | None
|
The completeness metric. Defaults to built-in GEvalCompletenessMetric. |
None
|
groundedness_metric
|
GEvalGroundednessMetric | None
|
The groundedness metric. Defaults to built-in GEvalGroundednessMetric. |
None
|
redundancy_metric
|
GEvalRedundancyMetric | None
|
The redundancy metric. Defaults to built-in GEvalRedundancyMetric. |
None
|
aggregation_method
|
AggregationSelector | None
|
The aggregation method to use for each metric. If None, each metric uses its own default (MAJORITY_VOTE for GEval metrics). |
None
|
max_concurrent_judges
|
int | None
|
The maximum number of concurrent judges per metric. If None, each metric uses its own default. |
None
|
run_parallel
|
bool
|
Whether to run the metrics in parallel. Defaults to True. |
True
|
score_mapping
|
dict[str, dict[int, float]] | None
|
The score mapping. Defaults to None. This is required if some of the default metrics are used. |
None
|
score_weights
|
dict[str, float] | None
|
The score weights. Defaults to None. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval). |
BATCH_MAX_ITERATIONS
|
metrics_aggregator
|
MetricsAggregator | None
|
Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None. |
None
|
fallback_models
|
list[BaseLMInvoker] | None
|
Ordered fallback invoker chain propagated to every metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If models list contains invalid invoker configurations. |
required_fields
property
Returns the required fields for the data.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |
SummarizationEvaluator(models=None, metrics=None, aggregation_method=DefaultValues.AGGREGATION_METHOD, run_parallel=True, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None, fallback_models=None)
Bases: BaseGenerationEvaluator
Evaluator for summarization quality using four GEval-style metrics.
Default expected input
- input (str): Source text or transcript.
- actual_output (str): Generated summary.
Initialize the SummarizationEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
models
|
BaseLMInvoker | list[BaseLMInvoker] | None
|
Judge models for single-judge/multi-judge evaluation.
|
None
|
metrics
|
list[BaseMetric] | None
|
Metric instances to evaluate.
If None, uses |
None
|
aggregation_method
|
AggregationMethod
|
Strategy used to aggregate judge results. |
AGGREGATION_METHOD
|
run_parallel
|
bool
|
Whether to run the metrics in parallel. Defaults to True. |
True
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. |
BATCH_MAX_ITERATIONS
|
metrics_aggregator
|
MetricsAggregator | None
|
Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None. |
None
|
fallback_models
|
list[BaseLMInvoker] | None
|
Ordered fallback invoker chain propagated to every metric. Defaults to None. |
None
|