Skip to content

Evaluator

Evaluator module for orchestrating evaluation workflows.

This module provides evaluator classes that coordinate metrics and evaluation logic for different use cases. Evaluators handle the evaluation process, including data preparation, metric execution, and result aggregation.

Available evaluators: - BaseEvaluator: Abstract base class for all evaluators - BaseGenerationEvaluator: Shared generation evaluator base class - GEvalGenerationEvaluator: G-Eval based generation evaluation - ClassicalRetrievalEvaluator: Traditional retrieval evaluation methods - LMBasedRetrievalEvaluator: LM-based retrieval evaluation methods - CompositeEvaluator: Combines multiple metrics into a single evaluation unit - QTEvaluator: Question-answering evaluation - AgentEvaluator: Combined agent trajectory and generation evaluation - SummarizationEvaluator: Summarization evaluation methods

AgentEvaluator(models=None, tool_correctness_metric=None, generation_evaluator=None, trajectory_accuracy_metric=None, metrics_aggregator=None, aggregation_method=DefaultValues.AGGREGATION_METHOD, fallback_models=None)

Bases: BaseEvaluator

Evaluator for agent tool calling and generation quality evaluation with MetricsAggregator.

This evaluator combines: 1. DeepEval Tool Correctness metric for agent tool call evaluation 2. GEval generation evaluator (completeness, groundedness, redundancy) 3. Optional LangChain Agent Trajectory Accuracy metric (disabled by default) 4. MetricsAggregator for polarity-aware binary scoring aggregation

Uses dependency injection pattern - accepts pre-configured metric and evaluator. Uses preprocess → aggregate → postprocess pattern consistent with other evaluators.

Expected input (LLMTestCase with generation fields): - agent_trajectory (list[dict[str, Any]]): The agent trajectory - expected_agent_trajectory (list[dict[str, Any]]): Expected trajectory - input (str): The input query for generation evaluation - actual_output (str): The generated response - expected_output (str): The expected response - retrieved_context (str | list[str] | None, optional): The retrieved context

Aggregation Logic
  • Uses MetricsAggregator.aggregate() for base aggregation:
    • aggregate_success = all(metric.success) — AND-gate across all metrics
    • aggregate_score = mean(quality_score) — quality_score = score (higher_is_better=True) or 1 - score (higher_is_better=False)
  • Trajectory accuracy metric (if enabled) only runs when agent_trajectory is provided in input. Results are included in output but do not affect aggregation

Output Structure:

The output is a flat dictionary containing:

- Aggregated Results:
    - aggregate_explanation (str): Human-readable explanation of the evaluation
    (added by parent evaluator)
    - aggregate_success (bool): AND-gate of all metric success values
    - aggregate_score (float): Mean quality score with polarity inversion

- Tool Call Evaluation Results:
    - deepeval_tool_correctness (dict): Tool correctness evaluation result containing
    score, success, threshold, and other contract fields

- Trajectory Accuracy Results (optional, only when metric is enabled AND agent_trajectory is provided):
    - langchain_agent_trajectory_accuracy (dict): Trajectory accuracy evaluation result

- Generation Evaluation Results (nested):
    - generation (dict): Generation evaluation results containing:
        - aggregate_explanation (str): Explanation of generation quality
        - aggregate_success (bool): AND-gate of generation metric success values
        - aggregate_score (float): Mean quality score for generation metrics
        - Individual metric results (completeness, groundedness, redundancy)
        with full contract fields

Attributes:

Name Type Description
name str

The name of the evaluator.

tool_correctness_metric DeepEvalToolCorrectnessMetric

The metric for tool call assessment.

generation_evaluator GEvalGenerationEvaluator

Evaluator for generation quality assessment.

trajectory_accuracy_metric LangChainAgentTrajectoryAccuracyMetric | None

Optional metric for LangChain-style trajectory accuracy evaluation. Disabled by default.

metrics_aggregator MetricsAggregator

Aggregator for polarity-aware binary scoring.

Example

import os from gllm_evals.evaluator import AgentEvaluator from gllm_evals.metrics.tool_use.deepeval_tool_correctness import DeepEvalToolCorrectnessMetric from gllm_evals.metrics.tool_use.langchain_agent_trajectory_accuracy import ( ... LangChainAgentTrajectoryAccuracyMetric ... )

Configure metrics

tool_correctness = DeepEvalToolCorrectnessMetric() trajectory_accuracy = LangChainAgentTrajectoryAccuracyMetric()

Create evaluator (trajectory accuracy is optional and runs conditionally)

evaluator = AgentEvaluator( ... tool_correctness_metric=tool_correctness, ... trajectory_accuracy_metric=trajectory_accuracy, # Optional; only runs if agent_trajectory is in input ... )

result = await evaluator.evaluate(data)

Initialize the AgentEvaluator.

Parameters:

Name Type Description Default
models BaseLMInvoker | list[BaseLMInvoker] | None

Judge models for single-judge/multi-judge evaluation.

  • None (default): single-judge mode using the default invoker.
  • [invoker] or a single invoker: runs that invoker once.
  • [invoker] * N: homogeneous — same model N times.
  • [invoker_a, invoker_b]: heterogeneous — distinct models.
None
tool_correctness_metric DeepEvalToolCorrectnessMetric | None

Pre-configured metric for trajectory evaluation. If None, a default metric will be created with models.

None
generation_evaluator GEvalGenerationEvaluator | None

Pre-configured evaluator for generation quality assessment. If None, a default GEvalGenerationEvaluator will be created. Defaults to None.

None
trajectory_accuracy_metric LangChainAgentTrajectoryAccuracyMetric | None

Pre-configured metric for LangChain-style trajectory accuracy evaluation. Disabled by default (None). When provided, only runs when agent_trajectory field is present in input data. Results are included in output but do not affect final aggregated score. Defaults to None.

None
metrics_aggregator MetricsAggregator | None

Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None
aggregation_method AggregationMethod

Method for aggregating judge scores at metric level in generation evaluator. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
fallback_models list[BaseLMInvoker] | None

Ordered fallback invoker chain propagated to every default metric and the generation evaluator. Defaults to None.

None

required_fields property

Returns the required fields for the data.

Returns the combined set of required fields from both the tool correctness metric and generation evaluator.

Returns:

Type Description
set[str]

set[str]: The required fields for the data.

format_tool_violation(tool_score, tool_passed, tool_threshold)

Format Tool Correctness violation message.

Parameters:

Name Type Description Default
tool_score float

The tool score.

required
tool_passed bool

Whether the tool passed.

required
tool_threshold float

The tool threshold.

required

Returns:

Name Type Description
str str

The formatted tool violation message.

BaseEvaluator(name, metrics=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None)

Bases: ABC

Base class for all evaluators.

Attributes:

Name Type Description
name str

The name of the evaluator.

required_fields set[str]

The required fields for the evaluator.

input_type type | None

The type of the input data.

Initialize the evaluator.

Parameters:

Name Type Description Default
name str

The name of the evaluator.

required
metrics list[BaseMetric] | None

Metric instances configured for the evaluator. If None, no metrics are configured. Defaults to None.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).

BATCH_MAX_ITERATIONS
metrics_aggregator MetricsAggregator | None

Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None

Raises:

Type Description
ValueError

If batch_status_check_interval or batch_max_iterations are not positive.

aggregate_required_fields(metrics, mode='any') staticmethod

Aggregate required fields from multiple metrics.

Parameters:

Name Type Description Default
metrics Iterable[BaseMetric]

The metrics to aggregate from.

required
mode str

The aggregation mode. Options: - "union": All fields required by any metric - "intersection": Only fields required by all metrics - "any": Empty set (no validation) Defaults to "any".

'any'

Returns:

Type Description
set[str]

set[str]: The aggregated required fields.

Raises:

Type Description
ValueError

If mode is not one of the supported options.

can_evaluate_any(metrics, data) staticmethod

Check if any of the metrics can evaluate the given data.

Parameters:

Name Type Description Default
metrics Iterable[BaseMetric]

The metrics to check.

required
data MetricInput

The data to validate against.

required

Returns:

Name Type Description
bool bool

True if any metric can evaluate the data, False otherwise.

ensure_list_of_dicts(data, key) staticmethod

Ensure that a field in the data is a list of dictionaries.

Parameters:

Name Type Description Default
data MetricInput

The data to validate.

required
key str

The key to check.

required

Raises:

Type Description
ValueError

If the field is not a list or contains non-dictionary elements.

ensure_non_empty_list(data, key) staticmethod

Ensure that a field in the data is a non-empty list.

Parameters:

Name Type Description Default
data MetricInput

The data to validate.

required
key str

The key to check.

required

Raises:

Type Description
ValueError

If the field is not a list or is empty.

evaluate(data) async

Evaluate the data (single item or batch).

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

The data to be evaluated. Can be a single item or a list for batch processing.

required

Returns:

Type Description
EvaluatorResult | list[EvaluatorResult]

EvaluatorResult | list[EvaluatorResult]: The evaluation output with aggregate_explanation. Returns a list if input is a list.

get_input_fields() classmethod

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type Description
list[str] | None

list[str] | None: The input fields.

get_input_spec() classmethod

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type Description
list[dict[str, Any]] | None

list[dict[str, Any]] | None: The input spec.

BaseGenerationEvaluator(models=None, metrics=None, aggregation_method=None, max_concurrent_judges=None, run_parallel=True, refusal_metric=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None, fallback_models=None)

Bases: BaseEvaluator

Shared base evaluator for generation-style rule-engine evaluation.

Default expected input
  • input (str): The input provided to the AI system or component (e.g., a query, prompt, or instruction).
  • retrieved_context (str): Supporting context used during generation (e.g., retrieved documents).
  • expected_output (str): The reference output used for comparison.
  • actual_output (str): The output generated by the AI system or component to evaluate.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics List[BaseMetric]

The list of metrics to evaluate.

run_parallel bool

Whether to run the metrics in parallel.

models BaseLMInvoker | list[BaseLMInvoker] | None

Judge models for single-judge/multi-judge evaluation.

metrics_aggregator MetricsAggregator

The aggregator for polarity-aware binary scoring.

Initialize the base generation evaluator.

Parameters:

Name Type Description Default
models BaseLMInvoker | list[BaseLMInvoker] | None

Judge models for single-judge/multi-judge evaluation.

  • None (default): single-judge mode using the default invoker.
  • [invoker] or a single invoker: runs that invoker once.
  • [invoker] * N: homogeneous — same model N times.
  • [invoker_a, invoker_b]: heterogeneous — distinct models.
None
metrics list[BaseMetric] | None

Metric instances to evaluate. If None, default metrics are built.

None
aggregation_method AggregationSelector | None

The aggregation method to use for each metric.

None
max_concurrent_judges int | None

The maximum number of concurrent judges per metric.

None
run_parallel bool

Whether to run the metrics in parallel. Defaults to True.

True
refusal_metric GEvalRefusalMetric | None

Optional explicit refusal metric.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).

BATCH_MAX_ITERATIONS
metrics_aggregator MetricsAggregator | None

The aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None
fallback_models list[BaseLMInvoker] | None

Ordered fallback invoker chain propagated to every metric. Defaults to None.

None

Raises:

Type Description
ValueError

If models list contains invalid invoker configurations.

ClassicalRetrievalEvaluator(metrics=None, k=20, metrics_aggregator=None)

Bases: BaseEvaluator

A class that evaluates the performance of a classical retrieval system.

Required fields: - retrieved_chunks: The retrieved chunks with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)

evaluator = ClassicalRetrievalEvaluator()
await evaluator.evaluate(data)

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics list[str | ClassicalRetrievalMetric] | None

The metrics to evaluate.

k int

The number of retrieved chunks to consider.

Initializes the evaluator.

Parameters:

Name Type Description Default
metrics list[str | ClassicalRetrievalMetric] | None

The metrics to evaluate. Defaults to all metrics.

None
k int | list[int]

The number of retrieved chunks to consider. Defaults to 20.

20
metrics_aggregator MetricsAggregator | None

Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None

required_fields property

Returns the required fields for the data.

Returns:

Type Description
set[str]

set[str]: The required fields for the data.

CompositeEvaluator(metrics, name='composite', parallel=True, metrics_aggregator=None)

Bases: BaseEvaluator

Composite evaluator that runs multiple metrics and aggregates results.

This evaluator composes multiple BaseMetric objects and presents them as a single evaluator unit, following the GoF Composite pattern. It supports parallel or sequential execution, fault isolation, and custom aggregation.

Attributes:

Name Type Description
metrics list[BaseMetric]

The list of metrics to evaluate.

name str

The name of the evaluator.

parallel bool

Whether to evaluate the metrics in parallel.

metrics_aggregator MetricsAggregator

The aggregator for polarity-aware binary scoring.

Initialize the composite evaluator.

Parameters:

Name Type Description Default
metrics list[BaseMetric]

The list of metrics to evaluate.

required
name str

The name of the evaluator.

'composite'
parallel bool

Whether to evaluate the metrics in parallel.

True
metrics_aggregator MetricsAggregator | None

Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None

GEvalGenerationEvaluator(models=None, metrics=None, aggregation_method=None, max_concurrent_judges=None, run_parallel=True, refusal_metric=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None, fallback_models=None)

Bases: BaseGenerationEvaluator

GEval Generation Evaluator.

This evaluator is used to evaluate the generation of the model.

Default expected input
  • input (str): The input provided to the AI system or component (e.g., a query, prompt, or instruction).
  • retrieved_context (str): Supporting context used during generation (e.g., retrieved documents).
  • expected_output (str): The reference output used for comparison.
  • actual_output (str): The output generated by the AI system or component to evaluate.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics List[BaseMetric]

The list of metrics to evaluate.

run_parallel bool

Whether to run the metrics in parallel.

metrics_aggregator MetricsAggregator

The aggregator for polarity-aware binary scoring.

Initialize the GEval Generation Evaluator.

Parameters:

Name Type Description Default
models BaseLMInvoker | list[BaseLMInvoker] | None

Judge models for single-judge/multi-judge evaluation.

  • None (default): single-judge mode using the default invoker.
  • [invoker] or a single invoker: runs that invoker once.
  • [invoker] * N: homogeneous — same model N times.
  • [invoker_a, invoker_b]: heterogeneous — distinct models.
None
metrics list[BaseMetric] | None

Metric instances to evaluate. If None, uses DEFAULT_METRICS.

None
aggregation_method AggregationSelector | None

Strategy used to aggregate judge results. Defaults to None.

None
max_concurrent_judges int | None

Maximum number of judges to run concurrently. Defaults to None.

None
run_parallel bool

Whether to run the metrics in parallel.

True
refusal_metric GEvalRefusalMetric | None

The refusal metric to use. If None, the default refusal metric will be used. Defaults to GEvalRefusalMetric.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).

BATCH_MAX_ITERATIONS
metrics_aggregator MetricsAggregator | None

The aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None
fallback_models list[BaseLMInvoker] | None

Ordered fallback invoker chain propagated to every metric. Defaults to None.

None

LMBasedRetrievalEvaluator(models=None, metrics=None, enabled_metrics=None, run_parallel=True, metrics_aggregator=None, fallback_models=None)

Bases: BaseEvaluator

Evaluator for LM-based retrieval quality in RAG pipelines.

This evaluator
  • Runs a configurable set of retrieval metrics (by default: DeepEval contextual precision and contextual recall)
  • Combines their scores using a simple rule-based scheme to produce:
    • relevancy_rating (good / bad / incomplete)
    • score (aggregated retrieval score)
Default expected input
  • input (str): The input query to evaluate the metric.
  • expected_output (str): The expected output to evaluate the metric.
  • retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. If the retrieved context is a str, it will be converted into a list with a single element.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics list[BaseMetric]

The list of metrics to evaluate.

enabled_metrics Sequence[type[BaseMetric] | str] | None

The list of metrics to enable.

run_parallel bool

Whether to run the metrics in parallel.

Initialize the LM-based retrieval evaluator.

Parameters:

Name Type Description Default
models BaseLMInvoker | list[BaseLMInvoker] | None

Judge models for single-judge/multi-judge evaluation.

  • None (default): single-judge mode using the default invoker.
  • [invoker] or a single invoker: runs that invoker once.
  • [invoker] * N: homogeneous — same model N times.
  • [invoker_a, invoker_b]: heterogeneous — distinct models.
None
metrics Sequence[BaseMetric] | None

Optional custom retrieval metric instances. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None.

None
enabled_metrics Sequence[type[BaseMetric] | str] | None

Optional subset of metrics to enable from the metric pool. Each entry can be either a metric class or its name. If None, all metrics from the pool are used. Defaults to None.

None
run_parallel bool

Whether to run retrieval metrics in parallel. Defaults to True.

True
metrics_aggregator MetricsAggregator | None

Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None
fallback_models list[BaseLMInvoker] | None

Ordered fallback invoker chain propagated to every metric. Defaults to None.

None

required_fields property

Return the union of required fields from all configured metrics.

QTEvaluator(models=None, completeness_metric=None, groundedness_metric=None, redundancy_metric=None, aggregation_method=None, max_concurrent_judges=None, run_parallel=True, score_mapping=None, score_weights=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None, fallback_models=None)

Bases: BaseEvaluator

Evaluator for query transformation tasks.

Default expected input: - input (str): The query to evaluate the completeness of the model's output. - expected_output (str): The expected response to evaluate the completeness of the model's output. - actual_output (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name Type Description
completeness_metric GEvalCompletenessMetric

The completeness metric.

hallucination_metric GEvalGroundednessMetric

The groundedness metric.

redundancy_metric GEvalRedundancyMetric

The redundancy metric.

run_parallel bool

Whether to run the metrics in parallel.

score_mapping dict[str, dict[int, float]]

The score mapping.

score_weights dict[str, float]

The score weights.

Initialize the QTEvaluator.

Parameters:

Name Type Description Default
models BaseLMInvoker | list[BaseLMInvoker] | None

Judge models for single-judge/multi-judge evaluation.

  • None (default): single-judge mode using the default invoker.
  • [invoker] or a single invoker: runs that invoker once.
  • [invoker] * N: homogeneous — same model N times.
  • [invoker_a, invoker_b]: heterogeneous — distinct models.
None
completeness_metric GEvalCompletenessMetric | None

The completeness metric. Defaults to built-in GEvalCompletenessMetric.

None
groundedness_metric GEvalGroundednessMetric | None

The groundedness metric. Defaults to built-in GEvalGroundednessMetric.

None
redundancy_metric GEvalRedundancyMetric | None

The redundancy metric. Defaults to built-in GEvalRedundancyMetric.

None
aggregation_method AggregationSelector | None

The aggregation method to use for each metric. If None, each metric uses its own default (MAJORITY_VOTE for GEval metrics).

None
max_concurrent_judges int | None

The maximum number of concurrent judges per metric. If None, each metric uses its own default.

None
run_parallel bool

Whether to run the metrics in parallel. Defaults to True.

True
score_mapping dict[str, dict[int, float]] | None

The score mapping. Defaults to None. This is required if some of the default metrics are used.

None
score_weights dict[str, float] | None

The score weights. Defaults to None.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).

BATCH_MAX_ITERATIONS
metrics_aggregator MetricsAggregator | None

Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None
fallback_models list[BaseLMInvoker] | None

Ordered fallback invoker chain propagated to every metric. Defaults to None.

None

Raises:

Type Description
ValueError

If models list contains invalid invoker configurations.

required_fields property

Returns the required fields for the data.

Returns:

Type Description
set[str]

set[str]: The required fields for the data.

SummarizationEvaluator(models=None, metrics=None, aggregation_method=DefaultValues.AGGREGATION_METHOD, run_parallel=True, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, metrics_aggregator=None, fallback_models=None)

Bases: BaseGenerationEvaluator

Evaluator for summarization quality using four GEval-style metrics.

Default expected input
  • input (str): Source text or transcript.
  • actual_output (str): Generated summary.

Initialize the SummarizationEvaluator.

Parameters:

Name Type Description Default
models BaseLMInvoker | list[BaseLMInvoker] | None

Judge models for single-judge/multi-judge evaluation.

  • None (default): single-judge mode using the default invoker.
  • [invoker] or a single invoker: runs that invoker once.
  • [invoker] * N: homogeneous — same model N times.
  • [invoker_a, invoker_b]: heterogeneous — distinct models.
None
metrics list[BaseMetric] | None

Metric instances to evaluate. If None, uses DEFAULT_METRICS.

None
aggregation_method AggregationMethod

Strategy used to aggregate judge results.

AGGREGATION_METHOD
run_parallel bool

Whether to run the metrics in parallel. Defaults to True.

True
batch_status_check_interval float

Time between batch status checks in seconds.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout.

BATCH_MAX_ITERATIONS
metrics_aggregator MetricsAggregator | None

Aggregator for polarity-aware binary scoring. If None, a default MetricsAggregator is used. Defaults to None.

None
fallback_models list[BaseLMInvoker] | None

Ordered fallback invoker chain propagated to every metric. Defaults to None.

None