Skip to content

Evaluator

Evaluator module for orchestrating evaluation workflows.

This module provides evaluator classes that coordinate metrics and evaluation logic for different use cases. Evaluators handle the evaluation process, including data preparation, metric execution, and result aggregation.

Available evaluators: - BaseEvaluator: Abstract base class for all evaluators - GenerationEvaluator: Evaluates text generation quality - AgentEvaluator: Evaluates AI agent performance - ClassicalRetrievalEvaluator: Traditional retrieval evaluation methods - GEvalGenerationEvaluator: G-Eval based generation evaluation - QTEvaluator: Question-answering evaluation - CustomEvaluator: Create custom evaluation workflows

AgentEvaluator(model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, prompt=None, use_reference=True, continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: BaseEvaluator

Evaluator for agent tasks.

This evaluator uses the LangChain AgentEvals trajectory accuracy metric to evaluate the performance of AI agents based on their execution trajectories.

Default expected input
  • agent_trajectory (list[dict[str, Any]]): The agent trajectory containing the sequence of actions, tool calls, and responses.
  • expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory for reference-based evaluation.

Attributes:

Name Type Description
name str

The name of the evaluator.

trajectory_accuracy_metric LangChainAgentTrajectoryAccuracyMetric

The metric used to evaluate agent trajectory accuracy.

Initialize the AgentEvaluator.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use for the trajectory accuracy metric. Defaults to DefaultValues.AGENT_EVALS_MODEL.

AGENT_EVALS_MODEL
model_credentials str | None

The model credentials. Defaults to None. This is required for the metric to function properly.

None
model_config dict[str, Any] | None

The model configuration. Defaults to None.

None
prompt str | None

Custom prompt for evaluation. If None, uses the default prompt from the metric. Defaults to None.

None
use_reference bool

Whether to use expected_agent_trajectory for reference-based evaluation. Defaults to True.

True
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[Any] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

Raises:

Type Description
ValueError

If model_credentials is not provided.

required_fields: set[str] property

Returns the required fields for the data.

Returns:

Type Description
set[str]

set[str]: The required fields for the data.

BaseEvaluator(name)

Bases: ABC

Base class for all evaluators.

Attributes:

Name Type Description
name str

The name of the evaluator.

required_fields set[str]

The required fields for the evaluator.

input_type type | None

The type of the input data.

Initialize the evaluator.

Parameters:

Name Type Description Default
name str

The name of the evaluator.

required

aggregate_required_fields(metrics, mode='any') staticmethod

Aggregate required fields from multiple metrics.

Parameters:

Name Type Description Default
metrics Iterable[BaseMetric]

The metrics to aggregate from.

required
mode str

The aggregation mode. Options: - "union": All fields required by any metric - "intersection": Only fields required by all metrics - "any": Empty set (no validation) Defaults to "any".

'any'

Returns:

Type Description
set[str]

set[str]: The aggregated required fields.

Raises:

Type Description
ValueError

If mode is not one of the supported options.

can_evaluate_any(metrics, data) staticmethod

Check if any of the metrics can evaluate the given data.

Parameters:

Name Type Description Default
metrics Iterable[BaseMetric]

The metrics to check.

required
data MetricInput

The data to validate against.

required

Returns:

Name Type Description
bool bool

True if any metric can evaluate the data, False otherwise.

ensure_list_of_dicts(data, key) staticmethod

Ensure that a field in the data is a list of dictionaries.

Parameters:

Name Type Description Default
data MetricInput

The data to validate.

required
key str

The key to check.

required

Raises:

Type Description
ValueError

If the field is not a list or contains non-dictionary elements.

ensure_non_empty_list(data, key) staticmethod

Ensure that a field in the data is a non-empty list.

Parameters:

Name Type Description Default
data MetricInput

The data to validate.

required
key str

The key to check.

required

Raises:

Type Description
ValueError

If the field is not a list or is empty.

evaluate(data) async

Evaluate the data.

Parameters:

Name Type Description Default
data MetricInput

The data to be evaluated.

required

Returns:

Name Type Description
EvaluateOutput EvaluationOutput

The evaluation output with global_explanation.

get_input_fields() classmethod

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type Description
list[str] | None

list[str] | None: The input fields.

get_input_spec() classmethod

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type Description
list[dict[str, Any]] | None

list[dict[str, Any]] | None: The input spec.

ClassicalRetrievalEvaluator(metrics=None, k=20)

Bases: BaseEvaluator

A class that evaluates the performance of a classical retrieval system.

Required fields: - retrieved_chunks: The retrieved chunks with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)

evaluator = ClassicalRetrievalEvaluator()
await evaluator.evaluate(data)

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics list[str | ClassicalRetrievalMetric] | None

The metrics to evaluate.

k int

The number of retrieved chunks to consider.

Initializes the evaluator.

Parameters:

Name Type Description Default
metrics list[str | ClassicalRetrievalMetric] | None

The metrics to evaluate. Defaults to all metrics.

None
k int | list[int]

The number of retrieved chunks to consider. Defaults to 20.

20

required_fields: set[str] property

Returns the required fields for the data.

Returns:

Type Description
set[str]

set[str]: The required fields for the data.

CustomEvaluator(metrics, name='custom', parallel=True)

Bases: BaseEvaluator

Custom evaluator.

This evaluator is used to evaluate the performance of the model.

Attributes:

Name Type Description
metrics list[BaseMetric]

The list of metrics to evaluate.

name str

The name of the evaluator.

parallel bool

Whether to evaluate the metrics in parallel.

Initialize the custom evaluator.

Parameters:

Name Type Description Default
metrics list[BaseMetric]

The list of metrics to evaluate.

required
name str

The name of the evaluator.

'custom'
parallel bool

Whether to evaluate the metrics in parallel.

True

GEvalGenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None, refusal_metric=None)

Bases: GenerationEvaluator

GEval Generation Evaluator.

This evaluator is used to evaluate the generation of the model.

Default expected input
  • query (str): The query to evaluate the generation of the model's output.
  • retrieved_context (str): The retrieved context to evaluate the generation of the model's output.
  • expected_response (str): The expected response to evaluate the generation of the model's output.
  • generated_response (str): The generated response to evaluate the generation of the model's output.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics List[BaseMetric]

The list of metrics to evaluate.

run_parallel bool

Whether to run the metrics in parallel.

rule_book RuleBook | None

The rule book.

generation_rule_engine GenerationRuleEngine | None

The generation rule engine.

Initialize the GEval Generation Evaluator.

Parameters:

Name Type Description Default
metrics List[BaseMetric] | None

The list of metrics to evaluate.

None
enabled_metrics List[type[BaseMetric] | str] | None

The list of enabled metrics.

None
model str | ModelId | BaseLMInvoker

The model to use for the metrics.

MODEL
model_credentials str | None

The model credentials to use for the metrics.

None
model_config dict[str, Any] | None

The model config to use for the metrics.

None
run_parallel bool

Whether to run the metrics in parallel.

True
rule_book RuleBook | None

The rule book.

None
generation_rule_engine GenerationRuleEngine | None

The generation rule engine.

None
judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation.

None
refusal_metric type[BaseMetric] | None

The refusal metric to use. If None, the default refusal metric will be used. Defaults to GEvalRefusalMetric.

None

GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None, refusal_metric=None)

Bases: BaseEvaluator

Evaluator for generation tasks.

Default expected input
  • query (str): The query to evaluate the completeness of the model's output.
  • retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
  • expected_response (str): The expected response to evaluate the completeness of the model's output.
  • generated_response (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics List[BaseMetric]

The list of metrics to evaluate.

run_parallel bool

Whether to run the metrics in parallel.

rule_book RuleBook | None

The rule book.

generation_rule_engine GenerationRuleEngine | None

The generation rule engine.

judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation.

Initialize the GenerationEvaluator.

Parameters:

Name Type Description Default
metrics List[BaseMetric] | None

A list of metric instances to use as a base pool. If None, defaults to [CompletenessMetric, RedundancyMetric, GroundednessMetric]. Each custom metrics, must generate a score key in the output.

None
enabled_metrics List[type[BaseMetric] | str] | None

A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used.

None
model str | ModelId | BaseLMInvoker

The model to use for the metrics.

MODEL
model_config dict[str, Any] | None

The model config to use for the metrics.

None
model_credentials str | None

The model credentials, used for initializing default metrics. Defaults to None. This is required if some of the default metrics are used.

None
run_parallel bool

Whether to run the metrics in parallel. Defaults to False.

True
rule_book RuleBook | None

The rule book for evaluation. If not provided, a default one is generated, but only if all enabled metrics are from the default set.

None
generation_rule_engine GenerationRuleEngine | None

The generation rule engine. Defaults to a new instance with the determined rule book.

None
judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation. If provided, will use multiple judges instead of single model evaluation. Uses composition pattern for clean separation of concerns.

None
refusal_metric type[RefusalMetric] | None

The refusal metric to use. If None, the default refusal metric will be used. Defaults to RefusalMetric.

None

Raises:

Type Description
ValueError

If model_credentials is not provided when using default metrics.

ValueError

If a custom rule_book is provided when using custom metrics or a mix of custom and default metrics without an explicit rule_book.

QTEvaluator(completeness_metric=None, groundedness_metric=None, redundancy_metric=None, model=DefaultValues.MODEL, model_config=None, model_credentials=None, run_parallel=True, score_mapping=None, score_weights=None, judge=None)

Bases: BaseEvaluator

Evaluator for query transformation tasks.

Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name Type Description
completeness_metric CompletenessMetric

The completeness metric.

hallucination_metric GroundednessMetric

The groundedness metric.

redundancy_metric RedundancyMetric

The redundancy metric.

run_parallel bool

Whether to run the metrics in parallel.

score_mapping dict[str, dict[int, float]]

The score mapping.

score_weights dict[str, float]

The score weights.

Initialize the QTEvaluator.

Parameters:

Name Type Description Default
completeness_metric CompletenessMetric | None

The completeness metric. Defaults to built-in CompletenessMetric.

None
groundedness_metric GroundednessMetric | None

The groundedness metric. Defaults to built-in GroundednessMetric.

None
redundancy_metric RedundancyMetric | None

The redundancy metric. Defaults to built-in RedundancyMetric.

None
model str | ModelId | BaseLMInvoker

The model to use for the metrics.

MODEL
model_config dict[str, Any] | None

The model config to use for the metrics.

None
model_credentials str | None

The model credentials. Defaults to None. This is required if some of the default metrics are used.

None
run_parallel bool

Whether to run the metrics in parallel. Defaults to True.

True
score_mapping dict[str, dict[int, float]] | None

The score mapping. Defaults to None. This is required if some of the default metrics are used.

None
score_weights dict[str, float] | None

The score weights. Defaults to None.

None
judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation.

None

Raises:

Type Description
ValueError

If model_credentials is not provided when using default metrics.

required_fields: set[str] property

Returns the required fields for the data.

Returns:

Type Description
set[str]

set[str]: The required fields for the data.

TrajectoryGenerationEvaluator(agent_evaluator=None, generation_evaluator=None)

Bases: BaseEvaluator

Evaluator for agent trajectory and generation quality evaluation with rule-based aggregation.

This evaluator combines: 1. LangChain AgentEvals trajectory accuracy metric (agent execution quality) 2. GEval generation evaluator (completeness, groundedness, redundancy) 3. Rule-based aggregation logic

Uses dependency injection pattern - accepts pre-configured evaluators for maximum flexibility.

Expected input (AgentData with generation fields): - agent_trajectory (list[dict[str, Any]]): The agent trajectory - expected_agent_trajectory (list[dict[str, Any]] | None, optional): Expected trajectory - query (str | None, optional): The query for generation evaluation - generated_response (str | None, optional): The generated response - expected_response (str | None, optional): The expected response - retrieved_context (str | list[str] | None, optional): The retrieved context

Aggregation Logic
  • If trajectory relevancy is "incomplete" or "bad": return trajectory result
  • If trajectory relevancy is "good": return GEval generation result

Output Structure:

The output is a flat dictionary containing:

- Aggregated Results:
    - global_explanation (str): Human-readable explanation of the evaluation
    (added by parent evaluator)
    - score (float | int): Final aggregated score based on rule-based logic
    - relevancy_rating (str): Final relevancy rating ("good", "bad", or "incomplete")
    - possible_issues (list[str], optional): List of detected issues
    (only present when trajectory is good)

- Trajectory Evaluation Results:
    - langchain_agent_trajectory_accuracy (dict): Trajectory evaluation result containing
    score, explanation, key, and metadata from the agent trajectory metric

- Generation Evaluation Results (nested):
    - generation (dict): Generation evaluation results containing:
        - global_explanation (str): Explanation of generation quality
        - relevancy_rating (str): Generation quality rating
        - score (float | int): Generation quality score
        - possible_issues (list[str]): List of generation-related issues
        - Individual metric results (completeness, groundedness, redundancy,
        language_consistency, refusal_alignment)

Attributes:

Name Type Description
name str

The name of the evaluator.

agent_evaluator AgentEvaluator

Pre-configured evaluator for trajectory assessment.

generation_evaluator GEvalGenerationEvaluator

Pre-configured evaluator for generation quality assessment.

Example

import os from gllm_evals.evaluator.agent_evaluator import AgentEvaluator from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator

Configure individual evaluators

agent_eval = AgentEvaluator( ... model="openai/gpt-4o-mini", ... model_credentials=os.getenv("OPENAI_API_KEY"), ... use_reference=True, ... ) gen_eval = GEvalGenerationEvaluator( ... model="google/gemini-2.0-flash", ... model_credentials=os.getenv("GOOGLE_API_KEY"), ... run_parallel=True, ... )

Create combined evaluator

evaluator = TrajectoryGenerationEvaluator( ... agent_evaluator=agent_eval, ... generation_evaluator=gen_eval, ... )

Evaluate

result = await evaluator.evaluate(data)

Initialize the TrajectoryGenerationEvaluator with optional dependency injection.

Parameters:

Name Type Description Default
agent_evaluator AgentEvaluator | None

Pre-configured evaluator for agent trajectory evaluation. If None, a default AgentEvaluator will be created with model=DefaultValues.AGENT_EVALS_MODEL. Defaults to None.

None
generation_evaluator GEvalGenerationEvaluator | None

Pre-configured evaluator for generation quality assessment. If None, a default GEvalGenerationEvaluator will be created with model=DefaultValues.MODEL. Defaults to None.

None
Note

When using default evaluators (by not providing agent_evaluator or generation_evaluator), make sure the required environment variables are set: - OPENAI_API_KEY for agent evaluator - GOOGLE_API_KEY for generation evaluator

required_fields: set[str] property

Returns the required fields for the data.

Returns the combined set of required fields from both the trajectory evaluator and generation evaluator.

Returns:

Type Description
set[str]

set[str]: The required fields for the data.