Evaluator

Evaluator module for orchestrating evaluation workflows.

This module provides evaluator classes that coordinate metrics and evaluation logic for different use cases. Evaluators handle the evaluation process, including data preparation, metric execution, and result aggregation.

Available evaluators: - BaseEvaluator: Abstract base class for all evaluators - GenerationEvaluator: Evaluates text generation quality - ClassicalRetrievalEvaluator: Traditional retrieval evaluation methods - LMBasedRetrievalEvaluator: LM-based retrieval evaluation methods - GEvalGenerationEvaluator: G-Eval based generation evaluation - RAGEvaluator: Combined retrieval and generation evaluation for RAG pipelines - QTEvaluator: Question-answering evaluation - CustomEvaluator: Create custom evaluation workflows - AgentEvaluator: Combined agent trajectory and generation evaluation

`AgentEvaluator(tool_correctness_metric=None, generation_evaluator=None, trajectory_accuracy_metric=None)`

Bases: BaseEvaluator

Evaluator for agent tool calling and generation quality evaluation with rule-based aggregation.

This evaluator combines: 1. DeepEval Tool Correctness metric for agent tool call evaluation 2. GEval generation evaluator (completeness, groundedness, redundancy) 3. Optional LangChain Agent Trajectory Accuracy metric (disabled by default) 4. Rule-based aggregation logic

Uses dependency injection pattern - accepts pre-configured metric and evaluator.

Expected input (AgentData with generation fields): - agent_trajectory (list[dict[str, Any]]): The agent trajectory - expected_agent_trajectory (list[dict[str, Any]]): Expected trajectory - query (str): The query for generation evaluation - generated_response (str): The generated response - expected_response (str): The expected response - retrieved_context (str | list[str] | None, optional): The retrieved context

Aggregation Logic

Final score = tool_correctness_score * generation_score
Generation score is derived from generation relevancy rating (good=1, incomplete=0.5, bad=0)
Final relevancy rating is derived from final score using thresholds:
- =0.75: good, >=0.25: incomplete, <0.25: bad
Trajectory accuracy metric (if enabled) only runs when agent_trajectory is provided in input. Results are included in output but do not affect aggregation

Output Structure:

The output is a flat dictionary containing:

- Aggregated Results:
    - global_explanation (str): Human-readable explanation of the evaluation
    (added by parent evaluator)
    - multiply_score (float): Direct multiplication of tool_correctness_score * generation_score
    - avg_score (float): Simple average of tool_correctness_score and generation_score
    - relevancy_rating (str): Final relevancy rating ("good", "bad", or "incomplete")
    - possible_issues (list[str], optional): List of detected issues
    (only present when tool call is good)

- Tool Call Evaluation Results:
    - deepeval_tool_correctness (dict): Tool correctness evaluation result containing
    score, explanation, and other metadata

- Trajectory Accuracy Results (optional, only when metric is enabled AND agent_trajectory is provided):
    - langchain_agent_trajectory_accuracy (dict): Trajectory accuracy evaluation result

- Generation Evaluation Results (nested):
    - generation (dict): Generation evaluation results containing:
        - global_explanation (str): Explanation of generation quality
        - relevancy_rating (str): Generation quality rating
        - score (float | int): Generation quality score
        - possible_issues (list[str]): List of generation-related issues
        - Individual metric results (completeness, groundedness, redundancy,
        language_consistency, refusal_alignment)

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`tool_correctness_metric`	`DeepEvalToolCorrectnessMetric`	The metric for tool call assessment.
`generation_evaluator`	`GEvalGenerationEvaluator`	Evaluator for generation quality assessment.
`trajectory_accuracy_metric`	`LangChainAgentTrajectoryAccuracyMetric \| None`	Optional metric for LangChain-style trajectory accuracy evaluation. Disabled by default.

Example

import os from gllm_evals.evaluator import AgentEvaluator from gllm_evals.metrics.agent.deepeval_tool_correctness import DeepEvalToolCorrectnessMetric from gllm_evals.metrics.agent.langchain_agent_trajectory_accuracy import ( ... LangChainAgentTrajectoryAccuracyMetric ... )

Configure metrics

tool_correctness = DeepEvalToolCorrectnessMetric( ... model="openai/gpt-4o-mini", ... model_credentials=os.getenv("OPENAI_API_KEY"), ... ) trajectory_accuracy = LangChainAgentTrajectoryAccuracyMetric( ... model="openai/gpt-4o-mini", ... model_credentials=os.getenv("OPENAI_API_KEY"), ... )

Create evaluator (trajectory accuracy is optional and runs conditionally)

evaluator = AgentEvaluator( ... tool_correctness_metric=tool_correctness, ... trajectory_accuracy_metric=trajectory_accuracy, # Optional; only runs if agent_trajectory is in input ... )

result = await evaluator.evaluate(data)

Initialize the AgentEvaluator.

Parameters:

Name	Type	Description	Default
`tool_correctness_metric`	`DeepEvalToolCorrectnessMetric \| None`	Pre-configured metric for trajectory evaluation. If None, a default metric will be created with model=DefaultValues.AGENT_EVALS_MODEL. Defaults to None.	`None`
`generation_evaluator`	`GEvalGenerationEvaluator \| None`	Pre-configured evaluator for generation quality assessment. If None, a default GEvalGenerationEvaluator will be created. Defaults to None.	`None`
`trajectory_accuracy_metric`	`LangChainAgentTrajectoryAccuracyMetric \| None`	Pre-configured metric for LangChain-style trajectory accuracy evaluation. Disabled by default (None). When provided, only runs when agent_trajectory field is present in input data. Results are included in output but do not affect final aggregated score. Defaults to None.	`None`

Note

When using default metric/evaluator (by not providing them), make sure the required environment variables are set: - OPENAI_API_KEY for tool correctness metric - GOOGLE_API_KEY for generation evaluator

`required_fields` `property`

Returns the required fields for the data.

Returns the combined set of required fields from both the tool correctness metric and generation evaluator.

Returns:

Type	Description
`set[str]`	set[str]: The required fields for the data.

`format_tool_violation(tool_score, tool_passed, tool_threshold)`

Format Tool Correctness violation message.

Parameters:

Name	Type	Description	Default
`tool_score`	`float`	The tool score.	required
`tool_passed`	`bool`	Whether the tool passed.	required
`tool_threshold`	`float`	The tool threshold.	required

Returns:

Name	Type	Description
`str`	`str`	The formatted tool violation message.

`BaseEvaluator(name, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: ABC

Base class for all evaluators.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`required_fields`	`set[str]`	The required fields for the evaluator.
`input_type`	`type \| None`	The type of the input data.

Initialize the evaluator.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the evaluator.	required
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).	`BATCH_MAX_ITERATIONS`

Raises:

Type	Description
`ValueError`	If batch_status_check_interval or batch_max_iterations are not positive.

`aggregate_required_fields(metrics, mode='any')` `staticmethod`

Aggregate required fields from multiple metrics.

Parameters:

Name	Type	Description	Default
`metrics`	`Iterable[BaseMetric]`	The metrics to aggregate from.	required
`mode`	`str`	The aggregation mode. Options: - "union": All fields required by any metric - "intersection": Only fields required by all metrics - "any": Empty set (no validation) Defaults to "any".	`'any'`

Returns:

Type	Description
`set[str]`	set[str]: The aggregated required fields.

Raises:

Type	Description
`ValueError`	If mode is not one of the supported options.

`can_evaluate_any(metrics, data)` `staticmethod`

Check if any of the metrics can evaluate the given data.

Parameters:

Name	Type	Description	Default
`metrics`	`Iterable[BaseMetric]`	The metrics to check.	required
`data`	`MetricInput`	The data to validate against.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if any metric can evaluate the data, False otherwise.

`ensure_list_of_dicts(data, key)` `staticmethod`

Ensure that a field in the data is a list of dictionaries.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The data to validate.	required
`key`	`str`	The key to check.	required

Raises:

Type	Description
`ValueError`	If the field is not a list or contains non-dictionary elements.

`ensure_non_empty_list(data, key)` `staticmethod`

Ensure that a field in the data is a non-empty list.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The data to validate.	required
`key`	`str`	The key to check.	required

Raises:

Type	Description
`ValueError`	If the field is not a list or is empty.

`evaluate(data)` `async`

Evaluate the data (single item or batch).

Parameters:

Name	Type	Description	Default
`data`	`MetricInput \| list[MetricInput]`	The data to be evaluated. Can be a single item or a list for batch processing.	required

Returns:

Type	Description
`EvaluationOutput \| list[EvaluationOutput]`	EvaluationOutput \| list[EvaluationOutput]: The evaluation output with global_explanation. Returns a list if input is a list.

`get_input_fields()` `classmethod`

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type	Description
`list[str] \| None`	list[str] \| None: The input fields.

`get_input_spec()` `classmethod`

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type	Description
`list[dict[str, Any]] \| None`	list[dict[str, Any]] \| None: The input spec.

`ClassicalRetrievalEvaluator(metrics=None, k=20)`

Bases: BaseEvaluator

A class that evaluates the performance of a classical retrieval system.

Required fields: - retrieved_chunks: The retrieved chunks with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)

evaluator = ClassicalRetrievalEvaluator()
await evaluator.evaluate(data)

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`metrics`	`list[str \| ClassicalRetrievalMetric] \| None`	The metrics to evaluate.
`k`	`int`	The number of retrieved chunks to consider.

Initializes the evaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`list[str \| ClassicalRetrievalMetric] \| None`	The metrics to evaluate. Defaults to all metrics.	`None`
`k`	`int \| list[int]`	The number of retrieved chunks to consider. Defaults to 20.	`20`

`required_fields` `property`

Returns the required fields for the data.

Returns:

Type	Description
`set[str]`	set[str]: The required fields for the data.

`CustomEvaluator(metrics, name='custom', parallel=True)`

Bases: BaseEvaluator

Custom evaluator.

This evaluator is used to evaluate the performance of the model.

Attributes:

Name	Type	Description
`metrics`	`list[BaseMetric]`	The list of metrics to evaluate.
`name`	`str`	The name of the evaluator.
`parallel`	`bool`	Whether to evaluate the metrics in parallel.

Initialize the custom evaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`list[BaseMetric]`	The list of metrics to evaluate.	required
`name`	`str`	The name of the evaluator.	`'custom'`
`parallel`	`bool`	Whether to evaluate the metrics in parallel.	`True`

`GEvalGenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None, refusal_metric=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: GenerationEvaluator

GEval Generation Evaluator.

This evaluator is used to evaluate the generation of the model.

Default expected input

query (str): The query to evaluate the generation of the model's output.
retrieved_context (str): The retrieved context to evaluate the generation of the model's output.
expected_response (str): The expected response to evaluate the generation of the model's output.
generated_response (str): The generated response to evaluate the generation of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`metrics`	`List[BaseMetric]`	The list of metrics to evaluate.
`run_parallel`	`bool`	Whether to run the metrics in parallel.
`rule_book`	`RuleBook \| None`	The rule book.
`generation_rule_engine`	`GenerationRuleEngine \| None`	The generation rule engine.

Initialize the GEval Generation Evaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`List[BaseMetric] \| None`	The list of metrics to evaluate.	`None`
`enabled_metrics`	`List[type[BaseMetric] \| str] \| None`	The list of enabled metrics.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metrics.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metrics.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metrics.	`None`
`run_parallel`	`bool`	Whether to run the metrics in parallel.	`True`
`rule_book`	`RuleBook \| None`	The rule book.	`None`
`generation_rule_engine`	`GenerationRuleEngine \| None`	The generation rule engine.	`None`
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation.	`None`
`refusal_metric`	`type[BaseMetric] \| None`	The refusal metric to use. If None, the default refusal metric will be used. Defaults to GEvalRefusalMetric.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).	`BATCH_MAX_ITERATIONS`

`GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None, refusal_metric=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: BaseEvaluator

Evaluator for generation tasks.

Default expected input

query (str): The query to evaluate the completeness of the model's output.
retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
expected_response (str): The expected response to evaluate the completeness of the model's output.
generated_response (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`metrics`	`List[BaseMetric]`	The list of metrics to evaluate.
`run_parallel`	`bool`	Whether to run the metrics in parallel.
`rule_book`	`RuleBook \| None`	The rule book.
`generation_rule_engine`	`GenerationRuleEngine \| None`	The generation rule engine.
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation.

Initialize the GenerationEvaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`List[BaseMetric] \| None`	A list of metric instances to use as a base pool. If None, defaults to `[CompletenessMetric, RedundancyMetric, GroundednessMetric]`. Each custom metrics, must generate a `score` key in the output.	`None`
`enabled_metrics`	`List[type[BaseMetric] \| str] \| None`	A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metrics.	`MODEL`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metrics.	`None`
`model_credentials`	`str \| None`	The model credentials, used for initializing default metrics. Defaults to None. This is required if some of the default metrics are used.	`None`
`run_parallel`	`bool`	Whether to run the metrics in parallel. Defaults to False.	`True`
`rule_book`	`RuleBook \| None`	The rule book for evaluation. If not provided, a default one is generated, but only if all enabled metrics are from the default set.	`None`
`generation_rule_engine`	`GenerationRuleEngine \| None`	The generation rule engine. Defaults to a new instance with the determined rule book.	`None`
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation. If provided, will use multiple judges instead of single model evaluation. Uses composition pattern for clean separation of concerns.	`None`
`refusal_metric`	`type[RefusalMetric] \| None`	The refusal metric to use. If None, the default refusal metric will be used. Defaults to RefusalMetric.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).	`BATCH_MAX_ITERATIONS`

Raises:

Type	Description
`ValueError`	If `model_credentials` is not provided when using default metrics.
`ValueError`	If a custom `rule_book` is provided when using custom metrics or a mix of custom and default metrics without an explicit rule_book.

`HybridRuleBook(good, bad, issue_rules)` `dataclass`

Rule book that supports both MetricSpec (int) and FloatMetricSpec (float).

This allows combining generation metrics (int-based) and retrieval metrics (float-based) in a single rule book.

Attributes:

Name	Type	Description
`good`	`Specification`	The good rule (can be MetricSpec, FloatMetricSpec, or combination).
`bad`	`Specification`	The bad rule (can be MetricSpec, FloatMetricSpec, or combination).
`issue_rules`	`Mapping[Issue, Specification]`	Issue detection rules.

`HybridRuleEngine(rules)`

Bases: BaseRuleEngine[HybridRuleBook, Specification, Relevancy]

Rule engine that handles both int-based (MetricSpec) and float-based (FloatMetricSpec) metrics.

This engine can evaluate rules that combine generation metrics (int 0-4) and retrieval metrics (float 0.0-1.0) in a single rule book.

Initialize the HybridRuleEngine.

`LMBasedRetrievalEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None)`

Bases: BaseEvaluator

Evaluator for LM-based retrieval quality in RAG pipelines.

This evaluator

Runs a configurable set of retrieval metrics (by default: DeepEval contextual precision and contextual recall)
Combines their scores using a simple rule-based scheme to produce:
- relevancy_rating (good / bad / incomplete)
- score (aggregated retrieval score)
- possible_issues (list of textual issues)

Default expected input

query (str): The query to evaluate the metric.
expected_response (str): The expected response to evaluate the metric.
retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. If the retrieved context is a str, it will be converted into a list with a single element.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`metrics`	`list[BaseMetric]`	The list of metrics to evaluate.
`enabled_metrics`	`Sequence[type[BaseMetric] \| str] \| None`	The list of metrics to enable.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metrics.
`model_credentials`	`str \| None`	The model credentials to use for the metrics.
`model_config`	`dict[str, Any] \| None`	The model configuration to use for the metrics.
`run_parallel`	`bool`	Whether to run the metrics in parallel.
`rule_book`	`LMBasedRetrievalRuleBook \| None`	The rule book for evaluation.
`rule_engine`	`LMBasedRetrievalRuleEngine \| None`	The rule engine for classification.

Initialize the LM-based retrieval evaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`Sequence[BaseMetric] \| None`	Optional custom retrieval metric instances. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None.	`None`
`enabled_metrics`	`Sequence[type[BaseMetric] \| str] \| None`	Optional subset of metrics to enable from the metric pool. Each entry can be either a metric class or its `name`. If None, all metrics from the pool are used. Defaults to None.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	Model for the default DeepEval metrics. Defaults to `DefaultValues.MODEL`.	`MODEL`
`model_credentials`	`str \| None`	Credentials for the model, required when `model` is a string. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	Optional model configuration. Defaults to None.	`None`
`run_parallel`	`bool`	Whether to run retrieval metrics in parallel. Defaults to True.	`True`
`rule_book`	`LMBasedRetrievalRuleBook \| None`	The rule book for evaluation. If not provided, a default one is generated based on enabled metrics. Defaults to None.	`None`
`rule_engine`	`LMBasedRetrievalRuleEngine \| None`	The rule engine for classification. If not provided, a new instance is created with the determined rule book. Defaults to None.	`None`

`required_fields` `property`

Return the union of required fields from all configured metrics.

`QTEvaluator(completeness_metric=None, groundedness_metric=None, redundancy_metric=None, model=DefaultValues.MODEL, model_config=None, model_credentials=None, run_parallel=True, score_mapping=None, score_weights=None, judge=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: BaseEvaluator

Evaluator for query transformation tasks.

Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name	Type	Description
`completeness_metric`	`CompletenessMetric`	The completeness metric.
`hallucination_metric`	`GroundednessMetric`	The groundedness metric.
`redundancy_metric`	`RedundancyMetric`	The redundancy metric.
`run_parallel`	`bool`	Whether to run the metrics in parallel.
`score_mapping`	`dict[str, dict[int, float]]`	The score mapping.
`score_weights`	`dict[str, float]`	The score weights.

Initialize the QTEvaluator.

Parameters:

Name	Type	Description	Default
`completeness_metric`	`CompletenessMetric \| None`	The completeness metric. Defaults to built-in CompletenessMetric.	`None`
`groundedness_metric`	`GroundednessMetric \| None`	The groundedness metric. Defaults to built-in GroundednessMetric.	`None`
`redundancy_metric`	`RedundancyMetric \| None`	The redundancy metric. Defaults to built-in RedundancyMetric.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metrics.	`MODEL`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metrics.	`None`
`model_credentials`	`str \| None`	The model credentials. Defaults to None. This is required if some of the default metrics are used.	`None`
`run_parallel`	`bool`	Whether to run the metrics in parallel. Defaults to True.	`True`
`score_mapping`	`dict[str, dict[int, float]] \| None`	The score mapping. Defaults to None. This is required if some of the default metrics are used.	`None`
`score_weights`	`dict[str, float] \| None`	The score weights. Defaults to None.	`None`
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).	`BATCH_MAX_ITERATIONS`

Raises:

Type	Description
`ValueError`	If `model_credentials` is not provided when using default metrics.

`required_fields` `property`

Returns the required fields for the data.

Returns:

Type	Description
`set[str]`	set[str]: The required fields for the data.

`RAGEvaluator(retrieval_evaluator=None, generation_evaluator=None, retrieval_metrics=None, generation_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None, judge=None, refusal_metric=None)`

Bases: BaseEvaluator

Evaluator for RAG pipelines combining retrieval and generation evaluation.

This evaluator

Runs retrieval evaluation using LMBasedRetrievalEvaluator
Runs generation evaluation using GEvalGenerationEvaluator
Combines their scores using a customizable rule-based scheme to produce:
- relevancy_rating (good / bad / incomplete)
- score (aggregated RAG score)
- possible_issues (list of textual issues)

Important Note on Rule Engine: By default, this evaluator uses GenerationRuleEngine with RuleBook which works with generation metrics (int scores 0-4). To include retrieval metrics in rule-based classification, use HybridRuleBook with HybridRuleEngine, which supports both MetricSpec (int-based) and FloatMetricSpec (float-based) metrics.

Default expected input

query (str): The query to evaluate.
expected_response (str): The expected response.
retrieved_context (str | list[str]): The retrieved contexts.
generated_response (str): The generated response.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`retrieval_evaluator`	`LMBasedRetrievalEvaluator`	The retrieval evaluator.
`generation_evaluator`	`GEvalGenerationEvaluator`	The generation evaluator.
`rule_book`	`RuleBook \| HybridRuleBook \| None`	The rule book for evaluation (uses generation metrics by default). Use `HybridRuleBook` to combine both generation (int) and retrieval (float) metrics.
`rule_engine`	`GenerationRuleEngine \| HybridRuleEngine \| None`	The rule engine for classification. Uses `GenerationRuleEngine` for `RuleBook` or `HybridRuleEngine` for `HybridRuleBook`.
`run_parallel`	`bool`	Whether to run retrieval and generation evaluations in parallel.

Initialize the RAG evaluator.

Parameters:

Name	Type	Description	Default
`retrieval_evaluator`	`LMBasedRetrievalEvaluator \| None`	Pre-configured retrieval evaluator. If provided, this will be used directly and retrieval_* parameters will be ignored. Defaults to None.	`None`
`generation_evaluator`	`GEvalGenerationEvaluator \| None`	Pre-configured generation evaluator. If provided, this will be used directly and generation_* parameters will be ignored. Defaults to None.	`None`
`retrieval_metrics`	`Sequence[BaseMetric] \| None`	Optional custom retrieval metric instances. Used only if retrieval_evaluator is None. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None.	`None`
`generation_metrics`	`Sequence[BaseMetric] \| None`	Optional custom generation metric instances. Used only if generation_evaluator is None. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	Model for the default metrics. Used only if evaluators are None. Defaults to `DefaultValues.MODEL`.	`MODEL`
`model_credentials`	`str \| None`	Credentials for the model, required when `model` is a string. Used only if evaluators are None. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	Optional model configuration. Used only if evaluators are None. Defaults to None.	`None`
`run_parallel`	`bool`	Whether to run retrieval and generation evaluations in parallel. Used only if evaluators are None. Defaults to True.	`True`
`rule_book`	`RuleBook \| HybridRuleBook \| None`	The rule book for evaluation. If not provided, a default one is generated based on enabled generation metrics. Use `RuleBook` for generation-only metrics (int-based) or `HybridRuleBook` to combine both generation (int) and retrieval (float) metrics. Defaults to None.	`None`
`rule_engine`	`GenerationRuleEngine \| HybridRuleEngine \| None`	The rule engine for classification. If not provided, a new instance is created with the determined rule book. Use `GenerationRuleEngine` for `RuleBook` or `HybridRuleEngine` for `HybridRuleBook`. Defaults to None.	`None`
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation. Used only if generation_evaluator is None. Defaults to None.	`None`
`refusal_metric`	`type[BaseMetric] \| None`	The refusal metric to use for generation evaluator. Used only if generation_evaluator is None. If None, the default refusal metric will be used. Defaults to None.	`None`

`required_fields` `property`

Return the union of required fields from both evaluators.

`SummarizationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: GenerationEvaluator

Evaluator for summarization quality using four GEval-style metrics.

Default expected input

input (str): Source text or transcript.
summary (str): Generated summary.

Initialize the SummarizationEvaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`list[BaseMetric] \| None`	A list of metric instances to use as a base pool. If None, default summary metrics are used.	`None`
`enabled_metrics`	`list[type[BaseMetric] \| str] \| None`	A list of metric classes or names to enable from the pool. If None, all default summary metrics from `_SUMMARY_METRIC_CONFIGS` are enabled.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metrics.	`MODEL`
`model_credentials`	`str \| None`	The model credentials used for metric initialization.	`None`
`model_config`	`dict[str, Any] \| None`	The model config used for metric initialization.	`None`
`run_parallel`	`bool`	Whether to run the metrics in parallel. Defaults to True.	`True`
`rule_book`	`RuleBook \| None`	Custom rule book for summarization evaluation. If None, a rule book is built from the enabled summary metrics and configured thresholds.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout.	`BATCH_MAX_ITERATIONS`

Returns:

Type	Description
	None

Evaluator

AgentEvaluator(tool_correctness_metric=None, generation_evaluator=None, trajectory_accuracy_metric=None)

Configure metrics

Create evaluator (trajectory accuracy is optional and runs conditionally)

required_fields property

format_tool_violation(tool_score, tool_passed, tool_threshold)

BaseEvaluator(name, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

aggregate_required_fields(metrics, mode='any') staticmethod

can_evaluate_any(metrics, data) staticmethod

ensure_list_of_dicts(data, key) staticmethod

ensure_non_empty_list(data, key) staticmethod

evaluate(data) async

get_input_fields() classmethod

get_input_spec() classmethod

ClassicalRetrievalEvaluator(metrics=None, k=20)

required_fields property

CustomEvaluator(metrics, name='custom', parallel=True)

HybridRuleBook(good, bad, issue_rules) dataclass

HybridRuleEngine(rules)

LMBasedRetrievalEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None)

required_fields property

required_fields property

RAGEvaluator(retrieval_evaluator=None, generation_evaluator=None, retrieval_metrics=None, generation_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None, judge=None, refusal_metric=None)

required_fields property

SummarizationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

`AgentEvaluator(tool_correctness_metric=None, generation_evaluator=None, trajectory_accuracy_metric=None)`

`required_fields` `property`

`format_tool_violation(tool_score, tool_passed, tool_threshold)`

`BaseEvaluator(name, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

`aggregate_required_fields(metrics, mode='any')` `staticmethod`

`can_evaluate_any(metrics, data)` `staticmethod`

`ensure_list_of_dicts(data, key)` `staticmethod`

`ensure_non_empty_list(data, key)` `staticmethod`

`evaluate(data)` `async`

`get_input_fields()` `classmethod`

`get_input_spec()` `classmethod`

`ClassicalRetrievalEvaluator(metrics=None, k=20)`

`required_fields` `property`

`CustomEvaluator(metrics, name='custom', parallel=True)`

`HybridRuleBook(good, bad, issue_rules)` `dataclass`

`HybridRuleEngine(rules)`

`LMBasedRetrievalEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None)`

`required_fields` `property`

`required_fields` `property`

`RAGEvaluator(retrieval_evaluator=None, generation_evaluator=None, retrieval_metrics=None, generation_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None, judge=None, refusal_metric=None)`

`required_fields` `property`

`SummarizationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`