Evaluator
Evaluator module for orchestrating evaluation workflows.
This module provides evaluator classes that coordinate metrics and evaluation logic for different use cases. Evaluators handle the evaluation process, including data preparation, metric execution, and result aggregation.
Available evaluators: - BaseEvaluator: Abstract base class for all evaluators - GenerationEvaluator: Evaluates text generation quality - AgentEvaluator: Evaluates AI agent performance - ClassicalRetrievalEvaluator: Traditional retrieval evaluation methods - GEvalGenerationEvaluator: G-Eval based generation evaluation - QTEvaluator: Question-answering evaluation - CustomEvaluator: Create custom evaluation workflows
AgentEvaluator(model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, prompt=None, use_reference=True, continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: BaseEvaluator
Evaluator for agent tasks.
This evaluator uses the LangChain AgentEvals trajectory accuracy metric to evaluate the performance of AI agents based on their execution trajectories.
Default expected input
- agent_trajectory (list[dict[str, Any]]): The agent trajectory containing the sequence of actions, tool calls, and responses.
- expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory for reference-based evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
trajectory_accuracy_metric |
LangChainAgentTrajectoryAccuracyMetric
|
The metric used to evaluate agent trajectory accuracy. |
Initialize the AgentEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the trajectory accuracy metric. Defaults to DefaultValues.AGENT_EVALS_MODEL. |
AGENT_EVALS_MODEL
|
model_credentials |
str | None
|
The model credentials. Defaults to None. This is required for the metric to function properly. |
None
|
model_config |
dict[str, Any] | None
|
The model configuration. Defaults to None. |
None
|
prompt |
str | None
|
Custom prompt for evaluation. If None, uses the default prompt from the metric. Defaults to None. |
None
|
use_reference |
bool
|
Whether to use expected_agent_trajectory for reference-based evaluation. Defaults to True. |
True
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[Any] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
required_fields: set[str]
property
Returns the required fields for the data.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |
BaseEvaluator(name)
Bases: ABC
Base class for all evaluators.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
required_fields |
set[str]
|
The required fields for the evaluator. |
input_type |
type | None
|
The type of the input data. |
Initialize the evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the evaluator. |
required |
aggregate_required_fields(metrics, mode='any')
staticmethod
Aggregate required fields from multiple metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
Iterable[BaseMetric]
|
The metrics to aggregate from. |
required |
mode |
str
|
The aggregation mode. Options: - "union": All fields required by any metric - "intersection": Only fields required by all metrics - "any": Empty set (no validation) Defaults to "any". |
'any'
|
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The aggregated required fields. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If mode is not one of the supported options. |
can_evaluate_any(metrics, data)
staticmethod
Check if any of the metrics can evaluate the given data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
Iterable[BaseMetric]
|
The metrics to check. |
required |
data |
MetricInput
|
The data to validate against. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if any metric can evaluate the data, False otherwise. |
ensure_list_of_dicts(data, key)
staticmethod
Ensure that a field in the data is a list of dictionaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The data to validate. |
required |
key |
str
|
The key to check. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the field is not a list or contains non-dictionary elements. |
ensure_non_empty_list(data, key)
staticmethod
Ensure that a field in the data is a non-empty list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The data to validate. |
required |
key |
str
|
The key to check. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the field is not a list or is empty. |
evaluate(data)
async
Evaluate the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The data to be evaluated. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
EvaluateOutput |
EvaluationOutput
|
The evaluation output with global_explanation. |
get_input_fields()
classmethod
Return declared input field names if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[str] | None
|
list[str] | None: The input fields. |
get_input_spec()
classmethod
Return structured spec for input fields if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | None
|
list[dict[str, Any]] | None: The input spec. |
ClassicalRetrievalEvaluator(metrics=None, k=20)
Bases: BaseEvaluator
A class that evaluates the performance of a classical retrieval system.
Required fields: - retrieved_chunks: The retrieved chunks with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.
Example:
data = RetrievalData(
retrieved_chunks={
"chunk1": 0.9,
"chunk2": 0.8,
"chunk3": 0.7,
},
ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
evaluator = ClassicalRetrievalEvaluator()
await evaluator.evaluate(data)
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
list[str | ClassicalRetrievalMetric] | None
|
The metrics to evaluate. |
k |
int
|
The number of retrieved chunks to consider. |
Initializes the evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
list[str | ClassicalRetrievalMetric] | None
|
The metrics to evaluate. Defaults to all metrics. |
None
|
k |
int | list[int]
|
The number of retrieved chunks to consider. Defaults to 20. |
20
|
required_fields: set[str]
property
Returns the required fields for the data.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |
CustomEvaluator(metrics, name='custom', parallel=True)
Bases: BaseEvaluator
Custom evaluator.
This evaluator is used to evaluate the performance of the model.
Attributes:
| Name | Type | Description |
|---|---|---|
metrics |
list[BaseMetric]
|
The list of metrics to evaluate. |
name |
str
|
The name of the evaluator. |
parallel |
bool
|
Whether to evaluate the metrics in parallel. |
Initialize the custom evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
list[BaseMetric]
|
The list of metrics to evaluate. |
required |
name |
str
|
The name of the evaluator. |
'custom'
|
parallel |
bool
|
Whether to evaluate the metrics in parallel. |
True
|
GEvalGenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None, refusal_metric=None)
Bases: GenerationEvaluator
GEval Generation Evaluator.
This evaluator is used to evaluate the generation of the model.
Default expected input
- query (str): The query to evaluate the generation of the model's output.
- retrieved_context (str): The retrieved context to evaluate the generation of the model's output.
- expected_response (str): The expected response to evaluate the generation of the model's output.
- generated_response (str): The generated response to evaluate the generation of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
List[BaseMetric]
|
The list of metrics to evaluate. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
rule_book |
RuleBook | None
|
The rule book. |
generation_rule_engine |
GenerationRuleEngine | None
|
The generation rule engine. |
Initialize the GEval Generation Evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
List[BaseMetric] | None
|
The list of metrics to evaluate. |
None
|
enabled_metrics |
List[type[BaseMetric] | str] | None
|
The list of enabled metrics. |
None
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metrics. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metrics. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metrics. |
None
|
run_parallel |
bool
|
Whether to run the metrics in parallel. |
True
|
rule_book |
RuleBook | None
|
The rule book. |
None
|
generation_rule_engine |
GenerationRuleEngine | None
|
The generation rule engine. |
None
|
judge |
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. |
None
|
refusal_metric |
type[BaseMetric] | None
|
The refusal metric to use. If None, the default refusal metric will be used. Defaults to GEvalRefusalMetric. |
None
|
GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None, refusal_metric=None)
Bases: BaseEvaluator
Evaluator for generation tasks.
Default expected input
- query (str): The query to evaluate the completeness of the model's output.
- retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
- expected_response (str): The expected response to evaluate the completeness of the model's output.
- generated_response (str): The generated response to evaluate the completeness of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
List[BaseMetric]
|
The list of metrics to evaluate. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
rule_book |
RuleBook | None
|
The rule book. |
generation_rule_engine |
GenerationRuleEngine | None
|
The generation rule engine. |
judge |
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. |
Initialize the GenerationEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
List[BaseMetric] | None
|
A list of metric instances to use as a base pool.
If None, defaults to |
None
|
enabled_metrics |
List[type[BaseMetric] | str] | None
|
A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used. |
None
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metrics. |
MODEL
|
model_config |
dict[str, Any] | None
|
The model config to use for the metrics. |
None
|
model_credentials |
str | None
|
The model credentials, used for initializing default metrics. Defaults to None. This is required if some of the default metrics are used. |
None
|
run_parallel |
bool
|
Whether to run the metrics in parallel. Defaults to False. |
True
|
rule_book |
RuleBook | None
|
The rule book for evaluation. If not provided, a default one is generated, but only if all enabled metrics are from the default set. |
None
|
generation_rule_engine |
GenerationRuleEngine | None
|
The generation rule engine. Defaults to a new instance with the determined rule book. |
None
|
judge |
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. If provided, will use multiple judges instead of single model evaluation. Uses composition pattern for clean separation of concerns. |
None
|
refusal_metric |
type[RefusalMetric] | None
|
The refusal metric to use. If None, the default refusal metric will be used. Defaults to RefusalMetric. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If a custom |
QTEvaluator(completeness_metric=None, groundedness_metric=None, redundancy_metric=None, model=DefaultValues.MODEL, model_config=None, model_credentials=None, run_parallel=True, score_mapping=None, score_weights=None, judge=None)
Bases: BaseEvaluator
Evaluator for query transformation tasks.
Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
completeness_metric |
CompletenessMetric
|
The completeness metric. |
hallucination_metric |
GroundednessMetric
|
The groundedness metric. |
redundancy_metric |
RedundancyMetric
|
The redundancy metric. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
score_mapping |
dict[str, dict[int, float]]
|
The score mapping. |
score_weights |
dict[str, float]
|
The score weights. |
Initialize the QTEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
completeness_metric |
CompletenessMetric | None
|
The completeness metric. Defaults to built-in CompletenessMetric. |
None
|
groundedness_metric |
GroundednessMetric | None
|
The groundedness metric. Defaults to built-in GroundednessMetric. |
None
|
redundancy_metric |
RedundancyMetric | None
|
The redundancy metric. Defaults to built-in RedundancyMetric. |
None
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metrics. |
MODEL
|
model_config |
dict[str, Any] | None
|
The model config to use for the metrics. |
None
|
model_credentials |
str | None
|
The model credentials. Defaults to None. This is required if some of the default metrics are used. |
None
|
run_parallel |
bool
|
Whether to run the metrics in parallel. Defaults to True. |
True
|
score_mapping |
dict[str, dict[int, float]] | None
|
The score mapping. Defaults to None. This is required if some of the default metrics are used. |
None
|
score_weights |
dict[str, float] | None
|
The score weights. Defaults to None. |
None
|
judge |
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
required_fields: set[str]
property
Returns the required fields for the data.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |
TrajectoryGenerationEvaluator(agent_evaluator=None, generation_evaluator=None)
Bases: BaseEvaluator
Evaluator for agent trajectory and generation quality evaluation with rule-based aggregation.
This evaluator combines: 1. LangChain AgentEvals trajectory accuracy metric (agent execution quality) 2. GEval generation evaluator (completeness, groundedness, redundancy) 3. Rule-based aggregation logic
Uses dependency injection pattern - accepts pre-configured evaluators for maximum flexibility.
Expected input (AgentData with generation fields): - agent_trajectory (list[dict[str, Any]]): The agent trajectory - expected_agent_trajectory (list[dict[str, Any]] | None, optional): Expected trajectory - query (str | None, optional): The query for generation evaluation - generated_response (str | None, optional): The generated response - expected_response (str | None, optional): The expected response - retrieved_context (str | list[str] | None, optional): The retrieved context
Aggregation Logic
- If trajectory relevancy is "incomplete" or "bad": return trajectory result
- If trajectory relevancy is "good": return GEval generation result
Output Structure:
The output is a flat dictionary containing:
- Aggregated Results:
- global_explanation (str): Human-readable explanation of the evaluation
(added by parent evaluator)
- score (float | int): Final aggregated score based on rule-based logic
- relevancy_rating (str): Final relevancy rating ("good", "bad", or "incomplete")
- possible_issues (list[str], optional): List of detected issues
(only present when trajectory is good)
- Trajectory Evaluation Results:
- langchain_agent_trajectory_accuracy (dict): Trajectory evaluation result containing
score, explanation, key, and metadata from the agent trajectory metric
- Generation Evaluation Results (nested):
- generation (dict): Generation evaluation results containing:
- global_explanation (str): Explanation of generation quality
- relevancy_rating (str): Generation quality rating
- score (float | int): Generation quality score
- possible_issues (list[str]): List of generation-related issues
- Individual metric results (completeness, groundedness, redundancy,
language_consistency, refusal_alignment)
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
agent_evaluator |
AgentEvaluator
|
Pre-configured evaluator for trajectory assessment. |
generation_evaluator |
GEvalGenerationEvaluator
|
Pre-configured evaluator for generation quality assessment. |
Example
import os from gllm_evals.evaluator.agent_evaluator import AgentEvaluator from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
Configure individual evaluators
agent_eval = AgentEvaluator( ... model="openai/gpt-4o-mini", ... model_credentials=os.getenv("OPENAI_API_KEY"), ... use_reference=True, ... ) gen_eval = GEvalGenerationEvaluator( ... model="google/gemini-2.0-flash", ... model_credentials=os.getenv("GOOGLE_API_KEY"), ... run_parallel=True, ... )
Create combined evaluator
evaluator = TrajectoryGenerationEvaluator( ... agent_evaluator=agent_eval, ... generation_evaluator=gen_eval, ... )
Evaluate
result = await evaluator.evaluate(data)
Initialize the TrajectoryGenerationEvaluator with optional dependency injection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent_evaluator |
AgentEvaluator | None
|
Pre-configured evaluator for agent trajectory evaluation. If None, a default AgentEvaluator will be created with model=DefaultValues.AGENT_EVALS_MODEL. Defaults to None. |
None
|
generation_evaluator |
GEvalGenerationEvaluator | None
|
Pre-configured evaluator for generation quality assessment. If None, a default GEvalGenerationEvaluator will be created with model=DefaultValues.MODEL. Defaults to None. |
None
|
Note
When using default evaluators (by not providing agent_evaluator or generation_evaluator), make sure the required environment variables are set: - OPENAI_API_KEY for agent evaluator - GOOGLE_API_KEY for generation evaluator
required_fields: set[str]
property
Returns the required fields for the data.
Returns the combined set of required fields from both the trajectory evaluator and generation evaluator.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |