Skip to content

Metrics

Metrics module for evaluating AI model outputs.

This module provides a comprehensive collection of evaluation metrics for assessing the quality of generated content, retrieval systems, and AI agent tool use. It includes both traditional metrics and LLM-based metrics, as well as integrations with popular evaluation frameworks.

Metric categories: - Generation metrics: Evaluate quality of generated text (completeness, groundedness, redundancy, language consistency, refusal alignment) - Retrieval metrics: Assess retrieval system performance (precision, recall, accuracy) - Tool use metrics: Evaluate AI agent tool calling and trajectory accuracy - Safety metrics: Detect toxicity, bias, PII leakage, misuse, and alignment issues - Open-source integrations: Wrappers for RAGAS, DeepEval, and LangChain evaluators

BaseMetric

Bases: ABC

Abstract class for metrics.

This class defines the interface for all metrics.

Attributes:

Name Type Description
name str

The name of the metric.

required_fields set[str]

The required fields for this metric to evaluate data.

input_type type | None

The type of the input data.

higher_is_better bool

Whether a higher score indicates better quality. Defaults to True.

strict_mode bool

If True, binarizes score to 1.0 or 0.0 before thresholding. Defaults to False.

threshold float

Pass/fail threshold in [0, 1]. Defaults to 0.5.

Example

Adding custom prompts to existing evaluator metrics:

from gllm_evals import load_simple_qa_dataset
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator


async def main():
    # Main function with custom prompts

    # Load your dataset (must have actual_output pre-populated)
    dataset = load_simple_qa_dataset()

    # Create evaluator with default metrics
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Add custom prompts polymorphically (works for any metric)
    for metric in evaluator.metrics:
        if hasattr(metric, 'name'):  # Ensure metric has name attribute
            # Add custom prompts based on metric name
            if metric.name == "geval_completeness":
                # Add domain-specific few-shot examples
                metric.additional_context += "\n\nCUSTOM MEDICAL EXAMPLES: ..."
            elif metric.name == "geval_groundedness":
                # Add grounding examples
                metric.additional_context += "\n\nMEDICAL GROUNDING EXAMPLES: ..."

    # Evaluate with custom prompts applied automatically
    results = await evaluate(
        data=dataset,
        evaluators=[evaluator],  # ← Custom prompts applied to metrics
    )

aggregation_method property writable

Return the configured aggregation method.

num_judges property writable

Return the configured number of judges.

can_evaluate(data)

Check if this metric can evaluate the given data.

Parameters:

Name Type Description Default
data EvalInput

The input data to check.

required

Returns:

Name Type Description
bool bool

True if the metric can evaluate the data, False otherwise.

evaluate(data) async

Evaluate the metric on the given dataset (single item or batch).

Automatically handles batch processing by default. Subclasses can override _evaluate to accept lists for optimized batch processing.

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

The data to evaluate the metric on. Can be a single item or a list for batch processing.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

MetricOutput | list[MetricOutput]: A dictionary where the key are the namespace and the value are the scores. Returns a list if input is a list.

get_input_fields() classmethod

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type Description
list[str] | None

list[str] | None: The input fields.

get_input_spec() classmethod

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type Description
list[dict[str, Any]] | None

list[dict[str, Any]] | None: The input spec.

is_success(score)

Determine if the score indicates success based on threshold and polarity.

Parameters:

Name Type Description Default
score float

The score to evaluate.

required

Returns:

Name Type Description
bool bool

True if the score meets the success criteria, False otherwise.

DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Answer Relevancy Metric Integration.

This metric uses LLM-as-a-judge to assess whether the output is relevant to the given input.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better answer relevancy.
Cookbook Example

Please refer to example_deepeval_answer_relevancy.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalAnswerRelevancyMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

DeepEvalBiasMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Bias Metric Integration.

This metric uses LLM-as-a-judge to assess whether the LLM application's output contains racial, political, or other forms of offensive bias.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more biased, closer to 0.0 means less biased.
Cookbook Example

Please refer to example_deepeval_bias.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalBiasMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualPrecisionTemplate, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Contextual Precision Metric.

Evaluates whether the retrieval context that are relevant to the given query are ranked higher than irrelevant ones. A higher score indicates better contextual precision, meaning relevant context chunks appear earlier in the retrieved results.

Available Fields
  • input (str): The query to evaluate the metric.
  • expected_output (str): The expected response to evaluate the metric.
  • retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. If a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better contextual precision.
Cookbook Example

Please refer to example_deepeval_contextual_precision.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalContextualPrecisionMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS
evaluation_template Type[ContextualPrecisionTemplate]

The evaluation template to use for the metric. Defaults to ContextualPrecisionTemplate. It is used to generate the reason for the metric.

ContextualPrecisionTemplate
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

DeepEvalContextualRecallMetric(threshold=1.0, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualRecallTemplate, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Contextual Recall Metric.

Evaluates the extent to which the retrieved context aligns with the expected output. A higher score indicates better contextual recall, meaning the retrieval system successfully found the information needed to generate the expected response.

Available Fields
  • input (str): The query to evaluate the metric.
  • expected_output (str): The expected response to evaluate the metric.
  • retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. If a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better contextual recall.
Cookbook Example

Please refer to example_deepeval_contextual_recall.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalContextualRecallMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

1.0
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS
evaluation_template Type[ContextualRecallTemplate]

The evaluation template to use for the metric. Defaults to ContextualRecallTemplate.

ContextualRecallTemplate
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Contextual Relevancy Metric.

Evaluates the overall relevance of the information presented in the retrieved context for a given query. A higher score indicates better contextual relevancy, meaning the retrieved context chunks contain less irrelevant or tangential information.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
  • retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. If a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better contextual relevancy.
Cookbook Example

Please refer to example_deepeval_contextual_relevancy.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalContextualRelevancyMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Faithfulness Metric Integration.

This metric uses LLM-as-a-judge to assess whether the answers rely solely on the retrieved context, without hallucinating or providing misinformation.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
  • retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. If a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better faithfulness.
Cookbook Example

Please refer to example_deepeval_faithfulness.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalFaithfulnessMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

DeepEvalGEvalMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, strict_mode=False)

Bases: DeepEvalMetricFactory, PromptExtractionMixin

DeepEval GEval Metric Integration.

This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.

GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.

Available Fields
  • input (str, optional): The query to evaluate the metric.
  • actual_output (str, optional): The generated response to evaluate the metric.
  • expected_output (str, optional): The expected response to evaluate the metric.
  • expected_context (str | list[str], optional): The expected retrieved context to evaluate the metric. If a str, it will be converted to a list with a single element.
  • retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. If a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): Or Boolean depending on the DeepEval GEval configuration.

Initializes the DeepEvalGEvalMetric class.

Parameters:

Name Type Description Default
name str | None

The name of the metric. Defaults to None. Required if not provided via _defaults.

None
evaluation_params list[LLMTestCaseParams] | None

The evaluation parameters. Defaults to None. Required if not provided via _defaults.

None
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
criteria str | None

The criteria to use for the metric. Defaults to None.

None
evaluation_steps list[str] | None

The evaluation steps to use for the metric. Defaults to None.

None
rubric list[Rubric] | None

The rubric to use for the metric. Defaults to None.

None
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
threshold float

The threshold to use for the metric. Defaults to 0.5. Must be between 0.0 and 1.0 inclusive.

0.5
additional_context str | None

Additional context like few-shot examples. Defaults to None.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None
strict_mode bool

If True, binarizes score to 1.0 or 0.0. Defaults to False.

False

evaluate(data, temp_fewshot=None, temp_info=None, fewshot_mode='append') async

Evaluate with custom prompt lifecycle support and heterogeneous judges.

Handles three concerns: 1. Runtime prompt parameters (temp_fewshot, temp_info) 2. Heterogeneous judges (judge parameter with different models) 3. Batch processing

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

Single data item or list of data items to evaluate.

required
temp_fewshot str | None

Runtime fewshot examples. Defaults to None.

None
temp_info str | None

Additional context information. Defaults to None.

None
fewshot_mode Literal['append', 'replace']

How to merge fewshot. Defaults to "append".

'append'

Returns:

Type Description
MetricOutput | list[MetricOutput]

MetricOutput | list[MetricOutput]: Evaluation results with scores namespaced by metric name.

get_custom_prompt_base_name()

Get the base name for custom prompt column lookup.

For GEval metrics, removes the 'geval_' prefix to align with CSV column conventions. This fixes Issue #3 by providing polymorphic naming for GEval metrics.

Returns:

Name Type Description
str str

The base name without 'geval_' prefix (e.g., "completeness" instead of "geval_completeness").

Example

metric.name = "geval_completeness" metric.get_custom_prompt_base_name() -> "completeness"

CSV columns expected: - fewshot_completeness - fewshot_completeness_mode - evaluation_step_completeness

get_full_prompt(data)

Get the full prompt that DeepEval generates for this metric.

Parameters:

Name Type Description Default
data EvalInput

The metric input.

required

Returns:

Name Type Description
str str

The complete prompt (system + user) as a string.

DeepEvalHallucinationMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Hallucination Metric Integration.

This metric uses LLM-as-a-judge to determine whether the output contains hallucinated or incorrect information based on the retrieved context.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
  • expected_context (str | list[str]): The expected context to evaluate the metric.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more hallucinated, closer to 0.0 means less hallucinated.
Cookbook Example

Please refer to example_deepeval_hallucination.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalHallucinationMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.MAJORITY_VOTE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval JSON Correctness Metric Integration.

This metric evaluates whether a response is JSON correct according to a specified schema. It helps ensure that AI responses follow the expected JSON structure.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
Scoring
  • 0.0-1.0 (Binary): 0.0 means the response is not JSON correct according to the schema, 1.0 means the response is JSON correct according to the schema.
Cookbook Example

Please refer to example_deepeval_json_correctness.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalJsonCorrectnessMetric class.

Parameters:

Name Type Description Default
expected_schema Type[BaseModel]

The expected schema class (not instance) for the response. Example: ExampleSchema (the class, not an instance).

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.MAJORITY_VOTE.

MAJORITY_VOTE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

Raises: ValueError: If expected_schema is not a valid BaseModel class.

DeepEvalMetric(metric, name, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None)

Bases: BaseMetric

DeepEval Metric.

A wrapper for DeepEval metrics.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str, optional): The generated response to evaluate the metric.
  • expected_output (str, optional): The expected response to evaluate the metric.
  • expected_context (str | list[str], optional): The expected retrieved context to evaluate the metric. If a str, it will be converted into a list with a single element.
  • retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. If a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): Or Boolean depending on the DeepEval metric.

Initializes the DeepEvalMetric class.

Parameters:

Name Type Description Default
metric BaseMetric

The DeepEval metric to wrap.

required
name str

The name of the metric.

required
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

DeepEvalMetricFactory(name, model, model_credentials, model_config, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, **kwargs)

Bases: DeepEvalMetric, ABC

DeepEval Metric Factory.

Abstract base class for creating DeepEval metrics with a shared model invoker.

Available Fields
  • (Dynamic): Depends on the specific DeepEval metric being created.
Scoring
  • (Dynamic): Depends on the specific DeepEval metric.

Initializes the metric, handling common model invoker creation.

Parameters:

Name Type Description Default
name str

The name for the metric.

required
model Union[str, ModelId, BaseLMInvoker]

The model identifier or an existing LM invoker instance.

required
model_credentials Optional[str]

Credentials for the model, required if model is a string.

required
model_config Optional[Dict[str, Any]]

Configuration for the model.

required
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS
aggregation_method AggregationSelector

The aggregation method to use for repeated-judge evaluation. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
**kwargs

Additional arguments for the specific DeepEval metric constructor.

{}

DeepEvalMisuseMetric(domain, threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Misuse Metric Integration.

This metric evaluates whether a response contains inappropriate misuse of the model. It helps ensure that AI responses don't provide harmful or inappropriate misuse of the model.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more misuse, closer to 0.0 means less misuse.
Cookbook Example

Please refer to example_deepeval_misuse.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalMisuseMetric class.

Parameters:

Name Type Description Default
domain str

The domain to evaluate the metric. Common domains include: "finance", "health", "legal", "personal", "investment".

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If domain is empty or contains invalid values.

DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Non-Advice Metric Integration.

This metric evaluates whether a response contains inappropriate advice types. It helps ensure that AI responses don't provide harmful or inappropriate advice.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 0.0 means more inappropriate advice, closer to 1.0 means less inappropriate advice.
Cookbook Example

Please refer to example_deepeval_non_advice.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalNonAdviceMetric class.

Parameters:

Name Type Description Default
advice_types List[str]

List of advice types to detect as inappropriate. Common types include: ["financial", "medical", "legal", "personal", "investment"].

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If advice_types is empty or contains invalid values.

DeepEvalPIILeakageMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval PII Leakage Metric Integration.

This metric uses LLM-as-a-judge to assess whether the LLM application's output contains leaked PII.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more privacy violations, closer to 0.0 means less privacy violations.
Cookbook Example

Please refer to example_deepeval_pii_leakage.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalPIILeakageMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Prompt Alignment Metric Integration.

This metric evaluates whether a response is aligned with the prompt instructions. It helps ensure that AI responses are aligned with the prompt instructions.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more aligned with the prompt instructions, closer to 0.0 means less aligned with the prompt instructions.
Cookbook Example

Please refer to example_deepeval_prompt_alignment.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalPromptAlignmentMetric class.

Parameters:

Name Type Description Default
prompt_instructions List[str]

a list of strings specifying the instructions you want followed in your prompt template.

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If prompt_instructions is empty or contains invalid values.

DeepEvalRoleViolationMetric(role, threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Role Violation Metric Integration.

This metric evaluates whether a response contains role violations. It helps ensure that AI responses don't provide harmful or inappropriate role violations.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 0.0 means more role violations, closer to 1.0 means less role violations.
Cookbook Example

Please refer to example_deepeval_role_violation.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalRoleViolationMetric class.

Parameters:

Name Type Description Default
role str

The role to evaluate the metric. Common roles include: "helpful customer assistant", "medical insurance agent".

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If role is empty or contains invalid values.

DeepEvalToolCorrectnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Tool Correctness Metric.

This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.

Available Fields
  • query (str): The input query.
  • generated_response (str, optional): The actual output/response.
  • expected_response (str, optional): The expected output/response.
  • tools_called (list[ToolCall], optional): The tools actually called by the agent. Each item should include 'name', 'input_parameters', and optionally 'output'. If not provided, will be extracted from agent_trajectory.
  • expected_tools (list[ToolCall], optional): The expected tools to be called. Each item should include 'name', 'input_parameters', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory.
  • agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field.
  • expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory.
  • available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.
Cookbook Example

Please refer to example_deepeval_tool_correctness.py in the gen-ai-sdk-cookbook repository.

Initializes DeepEvalToolCorrectnessMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
include_reason bool

Include reasoning in output. Defaults to True.

True
strict_mode bool

Binary mode (0 or 1). Defaults to False.

False
should_exact_match bool

Require exact match of tools. Defaults to False.

False
should_consider_ordering bool

Consider order of tools called. Defaults to False.

False
available_tools list[dict[str, Any]] | None

List of available tool definitions for context. Each tool should have 'name', 'description', and 'parameters'. Defaults to None.

None
evaluation_params list[ToolCallParams] | None

List of strictness criteria for tool correctness. Available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. Defaults to [ToolCallParams.INPUT_PARAMETERS, ToolCallParams.OUTPUT] to validate both.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

DeepEvalToxicityMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Toxicity Metric Integration.

This metric uses LLM-as-a-judge to assess whether a response contains toxic content.

Available Fields
  • input (str): The query to evaluate the metric.
  • actual_output (str): The generated response to evaluate the metric.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more toxic, closer to 0.0 means less toxic.
Cookbook Example

Please refer to example_deepeval_toxicity.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalToxicityMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

GEvalCompletenessMetric(*args, threshold=1.0, **kwargs)

Bases: DeepEvalGEvalMetric

GEval Completeness Metric.

This metric is used to evaluate the completeness of the generated output.

Available Fields
  • query (str): The query to evaluate the completeness of the model's output.
  • generated_response (str): The generated response to evaluate the completeness of the model's output.
  • expected_response (str): The expected response to evaluate the completeness of the model's output.
Scoring
  • [0, 1] (Continuous): Normalized score range. Stored native 1-3 rubric value in rubric_score field.
Cookbook Example

Please refer to example_geval_completeness.py in the gen-ai-sdk-cookbook repository.

Initializes the GEvalCompletenessMetric class.

Parameters:

Name Type Description Default
name str | None

The name of the metric. Defaults to "completeness".

required
evaluation_params list[LLMTestCaseParams] | None

The evaluation parameters. Defaults to [INPUT, ACTUAL_OUTPUT, EXPECTED_OUTPUT].

required
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

required
criteria str | None

The criteria to use for the metric. Defaults to COMPLETENESS_CRITERIA.

required
evaluation_steps list[str] | None

The evaluation steps to use for the metric. Defaults to COMPLETENESS_EVALUATION_STEPS.

required
rubric list[Rubric] | None

The rubric to use for the metric. Defaults to COMPLETENESS_RUBRIC.

required
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

required
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

required
threshold float

The threshold to use for the metric. Defaults to 1.0. Must be between 0.0 and 1.0 inclusive.

1.0
additional_context str | None

Additional context like few-shot examples. Defaults to COMPLETENESS_FEW_SHOT.

required
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

required
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

required
strict_mode bool

If True, binarizes score to 1.0 or 0.0. Defaults to False.

required

GEvalContextSufficiencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, strict_mode=False)

Bases: DeepEvalGEvalMetric

GEval Context Sufficiency Metric.

This metric is used to evaluate if the context contains enough information to answer the query.

Available Fields
  • query (str): The query to evaluate.
  • retrieved_context (str | list[str]): The retrieved context to check for sufficiency.
Scoring
  • 0-1 (Binary): Where 0 means insufficient context and 1 means sufficient context.
Cookbook Example

Please refer to example_geval_context_sufficiency.py in the gen-ai-sdk-cookbook repository.

GEvalGroundednessMetric(*args, threshold=1.0, **kwargs)

Bases: DeepEvalGEvalMetric

GEval Groundedness Metric.

This metric is used to evaluate the groundedness of the generated output.

Available Fields
  • query (str): The query to evaluate the groundedness of the model's output.
  • generated_response (str): The generated response to evaluate the groundedness of the model's output.
  • retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
Scoring
  • [0, 1] (Continuous): Normalized score range. Stored native 1-3 rubric value in rubric_score field.
Cookbook Example

Please refer to example_geval_groundedness.py in the gen-ai-sdk-cookbook repository.

Initializes the GEvalGroundednessMetric class.

Parameters:

Name Type Description Default
name str | None

The name of the metric. Defaults to "groundedness".

required
evaluation_params list[LLMTestCaseParams] | None

The evaluation parameters. Defaults to [INPUT, ACTUAL_OUTPUT, RETRIEVAL_CONTEXT].

required
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

required
criteria str | None

The criteria to use for the metric. Defaults to GROUNDEDNESS_CRITERIA.

required
evaluation_steps list[str] | None

The evaluation steps to use for the metric. Defaults to GROUNDEDNESS_EVALUATION_STEPS.

required
rubric list[Rubric] | None

The rubric to use for the metric. Defaults to GROUNDEDNESS_RUBRIC.

required
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

required
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

required
threshold float

The threshold to use for the metric. Defaults to 1.0. Must be between 0.0 and 1.0 inclusive.

1.0
additional_context str | None

Additional context like few-shot examples. Defaults to GROUNDEDNESS_FEW_SHOT.

required
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

required
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

required
strict_mode bool

If True, binarizes score to 1.0 or 0.0. Defaults to False.

required

GEvalLanguageConsistencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, strict_mode=False)

Bases: DeepEvalGEvalMetric

GEval Language Consistency Metric.

This metric is used to predict if the question and generated response is a language consistency response.

Available Fields
  • query (str): The query to predict if it is a language consistency response.
  • generated_response (str): The generated response to predict if it is a language consistency response.
Scoring
  • [0, 1] (Continuous): Normalized score range. Stored native 0-1 rubric value in rubric_score field.
Cookbook Example

Please refer to example_geval_language_consistency.py in the gen-ai-sdk-cookbook repository.

GEvalRedundancyMetric(*args, threshold=0.5, **kwargs)

Bases: DeepEvalGEvalMetric

GEval Redundancy Metric.

This metric is used to evaluate the redundancy of the generated output.

Available Fields
  • query (str): The query to evaluate the redundancy of the model's output.
  • generated_response (str): The generated response to evaluate the redundancy of the model's output.
Scoring
  • [0, 1] (Continuous): Normalized score range. Stored native 1-3 rubric value in rubric_score field. Lower score is better (higher_is_better=False).
Cookbook Example

Please refer to example_geval_redundancy.py in the gen-ai-sdk-cookbook repository.

Initializes GEvalRedundancyMetric.

Parameters:

Name Type Description Default
*args

Positional arguments passed to :class:DeepEvalGEvalMetric.

()
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
**kwargs

Keyword arguments passed to :class:DeepEvalGEvalMetric. name (str | None, optional): The name of the metric. Defaults to None. Required if not provided via _defaults. evaluation_params (list[LLMTestCaseParams] | None, optional): The evaluation parameters. Defaults to None. Required if not provided via _defaults. model (str | ModelId | BaseLMInvoker, optional): The model to use for the metric. Defaults to DefaultValues.MODEL. criteria (str | None, optional): The criteria to use for the metric. Defaults to None. evaluation_steps (list[str] | None, optional): The evaluation steps to use for the metric. Defaults to None. rubric (list[Rubric] | None, optional): The rubric to use for the metric. Defaults to None. model_credentials (str | None, optional): The model credentials to use for the metric. Defaults to None. Required when model is a string. model_config (dict[str, Any] | None, optional): The model config to use for the metric. Defaults to None. additional_context (str | None, optional): Additional context like few-shot examples. Defaults to None. batch_status_check_interval (float, optional): Time between batch status checks in seconds. Defaults to 30.0. batch_max_iterations (int, optional): Maximum number of status check iterations before timeout. Defaults to 120. strict_mode (bool, optional): If True, binarizes score to 1.0 or 0.0. Defaults to False.

{}

GEvalRefusalAlignmentMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, strict_mode=False)

Bases: DeepEvalGEvalMetric

GEval Refusal Alignment Metric.

This metric evaluates whether the generated response correctly aligns with the expected refusal behavior. It checks if both the expected and generated responses have the same refusal status.

Available Fields
  • query (str): The query to evaluate the metric.
  • expected_response (str): The expected response to evaluate the metric.
  • generated_response (str): The generated response to evaluate the metric.
  • is_refusal (bool, optional): Whether the sample should be treated as a refusal response.
Scoring
  • [0, 1] (Continuous): Normalized score range. Stored native 0-1 rubric value in rubric_score field.
Cookbook Example

Please refer to example_geval_refusal_alignment.py in the gen-ai-sdk-cookbook repository.

GEvalRefusalMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, strict_mode=False)

Bases: DeepEvalGEvalMetric

GEval Refusal Metric.

This metric is used to predict if the question and expected response is a refusal response.

Available Fields
  • query (str): The query to predict if it is a refusal response.
  • expected_response (str): The expected response to predict if it is a refusal response.
Scoring
  • [0, 1] (Continuous): Normalized score range. Stored native 0-1 rubric value in rubric_score field. Higher score is better (higher_is_better=True).
Cookbook Example

Please refer to example_geval_refusal.py in the gen-ai-sdk-cookbook repository.

GEvalSummarizationCoherenceMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, strict_mode=False)

Bases: GEvalSummarizationBaseMetric

GEval Summarization Coherence metric.

This metric is used to evaluate the coherence quality of summarization output using GEval.

Available Fields
  • input (str): Source text or transcript.
  • summary (str): Generated summary.
Scoring
  • [0, 1] float: A higher score indicates better coherence.
Cookbook Example

Please refer to example_geval_summarization_coherence.py in the gen-ai-sdk-cookbook repository.

GEvalSummarizationConsistencyMetric(*args, threshold=1, **kwargs)

Bases: GEvalSummarizationBaseMetric

GEval Summarization Consistency metric.

This metric is used to evaluate factual consistency quality of summarization output using GEval.

Available Fields
  • input (str): Source text or transcript.
  • summary (str): Generated summary.
Scoring
  • [0, 1] float: A higher score indicates better consistency.
Cookbook Example

Please refer to example_geval_summarization_consistency.py in the gen-ai-sdk-cookbook repository.

Initializes GEvalSummarizationConsistencyMetric.

Parameters:

Name Type Description Default
*args

Positional arguments passed to :class:GEvalSummarizationBaseMetric.

()
threshold float

The threshold to use for the metric. Defaults to 1.

1
**kwargs

Keyword arguments passed to :class:GEvalSummarizationBaseMetric.

{}

GEvalSummarizationFluencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, strict_mode=False)

Bases: GEvalSummarizationBaseMetric

GEval Summarization Fluency metric.

This metric is used to evaluate fluency quality of summarization output using GEval.

Available Fields
  • input (str): Source text or transcript.
  • summary (str): Generated summary.
Scoring
  • [0, 1] float: A higher score indicates better fluency.
Cookbook Example

Please refer to example_geval_summarization_fluency.py in the gen-ai-sdk-cookbook repository.

GEvalSummarizationRelevanceMetric(*args, threshold=1, **kwargs)

Bases: GEvalSummarizationBaseMetric

GEval Summarization Relevance metric.

This metric is used to evaluate the relevance quality of summarization output using GEval.

Available Fields
  • input (str): Source text or transcript.
  • summary (str): Generated summary.
Scoring
  • [0, 1] float: A higher score indicates better relevance.
Cookbook Example

Please refer to example_geval_summarization_relevance.py in the gen-ai-sdk-cookbook repository.

Initializes GEvalSummarizationRelevanceMetric.

Parameters:

Name Type Description Default
*args

Positional arguments passed to :class:GEvalSummarizationBaseMetric.

()
threshold float

The threshold to use for the metric. Defaults to 1.

1
**kwargs

Keyword arguments passed to :class:GEvalSummarizationBaseMetric.

{}

LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None)

Bases: BaseMetric

A multi purpose LM-based metric class.

This class is a multi purpose LM-based metric class that can be used to evaluate the performance of a LM-based metric. It can be used to evaluate the performance of a LM-based metric by providing a response schema, a prompt builder, a model id, and a model credentials.

Available Fields
  • (Dynamic): Depends on the prompt_builder and specific metric implementation.
Scoring
  • (Dynamic): Depends on the specific metric implementation and response validation.

Initialize the LMBasedMetric class.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
response_schema ResponseSchema

The response schema to use for the metric.

required
prompt_builder PromptBuilder

The prompt builder to use for the metric.

required
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
parse_response_fn Callable[[str | LMOutput], MetricOutput] | None

The function to use to parse the response from the LM. Defaults to a function that parses the response from the LM.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).

BATCH_MAX_ITERATIONS
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, uses efficient batch API when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None)

Bases: LangChainAgentEvalsMetric

LangChain AgentEvals LLM as a Judge Metric.

A metric that uses LangChain AgentEvals to evaluate Agent as a judge.

Available Fields
  • agent_trajectory (list[dict[str, Any]]): The agent trajectory.
  • expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.
Scoring
  • 0.0-1.0 (Continuous): An evaluation score assigned by the LLM judge based on the trajectory.

Initialize the LangChainAgentEvalsLLMAsAJudgeMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
prompt str

The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. Note that the default prompt allows a rubric in addition to the typical "inputs", "outputs", and "reference_outputs" parameters.

required
model str | ModelId | BaseLMInvoker

The model to use.

required
credentials str | None

The credentials to use for the model. Defaults to None.

None
config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "trajectory_accuracy".

'trajectory_accuracy'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, uses efficient batch API when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

LangChainAgentEvalsMetric(name, evaluator, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None)

Bases: BaseMetric

LangChain AgentEvals Metric.

A metric that uses LangChain AgentEvals to evaluate Agent.

Available Fields
  • agent_trajectory (list[dict[str, Any]]): The agent trajectory.
  • expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.
Scoring
  • 0.0-1.0 (Continuous): An evaluation score based on the trajectory.

Initialize the LangChainAgentEvalsMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
evaluator SimpleAsyncEvaluator

The evaluator to use.

required
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).

BATCH_MAX_ITERATIONS
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

LangChainAgentTrajectoryAccuracyMetric(model=DefaultValues.MODEL, prompt=None, model_credentials=None, model_config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, use_reference=True, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.MAJORITY_VOTE, max_concurrent_judges=None)

Bases: LangChainAgentEvalsLLMAsAJudgeMetric

LangChain Agent Trajectory Accuracy Metric.

A metric that uses LangChain AgentEvals to evaluate the trajectory accuracy of the agent.

Available Fields
  • agent_trajectory (list[dict[str, Any]]): The agent trajectory.
  • expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory.
Scoring
  • 0.0-1.0 (Ordinal): Scale where 0.0 is bad, 0.5 is incomplete, and 1.0 is good.
Cookbook Example

Please refer to example_langchain_agent_trajectory_accuracy.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainAgentTrajectoryAccuracyMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use. Defaults to DefaultValues.MODEL.

MODEL
prompt str | None

The prompt to use. Defaults to None.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "trajectory_accuracy".

'trajectory_accuracy'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
use_reference bool

If True, uses the expected agent trajectory to evaluate the trajectory accuracy. Defaults to True. If False, the TRAJECTORY_ACCURACY_CUSTOM_PROMPT is used to evaluate the trajectory accuracy. If custom prompt is provided, this parameter will be ignored.

True
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.MAJORITY_VOTE.

MAJORITY_VOTE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.MAJORITY_VOTE, max_concurrent_judges=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

LangChain Conciseness Metric.

A metric that uses LangChain and OpenEvals to evaluate the conciseness of the LLM.

Available Fields
  • input (str): The input to evaluate the conciseness of.
  • actual_output (str): The actual output to evaluate the conciseness of.
Scoring
  • 0.0-1.0 (Binary): Scale where 0.0 is not concise and 1.0 is concise.
Cookbook Example

Please refer to example_langchain_conciseness.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainConcisenessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.MAJORITY_VOTE.

MAJORITY_VOTE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.MAJORITY_VOTE, max_concurrent_judges=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

LangChain Correctness Metric.

A metric that uses LangChain and OpenEvals to evaluate the correctness of the LLM.

Available Fields
  • input (str): The query to evaluate the correctness of.
  • actual_output (str): The generated response to evaluate the correctness of.
  • expected_output (str): The expected response to evaluate the correctness of.
Scoring
  • 0.0-1.0 (Binary): Scale where 0.0 is incorrect and 1.0 is correct.
Cookbook Example

Please refer to example_langchain_correctness.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainCorrectnessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.MAJORITY_VOTE.

MAJORITY_VOTE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.MAJORITY_VOTE, max_concurrent_judges=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

LangChain Groundedness Metric.

A metric that uses LangChain and OpenEvals to evaluate the groundedness of the LLM.

Available Fields
  • generated_response (str): The generated response to evaluate the groundedness of.
  • retrieved_context (str | list[str]): The retrieved context to evaluate the groundedness of.
Scoring
  • 0.0-1.0 (Binary): Scale where 0.0 is not grounded and 1.0 is grounded.
Cookbook Example

Please refer to example_langchain_groundedness.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainGroundednessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.MAJORITY_VOTE.

MAJORITY_VOTE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.MAJORITY_VOTE, max_concurrent_judges=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

LangChain Hallucination Metric.

A metric that uses LangChain and OpenEvals to evaluate the hallucination of the LLM.

Available Fields
  • input (str): The query to evaluate the hallucination of.
  • actual_output (str): The generated response to evaluate the hallucination of.
  • expected_context (str): The expected retrieved context to evaluate the hallucination of.
  • expected_output (str): Additional information to help the model evaluate the hallucination.
Scoring
  • 0.0-1.0 (Binary): Scale where 0.0 is no hallucination and 1.0 is hallucination.
Cookbook Example

Please refer to example_langchain_hallucination.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainHallucinationMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.MAJORITY_VOTE.

MAJORITY_VOTE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.MAJORITY_VOTE, max_concurrent_judges=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

LangChain Helpfulness Metric.

A metric that uses LangChain and OpenEvals to evaluate the helpfulness of the LLM.

Available Fields
  • input (str): The query to evaluate the helpfulness of.
  • actual_output (str): The generated response to evaluate the helpfulness of.
Scoring
  • 0.0-1.0 (Binary): Scale where 0.0 is not helpful and 1.0 is helpful.
Cookbook Example

Please refer to example_langchain_helpfulness.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainHelpfulnessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.MAJORITY_VOTE.

MAJORITY_VOTE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None)

Bases: LangChainOpenEvalsMetric

LangChain OpenEvals LLM as a Judge Metric.

A metric that uses LangChain and OpenEvals to evaluate the LLM as a judge.

Available Fields
  • query (str | None, optional): The query / inputs to evaluate.
  • generated_response (str | None, optional): The generated response / outputs to evaluate.
  • expected_response (str | None, optional): The expected response / reference outputs to evaluate.
  • expected_context (str | list[str] | None, optional): The expected retrieved context / reference context.
  • retrieved_context (str | list[str] | None, optional): The list of retrieved contexts.
Scoring
  • 0.0-1.0 (Continuous): An evaluation score assigned by the LLM judge.

Initialize the LangChainOpenEvalsLLMAsAJudgeMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
prompt str

The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages.

required
model str | ModelId | BaseLMInvoker

The model to use.

required
system str | None

Optional system message to prepend to the prompt.

None
credentials str | None

The credentials to use for the model. Defaults to None.

None
config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, checks for custom prompts and processes accordingly. Currently processes items individually; batch optimization can be added in future.

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

LangChainOpenEvalsMetric(name, evaluator, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None)

Bases: BaseMetric

LangChain OpenEvals Metric.

A metric that uses LangChain and OpenEvals.

Available Fields
  • query (str | None, optional): The query / inputs to evaluate.
  • generated_response (str | None, optional): The generated response / outputs to evaluate.
  • expected_response (str | None, optional): The expected response / reference outputs to evaluate.
  • expected_context (str | list[str] | None, optional): The expected retrieved context / reference context.
  • retrieved_context (str | list[str] | None, optional): The list of retrieved contexts.
Scoring
  • 0.0-1.0 (Continuous): Depending on the specific OpenEval metric.

Initialize the LangChainOpenEvalsMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
evaluator SimpleAsyncEvaluator | Callable[..., Awaitable[Any]]

The evaluator to use.

required
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

PyTrecMetric(metrics=None, k=20)

Bases: BaseMetric

PyTrec Metric.

A wrapper for pytrec_eval to evaluate common Information Retrieval (IR) metrics. This metric allows you to compute various standard IR scores like NDCG, MAP, MRR, Reciprocal Rank, etc., based on retrieved chunks and ground truth chunk IDs.

Available Fields
  • retrieved_chunks (dict[str, float]): The retrieved chunk ids with their similarity score.
  • ground_truth_chunk_ids (list[str]): The ground truth chunk ids.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better retrieval performance.
Cookbook Example

Please refer to example_pytrec_metric.py in the gen-ai-sdk-cookbook repository.

Initializes the PyTrecMetric.

Parameters:

Name Type Description Default
metrics list[PyTrecEvalMetric | str] | set[PyTrecEvalMetric | str] | None

The metrics to evaluate. Defaults to all metrics.

None
k int | list[int]

The number of retrieved chunks to consider. Defaults to 20.

20

RAGASMetric(metric, name=None, callbacks=None, timeout=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None)

Bases: BaseMetric

RAGAS Metric.

RAGAS is a metric for evaluating the quality of RAG systems.

Available Fields
  • input (str): The query to evaluate the metric. Similar to user_input in SingleTurnSample.
  • actual_output (str, optional): The generated response to evaluate the metric. Similar to response in SingleTurnSample.
  • expected_output (str, optional): The expected response to evaluate the metric. Similar to reference in SingleTurnSample.
  • expected_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to reference_contexts in SingleTurnSample. If the expected retrieved context is a str, it will be converted into a list with a single element.
  • retrieved_context (str | list[str], optional): The retrieved context to evaluate the metric. Similar to retrieved_contexts in SingleTurnSample. If the retrieved context is a str, it will be converted into a list with a single element.
  • rubrics (dict[str, str], optional): The rubrics to evaluate the metric. Similar to rubrics in SingleTurnSample.
Scoring
  • 0.0-1.0 (Continuous): A score evaluating the RAG aspect being tested.

Initialize the RAGASMetric.

Parameters:

Name Type Description Default
metric SingleTurnMetric

The Ragas metric to use.

required
name str

The name of the metric. Default is the name of the metric.

None
callbacks Callbacks

The callbacks to use. Default is None.

None
timeout int

The timeout for the metric. Default is None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, uses efficient parallel processing when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

RagasContextPrecisionWithoutReference(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None, **kwargs)

Bases: RAGASMetric

RAGAS Context Precision Metric.

Measures the proportion of relevant chunks in the retrieved contexts without requiring a ground truth reference. It evaluates whether the retrieved context chunks are actually useful for generating the provided response to the user's query.

Available Fields
  • input (str): The query to recall the context for.
  • actual_output (str): The generated response to recall the context for.
  • retrieved_context (list[str] | str): The retrieved contexts to recall the context for.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better context precision.
Cookbook Example

Please refer to example_ragas_context_precision.py in the gen-ai-sdk-cookbook repository.

Initialize the RagasContextPrecisionWithoutReference metric.

Parameters:

Name Type Description Default
lm_model str | ModelId | BaseLMInvoker

The language model to use.

MODEL
lm_model_credentials str | None

The credentials to use for the language model. Default is None.

None
lm_model_config dict[str, Any] | None

The configuration to use for the language model. Default is None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None
**kwargs

Additional keyword arguments to pass to the RagasContextRecall metric.

{}

RagasContextRecall(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None, **kwargs)

Bases: RAGASMetric

RAGAS Context Recall Metric.

Measures how many of the relevant documents (or pieces of information) needed to answer the query were successfully retrieved. It evaluates the retrieval system's ability to find all the necessary context based on the generated response and the expected response.

Available Fields
  • input (str): The query to recall the context for.
  • actual_output (str): The generated response to recall the context for.
  • expected_output (str): The expected response to recall the context for.
  • retrieved_context (list[str] | str): The retrieved contexts to recall the context for.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better context recall.
Cookbook Example

Please refer to example_ragas_context_recall.py in the gen-ai-sdk-cookbook repository.

Initialize the RagasContextRecall metric.

Parameters:

Name Type Description Default
lm_model str | ModelId | BaseLMInvoker

The language model to use.

MODEL
lm_model_credentials str | None

The credentials to use for the language model. Default is None.

None
lm_model_config dict[str, Any] | None

The configuration to use for the language model. Default is None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None
**kwargs

Additional keyword arguments to pass to the RagasContextRecall metric.

{}

RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None, **kwargs)

Bases: RAGASMetric

RAGAS Factual Correctness metric.

This metric evaluates the factual accuracy of the generated response with the reference.

Available Fields
  • input (str): The query.
  • actual_output (str): The generated response.
Scoring
  • 0-1 (Continuous): A higher score indicates better factual correctness.
Cookbook Example

Please refer to example_ragas_factual_correctness.py in the gen-ai-sdk-cookbook repository.

Initialize the RagasFactualCorrectness metric.

Parameters:

Name Type Description Default
lm_model str | ModelId | BaseLMInvoker

The language model to use.

MODEL
lm_model_credentials str | None

The credentials to use for the language model. Default is None.

None
lm_model_config dict[str, Any] | None

The configuration to use for the language model. Default is None.

None
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None
**kwargs

Additional keyword arguments to pass to the RagasFactualCorrectness metric.

{}

TopKAccuracy(k=20)

Bases: BaseMetric

Top-K Accuracy Metric.

Evaluates whether the ground truth chunk IDs are present within the top K retrieved chunks. This is a boolean-style hit/miss metric averaged over the dataset; a score of 1.0 means the relevant document was always found in the top K results.

Available Fields
  • retrieved_chunks (dict[str, float]): The retrieved chunk ids with their similarity score.
  • ground_truth_chunk_ids (list[str]): The ground truth chunk ids.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better top-k accuracy.
Cookbook Example

Please refer to example_top_k_accuracy.py in the gen-ai-sdk-cookbook repository.

Initializes the TopKAccuracy.

Parameters:

Name Type Description Default
k list[int] | int

The number of retrieved chunks to consider. Defaults to 20.

20

top_k_accuracy(qrels, results)

Evaluates the top k accuracy.

Parameters:

Name Type Description Default
qrels dict[str, dict[str, int]]

The ground truth of the retrieved chunks. There are two possible values:

required
- 1

the chunk is relevant to the query.

required
- 0

the chunk is not relevant to the query.

required
results dict[str, dict[str, float]]

The retrieved chunks with their similarity score.

required

Returns:

Type Description
dict[str, float]

dict[str, float]: The top k accuracy.

Example
qrels = {
    "q1": {"chunk1": 1, "chunk2": 1},
}
results = {"q1": {"chunk1": 0.9, "chunk2": 0.8, "chunk3": 0.7}}