Metrics

Metrics module for evaluating AI model outputs.

This module provides a comprehensive collection of evaluation metrics for assessing the quality of generated content, retrieval systems, and AI agent responses. It includes both traditional metrics and LLM-based metrics, as well as integrations with popular evaluation frameworks.

Metric categories: - Generation metrics: Evaluate quality of generated text (completeness, groundedness, redundancy, language consistency, refusal alignment) - Retrieval metrics: Assess retrieval system performance (precision, recall, accuracy) - Agent metrics: Evaluate AI agent behavior and responses - Open-source integrations: Wrappers for RAGAS, DeepEval, and LangChain evaluators

`BaseMetric`

Bases: ABC

Abstract class for metrics.

This class defines the interface for all metrics.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`required_fields`	`set[str]`	The required fields for this metric to evaluate data.
`input_type`	`type \| None`	The type of the input data.

Example

Adding custom prompts to existing evaluator metrics:

from gllm_evals import load_simple_qa_dataset
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.utils.shared_functionality import inference_fn


async def main():
    # Main function with custom prompts

    # Load your dataset
    dataset = load_simple_qa_dataset()

    # Create evaluator with default metrics
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Add custom prompts polymorphically (works for any metric)
    for metric in evaluator.metrics:
        if hasattr(metric, 'name'):  # Ensure metric has name attribute
            # Add custom prompts based on metric name
            if metric.name == "geval_completeness":
                # Add domain-specific few-shot examples
                metric.additional_context += "\n\nCUSTOM MEDICAL EXAMPLES: ..."
            elif metric.name == "geval_groundedness":
                # Add grounding examples
                metric.additional_context += "\n\nMEDICAL GROUNDING EXAMPLES: ..."

    # Evaluate with custom prompts applied automatically
    results = await evaluate(
        data=dataset,
        inference_fn=inference_fn,
        evaluators=[evaluator],  # ← Custom prompts applied to metrics
    )

`can_evaluate(data)`

Check if this metric can evaluate the given data.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The input data to check.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the metric can evaluate the data, False otherwise.

`evaluate(data)` `async`

Evaluate the metric on the given dataset (single item or batch).

Automatically handles batch processing by default. Subclasses can override _evaluate to accept lists for optimized batch processing.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput \| list[MetricInput]`	The data to evaluate the metric on. Can be a single item or a list for batch processing.	required

Returns:

Type	Description
`MetricOutput \| list[MetricOutput]`	MetricOutput \| list[MetricOutput]: A dictionary where the key are the namespace and the value are the scores. Returns a list if input is a list.

`get_input_fields()` `classmethod`

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type	Description
`list[str] \| None`	list[str] \| None: The input fields.

`get_input_spec()` `classmethod`

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type	Description
`list[dict[str, Any]] \| None`	list[dict[str, Any]] \| None: The input spec.

`get_normalized_score(raw_score)`

Normalize raw score to 0-1 range based on metric's good_score and bad_score.

This method handles both: - Different scales (e.g., 1-3 for completeness, 0-1 for language_consistency) - Inverted scales (e.g., redundancy where lower is better)

Parameters:

Name	Type	Description	Default
`raw_score`	`float`	The raw score value from the metric evaluation.	required

Returns:

Name	Type	Description
`float`	`float`	Normalized score between 0 and 1, where 1 is best and 0 is worst.

Examples:

>>> # Completeness: good=3, bad=1 (higher is better)
>>> metric.get_normalized_score(2)  # Returns 0.5
>>> # Redundancy: good=1, bad=3 (lower is better)
>>> metric.get_normalized_score(2)  # Returns 0.5 (inverted)
>>> # Language Consistency: good=1, bad=0 (already 0-1)
>>> metric.get_normalized_score(0.5)  # Returns 0.5

`CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: LMBasedMetric

Completeness metric.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`response_schema`	`ResponseSchema`	The response schema to use for the metric.
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.
`model_credentials`	`str`	The model credentials to use for the metric.

Initialize the CompletenessMetric class.

Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.

Parameters:

Name	Type	Description	Default
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`prompt_builder`	`PromptBuilder \| None`	The prompt builder to use for the metric. Defaults to default prompt builder.	`None`
`response_schema`	`ResponseSchema \| None`	The response schema to use for the metric. Defaults to CompletenessResponseSchema.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Answer Relevancy Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Initializes the DeepEvalAnswerRelevancyMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Bias Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more biased, closer to 0.0 means less biased.

Initializes the DeepEvalBiasMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualPrecisionTemplate)`

Bases: DeepEvalMetricFactory

DeepEval ContextualPrecision Metric Integration.

Required Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalContextualPrecisionMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`
`batch_status_check_interval`	`float`	Interval in seconds between batch status checks. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of batch status check iterations. Defaults to 120.	`BATCH_MAX_ITERATIONS`
`evaluation_template`	`Type[ContextualPrecisionTemplate]`	The evaluation template to use for the metric. Defaults to ContextualPrecisionTemplate. It is used to generate the reason for the metric.	`ContextualPrecisionTemplate`

`DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualRecallTemplate)`

Bases: DeepEvalMetricFactory

DeepEval ContextualRecall Metric Integration.

Required Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalContextualRecallMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`
`batch_status_check_interval`	`float`	Interval in seconds between batch status checks. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of batch status check iterations. Defaults to 120.	`BATCH_MAX_ITERATIONS`
`evaluation_template`	`Type[ContextualRecallTemplate]`	The evaluation template to use for the metric. Defaults to ContextualRecallTemplate.	`ContextualRecallTemplate`

`DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval ContextualRelevancy Metric Integration.

Required Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalContextualRelevancyMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Faithfulness Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalFaithfulnessMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalGEvalMetric(name, evaluation_params, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalMetricFactory, PromptExtractionMixin

DeepEval GEval Metric Integration.

This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.

GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.

Available Fields: - query (str, optional): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalGEvalMetric class.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`evaluation_params`	`list[LLMTestCaseParams]`	The evaluation parameters.	required
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`criteria`	`str \| None`	The criteria to use for the metric. Defaults to None.	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. Defaults to None.	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. Defaults to None.	`None`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to None.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`evaluate(data)` `async`

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, checks for custom prompts and processes accordingly. Currently processes items individually; batch optimization can be added in future.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput \| list[MetricInput]`	Single data item or list of data items to evaluate.	required

Returns:

Type	Description
`MetricOutput \| list[MetricOutput]`	Evaluation results with scores namespaced by metric name.

`get_custom_prompt_base_name()`

Get the base name for custom prompt column lookup.

For GEval metrics, removes the 'geval_' prefix to align with CSV column conventions. This fixes Issue #3 by providing polymorphic naming for GEval metrics.

Returns:

Name	Type	Description
`str`	`str`	The base name without 'geval_' prefix (e.g., "completeness" instead of "geval_completeness").

Example

metric.name = "geval_completeness" metric.get_custom_prompt_base_name() -> "completeness"

CSV columns expected: - fewshot_completeness - fewshot_completeness_mode - evaluation_step_completeness

`get_full_prompt(data)`

Get the full prompt that DeepEval generates for this metric.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The metric input.	required

Returns:

Name	Type	Description
`str`	`str`	The complete prompt (system + user) as a string.

`DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Hallucination Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str]): The expected context to evaluate the metric. Similar to context in LLMTestCaseParams.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more hallucinated, closer to 0.0 means less hallucinated.

Initializes the DeepEvalHallucinationMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval JSON Correctness Metric Integration.

This metric evaluates whether a response is JSON correct according to a specified schema. It helps ensure that AI responses follow the expected JSON structure.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Categorical): - 0: The response is not JSON correct according to the schema. - 1: The response is JSON correct according to the schema.

Initializes the DeepEvalJsonCorrectnessMetric class.

Parameters:

Name	Type	Description	Default
`expected_schema`	`Type[BaseModel]`	The expected schema class (not instance) for the response. Example: `ExampleSchema` (the class, not an instance).	required
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4o".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If expected_schema is not a valid BaseModel class.

`DeepEvalMetric(metric, name)`

Bases: BaseMetric

DeepEval Metric Integration.

Attributes:

Name	Type	Description
`metric`	`BaseMetric`	The DeepEval metric to wrap.
`name`	`str`	The name of the metric.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalMetric class.

Parameters:

Name	Type	Description	Default
`metric`	`BaseMetric`	The DeepEval metric to wrap.	required
`name`	`str`	The name of the metric.	required

`DeepEvalMetricFactory(name, model, model_credentials, model_config, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, **kwargs)`

Bases: DeepEvalMetric, ABC

Abstract base class for creating DeepEval metrics with a shared model invoker.

Initializes the metric, handling common model invoker creation.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name for the metric.	required
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model identifier or an existing LM invoker instance.	required
`model_credentials`	`Optional[str]`	Credentials for the model, required if `model` is a string.	required
`model_config`	`Optional[Dict[str, Any]]`	Configuration for the model.	required
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`
`**kwargs`		Additional arguments for the specific DeepEval metric constructor.	`{}`

`DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Misuse Metric Integration.

This metric evaluates whether a response contains inappropriate misuse of the model. It helps ensure that AI responses don't provide harmful or inappropriate misuse of the model.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more misuse, closer to 0.0 means less misuse.

Initializes the DeepEvalMisuseMetric class.

Parameters:

Name	Type	Description	Default
`domain`	`str`	The domain to evaluate the metric. Common domains include: "finance", "health", "legal", "personal", "investment".	required
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4o".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If domain is empty or contains invalid values.

`DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Non-Advice Metric Integration.

This metric evaluates whether a response contains inappropriate advice types. It helps ensure that AI responses don't provide harmful or inappropriate advice.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 0.0 means more inappropriate advice, closer to 1.0 means less inappropriate advice.

Initializes the DeepEvalNonAdviceMetric class.

Parameters:

Name	Type	Description	Default
`advice_types`	`List[str]`	List of advice types to detect as inappropriate. Common types include: ["financial", "medical", "legal", "personal", "investment"].	required
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4o".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If advice_types is empty or contains invalid values.

`DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval PII Leakage Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more privacy violations, closer to 0.0 means less privacy violations.

Initializes the DeepEvalPIILeakageMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Prompt Alignment Metric Integration.

This metric evaluates whether a response is aligned with the prompt instructions. It helps ensure that AI responses are aligned with the prompt instructions.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output inLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more aligned with the prompt instructions, closer to 0.0 means less aligned with the prompt instructions.

Initializes the DeepEvalPromptAlignmentMetric class.

Parameters:

Name	Type	Description	Default
`prompt_instructions`	`List[str]`	a list of strings specifying the instructions you want followed in your prompt template.	required
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4o".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If prompt_instructions is empty or contains invalid values.

`DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Role Violation Metric Integration.

This metric evaluates whether a response contains role violations. It helps ensure that AI responses don't provide harmful or inappropriate role violations.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 0.0 means more role violations, closer to 1.0 means less role violations.

Initializes the DeepEvalRoleViolationMetric class.

Parameters:

Name	Type	Description	Default
`role`	`str`	The role to evaluate the metric. Common roles include: "helpful customer assistant", "medical insurance agent".	required
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4o".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If role is empty or contains invalid values.

`DeepEvalToolCorrectnessMetric(threshold=0.5, model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalMetricFactory

DeepEval Tool Correctness Metric Integration.

This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.

Available Fields: - query (str): The input query. - generated_response (str, optional): The actual output/response. - expected_response (str, optional): The expected output/response. - tools_called (list[dict], optional): The tools actually called by the agent. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from agent_trajectory. - expected_tools (list[dict], optional): The expected tools to be called. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory. - agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field. - expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory. - available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.

Example Usage (with agent_trajectory):

    metric = DeepEvalToolCorrectnessMetric(
        model_credentials="your-api-key",
        available_tools=[...]
    )
    result = await metric.measure(
        query="What is the average sales amount in the orders table?",
        agent_trajectory=[
            {"role": "user", "content": "What is the average sales amount..."},
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [{
                    "id": "call_1",
                    "type": "function",
                    "function": {
                        "name": "data_checker",
                        "arguments": '{"query": "SELECT AVG(amount) as avg_sales FROM orders LIMIT 1"}'
                    }
                }]
            },
            {"role": "tool", "tool_call_id": "call_1", "content": '[{"avg_sales": 250.50}]'},
            {"role": "assistant", "content": "Based on the data, the average sales amount is $250.50."}
        ],
        expected_agent_trajectory=[...],  # Same structure
        generated_response="Based on the data, the average sales amount is $250.50.",
        expected_response="The average sales amount in the orders table is $250.50."
    )
    print(result.score, result.reason)

Example Usage (with tools_called):

    result = await metric.measure(
        query="What is 15 plus 27?",
        tools_called=[
            {"name": "calculator", "args": {"expression": "15 + 27"}, "output": "42"}
        ],
        expected_tools=[
            {"name": "calculator", "args": {"expression": "15 + 27"}, "output": "42"}
        ],
        generated_response="15 plus 27 equals 42.",
        expected_response="15 plus 27 equals 42."
    )

Initializes DeepEvalToolCorrectnessMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5. Also used as good_score for BaseEvaluator's global explanation generation.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`AGENT_EVALS_MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`
`include_reason`	`bool`	Include reasoning in output. Defaults to True.	`True`
`strict_mode`	`bool`	Binary mode (0 or 1). Defaults to False.	`False`
`should_exact_match`	`bool`	Require exact match of tools. Defaults to False.	`False`
`should_consider_ordering`	`bool`	Consider order of tools called. Defaults to False.	`False`
`available_tools`	`list[dict[str, Any]] \| None`	List of available tool definitions for context. Each tool should have 'name', 'description', and 'parameters'. Defaults to None.	`None`
`evaluation_params`	`list[ToolCallParams] \| None`	List of strictness criteria for tool correctness. Available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. Defaults to [ToolCallParams.INPUT_PARAMETERS, ToolCallParams.OUTPUT] to validate both.	`None`
`batch_status_check_interval`	`float`	Interval in seconds between batch status checks. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of batch status check iterations. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Toxicity Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more toxic, closer to 0.0 means less toxic.

Initializes the DeepEvalToxicityMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`GEvalCompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalGEvalMetric

GEval Completeness Metric.

This metric is used to evaluate the completeness of the generated output.

Required Fields: - query (str): The query to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.
`model_credentials`	`str \| None`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.

Initialize the GEval Completeness Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric. default is DEFAULT_CRITERIA	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. default is DEFAULT_RUBRIC	`None`
`threshold`	`float`	The threshold to use for the metric. default is 0.5	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT]	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to None.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`GEvalGroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalGEvalMetric

GEval Groundedness Metric.

This metric is used to evaluate the groundedness of the generated output.

Required Fields: - query (str): The query to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.
`model_credentials`	`str \| None`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.

Initialize the GEval Groundedness Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric. default is DEFAULT_CRITERIA	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. default is DEFAULT_RUBRIC	`None`
`threshold`	`float`	The threshold to use for the metric. default is 0.5	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT]	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to None.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`GEvalLanguageConsistencyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalGEvalMetric

GEval Language Consistency Metric.

This metric is used to predict if the question and generated response is a language consistency response.

Required Fields: - query (str): The query to predict if it is a language consistency response. - generated_response (str): The generated response to predict if it is a language consistency response.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.
`model_credentials`	`str \| None`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.
`criteria`	`str \| None`	The criteria prompt to use for the metric.
`evaluation_steps`	`list[str] \| None`	The evaluation steps prompt to use for the metric.
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric.
`threshold`	`float`	The threshold to use for the metric.
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric.
`additional_context`	`str \| None`	Additional context like few-shot examples.

Initialize the GEval Language Consistency Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria for the metric. default is LANGUAGE_CONSISTENCY_CRITERIA.	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. default is LANGUAGE_CONSISTENCY_EVALUATION_STEPS	`None`
`rubric`	`list[Rubric] \| None`	The rubric for the metric. default is LANGUAGE_CONSISTENCY_RUBRIC.	`None`
`threshold`	`float`	The threshold for the metric. default is 0.5	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT].	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to LANGUAGE_CONSISTENCY_FEW_SHOT.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`GEvalRedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalGEvalMetric

GEval Redundancy Metric.

This metric is used to evaluate the redundancy of the generated output.

Required Fields: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.
`model_credentials`	`str \| None`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.

Initialize the GEval Redundancy Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric. default is REDUNDANCY_CRITERIA	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. default is REDUNDANCY_EVALUATION_STEPS	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. default is REDUNDANCY_RUBRIC	`None`
`threshold`	`float`	The threshold to use for the metric. default is 0.5	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to None.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`get_normalized_score(raw_score)`

Normalize raw score to 0-1 range.

For redundancy (lower is better): - Score ≤ 2: Good (normalized to 1.0) - Score ≥ 3: Bad (normalized to 0.0)

This override handles scores outside the [good_score, bad_score] range that would otherwise produce values outside 0-1.

Parameters:

Name	Type	Description	Default
`raw_score`	`float`	The raw score value from the metric evaluation (1-3).	required

Returns:

Name	Type	Description
`float`	`float`	Normalized score between 0 and 1, where 1 is best and 0 is worst.

`GEvalRefusalAlignmentMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalGEvalMetric

GEval Refusal Alignment Metric.

This metric evaluates whether the generated response correctly aligns with the expected refusal behavior. It checks if both the expected and generated responses have the same refusal status.

Required Fields: - query (str): The query to evaluate the metric. - expected_response (str): The expected response to evaluate the metric. - generated_response (str): The generated response to evaluate the metric.

Optional Fields: - is_refusal (bool): Whether the sample should be treated as a refusal response. If provided, this value will be used directly instead of detecting refusal from expected_response.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.
`model_credentials`	`str \| None`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.

Initialize the GEval Refusal Alignment Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric. Default is REFUSAL_ALIGNMENT_CRITERIA.	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. Default is REFUSAL_ALIGNMENT_EVALUATION_STEPS.	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. Default is REFUSAL_ALIGNMENT_RUBRIC.	`None`
`threshold`	`float`	The threshold to use for the metric. Default is 0.5.	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric. Default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.EXPECTED_OUTPUT, LLMTestCaseParams.ACTUAL_OUTPUT].	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples. Default is REFUSAL_ALIGNMENT_FEW_SHOT.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`GEvalRefusalMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalGEvalMetric

GEval Refusal Metric.

This metric is used to predict if the question and expected response is a refusal response

Required Fields: - query (str): The query to predict if it is a refusal response. - expected_response (str): The expected response to predict if it is a refusal response.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.
`model_credentials`	`str \| None`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.

Initialize the GEval Refusal Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric. default is DEFAULT_CRITERIA	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. default is DEFAULT_RUBRIC	`None`
`threshold`	`float`	The threshold to use for the metric. default is 0.5	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT]	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to None.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`GEvalSummarizationCoherenceMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalGEvalMetric

GEval Summarization Coherence metric.

Required Fields: - input (str): Source text or transcript. - summary (str): Generated summary.

Attributes:

Name	Type	Description
`name`	`str`	The metric name.
`model`	`str \| ModelId \| BaseLMInvoker`	Model for evaluation.
`model_credentials`	`str \| None`	Model credentials.
`model_config`	`dict[str, Any] \| None`	Optional model configuration.

Initialize the GEval Summarization Coherence Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric.	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric.	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric.	`None`
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric.	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout.	`BATCH_MAX_ITERATIONS`

`GEvalSummarizationConsistencyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalGEvalMetric

GEval Summarization Consistency metric.

Required Fields: - input (str): Source text or transcript. - summary (str): Generated summary.

Attributes:

Name	Type	Description
`name`	`str`	The metric name.
`model`	`str \| ModelId \| BaseLMInvoker`	Model for evaluation.
`model_credentials`	`str \| None`	Model credentials.
`model_config`	`dict[str, Any] \| None`	Optional model configuration.

Initialize the GEval Summarization Consistency Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric.	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric.	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric.	`None`
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric.	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout.	`BATCH_MAX_ITERATIONS`

`GEvalSummarizationFluencyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalGEvalMetric

GEval Summarization Fluency metric.

Required Fields: - input (str): Source text or transcript. - summary (str): Generated summary.

Attributes:

Name	Type	Description
`name`	`str`	The metric name.
`model`	`str \| ModelId \| BaseLMInvoker`	Model for evaluation.
`model_credentials`	`str \| None`	Model credentials.
`model_config`	`dict[str, Any] \| None`	Optional model configuration.

Initialize the GEval Summarization Fluency Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric.	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric.	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric.	`None`
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric.	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout.	`BATCH_MAX_ITERATIONS`

`GEvalSummarizationRelevanceMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalGEvalMetric

GEval Summarization Relevance metric.

Required Fields: - input (str): Source text or transcript. - summary (str): Generated summary.

Attributes:

Name	Type	Description
`name`	`str`	The metric name.
`model`	`str \| ModelId \| BaseLMInvoker`	Model for evaluation.
`model_credentials`	`str \| None`	Model credentials.
`model_config`	`dict[str, Any] \| None`	Optional model configuration.

Initialize the GEval Summarization Relevance Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric.	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric.	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric.	`None`
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric.	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout.	`BATCH_MAX_ITERATIONS`

`GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: LMBasedMetric

Groundedness metric.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`response_schema`	`ResponseSchema`	The response schema to use for the metric.
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.
`model_credentials`	`str`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.

Initialize the GroundednessMetric class.

Default expected input: - query (str): The query to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output.

Parameters:

Name	Type	Description	Default
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`prompt_builder`	`PromptBuilder \| None`	The prompt builder to use for the metric. Defaults to default prompt builder.	`None`
`response_schema`	`ResponseSchema \| None`	The response schema to use for the metric. Defaults to GroundednessResponseSchema.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: BaseMetric

A multi purpose LM-based metric class.

This class is a multi purpose LM-based metric class that can be used to evaluate the performance of a LM-based metric. It can be used to evaluate the performance of a LM-based metric by providing a response schema, a prompt builder, a model id, and a model credentials.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`response_schema`	`ResponseSchema`	The response schema to use for the metric.
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.
`model_credentials`	`str`	The model credentials to use for the metric.
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty.

Initialize the LMBasedMetric class.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`response_schema`	`ResponseSchema`	The response schema to use for the metric.	required
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.	required
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`parse_response_fn`	`Callable[[str \| LMOutput], MetricOutput] \| None`	The function to use to parse the response from the LM. Defaults to a function that parses the response from the LM.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).	`BATCH_MAX_ITERATIONS`

`evaluate(data)` `async`

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, uses efficient batch API when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput \| list[MetricInput]`	Single data item or list of data items to evaluate.	required

Returns:

Type	Description
`MetricOutput \| list[MetricOutput]`	Evaluation results with scores namespaced by metric name.

`LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: LangChainAgentEvalsMetric

A metric that uses LangChain AgentEvals to evaluate Agent as a judge.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`evaluator`	`SimpleAsyncEvaluator`	The evaluator to use.

Initialize the LangChainAgentEvalsLLMAsAJudgeMetric.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`prompt`	`str`	The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. Note that the default prompt allows a rubric in addition to the typical "inputs", "outputs", and "reference_outputs" parameters.	required
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	required
`credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "trajectory_accuracy".	`'trajectory_accuracy'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`
`batch_status_check_interval`	`float`	Interval in seconds between batch status checks. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of batch status check iterations. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`evaluate(data)` `async`

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, uses efficient batch API when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput \| list[MetricInput]`	Single data item or list of data items to evaluate.	required

Returns:

Type	Description
`MetricOutput \| list[MetricOutput]`	Evaluation results with scores namespaced by metric name.

`LangChainAgentEvalsMetric(name, evaluator)`

Bases: BaseMetric

A metric that uses LangChain AgentEvals to evaluate Agent.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`evaluator`	`SimpleAsyncEvaluator`	The evaluator to use.

Initialize the LangChainAgentEvalsMetric.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`evaluator`	`SimpleAsyncEvaluator`	The evaluator to use.	required

`LangChainAgentTrajectoryAccuracyMetric(model, prompt=None, model_credentials=None, model_config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, use_reference=True, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: LangChainAgentEvalsLLMAsAJudgeMetric

A metric that uses LangChain AgentEvals to evaluate the trajectory accuracy of the agent.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`evaluator`	`SimpleAsyncEvaluator`	The evaluator to use.

Initialize the LangChainAgentTrajectoryAccuracyMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	required
`prompt`	`str \| None`	The prompt to use. Defaults to None.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "trajectory_accuracy".	`'trajectory_accuracy'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`
`use_reference`	`bool`	If True, uses the expected agent trajectory to evaluate the trajectory accuracy. Defaults to True. If False, the TRAJECTORY_ACCURACY_CUSTOM_PROMPT is used to evaluate the trajectory accuracy. If custom prompt is provided, this parameter will be ignored.	`True`
`batch_status_check_interval`	`float`	Interval in seconds between batch status checks. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of batch status check iterations. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the conciseness of the LLM.

Required Fields: - query (str): The query to evaluate the conciseness of. - generated_response (str): The generated response to evaluate the conciseness of.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainConcisenessMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	`MODEL`
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the correctness of the LLM.

Required Fields: - query (str): The query to evaluate the correctness of. - generated_response (str): The generated response to evaluate the correctness of. - expected_response (str): The expected response to evaluate the correctness of.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainCorrectnessMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	`MODEL`
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the groundedness of the LLM.

Required Fields: - generated_response (str | list[str]): The generated response to evaluate the groundedness of. - retrieved_context (str | list[str]): The retrieved context to evaluate the groundedness of.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainGroundednessMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	`MODEL`
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the hallucination of the LLM.

Required Fields: - query (str): The query to evaluate the hallucination of. - generated_response (str): The generated response to evaluate the hallucination of. - expected_retrieved_context (str): The expected retrieved context to evaluate the hallucination of. - expected_response (str, optional): Additional information to help the model evaluate the hallucination.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainHallucinationMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	`MODEL`
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the helpfulness of the LLM.

Required Fields: - query (str): The query to evaluate the helpfulness of. - generated_response (str): The generated response to evaluate the helpfulness of.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainHelpfulnessMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	`MODEL`
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsMetric

A metric that uses LangChain and OpenEvals to evaluate the LLM as a judge.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainOpenEvalsLLMAsAJudgeMetric.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`prompt`	`str`	The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages.	required
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	required
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`evaluate(data)` `async`

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, checks for custom prompts and processes accordingly. Currently processes items individually; batch optimization can be added in future.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput \| list[MetricInput]`	Single data item or list of data items to evaluate.	required

Returns:

Type	Description
`MetricOutput \| list[MetricOutput]`	Evaluation results with scores namespaced by metric name.

`LangChainOpenEvalsMetric(name, evaluator)`

Bases: BaseMetric

A metric that uses LangChain and OpenEvals.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`evaluator`	`Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]`	The evaluator to use.

Initialize the LangChainOpenEvalsMetric.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`evaluator`	`Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]`	The evaluator to use.	required

`LanguageConsistencyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: LMBasedMetric

Language consistency metric.

Attributes:

Name	Type	Description
`description`	`str`	Description of the language consistency scoring (0-1 range).
`required_fields`	`set[str]`	Required input fields {QUERY, GENERATED_RESPONSE}.
`input_type`	`type`	Expected input data type (QAData).

Initialize the LanguageConsistencyMetric class.

Default expected input: - query (str): The query to evaluate the language consistency of the model's output. - generated_response (str): The generated response to evaluate the language consistency of the model's output.

Parameters:

Name	Type	Description	Default
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`prompt_builder`	`PromptBuilder \| None`	The prompt builder to use for the metric. Defaults to default prompt builder.	`None`
`response_schema`	`ResponseSchema \| None`	The response schema to use for the metric. Defaults to LanguageConsistencyResponseSchema.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`PyTrecMetric(metrics=None, k=20)`

Bases: BaseMetric

Pytrec_eval metric.

Required fields: - retrieved_chunks: The retrieved chunk ids with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
metric = PyTrecMetric()
await metric.evaluate(data)

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`metrics`	`list[PyTrecEvalMetric \| str] \| set[PyTrecEvalMetric \| str] \| None`	The metrics to evaluate.
`k_values`	`int \| list[int]`	The number of retrieved chunks to consider.

Initializes the PyTrecMetric.

Parameters:

Name	Type	Description	Default
`metrics`	`list[PyTrecEvalMetric \| str] \| set[PyTrecEvalMetric \| str] \| None`	The metrics to evaluate. Defaults to all metrics.	`None`
`k`	`int \| list[int]`	The number of retrieved chunks to consider. Defaults to 20.	`20`

`RAGASMetric(metric, name=None, callbacks=None, timeout=None)`

Bases: BaseMetric

RAGAS metric.

RAGAS is a metric for evaluating the quality of RAG systems.

Attributes:

Name	Type	Description
`metric`	`SingleTurnMetric`	The Ragas metric to use.
`name`	`str`	The name of the metric.
`callbacks`	`Callbacks`	The callbacks to use.
`timeout`	`int`	The timeout for the metric.

Available Fields: - query (str): The query to evaluate the metric. Similar to user_input in SingleTurnSample. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to response in SingleTurnSample. If the generated response is a list, the responses are concatenated into a single string. For multiple responses, use list[str]. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to reference in SingleTurnSample. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to reference_contexts in SingleTurnSample. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The retrieved context to evaluate the metric. Similar to retrieved_contexts in SingleTurnSample. If the retrieved context is a str, it will be converted into a list with a single element. - rubrics (dict[str, str], optional): The rubrics to evaluate the metric. Similar to rubrics in SingleTurnSample.

Initialize the RAGASMetric.

Parameters:

Name	Type	Description	Default
`metric`	`SingleTurnMetric`	The Ragas metric to use.	required
`name`	`str`	The name of the metric. Default is the name of the metric.	`None`
`callbacks`	`Callbacks`	The callbacks to use. Default is None.	`None`
`timeout`	`int`	The timeout for the metric. Default is None.	`None`

`evaluate(data)` `async`

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, uses efficient parallel processing when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput \| list[MetricInput]`	Single data item or list of data items to evaluate.	required

Returns:

Type	Description
`MetricOutput \| list[MetricOutput]`	Evaluation results with scores namespaced by metric name.

`RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)`

Bases: RAGASMetric

RAGAS Context Recall metric.

Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for. - retrieved_contexts (list[str]): The retrieved contexts to recall the context for.

Initialize the RagasContextPrecisionWithoutReference metric.

Parameters:

Name	Type	Description	Default
`lm_model`	`str \| ModelId \| BaseLMInvoker`	The language model to use.	required
`lm_model_credentials`	`str \| None`	The credentials to use for the language model. Default is None.	`None`
`lm_model_config`	`dict[str, Any] \| None`	The configuration to use for the language model. Default is None.	`None`
`**kwargs`		Additional keyword arguments to pass to the RagasContextRecall metric.	`{}`

`RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)`

Bases: RAGASMetric

RAGAS Context Recall metric.

Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for. - expected_response (str): The expected response to recall the context for. - retrieved_contexts (list[str]): The retrieved contexts to recall the context for.

Initialize the RagasContextRecall metric.

Parameters:

Name	Type	Description	Default
`lm_model`	`str \| ModelId \| BaseLMInvoker`	The language model to use.	required
`lm_model_credentials`	`str \| None`	The credentials to use for the language model. Default is None.	`None`
`lm_model_config`	`dict[str, Any] \| None`	The configuration to use for the language model. Default is None.	`None`
`**kwargs`		Additional keyword arguments to pass to the RagasContextRecall metric.	`{}`

`RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)`

Bases: RAGASMetric

RAGAS Factual Correctness metric.

Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for.

Initialize the RagasFactualCorrectness metric.

Parameters:

Name	Type	Description	Default
`lm_model`	`str \| ModelId \| BaseLMInvoker`	The language model to use.	`MODEL`
`lm_model_credentials`	`str \| None`	The credentials to use for the language model. Default is None.	`None`
`lm_model_config`	`dict[str, Any] \| None`	The configuration to use for the language model. Default is None.	`None`
`**kwargs`		Additional keyword arguments to pass to the RagasFactualCorrectness metric.	`{}`

`RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: LMBasedMetric

Redundancy metric.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`response_schema`	`ResponseSchema`	The response schema to use for the metric.
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.
`model_credentials`	`str`	The model credentials to use for the metric.

Initialize the RedundancyMetric class.

Default expected input: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.

Parameters:

Name	Type	Description	Default
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`prompt_builder`	`PromptBuilder \| None`	The prompt builder to use for the metric. Defaults to default prompt builder.	`None`
`response_schema`	`ResponseSchema \| None`	The response schema to use for the metric. Defaults to RedundancyResponseSchema.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`RefusalAlignmentMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: LMBasedMetric

Refusal Alignment metric.

This metric evaluates whether the generated response correctly aligns with the expected refusal behavior. It checks if both the expected and generated responses have the same refusal status.

Required Fields: - query (str): The query to evaluate the metric. - expected_response (str): The expected response to evaluate the metric. - generated_response (str): The generated response to evaluate the metric.

Optional Fields: - is_refusal (bool): Whether the sample should be treated as a refusal response. If provided, this value will be used directly instead of detecting refusal from expected_response.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`response_schema`	`ResponseSchema`	The response schema to use for the metric.
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.
`model_credentials`	`str`	The model credentials to use for the metric.

Initialize the RefusalAlignmentMetric class.

Parameters:

Name	Type	Description	Default
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`prompt_builder`	`PromptBuilder \| None`	The prompt builder to use for the metric. Defaults to default prompt builder.	`None`
`response_schema`	`ResponseSchema \| None`	The response schema to use for the metric. Defaults to RefusalAlignmentResponseSchema.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`RefusalMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: LMBasedMetric

Refusal metric.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`response_schema`	`ResponseSchema`	The response schema to use for the metric.
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.
`model_credentials`	`str`	The model credentials to use for the metric.

Initialize the RefusalMetric class.

Default expected input: - query (str): The query to evaluate the refusal of the model's output. - expected_response (str): The expected response to evaluate the refusal of the model's output.

Parameters:

Name	Type	Description	Default
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`prompt_builder`	`PromptBuilder \| None`	The prompt builder to use for the metric. Defaults to default prompt builder.	`None`
`response_schema`	`ResponseSchema \| None`	The response schema to use for the metric. Defaults to RefusalResponseSchema.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`TopKAccuracy(k=20)`

Bases: BaseMetric

Top K Accuracy metric.

Required fields: - retrieved_chunks: The retrieved chunk ids with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
metric = TopKAccuracy()
await metric.evaluate(data)

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`k_values`	`list[int]`	The number of retrieved chunks to consider.

Initializes the TopKAccuracy.

Parameters:

Name	Type	Description	Default
`k`	`list[int] \| int`	The number of retrieved chunks to consider. Defaults to 20.	`20`

`top_k_accuracy(qrels, results)`

Evaluates the top k accuracy.

Parameters:

Name	Type	Description	Default
`qrels`	`dict[str, dict[str, int]]`	The ground truth of the retrieved chunks. There are two possible values:	required
`- 1`		the chunk is relevant to the query.	required
`- 0`		the chunk is not relevant to the query.	required
`results`	`dict[str, dict[str, float]]`	The retrieved chunks with their similarity score.	required

Returns:

Type	Description
`dict[str, float]`	dict[str, float]: The top k accuracy.

Example:

qrels = {
    "q1": {"chunk1": 1, "chunk2": 1},
}
results = {"q1": {"chunk1": 0.9, "chunk2": 0.8, "chunk3": 0.7}}

Metrics

BaseMetric

can_evaluate(data)

evaluate(data) async

get_input_fields() classmethod

get_input_spec() classmethod

get_normalized_score(raw_score)

CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualPrecisionTemplate)

DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualRecallTemplate)

DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

evaluate(data) async

get_custom_prompt_base_name()

get_full_prompt(data)

DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalMetric(metric, name)

DeepEvalMetricFactory(name, model, model_credentials, model_config, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, **kwargs)

DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

get_normalized_score(raw_score)

GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

evaluate(data) async

evaluate(data) async

LangChainAgentEvalsMetric(name, evaluator)

LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

evaluate(data) async

LangChainOpenEvalsMetric(name, evaluator)

LanguageConsistencyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

PyTrecMetric(metrics=None, k=20)

RAGASMetric(metric, name=None, callbacks=None, timeout=None)

evaluate(data) async

RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)

RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)

RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)

RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

RefusalAlignmentMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

RefusalMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

TopKAccuracy(k=20)

top_k_accuracy(qrels, results)

`BaseMetric`

`can_evaluate(data)`

`evaluate(data)` `async`

`get_input_fields()` `classmethod`

`get_input_spec()` `classmethod`

`get_normalized_score(raw_score)`

`CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

`DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

`DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualPrecisionTemplate)`

`DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualRecallTemplate)`

`DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

`DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

`evaluate(data)` `async`

`get_custom_prompt_base_name()`

`get_full_prompt(data)`

`DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalMetric(metric, name)`

`DeepEvalMetricFactory(name, model, model_credentials, model_config, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, **kwargs)`

`DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`get_normalized_score(raw_score)`

`GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

`LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

`evaluate(data)` `async`

`evaluate(data)` `async`

`LangChainAgentEvalsMetric(name, evaluator)`

`LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`evaluate(data)` `async`

`LangChainOpenEvalsMetric(name, evaluator)`

`LanguageConsistencyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

`PyTrecMetric(metrics=None, k=20)`

`RAGASMetric(metric, name=None, callbacks=None, timeout=None)`

`evaluate(data)` `async`

`RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)`

`RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)`

`RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)`

`RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

`RefusalAlignmentMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

`RefusalMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

`TopKAccuracy(k=20)`

`top_k_accuracy(qrels, results)`