Metrics
Metrics module for evaluating AI model outputs.
This module provides a comprehensive collection of evaluation metrics for assessing the quality of generated content, retrieval systems, and AI agent responses. It includes both traditional metrics and LLM-based metrics, as well as integrations with popular evaluation frameworks.
Metric categories: - Generation metrics: Evaluate quality of generated text (completeness, groundedness, redundancy, language consistency, refusal alignment) - Retrieval metrics: Assess retrieval system performance (precision, recall, accuracy) - Agent metrics: Evaluate AI agent behavior and responses - Open-source integrations: Wrappers for RAGAS, DeepEval, and LangChain evaluators
BaseMetric
Bases: ABC
Abstract class for metrics.
This class defines the interface for all metrics.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_fields |
set[str]
|
The required fields for this metric to evaluate data. |
input_type |
type | None
|
The type of the input data. |
can_evaluate(data)
Check if this metric can evaluate the given data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The input data to check. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the metric can evaluate the data, False otherwise. |
evaluate(data)
async
Evaluate the metric on the given dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The data to evaluate the metric on. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MetricOutput |
MetricOutput
|
A dictionary where the key are the namespace and the value are the scores. |
get_input_fields()
classmethod
Return declared input field names if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[str] | None
|
list[str] | None: The input fields. |
get_input_spec()
classmethod
Return structured spec for input fields if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | None
|
list[dict[str, Any]] | None: The input spec. |
CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)
Bases: LMBasedMetric
Completeness metric.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
model_credentials |
str
|
The model credentials to use for the metric. |
Initialize the CompletenessMetric class.
Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder |
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema |
ResponseSchema | None
|
The response schema to use for the metric. Defaults to CompletenessResponseSchema. |
None
|
DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Answer Relevancy Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Initializes the DeepEvalAnswerRelevancyMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Bias Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more biased, closer to 0.0 means less biased.
Initializes the DeepEvalBiasMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval ContextualPrecision Metric Integration.
Required Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in
LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string.
- retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalContextualPrecisionMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval ContextualRecall Metric Integration.
Required Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in
LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string.
- retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalContextualRecallMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval ContextualRelevancy Metric Integration.
Required Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
- retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalContextualRelevancyMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Faithfulness Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
- retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalFaithfulnessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalGEvalMetric(name, evaluation_params, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None)
Bases: DeepEvalMetricFactory, PromptExtractionMixin
DeepEval GEval Metric Integration.
This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.
GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.
Available Fields:
- query (str, optional): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated
into a single string.
- expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated
into a single string.
- expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted
into a list with a single element.
- retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalGEvalMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
evaluation_params |
list[LLMTestCaseParams]
|
The evaluation parameters. |
required |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
criteria |
str | None
|
The criteria to use for the metric. Defaults to None. |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. Defaults to None. |
None
|
rubric |
list[Rubric] | None
|
The rubric to use for the metric. Defaults to None. |
None
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
additional_context |
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
get_full_prompt(data)
Get the full prompt that DeepEval generates for this metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The metric input. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The complete prompt (system + user) as a string. |
DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Hallucination Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
- expected_retrieved_context (str | list[str]): The expected context to evaluate the metric.
Similar to context in LLMTestCaseParams.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more hallucinated, closer to 0.0 means less hallucinated.
Initializes the DeepEvalHallucinationMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval JSON Correctness Metric Integration.
This metric evaluates whether a response is JSON correct according to a specified schema. It helps ensure that AI responses follow the expected JSON structure.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Categorical): - 0: The response is not JSON correct according to the schema. - 1: The response is JSON correct according to the schema.
Initializes the DeepEvalJsonCorrectnessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected_schema |
Type[BaseModel]
|
The expected schema class (not instance) for the response.
Example: |
required |
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If expected_schema is not a valid BaseModel class. |
DeepEvalMetric(metric, name)
Bases: BaseMetric
DeepEval Metric Integration.
Attributes:
| Name | Type | Description |
|---|---|---|
metric |
BaseMetric
|
The DeepEval metric to wrap. |
name |
str
|
The name of the metric. |
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated
into a single string.
- expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated
into a single string.
- expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted
into a list with a single element.
- retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric |
BaseMetric
|
The DeepEval metric to wrap. |
required |
name |
str
|
The name of the metric. |
required |
DeepEvalMetricFactory(name, model, model_credentials, model_config, **kwargs)
Bases: DeepEvalMetric, ABC
Abstract base class for creating DeepEval metrics with a shared model invoker.
Initializes the metric, handling common model invoker creation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name for the metric. |
required |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model identifier or an existing LM invoker instance. |
required |
model_credentials |
Optional[str]
|
Credentials for the model, required if |
required |
model_config |
Optional[Dict[str, Any]]
|
Configuration for the model. |
required |
**kwargs |
Additional arguments for the specific DeepEval metric constructor. |
{}
|
DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Misuse Metric Integration.
This metric evaluates whether a response contains inappropriate misuse of the model. It helps ensure that AI responses don't provide harmful or inappropriate misuse of the model.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more misuse, closer to 0.0 means less misuse.
Initializes the DeepEvalMisuseMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain |
str
|
The domain to evaluate the metric. Common domains include: "finance", "health", "legal", "personal", "investment". |
required |
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If domain is empty or contains invalid values. |
DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Non-Advice Metric Integration.
This metric evaluates whether a response contains inappropriate advice types. It helps ensure that AI responses don't provide harmful or inappropriate advice.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 0.0 means more inappropriate advice, closer to 1.0 means less inappropriate advice.
Initializes the DeepEvalNonAdviceMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
advice_types |
List[str]
|
List of advice types to detect as inappropriate. Common types include: ["financial", "medical", "legal", "personal", "investment"]. |
required |
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If advice_types is empty or contains invalid values. |
DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval PII Leakage Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more privacy violations, closer to 0.0 means less privacy violations.
Initializes the DeepEvalPIILeakageMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Prompt Alignment Metric Integration.
This metric evaluates whether a response is aligned with the prompt instructions. It helps ensure that AI responses are aligned with the prompt instructions.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric.
Similar to actual_output inLLMTestCaseParams. If the generated response is a list,
the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more aligned with the prompt instructions, closer to 0.0 means less aligned with the prompt instructions.
Initializes the DeepEvalPromptAlignmentMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_instructions |
List[str]
|
a list of strings specifying the instructions you want followed in your prompt template. |
required |
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If prompt_instructions is empty or contains invalid values. |
DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Role Violation Metric Integration.
This metric evaluates whether a response contains role violations. It helps ensure that AI responses don't provide harmful or inappropriate role violations.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 0.0 means more role violations, closer to 1.0 means less role violations.
Initializes the DeepEvalRoleViolationMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
role |
str
|
The role to evaluate the metric. Common roles include: "helpful customer assistant", "medical insurance agent". |
required |
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If role is empty or contains invalid values. |
DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Toxicity Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more toxic, closer to 0.0 means less toxic.
Initializes the DeepEvalToxicityMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
GEvalCompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)
Bases: DeepEvalGEvalMetric
GEval Completeness Metric.
This metric is used to evaluate the completeness of the generated output.
Required Fields: - query (str): The query to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
model_credentials |
str | None
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
Initialize the GEval Completeness Metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
None
|
criteria |
str | None
|
The criteria to use for the metric. default is DEFAULT_CRITERIA |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS |
None
|
rubric |
list[Rubric] | None
|
The rubric to use for the metric. default is DEFAULT_RUBRIC |
None
|
threshold |
float
|
The threshold to use for the metric. default is 0.5 |
0.5
|
evaluation_params |
list[LLMTestCaseParams] | None
|
The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT] |
None
|
additional_context |
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
GEvalGroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)
Bases: DeepEvalGEvalMetric
GEval Groundedness Metric.
This metric is used to evaluate the groundedness of the generated output.
Required Fields: - query (str): The query to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
model_credentials |
str | None
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
Initialize the GEval Groundedness Metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
None
|
criteria |
str | None
|
The criteria to use for the metric. default is DEFAULT_CRITERIA |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS |
None
|
rubric |
list[Rubric] | None
|
The rubric to use for the metric. default is DEFAULT_RUBRIC |
None
|
threshold |
float
|
The threshold to use for the metric. default is 0.5 |
0.5
|
evaluation_params |
list[LLMTestCaseParams] | None
|
The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT] |
None
|
additional_context |
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
GEvalLanguageConsistencyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)
Bases: DeepEvalGEvalMetric
GEval Language Consistency Metric.
This metric is used to predict if the question and generated response is a language consistency response.
Required Fields: - query (str): The query to predict if it is a language consistency response. - generated_response (str): The generated response to predict if it is a language consistency response.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
model_credentials |
str | None
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
criteria |
str | None
|
The criteria prompt to use for the metric. |
evaluation_steps |
list[str] | None
|
The evaluation steps prompt to use for the metric. |
rubric |
list[Rubric] | None
|
The rubric to use for the metric. |
threshold |
float
|
The threshold to use for the metric. |
evaluation_params |
list[LLMTestCaseParams] | None
|
The evaluation parameters to use for the metric. |
additional_context |
str | None
|
Additional context like few-shot examples. |
Initialize the GEval Language Consistency Metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
None
|
criteria |
str | None
|
The criteria for the metric. default is LANGUAGE_CONSISTENCY_CRITERIA. |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. default is LANGUAGE_CONSISTENCY_EVALUATION_STEPS |
None
|
rubric |
list[Rubric] | None
|
The rubric for the metric. default is LANGUAGE_CONSISTENCY_RUBRIC. |
None
|
threshold |
float
|
The threshold for the metric. default is 0.5 |
0.5
|
evaluation_params |
list[LLMTestCaseParams] | None
|
The evaluation parameters for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT]. |
None
|
additional_context |
str | None
|
Additional context like few-shot examples. Defaults to LANGUAGE_CONSISTENCY_FEW_SHOT. |
None
|
GEvalRedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)
Bases: DeepEvalGEvalMetric
GEval Redundancy Metric.
This metric is used to evaluate the redundancy of the generated output.
Required Fields: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
model_credentials |
str | None
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
Initialize the GEval Redundancy Metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
None
|
criteria |
str | None
|
The criteria to use for the metric. default is REDUNDANCY_CRITERIA |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. default is REDUNDANCY_EVALUATION_STEPS |
None
|
rubric |
list[Rubric] | None
|
The rubric to use for the metric. default is REDUNDANCY_RUBRIC |
None
|
threshold |
float
|
The threshold to use for the metric. default is 0.5 |
0.5
|
evaluation_params |
list[LLMTestCaseParams] | None
|
The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT] |
None
|
additional_context |
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
GEvalRefusalAlignmentMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)
Bases: DeepEvalGEvalMetric
GEval Refusal Alignment Metric.
This metric evaluates whether the generated response correctly aligns with the expected refusal behavior. It checks if both the expected and generated responses have the same refusal status.
Required Fields: - query (str): The query to evaluate the metric. - expected_response (str): The expected response to evaluate the metric. - generated_response (str): The generated response to evaluate the metric.
Optional Fields: - is_refusal (bool): Whether the sample should be treated as a refusal response. If provided, this value will be used directly instead of detecting refusal from expected_response.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
model_credentials |
str | None
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
Initialize the GEval Refusal Alignment Metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
None
|
criteria |
str | None
|
The criteria to use for the metric. Default is REFUSAL_ALIGNMENT_CRITERIA. |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. Default is REFUSAL_ALIGNMENT_EVALUATION_STEPS. |
None
|
rubric |
list[Rubric] | None
|
The rubric to use for the metric. Default is REFUSAL_ALIGNMENT_RUBRIC. |
None
|
threshold |
float
|
The threshold to use for the metric. Default is 0.5. |
0.5
|
evaluation_params |
list[LLMTestCaseParams] | None
|
The evaluation parameters to use for the metric. Default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.EXPECTED_OUTPUT, LLMTestCaseParams.ACTUAL_OUTPUT]. |
None
|
additional_context |
str | None
|
Additional context like few-shot examples. Default is REFUSAL_ALIGNMENT_FEW_SHOT. |
None
|
GEvalRefusalMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)
Bases: DeepEvalGEvalMetric
GEval Refusal Metric.
This metric is used to predict if the question and expected response is a refusal response
Required Fields: - query (str): The query to predict if it is a refusal response. - expected_response (str): The expected response to predict if it is a refusal response.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
model_credentials |
str | None
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
Initialize the GEval Refusal Metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
None
|
criteria |
str | None
|
The criteria to use for the metric. default is DEFAULT_CRITERIA |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS |
None
|
rubric |
list[Rubric] | None
|
The rubric to use for the metric. default is DEFAULT_RUBRIC |
None
|
threshold |
float
|
The threshold to use for the metric. default is 0.5 |
0.5
|
evaluation_params |
list[LLMTestCaseParams] | None
|
The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT] |
None
|
additional_context |
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)
Bases: LMBasedMetric
Groundedness metric.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
model_credentials |
str
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
Initialize the GroundednessMetric class.
Default expected input: - query (str): The query to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
prompt_builder |
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema |
ResponseSchema | None
|
The response schema to use for the metric. Defaults to GroundednessResponseSchema. |
None
|
LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None)
Bases: BaseMetric
A multi purpose LM-based metric class.
This class is a multi purpose LM-based metric class that can be used to evaluate the performance of a LM-based metric. It can be used to evaluate the performance of a LM-based metric by providing a response schema, a prompt builder, a model id, and a model credentials.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
model_credentials |
str
|
The model credentials to use for the metric. |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty. |
Initialize the LMBasedMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
required |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
required |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
parse_response_fn |
Callable[[str | LMOutput], MetricOutput] | None
|
The function to use to parse the response from the LM. Defaults to a function that parses the response from the LM. |
None
|
LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainAgentEvalsMetric
A metric that uses LangChain AgentEvals to evaluate Agent as a judge.
Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
evaluator |
SimpleAsyncEvaluator
|
The evaluator to use. |
Initialize the LangChainAgentEvalsLLMAsAJudgeMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
prompt |
str
|
The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. Note that the default prompt allows a rubric in addition to the typical "inputs", "outputs", and "reference_outputs" parameters. |
required |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
required |
credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "trajectory_accuracy". |
'trajectory_accuracy'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainAgentEvalsMetric(name, evaluator)
Bases: BaseMetric
A metric that uses LangChain AgentEvals to evaluate Agent.
Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
evaluator |
SimpleAsyncEvaluator
|
The evaluator to use. |
Initialize the LangChainAgentEvalsMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
evaluator |
SimpleAsyncEvaluator
|
The evaluator to use. |
required |
LangChainAgentTrajectoryAccuracyMetric(model, prompt=None, model_credentials=None, model_config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, use_reference=True)
Bases: LangChainAgentEvalsLLMAsAJudgeMetric
A metric that uses LangChain AgentEvals to evaluate the trajectory accuracy of the agent.
Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
evaluator |
SimpleAsyncEvaluator
|
The evaluator to use. |
Initialize the LangChainAgentTrajectoryAccuracyMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
required |
prompt |
str | None
|
The prompt to use. Defaults to None. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "trajectory_accuracy". |
'trajectory_accuracy'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
use_reference |
bool
|
If True, uses the expected agent trajectory to evaluate the trajectory accuracy. Defaults to True. If False, the TRAJECTORY_ACCURACY_CUSTOM_PROMPT is used to evaluate the trajectory accuracy. If custom prompt is provided, this parameter will be ignored. |
True
|
LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
A metric that uses LangChain and OpenEvals to evaluate the conciseness of the LLM.
Required Fields: - query (str): The query to evaluate the conciseness of. - generated_response (str): The generated response to evaluate the conciseness of.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainConcisenessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
A metric that uses LangChain and OpenEvals to evaluate the correctness of the LLM.
Required Fields: - query (str): The query to evaluate the correctness of. - generated_response (str): The generated response to evaluate the correctness of. - expected_response (str): The expected response to evaluate the correctness of.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainCorrectnessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
A metric that uses LangChain and OpenEvals to evaluate the groundedness of the LLM.
Required Fields: - generated_response (str | list[str]): The generated response to evaluate the groundedness of. - retrieved_context (str | list[str]): The retrieved context to evaluate the groundedness of.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainGroundednessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
A metric that uses LangChain and OpenEvals to evaluate the hallucination of the LLM.
Required Fields: - query (str): The query to evaluate the hallucination of. - generated_response (str): The generated response to evaluate the hallucination of. - expected_retrieved_context (str): The expected retrieved context to evaluate the hallucination of. - expected_response (str, optional): Additional information to help the model evaluate the hallucination.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainHallucinationMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
A metric that uses LangChain and OpenEvals to evaluate the helpfulness of the LLM.
Required Fields: - query (str): The query to evaluate the helpfulness of. - generated_response (str): The generated response to evaluate the helpfulness of.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainHelpfulnessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsMetric
A metric that uses LangChain and OpenEvals to evaluate the LLM as a judge.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainOpenEvalsLLMAsAJudgeMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
prompt |
str
|
The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. |
required |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
required |
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainOpenEvalsMetric(name, evaluator)
Bases: BaseMetric
A metric that uses LangChain and OpenEvals.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
evaluator |
Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]
|
The evaluator to use. |
Initialize the LangChainOpenEvalsMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
evaluator |
Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]
|
The evaluator to use. |
required |
LanguageConsistencyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)
Bases: LMBasedMetric
Language consistency metric.
Attributes:
| Name | Type | Description |
|---|---|---|
description |
str
|
Description of the language consistency scoring (0-1 range). |
required_fields |
set[str]
|
Required input fields {QUERY, GENERATED_RESPONSE}. |
input_type |
type
|
Expected input data type (QAData). |
Initialize the LanguageConsistencyMetric class.
Default expected input: - query (str): The query to evaluate the language consistency of the model's output. - generated_response (str): The generated response to evaluate the language consistency of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder |
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema |
ResponseSchema | None
|
The response schema to use for the metric. Defaults to LanguageConsistencyResponseSchema. |
None
|
PyTrecMetric(metrics=None, k=20)
Bases: BaseMetric
Pytrec_eval metric.
Required fields: - retrieved_chunks: The retrieved chunk ids with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.
Example:
data = RetrievalData(
retrieved_chunks={
"chunk1": 0.9,
"chunk2": 0.8,
"chunk3": 0.7,
},
ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
metric = PyTrecMetric()
await metric.evaluate(data)
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
metrics |
list[PyTrecEvalMetric | str] | set[PyTrecEvalMetric | str] | None
|
The metrics to evaluate. |
k_values |
int | list[int]
|
The number of retrieved chunks to consider. |
Initializes the PyTrecMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
list[PyTrecEvalMetric | str] | set[PyTrecEvalMetric | str] | None
|
The metrics to evaluate. Defaults to all metrics. |
None
|
k |
int | list[int]
|
The number of retrieved chunks to consider. Defaults to 20. |
20
|
RAGASMetric(metric, name=None, callbacks=None, timeout=None)
Bases: BaseMetric
RAGAS metric.
RAGAS is a metric for evaluating the quality of RAG systems.
Attributes:
| Name | Type | Description |
|---|---|---|
metric |
SingleTurnMetric
|
The Ragas metric to use. |
name |
str
|
The name of the metric. |
callbacks |
Callbacks
|
The callbacks to use. |
timeout |
int
|
The timeout for the metric. |
Available Fields:
- query (str): The query to evaluate the metric. Similar to user_input in SingleTurnSample.
- generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
response in SingleTurnSample. If the generated response is a list, the responses are concatenated into
a single string. For multiple responses, use list[str].
- expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
reference in SingleTurnSample. If the expected response is a list, the responses are concatenated
into a single string.
- expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to reference_contexts in SingleTurnSample. If the expected retrieved context is a str, it will be
converted into a list with a single element.
- retrieved_context (str | list[str], optional): The retrieved context to evaluate the metric. Similar to
retrieved_contexts in SingleTurnSample. If the retrieved context is a str, it will be converted into a
list with a single element.
- rubrics (dict[str, str], optional): The rubrics to evaluate the metric. Similar to rubrics in
SingleTurnSample.
Initialize the RAGASMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric |
SingleTurnMetric
|
The Ragas metric to use. |
required |
name |
str
|
The name of the metric. Default is the name of the metric. |
None
|
callbacks |
Callbacks
|
The callbacks to use. Default is None. |
None
|
timeout |
int
|
The timeout for the metric. Default is None. |
None
|
RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)
Bases: RAGASMetric
RAGAS Context Recall metric.
Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for. - retrieved_contexts (list[str]): The retrieved contexts to recall the context for.
Initialize the RagasContextPrecisionWithoutReference metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lm_model |
str | ModelId | BaseLMInvoker
|
The language model to use. |
required |
lm_model_credentials |
str | None
|
The credentials to use for the language model. Default is None. |
None
|
lm_model_config |
dict[str, Any] | None
|
The configuration to use for the language model. Default is None. |
None
|
**kwargs |
Additional keyword arguments to pass to the RagasContextRecall metric. |
{}
|
RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)
Bases: RAGASMetric
RAGAS Context Recall metric.
Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for. - expected_response (str): The expected response to recall the context for. - retrieved_contexts (list[str]): The retrieved contexts to recall the context for.
Initialize the RagasContextRecall metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lm_model |
str | ModelId | BaseLMInvoker
|
The language model to use. |
required |
lm_model_credentials |
str | None
|
The credentials to use for the language model. Default is None. |
None
|
lm_model_config |
dict[str, Any] | None
|
The configuration to use for the language model. Default is None. |
None
|
**kwargs |
Additional keyword arguments to pass to the RagasContextRecall metric. |
{}
|
RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)
Bases: RAGASMetric
RAGAS Factual Correctness metric.
Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for.
Initialize the RagasFactualCorrectness metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lm_model |
str | ModelId | BaseLMInvoker
|
The language model to use. |
MODEL
|
lm_model_credentials |
str | None
|
The credentials to use for the language model. Default is None. |
None
|
lm_model_config |
dict[str, Any] | None
|
The configuration to use for the language model. Default is None. |
None
|
**kwargs |
Additional keyword arguments to pass to the RagasFactualCorrectness metric. |
{}
|
RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)
Bases: LMBasedMetric
Redundancy metric.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
model_credentials |
str
|
The model credentials to use for the metric. |
Initialize the RedundancyMetric class.
Default expected input: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder |
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema |
ResponseSchema | None
|
The response schema to use for the metric. Defaults to RedundancyResponseSchema. |
None
|
RefusalAlignmentMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)
Bases: LMBasedMetric
Refusal Alignment metric.
This metric evaluates whether the generated response correctly aligns with the expected refusal behavior. It checks if both the expected and generated responses have the same refusal status.
Required Fields: - query (str): The query to evaluate the metric. - expected_response (str): The expected response to evaluate the metric. - generated_response (str): The generated response to evaluate the metric.
Optional Fields: - is_refusal (bool): Whether the sample should be treated as a refusal response. If provided, this value will be used directly instead of detecting refusal from expected_response.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
model_credentials |
str
|
The model credentials to use for the metric. |
Initialize the RefusalAlignmentMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder |
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema |
ResponseSchema | None
|
The response schema to use for the metric. Defaults to RefusalAlignmentResponseSchema. |
None
|
RefusalMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)
Bases: LMBasedMetric
Refusal metric.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
model_credentials |
str
|
The model credentials to use for the metric. |
Initialize the RefusalMetric class.
Default expected input: - query (str): The query to evaluate the refusal of the model's output. - expected_response (str): The expected response to evaluate the refusal of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder |
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema |
ResponseSchema | None
|
The response schema to use for the metric. Defaults to CompletenessResponseSchema. |
None
|
TopKAccuracy(k=20)
Bases: BaseMetric
Top K Accuracy metric.
Required fields: - retrieved_chunks: The retrieved chunk ids with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.
Example:
data = RetrievalData(
retrieved_chunks={
"chunk1": 0.9,
"chunk2": 0.8,
"chunk3": 0.7,
},
ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
metric = TopKAccuracy()
await metric.evaluate(data)
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
k_values |
list[int]
|
The number of retrieved chunks to consider. |
Initializes the TopKAccuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k |
list[int] | int
|
The number of retrieved chunks to consider. Defaults to 20. |
20
|
top_k_accuracy(qrels, results)
Evaluates the top k accuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
qrels |
dict[str, dict[str, int]]
|
The ground truth of the retrieved chunks. There are two possible values: |
required |
- |
1
|
the chunk is relevant to the query. |
required |
- |
0
|
the chunk is not relevant to the query. |
required |
results |
dict[str, dict[str, float]]
|
The retrieved chunks with their similarity score. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
dict[str, float]: The top k accuracy. |
Example:
qrels = {
"q1": {"chunk1": 1, "chunk2": 1},
}
results = {"q1": {"chunk1": 0.9, "chunk2": 0.8, "chunk3": 0.7}}