Deepeval geval
DeepEval GEval Metric Integration.
References
[1] https://deepeval.com/docs/metrics-llm-evals [2] https://arxiv.org/abs/2303.16634
DeepEvalGEvalMetric(name, evaluation_params, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalMetricFactory, PromptExtractionMixin
DeepEval GEval Metric Integration.
This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.
GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.
Available Fields:
- query (str, optional): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated
into a single string.
- expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated
into a single string.
- expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted
into a list with a single element.
- retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalGEvalMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the metric. |
required |
evaluation_params
|
list[LLMTestCaseParams]
|
The evaluation parameters. |
required |
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
criteria
|
str | None
|
The criteria to use for the metric. Defaults to None. |
None
|
evaluation_steps
|
list[str] | None
|
The evaluation steps to use for the metric. Defaults to None. |
None
|
rubric
|
list[Rubric] | None
|
The rubric to use for the metric. Defaults to None. |
None
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
additional_context
|
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
evaluate(data)
async
Evaluate with custom prompt lifecycle support.
Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.
For batch processing, checks for custom prompts and processes accordingly. Currently processes items individually; batch optimization can be added in future.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput | list[MetricInput]
|
Single data item or list of data items to evaluate. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
Evaluation results with scores namespaced by metric name. |
get_custom_prompt_base_name()
Get the base name for custom prompt column lookup.
For GEval metrics, removes the 'geval_' prefix to align with CSV column conventions. This fixes Issue #3 by providing polymorphic naming for GEval metrics.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The base name without 'geval_' prefix (e.g., "completeness" instead of "geval_completeness"). |
Example
metric.name = "geval_completeness" metric.get_custom_prompt_base_name() -> "completeness"
CSV columns expected: - fewshot_completeness - fewshot_completeness_mode - evaluation_step_completeness
get_full_prompt(data)
Get the full prompt that DeepEval generates for this metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput
|
The metric input. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The complete prompt (system + user) as a string. |
PromptExtractionMixin
Mixin class that provides get_full_prompt functionality for metrics.
This mixin provides a standard interface for metrics that support prompt extraction. Metrics that inherit from this mixin should implement the get_full_prompt method.
get_full_prompt(data)
Get the full prompt that the metric generates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput
|
The metric input. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The complete prompt as a string. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If the metric doesn't support prompt extraction. |