Skip to content

Deepeval geval

DeepEval GEval Metric Integration.

Authors

Surya Mahadi (made.r.s.mahadi@gdplabs.id)

References

[1] https://deepeval.com/docs/metrics-llm-evals [2] https://arxiv.org/abs/2303.16634

DeepEvalGEvalMetric(name, evaluation_params, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalMetricFactory, PromptExtractionMixin

DeepEval GEval Metric Integration.

This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.

GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.

Available Fields: - query (str, optional): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalGEvalMetric class.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
evaluation_params list[LLMTestCaseParams]

The evaluation parameters.

required
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
criteria str | None

The criteria to use for the metric. Defaults to None.

None
evaluation_steps list[str] | None

The evaluation steps to use for the metric. Defaults to None.

None
rubric list[Rubric] | None

The rubric to use for the metric. Defaults to None.

None
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
additional_context str | None

Additional context like few-shot examples. Defaults to None.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, checks for custom prompts and processes accordingly. Currently processes items individually; batch optimization can be added in future.

Parameters:

Name Type Description Default
data MetricInput | list[MetricInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

get_custom_prompt_base_name()

Get the base name for custom prompt column lookup.

For GEval metrics, removes the 'geval_' prefix to align with CSV column conventions. This fixes Issue #3 by providing polymorphic naming for GEval metrics.

Returns:

Name Type Description
str str

The base name without 'geval_' prefix (e.g., "completeness" instead of "geval_completeness").

Example

metric.name = "geval_completeness" metric.get_custom_prompt_base_name() -> "completeness"

CSV columns expected: - fewshot_completeness - fewshot_completeness_mode - evaluation_step_completeness

get_full_prompt(data)

Get the full prompt that DeepEval generates for this metric.

Parameters:

Name Type Description Default
data MetricInput

The metric input.

required

Returns:

Name Type Description
str str

The complete prompt (system + user) as a string.

PromptExtractionMixin

Mixin class that provides get_full_prompt functionality for metrics.

This mixin provides a standard interface for metrics that support prompt extraction. Metrics that inherit from this mixin should implement the get_full_prompt method.

get_full_prompt(data)

Get the full prompt that the metric generates.

Parameters:

Name Type Description Default
data MetricInput

The metric input.

required

Returns:

Name Type Description
str str

The complete prompt as a string.

Raises:

Type Description
NotImplementedError

If the metric doesn't support prompt extraction.