Skip to content

Deepeval geval

DeepEval GEval Metric Integration.

DeepEvalGEvalMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, strict_mode=False)

Bases: DeepEvalMetricFactory, PromptExtractionMixin

DeepEval GEval Metric Integration.

This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.

GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.

Available Fields
  • input (str, optional): The query to evaluate the metric.
  • actual_output (str, optional): The generated response to evaluate the metric.
  • expected_output (str, optional): The expected response to evaluate the metric.
  • expected_context (str | list[str], optional): The expected retrieved context to evaluate the metric. If a str, it will be converted to a list with a single element.
  • retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. If a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): Or Boolean depending on the DeepEval GEval configuration.

Initializes the DeepEvalGEvalMetric class.

Parameters:

Name Type Description Default
name str | None

The name of the metric. Defaults to None. Required if not provided via _defaults.

None
evaluation_params list[LLMTestCaseParams] | None

The evaluation parameters. Defaults to None. Required if not provided via _defaults.

None
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
criteria str | None

The criteria to use for the metric. Defaults to None.

None
evaluation_steps list[str] | None

The evaluation steps to use for the metric. Defaults to None.

None
rubric list[Rubric] | None

The rubric to use for the metric. Defaults to None.

None
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
threshold float

The threshold to use for the metric. Defaults to 0.5. Must be between 0.0 and 1.0 inclusive.

0.5
additional_context str | None

Additional context like few-shot examples. Defaults to None.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.

AGGREGATION_METHOD
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None
strict_mode bool

If True, binarizes score to 1.0 or 0.0. Defaults to False.

False

evaluate(data, temp_fewshot=None, temp_info=None, fewshot_mode='append') async

Evaluate with custom prompt lifecycle support and heterogeneous judges.

Handles three concerns: 1. Runtime prompt parameters (temp_fewshot, temp_info) 2. Heterogeneous judges (judge parameter with different models) 3. Batch processing

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

Single data item or list of data items to evaluate.

required
temp_fewshot str | None

Runtime fewshot examples. Defaults to None.

None
temp_info str | None

Additional context information. Defaults to None.

None
fewshot_mode Literal['append', 'replace']

How to merge fewshot. Defaults to "append".

'append'

Returns:

Type Description
MetricOutput | list[MetricOutput]

MetricOutput | list[MetricOutput]: Evaluation results with scores namespaced by metric name.

get_custom_prompt_base_name()

Get the base name for custom prompt column lookup.

For GEval metrics, removes the 'geval_' prefix to align with CSV column conventions. This fixes Issue #3 by providing polymorphic naming for GEval metrics.

Returns:

Name Type Description
str str

The base name without 'geval_' prefix (e.g., "completeness" instead of "geval_completeness").

Example

metric.name = "geval_completeness" metric.get_custom_prompt_base_name() -> "completeness"

CSV columns expected: - fewshot_completeness - fewshot_completeness_mode - evaluation_step_completeness

get_full_prompt(data)

Get the full prompt that DeepEval generates for this metric.

Parameters:

Name Type Description Default
data EvalInput

The metric input.

required

Returns:

Name Type Description
str str

The complete prompt (system + user) as a string.

GllmGEvalTemplate

Bases: GEvalTemplate

GEval template variant with reason before score.

generate_evaluation_results(evaluation_steps, test_case_content, parameters, rubric=None, score_range=(1, 3), _additional_context=None, multimodal=False) staticmethod

Generate evaluation prompt with reason listed before score.

Parameters:

Name Type Description Default
evaluation_steps str

Numbered evaluation steps used to judge the response.

required
test_case_content str

Rendered test case content included in the prompt.

required
parameters str

Evaluation parameter names referenced by the evaluator.

required
rubric str | None

Formatted rubric text to include in the prompt. Defaults to None.

None
score_range tuple[int, int]

Inclusive score range for the evaluator. Defaults to (1, 3).

(1, 3)
_additional_context str | None

Additional context such as few-shot examples. Defaults to None.

None
multimodal bool

Whether to include multimodal evaluation rules. Defaults to False.

False

Returns:

Name Type Description
str str

Full evaluation prompt string.

MetricDefaults(name='', criteria=None, evaluation_steps=None, rubric=None, evaluation_params=None, additional_context=None) dataclass

Metric defaults for DeepEval GEval.

PromptExtractionMixin

Mixin class that provides get_full_prompt functionality for metrics.

This mixin provides a standard interface for metrics that support prompt extraction. Metrics that inherit from this mixin should implement the get_full_prompt method.

get_full_prompt(data)

Get the full prompt that the metric generates.

Parameters:

Name Type Description Default
data MetricInput

The metric input.

required

Returns:

Name Type Description
str str

The complete prompt as a string.

Raises:

Type Description
NotImplementedError

If the metric doesn't support prompt extraction.