Deepeval geval

DeepEval GEval Metric Integration.

`DeepEvalGEvalMetric(name=None, evaluation_params=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, models=None, fallback_models=None, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, strict_mode=False)`

Bases: DeepEvalMetricFactory, PromptExtractionMixin

DeepEval GEval Metric Integration.

This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.

GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.

Available Fields

input (str, optional): The query to evaluate the metric.
actual_output (str, optional): The generated response to evaluate the metric.
expected_output (str, optional): The expected response to evaluate the metric.
expected_context (str | list[str], optional): The expected retrieved context to evaluate the metric. If a str, it will be converted to a list with a single element.
retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. If a str, it will be converted into a list with a single element.

Scoring

0.0-1.0 (Continuous): Or Boolean depending on the DeepEval GEval configuration.

Initializes the DeepEvalGEvalMetric class.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	The name of the metric. Defaults to None. Required if not provided via _defaults.	`None`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters. Defaults to None. Required if not provided via _defaults.	`None`
`criteria`	`str \| None`	The criteria to use for the metric. Defaults to None.	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. Defaults to None.	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. Defaults to None.	`None`
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5. Must be between 0.0 and 1.0 inclusive.	`0.5`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to None.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`
`models`	`BaseLMInvoker \| list[BaseLMInvoker] \| None`	The model invoker(s) to use for multi-judge evaluation. `None` (default): single-judge mode using the default invoker. `[invoker] * N`: homogeneous — same model N times. `[invoker_a, invoker_b]`: heterogeneous — distinct models.	`None`
`fallback_models`	`list[BaseLMInvoker] \| None`	Ordered list of fallback invokers tried in sequence when the primary judge fails. Defaults to None.	`None`
`aggregation_method`	`AggregationSelector`	The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.	`AGGREGATION_METHOD`
`max_concurrent_judges`	`int \| None`	The maximum number of concurrent judges to use for the metric. Defaults to None.	`None`
`strict_mode`	`bool`	If True, binarizes score to 1.0 or 0.0. Defaults to False.	`False`

`evaluate(data, temp_fewshot=None, temp_info=None, fewshot_mode='append')` `async`

Evaluate with custom prompt lifecycle support and heterogeneous judges.

Handles three concerns: 1. Runtime prompt parameters (temp_fewshot, temp_info) 2. Heterogeneous judges (judge parameter with different models) 3. Batch processing

Parameters:

Name	Type	Description	Default
`data`	`EvalInput \| list[EvalInput]`	Single data item or list of data items to evaluate.	required
`temp_fewshot`	`str \| None`	Runtime fewshot examples. Defaults to None.	`None`
`temp_info`	`str \| None`	Additional context information. Defaults to None.	`None`
`fewshot_mode`	`Literal['append', 'replace']`	How to merge fewshot. Defaults to "append".	`'append'`

Returns:

Type	Description
`MetricResult \| list[MetricResult]`	MetricResult \| list[MetricResult]: Evaluation results.

`get_custom_prompt_base_name()`

Get the base name for custom prompt column lookup.

For GEval metrics, removes the 'geval_' prefix to align with CSV column conventions. This fixes Issue #3 by providing polymorphic naming for GEval metrics.

Returns:

Name	Type	Description
`str`	`str`	The base name without 'geval_' prefix (e.g., "completeness" instead of "geval_completeness").

Example

metric.name = "geval_completeness" metric.get_custom_prompt_base_name() -> "completeness"

CSV columns expected: - fewshot_completeness - fewshot_completeness_mode - evaluation_step_completeness

`get_full_prompt(data)`

Get the full prompt that DeepEval generates for this metric.

Parameters:

Name	Type	Description	Default
`data`	`EvalInput`	The metric input.	required

Returns:

Name	Type	Description
`str`	`str`	The complete prompt (system + user) as a string.

`GllmGEvalTemplate`

Bases: GEvalTemplate

GEval template variant with reason before score.

`generate_evaluation_results(evaluation_steps, test_case_content, parameters, rubric=None, score_range=(1, 3), _additional_context=None, multimodal=False)` `staticmethod`

Generate evaluation prompt with reason listed before score.

Parameters:

Name	Type	Description	Default
`evaluation_steps`	`str`	Numbered evaluation steps used to judge the response.	required
`test_case_content`	`str`	Rendered test case content included in the prompt.	required
`parameters`	`str`	Evaluation parameter names referenced by the evaluator.	required
`rubric`	`str \| None`	Formatted rubric text to include in the prompt. Defaults to None.	`None`
`score_range`	`tuple[int, int]`	Inclusive score range for the evaluator. Defaults to (1, 3).	`(1, 3)`
`_additional_context`	`str \| None`	Additional context such as few-shot examples. Defaults to None.	`None`
`multimodal`	`bool`	Whether to include multimodal evaluation rules. Defaults to False.	`False`

Returns:

Name	Type	Description
`str`	`str`	Full evaluation prompt string.

`MetricDefaults(name='', criteria=None, evaluation_steps=None, rubric=None, evaluation_params=None, additional_context=None)` `dataclass`

Metric defaults for DeepEval GEval.

`PromptExtractionMixin`

Mixin class that provides get_full_prompt functionality for metrics.

This mixin provides a standard interface for metrics that support prompt extraction. Metrics that inherit from this mixin should implement the get_full_prompt method.

`get_full_prompt(data)`

Get the full prompt that the metric generates.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The metric input.	required

Returns:

Name	Type	Description
`str`	`str`	The complete prompt as a string.

Raises:

Type	Description
`NotImplementedError`	If the metric doesn't support prompt extraction.

Deepeval geval

evaluate(data, temp_fewshot=None, temp_info=None, fewshot_mode='append') async

get_custom_prompt_base_name()

get_full_prompt(data)

GllmGEvalTemplate

generate_evaluation_results(evaluation_steps, test_case_content, parameters, rubric=None, score_range=(1, 3), _additional_context=None, multimodal=False) staticmethod

MetricDefaults(name='', criteria=None, evaluation_steps=None, rubric=None, evaluation_params=None, additional_context=None) dataclass

PromptExtractionMixin

get_full_prompt(data)

`evaluate(data, temp_fewshot=None, temp_info=None, fewshot_mode='append')` `async`

`get_custom_prompt_base_name()`

`get_full_prompt(data)`

`GllmGEvalTemplate`

`generate_evaluation_results(evaluation_steps, test_case_content, parameters, rubric=None, score_range=(1, 3), _additional_context=None, multimodal=False)` `staticmethod`

`MetricDefaults(name='', criteria=None, evaluation_steps=None, rubric=None, evaluation_params=None, additional_context=None)` `dataclass`

`PromptExtractionMixin`

`get_full_prompt(data)`