Deepeval geval
DeepEval GEval Metric Integration.
DeepEvalGEvalMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None, strict_mode=False)
Bases: DeepEvalMetricFactory, PromptExtractionMixin
DeepEval GEval Metric Integration.
This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.
GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.
Available Fields
- input (str, optional): The query to evaluate the metric.
- actual_output (str, optional): The generated response to evaluate the metric.
- expected_output (str, optional): The expected response to evaluate the metric.
- expected_context (str | list[str], optional): The expected retrieved context to evaluate the metric. If a str, it will be converted to a list with a single element.
- retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. If a str, it will be converted into a list with a single element.
Scoring
- 0.0-1.0 (Continuous): Or Boolean depending on the DeepEval GEval configuration.
Initializes the DeepEvalGEvalMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str | None
|
The name of the metric. Defaults to None. Required if not provided via _defaults. |
None
|
evaluation_params
|
list[LLMTestCaseParams] | None
|
The evaluation parameters. Defaults to None. Required if not provided via _defaults. |
None
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to DefaultValues.MODEL. |
MODEL
|
criteria
|
str | None
|
The criteria to use for the metric. Defaults to None. |
None
|
evaluation_steps
|
list[str] | None
|
The evaluation steps to use for the metric. Defaults to None. |
None
|
rubric
|
list[Rubric] | None
|
The rubric to use for the metric. Defaults to None. |
None
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. Must be between 0.0 and 1.0 inclusive. |
0.5
|
additional_context
|
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
num_judges
|
int
|
The number of judges to use for the metric. Defaults to 1. |
NUM_JUDGES
|
aggregation_method
|
AggregationSelector
|
The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD. |
AGGREGATION_METHOD
|
max_concurrent_judges
|
int | None
|
The maximum number of concurrent judges to use for the metric. Defaults to None. |
None
|
strict_mode
|
bool
|
If True, binarizes score to 1.0 or 0.0. Defaults to False. |
False
|
evaluate(data, temp_fewshot=None, temp_info=None, fewshot_mode='append')
async
Evaluate with custom prompt lifecycle support and heterogeneous judges.
Handles three concerns: 1. Runtime prompt parameters (temp_fewshot, temp_info) 2. Heterogeneous judges (judge parameter with different models) 3. Batch processing
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
EvalInput | list[EvalInput]
|
Single data item or list of data items to evaluate. |
required |
temp_fewshot
|
str | None
|
Runtime fewshot examples. Defaults to None. |
None
|
temp_info
|
str | None
|
Additional context information. Defaults to None. |
None
|
fewshot_mode
|
Literal['append', 'replace']
|
How to merge fewshot. Defaults to "append". |
'append'
|
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
MetricOutput | list[MetricOutput]: Evaluation results with scores namespaced by metric name. |
get_custom_prompt_base_name()
Get the base name for custom prompt column lookup.
For GEval metrics, removes the 'geval_' prefix to align with CSV column conventions. This fixes Issue #3 by providing polymorphic naming for GEval metrics.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The base name without 'geval_' prefix (e.g., "completeness" instead of "geval_completeness"). |
Example
metric.name = "geval_completeness" metric.get_custom_prompt_base_name() -> "completeness"
CSV columns expected: - fewshot_completeness - fewshot_completeness_mode - evaluation_step_completeness
get_full_prompt(data)
Get the full prompt that DeepEval generates for this metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
EvalInput
|
The metric input. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The complete prompt (system + user) as a string. |
GllmGEvalTemplate
Bases: GEvalTemplate
GEval template variant with reason before score.
generate_evaluation_results(evaluation_steps, test_case_content, parameters, rubric=None, score_range=(1, 3), _additional_context=None, multimodal=False)
staticmethod
Generate evaluation prompt with reason listed before score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_steps
|
str
|
Numbered evaluation steps used to judge the response. |
required |
test_case_content
|
str
|
Rendered test case content included in the prompt. |
required |
parameters
|
str
|
Evaluation parameter names referenced by the evaluator. |
required |
rubric
|
str | None
|
Formatted rubric text to include in the prompt. Defaults to None. |
None
|
score_range
|
tuple[int, int]
|
Inclusive score range for the evaluator. Defaults to (1, 3). |
(1, 3)
|
_additional_context
|
str | None
|
Additional context such as few-shot examples. Defaults to None. |
None
|
multimodal
|
bool
|
Whether to include multimodal evaluation rules. Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Full evaluation prompt string. |
MetricDefaults(name='', criteria=None, evaluation_steps=None, rubric=None, evaluation_params=None, additional_context=None)
dataclass
Metric defaults for DeepEval GEval.
PromptExtractionMixin
Mixin class that provides get_full_prompt functionality for metrics.
This mixin provides a standard interface for metrics that support prompt extraction. Metrics that inherit from this mixin should implement the get_full_prompt method.
get_full_prompt(data)
Get the full prompt that the metric generates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput
|
The metric input. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The complete prompt as a string. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If the metric doesn't support prompt extraction. |