Deepeval

DeepEval Metric Integration.

`DeepEvalMetric(metric, name, models=None, aggregation_method=DefaultValues.AGGREGATION_METHOD, max_concurrent_judges=None)`

DeepEval Metric.

A wrapper for DeepEval metrics.

Available Fields

input (str): The query to evaluate the metric.
actual_output (str, optional): The generated response to evaluate the metric.
expected_output (str, optional): The expected response to evaluate the metric.
expected_context (str | list[str], optional): The expected retrieved context to evaluate the metric. If a str, it will be converted into a list with a single element.
retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. If a str, it will be converted into a list with a single element.

Scoring

Initializes the DeepEvalMetric class.

Parameters:

Name	Type	Description	Default
`metric`	`BaseMetric`	The DeepEval metric to wrap.	required
`name`	`str`	The name of the metric.	required
`models`	`BaseLMInvoker \| list[BaseLMInvoker] \| None`	The model invoker(s) to use for multi-judge evaluation. Defaults to None.	`None`
`aggregation_method`	`AggregationSelector`	The aggregation method to use for the metric. Defaults to DefaultValues.AGGREGATION_METHOD.	`AGGREGATION_METHOD`
`max_concurrent_judges`	`int \| None`	The maximum number of concurrent judges to use for the metric. Defaults to None.	`None`

Bases: DeepEvalMetric, ABC

DeepEval Metric Factory.

Abstract base class for creating DeepEval metrics with a shared model invoker.

Available Fields

Scoring

Initializes the metric, handling common model invoker creation.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name for the metric.	required
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120.	`BATCH_MAX_ITERATIONS`
`models`	`BaseLMInvoker \| list[BaseLMInvoker] \| None`	The model invoker(s) to use for multi-judge evaluation. `None` (default): single-judge mode using the default invoker. `[invoker] * N`: homogeneous — same model N times. `[invoker_a, invoker_b]`: heterogeneous — distinct models.	`None`
`fallback_models`	`list[BaseLMInvoker] \| None`	Ordered list of fallback invokers tried in sequence when the primary judge fails. Defaults to None.	`None`
`aggregation_method`	`AggregationSelector`	The aggregation method to use for repeated-judge evaluation. Defaults to DefaultValues.AGGREGATION_METHOD.	`AGGREGATION_METHOD`
`max_concurrent_judges`	`int \| None`	The maximum number of concurrent judges to use for the metric. Defaults to None.	`None`
`**kwargs`		Additional arguments for the specific DeepEval metric constructor.	`{}`