Skip to content

Metric

Base class for metrics.

BaseMetric

Bases: ABC

Abstract class for metrics.

Attributes:

Name Type Description
name str

The name of the metric.

required_fields set[str]

The required fields for this metric to evaluate data.

input_type type | None

The type of the input data.

higher_is_better bool

Whether a higher score indicates better quality. Defaults to True.

strict_mode bool

If True, binarizes score to 1.0 or 0.0 before thresholding. Defaults to False.

threshold float

Pass/fail threshold in [0, 1]. Defaults to 0.5.

models list[BaseLMInvoker]

Judge models for single-judge/multi-judge evaluation.

  • Empty (default): single-judge mode using the metric's own lm_invoker.
  • One invoker: same as empty — runs that invoker once.
  • Multiple identical invokers (homogeneous): models=[invoker] * 3 — runs the same model 3 times and aggregates with aggregation_method.
  • Multiple distinct invokers (heterogeneous): models=[invoker_a, invoker_b] — runs each model once and aggregates.
aggregation_method AggregationSelector

Strategy for aggregating judge scores (majority_vote, median, average). Defaults to majority_vote.

num_judges int

Read-only. Returns len(models) or 1.

max_concurrent_judges int | None

Cap on concurrent judge tasks.

Single-judge::

metric = GEvalCompletenessMetric(models=my_invoker)

Homogeneous multi-judge (same model, 3 repetitions)::

metric = GEvalCompletenessMetric(models=[my_invoker] * 3)
metric.aggregation_method = AggregationMethod.MEDIAN

Heterogeneous multi-judge (different models)::

metric = GEvalCompletenessMetric(models=[invoker_a, invoker_b, invoker_c])
metric.aggregation_method = AggregationMethod.MAJORITY_VOTE

aggregation_method property writable

Return the configured aggregation method.

Returns:

Name Type Description
AggregationSelector AggregationSelector

The aggregation method configured for this metric.

models property writable

Return configured judge model invokers.

  • Empty list (default): single-judge mode.
  • One invoker: runs that invoker once.
  • [invoker] * N: homogeneous — same model N times.
  • [invoker_a, invoker_b, ...]: heterogeneous — distinct models.

num_judges property

Return the number of judge models.

Read-only convenience: len(models) when models is non-empty, else 1.

Returns:

Name Type Description
int int

Number of judges configured for this metric.

can_evaluate(data)

Check if this metric can evaluate the given data.

Parameters:

Name Type Description Default
data EvalInput

The input data to check.

required

Returns:

Name Type Description
bool bool

True if the metric can evaluate the data, False otherwise.

evaluate(data) async

Evaluate the metric on the given dataset (single item or batch).

Automatically handles batch processing by default. Subclasses can override _evaluate to accept lists for optimized batch processing.

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

The data to evaluate the metric on. Can be a single item or a list for batch processing.

required

Returns:

Type Description
MetricResult | list[MetricResult]

MetricResult | list[MetricResult]: The evaluation result(s). Returns a list if input is a list.

get_input_fields() classmethod

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type Description
list[str] | None

list[str] | None: The input fields.

get_input_spec() classmethod

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type Description
list[dict[str, Any]] | None

list[dict[str, Any]] | None: The input spec.

is_success(score)

Determine if the score indicates success based on threshold and polarity.

Parameters:

Name Type Description Default
score float

The score to evaluate.

required

Returns:

Name Type Description
bool bool

True if the score meets the success criteria, False otherwise.

MultiOutputBaseMetric

Bases: BaseMetric, ABC

Base class for metrics that emit one result dict per sub-metric.

Used for retrieval metrics like pytrec_metric and top_k_accuracy that return a mapping of sub-metric names to individual result dictionaries, rather than a single MetricScore.

evaluate(data) async

Evaluate using multi-output path, bypassing MetricScore conversion.

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

The data to evaluate the metric on. Can be a single item or a list for batch processing.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

MetricOutput | list[MetricOutput]: The evaluation result(s). Returns a list if input is a list.

copy_invoker_with_schema(model, schema)

Return an invoker with a response schema applied when provided.

Parameters:

Name Type Description Default
model BaseLMInvoker

Invoker to use as the schema source.

required
schema Any | None

Response schema to apply to the copied invoker.

required

Returns:

Name Type Description
BaseLMInvoker BaseLMInvoker

Original invoker when schema is None; otherwise copied invoker with the response schema applied.

resolve_primary_invoker(models)

Resolve the invoker used to initialize single-model integrations.

Parameters:

Name Type Description Default
models BaseLMInvoker | list[BaseLMInvoker] | None

Single invoker, list of invokers, or None. Lists use the first invoker; empty lists and None use the default invoker.

required

Returns:

Name Type Description
BaseLMInvoker BaseLMInvoker

Invoker used for evaluator initialization.