Metric
Base class for metrics.
BaseMetric
Bases: ABC
Abstract class for metrics.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_fields |
set[str]
|
The required fields for this metric to evaluate data. |
input_type |
type | None
|
The type of the input data. |
higher_is_better |
bool
|
Whether a higher score indicates better quality. Defaults to True. |
strict_mode |
bool
|
If True, binarizes score to 1.0 or 0.0 before thresholding. Defaults to False. |
threshold |
float
|
Pass/fail threshold in [0, 1]. Defaults to 0.5. |
models |
list[BaseLMInvoker]
|
Judge models for single-judge/multi-judge evaluation.
|
aggregation_method |
AggregationSelector
|
Strategy for aggregating judge scores (majority_vote, median, average). Defaults to majority_vote. |
num_judges |
int
|
Read-only. Returns |
max_concurrent_judges |
int | None
|
Cap on concurrent judge tasks. |
Single-judge::
metric = GEvalCompletenessMetric(models=my_invoker)
Homogeneous multi-judge (same model, 3 repetitions)::
metric = GEvalCompletenessMetric(models=[my_invoker] * 3)
metric.aggregation_method = AggregationMethod.MEDIAN
Heterogeneous multi-judge (different models)::
metric = GEvalCompletenessMetric(models=[invoker_a, invoker_b, invoker_c])
metric.aggregation_method = AggregationMethod.MAJORITY_VOTE
aggregation_method
property
writable
Return the configured aggregation method.
Returns:
| Name | Type | Description |
|---|---|---|
AggregationSelector |
AggregationSelector
|
The aggregation method configured for this metric. |
models
property
writable
Return configured judge model invokers.
- Empty list (default): single-judge mode.
- One invoker: runs that invoker once.
[invoker] * N: homogeneous — same model N times.[invoker_a, invoker_b, ...]: heterogeneous — distinct models.
num_judges
property
Return the number of judge models.
Read-only convenience: len(models) when models is non-empty, else 1.
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
Number of judges configured for this metric. |
can_evaluate(data)
Check if this metric can evaluate the given data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
EvalInput
|
The input data to check. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the metric can evaluate the data, False otherwise. |
evaluate(data)
async
Evaluate the metric on the given dataset (single item or batch).
Automatically handles batch processing by default. Subclasses can override
_evaluate to accept lists for optimized batch processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
EvalInput | list[EvalInput]
|
The data to evaluate the metric on. Can be a single item or a list for batch processing. |
required |
Returns:
| Type | Description |
|---|---|
MetricResult | list[MetricResult]
|
MetricResult | list[MetricResult]: The evaluation result(s). Returns a list if input is a list. |
get_input_fields()
classmethod
Return declared input field names if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[str] | None
|
list[str] | None: The input fields. |
get_input_spec()
classmethod
Return structured spec for input fields if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | None
|
list[dict[str, Any]] | None: The input spec. |
is_success(score)
Determine if the score indicates success based on threshold and polarity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
score
|
float
|
The score to evaluate. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the score meets the success criteria, False otherwise. |
MultiOutputBaseMetric
Bases: BaseMetric, ABC
Base class for metrics that emit one result dict per sub-metric.
Used for retrieval metrics like pytrec_metric and top_k_accuracy that return a mapping of sub-metric names to individual result dictionaries, rather than a single MetricScore.
evaluate(data)
async
Evaluate using multi-output path, bypassing MetricScore conversion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
EvalInput | list[EvalInput]
|
The data to evaluate the metric on. Can be a single item or a list for batch processing. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
MetricOutput | list[MetricOutput]: The evaluation result(s). Returns a list if input is a list. |
copy_invoker_with_schema(model, schema)
Return an invoker with a response schema applied when provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
BaseLMInvoker
|
Invoker to use as the schema source. |
required |
schema
|
Any | None
|
Response schema to apply to the copied invoker. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
BaseLMInvoker |
BaseLMInvoker
|
Original invoker when schema is None; otherwise copied invoker with the response schema applied. |
resolve_primary_invoker(models)
Resolve the invoker used to initialize single-model integrations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
models
|
BaseLMInvoker | list[BaseLMInvoker] | None
|
Single invoker, list of invokers, or None. Lists use the first invoker; empty lists and None use the default invoker. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
BaseLMInvoker |
BaseLMInvoker
|
Invoker used for evaluator initialization. |