Skip to content

Metric

Base class for metrics.

BaseMetric

Bases: ABC

Abstract class for metrics.

This class defines the interface for all metrics.

Attributes:

Name Type Description
name str

The name of the metric.

required_fields set[str]

The required fields for this metric to evaluate data.

input_type type | None

The type of the input data.

higher_is_better bool

Whether a higher score indicates better quality. Defaults to True.

strict_mode bool

If True, binarizes score to 1.0 or 0.0 before thresholding. Defaults to False.

threshold float

Pass/fail threshold in [0, 1]. Defaults to 0.5.

Example

Adding custom prompts to existing evaluator metrics:

from gllm_evals import load_simple_qa_dataset
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator


async def main():
    # Main function with custom prompts

    # Load your dataset (must have actual_output pre-populated)
    dataset = load_simple_qa_dataset()

    # Create evaluator with default metrics
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Add custom prompts polymorphically (works for any metric)
    for metric in evaluator.metrics:
        if hasattr(metric, 'name'):  # Ensure metric has name attribute
            # Add custom prompts based on metric name
            if metric.name == "geval_completeness":
                # Add domain-specific few-shot examples
                metric.additional_context += "\n\nCUSTOM MEDICAL EXAMPLES: ..."
            elif metric.name == "geval_groundedness":
                # Add grounding examples
                metric.additional_context += "\n\nMEDICAL GROUNDING EXAMPLES: ..."

    # Evaluate with custom prompts applied automatically
    results = await evaluate(
        data=dataset,
        evaluators=[evaluator],  # ← Custom prompts applied to metrics
    )

aggregation_method property writable

Return the configured aggregation method.

num_judges property writable

Return the configured number of judges.

can_evaluate(data)

Check if this metric can evaluate the given data.

Parameters:

Name Type Description Default
data EvalInput

The input data to check.

required

Returns:

Name Type Description
bool bool

True if the metric can evaluate the data, False otherwise.

evaluate(data) async

Evaluate the metric on the given dataset (single item or batch).

Automatically handles batch processing by default. Subclasses can override _evaluate to accept lists for optimized batch processing.

Parameters:

Name Type Description Default
data EvalInput | list[EvalInput]

The data to evaluate the metric on. Can be a single item or a list for batch processing.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

MetricOutput | list[MetricOutput]: A dictionary where the key are the namespace and the value are the scores. Returns a list if input is a list.

get_input_fields() classmethod

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type Description
list[str] | None

list[str] | None: The input fields.

get_input_spec() classmethod

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type Description
list[dict[str, Any]] | None

list[dict[str, Any]] | None: The input spec.

is_success(score)

Determine if the score indicates success based on threshold and polarity.

Parameters:

Name Type Description Default
score float

The score to evaluate.

required

Returns:

Name Type Description
bool bool

True if the score meets the success criteria, False otherwise.