Metric
Base class for metrics.
BaseMetric
Bases: ABC
Abstract class for metrics.
This class defines the interface for all metrics.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_fields |
set[str]
|
The required fields for this metric to evaluate data. |
input_type |
type | None
|
The type of the input data. |
higher_is_better |
bool
|
Whether a higher score indicates better quality. Defaults to True. |
strict_mode |
bool
|
If True, binarizes score to 1.0 or 0.0 before thresholding. Defaults to False. |
threshold |
float
|
Pass/fail threshold in [0, 1]. Defaults to 0.5. |
Example
Adding custom prompts to existing evaluator metrics:
from gllm_evals import load_simple_qa_dataset
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
async def main():
# Main function with custom prompts
# Load your dataset (must have actual_output pre-populated)
dataset = load_simple_qa_dataset()
# Create evaluator with default metrics
evaluator = GEvalGenerationEvaluator(
model_credentials=os.getenv("GOOGLE_API_KEY")
)
# Add custom prompts polymorphically (works for any metric)
for metric in evaluator.metrics:
if hasattr(metric, 'name'): # Ensure metric has name attribute
# Add custom prompts based on metric name
if metric.name == "geval_completeness":
# Add domain-specific few-shot examples
metric.additional_context += "\n\nCUSTOM MEDICAL EXAMPLES: ..."
elif metric.name == "geval_groundedness":
# Add grounding examples
metric.additional_context += "\n\nMEDICAL GROUNDING EXAMPLES: ..."
# Evaluate with custom prompts applied automatically
results = await evaluate(
data=dataset,
evaluators=[evaluator], # ← Custom prompts applied to metrics
)
aggregation_method
property
writable
Return the configured aggregation method.
num_judges
property
writable
Return the configured number of judges.
can_evaluate(data)
Check if this metric can evaluate the given data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
EvalInput
|
The input data to check. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the metric can evaluate the data, False otherwise. |
evaluate(data)
async
Evaluate the metric on the given dataset (single item or batch).
Automatically handles batch processing by default. Subclasses can override
_evaluate to accept lists for optimized batch processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
EvalInput | list[EvalInput]
|
The data to evaluate the metric on. Can be a single item or a list for batch processing. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
MetricOutput | list[MetricOutput]: A dictionary where the key are the namespace and the value are the scores. Returns a list if input is a list. |
get_input_fields()
classmethod
Return declared input field names if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[str] | None
|
list[str] | None: The input fields. |
get_input_spec()
classmethod
Return structured spec for input fields if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | None
|
list[dict[str, Any]] | None: The input spec. |
is_success(score)
Determine if the score indicates success based on threshold and polarity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
score
|
float
|
The score to evaluate. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the score meets the success criteria, False otherwise. |