Skip to content

Evaluate

Evaluate Module.

This module is used to evaluate precomputed model outputs using a convenience function.

The evaluation pipeline requires input data to already contain model outputs (e.g., actual_output). It does not perform live model inference.

evaluate(data, evaluators, experiment_tracker=None, batch_size=10, allow_batch_evaluation=False, summary_evaluators=None, **kwargs) async

Evaluate the model.

Input data must already contain model outputs (e.g. actual_output). This function does not perform live model inference.

Parameters:

Name Type Description Default
data str | BaseDataset | list[EvalInput]

The data to evaluate. When a list is given, LLMTestData rows are normalized and wrapped in a DictDataset before being passed to the Runner.

required
evaluators list[BaseEvaluator | BaseMetric]

The evaluators to use.

required
experiment_tracker BaseExperimentTracker | None

The experiment tracker to use.

None
batch_size int

The batch size to use for evaluation (runner-level chunking for memory management). Defaults to 10.

10
allow_batch_evaluation bool

Enable batch processing mode for LLM API calls. When True, the runner passes entire chunks to evaluators for batch processing. Defaults to False.

False
summary_evaluators list[SummaryEvaluatorCallable] | None

Custom summary evaluators to compute batch-level statistics. Each callable receives (evaluation_results, data) and returns a dict of summary metrics. Defaults to None.

None
**kwargs Any

Additional configuration parameters.

{}

Returns:

Name Type Description
EvaluationResult EvaluationResult

Structured result containing evaluation results and experiment URLs/paths.