Evaluate
Evaluate Module.
This module is used to evaluate precomputed model outputs using a convenience function.
The evaluation pipeline requires input data to already contain model outputs
(e.g., actual_output). It does not perform live model inference.
evaluate(data, evaluators, experiment_tracker=None, batch_size=10, allow_batch_evaluation=False, summary_evaluators=None, **kwargs)
async
Evaluate the model.
Input data must already contain model outputs (e.g. actual_output).
This function does not perform live model inference.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
str | BaseDataset | list[EvalInput]
|
The data to evaluate. When a list is given, LLMTestData rows are normalized and wrapped in a DictDataset before being passed to the Runner. |
required |
evaluators
|
list[BaseEvaluator | BaseMetric]
|
The evaluators to use. |
required |
experiment_tracker
|
BaseExperimentTracker | None
|
The experiment tracker to use. |
None
|
batch_size
|
int
|
The batch size to use for evaluation (runner-level chunking for memory management). Defaults to 10. |
10
|
allow_batch_evaluation
|
bool
|
Enable batch processing mode for LLM API calls. When True, the runner passes entire chunks to evaluators for batch processing. Defaults to False. |
False
|
summary_evaluators
|
list[SummaryEvaluatorCallable] | None
|
Custom summary evaluators to compute batch-level statistics. Each callable receives (evaluation_results, data) and returns a dict of summary metrics. Defaults to None. |
None
|
**kwargs
|
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
EvaluationResult |
EvaluationResult
|
Structured result containing evaluation results and experiment URLs/paths. |