Evaluate Suites
Evaluate Suites Module.
This module provides a helper function for evaluating different data partitions (suites) with different evaluator sets under a shared run_id and experiment tracker.
EvalSuite
Bases: BaseModel
Evaluation suite defining a data partition and its evaluators.
Attributes:
| Name | Type | Description |
|---|---|---|
data |
str | BaseDataset | list[EvalInput]
|
Input data as a string path, BaseDataset, or list of evaluation inputs. |
evaluators |
list[BaseEvaluator]
|
List of evaluators to apply to this suite's data. |
name |
str | None
|
Optional name for the suite. If not provided, auto-generated as suite_0, suite_1, etc. |
evaluate_suites(suites, experiment_tracker=None, batch_size=10, allow_batch_evaluation=False, run_aggregators=None, dataset_name=None, run_id=None, **kwargs)
async
Evaluate multiple suites with different evaluators under a shared run_id.
Allows different data partitions (suites) to use different evaluator sets while sharing one tracker and one run_id. Dataset names are namespaced per suite. Warns on duplicate test case content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suites
|
list[EvalSuite]
|
List of evaluation suites, each with data and evaluators. |
required |
experiment_tracker
|
type[BaseExperimentTracker] | BaseExperimentTracker | None
|
Tracker class, instance, or None for CSV default. |
None
|
batch_size
|
int
|
Batch size for evaluation (runner-level chunking). Defaults to 10. |
10
|
allow_batch_evaluation
|
bool
|
Enable batch processing for LLM API calls. Defaults to False. |
False
|
run_aggregators
|
list[RunAggregatorCallable] | None
|
Custom run aggregators. Defaults to None. |
None
|
dataset_name
|
str | None
|
Base dataset name. If None, auto-generated with timestamp. |
None
|
run_id
|
str | None
|
Shared run ID. If None, auto-generated. |
None
|
**kwargs
|
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
SuiteExperimentResult |
SuiteExperimentResult
|
Top-level result with per-suite results and pooled aggregators. |