Skip to content

Evaluate Suites

Evaluate Suites Module.

This module provides a helper function for evaluating different data partitions (suites) with different evaluator sets under a shared run_id and experiment tracker.

EvalSuite

Bases: BaseModel

Evaluation suite defining a data partition and its evaluators.

Attributes:

Name Type Description
data str | BaseDataset | list[EvalInput]

Input data as a string path, BaseDataset, or list of evaluation inputs.

evaluators list[BaseEvaluator]

List of evaluators to apply to this suite's data.

name str | None

Optional name for the suite. If not provided, auto-generated as suite_0, suite_1, etc.

evaluate_suites(suites, experiment_tracker=None, batch_size=10, allow_batch_evaluation=False, run_aggregators=None, dataset_name=None, run_id=None, **kwargs) async

Evaluate multiple suites with different evaluators under a shared run_id.

Allows different data partitions (suites) to use different evaluator sets while sharing one tracker and one run_id. Dataset names are namespaced per suite. Warns on duplicate test case content.

Parameters:

Name Type Description Default
suites list[EvalSuite]

List of evaluation suites, each with data and evaluators.

required
experiment_tracker type[BaseExperimentTracker] | BaseExperimentTracker | None

Tracker class, instance, or None for CSV default.

None
batch_size int

Batch size for evaluation (runner-level chunking). Defaults to 10.

10
allow_batch_evaluation bool

Enable batch processing for LLM API calls. Defaults to False.

False
run_aggregators list[RunAggregatorCallable] | None

Custom run aggregators. Defaults to None.

None
dataset_name str | None

Base dataset name. If None, auto-generated with timestamp.

None
run_id str | None

Shared run ID. If None, auto-generated.

None
**kwargs Any

Additional configuration parameters.

{}

Returns:

Name Type Description
SuiteExperimentResult SuiteExperimentResult

Top-level result with per-suite results and pooled aggregators.