Skip to content

Runner

Runner module for executing batch evaluation workflows.

This module provides runner classes that orchestrate the execution of evaluation workflows, handling batch processing, async operations, and result collection. Runners coordinate between datasets, evaluators, and experiment trackers to perform comprehensive evaluations.

Components: - BaseRunner: Abstract base class for all runners - Runner: Main runner implementation for batch evaluation execution

BaseRunner(data, evaluators, experiment_tracker=None, batch_size=10, **kwargs)

Bases: ABC

Abstract class for runner.

This class defines the interface for all runner. Input data must already contain model outputs (e.g. actual_output).

Attributes:

Name Type Description
data str | BaseDataset

The data to evaluate.

evaluators list[BaseEvaluator]

The evaluators to use.

experiment_tracker BaseExperimentTracker | None

The experiment tracker.

batch_size int

The batch size to use for evaluation.

Initialize the runner.

Parameters:

Name Type Description Default
data str | BaseDataset

The data to evaluate. Must contain model outputs.

required
evaluators list[BaseEvaluator]

The evaluators to use.

required
experiment_tracker BaseExperimentTracker | None

The experiment tracker.

None
batch_size int

The batch size to use for evaluation.

10
**kwargs Any

Additional configuration parameters.

{}

evaluate() abstractmethod async

Run the evaluator on the dataset.

The dataset is evaluated in batches of the given batch size.

Returns:

Name Type Description
EvaluationResult EvaluationResult

Structured result containing evaluation results and experiment URLs/paths.

Runner(data, evaluators, experiment_tracker=None, batch_size=10, allow_batch_evaluation=False, summary_evaluators=None, **kwargs)

Bases: BaseRunner

Runner class for evaluating datasets.

Input data must already contain model outputs (e.g. actual_output). The runner does not perform live model inference.

Attributes:

Name Type Description
data str | BaseDataset

The data to evaluate.

evaluators list[BaseEvaluator]

The evaluators to use.

experiment_tracker ExperimentTrackerAdapter | type[BaseExperimentTracker] | None

The experiment tracker for logging evaluation results. Can be: - None: Uses CSVExperimentTracker (default) - A tracker class: Will be instantiated with provided kwargs - A tracker instance: Will be used directly

**kwargs ExperimentTrackerAdapter | type[BaseExperimentTracker] | None

Additional configuration parameters.

Initialize the Runner.

Parameters:

Name Type Description Default
data str | BaseDataset

The data to evaluate. Must contain model outputs.

required
evaluators list[BaseEvaluator | BaseMetric]

The evaluators to use.

required
experiment_tracker BaseExperimentTracker | type[BaseExperimentTracker] | None

The experiment tracker for logging evaluation results. Can be: - None: Uses CSVExperimentTracker (default) - A tracker class: Will be instantiated with provided kwargs - A tracker instance: Will be used directly Defaults to None.

None
batch_size int

The batch size to use for evaluation (runner-level chunking for memory management).

10
allow_batch_evaluation bool

Enable batch processing mode for LLM API calls. When True, the runner passes entire chunks to evaluators for batch processing. Defaults to False.

False
summary_evaluators list[SummaryEvaluatorCallable] | None

Custom summary evaluators to compute batch-level statistics. Each callable receives (evaluation_results, data) and returns a dict of summary metrics. Defaults to None.

None
**kwargs Any

Additional configuration parameters.

{}

evaluate() async

Run the evaluators on the dataset.

The dataset is evaluated using the following flow: 1. Convert dataset to standard format 2. Prepare dataset for evaluation (tracker may convert to its own format) 3. Prepare each row for evaluation (validation + dataset-specific prep) 4. Evaluate batch 5. Prepare rows for tracking 6. Log to tracker

Returns:

Name Type Description
EvaluationResult EvaluationResult

Structured result containing evaluation results and experiment URLs/paths.

get_run_results(**kwargs)

Get the results of a run.

Parameters:

Name Type Description Default
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: The results of the run.