Skip to content

Google sheets experiment tracker

Google Sheets experiment tracker for tracking evaluation results.

This tracker stores experiment results and leaderboards in Google Sheets, providing an easy-to-access interface for viewing and sharing results.

GoogleSheetsExperimentTracker(project_name, config, score_key='score', extra_score_keys=None, leaderboard_score_key='aggregate_success', run_aggregators=None, companion_fields=None, include_eval_result=True, batch_size=GSheetDefaultValues.DEFAULT_BATCH_SIZE, max_retries=GSheetDefaultValues.MAX_RETRIES, keep_fields=None, field_mapping=None)

Bases: BaseExperimentTracker

Google Sheets experiment tracker for evaluation results.

Initialize a Google Sheets–backed experiment tracker instance.

This constructor wires the tracker into the Google Sheets–based experiment tracking workflow defined by :class:BaseExperimentTracker.

Parameters:

Name Type Description Default
project_name str

Logical name of the project or benchmark suite.

required
config GoogleSheetsTrackerConfig

GoogleSheetsTrackerConfig containing Google Sheets resource/auth settings.

required
score_key str | list[str]

Metric key(s) used to extract scores from evaluation outputs.

'score'
extra_score_keys list[str] | None

Additional score keys to extract for result columns. Defaults to DEFAULT_EXTRA_SCORE_KEYS.

None
leaderboard_score_key str

Score key used for leaderboard aggregation.

'aggregate_success'
run_aggregators list[Any] | None

Custom run aggregators for batch-level metrics.

None
companion_fields list[str] | None

Per-metric fields to extract alongside scores. Defaults to DEFAULT_COMPANION_FIELDS.

None
include_eval_result bool

Whether to include the full evaluator result JSON. Defaults to True.

True
batch_size int

Maximum number of rows per batch write to Google Sheets.

DEFAULT_BATCH_SIZE
max_retries int

Maximum retry attempts for transient API failures.

MAX_RETRIES
keep_fields list[str] | None

Optional list of fields to keep when logging.

None
field_mapping dict[str, str] | None

Optional mapping of output field name -> source field name.

None

Raises:

Type Description
ValueError

If config validation fails.

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs) async

Log a single evaluation result (asynchronous).

Note

gspread is synchronous, so this offloads the blocking write to the running loop's default ThreadPoolExecutor via run_in_executor(None, ...). Concurrent alog calls share that pool; with the CPython default (min(32, os.cpu_count() + 4) workers) sustained high concurrency can saturate it and block unrelated tasks. For high-throughput cases prefer log_batch (one executor call per batch) or configure a dedicated executor on the loop via loop.set_default_executor.

Parameters:

Name Type Description Default
evaluation_result EvaluatorResult

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
update_leaderboard bool

Whether to update the leaderboard after writing this row.

True
**kwargs Any

Additional configuration parameters.

{}

get_experiment_urls(run_id, dataset_name, **kwargs)

Get experiment URLs and paths for accessing experiment data.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
dataset_name str

Name of the dataset that was evaluated.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Name Type Description
ExperimentUris ExperimentUris

Dictionary containing URIs for accessing experiment data.

log(evaluation_result, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)

Log a single evaluation result (synchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluatorResult

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
update_leaderboard bool

Whether to update the leaderboard after writing this row. Set to False when calling log repeatedly and update the leaderboard once after the batch.

True
**kwargs Any

Additional configuration parameters.

{}

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs) async

Log a batch of evaluation results (asynchronous).

Parameters:

Name Type Description Default
evaluation_results list[list[EvaluatorResult]]

Row-grouped evaluation results to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data list[MetricInput]

List of input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
update_leaderboard bool

Whether to update the leaderboard once after writing the batch.

True
**kwargs Any

Additional configuration parameters.

{}

GoogleSheetsTrackerConfig(spreadsheet_id=None, folder_drive_id=None, client_email=None, private_key=None) dataclass

Configuration for Google Sheets Experiment Tracker.

This dataclass only stores Google Sheets connection/resource settings.

Attributes:

Name Type Description
spreadsheet_id str | None

Optional ID of an existing Google Sheets spreadsheet.

folder_drive_id str | None

Optional Google Drive folder ID for creating new spreadsheet.

client_email str | None

Optional service account client email for authentication.

private_key str | None

Optional private key for service account authentication.

__post_init__()

Validate configuration after initialization.