Google sheets experiment tracker

Google Sheets experiment tracker for tracking evaluation results.

This tracker stores experiment results and leaderboards in Google Sheets, providing an easy-to-access interface for viewing and sharing results.

`GoogleSheetsExperimentTracker(project_name, config, score_key='score', extra_score_keys=None, leaderboard_score_key='aggregate_success', run_aggregators=None, companion_fields=None, include_eval_result=True, batch_size=GSheetDefaultValues.DEFAULT_BATCH_SIZE, max_retries=GSheetDefaultValues.MAX_RETRIES, keep_fields=None, field_mapping=None)`

Bases: BaseExperimentTracker

Google Sheets experiment tracker for evaluation results.

Initialize a Google Sheets–backed experiment tracker instance.

This constructor wires the tracker into the Google Sheets–based experiment tracking workflow defined by :class:BaseExperimentTracker.

Parameters:

Name	Type	Description	Default
`project_name`	`str`	Logical name of the project or benchmark suite.	required
`config`	`GoogleSheetsTrackerConfig`	GoogleSheetsTrackerConfig containing Google Sheets resource/auth settings.	required
`score_key`	`str \| list[str]`	Metric key(s) used to extract scores from evaluation outputs.	`'score'`
`extra_score_keys`	`list[str] \| None`	Additional score keys to extract for result columns. Defaults to DEFAULT_EXTRA_SCORE_KEYS.	`None`
`leaderboard_score_key`	`str`	Score key used for leaderboard aggregation.	`'aggregate_success'`
`run_aggregators`	`list[Any] \| None`	Custom run aggregators for batch-level metrics.	`None`
`companion_fields`	`list[str] \| None`	Per-metric fields to extract alongside scores. Defaults to DEFAULT_COMPANION_FIELDS.	`None`
`include_eval_result`	`bool`	Whether to include the full evaluator result JSON. Defaults to True.	`True`
`batch_size`	`int`	Maximum number of rows per batch write to Google Sheets.	`DEFAULT_BATCH_SIZE`
`max_retries`	`int`	Maximum retry attempts for transient API failures.	`MAX_RETRIES`
`keep_fields`	`list[str] \| None`	Optional list of fields to keep when logging.	`None`
`field_mapping`	`dict[str, str] \| None`	Optional mapping of output field name -> source field name.	`None`

Raises:

Type	Description
`ValueError`	If config validation fails.

`alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)` `async`

Log a single evaluation result (asynchronous).

Note

gspread is synchronous, so this offloads the blocking write to the running loop's default ThreadPoolExecutor via run_in_executor(None, ...). Concurrent alog calls share that pool; with the CPython default (min(32, os.cpu_count() + 4) workers) sustained high concurrency can saturate it and block unrelated tasks. For high-throughput cases prefer log_batch (one executor call per batch) or configure a dedicated executor on the loop via loop.set_default_executor.

Parameters:

Name	Type	Description	Default
`evaluation_result`	`EvaluatorResult`	The evaluation result to log.	required
`dataset_name`	`str`	Name of the dataset being evaluated.	required
`data`	`LLMTestCase`	The input data that was evaluated.	required
`run_id`	`str \| None`	ID of the experiment run. Can be auto-generated if None.	`None`
`metadata`	`dict[str, Any] \| None`	Additional metadata to log.	`None`
`update_leaderboard`	`bool`	Whether to update the leaderboard after writing this row.	`True`
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

`get_experiment_urls(run_id, dataset_name, **kwargs)`

Get experiment URLs and paths for accessing experiment data.

Parameters:

Name	Type	Description	Default
`run_id`	`str`	ID of the experiment run.	required
`dataset_name`	`str`	Name of the dataset that was evaluated.	required
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

Returns:

Name	Type	Description
`ExperimentUris`	`ExperimentUris`	Dictionary containing URIs for accessing experiment data.

`log(evaluation_result, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)`

Log a single evaluation result (synchronous).

Parameters:

Name	Type	Description	Default
`evaluation_result`	`EvaluatorResult`	The evaluation result to log.	required
`dataset_name`	`str`	Name of the dataset being evaluated.	required
`data`	`LLMTestCase`	The input data that was evaluated.	required
`run_id`	`str \| None`	ID of the experiment run. Can be auto-generated if None.	`None`
`metadata`	`dict[str, Any] \| None`	Additional metadata to log.	`None`
`update_leaderboard`	`bool`	Whether to update the leaderboard after writing this row. Set to False when calling log repeatedly and update the leaderboard once after the batch.	`True`
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

`log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)` `async`

Log a batch of evaluation results (asynchronous).

Parameters:

Name	Type	Description	Default
`evaluation_results`	`list[list[EvaluatorResult]]`	Row-grouped evaluation results to log.	required
`dataset_name`	`str`	Name of the dataset being evaluated.	required
`data`	`list[LLMTestCase]`	List of input data that was evaluated.	required
`run_id`	`str \| None`	ID of the experiment run. Can be auto-generated if None.	`None`
`metadata`	`dict[str, Any] \| None`	Additional metadata to log.	`None`
`update_leaderboard`	`bool`	Whether to update the leaderboard once after writing the batch.	`True`
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

`set_run_aggregators(aggregators)`

Set run aggregators on the tracker.

Parameters:

Name	Type	Description	Default
`aggregators`	`list[Any]`	Run aggregator callables to apply after each batch.	required

`GoogleSheetsTrackerConfig(spreadsheet_id=None, folder_drive_id=None, client_email=None, private_key=None)` `dataclass`

Configuration for Google Sheets Experiment Tracker.

This dataclass only stores Google Sheets connection/resource settings.

Attributes:

Name	Type	Description
`spreadsheet_id`	`str \| None`	Optional ID of an existing Google Sheets spreadsheet.
`folder_drive_id`	`str \| None`	Optional Google Drive folder ID for creating new spreadsheet.
`client_email`	`str \| None`	Optional service account client email for authentication.
`private_key`	`str \| None`	Optional private key for service account authentication.

`__post_init__()`

Validate configuration after initialization.

Google sheets experiment tracker

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs) async

get_experiment_urls(run_id, dataset_name, **kwargs)

log(evaluation_result, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs) async

set_run_aggregators(aggregators)

GoogleSheetsTrackerConfig(spreadsheet_id=None, folder_drive_id=None, client_email=None, private_key=None) dataclass

__post_init__()

`alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)` `async`

`get_experiment_urls(run_id, dataset_name, **kwargs)`

`log(evaluation_result, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)`

`log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)` `async`

`set_run_aggregators(aggregators)`

`GoogleSheetsTrackerConfig(spreadsheet_id=None, folder_drive_id=None, client_email=None, private_key=None)` `dataclass`

`__post_init__()`