Google sheets experiment tracker
Google Sheets experiment tracker for tracking evaluation results.
This tracker stores experiment results and leaderboards in Google Sheets, providing an easy-to-access interface for viewing and sharing results.
GoogleSheetsExperimentTracker(project_name, config, score_key='score', extra_score_keys=None, leaderboard_score_key='aggregate_success', run_aggregators=None, companion_fields=None, include_eval_result=True, batch_size=GSheetDefaultValues.DEFAULT_BATCH_SIZE, max_retries=GSheetDefaultValues.MAX_RETRIES, keep_fields=None, field_mapping=None)
Bases: BaseExperimentTracker
Google Sheets experiment tracker for evaluation results.
Initialize a Google Sheets–backed experiment tracker instance.
This constructor wires the tracker into the Google Sheets–based
experiment tracking workflow defined by :class:BaseExperimentTracker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
project_name
|
str
|
Logical name of the project or benchmark suite. |
required |
config
|
GoogleSheetsTrackerConfig
|
GoogleSheetsTrackerConfig containing Google Sheets resource/auth settings. |
required |
score_key
|
str | list[str]
|
Metric key(s) used to extract scores from evaluation outputs. |
'score'
|
extra_score_keys
|
list[str] | None
|
Additional score keys to extract for result columns. Defaults to DEFAULT_EXTRA_SCORE_KEYS. |
None
|
leaderboard_score_key
|
str
|
Score key used for leaderboard aggregation. |
'aggregate_success'
|
run_aggregators
|
list[Any] | None
|
Custom run aggregators for batch-level metrics. |
None
|
companion_fields
|
list[str] | None
|
Per-metric fields to extract alongside scores. Defaults to DEFAULT_COMPANION_FIELDS. |
None
|
include_eval_result
|
bool
|
Whether to include the full evaluator result JSON. Defaults to True. |
True
|
batch_size
|
int
|
Maximum number of rows per batch write to Google Sheets. |
DEFAULT_BATCH_SIZE
|
max_retries
|
int
|
Maximum retry attempts for transient API failures. |
MAX_RETRIES
|
keep_fields
|
list[str] | None
|
Optional list of fields to keep when logging. |
None
|
field_mapping
|
dict[str, str] | None
|
Optional mapping of output field name -> source field name. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If config validation fails. |
alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)
async
Log a single evaluation result (asynchronous).
Note
gspread is synchronous, so this offloads the blocking write to the
running loop's default ThreadPoolExecutor via run_in_executor(None, ...).
Concurrent alog calls share that pool; with the CPython default
(min(32, os.cpu_count() + 4) workers) sustained high concurrency
can saturate it and block unrelated tasks. For high-throughput cases
prefer log_batch (one executor call per batch) or configure a
dedicated executor on the loop via loop.set_default_executor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result
|
EvaluatorResult
|
The evaluation result to log. |
required |
dataset_name
|
str
|
Name of the dataset being evaluated. |
required |
data
|
MetricInput
|
The input data that was evaluated. |
required |
run_id
|
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata
|
dict[str, Any] | None
|
Additional metadata to log. |
None
|
update_leaderboard
|
bool
|
Whether to update the leaderboard after writing this row. |
True
|
**kwargs
|
Any
|
Additional configuration parameters. |
{}
|
get_experiment_urls(run_id, dataset_name, **kwargs)
Get experiment URLs and paths for accessing experiment data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id
|
str
|
ID of the experiment run. |
required |
dataset_name
|
str
|
Name of the dataset that was evaluated. |
required |
**kwargs
|
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
ExperimentUris |
ExperimentUris
|
Dictionary containing URIs for accessing experiment data. |
log(evaluation_result, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)
Log a single evaluation result (synchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result
|
EvaluatorResult
|
The evaluation result to log. |
required |
dataset_name
|
str
|
Name of the dataset being evaluated. |
required |
data
|
MetricInput
|
The input data that was evaluated. |
required |
run_id
|
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata
|
dict[str, Any] | None
|
Additional metadata to log. |
None
|
update_leaderboard
|
bool
|
Whether to update the leaderboard after writing this row. Set to False when calling log repeatedly and update the leaderboard once after the batch. |
True
|
**kwargs
|
Any
|
Additional configuration parameters. |
{}
|
log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, update_leaderboard=True, **kwargs)
async
Log a batch of evaluation results (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_results
|
list[list[EvaluatorResult]]
|
Row-grouped evaluation results to log. |
required |
dataset_name
|
str
|
Name of the dataset being evaluated. |
required |
data
|
list[MetricInput]
|
List of input data that was evaluated. |
required |
run_id
|
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata
|
dict[str, Any] | None
|
Additional metadata to log. |
None
|
update_leaderboard
|
bool
|
Whether to update the leaderboard once after writing the batch. |
True
|
**kwargs
|
Any
|
Additional configuration parameters. |
{}
|
GoogleSheetsTrackerConfig(spreadsheet_id=None, folder_drive_id=None, client_email=None, private_key=None)
dataclass
Configuration for Google Sheets Experiment Tracker.
This dataclass only stores Google Sheets connection/resource settings.
Attributes:
| Name | Type | Description |
|---|---|---|
spreadsheet_id |
str | None
|
Optional ID of an existing Google Sheets spreadsheet. |
folder_drive_id |
str | None
|
Optional Google Drive folder ID for creating new spreadsheet. |
client_email |
str | None
|
Optional service account client email for authentication. |
private_key |
str | None
|
Optional private key for service account authentication. |
__post_init__()
Validate configuration after initialization.