Experiment tracker

Base class for all experiment trackers.

`BaseExperimentTracker(project_name, **kwargs)`

Bases: ABC

Base class for all experiment trackers.

This class defines the core interface for experiment tracking across different backends. It provides methods for logging individual results and batch results using the observability pattern.

Attributes:

Name	Type	Description
`project_name`	`str`	The name of the project.

Initialize the experiment tracker.

Parameters:

Name	Type	Description	Default
`project_name`	`str`	The name of the project.	required
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

`alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)` `abstractmethod` `async`

Log a single evaluation result (asynchronous).

Parameters:

Name	Type	Description	Default
`evaluation_result`	`EvaluationOutput`	The evaluation result to log.	required
`dataset_name`	`str`	Name of the dataset being evaluated.	required
`data`	`MetricInput`	The input data that was evaluated.	required
`run_id`	`str \| None`	ID of the experiment run. Can be auto-generated if None.	`None`
`metadata`	`dict[str, Any] \| None`	Additional metadata to log.	`None`
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

`export_experiment_results(run_id, **kwargs)`

Export experiment results to the specified format.

This is an optional method - trackers can override if they support exporting functionality. Default implementation returns empty list.

Parameters:

Name	Type	Description	Default
`run_id`	`str`	ID of the experiment run.	required
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

Returns:

Type	Description
`list[dict[str, Any]]`	list[dict[str, Any]]: List of experiment results. Returns empty list by default.

`get_additional_kwargs()`

Get additional keyword arguments needed.

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Additional keyword arguments.

`get_batch_summaries(run_id)`

Get batch summaries for a run.

This method returns summaries computed by summary evaluators during batch logging. Trackers that support summary evaluators should override this method to return the cumulative summaries for the given run_id (recalculated from all data).

Parameters:

Name	Type	Description	Default
`run_id`	`str`	ID of the experiment run.	required

Returns:

Type	Description
`dict[str, Any]`	The cumulative summary dict for the run. Returns empty dict by default. Trackers that support summary_evaluators should override this to return the actual cumulative summary.

`get_experiment_history(**kwargs)`

Get all experiment runs.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

Returns:

Type	Description
`list[dict[str, Any]]`	list[dict[str, Any]]: List of experiment runs.

`get_experiment_urls(run_id, dataset_name, **kwargs)` `abstractmethod`

Get experiment URLs and paths for accessing experiment data.

This method returns URLs or local paths that can be used to access experiment results and leaderboards. Different experiment trackers will return different types of URLs/paths based on their capabilities.

Parameters:

Name	Type	Description	Default
`run_id`	`str`	ID of the experiment run.	required
`dataset_name`	`str`	Name of the dataset that was evaluated.	required
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

Returns:

Name	Type	Description
`ExperimentUrls`	`ExperimentUrls`	Dictionary containing URLs and paths for accessing experiment data. Different trackers will populate different fields: - Langfuse: session_url, dataset_run_url, experiment_url, leaderboard_url - Simple: experiment_local_path, leaderboard_local_path, experiment_url, leaderboard_url

`get_run_results(run_id, **kwargs)`

Get detailed results for a specific run.

Parameters:

Name	Type	Description	Default
`run_id`	`str`	ID of the experiment run.	required
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

Returns:

Type	Description
`list[dict[str, Any]]`	list[dict[str, Any]]: Detailed run results.

`log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)` `abstractmethod`

Log a single evaluation result (synchronous).

Parameters:

Name	Type	Description	Default
`evaluation_result`	`EvaluationOutput`	The evaluation result to log.	required
`dataset_name`	`str`	Name of the dataset being evaluated.	required
`data`	`MetricInput`	The input data that was evaluated.	required
`run_id`	`str \| None`	ID of the experiment run. Can be auto-generated if None.	`None`
`metadata`	`dict[str, Any] \| None`	Additional metadata to log.	`None`
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

`log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs)` `abstractmethod` `async`

Log a batch of evaluation results (asynchronous).

Parameters:

Name	Type	Description	Default
`evaluation_results`	`list[EvaluationOutput]`	The evaluation results to log.	required
`dataset_name`	`str`	Name of the dataset being evaluated.	required
`data`	`list[MetricInput]`	List of input data that was evaluated.	required
`run_id`	`str \| None`	ID of the experiment run. Can be auto-generated if None.	`None`
`metadata`	`dict[str, Any] \| None`	Additional metadata to log.	`None`
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

`prepare_dataset_for_evaluation(dataset, dataset_name, **kwargs)`

Optionally convert the dataset to tracker-specific format if needed.

This method allows trackers to convert the dataset to a format required for evaluation. For example, LangfuseExperimentTracker may convert a standard dataset to LangfuseDataset to enable dataset item syncing.

Parameters:

Name	Type	Description	Default
`dataset`	`BaseDataset`	The dataset to prepare.	required
`dataset_name`	`str`	The name of the dataset.	required
`**kwargs`	`Any`	Additional arguments.	`{}`

Returns:

Name	Type	Description
`BaseDataset`	`BaseDataset`	The prepared dataset (or original if no conversion needed).

`prepare_for_tracking(row, **kwargs)`

Optional hook for tracker-specific preprocessing before logging.

This allows trackers to prepare data before logging if needed. Most trackers won't need to override this.

Parameters:

Name	Type	Description	Default
`row`	`dict[str, Any]`	The row to prepare.	required
`**kwargs`	`Any`	Additional arguments.	`{}`

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: The prepared row.

`refresh_score(run_id, **kwargs)`

Refresh the scores in the experiment tracker.

This method allows trackers to refresh or recalculate scores. Default implementation does nothing - trackers can override if they support score refreshing functionality.

Parameters:

Name	Type	Description	Default
`run_id`	`str`	ID of the experiment run (session ID).	required
`**kwargs`	`Any`	Additional configuration parameters.	`{}`

Returns:

Name	Type	Description
`bool`	`bool`	True if the refresh operation succeeded, False otherwise. Default implementation returns True (no-op completed successfully).

Note

This is an optional method. Trackers that don't support score refreshing can use the default no-op implementation.

`wrap_inference_fn(inference_fn)`

Optionally wrap the inference function for tracker-specific behavior.

This method allows trackers to wrap the inference function with additional functionality (e.g., Langfuse observe decorator). Most trackers won't need to override this.

Parameters:

Name	Type	Description	Default
`inference_fn`	`Callable`	The original inference function.	required

Returns:

Name	Type	Description
`Callable`	`Callable`	The wrapped inference function (or original if no wrapping needed).

Experiment tracker

BaseExperimentTracker(project_name, **kwargs)

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod async

export_experiment_results(run_id, **kwargs)

get_additional_kwargs()

get_batch_summaries(run_id)

get_experiment_history(**kwargs)

get_experiment_urls(run_id, dataset_name, **kwargs) abstractmethod

get_run_results(run_id, **kwargs)

log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod async

prepare_dataset_for_evaluation(dataset, dataset_name, **kwargs)

prepare_for_tracking(row, **kwargs)

refresh_score(run_id, **kwargs)

wrap_inference_fn(inference_fn)

`BaseExperimentTracker(project_name, **kwargs)`

`alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)` `abstractmethod` `async`

`export_experiment_results(run_id, **kwargs)`

`get_additional_kwargs()`

`get_batch_summaries(run_id)`

`get_experiment_history(**kwargs)`

`get_experiment_urls(run_id, dataset_name, **kwargs)` `abstractmethod`

`get_run_results(run_id, **kwargs)`

`log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)` `abstractmethod`

`log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs)` `abstractmethod` `async`

`prepare_dataset_for_evaluation(dataset, dataset_name, **kwargs)`

`prepare_for_tracking(row, **kwargs)`

`refresh_score(run_id, **kwargs)`

`wrap_inference_fn(inference_fn)`