Skip to content

Experiment tracker

Base class for all experiment trackers.

Authors

Apri Dwi Rachmadi (apri.d.rachmadi@gdplabs.id)

References

NONE

BaseExperimentTracker(project_name, **kwargs)

Bases: ABC

Base class for all experiment trackers.

This class defines the core interface for experiment tracking across different backends. It provides methods for logging individual results and batch results using the observability pattern.

Attributes:

Name Type Description
project_name str

The name of the project.

Initialize the experiment tracker.

Parameters:

Name Type Description Default
project_name str

The name of the project.

required
**kwargs Any

Additional configuration parameters.

{}

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod async

Log a single evaluation result (asynchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

export_experiment_results(run_id, **kwargs)

Export experiment results to the specified format.

This is an optional method - trackers can override if they support exporting functionality. Default implementation returns empty list.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of experiment results. Returns empty list by default.

get_additional_kwargs()

Get additional keyword arguments needed.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Additional keyword arguments.

get_experiment_history(**kwargs)

Get all experiment runs.

Parameters:

Name Type Description Default
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of experiment runs.

get_experiment_urls(run_id, dataset_name, **kwargs) abstractmethod

Get experiment URLs and paths for accessing experiment data.

This method returns URLs or local paths that can be used to access experiment results and leaderboards. Different experiment trackers will return different types of URLs/paths based on their capabilities.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
dataset_name str

Name of the dataset that was evaluated.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Name Type Description
ExperimentUrls ExperimentUrls

Dictionary containing URLs and paths for accessing experiment data. Different trackers will populate different fields: - Langfuse: session_url, dataset_run_url, experiment_url, leaderboard_url - Simple: experiment_local_path, leaderboard_local_path, experiment_url, leaderboard_url

get_run_results(run_id, **kwargs)

Get detailed results for a specific run.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: Detailed run results.

log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod

Log a single evaluation result (synchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod async

Log a batch of evaluation results (asynchronous).

Parameters:

Name Type Description Default
evaluation_results list[EvaluationOutput]

The evaluation results to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data list[MetricInput]

List of input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

prepare_dataset_for_evaluation(dataset, dataset_name, **kwargs)

Optionally convert the dataset to tracker-specific format if needed.

This method allows trackers to convert the dataset to a format required for evaluation. For example, LangfuseExperimentTracker may convert a standard dataset to LangfuseDataset to enable dataset item syncing.

Parameters:

Name Type Description Default
dataset BaseDataset

The dataset to prepare.

required
dataset_name str

The name of the dataset.

required
**kwargs Any

Additional arguments.

{}

Returns:

Name Type Description
BaseDataset BaseDataset

The prepared dataset (or original if no conversion needed).

prepare_for_tracking(row, **kwargs)

Optional hook for tracker-specific preprocessing before logging.

This allows trackers to prepare data before logging if needed. Most trackers won't need to override this.

Parameters:

Name Type Description Default
row dict[str, Any]

The row to prepare.

required
**kwargs Any

Additional arguments.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The prepared row.

refresh_score(run_id, **kwargs)

Refresh the scores in the experiment tracker.

This method allows trackers to refresh or recalculate scores. Default implementation does nothing - trackers can override if they support score refreshing functionality.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run (session ID).

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Name Type Description
bool bool

True if the refresh operation succeeded, False otherwise. Default implementation returns True (no-op completed successfully).

Note

This is an optional method. Trackers that don't support score refreshing can use the default no-op implementation.

wrap_inference_fn(inference_fn)

Optionally wrap the inference function for tracker-specific behavior.

This method allows trackers to wrap the inference function with additional functionality (e.g., Langfuse observe decorator). Most trackers won't need to override this.

Parameters:

Name Type Description Default
inference_fn Callable

The original inference function.

required

Returns:

Name Type Description
Callable Callable

The wrapped inference function (or original if no wrapping needed).