Experiment tracker
Base class for all experiment trackers.
References
NONE
BaseExperimentTracker(project_name, **kwargs)
Bases: ABC
Base class for all experiment trackers.
This class defines the core interface for experiment tracking across different backends. It provides methods for logging individual results and batch results using the observability pattern.
Attributes:
| Name | Type | Description |
|---|---|---|
project_name |
str
|
The name of the project. |
Initialize the experiment tracker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
project_name |
str
|
The name of the project. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
async
Log a single evaluation result (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
export_experiment_results(run_id, **kwargs)
Export experiment results to the specified format.
This is an optional method - trackers can override if they support exporting functionality. Default implementation returns empty list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of experiment results. Returns empty list by default. |
get_additional_kwargs()
Get additional keyword arguments needed.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Additional keyword arguments. |
get_experiment_history(**kwargs)
Get all experiment runs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of experiment runs. |
get_experiment_urls(run_id, dataset_name, **kwargs)
abstractmethod
Get experiment URLs and paths for accessing experiment data.
This method returns URLs or local paths that can be used to access experiment results and leaderboards. Different experiment trackers will return different types of URLs/paths based on their capabilities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
dataset_name |
str
|
Name of the dataset that was evaluated. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
ExperimentUrls |
ExperimentUrls
|
Dictionary containing URLs and paths for accessing experiment data. Different trackers will populate different fields: - Langfuse: session_url, dataset_run_url, experiment_url, leaderboard_url - Simple: experiment_local_path, leaderboard_local_path, experiment_url, leaderboard_url |
get_run_results(run_id, **kwargs)
Get detailed results for a specific run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: Detailed run results. |
log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
Log a single evaluation result (synchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
async
Log a batch of evaluation results (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_results |
list[EvaluationOutput]
|
The evaluation results to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
list[MetricInput]
|
List of input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
prepare_dataset_for_evaluation(dataset, dataset_name, **kwargs)
Optionally convert the dataset to tracker-specific format if needed.
This method allows trackers to convert the dataset to a format required for evaluation. For example, LangfuseExperimentTracker may convert a standard dataset to LangfuseDataset to enable dataset item syncing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset |
BaseDataset
|
The dataset to prepare. |
required |
dataset_name |
str
|
The name of the dataset. |
required |
**kwargs |
Any
|
Additional arguments. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
BaseDataset |
BaseDataset
|
The prepared dataset (or original if no conversion needed). |
prepare_for_tracking(row, **kwargs)
Optional hook for tracker-specific preprocessing before logging.
This allows trackers to prepare data before logging if needed. Most trackers won't need to override this.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row |
dict[str, Any]
|
The row to prepare. |
required |
**kwargs |
Any
|
Additional arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The prepared row. |
refresh_score(run_id, **kwargs)
Refresh the scores in the experiment tracker.
This method allows trackers to refresh or recalculate scores. Default implementation does nothing - trackers can override if they support score refreshing functionality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run (session ID). |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the refresh operation succeeded, False otherwise. Default implementation returns True (no-op completed successfully). |
Note
This is an optional method. Trackers that don't support score refreshing can use the default no-op implementation.
wrap_inference_fn(inference_fn)
Optionally wrap the inference function for tracker-specific behavior.
This method allows trackers to wrap the inference function with additional functionality (e.g., Langfuse observe decorator). Most trackers won't need to override this.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inference_fn |
Callable
|
The original inference function. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Callable |
Callable
|
The wrapped inference function (or original if no wrapping needed). |