Skip to content

Experiment Tracker

Experiment tracker module for logging and tracking evaluation experiments.

This module provides experiment tracking functionality to log evaluation runs, metrics, and results to various tracking platforms. It supports integration with Langfuse and provides a simple file-based tracker for local development.

Available trackers: - BaseExperimentTracker: Abstract base class for experiment trackers - SimpleExperimentTracker: File-based experiment tracking for local use - LangfuseExperimentTracker: Integration with Langfuse platform - get_experiment_tracker: Factory function for creating tracker instances

BaseExperimentTracker(project_name, **kwargs)

Bases: ABC

Base class for all experiment trackers.

This class defines the core interface for experiment tracking across different backends. It provides methods for logging individual results and batch results using the observability pattern.

Attributes:

Name Type Description
project_name str

The name of the project.

Initialize the experiment tracker.

Parameters:

Name Type Description Default
project_name str

The name of the project.

required
**kwargs Any

Additional configuration parameters.

{}

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod async

Log a single evaluation result (asynchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

export_experiment_results(run_id, **kwargs)

Export experiment results to the specified format.

This is an optional method - trackers can override if they support exporting functionality. Default implementation returns empty list.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of experiment results. Returns empty list by default.

get_additional_kwargs()

Get additional keyword arguments needed.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Additional keyword arguments.

get_experiment_history(**kwargs)

Get all experiment runs.

Parameters:

Name Type Description Default
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of experiment runs.

get_experiment_urls(run_id, dataset_name, **kwargs) abstractmethod

Get experiment URLs and paths for accessing experiment data.

This method returns URLs or local paths that can be used to access experiment results and leaderboards. Different experiment trackers will return different types of URLs/paths based on their capabilities.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
dataset_name str

Name of the dataset that was evaluated.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Name Type Description
ExperimentUrls ExperimentUrls

Dictionary containing URLs and paths for accessing experiment data. Different trackers will populate different fields: - Langfuse: session_url, dataset_run_url, experiment_url, leaderboard_url - Simple: experiment_local_path, leaderboard_local_path, experiment_url, leaderboard_url

get_run_results(run_id, **kwargs)

Get detailed results for a specific run.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: Detailed run results.

log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod

Log a single evaluation result (synchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod async

Log a batch of evaluation results (asynchronous).

Parameters:

Name Type Description Default
evaluation_results list[EvaluationOutput]

The evaluation results to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data list[MetricInput]

List of input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

prepare_dataset_for_evaluation(dataset, dataset_name, **kwargs)

Optionally convert the dataset to tracker-specific format if needed.

This method allows trackers to convert the dataset to a format required for evaluation. For example, LangfuseExperimentTracker may convert a standard dataset to LangfuseDataset to enable dataset item syncing.

Parameters:

Name Type Description Default
dataset BaseDataset

The dataset to prepare.

required
dataset_name str

The name of the dataset.

required
**kwargs Any

Additional arguments.

{}

Returns:

Name Type Description
BaseDataset BaseDataset

The prepared dataset (or original if no conversion needed).

prepare_for_tracking(row, **kwargs)

Optional hook for tracker-specific preprocessing before logging.

This allows trackers to prepare data before logging if needed. Most trackers won't need to override this.

Parameters:

Name Type Description Default
row dict[str, Any]

The row to prepare.

required
**kwargs Any

Additional arguments.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The prepared row.

refresh_score(run_id, **kwargs)

Refresh the scores in the experiment tracker.

This method allows trackers to refresh or recalculate scores. Default implementation does nothing - trackers can override if they support score refreshing functionality.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run (session ID).

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Name Type Description
bool bool

True if the refresh operation succeeded, False otherwise. Default implementation returns True (no-op completed successfully).

Note

This is an optional method. Trackers that don't support score refreshing can use the default no-op implementation.

wrap_inference_fn(inference_fn)

Optionally wrap the inference function for tracker-specific behavior.

This method allows trackers to wrap the inference function with additional functionality (e.g., Langfuse observe decorator). Most trackers won't need to override this.

Parameters:

Name Type Description Default
inference_fn Callable

The original inference function.

required

Returns:

Name Type Description
Callable Callable

The wrapped inference function (or original if no wrapping needed).

LangfuseExperimentTracker(score_key=DefaultValues.SCORE_KEY, project_name=DefaultValues.PROJECT_NAME, langfuse_client=None, expected_output_key=DefaultValues.EXPECTED_OUTPUT_KEY, mapping=None)

Bases: BaseExperimentTracker

Experiment tracker for Langfuse.

Attributes:

Name Type Description
langfuse_client Langfuse

The Langfuse client.

Initialize the LangfuseExperimentTracker class.

Parameters:

Name Type Description Default
score_key str | list[str]

The key(s) of the score(s) to log. Defaults to "score".

SCORE_KEY
project_name str

The name of the project.

PROJECT_NAME
langfuse_client Langfuse

The Langfuse client.

None
expected_output_key str | None

The key to extract the expected output from the data. Defaults to "expected_response".

EXPECTED_OUTPUT_KEY
mapping dict[str, Any] | None

Optional mapping for field keys. Defaults to None.

None

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, dataset_item_id=None, reason_key=DefaultValues.REASON_KEY, **kwargs) async

Log an evaluation result to Langfuse asynchronously.

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

The name of the dataset.

required
data MetricInput

The dataset data.

required
run_id str | None

The run ID.

None
metadata dict

Additional metadata.

None
dataset_item_id str | None

The ID of the dataset item.

None
reason_key str | None

The key to extract.

REASON_KEY
**kwargs Any

Additional configuration parameters.

{}

export_experiment_results(run_id, **kwargs)

Export langfuse traces to the specified export type format.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of experiment results.

get_additional_kwargs()

Get additional keyword arguments needed.

LangfuseExperimentTracker provides the langfuse_client.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Additional keyword arguments.

get_experiment_urls(run_id, dataset_name, **kwargs)

Get experiment URLs for Langfuse experiment tracker.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
dataset_name str

Name of the dataset that was evaluated.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Name Type Description
ExperimentUrls ExperimentUrls

Dictionary containing Langfuse URLs for accessing experiment data.

get_run_results(run_id, **kwargs)

Get detailed results for a specific run.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
**kwargs Any

Additional configuration parameters including 'keys' for trace keys.

{}

Returns:

Type Description
list[dict[str, Any]]

dict[str, Any] | list[dict[str, Any]]: Detailed run results.

log(evaluation_result, dataset_name, data, run_id=None, metadata=None, dataset_item_id=None, reason_key=DefaultValues.REASON_KEY, **kwargs)

Log a single evaluation result to Langfuse.

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

The name of the dataset.

required
data MetricInput

The dataset data.

required
run_id str

The run ID.

None
metadata dict

Additional metadata.

None
dataset_item_id str | None

The ID of the dataset item.

None
reason_key str | None

The key to extract reasoning/explanation text from evaluation results. This text will be logged as comments alongside scores in Langfuse traces. Defaults to "explanation".

REASON_KEY
**kwargs Any

Additional configuration parameters.

{}

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, reason_key=DefaultValues.REASON_KEY, **kwargs) async

Log a batch of evaluation results to Langfuse.

Parameters:

Name Type Description Default
evaluation_results list[EvaluationOutput]

The list of evaluation results to log.

required
dataset_name str

The name of the dataset.

required
data list[MetricInput]

The list of dataset data items.

required
run_id str | None

The run ID. If None, a unique ID will be generated.

None
metadata dict

Additional metadata.

None
reason_key str | None

The key to extract reasoning/explanation text from evaluation results. This text will be logged as comments alongside scores in Langfuse traces. Defaults to "explanation".

REASON_KEY
**kwargs Any

Additional configuration parameters.

{}

prepare_dataset_for_evaluation(dataset, dataset_name, **kwargs)

Convert standard dataset to LangfuseDataset if not already LangfuseDataset.

This ensures that LangfuseDataset.prepare_row_for_inference can properly sync dataset items in Langfuse.

Parameters:

Name Type Description Default
dataset BaseDataset

The dataset to prepare.

required
dataset_name str

The name of the dataset.

required
**kwargs Any

Additional arguments including mapping.

{}

Returns:

Name Type Description
BaseDataset BaseDataset

LangfuseDataset if conversion needed, original dataset otherwise.

prepare_for_tracking(row, **kwargs)

Prepare row for Langfuse tracking.

Converts standard format to Langfuse format if needed.

Parameters:

Name Type Description Default
row dict[str, Any]

The row to prepare.

required
**kwargs Any

Additional arguments.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Row in Langfuse format.

refresh_score(run_id, **kwargs)

Refresh the session score in Langfuse.

This method refreshes or recalculates the session-level score for a given run. It retrieves all traces for the session, aggregates their scores, and updates the session-level score in Langfuse.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run (session ID).

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Name Type Description
bool bool

True if the refresh operation succeeded, False otherwise.

wrap_inference_fn(inference_fn)

Wrap inference function with Langfuse observe decorator.

Parameters:

Name Type Description Default
inference_fn Callable

The original inference function.

required

Returns:

Name Type Description
Callable Callable

The inference function wrapped with Langfuse observe decorator.

SimpleExperimentTracker(project_name, output_dir='./gllm_evals/experiments', score_key='score')

Bases: BaseExperimentTracker

Simple file-based experiment tracker for development and testing.

This class provides a simple local storage implementation for experiment tracking. It stores experiment data in CSV format with two files: experiment_results.csv and leaderboard.csv.

Attributes:

Name Type Description
project_name str

The name of the project.

output_dir Path

Directory to store experiment results.

experiment_results_file Path

CSV file for experiment results.

leaderboard_file Path

CSV file for leaderboard data.

logger Logger

Logger for tracking errors and warnings.

Constants

MAX_OUTER_DICT_PARTS (int): Maximum number of parts in an outer dict score path. Used to distinguish main evaluator scores from nested sub-metrics.

Initialize simple tracker with project name and output directory.

Parameters:

Name Type Description Default
project_name str

The name of the project.

required
output_dir str

Directory to store experiment results.

'./gllm_evals/experiments'
score_key str | list[str]

The key to extract scores from evaluation results. - If str: Direct key access (e.g., "score") or dot notation (e.g., "metrics.accuracy") - If list[str]: Nested key path (e.g., ["generation", "score"]) Defaults to "score".

'score'

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) async

Log a single evaluation result (asynchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters (currently unused but accepted for interface consistency).

{}

get_experiment_history()

Get all experiment runs from leaderboard.

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of experiment runs.

get_experiment_urls(run_id, dataset_name, **kwargs)

Get experiment local file paths for Simple experiment tracker.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
dataset_name str

Name of the dataset that was evaluated.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Name Type Description
ExperimentUrls ExperimentUrls

Dictionary containing local file paths for accessing experiment data.

get_run_results(run_id)

Get detailed results for a specific run.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: Detailed run results.

log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)

Log a single evaluation result.

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters (currently unused but accepted for interface consistency).

{}

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs) async

Log a batch of evaluation results (asynchronous).

Parameters:

Name Type Description Default
evaluation_results List[EvaluationOutput]

The evaluation results to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data list[MetricInput]

List of input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

get_experiment_tracker(experiment_tracker, project_name, **kwargs)

Get an experiment tracker instance (create or return existing).

Supported experiment tracker types
  • SimpleExperimentTracker (Default - Local CSV)
  • LangfuseExperimentTracker (Langfuse) Required parameters: langfuse_client

Parameters:

Name Type Description Default
experiment_tracker BaseExperimentTracker | type[BaseExperimentTracker]

The experiment tracker to get.

required
project_name str

The name of the project.

required
**kwargs Any

Additional arguments to pass to the constructor.

{}

Returns:

Name Type Description
BaseExperimentTracker BaseExperimentTracker

The experiment tracker.

Raises:

Type Description
ValueError

If the experiment tracker is not supported.