Experiment Tracker
Experiment tracker module for logging and tracking evaluation experiments.
This module provides experiment tracking functionality to log evaluation runs, metrics, and results to various tracking platforms. It supports integration with Langfuse and provides a simple file-based tracker for local development.
Available trackers: - BaseExperimentTracker: Abstract base class for experiment trackers - SimpleExperimentTracker: File-based experiment tracking for local use - LangfuseExperimentTracker: Integration with Langfuse platform - get_experiment_tracker: Factory function for creating tracker instances
BaseExperimentTracker(project_name, **kwargs)
Bases: ABC
Base class for all experiment trackers.
This class defines the core interface for experiment tracking across different backends. It provides methods for logging individual results and batch results using the observability pattern.
Attributes:
| Name | Type | Description |
|---|---|---|
project_name |
str
|
The name of the project. |
Initialize the experiment tracker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
project_name |
str
|
The name of the project. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
async
Log a single evaluation result (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
export_experiment_results(run_id, **kwargs)
Export experiment results to the specified format.
This is an optional method - trackers can override if they support exporting functionality. Default implementation returns empty list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of experiment results. Returns empty list by default. |
get_additional_kwargs()
Get additional keyword arguments needed.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Additional keyword arguments. |
get_experiment_history(**kwargs)
Get all experiment runs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of experiment runs. |
get_experiment_urls(run_id, dataset_name, **kwargs)
abstractmethod
Get experiment URLs and paths for accessing experiment data.
This method returns URLs or local paths that can be used to access experiment results and leaderboards. Different experiment trackers will return different types of URLs/paths based on their capabilities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
dataset_name |
str
|
Name of the dataset that was evaluated. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
ExperimentUrls |
ExperimentUrls
|
Dictionary containing URLs and paths for accessing experiment data. Different trackers will populate different fields: - Langfuse: session_url, dataset_run_url, experiment_url, leaderboard_url - Simple: experiment_local_path, leaderboard_local_path, experiment_url, leaderboard_url |
get_run_results(run_id, **kwargs)
Get detailed results for a specific run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: Detailed run results. |
log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
Log a single evaluation result (synchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
async
Log a batch of evaluation results (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_results |
list[EvaluationOutput]
|
The evaluation results to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
list[MetricInput]
|
List of input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
prepare_dataset_for_evaluation(dataset, dataset_name, **kwargs)
Optionally convert the dataset to tracker-specific format if needed.
This method allows trackers to convert the dataset to a format required for evaluation. For example, LangfuseExperimentTracker may convert a standard dataset to LangfuseDataset to enable dataset item syncing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset |
BaseDataset
|
The dataset to prepare. |
required |
dataset_name |
str
|
The name of the dataset. |
required |
**kwargs |
Any
|
Additional arguments. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
BaseDataset |
BaseDataset
|
The prepared dataset (or original if no conversion needed). |
prepare_for_tracking(row, **kwargs)
Optional hook for tracker-specific preprocessing before logging.
This allows trackers to prepare data before logging if needed. Most trackers won't need to override this.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row |
dict[str, Any]
|
The row to prepare. |
required |
**kwargs |
Any
|
Additional arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The prepared row. |
refresh_score(run_id, **kwargs)
Refresh the scores in the experiment tracker.
This method allows trackers to refresh or recalculate scores. Default implementation does nothing - trackers can override if they support score refreshing functionality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run (session ID). |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the refresh operation succeeded, False otherwise. Default implementation returns True (no-op completed successfully). |
Note
This is an optional method. Trackers that don't support score refreshing can use the default no-op implementation.
wrap_inference_fn(inference_fn)
Optionally wrap the inference function for tracker-specific behavior.
This method allows trackers to wrap the inference function with additional functionality (e.g., Langfuse observe decorator). Most trackers won't need to override this.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inference_fn |
Callable
|
The original inference function. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Callable |
Callable
|
The wrapped inference function (or original if no wrapping needed). |
LangfuseExperimentTracker(score_key=DefaultValues.SCORE_KEY, project_name=DefaultValues.PROJECT_NAME, langfuse_client=None, expected_output_key=DefaultValues.EXPECTED_OUTPUT_KEY, mapping=None)
Bases: BaseExperimentTracker
Experiment tracker for Langfuse.
Attributes:
| Name | Type | Description |
|---|---|---|
langfuse_client |
Langfuse
|
The Langfuse client. |
Initialize the LangfuseExperimentTracker class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
score_key |
str | list[str]
|
The key(s) of the score(s) to log. Defaults to "score". |
SCORE_KEY
|
project_name |
str
|
The name of the project. |
PROJECT_NAME
|
langfuse_client |
Langfuse
|
The Langfuse client. |
None
|
expected_output_key |
str | None
|
The key to extract the expected output from the data. Defaults to "expected_response". |
EXPECTED_OUTPUT_KEY
|
mapping |
dict[str, Any] | None
|
Optional mapping for field keys. Defaults to None. |
None
|
alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, dataset_item_id=None, reason_key=DefaultValues.REASON_KEY, **kwargs)
async
Log an evaluation result to Langfuse asynchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
The name of the dataset. |
required |
data |
MetricInput
|
The dataset data. |
required |
run_id |
str | None
|
The run ID. |
None
|
metadata |
dict
|
Additional metadata. |
None
|
dataset_item_id |
str | None
|
The ID of the dataset item. |
None
|
reason_key |
str | None
|
The key to extract. |
REASON_KEY
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
export_experiment_results(run_id, **kwargs)
Export langfuse traces to the specified export type format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of experiment results. |
get_additional_kwargs()
Get additional keyword arguments needed.
LangfuseExperimentTracker provides the langfuse_client.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Additional keyword arguments. |
get_experiment_urls(run_id, dataset_name, **kwargs)
Get experiment URLs for Langfuse experiment tracker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
dataset_name |
str
|
Name of the dataset that was evaluated. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
ExperimentUrls |
ExperimentUrls
|
Dictionary containing Langfuse URLs for accessing experiment data. |
get_run_results(run_id, **kwargs)
Get detailed results for a specific run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
**kwargs |
Any
|
Additional configuration parameters including 'keys' for trace keys. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
dict[str, Any] | list[dict[str, Any]]: Detailed run results. |
log(evaluation_result, dataset_name, data, run_id=None, metadata=None, dataset_item_id=None, reason_key=DefaultValues.REASON_KEY, **kwargs)
Log a single evaluation result to Langfuse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
The name of the dataset. |
required |
data |
MetricInput
|
The dataset data. |
required |
run_id |
str
|
The run ID. |
None
|
metadata |
dict
|
Additional metadata. |
None
|
dataset_item_id |
str | None
|
The ID of the dataset item. |
None
|
reason_key |
str | None
|
The key to extract reasoning/explanation text from evaluation results. This text will be logged as comments alongside scores in Langfuse traces. Defaults to "explanation". |
REASON_KEY
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, reason_key=DefaultValues.REASON_KEY, **kwargs)
async
Log a batch of evaluation results to Langfuse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_results |
list[EvaluationOutput]
|
The list of evaluation results to log. |
required |
dataset_name |
str
|
The name of the dataset. |
required |
data |
list[MetricInput]
|
The list of dataset data items. |
required |
run_id |
str | None
|
The run ID. If None, a unique ID will be generated. |
None
|
metadata |
dict
|
Additional metadata. |
None
|
reason_key |
str | None
|
The key to extract reasoning/explanation text from evaluation results. This text will be logged as comments alongside scores in Langfuse traces. Defaults to "explanation". |
REASON_KEY
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
prepare_dataset_for_evaluation(dataset, dataset_name, **kwargs)
Convert standard dataset to LangfuseDataset if not already LangfuseDataset.
This ensures that LangfuseDataset.prepare_row_for_inference can properly sync dataset items in Langfuse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset |
BaseDataset
|
The dataset to prepare. |
required |
dataset_name |
str
|
The name of the dataset. |
required |
**kwargs |
Any
|
Additional arguments including mapping. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
BaseDataset |
BaseDataset
|
LangfuseDataset if conversion needed, original dataset otherwise. |
prepare_for_tracking(row, **kwargs)
Prepare row for Langfuse tracking.
Converts standard format to Langfuse format if needed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row |
dict[str, Any]
|
The row to prepare. |
required |
**kwargs |
Any
|
Additional arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Row in Langfuse format. |
refresh_score(run_id, **kwargs)
Refresh the session score in Langfuse.
This method refreshes or recalculates the session-level score for a given run. It retrieves all traces for the session, aggregates their scores, and updates the session-level score in Langfuse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run (session ID). |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the refresh operation succeeded, False otherwise. |
wrap_inference_fn(inference_fn)
Wrap inference function with Langfuse observe decorator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inference_fn |
Callable
|
The original inference function. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Callable |
Callable
|
The inference function wrapped with Langfuse observe decorator. |
SimpleExperimentTracker(project_name, output_dir='./gllm_evals/experiments', score_key='score')
Bases: BaseExperimentTracker
Simple file-based experiment tracker for development and testing.
This class provides a simple local storage implementation for experiment tracking. It stores experiment data in CSV format with two files: experiment_results.csv and leaderboard.csv.
Attributes:
| Name | Type | Description |
|---|---|---|
project_name |
str
|
The name of the project. |
output_dir |
Path
|
Directory to store experiment results. |
experiment_results_file |
Path
|
CSV file for experiment results. |
leaderboard_file |
Path
|
CSV file for leaderboard data. |
logger |
Logger
|
Logger for tracking errors and warnings. |
Constants
MAX_OUTER_DICT_PARTS (int): Maximum number of parts in an outer dict score path. Used to distinguish main evaluator scores from nested sub-metrics.
Initialize simple tracker with project name and output directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
project_name |
str
|
The name of the project. |
required |
output_dir |
str
|
Directory to store experiment results. |
'./gllm_evals/experiments'
|
score_key |
str | list[str]
|
The key to extract scores from evaluation results. - If str: Direct key access (e.g., "score") or dot notation (e.g., "metrics.accuracy") - If list[str]: Nested key path (e.g., ["generation", "score"]) Defaults to "score". |
'score'
|
alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)
async
Log a single evaluation result (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters (currently unused but accepted for interface consistency). |
{}
|
get_experiment_history()
Get all experiment runs from leaderboard.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of experiment runs. |
get_experiment_urls(run_id, dataset_name, **kwargs)
Get experiment local file paths for Simple experiment tracker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
dataset_name |
str
|
Name of the dataset that was evaluated. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
ExperimentUrls |
ExperimentUrls
|
Dictionary containing local file paths for accessing experiment data. |
get_run_results(run_id)
Get detailed results for a specific run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: Detailed run results. |
log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)
Log a single evaluation result.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters (currently unused but accepted for interface consistency). |
{}
|
log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs)
async
Log a batch of evaluation results (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_results |
List[EvaluationOutput]
|
The evaluation results to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
list[MetricInput]
|
List of input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
get_experiment_tracker(experiment_tracker, project_name, **kwargs)
Get an experiment tracker instance (create or return existing).
Supported experiment tracker types
- SimpleExperimentTracker (Default - Local CSV)
- LangfuseExperimentTracker (Langfuse) Required parameters: langfuse_client
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
experiment_tracker |
BaseExperimentTracker | type[BaseExperimentTracker]
|
The experiment tracker to get. |
required |
project_name |
str
|
The name of the project. |
required |
**kwargs |
Any
|
Additional arguments to pass to the constructor. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
BaseExperimentTracker |
BaseExperimentTracker
|
The experiment tracker. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the experiment tracker is not supported. |