Skip to content

Base data loader

Base classes and utilities for data sources.

This module provides the foundational components for data loading from various sources. It contains abstract base classes, common exceptions, and utility functions that are shared across different data loader implementations.

Authors
  • Alfan Dinda Rahmawan (alfan.d.rahmawan@gdplabs.id)
Reviewer
  • Muhammad Afif Al Hawari (muhammad.a.a.hawari@gdplabs.id)
References

NONE

BaseDataLoader

Bases: Component, ABC

Base class for dataset loaders.

This class defines the common interface that all data sources must implement. Each concrete data source should provide implementation for loading data from their specific source (CSV files, Google Sheets, etc.).

The interface is designed to be consistent across different data source types, making it easy to switch between sources, add new ones, and maintain the factory pattern architecture.

Uses lazy initialization pattern where configuration is provided at load time rather than initialization time, allowing for lightweight instantiation and flexible configuration changes.

All data sources must implement: 1. Lightweight initialization without configuration 2. Load method that accepts configuration and returns data 3. Cache management (clear cache) 4. Proper error handling for configuration and connection issues 5. Optional caching mechanisms for performance optimization

Subclasses can optionally use internal methods like _reset_configuration(), _initialize_configuration(), and _load_optional_dataset() based on their specific implementation patterns.

clear_cache(key=None) abstractmethod

Clear the data cache for a specific key or all cached data.

This method must be implemented by subclasses to provide cache clearing functionality appropriate to their caching strategy.

Parameters:

Name Type Description Default
key str | None

Specific cache key to clear. If None, clears all cached data. Defaults to None.

None

get_data_as_dataframe(**kwargs) abstractmethod

Get data as a pandas DataFrame.

Parameters:

Name Type Description Default
**kwargs str

Arbitrary keyword arguments.

{}

Returns:

Type Description
DataFrame

pd.DataFrame: The data as a pandas DataFrame.

load(experiment_args, cache_timeout=GeneralConstants.CACHE_TIMEOUT) abstractmethod

Load the dataset and return it as a pandas DataFrame.

Parameters:

Name Type Description Default
experiment_args ExperimentConfig

The experiment arguments containing data source configuration.

required
cache_timeout Optional[int]

The time in seconds for which data should be cached. Defaults to CACHE_TIMEOUT (5 minutes). Set to None or 0 to disable caching.

CACHE_TIMEOUT

Returns:

Type Description
tuple[DataFrame, DataFrame, DataFrame]

tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: Training, validation, and prompt data.

DataLoaderError

Bases: Exception

Base exception for all data source errors.

This exception serves as the base class for all data source-related errors. It provides a common interface for handling various types of failures that can occur during data source operations such as connection issues, authentication failures, or data retrieval problems.