Base data loader
Base classes and utilities for data sources.
This module provides the foundational components for data loading from various sources. It contains abstract base classes, common exceptions, and utility functions that are shared across different data loader implementations.
Reviewer
- Muhammad Afif Al Hawari (muhammad.a.a.hawari@gdplabs.id)
References
NONE
BaseDataLoader
Bases: Component, ABC
Base class for dataset loaders.
This class defines the common interface that all data sources must implement. Each concrete data source should provide implementation for loading data from their specific source (CSV files, Google Sheets, etc.).
The interface is designed to be consistent across different data source types, making it easy to switch between sources, add new ones, and maintain the factory pattern architecture.
Uses lazy initialization pattern where configuration is provided at load time rather than initialization time, allowing for lightweight instantiation and flexible configuration changes.
All data sources must implement: 1. Lightweight initialization without configuration 2. Load method that accepts configuration and returns data 3. Cache management (clear cache) 4. Proper error handling for configuration and connection issues 5. Optional caching mechanisms for performance optimization
Subclasses can optionally use internal methods like _reset_configuration(), _initialize_configuration(), and _load_optional_dataset() based on their specific implementation patterns.
clear_cache(key=None)
abstractmethod
Clear the data cache for a specific key or all cached data.
This method must be implemented by subclasses to provide cache clearing functionality appropriate to their caching strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key |
str | None
|
Specific cache key to clear. If None, clears all cached data. Defaults to None. |
None
|
get_data_as_dataframe(**kwargs)
abstractmethod
Get data as a pandas DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs |
str
|
Arbitrary keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The data as a pandas DataFrame. |
load(experiment_args, cache_timeout=GeneralConstants.CACHE_TIMEOUT)
abstractmethod
Load the dataset and return it as a pandas DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
experiment_args |
ExperimentConfig
|
The experiment arguments containing data source configuration. |
required |
cache_timeout |
Optional[int]
|
The time in seconds for which data should be cached. Defaults to CACHE_TIMEOUT (5 minutes). Set to None or 0 to disable caching. |
CACHE_TIMEOUT
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame, DataFrame]
|
tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: Training, validation, and prompt data. |
DataLoaderError
Bases: Exception
Base exception for all data source errors.
This exception serves as the base class for all data source-related errors. It provides a common interface for handling various types of failures that can occur during data source operations such as connection issues, authentication failures, or data retrieval problems.