Downloader

Document Processing Orchestrator Downloader Package.

Modules:

Name	Description
`BaseDownloader`	Abstract base class for document downloader.
`DirectFileURLDownloader`	Downloader for direct file URL.
`GoogleDriveDownloader`	Downloader for Google Drive files.

`BaseDownloader`

Bases: ABC

Base class for document downloader.

`download(source, output, **kwargs)` `abstractmethod`

Download source to the output directory.

Parameters:

Name	Type	Description	Default
`source`	`str`	The source to be downloaded.	required
`output`	`str`	The output directory where the downloaded source will be saved.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`list[str] \| None`	list[str] \| None: A list of file paths of successfully downloaded files. If no files are downloaded, an empty list should be returned. Returning None is only for backward compatibility and should be avoided in new implementations.

`DirectFileURLDownloader(stream_buffer_size=65536, retry_config=None, max_retries=DEFAULT_MAX_RETRIES, timeout=None)`

Bases: BaseDownloader

A class for downloading files from a direct file URL to the defined output directory.

Initialize the DirectFileURLDownloader.

Parameters:

Name	Type	Description	Default
`stream_buffer_size`	`int`	The size of the buffer for streaming downloads in bytes. Defaults to 64KB (65536 bytes).	`65536`
`retry_config`	`RetryConfig \| dict[str, Any] \| None`	Retry configuration. When a dict, it is validated into RetryConfig. When None, a default RetryConfig is built using max_retries and timeout. Defaults to None. Note: retry_on_exceptions cannot be customized; it is always NETWORK_RETRY_EXCEPTIONS.	`None`
`max_retries`	`int`	The maximum number of retries for failed downloads. Defaults to 3. Deprecated: Use retry_config instead.	`DEFAULT_MAX_RETRIES`
`timeout`	`int \| None`	The timeout for the download request in seconds. Defaults to None. Deprecated: Use retry_config instead.	`None`

`download(source, output, **kwargs)`

Download source to the output directory.

Parameters:

Name	Type	Description	Default
`source`	`str`	The source to be downloaded.	required
`output`	`str`	The output directory where the downloaded source will be saved.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

kwargs

ca_certs_path (str, optional): The path to the CA certificates file. Defaults to None. extension (str, optional): The extension of the file to be downloaded. If not provided, the extension will be detected from the response headers or content mime type.

Returns:

Type	Description
`list[str]`	list[str]: A list of file paths of successfully downloaded files.

`GoogleDriveDownloader(api_key, identifier, secret, api_base_url='https://api.bosa.id')`

Bases: BaseDownloader

A class for downloading files from Google Drive using GL Connectors for Google Drive integration.

Initialize the GoogleDriveDownloader.

Parameters:

Name	Type	Description	Default
`api_key`	`str`	The API key for the GL Connectors API.	required
`identifier`	`str`	The identifier for the GL Connectors user.	required
`secret`	`str`	The secret for the GL Connectors user.	required
`api_base_url`	`str`	The base URL for the GL Connectors API. Defaults to "https://api.bosa.id".	`'https://api.bosa.id'`

`download(source, output, **kwargs)`

Download a file from Google Drive to the output directory.

Parameters:

Name	Type	Description	Default
`source`	`str`	The Google Drive file ID or URL.	required
`output`	`str`	The output directory where the downloaded file will be saved.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Kwargs

export_format (str, optional): The export format for the file.

Returns:

Type	Description
`list[str]`	list[str]: A list containing the path(s) to the successfully downloaded file(s).

Raises:

Type	Description
`ValueError`	If file ID cannot be extracted or no files are returned from Google Drive.

Downloader

BaseDownloader

download(source, output, **kwargs) abstractmethod

DirectFileURLDownloader(stream_buffer_size=65536, retry_config=None, max_retries=DEFAULT_MAX_RETRIES, timeout=None)

download(source, output, **kwargs)

GoogleDriveDownloader(api_key, identifier, secret, api_base_url='https://api.bosa.id')

download(source, output, **kwargs)

`BaseDownloader`

`download(source, output, **kwargs)` `abstractmethod`

`DirectFileURLDownloader(stream_buffer_size=65536, retry_config=None, max_retries=DEFAULT_MAX_RETRIES, timeout=None)`

`download(source, output, **kwargs)`

`GoogleDriveDownloader(api_key, identifier, secret, api_base_url='https://api.bosa.id')`

`download(source, output, **kwargs)`