Downloader
Document Processing Orchestrator Downloader Package.
Modules:
| Name | Description |
|---|---|
BaseDownloader |
Abstract base class for document downloader. |
RetryableDownloader |
Abstract base class for retryable downloader. |
DirectFileURLDownloader |
Downloader for direct file URL. |
GoogleDriveDownloader |
Downloader for Google Drive files. |
SmartCrawlDownloader |
Downloader for Smart Crawl data API. |
SmartSearchDownloader |
Downloader for Smart Search data API. |
BaseDownloader
Bases: ABC
Base class for document downloader.
download(source, output, **kwargs)
abstractmethod
Download source to the output directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
The source to be downloaded. |
required |
output
|
str
|
The output directory where the downloaded source will be saved. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
list[str] | None
|
list[str] | None: A list of file paths of successfully downloaded files. If no files are downloaded, an empty list should be returned. Returning None is only for backward compatibility and should be avoided in new implementations. |
DirectFileURLDownloader(stream_buffer_size=65536, retry_config=None)
Bases: RetryableDownloader
A class for downloading files from a direct file URL to the defined output directory.
Initialize the DirectFileURLDownloader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stream_buffer_size
|
int
|
The size of the buffer for streaming downloads in bytes. Defaults to 64KB (65536 bytes). |
65536
|
retry_config
|
RetryConfig | dict[str, Any] | None
|
Retry configuration. When a dict, it is validated into RetryConfig. When None, a default RetryConfig is built by the base class with a default timeout of 30 seconds. Defaults to None. Note: retry_on_exceptions cannot be customized; it is always NETWORK_RETRY_EXCEPTIONS. |
None
|
download(source, output, **kwargs)
Download source to the output directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
The source to be downloaded. |
required |
output
|
str
|
The output directory where the downloaded source will be saved. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
kwargs
ca_certs_path (str, optional): The path to the CA certificates file. Defaults to None. extension (str, optional): The extension of the file to be downloaded. If not provided, the extension will be detected from the response headers or content mime type.
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: A list of file paths of successfully downloaded files. |
GoogleDriveDownloader(api_key, identifier, secret, api_base_url='https://api.bosa.id')
Bases: BaseDownloader
A class for downloading files from Google Drive using GL Connectors for Google Drive integration.
Initialize the GoogleDriveDownloader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_key
|
str
|
The API key for the GL Connectors API. |
required |
identifier
|
str
|
The identifier for the GL Connectors user. |
required |
secret
|
str
|
The secret for the GL Connectors user. |
required |
api_base_url
|
str
|
The base URL for the GL Connectors API. Defaults to "https://api.bosa.id". |
'https://api.bosa.id'
|
download(source, output, **kwargs)
Download a file from Google Drive to the output directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
The Google Drive file ID or URL. |
required |
output
|
str
|
The output directory where the downloaded file will be saved. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Kwargs
export_format (str, optional): The export format for the file.
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: A list containing the path(s) to the successfully downloaded file(s). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file ID cannot be extracted or no files are returned from Google Drive. |
RetryableDownloader(retry_config=None)
Bases: BaseDownloader
A base downloader with built-in tenacity retry logic for transient failures.
Initialize the RetryableDownloader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retry_config
|
RetryConfig | dict[str, Any] | None
|
Retry configuration. When a dict, it is validated into RetryConfig. When None, a default RetryConfig is built. Note: retry_on_exceptions cannot be customized; it is always NETWORK_RETRY_EXCEPTIONS. Defaults to None. |
None
|
SmartCrawlDownloader(endpoint_url, retry_config=None)
Bases: RetryableDownloader
A downloader for retrieving crawled records from Smart Crawl.
Initialize the SmartCrawlDownloader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
endpoint_url
|
str
|
The URL of the Smart Crawl API endpoint. |
required |
retry_config
|
RetryConfig | dict[str, Any] | None
|
Retry configuration. When a dict, it is validated into RetryConfig. When None, a default RetryConfig is built. Note: retry_on_exceptions cannot be customized; it is always NETWORK_RETRY_EXCEPTIONS. Defaults to None. |
None
|
download(source, output, **kwargs)
Download the data from the Smart Crawl API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
The smart crawl domains to be downloaded in comma separated format. |
required |
output
|
str
|
The output directory where the downloaded data will be saved. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Kwargs
start_date (str): Start datetime in ISO 8601 with timezone. end_date (str): End datetime in ISO 8601 with timezone. queries (str, optional): Comma separated of search queries. schema (str, optional): Comma separated of fields to be included in the response. page (int, optional): The page number to be downloaded. page_size (int, optional): The number of items to be downloaded per page. after_timestamp (str, optional): The ISO 8601 timestamp to be used as the cursor for the next page.
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: A list of file paths of successfully downloaded files. |
SmartSearchDownloader(base_url, token, retry_config=None)
Bases: RetryableDownloader
A downloader for fetching web page content via the Smart Search SDK.
Initialize the SmartSearchDownloader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_url
|
str
|
The base URL for the Smart Search Client. |
required |
token
|
str
|
The access token for authentication. |
required |
retry_config
|
RetryConfig | dict[str, Any] | None
|
Retry configuration. - When a dict, it is validated into RetryConfig. - When None, a default RetryConfig is built. Note: retry_on_exceptions cannot be customized; it is always NETWORK_RETRY_EXCEPTIONS. Defaults to None. |
None
|
download(source, output, **kwargs)
Fetch and download a specific web page's content utilizing the Smart Search SDK.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
The specific URL of the web page to fetch content from. |
required |
output
|
str
|
The output directory where the downloaded JSON data will be saved. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Kwargs
return_html (bool, optional): Return raw HTML if True, cleaned text if False. Defaults to False. json_schema (dict, optional): JSON schema for custom structured data extraction. Defaults to None.
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: A list of file paths of successfully downloaded files. |