Smart search retriever
Defines a web search retriever using SmartSearch SDK.
This module provides a retriever that uses the SmartSearch SDK to perform web searches and retrieve relevant content from the web.
SmartSearchWebRetriever(base_url=None, token=None)
Bases: BaseRetriever[list[Chunk]]
A web search retriever using SmartSearch SDK.
This retriever uses the SmartSearch SDK to perform web searches and retrieve relevant content from the web. It supports multiple search modes including web search, URL retrieval, page fetching, and content extraction.
Examples:
# Initialize the retriever
retriever = SmartSearchWebRetriever(
base_url="https://your-smartsearch-endpoint",
token="your-access-token"
)
# Perform a basic web search
results = await retriever.retrieve(
"What is cloud computing?",
top_k=5,
result_type="snippets"
)
# Search with site filter
results = await retriever.retrieve(
"machine learning frameworks",
site="https://github.com",
top_k=5
)
# Batch search
batch_results = await retriever.retrieve(
["query 1", "query 2"],
top_k=5
)
Attributes:
| Name | Type | Description |
|---|---|---|
client |
WebSearchClient
|
The SmartSearch web search client. |
base_url |
str
|
The base URL for the SmartSearch API. |
Note
This class uses an async factory pattern. Use the create() class method
to instantiate and authenticate in one step:
retriever = await SmartWebSearchRetriever.create(
base_url="https://api.example.com",
token="your-token"
)
Initialize the SmartSearchWebRetriever.
Note
This constructor does not authenticate. For automatic authentication,
use the create() class method instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_url
|
str | None
|
The base URL for the SmartSearch API. If not provided, will use SMART_SEARCH_BASE_URL environment variable. |
None
|
token
|
str | None
|
The authentication token for the SmartSearch API. If not provided, will use SMART_SEARCH_TOKEN environment variable. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If base_url or token is not provided and environment variables are not set. |
create(base_url=None, token=None)
async
classmethod
Create and authenticate a SmartSearchWebRetriever instance.
This is the recommended way to instantiate the retriever as it handles authentication during initialization.
Examples:
# Create with explicit credentials
retriever = await SmartSearchWebSearchRetriever.create(
base_url="https://api.example.com",
token="your-token"
)
# Create using environment variables
retriever = await SmartWebSearchRetriever.create()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_url
|
str | None
|
The base URL for the SmartSearch API. If not provided, will use SMART_SEARCH_BASE_URL environment variable. |
None
|
token
|
str | None
|
The authentication token for the SmartSearch API. If not provided, will use SMART_SEARCH_TOKEN environment variable. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
SmartSearchWebRetriever |
SmartSearchWebRetriever
|
An authenticated retriever instance. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If base_url or token is not provided and environment variables are not set. |
fetch_page(source, return_html=False, json_schema=None)
async
Fetch the content of a specific web page.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
The URL of the web page to fetch. |
required |
return_html
|
bool
|
Whether to return raw HTML or cleaned text. Defaults to False. |
False
|
json_schema
|
dict[str, Any] | None
|
JSON schema for custom structured data extraction. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The fetched page content. |
get_page_keypoints(query, source, top_k=SMARTSEARCH_DEFAULT_KEYPOINT_COUNT, json_schema=None)
async
Extract keypoints summarizing the content of a web page.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The focus topic for extracting keypoints. |
required |
source
|
str
|
The web page URL to analyze. |
required |
top_k
|
int
|
Number of keypoints to return. Defaults to SMARTSEARCH_DEFAULT_KEYPOINT_COUNT (3). |
SMARTSEARCH_DEFAULT_KEYPOINT_COUNT
|
json_schema
|
dict[str, Any] | None
|
JSON schema for custom extraction. |
None
|
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
list[Chunk]: List of Chunk objects containing the extracted keypoints. |
get_page_snippets(query, source, top_k=SMARTSEARCH_DEFAULT_SNIPPET_COUNT, snippet_style='paragraph', json_schema=None)
async
Extract relevant text snippets from a web page.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The text to match against the web page content. |
required |
source
|
str
|
The URL of the web page. |
required |
top_k
|
int
|
Number of snippets to extract. Defaults to SMARTSEARCH_DEFAULT_SNIPPET_COUNT (3). |
SMARTSEARCH_DEFAULT_SNIPPET_COUNT
|
snippet_style
|
SmartSearchSnippetStyle
|
Style of snippet extraction. "paragraph" or "sentence". Defaults to "paragraph". |
'paragraph'
|
json_schema
|
dict[str, Any] | None
|
JSON schema for custom extraction. |
None
|
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
list[Chunk]: List of Chunk objects containing the extracted snippets. |
map_website(base_url, top_k=SMARTSEARCH_DEFAULT_MAP_SIZE, include_subdomains=False, query=None)
async
Map a website and discover its URL structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_url
|
str
|
The base URL of the website to map. |
required |
top_k
|
int
|
Maximum number of URLs to return. Defaults to SMARTSEARCH_DEFAULT_MAP_SIZE (20). |
SMARTSEARCH_DEFAULT_MAP_SIZE
|
include_subdomains
|
bool
|
Whether to include subdomains. Defaults to False. |
False
|
query
|
str | None
|
Search query to filter URLs by keywords. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: A list of URLs from the website map. |
retrieve(query, query_filter=None, top_k=SMARTSEARCH_DEFAULT_TOP_K, result_type='snippets', site=None, engine=None, **kwargs)
async
retrieve(query: str, query_filter: FilterClause | QueryFilter | None = None, top_k: int = SMARTSEARCH_DEFAULT_TOP_K, result_type: SmartSearchResultType = 'snippets', site: str | list[str] | None = None, engine: str | None = None, **kwargs: Any) -> list[Chunk]
retrieve(query: list[str], query_filter: FilterClause | QueryFilter | None = None, top_k: int = SMARTSEARCH_DEFAULT_TOP_K, result_type: SmartSearchResultType = 'snippets', site: str | list[str] | None = None, engine: str | None = None, **kwargs: Any) -> list[list[Chunk]]
Retrieve web search results based on the query.
This method performs a web search using the SmartSearch SDK and returns the results as a list of Chunk objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str | list[str]
|
The query string or list of query strings to search for. If a list is provided, retrieval is performed for each query concurrently. |
required |
query_filter
|
FilterClause | QueryFilter | None
|
Filter criteria for the retrieval. Note: This parameter is not used by the SmartSearch API but is kept for interface consistency. Defaults to None. |
None
|
top_k
|
int
|
The maximum number of results to retrieve. Defaults to SMARTSEARCH_DEFAULT_TOP_K (5). |
SMARTSEARCH_DEFAULT_TOP_K
|
result_type
|
ResultType
|
Type of output format. Supported values: "snippets", "keypoints", "summary", "description". Defaults to "snippets". |
'snippets'
|
site
|
str | list[str] | None
|
URL or list of URLs to limit search results to specific sites or domains. Defaults to None. |
None
|
engine
|
str | None
|
Search engine to use: "auto", "firecrawl", or "perplexity". Defaults to "auto". |
None
|
**kwargs
|
Any
|
Additional parameters for the retrieval process. |
{}
|
Returns:
| Type | Description |
|---|---|
list[Chunk] | list[list[Chunk]]
|
list[Chunk] | list[list[Chunk]]: Retrieved web search results as Chunk objects. Returns list[list[Chunk]] if query is a list of strings. |
search_urls(query, top_k=SMARTSEARCH_DEFAULT_TOP_K, site=None, engine=None)
async
Retrieve a list of URLs that match the given query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The query string to search for. |
required |
top_k
|
int
|
The maximum number of URLs to retrieve. Defaults to SMARTSEARCH_DEFAULT_TOP_K (5). |
SMARTSEARCH_DEFAULT_TOP_K
|
site
|
str | list[str] | None
|
URL or list of URLs to limit search. |
None
|
engine
|
str | None
|
Search engine to use. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: A list of URLs matching the query. |