Skip to content

Smart search retriever

Defines a web search retriever using SmartSearch SDK.

This module provides a retriever that uses the SmartSearch SDK to perform web searches and retrieve relevant content from the web.

SmartSearchWebRetriever(base_url=None, token=None)

Bases: BaseRetriever[list[Chunk]]

A web search retriever using SmartSearch SDK.

This retriever uses the SmartSearch SDK to perform web searches and retrieve relevant content from the web. It supports multiple search modes including web search, URL retrieval, page fetching, and content extraction.

Examples:

# Initialize the retriever
retriever = SmartSearchWebRetriever(
    base_url="https://your-smartsearch-endpoint",
    token="your-access-token"
)

# Perform a basic web search
results = await retriever.retrieve(
    "What is cloud computing?",
    top_k=5,
    result_type="snippets"
)

# Search with site filter
results = await retriever.retrieve(
    "machine learning frameworks",
    site="https://github.com",
    top_k=5
)

# Batch search
batch_results = await retriever.retrieve(
    ["query 1", "query 2"],
    top_k=5
)

Attributes:

Name Type Description
client WebSearchClient

The SmartSearch web search client.

base_url str

The base URL for the SmartSearch API.

Note

This class uses an async factory pattern. Use the create() class method to instantiate and authenticate in one step:

retriever = await SmartWebSearchRetriever.create(
    base_url="https://api.example.com",
    token="your-token"
)

Initialize the SmartSearchWebRetriever.

Note

This constructor does not authenticate. For automatic authentication, use the create() class method instead.

Parameters:

Name Type Description Default
base_url str | None

The base URL for the SmartSearch API. If not provided, will use SMART_SEARCH_BASE_URL environment variable.

None
token str | None

The authentication token for the SmartSearch API. If not provided, will use SMART_SEARCH_TOKEN environment variable.

None

Raises:

Type Description
ValueError

If base_url or token is not provided and environment variables are not set.

create(base_url=None, token=None) async classmethod

Create and authenticate a SmartSearchWebRetriever instance.

This is the recommended way to instantiate the retriever as it handles authentication during initialization.

Examples:

# Create with explicit credentials
retriever = await SmartSearchWebSearchRetriever.create(
    base_url="https://api.example.com",
    token="your-token"
)

# Create using environment variables
retriever = await SmartWebSearchRetriever.create()

Parameters:

Name Type Description Default
base_url str | None

The base URL for the SmartSearch API. If not provided, will use SMART_SEARCH_BASE_URL environment variable.

None
token str | None

The authentication token for the SmartSearch API. If not provided, will use SMART_SEARCH_TOKEN environment variable.

None

Returns:

Name Type Description
SmartSearchWebRetriever SmartSearchWebRetriever

An authenticated retriever instance.

Raises:

Type Description
ValueError

If base_url or token is not provided and environment variables are not set.

fetch_page(source, return_html=False, json_schema=None) async

Fetch the content of a specific web page.

Parameters:

Name Type Description Default
source str

The URL of the web page to fetch.

required
return_html bool

Whether to return raw HTML or cleaned text. Defaults to False.

False
json_schema dict[str, Any] | None

JSON schema for custom structured data extraction.

None

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The fetched page content.

get_page_keypoints(query, source, top_k=SMARTSEARCH_DEFAULT_KEYPOINT_COUNT, json_schema=None) async

Extract keypoints summarizing the content of a web page.

Parameters:

Name Type Description Default
query str

The focus topic for extracting keypoints.

required
source str

The web page URL to analyze.

required
top_k int

Number of keypoints to return. Defaults to SMARTSEARCH_DEFAULT_KEYPOINT_COUNT (3).

SMARTSEARCH_DEFAULT_KEYPOINT_COUNT
json_schema dict[str, Any] | None

JSON schema for custom extraction.

None

Returns:

Type Description
list[Chunk]

list[Chunk]: List of Chunk objects containing the extracted keypoints.

get_page_snippets(query, source, top_k=SMARTSEARCH_DEFAULT_SNIPPET_COUNT, snippet_style='paragraph', json_schema=None) async

Extract relevant text snippets from a web page.

Parameters:

Name Type Description Default
query str

The text to match against the web page content.

required
source str

The URL of the web page.

required
top_k int

Number of snippets to extract. Defaults to SMARTSEARCH_DEFAULT_SNIPPET_COUNT (3).

SMARTSEARCH_DEFAULT_SNIPPET_COUNT
snippet_style SmartSearchSnippetStyle

Style of snippet extraction. "paragraph" or "sentence". Defaults to "paragraph".

'paragraph'
json_schema dict[str, Any] | None

JSON schema for custom extraction.

None

Returns:

Type Description
list[Chunk]

list[Chunk]: List of Chunk objects containing the extracted snippets.

map_website(base_url, top_k=SMARTSEARCH_DEFAULT_MAP_SIZE, include_subdomains=False, query=None) async

Map a website and discover its URL structure.

Parameters:

Name Type Description Default
base_url str

The base URL of the website to map.

required
top_k int

Maximum number of URLs to return. Defaults to SMARTSEARCH_DEFAULT_MAP_SIZE (20).

SMARTSEARCH_DEFAULT_MAP_SIZE
include_subdomains bool

Whether to include subdomains. Defaults to False.

False
query str | None

Search query to filter URLs by keywords.

None

Returns:

Type Description
list[str]

list[str]: A list of URLs from the website map.

retrieve(query, query_filter=None, top_k=SMARTSEARCH_DEFAULT_TOP_K, result_type='snippets', site=None, engine=None, **kwargs) async

retrieve(query: str, query_filter: FilterClause | QueryFilter | None = None, top_k: int = SMARTSEARCH_DEFAULT_TOP_K, result_type: SmartSearchResultType = 'snippets', site: str | list[str] | None = None, engine: str | None = None, **kwargs: Any) -> list[Chunk]
retrieve(query: list[str], query_filter: FilterClause | QueryFilter | None = None, top_k: int = SMARTSEARCH_DEFAULT_TOP_K, result_type: SmartSearchResultType = 'snippets', site: str | list[str] | None = None, engine: str | None = None, **kwargs: Any) -> list[list[Chunk]]

Retrieve web search results based on the query.

This method performs a web search using the SmartSearch SDK and returns the results as a list of Chunk objects.

Parameters:

Name Type Description Default
query str | list[str]

The query string or list of query strings to search for. If a list is provided, retrieval is performed for each query concurrently.

required
query_filter FilterClause | QueryFilter | None

Filter criteria for the retrieval. Note: This parameter is not used by the SmartSearch API but is kept for interface consistency. Defaults to None.

None
top_k int

The maximum number of results to retrieve. Defaults to SMARTSEARCH_DEFAULT_TOP_K (5).

SMARTSEARCH_DEFAULT_TOP_K
result_type ResultType

Type of output format. Supported values: "snippets", "keypoints", "summary", "description". Defaults to "snippets".

'snippets'
site str | list[str] | None

URL or list of URLs to limit search results to specific sites or domains. Defaults to None.

None
engine str | None

Search engine to use: "auto", "firecrawl", or "perplexity". Defaults to "auto".

None
**kwargs Any

Additional parameters for the retrieval process.

{}

Returns:

Type Description
list[Chunk] | list[list[Chunk]]

list[Chunk] | list[list[Chunk]]: Retrieved web search results as Chunk objects. Returns list[list[Chunk]] if query is a list of strings.

search_urls(query, top_k=SMARTSEARCH_DEFAULT_TOP_K, site=None, engine=None) async

Retrieve a list of URLs that match the given query.

Parameters:

Name Type Description Default
query str

The query string to search for.

required
top_k int

The maximum number of URLs to retrieve. Defaults to SMARTSEARCH_DEFAULT_TOP_K (5).

SMARTSEARCH_DEFAULT_TOP_K
site str | list[str] | None

URL or list of URLs to limit search.

None
engine str | None

Search engine to use.

None

Returns:

Type Description
list[str]

list[str]: A list of URLs matching the query.