Types

Types for the evaluator.

This module contains the types for the evaluator.

`AgentData`

Bases: RAGData

Agent data.

A data for agent evaluation such as AgentEvals, etc. Extends RAGData to support both agent trajectory and generation evaluation.

Attributes:

Name	Type	Description
`agent_trajectory`	`list[dict[str, Any]]`	The agent trajectory.
`expected_agent_trajectory`	`list[dict[str, Any]] \| None`	The expected agent trajectory.
`query`	`str \| None`	The query for generation evaluation.
`expected_response`	`str \| list[str] \| None`	The expected response for generation evaluation.
`generated_response`	`str \| list[str] \| None`	The generated response for generation evaluation.
`retrieved_context`	`str \| list[str] \| None`	The retrieved context for generation evaluation.

`AttachmentConfig`

Bases: BaseModel, ABC

Base configuration for loading attachments.

Attributes:

Name	Type	Description
`type`	`str`	The type of attachment source.

`validate_config()` `abstractmethod`

Validate configuration (can be overridden if needed).

`EvaluationResult`

Bases: TypedDict

Structured result from the evaluate function.

This provides a unified return type that includes both evaluation results and experiment tracker URLs/paths.

Attributes:

Name	Type	Description
`run_id`	`str`	The run ID for this evaluation.
`results`	`list[list[EvaluationOutput]] \| None`	The evaluation results from all evaluators.
`experiment_urls`	`ExperimentUrls`	URLs and paths for accessing experiment data.
`dataset_name`	`str \| None`	The name of the dataset that was evaluated.
`timestamp`	`str`	The timestamp of the evaluation in ISO 8601 format.
`num_samples`	`int`	The number of samples in the dataset.
`metadata`	`dict[str, Any]`	The metadata of the evaluation.
`summary_result`	`dict[str, Any] \| None`	Aggregated summary from summary_evaluators.

`ExperimentUrls`

Bases: TypedDict

Experiment URLs and paths for different experiment trackers.

This TypedDict provides a unified interface for experiment tracker URLs/paths. Different trackers will populate different fields based on their capabilities.

Attributes:

Name	Type	Description
`run_url`	`str \| None`	URL to view the experiment results. For Langfuse: session URL. For Simple: local file path to experiment results.
`leaderboard_url`	`str \| None`	URL to view the leaderboard. For Langfuse: dataset run URL. For Simple: local file path to leaderboard CSV.

`GoogleDriveAttachmentConfig`

Bases: AttachmentConfig

Configuration for loading attachments from Google Drive.

Attributes:

Name	Type	Description
`type`	`Literal[GDRIVE]`	Always "gdrive" for this implementation.
`client_email`	`str \| None`	Google service account client email.
`private_key`	`str \| None`	Google service account private key.
`folder_id`	`str`	Google Drive folder ID.
`service_account_file`	`str \| None`	Path to service account JSON file (alternative to client_email/private_key).

`validate_config()`

Google Drive-specific validation.

`LocalAttachmentConfig`

Bases: AttachmentConfig

Configuration for loading attachments from local directory.

Attributes:

Name	Type	Description
`type`	`Literal[LOCAL]`	Always "local" for this implementation.
`local_directory`	`str`	Local directory path.

`validate_config()`

Local-specific validation.

`MetricResult`

Bases: BaseModel

Metric Output Pydantic Model.

A structured output for metric results with score and explanation.

Attributes:

Name	Type	Description
`score`	`MetricValue`	The evaluation score. Can be continuous (float), discrete (int), or categorical (str).
`explanation`	`str \| None`	A detailed explanation of the evaluation result.

`QAData`

Bases: TypedDict

QA data.

A data for QA evaluation such as GenerationEvaluator, etc.

Attributes:

Name	Type	Description
`query`	`str \| None`	The query.
`expected_response`	`str \| list[str] \| None`	The expected response.
`generated_response`	`str \| list[str] \| None`	The generated response.

`RAGData`

Bases: QAData

RAG data.

Extends QAData.

Attributes:

Name	Type	Description
`query`	`str \| None`	The query.
`expected_response`	`str \| None`	The expected response. For multiple responses, use list[str].
`generated_response`	`str \| None`	The generated response. For multiple responses, use list[str].
`retrieved_context`	`str \| list[str] \| None`	The retrieved context. If the retrieved context is concatenated from multiple contexts, use str. Otherwise, use list[str].
`expected_retrieved_context`	`str \| list[str] \| None`	The expected retrieved context. If the expected retrieved context is concatenated from multiple contexts, use str. Otherwise, use list[str].
`rubrics`	`dict[str, str] \| None`	Evaluation rubric for the sample.
`is_refusal`	`bool \| None`	Whether the sample should be treated as a refusal response.

`RetrievalData`

Bases: TypedDict

Retrieval data.

A data for retrieval evaluation such as ClassicalRetrievalEvaluator, etc.

Attributes:

Name	Type	Description
`retrieved_chunks`	`dict[str, float]`	The retrieved chunks and their scores.
`ground_truth_chunk_ids`	`list[str]`	The ground truth chunk IDs.

`RunSummaryData`

Bases: BaseModel

Summary data for a single run accumulated from batch processing.

This contains all accumulated data and the computed cumulative summary for a run. Stored and managed by experiment trackers that support summary evaluators.

Attributes:

Name	Type	Description
`results`	`list[EvaluationOutput]`	All evaluation results accumulated from batches. Defaults to empty list.
`data`	`list[MetricInput]`	All input data accumulated from batches. Defaults to empty list.
`summary`	`dict[str, Any]`	Cumulative summary computed by summary evaluators. Defaults to empty dict.

`S3AttachmentConfig`

Bases: AttachmentConfig

Configuration for loading attachments from S3.

Attributes:

Name	Type	Description
`type`	`Literal[S3]`	Always "s3" for this implementation.
`s3_bucket`	`str`	S3 bucket name.
`s3_prefix`	`str \| None`	S3 prefix (optional).
`aws_access_key_id`	`str`	AWS access key ID.
`aws_secret_access_key`	`str`	AWS secret access key.
`aws_region`	`str`	AWS region.

`validate_config()`

S3-specific validation (if needed beyond Pydantic's).

`SummaryData`

Bases: TypedDict

Summary data.

A data type for summary evaluation.

Attributes:

Name	Type	Description
`input`	`str \| None`	Source source text, e.g.: meeting transcript.
`summary`	`str \| None`	Generated summary text.

`create_attachment_config(config_dict)`

Factory function to create the appropriate AttachmentConfig.

Parameters:

Name	Type	Description	Default
`config_dict`	`dict[str, Any]`	Configuration dictionary with 'type' field.	required

Returns:

Type	Description
`AttachmentConfig`	The appropriate AttachmentConfig subclass instance.

Raises:

Type	Description
`ValidationError`	If configuration is invalid.

Example

config = create_attachment_config({ ... "type": "s3", ... "s3_bucket": "my-bucket", ... "aws_access_key_id": "...", ... "aws_secret_access_key": "...", ... "aws_region": "us-east-1" ... }) isinstance(config, S3AttachmentConfig) True

`validate_metric_result(parsed_response)`

Validate the response.

Parameters:

Name	Type	Description	Default
`parsed_response`	`dict`	The response to validate.	required

Returns:

Name	Type	Description
`dict`	`dict`	The validated response.

Raises:

Type	Description
`ValueError`	If the response is not valid.

Types

AgentData

AttachmentConfig

validate_config() abstractmethod

EvaluationResult

ExperimentUrls

GoogleDriveAttachmentConfig

validate_config()

LocalAttachmentConfig

validate_config()

MetricResult

QAData

RAGData

RetrievalData

RunSummaryData

S3AttachmentConfig

validate_config()

SummaryData

create_attachment_config(config_dict)

validate_metric_result(parsed_response)

`AgentData`

`AttachmentConfig`

`validate_config()` `abstractmethod`

`EvaluationResult`

`ExperimentUrls`

`GoogleDriveAttachmentConfig`

`validate_config()`

`LocalAttachmentConfig`

`validate_config()`

`MetricResult`

`QAData`

`RAGData`

`RetrievalData`

`RunSummaryData`

`S3AttachmentConfig`

`validate_config()`

`SummaryData`

`create_attachment_config(config_dict)`

`validate_metric_result(parsed_response)`