Types
Types for the evaluator.
This module contains the types for the evaluator.
AttachmentConfig
Bases: BaseModel, ABC
Base configuration for loading attachments.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
str
|
The type of attachment source. |
validate_config()
abstractmethod
Validate configuration (can be overridden if needed).
EvaluationResult
Bases: TypedDict
Structured result from the evaluate function.
This provides a unified return type that includes both evaluation results and experiment tracker URLs/paths.
Attributes:
| Name | Type | Description |
|---|---|---|
run_id |
str
|
The run ID for this evaluation. |
results |
list[list[EvaluationOutput]]
|
The evaluation results from all evaluators. |
experiment_urls |
ExperimentUrls
|
URLs and paths for accessing experiment data. |
dataset_name |
str | None
|
The name of the dataset that was evaluated. |
timestamp |
str
|
The timestamp of the evaluation in ISO 8601 format. |
num_samples |
int
|
The number of samples in the dataset. |
metadata |
dict[str, Any]
|
The metadata of the evaluation. |
summary_result |
dict[str, Any] | None
|
Aggregated summary from summary_evaluators. |
ExperimentUrls
Bases: TypedDict
Experiment URLs and paths for different experiment trackers.
This TypedDict provides a unified interface for experiment tracker URLs/paths. Different trackers will populate different fields based on their capabilities.
Attributes:
| Name | Type | Description |
|---|---|---|
run_url |
str | None
|
URL to view the experiment results. For Langfuse: session URL. For Simple: local file path to experiment results. |
leaderboard_url |
str | None
|
URL to view the leaderboard. For Langfuse: dataset run URL. For Simple: local file path to leaderboard CSV. |
GEvalMetricResult
Bases: MetricResult
GEval-specific metric result with polarity-aware success and diagnostic fields.
Extends MetricResult with fields required for polarity-aware binary scoring and threshold-based pass/fail logic. All GEval-specific fields below are required and non-optional for type-safe aggregation.
Attributes:
| Name | Type | Description |
|---|---|---|
score |
MetricValue
|
The evaluation score. Can be continuous (float), discrete (int), or categorical (str). Inherited from MetricResult. |
explanation |
str | None
|
A detailed explanation of the evaluation result. Inherited from MetricResult; nullable. |
rubric_score |
int | float
|
Native rubric value as-is (diagnostic only, not for thresholding). |
success |
bool
|
Pass/fail determination using polarity and threshold. |
threshold |
float
|
The threshold used to compute success. |
strict_mode |
bool
|
If True, binarizes score to 1.0 or 0.0; else uses raw float score. |
higher_is_better |
bool
|
Polarity flag; determines success logic direction. |
GoogleDriveAttachmentConfig
Bases: AttachmentConfig
Configuration for loading attachments from Google Drive.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal[GDRIVE]
|
Always "gdrive" for this implementation. |
client_email |
str | None
|
Google service account client email. |
private_key |
str | None
|
Google service account private key. |
folder_id |
str
|
Google Drive folder ID. |
service_account_file |
str | None
|
Path to service account JSON file (alternative to client_email/private_key). |
validate_config()
Google Drive-specific validation.
LLMTestData
Bases: BaseModel
Evaluation row model for LLM-based tests.
Use this model for QA, RAG, and agent-style evaluations. All fields are optional so one row can represent different evaluation cases.
Extra fields beyond those listed below are allowed and preserved
throughout evaluation. They can be accessed by attribute or via
model_dump()::
row = LLMTestData(input="q", custom_score=0.9)
row.custom_score # 0.9
row.model_dump()["custom_score"] # 0.9
Attributes:
| Name | Type | Description |
|---|---|---|
input |
str | None
|
Input query or user prompt. Defaults to None. |
actual_output |
str | None
|
Model-generated response. Defaults to None. |
expected_output |
str | None
|
Reference response. Defaults to None. |
retrieved_context |
str | list[str] | None
|
Retrieved context used to answer the query. Defaults to None. |
expected_context |
str | list[str] | None
|
Reference context expected to be retrieved. Defaults to None. |
agent_trajectory |
list[dict[str, Any]] | None
|
Agent execution trace. Defaults to None. |
expected_agent_trajectory |
list[dict[str, Any]] | None
|
Reference agent execution trace. Defaults to None. |
tools_called |
list[ToolCall] | None
|
Tools invoked by the agent. Defaults to None. |
expected_tools |
list[ToolCall] | None
|
Tools expected to be invoked. Defaults to None. |
is_refusal |
bool | None
|
Whether the sample is a refusal case. Defaults to None. |
LocalAttachmentConfig
Bases: AttachmentConfig
Configuration for loading attachments from local directory.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal[LOCAL]
|
Always "local" for this implementation. |
local_directory |
str
|
Local directory path. |
validate_config()
Local-specific validation.
MetricResult
Bases: BaseModel
Metric Output Pydantic Model.
A structured output for metric results with score and explanation.
Attributes:
| Name | Type | Description |
|---|---|---|
score |
MetricValue
|
The evaluation score. Can be continuous (float), discrete (int), or categorical (str). |
explanation |
str | None
|
A detailed explanation of the evaluation result. |
RetrievalData
Bases: TypedDict
Retrieval data.
A data for retrieval evaluation such as ClassicalRetrievalEvaluator, etc.
Attributes:
| Name | Type | Description |
|---|---|---|
retrieved_chunks |
dict[str, float]
|
The retrieved chunks and their scores. |
ground_truth_chunk_ids |
list[str]
|
The ground truth chunk IDs. |
RunSummaryData
Bases: BaseModel
Summary data for a single run accumulated from batch processing.
This contains all accumulated data and the computed cumulative summary for a run. Stored and managed by experiment trackers that support summary evaluators.
Attributes:
| Name | Type | Description |
|---|---|---|
results |
list[EvaluationOutput]
|
All evaluation results accumulated from batches. Defaults to empty list. |
data |
list[MetricInput]
|
All input data accumulated from batches. Defaults to empty list. |
summary |
dict[str, Any]
|
Cumulative summary computed by summary evaluators. Defaults to empty dict. |
S3AttachmentConfig
Bases: AttachmentConfig
Configuration for loading attachments from S3.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
Literal[S3]
|
Always "s3" for this implementation. |
s3_bucket |
str
|
S3 bucket name. |
s3_prefix |
str | None
|
S3 prefix (optional). |
aws_access_key_id |
str
|
AWS access key ID. |
aws_secret_access_key |
str
|
AWS secret access key. |
aws_region |
str
|
AWS region. |
validate_config()
S3-specific validation (if needed beyond Pydantic's).
ToolCall
Bases: BaseModel
Structured tool call data for agent evaluation rows.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Tool name. |
description |
str | None
|
Tool description or metadata. |
reasoning |
str | None
|
Model reasoning for selecting the tool. |
output |
Any | None
|
Tool output. |
input_parameters |
dict[str, Any] | None
|
Tool input parameters. |
from_dicts(tool_calls)
classmethod
Convert canonical tool-call dictionaries to ToolCall objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_calls
|
list[dict[str, Any]] | None
|
Tool-call dictionaries using the
public |
required |
Returns:
| Type | Description |
|---|---|
list[ToolCall] | None
|
list[ToolCall] | None: Parsed ToolCall objects, or None when the input is None or empty. |
create_attachment_config(config_dict)
Factory function to create the appropriate AttachmentConfig.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_dict
|
dict[str, Any]
|
Configuration dictionary with 'type' field. |
required |
Returns:
| Type | Description |
|---|---|
AttachmentConfig
|
The appropriate AttachmentConfig subclass instance. |
Raises:
| Type | Description |
|---|---|
ValidationError
|
If configuration is invalid. |
Example
config = create_attachment_config({ ... "type": "s3", ... "s3_bucket": "my-bucket", ... "aws_access_key_id": "...", ... "aws_secret_access_key": "...", ... "aws_region": "us-east-1" ... }) isinstance(config, S3AttachmentConfig) True
normalize_metric_input(data)
Normalize public input into an internal mutable dict.
When data is an LLMTestData instance, all declared fields are included in the result — even those that are None. When data is already a mapping, it is converted to a plain dict as-is.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
EvalInput
|
A single evaluation row, either an LLMTestData instance or an arbitrary Mapping[str, Any]. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: A plain, mutable dict representation of the input. |
normalize_metric_inputs(data)
Normalize a batch of evaluation rows into internal mutable dicts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
list[EvalInput]
|
A list of evaluation rows. Each item may be an LLMTestData instance or a Mapping[str, Any]. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: A list of plain dicts, one per input row. |
validate_metric_result(parsed_response)
Validate the response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parsed_response
|
dict
|
The response to validate. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
The validated response. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the response is not valid. |