Skip to content

Types

Types for the evaluator.

This module contains the types for the evaluator.

AttachmentConfig

Bases: BaseModel, ABC

Base configuration for loading attachments.

Attributes:

Name Type Description
type str

The type of attachment source.

validate_config() abstractmethod

Validate configuration (can be overridden if needed).

EvaluationResult

Bases: TypedDict

Structured result from the evaluate function.

This provides a unified return type that includes both evaluation results and experiment tracker URLs/paths.

Attributes:

Name Type Description
run_id str

The run ID for this evaluation.

results list[list[EvaluationOutput]]

The evaluation results from all evaluators.

experiment_urls ExperimentUrls

URLs and paths for accessing experiment data.

dataset_name str | None

The name of the dataset that was evaluated.

timestamp str

The timestamp of the evaluation in ISO 8601 format.

num_samples int

The number of samples in the dataset.

metadata dict[str, Any]

The metadata of the evaluation.

summary_result dict[str, Any] | None

Aggregated summary from summary_evaluators.

ExperimentUrls

Bases: TypedDict

Experiment URLs and paths for different experiment trackers.

This TypedDict provides a unified interface for experiment tracker URLs/paths. Different trackers will populate different fields based on their capabilities.

Attributes:

Name Type Description
run_url str | None

URL to view the experiment results. For Langfuse: session URL. For Simple: local file path to experiment results.

leaderboard_url str | None

URL to view the leaderboard. For Langfuse: dataset run URL. For Simple: local file path to leaderboard CSV.

GEvalMetricResult

Bases: MetricResult

GEval-specific metric result with polarity-aware success and diagnostic fields.

Extends MetricResult with fields required for polarity-aware binary scoring and threshold-based pass/fail logic. All GEval-specific fields below are required and non-optional for type-safe aggregation.

Attributes:

Name Type Description
score MetricValue

The evaluation score. Can be continuous (float), discrete (int), or categorical (str). Inherited from MetricResult.

explanation str | None

A detailed explanation of the evaluation result. Inherited from MetricResult; nullable.

rubric_score int | float

Native rubric value as-is (diagnostic only, not for thresholding).

success bool

Pass/fail determination using polarity and threshold.

threshold float

The threshold used to compute success.

strict_mode bool

If True, binarizes score to 1.0 or 0.0; else uses raw float score.

higher_is_better bool

Polarity flag; determines success logic direction.

GoogleDriveAttachmentConfig

Bases: AttachmentConfig

Configuration for loading attachments from Google Drive.

Attributes:

Name Type Description
type Literal[GDRIVE]

Always "gdrive" for this implementation.

client_email str | None

Google service account client email.

private_key str | None

Google service account private key.

folder_id str

Google Drive folder ID.

service_account_file str | None

Path to service account JSON file (alternative to client_email/private_key).

validate_config()

Google Drive-specific validation.

LLMTestData

Bases: BaseModel

Evaluation row model for LLM-based tests.

Use this model for QA, RAG, and agent-style evaluations. All fields are optional so one row can represent different evaluation cases.

Extra fields beyond those listed below are allowed and preserved throughout evaluation. They can be accessed by attribute or via model_dump()::

row = LLMTestData(input="q", custom_score=0.9)
row.custom_score          # 0.9
row.model_dump()["custom_score"]  # 0.9

Attributes:

Name Type Description
input str | None

Input query or user prompt. Defaults to None.

actual_output str | None

Model-generated response. Defaults to None.

expected_output str | None

Reference response. Defaults to None.

retrieved_context str | list[str] | None

Retrieved context used to answer the query. Defaults to None.

expected_context str | list[str] | None

Reference context expected to be retrieved. Defaults to None.

agent_trajectory list[dict[str, Any]] | None

Agent execution trace. Defaults to None.

expected_agent_trajectory list[dict[str, Any]] | None

Reference agent execution trace. Defaults to None.

tools_called list[ToolCall] | None

Tools invoked by the agent. Defaults to None.

expected_tools list[ToolCall] | None

Tools expected to be invoked. Defaults to None.

is_refusal bool | None

Whether the sample is a refusal case. Defaults to None.

LocalAttachmentConfig

Bases: AttachmentConfig

Configuration for loading attachments from local directory.

Attributes:

Name Type Description
type Literal[LOCAL]

Always "local" for this implementation.

local_directory str

Local directory path.

validate_config()

Local-specific validation.

MetricResult

Bases: BaseModel

Metric Output Pydantic Model.

A structured output for metric results with score and explanation.

Attributes:

Name Type Description
score MetricValue

The evaluation score. Can be continuous (float), discrete (int), or categorical (str).

explanation str | None

A detailed explanation of the evaluation result.

RetrievalData

Bases: TypedDict

Retrieval data.

A data for retrieval evaluation such as ClassicalRetrievalEvaluator, etc.

Attributes:

Name Type Description
retrieved_chunks dict[str, float]

The retrieved chunks and their scores.

ground_truth_chunk_ids list[str]

The ground truth chunk IDs.

RunSummaryData

Bases: BaseModel

Summary data for a single run accumulated from batch processing.

This contains all accumulated data and the computed cumulative summary for a run. Stored and managed by experiment trackers that support summary evaluators.

Attributes:

Name Type Description
results list[EvaluationOutput]

All evaluation results accumulated from batches. Defaults to empty list.

data list[MetricInput]

All input data accumulated from batches. Defaults to empty list.

summary dict[str, Any]

Cumulative summary computed by summary evaluators. Defaults to empty dict.

S3AttachmentConfig

Bases: AttachmentConfig

Configuration for loading attachments from S3.

Attributes:

Name Type Description
type Literal[S3]

Always "s3" for this implementation.

s3_bucket str

S3 bucket name.

s3_prefix str | None

S3 prefix (optional).

aws_access_key_id str

AWS access key ID.

aws_secret_access_key str

AWS secret access key.

aws_region str

AWS region.

validate_config()

S3-specific validation (if needed beyond Pydantic's).

ToolCall

Bases: BaseModel

Structured tool call data for agent evaluation rows.

Attributes:

Name Type Description
name str

Tool name.

description str | None

Tool description or metadata.

reasoning str | None

Model reasoning for selecting the tool.

output Any | None

Tool output.

input_parameters dict[str, Any] | None

Tool input parameters.

from_dicts(tool_calls) classmethod

Convert canonical tool-call dictionaries to ToolCall objects.

Parameters:

Name Type Description Default
tool_calls list[dict[str, Any]] | None

Tool-call dictionaries using the public ToolCall field names.

required

Returns:

Type Description
list[ToolCall] | None

list[ToolCall] | None: Parsed ToolCall objects, or None when the input is None or empty.

create_attachment_config(config_dict)

Factory function to create the appropriate AttachmentConfig.

Parameters:

Name Type Description Default
config_dict dict[str, Any]

Configuration dictionary with 'type' field.

required

Returns:

Type Description
AttachmentConfig

The appropriate AttachmentConfig subclass instance.

Raises:

Type Description
ValidationError

If configuration is invalid.

Example

config = create_attachment_config({ ... "type": "s3", ... "s3_bucket": "my-bucket", ... "aws_access_key_id": "...", ... "aws_secret_access_key": "...", ... "aws_region": "us-east-1" ... }) isinstance(config, S3AttachmentConfig) True

normalize_metric_input(data)

Normalize public input into an internal mutable dict.

When data is an LLMTestData instance, all declared fields are included in the result — even those that are None. When data is already a mapping, it is converted to a plain dict as-is.

Parameters:

Name Type Description Default
data EvalInput

A single evaluation row, either an LLMTestData instance or an arbitrary Mapping[str, Any].

required

Returns:

Type Description
dict[str, Any]

dict[str, Any]: A plain, mutable dict representation of the input.

normalize_metric_inputs(data)

Normalize a batch of evaluation rows into internal mutable dicts.

Parameters:

Name Type Description Default
data list[EvalInput]

A list of evaluation rows. Each item may be an LLMTestData instance or a Mapping[str, Any].

required

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: A list of plain dicts, one per input row.

validate_metric_result(parsed_response)

Validate the response.

Parameters:

Name Type Description Default
parsed_response dict

The response to validate.

required

Returns:

Name Type Description
dict dict

The validated response.

Raises:

Type Description
ValueError

If the response is not valid.