Skip to content

Deepeval tool correctness

DeepEval Tool Correctness Metric Integration.

DeepEvalToolCorrectnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, num_judges=DefaultValues.NUM_JUDGES, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Tool Correctness Metric.

This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.

Available Fields
  • query (str): The input query.
  • generated_response (str, optional): The actual output/response.
  • expected_response (str, optional): The expected output/response.
  • tools_called (list[ToolCall], optional): The tools actually called by the agent. Each item should include 'name', 'input_parameters', and optionally 'output'. If not provided, will be extracted from agent_trajectory.
  • expected_tools (list[ToolCall], optional): The expected tools to be called. Each item should include 'name', 'input_parameters', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory.
  • agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field.
  • expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory.
  • available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.
Cookbook Example

Please refer to example_deepeval_tool_correctness.py in the gen-ai-sdk-cookbook repository.

Initializes DeepEvalToolCorrectnessMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
include_reason bool

Include reasoning in output. Defaults to True.

True
strict_mode bool

Binary mode (0 or 1). Defaults to False.

False
should_exact_match bool

Require exact match of tools. Defaults to False.

False
should_consider_ordering bool

Consider order of tools called. Defaults to False.

False
available_tools list[dict[str, Any]] | None

List of available tool definitions for context. Each tool should have 'name', 'description', and 'parameters'. Defaults to None.

None
evaluation_params list[ToolCallParams] | None

List of strictness criteria for tool correctness. Available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. Defaults to [ToolCallParams.INPUT_PARAMETERS, ToolCallParams.OUTPUT] to validate both.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS
num_judges int

The number of judges to use for the metric. Defaults to 1.

NUM_JUDGES
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use for the metric. Defaults to None.

None

ToolCallParser

ToolCallParser converts public tool-call rows into DeepEval tool calls.

from_trajectory_steps(steps)

Convert internal trajectory step dictionaries to public tool calls.

Parameters:

Name Type Description Default
steps list[dict[str, Any]]

Internal step dictionaries produced by TrajectoryParser.

required

Returns:

Type Description
list[ToolCall]

List of typed public ToolCall objects.

normalize_tool_name(tool_name)

Normalize tool names by removing dynamic suffixes.

Tool names like 'delegate_to_sql_query_agent_xg7c' or 'delegate_to_sql_query_agent_nxle' become 'delegate_to_sql_query_agent' by removing the last underscore and 4-char suffix.

Removes suffixes that look like random IDs (4 chars alphanumeric mix with both letters and digits, e.g. _xg7c, _nxle, _zor2), not legitimate tool name parts like '_data' or '_name'.

Parameters:

Name Type Description Default
tool_name str

The original tool name with potential suffix.

required

Returns:

Type Description
str

Normalized tool name without dynamic suffix.

parse(tool_calls)

Parse public tool-call payloads into DeepEval tool calls.

Parameters:

Name Type Description Default
tool_calls list[ToolCall | dict[str, Any]]

Public tool-call payloads as ToolCall objects or raw dicts.

required

Returns:

Type Description
list[ToolCall]

List of DeepEvalToolCall instances for metric evaluation.

to_deepeval_tool_calls(steps, normalize_names=True, include_output=True)

Convert public tool calls into DeepEval tool calls.

This is shared adapter logic used by metric classes that pass tool-call data into DeepEval.

Parameters:

Name Type Description Default
steps list[ToolCall]

List of tool calls with 'name', 'input_parameters', and optional 'output'.

required
normalize_names bool

Whether to normalize tool names by removing dynamic suffixes (e.g., _xg7c). Defaults to True.

True
include_output bool

Whether to include the 'output' field from each step in the DeepEval tool call. Defaults to True.

True

Returns:

Type Description
list[ToolCall]

List of DeepEvalToolCall objects.

TrajectoryParser()

TrajectoryParser is used to parse trajectory data into steps.

Initialize the TrajectoryParser.

convert_trajectory_to_steps(trajectory)

Convert trajectory data to expected steps format.

This function converts agent trajectory data into a flattened list of steps. When a tool call has both a step without output and a step with output, they share the same tool_call_id to indicate they represent the same execution.

Parameters:

Name Type Description Default
trajectory list[dict]

List of trajectory messages with roles (user/assistant/tool).

required

Returns:

Type Description
list[dict]

List of step dictionaries with tool_call_id, kind, name, args, and optional output.

list[dict]

Steps representing the same tool call (with and without output) share the same tool_call_id.

parse(trajectory)

Parse the trajectory data into steps.