Skip to content

Deepeval tool correctness

DeepEval Tool Correctness Metric Integration.

DeepEvalToolCorrectnessMetric(threshold=0.5, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, models=None, fallback_models=None, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)

Bases: DeepEvalMetricFactory

DeepEval Tool Correctness Metric.

This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.

Available Fields
  • query (str): The input query.
  • generated_response (str, optional): The actual output/response.
  • expected_response (str, optional): The expected output/response.
  • tools_called (list[ToolCall], optional): The tools actually called by the agent. Each item should include 'name', 'input_parameters', and optionally 'output'. If not provided, will be extracted from agent_trajectory.
  • expected_tools (list[ToolCall], optional): The expected tools to be called. Each item should include 'name', 'input_parameters', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory.
  • agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field.
  • expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory.
  • available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.
Cookbook Example

Please refer to example_deepeval_tool_correctness.py in the gen-ai-sdk-cookbook repository.

Initializes DeepEvalToolCorrectnessMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
include_reason bool

Whether to include score reasoning. Defaults to True.

True
strict_mode bool

Whether to enforce strict scoring. Defaults to False.

False
should_exact_match bool

Whether tool calls must exactly match. Defaults to False.

False
should_consider_ordering bool

Whether call order matters. Defaults to False.

False
available_tools list[dict[str, Any]] | None

Available tool definitions. Defaults to None.

None
evaluation_params list[ToolCallParams] | None

The tool call fields that the LLM judge will compare between the actual and expected tool calls. Defaults to None (all supported fields are evaluated by DeepEval's default behaviour when None or empty list is passed). Available values:

  • ToolCallParams.INPUT_PARAMETERS: evaluates whether the arguments passed to the tool match the expected input parameters.
  • ToolCallParams.OUTPUT: evaluates whether the tool's return value matches the expected output.
None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS
models BaseLMInvoker | list[BaseLMInvoker] | None

The model invoker(s) to use for multi-judge evaluation.

  • None (default): single-judge mode using the default invoker.
  • [invoker] * N: homogeneous — same model N times.
  • [invoker_a, invoker_b]: heterogeneous — distinct models.
None
fallback_models list[BaseLMInvoker] | None

Ordered list of fallback invokers tried in sequence when the primary judge fails. Defaults to None.

None
aggregation_method AggregationSelector

The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.

AVERAGE
max_concurrent_judges int | None

The maximum number of concurrent judges to use. Defaults to None.

None

ToolCallParser

ToolCallParser converts public tool-call rows into DeepEval tool calls.

from_trajectory_steps(steps)

Convert internal trajectory step dictionaries to public tool calls.

Parameters:

Name Type Description Default
steps list[dict[str, Any]]

Internal step dictionaries produced by TrajectoryParser.

required

Returns:

Type Description
list[ToolCall]

List of typed public ToolCall objects.

normalize_tool_name(tool_name)

Normalize tool names by removing dynamic suffixes.

Tool names like 'delegate_to_sql_query_agent_xg7c' or 'delegate_to_sql_query_agent_nxle' become 'delegate_to_sql_query_agent' by removing the last underscore and 4-char suffix.

Removes suffixes that look like random IDs (4 chars alphanumeric mix with both letters and digits, e.g. _xg7c, _nxle, _zor2), not legitimate tool name parts like '_data' or '_name'.

Parameters:

Name Type Description Default
tool_name str

The original tool name with potential suffix.

required

Returns:

Type Description
str

Normalized tool name without dynamic suffix.

parse(tool_calls)

Parse public tool-call payloads into DeepEval tool calls.

Parameters:

Name Type Description Default
tool_calls list[ToolCall | dict[str, Any]]

Public tool-call payloads as ToolCall objects or raw dicts.

required

Returns:

Type Description
list[ToolCall]

List of DeepEvalToolCall instances for metric evaluation.

to_deepeval_tool_calls(steps, normalize_names=True, include_output=True)

Convert public tool calls into DeepEval tool calls.

This is shared adapter logic used by metric classes that pass tool-call data into DeepEval.

Parameters:

Name Type Description Default
steps list[ToolCall]

List of tool calls with 'name', 'input_parameters', and optional 'output'.

required
normalize_names bool

Whether to normalize tool names by removing dynamic suffixes (e.g., _xg7c). Defaults to True.

True
include_output bool

Whether to include the 'output' field from each step in the DeepEval tool call. Defaults to True.

True

Returns:

Type Description
list[ToolCall]

List of DeepEvalToolCall objects.

TrajectoryParser()

TrajectoryParser is used to parse trajectory data into steps.

Initialize the TrajectoryParser.

convert_trajectory_to_steps(trajectory)

Convert trajectory data to expected steps format.

This function converts agent trajectory data into a flattened list of steps. When a tool call has both a step without output and a step with output, they share the same tool_call_id to indicate they represent the same execution.

Parameters:

Name Type Description Default
trajectory list[dict]

List of trajectory messages with roles (user/assistant/tool).

required

Returns:

Type Description
list[dict]

List of step dictionaries with tool_call_id, kind, name, args, and optional output.

list[dict]

Steps representing the same tool call (with and without output) share the same tool_call_id.

parse(trajectory)

Parse the trajectory data into steps.