Deepeval tool correctness

DeepEval Tool Correctness Metric Integration.

`DeepEvalToolCorrectnessMetric(threshold=0.5, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, models=None, fallback_models=None, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)`

Bases: DeepEvalMetricFactory

DeepEval Tool Correctness Metric.

This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.

Available Fields

query (str): The input query.
generated_response (str, optional): The actual output/response.
expected_response (str, optional): The expected output/response.
tools_called (list[ToolCall], optional): The tools actually called by the agent. Each item should include 'name', 'input_parameters', and optionally 'output'. If not provided, will be extracted from agent_trajectory.
expected_tools (list[ToolCall], optional): The expected tools to be called. Each item should include 'name', 'input_parameters', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory.
agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field.
expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory.
available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.

Scoring

0.0-1.0 (Continuous): Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.

Cookbook Example

Please refer to example_deepeval_tool_correctness.py in the gen-ai-sdk-cookbook repository.

Initializes DeepEvalToolCorrectnessMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`include_reason`	`bool`	Whether to include score reasoning. Defaults to True.	`True`
`strict_mode`	`bool`	Whether to enforce strict scoring. Defaults to False.	`False`
`should_exact_match`	`bool`	Whether tool calls must exactly match. Defaults to False.	`False`
`should_consider_ordering`	`bool`	Whether call order matters. Defaults to False.	`False`
`available_tools`	`list[dict[str, Any]] \| None`	Available tool definitions. Defaults to None.	`None`
`evaluation_params`	`list[ToolCallParams] \| None`	The tool call fields that the LLM judge will compare between the actual and expected tool calls. Defaults to None (all supported fields are evaluated by DeepEval's default behaviour when None or empty list is passed). Available values: `ToolCallParams.INPUT_PARAMETERS`: evaluates whether the arguments passed to the tool match the expected input parameters. `ToolCallParams.OUTPUT`: evaluates whether the tool's return value matches the expected output.	`None`
`batch_status_check_interval`	`float`	Interval in seconds between batch status checks. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of batch status check iterations. Defaults to 120.	`BATCH_MAX_ITERATIONS`
`models`	`BaseLMInvoker \| list[BaseLMInvoker] \| None`	The model invoker(s) to use for multi-judge evaluation. `None` (default): single-judge mode using the default invoker. `[invoker] * N`: homogeneous — same model N times. `[invoker_a, invoker_b]`: heterogeneous — distinct models.	`None`
`fallback_models`	`list[BaseLMInvoker] \| None`	Ordered list of fallback invokers tried in sequence when the primary judge fails. Defaults to None.	`None`
`aggregation_method`	`AggregationSelector`	The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE.	`AVERAGE`
`max_concurrent_judges`	`int \| None`	The maximum number of concurrent judges to use. Defaults to None.	`None`

`ToolCallParser`

ToolCallParser converts public tool-call rows into DeepEval tool calls.

`from_trajectory_steps(steps)`

Convert internal trajectory step dictionaries to public tool calls.

Parameters:

Name	Type	Description	Default
`steps`	`list[dict[str, Any]]`	Internal step dictionaries produced by `TrajectoryParser`.	required

Returns:

Type	Description
`list[ToolCall]`	List of typed public `ToolCall` objects.

`normalize_tool_name(tool_name)`

Normalize tool names by removing dynamic suffixes.

Tool names like 'delegate_to_sql_query_agent_xg7c' or 'delegate_to_sql_query_agent_nxle' become 'delegate_to_sql_query_agent' by removing the last underscore and 4-char suffix.

Removes suffixes that look like random IDs (4 chars alphanumeric mix with both letters and digits, e.g. _xg7c, _nxle, _zor2), not legitimate tool name parts like '_data' or '_name'.

Parameters:

Name	Type	Description	Default
`tool_name`	`str`	The original tool name with potential suffix.	required

Returns:

Type	Description
`str`	Normalized tool name without dynamic suffix.

`parse(tool_calls)`

Parse public tool-call payloads into DeepEval tool calls.

Parameters:

Name	Type	Description	Default
`tool_calls`	`list[ToolCall \| dict[str, Any]]`	Public tool-call payloads as `ToolCall` objects or raw dicts.	required

Returns:

Type	Description
`list[ToolCall]`	List of `DeepEvalToolCall` instances for metric evaluation.

`to_deepeval_tool_calls(steps, normalize_names=True, include_output=True)`

Convert public tool calls into DeepEval tool calls.

This is shared adapter logic used by metric classes that pass tool-call data into DeepEval.

Parameters:

Name	Type	Description	Default
`steps`	`list[ToolCall]`	List of tool calls with 'name', 'input_parameters', and optional 'output'.	required
`normalize_names`	`bool`	Whether to normalize tool names by removing dynamic suffixes (e.g., _xg7c). Defaults to True.	`True`
`include_output`	`bool`	Whether to include the 'output' field from each step in the DeepEval tool call. Defaults to True.	`True`

Returns:

Type	Description
`list[ToolCall]`	List of `DeepEvalToolCall` objects.

`TrajectoryParser()`

TrajectoryParser is used to parse trajectory data into steps.

Initialize the TrajectoryParser.

`convert_trajectory_to_steps(trajectory)`

Convert trajectory data to expected steps format.

This function converts agent trajectory data into a flattened list of steps. When a tool call has both a step without output and a step with output, they share the same tool_call_id to indicate they represent the same execution.

Parameters:

Name	Type	Description	Default
`trajectory`	`list[dict]`	List of trajectory messages with roles (user/assistant/tool).	required

Returns:

Type	Description
`list[dict]`	List of step dictionaries with tool_call_id, kind, name, args, and optional output.
`list[dict]`	Steps representing the same tool call (with and without output) share the same tool_call_id.

`parse(trajectory)`

Parse the trajectory data into steps.