Deepeval tool correctness
DeepEval Tool Correctness Metric Integration.
DeepEvalToolCorrectnessMetric(threshold=0.5, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, models=None, fallback_models=None, aggregation_method=AggregationMethod.AVERAGE, max_concurrent_judges=None)
Bases: DeepEvalMetricFactory
DeepEval Tool Correctness Metric.
This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.
Available Fields
- query (str): The input query.
- generated_response (str, optional): The actual output/response.
- expected_response (str, optional): The expected output/response.
- tools_called (list[ToolCall], optional): The tools actually called by the agent. Each item should include 'name', 'input_parameters', and optionally 'output'. If not provided, will be extracted from agent_trajectory.
- expected_tools (list[ToolCall], optional): The expected tools to be called. Each item should include 'name', 'input_parameters', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory.
- agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field.
- expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory.
- available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.
Cookbook Example
Please refer to example_deepeval_tool_correctness.py in the gen-ai-sdk-cookbook repository.
Initializes DeepEvalToolCorrectnessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include score reasoning. Defaults to True. |
True
|
strict_mode
|
bool
|
Whether to enforce strict scoring. Defaults to False. |
False
|
should_exact_match
|
bool
|
Whether tool calls must exactly match. Defaults to False. |
False
|
should_consider_ordering
|
bool
|
Whether call order matters. Defaults to False. |
False
|
available_tools
|
list[dict[str, Any]] | None
|
Available tool definitions. Defaults to None. |
None
|
evaluation_params
|
list[ToolCallParams] | None
|
The tool call fields that the LLM judge will compare between the actual and expected tool calls. Defaults to None (all supported fields are evaluated by DeepEval's default behaviour when None or empty list is passed). Available values:
|
None
|
batch_status_check_interval
|
float
|
Interval in seconds between batch status checks. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of batch status check iterations. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
models
|
BaseLMInvoker | list[BaseLMInvoker] | None
|
The model invoker(s) to use for multi-judge evaluation.
|
None
|
fallback_models
|
list[BaseLMInvoker] | None
|
Ordered list of fallback invokers tried in sequence when the primary judge fails. Defaults to None. |
None
|
aggregation_method
|
AggregationSelector
|
The aggregation method to use for the metric. Defaults to AggregationMethod.AVERAGE. |
AVERAGE
|
max_concurrent_judges
|
int | None
|
The maximum number of concurrent judges to use. Defaults to None. |
None
|
ToolCallParser
ToolCallParser converts public tool-call rows into DeepEval tool calls.
from_trajectory_steps(steps)
Convert internal trajectory step dictionaries to public tool calls.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
steps
|
list[dict[str, Any]]
|
Internal step dictionaries produced by |
required |
Returns:
| Type | Description |
|---|---|
list[ToolCall]
|
List of typed public |
normalize_tool_name(tool_name)
Normalize tool names by removing dynamic suffixes.
Tool names like 'delegate_to_sql_query_agent_xg7c' or 'delegate_to_sql_query_agent_nxle' become 'delegate_to_sql_query_agent' by removing the last underscore and 4-char suffix.
Removes suffixes that look like random IDs (4 chars alphanumeric mix with both letters and digits, e.g. _xg7c, _nxle, _zor2), not legitimate tool name parts like '_data' or '_name'.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_name
|
str
|
The original tool name with potential suffix. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Normalized tool name without dynamic suffix. |
parse(tool_calls)
Parse public tool-call payloads into DeepEval tool calls.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_calls
|
list[ToolCall | dict[str, Any]]
|
Public tool-call payloads as |
required |
Returns:
| Type | Description |
|---|---|
list[ToolCall]
|
List of |
to_deepeval_tool_calls(steps, normalize_names=True, include_output=True)
Convert public tool calls into DeepEval tool calls.
This is shared adapter logic used by metric classes that pass tool-call data into DeepEval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
steps
|
list[ToolCall]
|
List of tool calls with 'name', 'input_parameters', and optional 'output'. |
required |
normalize_names
|
bool
|
Whether to normalize tool names by removing dynamic suffixes (e.g., _xg7c). Defaults to True. |
True
|
include_output
|
bool
|
Whether to include the 'output' field from each step in the DeepEval tool call. Defaults to True. |
True
|
Returns:
| Type | Description |
|---|---|
list[ToolCall]
|
List of |
TrajectoryParser()
TrajectoryParser is used to parse trajectory data into steps.
Initialize the TrajectoryParser.
convert_trajectory_to_steps(trajectory)
Convert trajectory data to expected steps format.
This function converts agent trajectory data into a flattened list of steps. When a tool call has both a step without output and a step with output, they share the same tool_call_id to indicate they represent the same execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory
|
list[dict]
|
List of trajectory messages with roles (user/assistant/tool). |
required |
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of step dictionaries with tool_call_id, kind, name, args, and optional output. |
list[dict]
|
Steps representing the same tool call (with and without output) share the same tool_call_id. |
parse(trajectory)
Parse the trajectory data into steps.