Skip to content

Langchain agentevals

This module contains the LangChain AgentEvals Metric.

Authors

Surya Mahadi (made.r.s.mahadi@gdplabs.id)

References

[1] https://github.com/langchain-ai/agentevals

LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: LangChainAgentEvalsMetric

A metric that uses LangChain AgentEvals to evaluate Agent as a judge.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.

Attributes:

Name Type Description
name str

The name of the metric.

evaluator SimpleAsyncEvaluator

The evaluator to use.

Initialize the LangChainAgentEvalsLLMAsAJudgeMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
prompt str

The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. Note that the default prompt allows a rubric in addition to the typical "inputs", "outputs", and "reference_outputs" parameters.

required
model str | ModelId | BaseLMInvoker

The model to use.

required
credentials str | None

The credentials to use for the model. Defaults to None.

None
config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "trajectory_accuracy".

'trajectory_accuracy'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, uses efficient batch API when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.

Parameters:

Name Type Description Default
data MetricInput | list[MetricInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

LangChainAgentEvalsMetric(name, evaluator)

Bases: BaseMetric

A metric that uses LangChain AgentEvals to evaluate Agent.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.

Attributes:

Name Type Description
name str

The name of the metric.

evaluator SimpleAsyncEvaluator

The evaluator to use.

Initialize the LangChainAgentEvalsMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
evaluator SimpleAsyncEvaluator

The evaluator to use.

required