pith. sign in

arxiv: 2512.03109 · v2 · pith:F7ZF6IGCnew · submitted 2025-12-02 · 💻 cs.LG · cs.AI· stat.AP· stat.ML

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

classification 💻 cs.LG cs.AIstat.APstat.ML
keywords e-valuatoragenttrajectoriesactionshypothesissequentialagenticagents
0
0 comments X
read the original abstract

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

    cs.AI 2026-04 unverdicted novelty 7.0

    AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.

  2. Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

    cs.AI 2026-05 unverdicted novelty 5.0

    A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.