pith. sign in

arxiv: 2604.06240 · v1 · submitted 2026-04-05 · 💻 cs.CR · cs.AI· cs.MA

The Art of Building Verifiers for Computer Use Agents

classification 💻 cs.CR cs.AIcs.MA
keywords verifieruniversalagentbuildingcomputercuaverifierbenchdesignhumans
0
0 comments X
read the original abstract

Verifying the success of computer use agent (CUA) trajectories is a critical challenge: without reliable verification, neither evaluation nor training signal can be trusted. In this paper, we present lessons learned from building a best-in-class verifier for web tasks we call the Universal Verifier. We design the Universal Verifier around four key principles: 1) constructing rubrics with meaningful, non-overlapping criteria to reduce noise; 2) separating process and outcome rewards that yield complementary signals, capturing cases where an agent follows the right steps but gets blocked or succeeds through an unexpected path; 3) distinguishing between controllable and uncontrollable failures scored via a cascading-error-free strategy for finer-grained failure understanding; and 4) a divide-and-conquer context management scheme that attends to all screenshots in a trajectory, improving reliability on longer task horizons. We validate these findings on CUAVerifierBench, a new set of CUA trajectories with both process and outcome human labels, showing that our Universal Verifier agrees with humans as often as humans agree with each other. We report a reduction in false positive rates to near zero compared to baselines like WebVoyager ($\geq$ 45\%) and WebJudge ($\geq$ 22\%). We emphasize that these gains stem from the cumulative effect of the design choices above. We also find that an auto-research agent achieves 70\% of expert quality in 5\% of the time, but fails to discover all strategies required to replicate the Universal Verifier. We open-source our Universal Verifier system along with CUAVerifierBench; available at https://github.com/microsoft/fara.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

    cs.SE 2026-05 conditional novelty 7.0

    DiagEval is a new diagnostic protocol that conditions on failed trajectories to attribute GUI-agent evaluation failures, recovering 45-62% of misattributed cases and lifting accuracy 8-16 points on two benchmarks.

  2. DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

    cs.SE 2026-05 unverdicted novelty 6.0

    DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% o...