DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
Pith reviewed 2026-05-20 12:59 UTC · model grok-4.3
The pith
DiagEval reuses failed GUI trajectories to diagnose whether evaluation errors come from the agent or the software under test.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiagEval is a trajectory-conditioned diagnostic evaluation protocol that, after a failed rollout, selects targeted diagnostic probes from the observed trajectory and aggregates their outcomes into an internal attribution signal that distinguishes evaluator-side execution error from genuine software defect.
What carries the argument
trajectory-conditioned diagnostic probes whose outcomes are aggregated into an attribution signal
If this is right
- On false-negative cases DiagEval recovers 45.6-62.1 percent of failures initially blamed on software.
- It outperforms retry-based baselines by 34.4-160.6 percent relative gains.
- Overall accuracy rises from 69.9 percent to 78.3 percent on WebDevJudge-Unit and from 65.0 percent to 81.6 percent on RealDevBench.
- Reliable GUI-agent evaluation requires active failure diagnosis in addition to stronger execution.
Where Pith is reading between the lines
- The same trajectory-reuse idea could be tested on non-GUI agent evaluations where state is only partially observed.
- If diagnostic probes can be generated automatically from trajectory logs, the method might generalize to domains without human-designed probe templates.
- Improved attribution might allow evaluators to allocate compute more efficiently by retrying only when the signal indicates an execution error rather than a defect.
Load-bearing premise
The failed trajectory supplies enough information to select targeted diagnostic probes that can reliably separate execution errors from software defects.
What would settle it
Apply DiagEval to a benchmark where every failure has been manually labeled as either software defect or evaluator error; if the aggregated probe outcomes do not increase the fraction of correctly attributed cases beyond the no-probe baseline, the method does not work.
Figures
read the original abstract
Evaluating LLM-generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph-level reachable property over latent UI state-transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator-side execution error and genuine software defect. We present DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. Rather than blindly retrying from scratch, DiagEval reuses the failed trajectory to choose targeted diagnostic probes and aggregates their outcomes into an internal attribution signal. The latent-graph view motivates the diagnostic problem; DiagEval does not reconstruct the graph or estimate calibrated posterior probabilities. We evaluate DiagEval on WebDevJudge-Unit and RealDevBench across multiple GUI-agent evaluators and LLM backbones. On false-negative cases, DiagEval recovers 45.6-62.1% of failures that were initially misattributed to software defects, outperforming retry-based baselines with 34.4-160.6% relative gains. On the full evaluation sets, this recovery improves accuracy from 69.9% to 78.3% on WebDevJudge-Unit and from 65.0% to 81.6% on RealDevBench. These results suggest that reliable GUI-agent evaluation requires not only stronger execution, but also active failure diagnosis to disambiguate evaluator-side errors from genuine software defects. Our code is available at https://github.com/scutGit/DiagEval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. It frames correctness as a graph-level reachable property over latent UI state-transition graphs, notes that a single failed rollout leaves attribution ambiguous between evaluator-side execution error and genuine software defect, and proposes reusing the failed trajectory to select targeted diagnostic probes whose outcomes are aggregated into an internal attribution signal. The method is evaluated on WebDevJudge-Unit and RealDevBench across multiple GUI-agent evaluators and LLM backbones, reporting recovery of 45.6-62.1% of initially misattributed failures (34.4-160.6% relative gains over retry baselines) and accuracy lifts from 69.9% to 78.3% and from 65.0% to 81.6%. Code is released at https://github.com/scutGit/DiagEval.
Significance. If the attribution signal reliably distinguishes error sources, the work usefully shifts attention from pure execution strength to active diagnosis in GUI-agent software evaluation. The concrete recovery percentages and accuracy numbers on two named benchmarks, together with the open-source implementation, constitute a reproducible empirical contribution that future work can build upon or stress-test. The latent-graph motivation provides a clean conceptual framing even though the method itself avoids graph reconstruction or calibrated posteriors.
major comments (1)
- [Evaluation] Evaluation section: The reported accuracy improvements (69.9%→78.3% on WebDevJudge-Unit; 65.0%→81.6% on RealDevBench) and recovery rates rest on DiagEval's aggregated diagnostic outcomes correctly labeling failures as evaluator-side rather than software defects. No independent oracle (human adjudication, exhaustive state exploration, or formal model) is used to validate these attributions; correctness is measured only against the same benchmark outcomes that define the gains. This is load-bearing for the central claim of reliable diagnosis.
minor comments (1)
- [Abstract] Abstract: A single sentence summarizing the aggregation heuristic or the typical number of diagnostic probes would help readers assess the method's scope without reading the full text.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comment on our evaluation methodology. We address the concern directly below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The reported accuracy improvements (69.9%→78.3% on WebDevJudge-Unit; 65.0%→81.6% on RealDevBench) and recovery rates rest on DiagEval's aggregated diagnostic outcomes correctly labeling failures as evaluator-side rather than software defects. No independent oracle (human adjudication, exhaustive state exploration, or formal model) is used to validate these attributions; correctness is measured only against the same benchmark outcomes that define the gains. This is load-bearing for the central claim of reliable diagnosis.
Authors: The WebDevJudge-Unit and RealDevBench benchmarks supply independent ground-truth labels that indicate whether each software instance is functionally correct or contains a genuine defect. These labels serve as the oracle for identifying misattributions: an initial evaluator failure on a ground-truth-correct instance is a false negative that DiagEval attempts to reclassify via trajectory-conditioned probes. The reported recovery percentages and accuracy gains are therefore measured against these external ground truths rather than against the initial evaluator outputs themselves. We acknowledge that the paper does not include separate human adjudication or exhaustive state exploration to validate the internal attribution signal in isolation. To strengthen the presentation we will add a dedicated limitations paragraph on this point and report a small-scale human review of attribution decisions on a random sample of cases in the revised manuscript. revision: partial
Circularity Check
No circularity: empirical results on external benchmarks with no derivations or self-referential reductions
full rationale
The paper describes DiagEval as a heuristic diagnostic protocol that reuses failed trajectories for targeted probes and aggregates outcomes into an attribution signal. All reported gains (45.6-62.1% recovery, 69.9%→78.3% and 65.0%→81.6% accuracy) are direct empirical measurements on the external benchmarks WebDevJudge-Unit and RealDevBench across multiple evaluators and backbones. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The method explicitly avoids graph reconstruction or calibrated posteriors, so the central claims remain independent of any internal reduction to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Correctness is a graph-level reachable property over latent UI state-transition graphs while an evaluator observes only a single execution trajectory
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We model the software under test as a latent state-transition graph G=(S,E)... single-trajectory identifiability gap
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SOU 1-4... Type A/B/C probes... EIG(b;F_k) = H_bin(p(k)) - E[...]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the International Conference on Learning Representations (ICLR) , year=
WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
-
[2]
Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , booktitle=. 2024 , url=
work page 2024
-
[3]
Proceedings of the International Conference on Learning Representations (ICLR) , year=
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
-
[4]
Rawles, Chris and Clinckemaillie, Sarah and Chang, Yifan and Waltz, Jonathan and Lau, Gabrielle and Fair, Marybeth and Li, Alice and Bishop, William and Li, Wei and Campbell-Ajala, Folawiyo and Toyama, Daniel and Berry, Robert and Tyamagundlu, Divya and Lillicrap, Timothy and Riva, Oriana , booktitle =. 2025 , url =
work page 2025
-
[5]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[6]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[7]
Proceedings of the International Conference on Learning Representations (ICLR) , year=
ReAct: Synergizing Reasoning and Acting in Language Models , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
-
[8]
Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=
work page 2021
-
[9]
Agent-as-a-Judge: Evaluate Agents with Agents , author=. 2024 , eprint=
work page 2024
-
[10]
Advances in Neural Information Processing Systems (NeurIPS) , year=
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2505.03733 , archivePrefix=
-
[11]
Li, Chunyang and Zheng, Yilun and Huang, Xinting and Fang, Tianqing and Xu, Jiahao and Chen, Lihui and Song, Yangqiu and Hu, Han , booktitle =. 2026 , url =
work page 2026
-
[12]
Proceedings of the International Conference on Learning Representations , year =
Code Aesthetics with Agentic Reward Feedback , author =. Proceedings of the International Conference on Learning Representations , year =
-
[13]
arXiv preprint arXiv:2603.25226 , year=
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing , author=. arXiv preprint arXiv:2603.25226 , year=
-
[14]
Proceedings of the International Conference on Learning Representations (ICLR) , year=
Large Language Models Cannot Self-Correct Reasoning Yet , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
-
[15]
INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback
Xu, Wenda and Wang, Danqing and Pan, Liangming and Song, Zhenqiao and Freitag, Markus and Wang, William and Li, Lei. INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.365
-
[16]
Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking , author=. 2025 , eprint=
work page 2025
-
[17]
MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them , author=. 2025 , eprint=
work page 2025
-
[18]
AI Agents for Web Testing: A Case Study in the Wild , author=. 2025 , eprint=
work page 2025
-
[19]
WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[20]
AI-powered software testing tools: A systematic review and empirical assessment of their features and limitations , author=. 2024 , eprint=
work page 2024
-
[21]
2025 IEEE International Conference on Software Maintenance and Evolution (ICSME) , title=
Kaynak, Erg. 2025 IEEE International Conference on Software Maintenance and Evolution (ICSME) , title=. 2025 , volume=
work page 2025
-
[22]
Aria- UI : Visual Grounding for GUI Instructions
Yang, Yuhao and Wang, Yue and Li, Dongxu and Luo, Ziyang and Chen, Bei and Huang, Chao and Li, Junnan. Aria- UI : Visual Grounding for GUI Instructions. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1152
-
[23]
GUIC ourse: From General Vision Language Model to Versatile GUI Agent
Chen, Wentong and Cui, Junbo and Hu, Jinyi and Qin, Yujia and Fang, Junjie and Zhao, Yue and Wang, Chongyi and Liu, Jun and Chen, Guirong and Huo, Yupeng and Yao, Yuan and Lin, Yankai and Liu, Zhiyuan and Sun, Maosong. GUIC ourse: From General Vision Language Model to Versatile GUI Agent. Proceedings of the 63rd Annual Meeting of the Association for Compu...
-
[24]
ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search , author=. 2025 , eprint=
work page 2025
-
[25]
Zhu, Lianghui and Wang, Xinggang and Wang, Xinlong , booktitle =. 2025 , eprint =
work page 2025
-
[26]
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate , author=. 2023 , eprint=
work page 2023
-
[27]
RocketEval: Efficient Automated LLM Evaluation via Grading Checklist , author=. 2025 , eprint=
work page 2025
-
[28]
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges
Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025
work page 2025
-
[29]
LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge , author=. 2025 , eprint=
work page 2025
-
[30]
Dongping Chen and Ruoxi Chen and Shilin Zhang and Yaochen Wang and Yinuo Liu and Huichi Zhou and Qihui Zhang and Yao Wan and Pan Zhou and Lichao Sun , booktitle=. 2024 , url=
work page 2024
-
[31]
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge , author=. 2024 , eprint=
work page 2024
-
[32]
Advances in Neural Information Processing Systems , author =
Judging. Advances in Neural Information Processing Systems , author =. 2023 , volume =
work page 2023
-
[33]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. arXiv preprint arXiv:2410.07095 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Seeclick: Harnessing gui grounding for advanced visual gui agents , year =
Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and Li, Yantao and Zhang, Jianbing and Wu, Zhiyong , journal =. Seeclick: Harnessing gui grounding for advanced visual gui agents , year =
-
[35]
Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion , volume =
Ding, Yangruibo and Wang, Zijian and Ahmad, Wasi and Ding, Hantian and Tan, Ming and Jain, Nihal and Ramanathan, Murali Krishna and Nallapati, Ramesh and Bhatia, Parminder and Roth, Dan and others , journal =. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion , volume =
-
[36]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , url =
Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica , booktitle =. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , url =
- [37]
-
[38]
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =
Tianyang Liu and Canwen Xu and Julian McAuley , booktitle =. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =
-
[39]
Miserendino, Samuel and Wang, Michele and Patwardhan, Tejal and Heidecke, Johannes , journal =. SWE-Lancer: Can Frontier LLMs Earn \ 1 Million from Real-World Freelance Software Engineering? , year =
-
[40]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents , year =
Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others , journal =. UI-TARS: Pioneering Automated GUI Interaction with Native Agents , year =
-
[41]
Zhang, Shudan and Zhao, Hanlin and Liu, Xiao and Zheng, Qinkai and Qi, Zehan and Gu, Xiaotao and Dong, Yuxiao and Tang, Jie , booktitle =. Naturalcodebench: Examining coding performance mismatch on humaneval and natural user queries , year =
-
[42]
Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others , journal =. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions , year =
- [43]
-
[44]
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Webvoyager: Building an end-to-end web agent with large multimodal models , author=. arXiv preprint arXiv:2401.13919 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Os-atlas: A foundation action model for generalist gui agents , author=. arXiv preprint arXiv:2410.23218 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Aguvis: Unified pure vision agents for autonomous gui interaction , author=. arXiv preprint arXiv:2412.04454 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents , author=. arXiv preprint arXiv:2410.05243 , year=. 2410.05243 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design , author=. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=. 2025 , publisher=. doi:10.1145/3706599.3719729 , url=
-
[49]
Gemini 2.5 Computer Use Model , author =
-
[50]
PlayCoder: Making LLM-Generated GUI Code Playable
Peng, Zhiyuan and Tao, Wei and Yin, Xin and Ying, Chenhao and Luo, Yuan and Guo, Yiwen , year =. 2604.19742 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
The Art of Building Verifiers for Computer Use Agents
The Art of Building Verifiers for Computer Use Agents , author =. arXiv preprint arXiv:2604.06240 , year =. doi:10.48550/arXiv.2604.06240 , url =. 2604.06240 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.06240
-
[52]
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents , author=. arXiv preprint arXiv:2604.17284 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
GUITester: Enabling GUI Agents for Exploratory Defect Discovery , author=. 2026 , eprint=
work page 2026
-
[54]
arXiv preprint arXiv:2505.21055 , year=
Agent-Environment Alignment via Automated Interface Generation , author=. arXiv preprint arXiv:2505.21055 , year=
-
[55]
You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation , author=. 2025 , eprint=
work page 2025
-
[56]
Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems
Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2604.22708 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Distributionally Robust Reinforcement Learning , author=. 2019 , eprint=
work page 2019
-
[58]
AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? , author=. 2025 , eprint=
work page 2025
-
[59]
ACM Computing Surveys , volume=
From reactive to active sensing: A survey on information gathering in decision-theoretic planning , author=. ACM Computing Surveys , volume=. 2023 , publisher=
work page 2023
-
[60]
2026 , month = feb, url =
work page 2026
-
[61]
2025 , month = dec, url =
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.