pith. sign in

arxiv: 2605.17439 · v2 · pith:VGH4GKH4new · submitted 2026-05-17 · 💻 cs.SE · cs.AI

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Pith reviewed 2026-05-20 12:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords GUI agentssoftware evaluationfailure diagnosisLLM evaluationinteractive softwaretrajectory analysiserror attributionagent benchmarking
0
0 comments X

The pith

DiagEval reuses failed GUI trajectories to diagnose whether evaluation errors come from the agent or the software under test.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that correctness of interactive software is a property over latent UI state graphs, but a single observed trajectory leaves failure attribution ambiguous between genuine defects and evaluator execution mistakes. DiagEval addresses this by conditioning diagnostic probes on the failed trajectory and aggregating their results into an attribution signal, without attempting to reconstruct the full graph. This approach recovers a substantial fraction of initially misattributed failures and raises end-to-end accuracy on two evaluation suites. A sympathetic reader would care because reliable assessment of LLM-generated software currently depends on expensive retries or stronger executors; active diagnosis offers a lighter alternative.

Core claim

DiagEval is a trajectory-conditioned diagnostic evaluation protocol that, after a failed rollout, selects targeted diagnostic probes from the observed trajectory and aggregates their outcomes into an internal attribution signal that distinguishes evaluator-side execution error from genuine software defect.

What carries the argument

trajectory-conditioned diagnostic probes whose outcomes are aggregated into an attribution signal

If this is right

  • On false-negative cases DiagEval recovers 45.6-62.1 percent of failures initially blamed on software.
  • It outperforms retry-based baselines by 34.4-160.6 percent relative gains.
  • Overall accuracy rises from 69.9 percent to 78.3 percent on WebDevJudge-Unit and from 65.0 percent to 81.6 percent on RealDevBench.
  • Reliable GUI-agent evaluation requires active failure diagnosis in addition to stronger execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory-reuse idea could be tested on non-GUI agent evaluations where state is only partially observed.
  • If diagnostic probes can be generated automatically from trajectory logs, the method might generalize to domains without human-designed probe templates.
  • Improved attribution might allow evaluators to allocate compute more efficiently by retrying only when the signal indicates an execution error rather than a defect.

Load-bearing premise

The failed trajectory supplies enough information to select targeted diagnostic probes that can reliably separate execution errors from software defects.

What would settle it

Apply DiagEval to a benchmark where every failure has been manually labeled as either software defect or evaluator error; if the aggregated probe outcomes do not increase the fraction of correctly attributed cases beyond the no-probe baseline, the method does not work.

Figures

Figures reproduced from arXiv: 2605.17439 by Chenglin Wu, Sirui Hong, Tengfei Li, Wei Tao, Yifan Wu, Zhijie Liu.

Figure 1
Figure 1. Figure 1: Overview of DIAGEVAL. Given a failed rollout τ , DIAGEVAL parses a failure diagnostic summary (FDS), dispatches SOU-guided diagnostic branches, and integrates evidence across multi￾ple branch trajectories to refine an internal attribution score over Z ∈ {AGENTFAIL, ENVFAIL}. G, but the evaluator fails to discover or verify it. ENVFAIL indicates that Vcase = 0, i.e., the target state is genuinely unreachabl… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of post-failure retry strategies on “Click the upvote button for a post in a specific subreddit.” The initial test (left) and a naive retry (middle) both fail along essentially the same homepage trajectory. DIAGEVAL (right) uses the FDS to identify SOU-2 (Incomplete Observation), generates a targeted retry plan, and verifies success (1500 → 1501). 4.3 Information-Gain Branch Ranking Before execu… view at source ↗
Figure 3
Figure 3. Figure 3: SOU-typed branch allocation and diagnostic outcomes. Each subfigure corresponds to one benchmark. The left panel shows the distribution of Type A/B/C probes within each SOU, and the right panels report FN recovery for Z=AGENTFAIL and TN flip for confirmed Z=ENVFAIL cases. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attribution-score update across diagnostic branches in the running example. Each branch yields the same binary outcome (fail), but produces a different update magnitude due to branch-typed likelihoods γd, consistent with Eq. 3. click, and no download is triggered. NR (×2) repeats the same button-clicking pattern with no memory of what was tried. DIAGEVAL branch and outcome. DIAGEVAL (×2) identifies that di… view at source ↗
Figure 5
Figure 5. Figure 5: Cross-framework transfer results. ∆ denotes the absolute accuracy gain in percentage points over the corresponding single-pass baseline of the same GUI-agent framework. used in the main evaluation (114 cases). For a fair comparison, naive retry uses the same backbone (Gemini 3 Flash Preview) and the same AppEvalPilot framework as DIAGEVAL, so that any differ￾ence reflects the diagnostic mechanism rather th… view at source ↗
Figure 6
Figure 6. Figure 6: Gemini / Claude call ratio per case across the FN retry pipeline. The ratio is computed as the average number of Gemini agent calls divided by the average number of Claude supervisor calls per case. Parenthetical values report the rounded underlying averages (Gemini avg. / Claude avg.). Gemini calls remain relatively stable across stages, while most variation in the ratio comes from Claude-side diagnostic … view at source ↗
Figure 6
Figure 6. Figure 6: Gemini / Claude call ratio per case across the FN retry pipeline. The ratio is computed as the average number of Gemini agent calls divided by the average number of Claude supervisor calls per case. Parenthetical values report the rounded underlying averages (Gemini avg. / Claude avg.). Gemini calls remain relatively stable across stages, while most variation in the ratio comes from Claude-side diagnostic … view at source ↗
read the original abstract

Evaluating LLM-generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph-level reachable property over latent UI state-transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator-side execution error and genuine software defect. We present DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. Rather than blindly retrying from scratch, DiagEval reuses the failed trajectory to choose targeted diagnostic probes and aggregates their outcomes into an internal attribution signal. The latent-graph view motivates the diagnostic problem; DiagEval does not reconstruct the graph or estimate calibrated posterior probabilities. We evaluate DiagEval on WebDevJudge-Unit and RealDevBench across multiple GUI-agent evaluators and LLM backbones. On false-negative cases, DiagEval recovers 45.6-62.1% of failures that were initially misattributed to software defects, outperforming retry-based baselines with 34.4-160.6% relative gains. On the full evaluation sets, this recovery improves accuracy from 69.9% to 78.3% on WebDevJudge-Unit and from 65.0% to 81.6% on RealDevBench. These results suggest that reliable GUI-agent evaluation requires not only stronger execution, but also active failure diagnosis to disambiguate evaluator-side errors from genuine software defects. Our code is available at https://github.com/scutGit/DiagEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. It frames correctness as a graph-level reachable property over latent UI state-transition graphs, notes that a single failed rollout leaves attribution ambiguous between evaluator-side execution error and genuine software defect, and proposes reusing the failed trajectory to select targeted diagnostic probes whose outcomes are aggregated into an internal attribution signal. The method is evaluated on WebDevJudge-Unit and RealDevBench across multiple GUI-agent evaluators and LLM backbones, reporting recovery of 45.6-62.1% of initially misattributed failures (34.4-160.6% relative gains over retry baselines) and accuracy lifts from 69.9% to 78.3% and from 65.0% to 81.6%. Code is released at https://github.com/scutGit/DiagEval.

Significance. If the attribution signal reliably distinguishes error sources, the work usefully shifts attention from pure execution strength to active diagnosis in GUI-agent software evaluation. The concrete recovery percentages and accuracy numbers on two named benchmarks, together with the open-source implementation, constitute a reproducible empirical contribution that future work can build upon or stress-test. The latent-graph motivation provides a clean conceptual framing even though the method itself avoids graph reconstruction or calibrated posteriors.

major comments (1)
  1. [Evaluation] Evaluation section: The reported accuracy improvements (69.9%→78.3% on WebDevJudge-Unit; 65.0%→81.6% on RealDevBench) and recovery rates rest on DiagEval's aggregated diagnostic outcomes correctly labeling failures as evaluator-side rather than software defects. No independent oracle (human adjudication, exhaustive state exploration, or formal model) is used to validate these attributions; correctness is measured only against the same benchmark outcomes that define the gains. This is load-bearing for the central claim of reliable diagnosis.
minor comments (1)
  1. [Abstract] Abstract: A single sentence summarizing the aggregation heuristic or the typical number of diagnostic probes would help readers assess the method's scope without reading the full text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comment on our evaluation methodology. We address the concern directly below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The reported accuracy improvements (69.9%→78.3% on WebDevJudge-Unit; 65.0%→81.6% on RealDevBench) and recovery rates rest on DiagEval's aggregated diagnostic outcomes correctly labeling failures as evaluator-side rather than software defects. No independent oracle (human adjudication, exhaustive state exploration, or formal model) is used to validate these attributions; correctness is measured only against the same benchmark outcomes that define the gains. This is load-bearing for the central claim of reliable diagnosis.

    Authors: The WebDevJudge-Unit and RealDevBench benchmarks supply independent ground-truth labels that indicate whether each software instance is functionally correct or contains a genuine defect. These labels serve as the oracle for identifying misattributions: an initial evaluator failure on a ground-truth-correct instance is a false negative that DiagEval attempts to reclassify via trajectory-conditioned probes. The reported recovery percentages and accuracy gains are therefore measured against these external ground truths rather than against the initial evaluator outputs themselves. We acknowledge that the paper does not include separate human adjudication or exhaustive state exploration to validate the internal attribution signal in isolation. To strengthen the presentation we will add a dedicated limitations paragraph on this point and report a small-scale human review of attribution decisions on a random sample of cases in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks with no derivations or self-referential reductions

full rationale

The paper describes DiagEval as a heuristic diagnostic protocol that reuses failed trajectories for targeted probes and aggregates outcomes into an attribution signal. All reported gains (45.6-62.1% recovery, 69.9%→78.3% and 65.0%→81.6% accuracy) are direct empirical measurements on the external benchmarks WebDevJudge-Unit and RealDevBench across multiple evaluators and backbones. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The method explicitly avoids graph reconstruction or calibrated posteriors, so the central claims remain independent of any internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on a conceptual framing rather than new fitted parameters or invented physical entities.

axioms (1)
  • domain assumption Correctness is a graph-level reachable property over latent UI state-transition graphs while an evaluator observes only a single execution trajectory
    Explicitly stated in the abstract as the key difficulty motivating the diagnostic problem.

pith-pipeline@v0.9.0 · 5829 in / 1307 out tokens · 44846 ms · 2026-05-20T12:59:54.123953+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 9 internal anchors

  1. [1]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  2. [2]

    2024 , url=

    Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , booktitle=. 2024 , url=

  3. [3]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  4. [4]

    2025 , url =

    Rawles, Chris and Clinckemaillie, Sarah and Chang, Yifan and Waltz, Jonathan and Lau, Gabrielle and Fair, Marybeth and Li, Alice and Bishop, William and Li, Wei and Campbell-Ajala, Folawiyo and Toyama, Daniel and Berry, Robert and Tyamagundlu, Divya and Lillicrap, Timothy and Riva, Oriana , booktitle =. 2025 , url =

  5. [5]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  6. [6]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  7. [7]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  8. [8]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  9. [9]

    2024 , eprint=

    Agent-as-a-Judge: Evaluate Agents with Agents , author=. 2024 , eprint=

  10. [10]

    Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

    WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2505.03733 , archivePrefix=

  11. [11]

    2026 , url =

    Li, Chunyang and Zheng, Yilun and Huang, Xinting and Fang, Tianqing and Xu, Jiahao and Chen, Lihui and Song, Yangqiu and Hu, Han , booktitle =. 2026 , url =

  12. [12]

    Proceedings of the International Conference on Learning Representations , year =

    Code Aesthetics with Agentic Reward Feedback , author =. Proceedings of the International Conference on Learning Representations , year =

  13. [13]

    arXiv preprint arXiv:2603.25226 , year=

    WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing , author=. arXiv preprint arXiv:2603.25226 , year=

  14. [14]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Large Language Models Cannot Self-Correct Reasoning Yet , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  15. [15]

    INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback

    Xu, Wenda and Wang, Danqing and Pan, Liangming and Song, Zhenqiao and Freitag, Markus and Wang, William and Li, Lei. INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.365

  16. [16]

    2025 , eprint=

    Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking , author=. 2025 , eprint=

  17. [17]

    2025 , eprint=

    MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them , author=. 2025 , eprint=

  18. [18]

    2025 , eprint=

    AI Agents for Web Testing: A Case Study in the Wild , author=. 2025 , eprint=

  19. [19]

    2025 , eprint=

    WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning , author=. 2025 , eprint=

  20. [20]

    2024 , eprint=

    AI-powered software testing tools: A systematic review and empirical assessment of their features and limitations , author=. 2024 , eprint=

  21. [21]

    2025 IEEE International Conference on Software Maintenance and Evolution (ICSME) , title=

    Kaynak, Erg. 2025 IEEE International Conference on Software Maintenance and Evolution (ICSME) , title=. 2025 , volume=

  22. [22]

    Aria- UI : Visual Grounding for GUI Instructions

    Yang, Yuhao and Wang, Yue and Li, Dongxu and Luo, Ziyang and Chen, Bei and Huang, Chao and Li, Junnan. Aria- UI : Visual Grounding for GUI Instructions. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1152

  23. [23]

    GUIC ourse: From General Vision Language Model to Versatile GUI Agent

    Chen, Wentong and Cui, Junbo and Hu, Jinyi and Qin, Yujia and Fang, Junjie and Zhao, Yue and Wang, Chongyi and Liu, Jun and Chen, Guirong and Huo, Yupeng and Yao, Yuan and Lin, Yankai and Liu, Zhiyuan and Sun, Maosong. GUIC ourse: From General Vision Language Model to Versatile GUI Agent. Proceedings of the 63rd Annual Meeting of the Association for Compu...

  24. [24]

    2025 , eprint=

    ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search , author=. 2025 , eprint=

  25. [25]

    2025 , eprint =

    Zhu, Lianghui and Wang, Xinggang and Wang, Xinlong , booktitle =. 2025 , eprint =

  26. [26]

    2023 , eprint=

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate , author=. 2023 , eprint=

  27. [27]

    2025 , eprint=

    RocketEval: Efficient Automated LLM Evaluation via Grading Checklist , author=. 2025 , eprint=

  28. [28]

    Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges

    Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

  29. [29]

    2025 , eprint=

    LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge , author=. 2025 , eprint=

  30. [30]

    2024 , url=

    Dongping Chen and Ruoxi Chen and Shilin Zhang and Yaochen Wang and Yinuo Liu and Huichi Zhou and Qihui Zhang and Yao Wan and Pan Zhou and Lichao Sun , booktitle=. 2024 , url=

  31. [31]

    2024 , eprint=

    Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge , author=. 2024 , eprint=

  32. [32]

    Advances in Neural Information Processing Systems , author =

    Judging. Advances in Neural Information Processing Systems , author =. 2023 , volume =

  33. [33]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. arXiv preprint arXiv:2410.07095 , year=

  34. [34]

    Seeclick: Harnessing gui grounding for advanced visual gui agents , year =

    Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and Li, Yantao and Zhang, Jianbing and Wu, Zhiyong , journal =. Seeclick: Harnessing gui grounding for advanced visual gui agents , year =

  35. [35]

    Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion , volume =

    Ding, Yangruibo and Wang, Zijian and Ahmad, Wasi and Ding, Hantian and Tan, Ming and Jain, Nihal and Ramanathan, Murali Krishna and Nallapati, Ramesh and Bhatia, Parminder and Roth, Dan and others , journal =. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion , volume =

  36. [36]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , url =

    Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica , booktitle =. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , url =

  37. [37]

    2025 , note =

    Effective harnesses for long-running agents , author =. 2025 , note =

  38. [38]

    RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =

    Tianyang Liu and Canwen Xu and Julian McAuley , booktitle =. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =

  39. [39]

    SWE-Lancer: Can Frontier LLMs Earn \ 1 Million from Real-World Freelance Software Engineering? , year =

    Miserendino, Samuel and Wang, Michele and Patwardhan, Tejal and Heidecke, Johannes , journal =. SWE-Lancer: Can Frontier LLMs Earn \ 1 Million from Real-World Freelance Software Engineering? , year =

  40. [40]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents , year =

    Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others , journal =. UI-TARS: Pioneering Automated GUI Interaction with Native Agents , year =

  41. [41]

    Naturalcodebench: Examining coding performance mismatch on humaneval and natural user queries , year =

    Zhang, Shudan and Zhao, Hanlin and Liu, Xiao and Zheng, Qinkai and Qi, Zehan and Gu, Xiaotao and Dong, Yuxiao and Tang, Jie , booktitle =. Naturalcodebench: Examining coding performance mismatch on humaneval and natural user queries , year =

  42. [42]

    Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions , year =

    Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others , journal =. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions , year =

  43. [43]

    2024 , note =

    Anthropic , title =. 2024 , note =

  44. [44]

    WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    Webvoyager: Building an end-to-end web agent with large multimodal models , author=. arXiv preprint arXiv:2401.13919 , year=

  45. [45]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Os-atlas: A foundation action model for generalist gui agents , author=. arXiv preprint arXiv:2410.23218 , year=

  46. [46]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Aguvis: Unified pure vision agents for autonomous gui interaction , author=. arXiv preprint arXiv:2412.04454 , year=

  47. [47]

    Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

    Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents , author=. arXiv preprint arXiv:2410.05243 , year=. 2410.05243 , archivePrefix=

  48. [48]

    Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

    UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design , author=. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=. 2025 , publisher=. doi:10.1145/3706599.3719729 , url=

  49. [49]

    Gemini 2.5 Computer Use Model , author =

  50. [50]

    PlayCoder: Making LLM-Generated GUI Code Playable

    Peng, Zhiyuan and Tao, Wei and Yin, Xin and Ying, Chenhao and Luo, Yuan and Guo, Yiwen , year =. 2604.19742 , archivePrefix =

  51. [51]

    The Art of Building Verifiers for Computer Use Agents

    The Art of Building Verifiers for Computer Use Agents , author =. arXiv preprint arXiv:2604.06240 , year =. doi:10.48550/arXiv.2604.06240 , url =. 2604.06240 , archivePrefix =

  52. [52]

    HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

    HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents , author=. arXiv preprint arXiv:2604.17284 , year=

  53. [53]

    2026 , eprint=

    GUITester: Enabling GUI Agents for Exploratory Defect Discovery , author=. 2026 , eprint=

  54. [54]

    arXiv preprint arXiv:2505.21055 , year=

    Agent-Environment Alignment via Automated Interface Generation , author=. arXiv preprint arXiv:2505.21055 , year=

  55. [55]

    2025 , eprint=

    You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation , author=. 2025 , eprint=

  56. [56]

    Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

    Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2604.22708 , year=

  57. [57]

    2019 , eprint=

    Distributionally Robust Reinforcement Learning , author=. 2019 , eprint=

  58. [58]

    2025 , eprint=

    AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? , author=. 2025 , eprint=

  59. [59]

    ACM Computing Surveys , volume=

    From reactive to active sensing: A survey on information gathering in decision-theoretic planning , author=. ACM Computing Surveys , volume=. 2023 , publisher=

  60. [60]

    2026 , month = feb, url =

  61. [61]

    2025 , month = dec, url =