DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Chenglin Wu; Sirui Hong; Tengfei Li; Wei Tao; Yifan Wu; Zhijie Liu

arxiv: 2605.17439 · v2 · pith:VGH4GKH4new · submitted 2026-05-17 · 💻 cs.SE · cs.AI

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Sirui Hong , Zhijie Liu , Tengfei Li , Wei Tao , Yifan Wu , Chenglin Wu This is my paper

Pith reviewed 2026-05-20 12:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords GUI agentssoftware evaluationfailure diagnosisLLM evaluationinteractive softwaretrajectory analysiserror attributionagent benchmarking

0 comments

The pith

DiagEval reuses failed GUI trajectories to diagnose whether evaluation errors come from the agent or the software under test.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that correctness of interactive software is a property over latent UI state graphs, but a single observed trajectory leaves failure attribution ambiguous between genuine defects and evaluator execution mistakes. DiagEval addresses this by conditioning diagnostic probes on the failed trajectory and aggregating their results into an attribution signal, without attempting to reconstruct the full graph. This approach recovers a substantial fraction of initially misattributed failures and raises end-to-end accuracy on two evaluation suites. A sympathetic reader would care because reliable assessment of LLM-generated software currently depends on expensive retries or stronger executors; active diagnosis offers a lighter alternative.

Core claim

DiagEval is a trajectory-conditioned diagnostic evaluation protocol that, after a failed rollout, selects targeted diagnostic probes from the observed trajectory and aggregates their outcomes into an internal attribution signal that distinguishes evaluator-side execution error from genuine software defect.

What carries the argument

trajectory-conditioned diagnostic probes whose outcomes are aggregated into an attribution signal

If this is right

On false-negative cases DiagEval recovers 45.6-62.1 percent of failures initially blamed on software.
It outperforms retry-based baselines by 34.4-160.6 percent relative gains.
Overall accuracy rises from 69.9 percent to 78.3 percent on WebDevJudge-Unit and from 65.0 percent to 81.6 percent on RealDevBench.
Reliable GUI-agent evaluation requires active failure diagnosis in addition to stronger execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-reuse idea could be tested on non-GUI agent evaluations where state is only partially observed.
If diagnostic probes can be generated automatically from trajectory logs, the method might generalize to domains without human-designed probe templates.
Improved attribution might allow evaluators to allocate compute more efficiently by retrying only when the signal indicates an execution error rather than a defect.

Load-bearing premise

The failed trajectory supplies enough information to select targeted diagnostic probes that can reliably separate execution errors from software defects.

What would settle it

Apply DiagEval to a benchmark where every failure has been manually labeled as either software defect or evaluator error; if the aggregated probe outcomes do not increase the fraction of correctly attributed cases beyond the no-probe baseline, the method does not work.

Figures

Figures reproduced from arXiv: 2605.17439 by Chenglin Wu, Sirui Hong, Tengfei Li, Wei Tao, Yifan Wu, Zhijie Liu.

**Figure 1.** Figure 1: Overview of DIAGEVAL. Given a failed rollout τ , DIAGEVAL parses a failure diagnostic summary (FDS), dispatches SOU-guided diagnostic branches, and integrates evidence across multiple branch trajectories to refine an internal attribution score over Z ∈ {AGENTFAIL, ENVFAIL}. G, but the evaluator fails to discover or verify it. ENVFAIL indicates that Vcase = 0, i.e., the target state is genuinely unreachabl… view at source ↗

**Figure 2.** Figure 2: Comparison of post-failure retry strategies on “Click the upvote button for a post in a specific subreddit.” The initial test (left) and a naive retry (middle) both fail along essentially the same homepage trajectory. DIAGEVAL (right) uses the FDS to identify SOU-2 (Incomplete Observation), generates a targeted retry plan, and verifies success (1500 → 1501). 4.3 Information-Gain Branch Ranking Before execu… view at source ↗

**Figure 3.** Figure 3: SOU-typed branch allocation and diagnostic outcomes. Each subfigure corresponds to one benchmark. The left panel shows the distribution of Type A/B/C probes within each SOU, and the right panels report FN recovery for Z=AGENTFAIL and TN flip for confirmed Z=ENVFAIL cases. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Attribution-score update across diagnostic branches in the running example. Each branch yields the same binary outcome (fail), but produces a different update magnitude due to branch-typed likelihoods γd, consistent with Eq. 3. click, and no download is triggered. NR (×2) repeats the same button-clicking pattern with no memory of what was tried. DIAGEVAL branch and outcome. DIAGEVAL (×2) identifies that di… view at source ↗

**Figure 5.** Figure 5: Cross-framework transfer results. ∆ denotes the absolute accuracy gain in percentage points over the corresponding single-pass baseline of the same GUI-agent framework. used in the main evaluation (114 cases). For a fair comparison, naive retry uses the same backbone (Gemini 3 Flash Preview) and the same AppEvalPilot framework as DIAGEVAL, so that any difference reflects the diagnostic mechanism rather th… view at source ↗

**Figure 6.** Figure 6: Gemini / Claude call ratio per case across the FN retry pipeline. The ratio is computed as the average number of Gemini agent calls divided by the average number of Claude supervisor calls per case. Parenthetical values report the rounded underlying averages (Gemini avg. / Claude avg.). Gemini calls remain relatively stable across stages, while most variation in the ratio comes from Claude-side diagnostic … view at source ↗

read the original abstract

Evaluating LLM-generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph-level reachable property over latent UI state-transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator-side execution error and genuine software defect. We present DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. Rather than blindly retrying from scratch, DiagEval reuses the failed trajectory to choose targeted diagnostic probes and aggregates their outcomes into an internal attribution signal. The latent-graph view motivates the diagnostic problem; DiagEval does not reconstruct the graph or estimate calibrated posterior probabilities. We evaluate DiagEval on WebDevJudge-Unit and RealDevBench across multiple GUI-agent evaluators and LLM backbones. On false-negative cases, DiagEval recovers 45.6-62.1% of failures that were initially misattributed to software defects, outperforming retry-based baselines with 34.4-160.6% relative gains. On the full evaluation sets, this recovery improves accuracy from 69.9% to 78.3% on WebDevJudge-Unit and from 65.0% to 81.6% on RealDevBench. These results suggest that reliable GUI-agent evaluation requires not only stronger execution, but also active failure diagnosis to disambiguate evaluator-side errors from genuine software defects. Our code is available at https://github.com/scutGit/DiagEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiagEval gives a practical heuristic for attributing GUI agent failures by reusing trajectories for targeted probes, with reported accuracy gains on two benchmarks, but the attribution correctness lacks independent verification.

read the letter

The main thing here is that DiagEval reuses a failed GUI trajectory to pick targeted diagnostic probes and aggregates them into an attribution signal that separates evaluator execution errors from actual software defects. This produces concrete lifts: recovery of 45.6-62.1% of initially misattributed failures and accuracy rising from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% on RealDevBench, beating retry baselines by 34-160% relative. The latent-graph motivation is straightforward and explains why single rollouts leave ambiguity without needing full graph reconstruction or calibrated posteriors.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. It frames correctness as a graph-level reachable property over latent UI state-transition graphs, notes that a single failed rollout leaves attribution ambiguous between evaluator-side execution error and genuine software defect, and proposes reusing the failed trajectory to select targeted diagnostic probes whose outcomes are aggregated into an internal attribution signal. The method is evaluated on WebDevJudge-Unit and RealDevBench across multiple GUI-agent evaluators and LLM backbones, reporting recovery of 45.6-62.1% of initially misattributed failures (34.4-160.6% relative gains over retry baselines) and accuracy lifts from 69.9% to 78.3% and from 65.0% to 81.6%. Code is released at https://github.com/scutGit/DiagEval.

Significance. If the attribution signal reliably distinguishes error sources, the work usefully shifts attention from pure execution strength to active diagnosis in GUI-agent software evaluation. The concrete recovery percentages and accuracy numbers on two named benchmarks, together with the open-source implementation, constitute a reproducible empirical contribution that future work can build upon or stress-test. The latent-graph motivation provides a clean conceptual framing even though the method itself avoids graph reconstruction or calibrated posteriors.

major comments (1)

[Evaluation] Evaluation section: The reported accuracy improvements (69.9%→78.3% on WebDevJudge-Unit; 65.0%→81.6% on RealDevBench) and recovery rates rest on DiagEval's aggregated diagnostic outcomes correctly labeling failures as evaluator-side rather than software defects. No independent oracle (human adjudication, exhaustive state exploration, or formal model) is used to validate these attributions; correctness is measured only against the same benchmark outcomes that define the gains. This is load-bearing for the central claim of reliable diagnosis.

minor comments (1)

[Abstract] Abstract: A single sentence summarizing the aggregation heuristic or the typical number of diagnostic probes would help readers assess the method's scope without reading the full text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comment on our evaluation methodology. We address the concern directly below.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The reported accuracy improvements (69.9%→78.3% on WebDevJudge-Unit; 65.0%→81.6% on RealDevBench) and recovery rates rest on DiagEval's aggregated diagnostic outcomes correctly labeling failures as evaluator-side rather than software defects. No independent oracle (human adjudication, exhaustive state exploration, or formal model) is used to validate these attributions; correctness is measured only against the same benchmark outcomes that define the gains. This is load-bearing for the central claim of reliable diagnosis.

Authors: The WebDevJudge-Unit and RealDevBench benchmarks supply independent ground-truth labels that indicate whether each software instance is functionally correct or contains a genuine defect. These labels serve as the oracle for identifying misattributions: an initial evaluator failure on a ground-truth-correct instance is a false negative that DiagEval attempts to reclassify via trajectory-conditioned probes. The reported recovery percentages and accuracy gains are therefore measured against these external ground truths rather than against the initial evaluator outputs themselves. We acknowledge that the paper does not include separate human adjudication or exhaustive state exploration to validate the internal attribution signal in isolation. To strengthen the presentation we will add a dedicated limitations paragraph on this point and report a small-scale human review of attribution decisions on a random sample of cases in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks with no derivations or self-referential reductions

full rationale

The paper describes DiagEval as a heuristic diagnostic protocol that reuses failed trajectories for targeted probes and aggregates outcomes into an attribution signal. All reported gains (45.6-62.1% recovery, 69.9%→78.3% and 65.0%→81.6% accuracy) are direct empirical measurements on the external benchmarks WebDevJudge-Unit and RealDevBench across multiple evaluators and backbones. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The method explicitly avoids graph reconstruction or calibrated posteriors, so the central claims remain independent of any internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on a conceptual framing rather than new fitted parameters or invented physical entities.

axioms (1)

domain assumption Correctness is a graph-level reachable property over latent UI state-transition graphs while an evaluator observes only a single execution trajectory
Explicitly stated in the abstract as the key difficulty motivating the diagnostic problem.

pith-pipeline@v0.9.0 · 5829 in / 1307 out tokens · 44846 ms · 2026-05-20T12:59:54.123953+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model the software under test as a latent state-transition graph G=(S,E)... single-trajectory identifiability gap
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SOU 1-4... Type A/B/C probes... EIG(b;F_k) = H_bin(p(k)) - E[...]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 9 internal anchors

[1]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

work page
[2]

2024 , url=

Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , booktitle=. 2024 , url=

work page 2024
[3]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

work page
[4]

2025 , url =

Rawles, Chris and Clinckemaillie, Sarah and Chang, Yifan and Waltz, Jonathan and Lau, Gabrielle and Fair, Marybeth and Li, Alice and Bishop, William and Li, Wei and Campbell-Ajala, Folawiyo and Toyama, Daniel and Berry, Robert and Tyamagundlu, Divya and Lillicrap, Timothy and Riva, Oriana , booktitle =. 2025 , url =

work page 2025
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[6]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[7]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

work page
[8]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[9]

2024 , eprint=

Agent-as-a-Judge: Evaluate Agents with Agents , author=. 2024 , eprint=

work page 2024
[10]

Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2505.03733 , archivePrefix=

work page arXiv
[11]

2026 , url =

Li, Chunyang and Zheng, Yilun and Huang, Xinting and Fang, Tianqing and Xu, Jiahao and Chen, Lihui and Song, Yangqiu and Hu, Han , booktitle =. 2026 , url =

work page 2026
[12]

Proceedings of the International Conference on Learning Representations , year =

Code Aesthetics with Agentic Reward Feedback , author =. Proceedings of the International Conference on Learning Representations , year =

work page
[13]

arXiv preprint arXiv:2603.25226 , year=

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing , author=. arXiv preprint arXiv:2603.25226 , year=

work page arXiv
[14]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Large Language Models Cannot Self-Correct Reasoning Yet , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

work page
[15]

INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback

Xu, Wenda and Wang, Danqing and Pan, Liangming and Song, Zhenqiao and Freitag, Markus and Wang, William and Li, Lei. INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.365

work page doi:10.18653/v1/2023.emnlp-main.365 2023
[16]

2025 , eprint=

Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking , author=. 2025 , eprint=

work page 2025
[17]

2025 , eprint=

MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them , author=. 2025 , eprint=

work page 2025
[18]

2025 , eprint=

AI Agents for Web Testing: A Case Study in the Wild , author=. 2025 , eprint=

work page 2025
[19]

2025 , eprint=

WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[20]

2024 , eprint=

AI-powered software testing tools: A systematic review and empirical assessment of their features and limitations , author=. 2024 , eprint=

work page 2024
[21]

2025 IEEE International Conference on Software Maintenance and Evolution (ICSME) , title=

Kaynak, Erg. 2025 IEEE International Conference on Software Maintenance and Evolution (ICSME) , title=. 2025 , volume=

work page 2025
[22]

Aria- UI : Visual Grounding for GUI Instructions

Yang, Yuhao and Wang, Yue and Li, Dongxu and Luo, Ziyang and Chen, Bei and Huang, Chao and Li, Junnan. Aria- UI : Visual Grounding for GUI Instructions. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1152

work page doi:10.18653/v1/2025.findings-acl.1152 2025
[23]

GUIC ourse: From General Vision Language Model to Versatile GUI Agent

Chen, Wentong and Cui, Junbo and Hu, Jinyi and Qin, Yujia and Fang, Junjie and Zhao, Yue and Wang, Chongyi and Liu, Jun and Chen, Guirong and Huo, Yupeng and Yao, Yuan and Lin, Yankai and Liu, Zhiyuan and Sun, Maosong. GUIC ourse: From General Vision Language Model to Versatile GUI Agent. Proceedings of the 63rd Annual Meeting of the Association for Compu...

work page doi:10.18653/v1/2025.acl-long.1065 2025
[24]

2025 , eprint=

ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search , author=. 2025 , eprint=

work page 2025
[25]

2025 , eprint =

Zhu, Lianghui and Wang, Xinggang and Wang, Xinlong , booktitle =. 2025 , eprint =

work page 2025
[26]

2023 , eprint=

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate , author=. 2023 , eprint=

work page 2023
[27]

2025 , eprint=

RocketEval: Efficient Automated LLM Evaluation via Grading Checklist , author=. 2025 , eprint=

work page 2025
[28]

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges

Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

work page 2025
[29]

2025 , eprint=

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge , author=. 2025 , eprint=

work page 2025
[30]

2024 , url=

Dongping Chen and Ruoxi Chen and Shilin Zhang and Yaochen Wang and Yinuo Liu and Huichi Zhou and Qihui Zhang and Yao Wan and Pan Zhou and Lichao Sun , booktitle=. 2024 , url=

work page 2024
[31]

2024 , eprint=

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge , author=. 2024 , eprint=

work page 2024
[32]

Advances in Neural Information Processing Systems , author =

Judging. Advances in Neural Information Processing Systems , author =. 2023 , volume =

work page 2023
[33]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. arXiv preprint arXiv:2410.07095 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Seeclick: Harnessing gui grounding for advanced visual gui agents , year =

Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and Li, Yantao and Zhang, Jianbing and Wu, Zhiyong , journal =. Seeclick: Harnessing gui grounding for advanced visual gui agents , year =

work page
[35]

Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion , volume =

Ding, Yangruibo and Wang, Zijian and Ahmad, Wasi and Ding, Hantian and Tan, Ming and Jain, Nihal and Ramanathan, Murali Krishna and Nallapati, Ramesh and Bhatia, Parminder and Roth, Dan and others , journal =. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion , volume =

work page
[36]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , url =

Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica , booktitle =. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , url =

work page
[37]

2025 , note =

Effective harnesses for long-running agents , author =. 2025 , note =

work page 2025
[38]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =

Tianyang Liu and Canwen Xu and Julian McAuley , booktitle =. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =

work page
[39]

SWE-Lancer: Can Frontier LLMs Earn \ 1 Million from Real-World Freelance Software Engineering? , year =

Miserendino, Samuel and Wang, Michele and Patwardhan, Tejal and Heidecke, Johannes , journal =. SWE-Lancer: Can Frontier LLMs Earn \ 1 Million from Real-World Freelance Software Engineering? , year =

work page
[40]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents , year =

Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others , journal =. UI-TARS: Pioneering Automated GUI Interaction with Native Agents , year =

work page
[41]

Naturalcodebench: Examining coding performance mismatch on humaneval and natural user queries , year =

Zhang, Shudan and Zhao, Hanlin and Liu, Xiao and Zheng, Qinkai and Qi, Zehan and Gu, Xiaotao and Dong, Yuxiao and Tang, Jie , booktitle =. Naturalcodebench: Examining coding performance mismatch on humaneval and natural user queries , year =

work page
[42]

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions , year =

Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others , journal =. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions , year =

work page
[43]

2024 , note =

Anthropic , title =. 2024 , note =

work page 2024
[44]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Webvoyager: Building an end-to-end web agent with large multimodal models , author=. arXiv preprint arXiv:2401.13919 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Os-atlas: A foundation action model for generalist gui agents , author=. arXiv preprint arXiv:2410.23218 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Aguvis: Unified pure vision agents for autonomous gui interaction , author=. arXiv preprint arXiv:2412.04454 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents , author=. arXiv preprint arXiv:2410.05243 , year=. 2410.05243 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design , author=. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=. 2025 , publisher=. doi:10.1145/3706599.3719729 , url=

work page doi:10.1145/3706599.3719729 2025
[49]

Gemini 2.5 Computer Use Model , author =

work page
[50]

PlayCoder: Making LLM-Generated GUI Code Playable

Peng, Zhiyuan and Tao, Wei and Yin, Xin and Ying, Chenhao and Luo, Yuan and Guo, Yiwen , year =. 2604.19742 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[51]

The Art of Building Verifiers for Computer Use Agents

The Art of Building Verifiers for Computer Use Agents , author =. arXiv preprint arXiv:2604.06240 , year =. doi:10.48550/arXiv.2604.06240 , url =. 2604.06240 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.06240
[52]

HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents , author=. arXiv preprint arXiv:2604.17284 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

2026 , eprint=

GUITester: Enabling GUI Agents for Exploratory Defect Discovery , author=. 2026 , eprint=

work page 2026
[54]

arXiv preprint arXiv:2505.21055 , year=

Agent-Environment Alignment via Automated Interface Generation , author=. arXiv preprint arXiv:2505.21055 , year=

work page arXiv
[55]

2025 , eprint=

You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation , author=. 2025 , eprint=

work page 2025
[56]

Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2604.22708 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

2019 , eprint=

Distributionally Robust Reinforcement Learning , author=. 2019 , eprint=

work page 2019
[58]

2025 , eprint=

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? , author=. 2025 , eprint=

work page 2025
[59]

ACM Computing Surveys , volume=

From reactive to active sensing: A survey on information gathering in decision-theoretic planning , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023
[60]

2026 , month = feb, url =

work page 2026
[61]

2025 , month = dec, url =

work page 2025

[1] [1]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

work page

[2] [2]

2024 , url=

Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , booktitle=. 2024 , url=

work page 2024

[3] [3]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

work page

[4] [4]

2025 , url =

Rawles, Chris and Clinckemaillie, Sarah and Chang, Yifan and Waltz, Jonathan and Lau, Gabrielle and Fair, Marybeth and Li, Alice and Bishop, William and Li, Wei and Campbell-Ajala, Folawiyo and Toyama, Daniel and Berry, Robert and Tyamagundlu, Divya and Lillicrap, Timothy and Riva, Oriana , booktitle =. 2025 , url =

work page 2025

[5] [5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[6] [6]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[7] [7]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

work page

[8] [8]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021

[9] [9]

2024 , eprint=

Agent-as-a-Judge: Evaluate Agents with Agents , author=. 2024 , eprint=

work page 2024

[10] [10]

Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2505.03733 , archivePrefix=

work page arXiv

[11] [11]

2026 , url =

Li, Chunyang and Zheng, Yilun and Huang, Xinting and Fang, Tianqing and Xu, Jiahao and Chen, Lihui and Song, Yangqiu and Hu, Han , booktitle =. 2026 , url =

work page 2026

[12] [12]

Proceedings of the International Conference on Learning Representations , year =

Code Aesthetics with Agentic Reward Feedback , author =. Proceedings of the International Conference on Learning Representations , year =

work page

[13] [13]

arXiv preprint arXiv:2603.25226 , year=

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing , author=. arXiv preprint arXiv:2603.25226 , year=

work page arXiv

[14] [14]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Large Language Models Cannot Self-Correct Reasoning Yet , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

work page

[15] [15]

INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback

Xu, Wenda and Wang, Danqing and Pan, Liangming and Song, Zhenqiao and Freitag, Markus and Wang, William and Li, Lei. INSTRUCTSCORE : Towards Explainable Text Generation Evaluation with Automatic Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.365

work page doi:10.18653/v1/2023.emnlp-main.365 2023

[16] [16]

2025 , eprint=

Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking , author=. 2025 , eprint=

work page 2025

[17] [17]

2025 , eprint=

MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them , author=. 2025 , eprint=

work page 2025

[18] [18]

2025 , eprint=

AI Agents for Web Testing: A Case Study in the Wild , author=. 2025 , eprint=

work page 2025

[19] [19]

2025 , eprint=

WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[20] [20]

2024 , eprint=

AI-powered software testing tools: A systematic review and empirical assessment of their features and limitations , author=. 2024 , eprint=

work page 2024

[21] [21]

2025 IEEE International Conference on Software Maintenance and Evolution (ICSME) , title=

Kaynak, Erg. 2025 IEEE International Conference on Software Maintenance and Evolution (ICSME) , title=. 2025 , volume=

work page 2025

[22] [22]

Aria- UI : Visual Grounding for GUI Instructions

Yang, Yuhao and Wang, Yue and Li, Dongxu and Luo, Ziyang and Chen, Bei and Huang, Chao and Li, Junnan. Aria- UI : Visual Grounding for GUI Instructions. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1152

work page doi:10.18653/v1/2025.findings-acl.1152 2025

[23] [23]

GUIC ourse: From General Vision Language Model to Versatile GUI Agent

Chen, Wentong and Cui, Junbo and Hu, Jinyi and Qin, Yujia and Fang, Junjie and Zhao, Yue and Wang, Chongyi and Liu, Jun and Chen, Guirong and Huo, Yupeng and Yao, Yuan and Lin, Yankai and Liu, Zhiyuan and Sun, Maosong. GUIC ourse: From General Vision Language Model to Versatile GUI Agent. Proceedings of the 63rd Annual Meeting of the Association for Compu...

work page doi:10.18653/v1/2025.acl-long.1065 2025

[24] [24]

2025 , eprint=

ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search , author=. 2025 , eprint=

work page 2025

[25] [25]

2025 , eprint =

Zhu, Lianghui and Wang, Xinggang and Wang, Xinlong , booktitle =. 2025 , eprint =

work page 2025

[26] [26]

2023 , eprint=

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate , author=. 2023 , eprint=

work page 2023

[27] [27]

2025 , eprint=

RocketEval: Efficient Automated LLM Evaluation via Grading Checklist , author=. 2025 , eprint=

work page 2025

[28] [28]

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges

Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

work page 2025

[29] [29]

2025 , eprint=

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge , author=. 2025 , eprint=

work page 2025

[30] [30]

2024 , url=

Dongping Chen and Ruoxi Chen and Shilin Zhang and Yaochen Wang and Yinuo Liu and Huichi Zhou and Qihui Zhang and Yao Wan and Pan Zhou and Lichao Sun , booktitle=. 2024 , url=

work page 2024

[31] [31]

2024 , eprint=

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge , author=. 2024 , eprint=

work page 2024

[32] [32]

Advances in Neural Information Processing Systems , author =

Judging. Advances in Neural Information Processing Systems , author =. 2023 , volume =

work page 2023

[33] [33]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. arXiv preprint arXiv:2410.07095 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Seeclick: Harnessing gui grounding for advanced visual gui agents , year =

Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and Li, Yantao and Zhang, Jianbing and Wu, Zhiyong , journal =. Seeclick: Harnessing gui grounding for advanced visual gui agents , year =

work page

[35] [35]

Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion , volume =

Ding, Yangruibo and Wang, Zijian and Ahmad, Wasi and Ding, Hantian and Tan, Ming and Jain, Nihal and Ramanathan, Murali Krishna and Nallapati, Ramesh and Bhatia, Parminder and Roth, Dan and others , journal =. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion , volume =

work page

[36] [36]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , url =

Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica , booktitle =. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , url =

work page

[37] [37]

2025 , note =

Effective harnesses for long-running agents , author =. 2025 , note =

work page 2025

[38] [38]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =

Tianyang Liu and Canwen Xu and Julian McAuley , booktitle =. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , url =

work page

[39] [39]

SWE-Lancer: Can Frontier LLMs Earn \ 1 Million from Real-World Freelance Software Engineering? , year =

Miserendino, Samuel and Wang, Michele and Patwardhan, Tejal and Heidecke, Johannes , journal =. SWE-Lancer: Can Frontier LLMs Earn \ 1 Million from Real-World Freelance Software Engineering? , year =

work page

[40] [40]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents , year =

Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others , journal =. UI-TARS: Pioneering Automated GUI Interaction with Native Agents , year =

work page

[41] [41]

Naturalcodebench: Examining coding performance mismatch on humaneval and natural user queries , year =

Zhang, Shudan and Zhao, Hanlin and Liu, Xiao and Zheng, Qinkai and Qi, Zehan and Gu, Xiaotao and Dong, Yuxiao and Tang, Jie , booktitle =. Naturalcodebench: Examining coding performance mismatch on humaneval and natural user queries , year =

work page

[42] [42]

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions , year =

Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others , journal =. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions , year =

work page

[43] [43]

2024 , note =

Anthropic , title =. 2024 , note =

work page 2024

[44] [44]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Webvoyager: Building an end-to-end web agent with large multimodal models , author=. arXiv preprint arXiv:2401.13919 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Os-atlas: A foundation action model for generalist gui agents , author=. arXiv preprint arXiv:2410.23218 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Aguvis: Unified pure vision agents for autonomous gui interaction , author=. arXiv preprint arXiv:2412.04454 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents , author=. arXiv preprint arXiv:2410.05243 , year=. 2410.05243 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design , author=. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=. 2025 , publisher=. doi:10.1145/3706599.3719729 , url=

work page doi:10.1145/3706599.3719729 2025

[49] [49]

Gemini 2.5 Computer Use Model , author =

work page

[50] [50]

PlayCoder: Making LLM-Generated GUI Code Playable

Peng, Zhiyuan and Tao, Wei and Yin, Xin and Ying, Chenhao and Luo, Yuan and Guo, Yiwen , year =. 2604.19742 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

The Art of Building Verifiers for Computer Use Agents

The Art of Building Verifiers for Computer Use Agents , author =. arXiv preprint arXiv:2604.06240 , year =. doi:10.48550/arXiv.2604.06240 , url =. 2604.06240 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.06240

[52] [52]

HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents , author=. arXiv preprint arXiv:2604.17284 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

2026 , eprint=

GUITester: Enabling GUI Agents for Exploratory Defect Discovery , author=. 2026 , eprint=

work page 2026

[54] [54]

arXiv preprint arXiv:2505.21055 , year=

Agent-Environment Alignment via Automated Interface Generation , author=. arXiv preprint arXiv:2505.21055 , year=

work page arXiv

[55] [55]

2025 , eprint=

You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation , author=. 2025 , eprint=

work page 2025

[56] [56]

Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2604.22708 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

2019 , eprint=

Distributionally Robust Reinforcement Learning , author=. 2019 , eprint=

work page 2019

[58] [58]

2025 , eprint=

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? , author=. 2025 , eprint=

work page 2025

[59] [59]

ACM Computing Surveys , volume=

From reactive to active sensing: A survey on information gathering in decision-theoretic planning , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023

[60] [60]

2026 , month = feb, url =

work page 2026

[61] [61]

2025 , month = dec, url =

work page 2025