VisCritic: Visual State Comparison as Process Reward for GUI Agents

Jiachen Qian

arxiv: 2606.24525 · v1 · pith:2J6QV662new · submitted 2026-06-23 · 💻 cs.CV

VisCritic: Visual State Comparison as Process Reward for GUI Agents

Jiachen Qian This is my paper

Pith reviewed 2026-06-26 00:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords GUI agentsprocess rewardvisual state comparisonSiamese vision transformeraction verificationvision-language modelsscreen state changes

0 comments

The pith

Direct visual comparison of pre- and post-action screenshots verifies GUI agent actions and boosts benchmark performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GUI agents powered by vision-language models often fail in long tasks because they lack reliable step-level verification. Existing process reward models depend on textual reasoning, which overlooks the visual changes in GUI states. VisCritic addresses this by comparing screenshots before and after actions using a Siamese vision transformer to capture change-aware features. An Action-Aware Critic Head then assesses action success, task progress, and error types from these features. The framework trains on weakly supervised data from existing trajectories without new labels and improves results when added to various GUI agents.

Core claim

VisCritic is a visual process reward framework that verifies agent actions by directly comparing pre-action and post-action screenshots in visual feature space. It employs a Siamese vision transformer to extract change-aware representations, coupled with an Action-Aware Critic Head that jointly evaluates action success, task progress, and error type. A critic-training data construction pipeline generates weakly supervised samples from existing trajectories without additional human labels. Experiments across five benchmarks show it serves as a plug-and-play enhancement that generally improves metrics while providing visual diagnostic cues.

What carries the argument

Siamese vision transformer paired with an Action-Aware Critic Head that processes change-aware representations from before-and-after screenshots to assess action success, task progress, and error type.

If this is right

Serves as a plug-and-play enhancement for diverse GUI agents.
Generally improves benchmark metrics across the five tested benchmarks.
Provides visual diagnostic cues alongside the reward signals.
Trains using weakly supervised samples generated from existing trajectories without new human labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The visual approach could be combined with text-based rewards to create hybrid verification for agents that handle both visual and semantic errors.
Similar before-and-after image comparison might extend to agent tasks in robotics or mobile apps where state changes are also visual.
Agents trained with these visual rewards might handle longer sequences better by focusing on observable state transitions rather than inferred text descriptions.

Load-bearing premise

That direct visual feature comparison of screenshots can reliably verify action success, task progress, and error type without textual reasoning.

What would settle it

A test measuring whether VisCritic's visual scores match human judgments of action outcomes on held-out trajectories, or an ablation where removing the visual comparison eliminates the reported benchmark gains.

Figures

Figures reproduced from arXiv: 2606.24525 by Jiachen Qian.

**Figure 1.** Figure 1: Motivation example (schematic illustration). A GUI agent accidentally clicks “Share” instead of “Order” (left→middle). VisCritic detects the error by comparing screenshots: the attention map highlights the unexpected popup, and the critic predicts low success with “wrong target” classification (right). Recent work has begun to address this gap through process reward models [17, 37] and error detection mec… view at source ↗

**Figure 2.** Figure 2: Overview of the VisCritic framework. Given a pre-action screenshot st and post-action screenshot st+1, the Visual Difference Encoder (VDE) extracts patchlevel semantic differences via a shared ViT encoder, computes a change magnitude map, and applies change region attention to produce the attended difference vector v∆. The Action-Aware Critic Head fuses v∆ with the action at and task instruction l to pred… view at source ↗

**Figure 3.** Figure 3: Critic-training data construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Conceptual illustration of change region attention patterns. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

GUI agents powered by vision-language models show strong potential for automating digital tasks, yet frequently fail in long-horizon scenarios due to the absence of step-level verification. Existing process reward models verify actions through textual reasoning alone, missing the visual nature of GUI state changes. We introduce VisCritic, a visual process reward framework that verifies agent actions by directly comparing pre-action and post-action screenshots in visual feature space. VisCritic employs a Siamese vision transformer to extract change-aware representations, coupled with an Action-Aware Critic Head that jointly evaluates action success, task progress, and error type. A critic-training data construction pipeline generates weakly supervised samples from existing trajectories without additional human labels for critic training. Experiments and offline analyses across five benchmarks demonstrate that VisCritic serves as a plug-and-play enhancement for diverse GUI agents, generally improving benchmark metrics while providing visual diagnostic cues.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VisCritic's Siamese ViT for visual state comparison in GUI process rewards is a reasonable idea but the abstract gives almost no evidence the gains are real.

read the letter

The main point is that this paper replaces text-only reasoning in process reward models with direct visual comparison of before-and-after screenshots via a Siamese vision transformer plus an action-aware head. That shift makes sense for GUI work where state is inherently visual.

It does a clean job laying out the limitation in existing textual critics and shows how weakly supervised trajectory data can train the critic without fresh labels. The joint prediction of success, progress, and error type from the same visual features is a practical design choice.

The soft spot is the complete absence of any experimental detail. The abstract says it improves five benchmarks and gives visual diagnostics, yet supplies no numbers, no baselines, no splits, and no error bars. That leaves the central claim unsupported on the page. The assumption that raw visual feature differences will reliably flag action outcomes also looks thin without ablation or failure-case analysis.

This is for people already building or evaluating VLM-based GUI agents who need a drop-in reward signal. A reader already familiar with the textual reward literature would see the contrast quickly.

It deserves peer review because the problem is concrete and the proposed mechanism is distinct enough to test. The experiments will decide whether it holds up.

I would send it out rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces VisCritic, a visual process reward framework for GUI agents. It employs a Siamese vision transformer to extract change-aware representations by directly comparing pre- and post-action screenshots, paired with an Action-Aware Critic Head that jointly assesses action success, task progress, and error type. The model is trained via a weakly supervised data construction pipeline on existing trajectories without additional human labels, and the abstract reports that it acts as a plug-and-play enhancement improving metrics across five benchmarks while supplying visual diagnostic cues.

Significance. If the empirical claims hold after proper validation, the work could meaningfully extend process reward modeling for GUI agents by shifting from purely textual reasoning to direct visual state comparison, potentially improving reliability in long-horizon visual tasks where textual PRMs fall short.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: The manuscript asserts improvements across five benchmarks and offline analyses, yet supplies no information on experimental setup, chosen baselines, error bars, data splits, or statistical testing; this absence renders the central empirical claim unverifiable and load-bearing for the plug-and-play enhancement assertion.
[Method] Method section (Siamese ViT + Action-Aware Critic Head): The design premise that visual feature comparison alone suffices to reliably verify action success, task progress, and error type without textual reasoning is central to the contribution, but the provided description offers no ablations or targeted validation demonstrating robustness against visual ambiguities common in GUI screenshots.

minor comments (1)

[Abstract] The abstract would be clearer if it named the five benchmarks and briefly indicated the magnitude of reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to improve clarity and completeness of the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The manuscript asserts improvements across five benchmarks and offline analyses, yet supplies no information on experimental setup, chosen baselines, error bars, data splits, or statistical testing; this absence renders the central empirical claim unverifiable and load-bearing for the plug-and-play enhancement assertion.

Authors: We acknowledge that the current presentation of experimental details is insufficient for full verifiability. While the manuscript references five benchmarks and describes the overall evaluation, it does not explicitly detail error bars, statistical testing procedures, precise data splits, or a consolidated list of baselines. We will revise the Experiments section to add a dedicated 'Experimental Setup' subsection that reports these elements, including any multi-run statistics and baseline specifications, to make the empirical claims transparent and reproducible. revision: yes
Referee: [Method] Method section (Siamese ViT + Action-Aware Critic Head): The design premise that visual feature comparison alone suffices to reliably verify action success, task progress, and error type without textual reasoning is central to the contribution, but the provided description offers no ablations or targeted validation demonstrating robustness against visual ambiguities common in GUI screenshots.

Authors: The referee correctly notes the absence of ablations supporting the core design choice. The manuscript describes the Siamese ViT and Action-Aware Critic Head but does not include targeted experiments isolating the contribution of visual comparison or testing against common GUI ambiguities such as icon similarity or dynamic UI elements. We will add an ablation study subsection in the revised manuscript, including variants that remove the Siamese structure or introduce controlled visual perturbations, to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents VisCritic as a Siamese ViT-based visual comparator trained on weakly supervised trajectory data to produce process rewards. No load-bearing step reduces by construction to its own inputs: the model architecture, data construction pipeline, and evaluation on external benchmarks are described as independent supervised learning components without self-definitional equations, renamed fitted parameters presented as predictions, or uniqueness claims imported solely via self-citation. The central claim of plug-and-play improvement rests on empirical results rather than tautological re-derivation of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5668 in / 917 out tokens · 18540 ms · 2026-06-26T00:49:37.901614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 1 canonical work pages

[1]

In: ECCV Workshops (2016)

Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully- convolutional siamese networks for object tracking. In: ECCV Workshops (2016)

2016
[2]

arXiv preprint arXiv:2601.04035 (2026)

Cao, Y., Zhong, Y., Zeng, Z., Zheng, L., Huang, J., Qiu, H., Shi, P., Mao, W., Wan Guanglu: MobileDreamer: Generative sketch world model for GUI agent. arXiv preprint arXiv:2601.04035 (2026)

arXiv 2026
[3]

In: NeurIPS (2025)

Chae, H., Kim, S., Cho, J., Kim, S., Moon, S., Hwangbo, G., Lim, D., Kim, M., Hwang, Y., Gwak, M., Choi, D., Kang, M., Im, G., Cho, B., Kim, H., Han, J., Kwon, T., Kim, M., Kwak, B.w., Kang, D., Yeo, J.: Web-Shepherd: Advancing PRMs for reinforcing web agents. In: NeurIPS (2025)

2025
[4]

arXiv preprint arXiv:2509.23738 (2025)

Chen, C., Ji, K., Zhong, H., Zhu, M., Li, A., Gan, G., Huang, Z., Zou, C., Liu, J., Chen, J., Chen, H., Shen, C.: GUI-Shepherd: Reliable process reward and verifica- tion for long-sequence GUI tasks. arXiv preprint arXiv:2509.23738 (2025)

arXiv 2025
[5]

In: ICML (2020)

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: ICML (2020)

2020
[6]

arXiv preprint arXiv:2412.05271 (2024)

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., Gu, L., Wang, X., Li, Q., Ren, Y., Chen, Z., Luo, J., Wang, J., Jiang, T., Wang, B., He, C., Shi, B., Zhang, X., Lv, H., Wang, Y., Shao, W., Chu, P., Tu, Z., He, T., Wu, Z., Deng, H., Ge, J., Chen, K., Zhang, K., Wang, L., Dou, M., Lu, L., Zhu, X., Lu, T., Lin, D.,...

Pith/arXiv arXiv 2024
[7]

In: ACL (2024)

Cheng, K., Sun, Q., Chu, Y., Xu, F., YanTao, L., Zhang, J., Wu, Z.: SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In: ACL (2024)

2024
[8]

In: IEEE International Conference on Image Processing (ICIP) (2018)

Daudt, R.C., Saux, B.L., Boulch, A.: Fully convolutional siamese networks for change detection. In: IEEE International Conference on Image Processing (ICIP) (2018)

2018
[9]

In: NeurIPS (2023)

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y.: Mind2Web: Towards a generalist agent for the web. In: NeurIPS (2023)

2023
[10]

In: ICLR (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

2021
[11]

arXiv preprint arXiv:2602.07787 (2026) 16 Qian

Favreau, P.L., Lo, J.P., Guiguet, C., Simon-Meunier, C., Dehandschoewercker, N., Roush, A.G., Goldfeder, J., Shwartz-Ziv, R.: Do multi-agents dream of electric screens? achieving perfect accuracy on AndroidWorld through task decomposition. arXiv preprint arXiv:2602.07787 (2026) 16 Qian

arXiv 2026
[12]

In: ICLR (2025)

Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., Su, Y.: Navigating the digital world as humans do: Universal visual grounding for GUI agents. In: ICLR (2025)

2025
[13]

arXiv preprint arXiv:2602.17365 (2026)

Guan, Y., Yu, R., Zhang, J., Wang, L., Zhang, C., Li, L., Qiao, B., Qin, S., Huang, H., Yang, F., Zhao, P., Wutschitz, L., Kessler, S., Inan, H.A., Sim, R., Rajmohan, S., Lin, Q., Zhang, D.: Computer-using world model. arXiv preprint arXiv:2602.17365 (2026)

arXiv 2026
[14]

In: ICLR (2024)

Gur, I., Furuta, H., Huang, A., Safdari, M., Matsuo, Y., Eck, D., Faust, A.: A real-world WebAgent with planning, long context understanding, and program synthesis. In: ICLR (2024)

2024
[15]

In: CVPR (2024)

Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., Tang, J.: CogAgent: A visual language model for GUI agents. In: CVPR (2024)

2024
[16]

In: ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022)

2022
[17]

arXiv preprint arXiv:2504.16073 (2025)

Hu, Z., Xiong, S., Zhang, Y., Ng, S.K., Luu, A.T., An, B., Yan, S., Hooi, B.: Guiding VLM agents with process rewards at inference time for GUI navigation. arXiv preprint arXiv:2504.16073 (2025)

arXiv 2025
[18]

In: CVPR (2018)

Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: CVPR (2018)

2018
[19]

Liao, Y.H., Mahmood, R., Fidler, S., Acuna, D.: Can large vision-language models correct semantic grounding errors by themselves? In: CVPR (2025)

2025
[20]

In: ICLR (2024)

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. In: ICLR (2024)

2024
[21]

In: CVPR (2025)

Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: ShowUI: One vision-language-action model for GUI visual agent. In: CVPR (2025)

2025
[22]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

2023
[23]

Nong, S., Tang, X., Xu, J., Zhou, S., Chen, J., Jiang, T., Xu, W.: CRAFT-GUI: Curriculum-reinforcedagentforGUItasks.arXivpreprintarXiv:2508.11360(2025)

arXiv 2025
[24]

arXiv preprint arXiv:1807.03748 (2018)

van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

Pith/arXiv arXiv 2018
[25]

arXiv preprint arXiv:2303.08774 (2023)

OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

Pith/arXiv arXiv 2023
[26]

arXiv preprint arXiv:2604.16966 (2026)

Qian, J.: Visual inception: Compromising long-term planning in agentic rec- ommenders via multimodal memory poisoning. arXiv preprint arXiv:2604.16966 (2026)

Pith/arXiv arXiv 2026
[27]

arXiv preprint arXiv:2604.16515 (2026)

Qian, J., Kang, Z.: Penny wise, pixel foolish: Bypassing price constraints in multi- modal agents via visual adversarial perturbations. arXiv preprint arXiv:2604.16515 (2026)

Pith/arXiv arXiv 2026
[28]

arXiv preprint arXiv:2502.13923 (2025)

Qwen Team: Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923 (2025)

Pith/arXiv arXiv 2025
[29]

In: ICLR (2025)

Rawles, C., Clinckemaillie, S., Chang, Y., Waltz, J., Lau, G., Fair, M., Li, A., Bishop, W., Li, W., Campbell-Ajala, F., Toyama, D., Berry, R., Tyamagundlu, D., Lillicrap, T., Riva, O.: AndroidWorld: A dynamic benchmarking environment for autonomous agents. In: ICLR (2025)

2025
[30]

In: NeurIPS (2023)

Rawles, C., Li, A., Rodriguez, D., Riva, O., Lillicrap, T.: AndroidInTheWild: A large-scale dataset for android device control. In: NeurIPS (2023)

2023
[31]

Shi, W., Zhang, M., Zhang, R., Chen, S., Zhan, Z.: Change detection based on artificialintelligence:State-of-the-artandchallenges.RemoteSensing12(10), 1688 (2020) VisCritic 17

2020
[32]

arXiv preprint arXiv:2602.02995 (2026)

Tang, S., Chen, R., Lan, T.: Agent Alpha: Tree search unifying generation, explo- ration and evaluation for computer-use agents. arXiv preprint arXiv:2602.02995 (2026)

arXiv 2026
[33]

arXiv preprint arXiv:2401.16158 (2024)

Wang, J., Xu, H., Ye, J., Yan, M., Shen, W., Zhang, J., Huang, F., Sang, J.: Mobile- Agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158 (2024)

Pith/arXiv arXiv 2024
[34]

In: NeurIPS (2025)

Wanyan, Y., Zhang, X., Xu, H., Liu, H., Wang, J., Ye, J., Kou, Y., Yan, M., Huang, F., Yang, X., Dong, W., Xu, C.: Look before you leap: A GUI-critic-R1 model for pre-operative error diagnosis in GUI automation. In: NeurIPS (2025)

2025
[35]

In: EMNLP (2025)

Wu, Q., Gao, P., Liu, W., Luan, J.: BacktrackAgent: Enhancing GUI agent with error detection and backtracking mechanism. In: EMNLP (2025)

2025
[36]

In: NeurIPS (2024)

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., Yu, T.: OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In: NeurIPS (2024)

2024
[37]

arXiv preprint arXiv:2509.23263 (2025)

Xiong, T., Hu, X., Chen, Y., Liu, Y., Wu, C., Gao, P., Liu, W., Luan, J., Zhang, S.: GUI-PRA: Process reward agent for GUI tasks. arXiv preprint arXiv:2509.23263 (2025)

arXiv 2025
[38]

arXiv preprint arXiv:2510.09577 (2025)

Yu, X., Peng, B., Galley, M., Cheng, H., Wu, Q., Kulkarni, J., Nath, S., Yu, Z., Gao, J.: Dyna-Mind: Learning to simulate from experience for better AI agents. arXiv preprint arXiv:2510.09577 (2025)

arXiv 2025
[39]

In: NeurIPS (2025)

Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.T., Li, B.: SE-GUI: Enhancing visual grounding for GUI agents via self-evolutionary reinforcement learning. In: NeurIPS (2025)

2025
[40]

In: ICCV (2023)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV (2023)

2023
[41]

In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (2025).https: //doi.org/10.1145/3706598.3713600

Zhang, C., Yang, Z., Liu, J., Li, Y., Han, Y., Chen, X., Huang, Z., Fu, B., Yu, G.: AppAgent: Multimodal agents as smartphone users. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (2025).https: //doi.org/10.1145/3706598.3713600

work page doi:10.1145/3706598.3713600 2025
[42]

arXiv preprint arXiv:2602.11524 (2026)

Zheng, C., Mo, X., Ma, X., Lin, Q., Zhao, Y., Zhu, J., Lou, X., Wang, J., Wang, Z., Liu, W., Zhang, Z., Yu, Y., Zhang, W.: Adaptive milestone reward for GUI agents. arXiv preprint arXiv:2602.11524 (2026)

arXiv 2026
[43]

In: ICML (2024)

Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., Wang, Y.X.: Language agent tree search unifies reasoning, acting, and planning in language models. In: ICML (2024)

2024
[44]

In: ICLR (2024)

Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., Neubig, G.: WebArena: A realistic web environment for building autonomous agents. In: ICLR (2024)

2024
[45]

In: NeurIPS (2025)

Zhou, Y., Dai, S., Wang, S., Zhou, K., Jia, Q., Xu, J.: GUI-G1: Understanding R1-zero-like training for visual grounding in GUI agents. In: NeurIPS (2025)

2025

[1] [1]

In: ECCV Workshops (2016)

Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully- convolutional siamese networks for object tracking. In: ECCV Workshops (2016)

2016

[2] [2]

arXiv preprint arXiv:2601.04035 (2026)

Cao, Y., Zhong, Y., Zeng, Z., Zheng, L., Huang, J., Qiu, H., Shi, P., Mao, W., Wan Guanglu: MobileDreamer: Generative sketch world model for GUI agent. arXiv preprint arXiv:2601.04035 (2026)

arXiv 2026

[3] [3]

In: NeurIPS (2025)

Chae, H., Kim, S., Cho, J., Kim, S., Moon, S., Hwangbo, G., Lim, D., Kim, M., Hwang, Y., Gwak, M., Choi, D., Kang, M., Im, G., Cho, B., Kim, H., Han, J., Kwon, T., Kim, M., Kwak, B.w., Kang, D., Yeo, J.: Web-Shepherd: Advancing PRMs for reinforcing web agents. In: NeurIPS (2025)

2025

[4] [4]

arXiv preprint arXiv:2509.23738 (2025)

Chen, C., Ji, K., Zhong, H., Zhu, M., Li, A., Gan, G., Huang, Z., Zou, C., Liu, J., Chen, J., Chen, H., Shen, C.: GUI-Shepherd: Reliable process reward and verifica- tion for long-sequence GUI tasks. arXiv preprint arXiv:2509.23738 (2025)

arXiv 2025

[5] [5]

In: ICML (2020)

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: ICML (2020)

2020

[6] [6]

arXiv preprint arXiv:2412.05271 (2024)

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., Gu, L., Wang, X., Li, Q., Ren, Y., Chen, Z., Luo, J., Wang, J., Jiang, T., Wang, B., He, C., Shi, B., Zhang, X., Lv, H., Wang, Y., Shao, W., Chu, P., Tu, Z., He, T., Wu, Z., Deng, H., Ge, J., Chen, K., Zhang, K., Wang, L., Dou, M., Lu, L., Zhu, X., Lu, T., Lin, D.,...

Pith/arXiv arXiv 2024

[7] [7]

In: ACL (2024)

Cheng, K., Sun, Q., Chu, Y., Xu, F., YanTao, L., Zhang, J., Wu, Z.: SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In: ACL (2024)

2024

[8] [8]

In: IEEE International Conference on Image Processing (ICIP) (2018)

Daudt, R.C., Saux, B.L., Boulch, A.: Fully convolutional siamese networks for change detection. In: IEEE International Conference on Image Processing (ICIP) (2018)

2018

[9] [9]

In: NeurIPS (2023)

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y.: Mind2Web: Towards a generalist agent for the web. In: NeurIPS (2023)

2023

[10] [10]

In: ICLR (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

2021

[11] [11]

arXiv preprint arXiv:2602.07787 (2026) 16 Qian

Favreau, P.L., Lo, J.P., Guiguet, C., Simon-Meunier, C., Dehandschoewercker, N., Roush, A.G., Goldfeder, J., Shwartz-Ziv, R.: Do multi-agents dream of electric screens? achieving perfect accuracy on AndroidWorld through task decomposition. arXiv preprint arXiv:2602.07787 (2026) 16 Qian

arXiv 2026

[12] [12]

In: ICLR (2025)

Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., Su, Y.: Navigating the digital world as humans do: Universal visual grounding for GUI agents. In: ICLR (2025)

2025

[13] [13]

arXiv preprint arXiv:2602.17365 (2026)

Guan, Y., Yu, R., Zhang, J., Wang, L., Zhang, C., Li, L., Qiao, B., Qin, S., Huang, H., Yang, F., Zhao, P., Wutschitz, L., Kessler, S., Inan, H.A., Sim, R., Rajmohan, S., Lin, Q., Zhang, D.: Computer-using world model. arXiv preprint arXiv:2602.17365 (2026)

arXiv 2026

[14] [14]

In: ICLR (2024)

Gur, I., Furuta, H., Huang, A., Safdari, M., Matsuo, Y., Eck, D., Faust, A.: A real-world WebAgent with planning, long context understanding, and program synthesis. In: ICLR (2024)

2024

[15] [15]

In: CVPR (2024)

Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., Tang, J.: CogAgent: A visual language model for GUI agents. In: CVPR (2024)

2024

[16] [16]

In: ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022)

2022

[17] [17]

arXiv preprint arXiv:2504.16073 (2025)

Hu, Z., Xiong, S., Zhang, Y., Ng, S.K., Luu, A.T., An, B., Yan, S., Hooi, B.: Guiding VLM agents with process rewards at inference time for GUI navigation. arXiv preprint arXiv:2504.16073 (2025)

arXiv 2025

[18] [18]

In: CVPR (2018)

Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: CVPR (2018)

2018

[19] [19]

Liao, Y.H., Mahmood, R., Fidler, S., Acuna, D.: Can large vision-language models correct semantic grounding errors by themselves? In: CVPR (2025)

2025

[20] [20]

In: ICLR (2024)

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. In: ICLR (2024)

2024

[21] [21]

In: CVPR (2025)

Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: ShowUI: One vision-language-action model for GUI visual agent. In: CVPR (2025)

2025

[22] [22]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

2023

[23] [23]

Nong, S., Tang, X., Xu, J., Zhou, S., Chen, J., Jiang, T., Xu, W.: CRAFT-GUI: Curriculum-reinforcedagentforGUItasks.arXivpreprintarXiv:2508.11360(2025)

arXiv 2025

[24] [24]

arXiv preprint arXiv:1807.03748 (2018)

van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

Pith/arXiv arXiv 2018

[25] [25]

arXiv preprint arXiv:2303.08774 (2023)

OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

Pith/arXiv arXiv 2023

[26] [26]

arXiv preprint arXiv:2604.16966 (2026)

Qian, J.: Visual inception: Compromising long-term planning in agentic rec- ommenders via multimodal memory poisoning. arXiv preprint arXiv:2604.16966 (2026)

Pith/arXiv arXiv 2026

[27] [27]

arXiv preprint arXiv:2604.16515 (2026)

Qian, J., Kang, Z.: Penny wise, pixel foolish: Bypassing price constraints in multi- modal agents via visual adversarial perturbations. arXiv preprint arXiv:2604.16515 (2026)

Pith/arXiv arXiv 2026

[28] [28]

arXiv preprint arXiv:2502.13923 (2025)

Qwen Team: Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923 (2025)

Pith/arXiv arXiv 2025

[29] [29]

In: ICLR (2025)

Rawles, C., Clinckemaillie, S., Chang, Y., Waltz, J., Lau, G., Fair, M., Li, A., Bishop, W., Li, W., Campbell-Ajala, F., Toyama, D., Berry, R., Tyamagundlu, D., Lillicrap, T., Riva, O.: AndroidWorld: A dynamic benchmarking environment for autonomous agents. In: ICLR (2025)

2025

[30] [30]

In: NeurIPS (2023)

Rawles, C., Li, A., Rodriguez, D., Riva, O., Lillicrap, T.: AndroidInTheWild: A large-scale dataset for android device control. In: NeurIPS (2023)

2023

[31] [31]

Shi, W., Zhang, M., Zhang, R., Chen, S., Zhan, Z.: Change detection based on artificialintelligence:State-of-the-artandchallenges.RemoteSensing12(10), 1688 (2020) VisCritic 17

2020

[32] [32]

arXiv preprint arXiv:2602.02995 (2026)

Tang, S., Chen, R., Lan, T.: Agent Alpha: Tree search unifying generation, explo- ration and evaluation for computer-use agents. arXiv preprint arXiv:2602.02995 (2026)

arXiv 2026

[33] [33]

arXiv preprint arXiv:2401.16158 (2024)

Wang, J., Xu, H., Ye, J., Yan, M., Shen, W., Zhang, J., Huang, F., Sang, J.: Mobile- Agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158 (2024)

Pith/arXiv arXiv 2024

[34] [34]

In: NeurIPS (2025)

Wanyan, Y., Zhang, X., Xu, H., Liu, H., Wang, J., Ye, J., Kou, Y., Yan, M., Huang, F., Yang, X., Dong, W., Xu, C.: Look before you leap: A GUI-critic-R1 model for pre-operative error diagnosis in GUI automation. In: NeurIPS (2025)

2025

[35] [35]

In: EMNLP (2025)

Wu, Q., Gao, P., Liu, W., Luan, J.: BacktrackAgent: Enhancing GUI agent with error detection and backtracking mechanism. In: EMNLP (2025)

2025

[36] [36]

In: NeurIPS (2024)

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., Yu, T.: OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In: NeurIPS (2024)

2024

[37] [37]

arXiv preprint arXiv:2509.23263 (2025)

Xiong, T., Hu, X., Chen, Y., Liu, Y., Wu, C., Gao, P., Liu, W., Luan, J., Zhang, S.: GUI-PRA: Process reward agent for GUI tasks. arXiv preprint arXiv:2509.23263 (2025)

arXiv 2025

[38] [38]

arXiv preprint arXiv:2510.09577 (2025)

Yu, X., Peng, B., Galley, M., Cheng, H., Wu, Q., Kulkarni, J., Nath, S., Yu, Z., Gao, J.: Dyna-Mind: Learning to simulate from experience for better AI agents. arXiv preprint arXiv:2510.09577 (2025)

arXiv 2025

[39] [39]

In: NeurIPS (2025)

Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.T., Li, B.: SE-GUI: Enhancing visual grounding for GUI agents via self-evolutionary reinforcement learning. In: NeurIPS (2025)

2025

[40] [40]

In: ICCV (2023)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV (2023)

2023

[41] [41]

In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (2025).https: //doi.org/10.1145/3706598.3713600

Zhang, C., Yang, Z., Liu, J., Li, Y., Han, Y., Chen, X., Huang, Z., Fu, B., Yu, G.: AppAgent: Multimodal agents as smartphone users. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (2025).https: //doi.org/10.1145/3706598.3713600

work page doi:10.1145/3706598.3713600 2025

[42] [42]

arXiv preprint arXiv:2602.11524 (2026)

Zheng, C., Mo, X., Ma, X., Lin, Q., Zhao, Y., Zhu, J., Lou, X., Wang, J., Wang, Z., Liu, W., Zhang, Z., Yu, Y., Zhang, W.: Adaptive milestone reward for GUI agents. arXiv preprint arXiv:2602.11524 (2026)

arXiv 2026

[43] [43]

In: ICML (2024)

Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., Wang, Y.X.: Language agent tree search unifies reasoning, acting, and planning in language models. In: ICML (2024)

2024

[44] [44]

In: ICLR (2024)

Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., Neubig, G.: WebArena: A realistic web environment for building autonomous agents. In: ICLR (2024)

2024

[45] [45]

In: NeurIPS (2025)

Zhou, Y., Dai, S., Wang, S., Zhou, K., Jia, Q., Xu, J.: GUI-G1: Understanding R1-zero-like training for visual grounding in GUI agents. In: NeurIPS (2025)

2025