RISK: A Framework for GUI Agents in E-commerce Risk Management

Jianming Guo; Jingzhe Zhu; Qingqing Sun; Renqi Chen; Shuai Chen; Tianyi Zhang; Yiheng Peng; Zeyin Tao

arxiv: 2509.21982 · v2 · submitted 2025-09-26 · 💻 cs.AI · cs.CL

RISK: A Framework for GUI Agents in E-commerce Risk Management

Renqi Chen , Zeyin Tao , Jianming Guo , Jingzhe Zhu , Yiheng Peng , Qingqing Sun , Tianyi Zhang , Shuai Chen This is my paper

Pith reviewed 2026-05-18 13:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords GUI agentse-commerce risk managementreinforcement fine-tuningmulti-step web interactionagent benchmarktrajectory datasetweb automation

0 comments

The pith

A reinforcement fine-tuning framework equips small GUI agents to handle multi-step e-commerce risk interactions that defeat general agents and scraping tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that domain-specific data collection, benchmarking, and reward engineering can turn GUI agents into practical tools for e-commerce risk management, where tasks require sustained stateful navigation through dynamic web interfaces. A sympathetic reader would care because risk assessment depends on pulling together scattered, interactive data that current automated methods miss, and a working solution could replace large amounts of manual review. RISK-R1 demonstrates this by delivering measurable gains in both single-step and multi-step settings while using a fraction of the parameters of prior models. The central mechanism is a four-part reward structure that enforces correct output formats, rewards progress at step and task levels, and reweights outcomes to keep the agent on track across long trajectories.

Core claim

RISK supplies 8,492 single-step and 2,386 multi-step trajectories in RISK-Data, a 1,122-trajectory benchmark in RISK-Bench spanning three difficulty levels, and RISK-R1, which applies R1-style reinforcement fine-tuning under four explicit constraints and rewards: output format, single-step level, multi-step level, and task-level reweighting. When trained this way, the resulting agents improve offline single-step performance by 6.8 percent and multi-step performance by 8.8 percent relative to the prior state-of-the-art while using only 7.2 percent of its parameter count, and they reach 70.5 percent task success in live online evaluation.

What carries the argument

RISK-R1, a reinforcement fine-tuning procedure that adds output-format constraints together with single-step, multi-step, and task-level rewards to steer GUI agents through stateful web sequences.

If this is right

GUI agents become viable for any risk workflow that requires repeated navigation and state tracking rather than one-shot scraping.
Model size can be reduced dramatically while still outperforming larger baselines once domain rewards are applied.
RISK-Bench provides a fixed yardstick that lets future work measure progress on single-step versus multi-step web risk tasks.
The same data-plus-reward recipe can be reused to automate other embedded e-commerce processes that involve dynamic content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward layering could be ported to GUI agents that collect compliance or pricing data on retail sites.
If the online success rate holds under production traffic, the framework could cut the volume of human risk reviewers needed per merchant.
Replicating the high-fidelity browser collection pipeline on other verticals would test how much of the gain comes from domain data versus the reward design itself.

Load-bearing premise

The trajectories and four-aspect reward design collected for RISK-Data and RISK-R1 are assumed to transfer to real-world e-commerce risk sites without large distribution shifts or reward hacking.

What would settle it

A sharp fall in task success rate when the trained RISK-R1 agents are deployed on live e-commerce platforms whose page structures or risk workflows differ substantially from those appearing in RISK-Bench.

Figures

Figures reproduced from arXiv: 2509.21982 by Jianming Guo, Jingzhe Zhu, Qingqing Sun, Renqi Chen, Shuai Chen, Tianyi Zhang, Yiheng Peng, Zeyin Tao.

**Figure 2.** Figure 2: Data construction process for GUI agents in e-commerce risk management. By leveraging [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Action type distribution in RISK-Data, which includes 13 action types. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: RISK-R1 framework. Our framework comprises four key components (format reward, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of KL divergence curves [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Stepwise accuracy reward analysis. Peak performance is achieved at the combination of stepwise and binary accuracy rewards. exploration. We analyze the impact of different reward settings, as shown in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Token count distribution and step count distribution of multi-step trajectories, where we [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Weight curve for process reweighting. We use a stepwise reward in the first epoch and a binary reward in the remaining epochs. During inference, we deploy the vLLM engine (Kwon et al., 2023) with a temperature of 0 to generate deterministic responses. Training Datasets and Evaluation Benchmarks. In SFT, we use all single-step and multi-step trajectories in RISK-Data for training, where the maximum of image… view at source ↗

read the original abstract

E-commerce risk management requires aggregating diverse, deeply embedded web data through multi-step, stateful interactions, which traditional scraping methods and most existing Graphical User Interface (GUI) agents cannot handle. These agents are typically limited to single-step tasks and lack the ability to manage dynamic, interactive content critical for effective risk assessment. To address this challenge, we introduce RISK, a novel framework designed to build and deploy GUI agents for this domain. RISK integrates three components: (1) RISK-Data, a dataset of 8,492 single-step and 2,386 multi-step interaction trajectories, collected through a high-fidelity browser framework and a meticulous data curation process; (2) RISK-Bench, a benchmark with 802 single-step and 320 multi-step trajectories across three difficulty levels for standardized evaluation; and (3) RISK-R1, a R1-style reinforcement fine-tuning framework considering four aspects: (i) Output Format Constraint, (ii) Single-step and (iii) Multi-step Level Reward, and (iv) Task Level Reweight. Experiments show that RISK-R1 achieves a 6.8% improvement in offline single-step and an 8.8% improvement in offline multi-step, using only 7.2% of the parameters of the SOTA baseline. Moreover, it attains a top task success rate of 70.5% in online evaluation. RISK provides a scalable, domain-specific solution for automating complex web interactions in e-commerce risk management. The code is available at https://github.com/RenqiChen/RISK-GUI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a targeted e-commerce risk dataset and small-model RL recipe that reports practical gains, but the evaluation leaves generalization and statistical details thin.

read the letter

The main thing to know is that this paper supplies a new dataset of e-commerce risk trajectories and a four-part RL fine-tuning method that beats a larger baseline on offline tasks while hitting 70 percent success online with far fewer parameters. They collected 8,492 single-step and 2,386 multi-step examples through a high-fidelity browser setup, carved out RISK-Bench with three difficulty levels, and trained RISK-R1 using rewards for output format, single steps, multi-step flow, and task reweighting. The code release is a clear plus for anyone who wants to inspect or extend it. This fills a real gap because most GUI agent papers stay general, while risk work needs reliable multi-step navigation on dynamic pages. The reported lifts of 6.8 and 8.8 percent offline look usable for the domain. The soft spots sit in the evaluation. The abstract gives no error bars, no statistical tests, and no ablation on the individual reward pieces or baseline implementation details. Because the benchmark trajectories come from the same collection pipeline as the training data, the gains could partly reflect distribution match rather than robustness to live site changes. That generalization question is the one worth pressing. This paper is for applied researchers or industry teams building web agents for compliance and risk tasks. A reader who needs concrete data or a starting recipe for domain-specific GUI work will find value here. It shows enough concrete engineering and empirical results to deserve a serious referee who can request the missing stats and external-site checks. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces the RISK framework for GUI agents specialized in e-commerce risk management. It comprises three components: RISK-Data (8,492 single-step and 2,386 multi-step trajectories collected via a high-fidelity browser), RISK-Bench (802 single-step and 320 multi-step trajectories across difficulty levels), and RISK-R1 (an R1-style reinforcement fine-tuning method incorporating four reward aspects: Output Format Constraint, Single-step Level Reward, Multi-step Level Reward, and Task Level Reweight). The central empirical claims are that RISK-R1 delivers a 6.8% improvement on offline single-step tasks and 8.8% on offline multi-step tasks while using only 7.2% of the parameters of the SOTA baseline, together with a 70.5% task success rate in online evaluation.

Significance. If the reported gains prove robust, the work offers a practical, domain-specific advance for automating multi-step, stateful web interactions that standard scraping and general GUI agents cannot handle. The public code release is a clear strength that supports reproducibility and extension by the community.

major comments (3)

[Experiments] Experiments section: the headline claims of 6.8% offline single-step and 8.8% offline multi-step improvements plus 70.5% online success rate are presented without error bars, standard deviations across runs, statistical significance tests, or details on how the SOTA baseline was re-implemented, which prevents independent verification of the performance delta.
[RISK-R1] RISK-R1 reward design (four-aspect formulation): no ablation results are provided that isolate the contribution of Output Format Constraint, Single-step Level Reward, Multi-step Level Reward, or Task Level Reweight, leaving open whether the reported gains are driven by the full combination or by a subset of components.
[RISK-Data and RISK-Bench] Data and benchmark construction: the paper does not report any analysis of distribution shift between the collected RISK-Data trajectories and live, dynamic e-commerce sites, nor any safeguards against reward hacking in the online setting; this assumption is load-bearing for the claimed 70.5% online success rate.

minor comments (2)

[Experiments] Clarify the exact model architecture and parameter count of the SOTA baseline so that the 7.2% figure can be directly verified.
[RISK-Data] Add a short description of the high-fidelity browser framework used for trajectory collection to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback on our manuscript. We address each of the major comments point by point below, indicating the revisions we plan to make to improve the clarity and rigor of the paper.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline claims of 6.8% offline single-step and 8.8% offline multi-step improvements plus 70.5% online success rate are presented without error bars, standard deviations across runs, statistical significance tests, or details on how the SOTA baseline was re-implemented, which prevents independent verification of the performance delta.

Authors: We agree that the experimental results would benefit from greater statistical rigor. In the revised manuscript, we will include error bars and standard deviations computed over multiple independent runs for the reported performance metrics. We will also perform and report statistical significance tests (e.g., paired t-tests) to support the claimed improvements. Furthermore, we will expand the description of the SOTA baseline re-implementation, including hyperparameters and any adaptations made to ensure fair comparison. These changes will allow for better independent verification. revision: yes
Referee: [RISK-R1] RISK-R1 reward design (four-aspect formulation): no ablation results are provided that isolate the contribution of Output Format Constraint, Single-step Level Reward, Multi-step Level Reward, or Task Level Reweight, leaving open whether the reported gains are driven by the full combination or by a subset of components.

Authors: We acknowledge that ablation studies isolating the individual reward components were not included in the original submission. To address this, we will add a new set of ablation experiments in the revised paper. These will systematically remove or modify each of the four aspects (Output Format Constraint, Single-step Level Reward, Multi-step Level Reward, and Task Level Reweight) and report the resulting performance changes on both offline and online tasks. This will clarify the contribution of each component to the overall gains. revision: yes
Referee: [RISK-Data and RISK-Bench] Data and benchmark construction: the paper does not report any analysis of distribution shift between the collected RISK-Data trajectories and live, dynamic e-commerce sites, nor any safeguards against reward hacking in the online setting; this assumption is load-bearing for the claimed 70.5% online success rate.

Authors: We recognize the importance of addressing potential distribution shift and reward hacking concerns for the online evaluation. The RISK-Data was collected using a high-fidelity browser environment designed to closely mimic live e-commerce sites, which helps mitigate shift. In the revision, we will add a dedicated subsection discussing the steps taken to minimize distribution shift, such as using diverse site configurations and real-time interaction logging. For safeguards against reward hacking, we will describe our use of held-out evaluation environments, human verification of a subset of trajectories, and monitoring for policy collapse or anomalous action patterns during online testing. While a comprehensive quantitative analysis of shift (e.g., via statistical tests on state distributions) was not performed originally, we will include preliminary comparisons where data permits. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results rest on independent data collection and evaluation

full rationale

The paper describes an empirical pipeline consisting of new trajectory collection through a high-fidelity browser framework to produce RISK-Data, construction of RISK-Bench for evaluation, and reinforcement fine-tuning of RISK-R1 using a four-aspect reward. No equations, derivations, or self-referential fitting steps are present that would reduce the reported performance gains (6.8% offline single-step, 8.8% offline multi-step, 70.5% online success) to quantities defined by the same inputs by construction. The central claims are supported by externally collected trajectories and online testing rather than any self-definition, fitted-input renaming, or self-citation chain that collapses the argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The framework introduces new data and reward components whose internal hyperparameters are not detailed.

pith-pipeline@v0.9.0 · 5835 in / 1202 out tokens · 39168 ms · 2026-05-18T13:24:15.570162+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RISK-R1 ... four aspects: (i) Output Format Constraint, (ii) Single-step Level Reward, (iii) Multi-step Level Reward, and (iv) Task Level Reweight

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
cs.AI 2026-04 unverdicted novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2407.17490

Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents.arXiv preprint arXiv:2407.17490,

work page arXiv
[3]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Ui-venus technical report: Building high-performance ui agents with rft

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft.arXiv preprint arXiv:2508.10833,

work page arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

10 Preprint Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Appagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268,

work page arXiv
[10]

Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat- Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981,

work page arXiv
[11]

Appagent v2: Advanced agent for flexible mobile interactions.arXiv preprint arXiv:2408.11824, 2024

Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, and Yunchao Wei. Appagent v2: Advanced agent for flexible mobile interactions.arXiv preprint arXiv:2408.11824,

work page arXiv
[12]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239,

work page internal anchor Pith review arXiv
[13]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

From semantic web and mas to agentic ai: A unified narrative of the web of agents.arXiv preprint arXiv:2507.10644,

Tatiana Petrova, Boris Bliznioukov, Aleksandr Puzikov, and Radu State. From semantic web and mas to agentic ai: A unified narrative of the web of agents.arXiv preprint arXiv:2507.10644,

work page internal anchor Pith review arXiv
[15]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

11 Preprint Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720,

work page arXiv
[19]

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. Os-genesis: Automating gui agent trajectory construction via reverse task synthesis.arXiv preprint arXiv:2412.19723,

work page arXiv
[20]

Gui-g 2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846,

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gui-g 2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846,

work page arXiv
[21]

Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733,

work page arXiv
[22]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and syn- thesis.arXiv preprint arXiv:2505.13227,

work page arXiv
[24]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

2407.15711 , archivePrefix=

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. As- sistantbench: Can web agents solve realistic and time-consuming tasks?arXiv preprint arXiv:2407.15711,

work page arXiv
[26]

Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370,

work page arXiv
[27]

Large Language Model-Brained GUI Agents: A Survey

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. Large language model-brained gui agents: A survey.arXiv preprint arXiv:2411.18279,

work page internal anchor Pith review arXiv
[28]

Ufo2: The desktop agentos.arXiv preprint arXiv:2504.14603,

Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, et al. Ufo2: The desktop agentos.arXiv preprint arXiv:2504.14603,

work page arXiv
[29]

and Zuo, C

Jixiao Zhang and Chunsheng Zuo. Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models.arXiv preprint arXiv:2504.09696,

work page arXiv
[30]

Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero- like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810,

work page arXiv
[31]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

(1) The current multi-step trajectories in RISK-Data are mainly used in SFT, while RFT only utilizes single-step trajectories due to GPU memory constraints

12 Preprint A APPENDIX A.1 LIMITATIONS ANDFUTUREWORK Limitations.Although RISK-R1 demonstrates superior performance in e-commerce risk management tasks, there are still some limitations. (1) The current multi-step trajectories in RISK-Data are mainly used in SFT, while RFT only utilizes single-step trajectories due to GPU memory constraints. This may limi...

work page 2025
[33]

We use a stepwise reward in the first epoch and a binary reward in the remaining epochs

14 Preprint 0% 20% 40% 60% 80% 100% Progress 0.90 0.95 1.00 1.05 1.10Weight 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Weight curve Discrete steps (a)γ= 1 0% 20% 40% 60% 80% 100% Progress 0.4 0.5 0.6 0.7 0.8 0.9 1.0Weight 0.4200.4350.462 0.506 0.571 0.654 0.746 0.829 0.894 0.9380.9650.980 Weight curve Discrete steps (b)γ= 0.4, δ= 4 0% 20% 40% 60% 80%...

work page arXiv

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2407.17490

Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents.arXiv preprint arXiv:2407.17490,

work page arXiv

[3] [3]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Ui-venus technical report: Building high-performance ui agents with rft

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft.arXiv preprint arXiv:2508.10833,

work page arXiv

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

10 Preprint Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Appagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268,

work page arXiv

[10] [10]

Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat- Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981,

work page arXiv

[11] [11]

Appagent v2: Advanced agent for flexible mobile interactions.arXiv preprint arXiv:2408.11824, 2024

Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, and Yunchao Wei. Appagent v2: Advanced agent for flexible mobile interactions.arXiv preprint arXiv:2408.11824,

work page arXiv

[12] [12]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239,

work page internal anchor Pith review arXiv

[13] [13]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

From semantic web and mas to agentic ai: A unified narrative of the web of agents.arXiv preprint arXiv:2507.10644,

Tatiana Petrova, Boris Bliznioukov, Aleksandr Puzikov, and Radu State. From semantic web and mas to agentic ai: A unified narrative of the web of agents.arXiv preprint arXiv:2507.10644,

work page internal anchor Pith review arXiv

[15] [15]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

11 Preprint Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720,

work page arXiv

[19] [19]

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. Os-genesis: Automating gui agent trajectory construction via reverse task synthesis.arXiv preprint arXiv:2412.19723,

work page arXiv

[20] [20]

Gui-g 2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846,

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gui-g 2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846,

work page arXiv

[21] [21]

Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733,

work page arXiv

[22] [22]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and syn- thesis.arXiv preprint arXiv:2505.13227,

work page arXiv

[24] [24]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

2407.15711 , archivePrefix=

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. As- sistantbench: Can web agents solve realistic and time-consuming tasks?arXiv preprint arXiv:2407.15711,

work page arXiv

[26] [26]

Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370,

work page arXiv

[27] [27]

Large Language Model-Brained GUI Agents: A Survey

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. Large language model-brained gui agents: A survey.arXiv preprint arXiv:2411.18279,

work page internal anchor Pith review arXiv

[28] [28]

Ufo2: The desktop agentos.arXiv preprint arXiv:2504.14603,

Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, et al. Ufo2: The desktop agentos.arXiv preprint arXiv:2504.14603,

work page arXiv

[29] [29]

and Zuo, C

Jixiao Zhang and Chunsheng Zuo. Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models.arXiv preprint arXiv:2504.09696,

work page arXiv

[30] [30]

Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero- like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810,

work page arXiv

[31] [31]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

(1) The current multi-step trajectories in RISK-Data are mainly used in SFT, while RFT only utilizes single-step trajectories due to GPU memory constraints

12 Preprint A APPENDIX A.1 LIMITATIONS ANDFUTUREWORK Limitations.Although RISK-R1 demonstrates superior performance in e-commerce risk management tasks, there are still some limitations. (1) The current multi-step trajectories in RISK-Data are mainly used in SFT, while RFT only utilizes single-step trajectories due to GPU memory constraints. This may limi...

work page 2025

[33] [33]

We use a stepwise reward in the first epoch and a binary reward in the remaining epochs

14 Preprint 0% 20% 40% 60% 80% 100% Progress 0.90 0.95 1.00 1.05 1.10Weight 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Weight curve Discrete steps (a)γ= 1 0% 20% 40% 60% 80% 100% Progress 0.4 0.5 0.6 0.7 0.8 0.9 1.0Weight 0.4200.4350.462 0.506 0.571 0.654 0.746 0.829 0.894 0.9380.9650.980 Weight curve Discrete steps (b)γ= 0.4, δ= 4 0% 20% 40% 60% 80%...

work page arXiv