DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

Maxime Langevin; Nathan Bout; Ronan Riochet

arxiv: 2606.06322 · v1 · pith:ODCWYNZQnew · submitted 2026-06-04 · 💻 cs.AI

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

Nathan Bout , Maxime Langevin , Ronan Riochet This is my paper

Pith reviewed 2026-06-28 01:35 UTC · model grok-4.3

classification 💻 cs.AI

keywords GUI agentsdrag groundingbenchmarkdatasetvision language modelsuser interface automationmultimodal modelscomputer use tasks

0 comments

The pith

DragOn dataset of 3.5 million tasks improves models on GUI drag interactions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DragOn, a benchmark and dataset designed to address the shortage of data for drag-based interactions in GUI agents. It includes 286K screenshots and 3.5M tasks spanning text highlighting, cell selection, element resizing, and slider manipulation, along with a 2000-example test set. Evaluations of leading models reveal their current weaknesses on these tasks, while fine-tuning an open-weight model on the training data points to performance improvements on related computer-use applications.

Core claim

We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.

What carries the argument

The DragOn dataset for training and evaluating drag grounding in vision-based GUI agents across four domains.

If this is right

Fine-tuned models achieve better results on drag tasks than base models.
The dataset helps close the gap between click and drag data availability for GUI agents.
Evaluations highlight limitations in current proprietary and open models for complex drags.
Potential for better automation of tasks requiring dragging actions on desktops and mobile.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This dataset could be combined with existing click datasets to create more comprehensive GUI training.
Improved drag handling may enable agents to perform tasks like file management or design work more effectively.
Future work might test generalization to new domains like games or specialized software interfaces.

Load-bearing premise

The 2000-example held-out evaluation suite is representative of real-world drag interactions and that fine-tuning on the training data generalizes beyond the specific domains tested.

What would settle it

Evaluating the fine-tuned model on drag tasks from applications outside the four covered domains and finding no improvement over the base model.

Figures

Figures reproduced from arXiv: 2606.06322 by Maxime Langevin, Nathan Bout, Ronan Riochet.

**Figure 1.** Figure 1: The four drag grounding action domains covered by the proposed DragOn benchmark. Each example pairs a screenshot with a natural-language intent; the task is to predict a source and target bounding box on the screenshot, with an ordered flag indicating whether drag direction is semantically meaningful. drag actions. 2. Related Work A growing body of work investigates general-purpose agents that perceive, re… view at source ↗

**Figure 2.** Figure 2: Representative drag actions from end-to-end agent benchmarks. The red arrow shows the executed drag overlaid on the observation the agent saw immediately before acting; drag endpoints are in normalized (x, y) screen coordinates [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Canonical vs. alternative ground-truth target regions for actions with a continuum of valid drop points. We adopt the canonical convention (ours, left in each pair) throughout the paper. • Highlight the range of the text ‘active Federal’ • Drag across the institution: ‘Army Commendation Medals’ • Mark the extent of the paragraph starting with ‘Location of headquarters changed’ • Trace 3 paragraphs starting… view at source ↗

**Figure 4.** Figure 4: Qualitative end-to-end comparison on the OSWorld task libreoffice calc 19: the computer-use-specialized policy (Figure 4a) executes a successful drag and solves the task, while the generalist base policy with the same architecture and parameter count (Figure 4b) produces a failed drag on the same initial state; see Section C for the full setup. • Draw a selection over ‘positioned in 11-space as’ • Outline … view at source ↗

**Figure 5.** Figure 5: Text highlighting example. • Drag to select the text his defeat at the Intents for [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Text highlighting example. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Text highlighting example. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Cell selection example [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Element resizing example (crop). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Slider manipulation example [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Slider manipulation example (vertical mixer fader). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DragOn releases a sizable new dataset for drag grounding in GUI agents that fills a documented gap, but the downstream improvement claim stays untested.

read the letter

The main takeaway is a new dataset of 286K screenshots and 3.5M drag tasks across four domains: text highlighting, cell selection, resizing, and sliders. This directly addresses the abstract's point that drag data has lagged far behind click data, and the scale looks substantial enough to matter for training vision-language GUI agents.

The paper evaluates proprietary and open models on the held-out 2000-example set and reports gains after fine-tuning Qwen on the training split. That part is clear and reproducible in principle, and releasing both the benchmark and the data is the useful step here.

The soft spots are straightforward. The held-out set is drawn from the same four domains used for training, so the results do not speak to generalization outside those domains. The claim that the dataset could improve downstream computer-use tasks is presented as a suggestion rather than a measured outcome; no separate experiments on actual downstream tasks appear. Details on data generation, quality control, and task validity are not visible in the abstract, which limits how much weight the numbers can carry right now.

This is for people working on GUI agents who need drag examples and are willing to check the released data themselves. A reader in that niche would get practical value from the resource if the construction process is documented and the tasks are sound. It is worth sending to peer review as a dataset paper; the gap it targets is real and the scale is large enough that referees can assess whether the release is solid.

Referee Report

0 major / 3 minor

Summary. The paper introduces DragOn, a benchmark and dataset for drag-based GUI interactions covering four domains (text highlighting, cell selection, element resizing, slider manipulation). It provides 286K training screenshots and 3.5M training tasks plus a 2000-example held-out evaluation suite, evaluates proprietary and open-weight VLMs on the benchmark, and reports that fine-tuning Qwen on the training data yields gains; the authors suggest this dataset could improve state-of-the-art models on downstream computer-use tasks.

Significance. If the dataset construction, quality controls, and evaluation protocol are sound, DragOn would fill a documented gap in drag-grounding data (currently an order of magnitude smaller than click data) and supply a reproducible resource for training GUI agents on complex interactions.

minor comments (3)

The abstract and experimental outline leave the precise task-generation process, quality-control steps, and domain sampling strategy unspecified; a dedicated section or appendix detailing these would strengthen reproducibility claims.
The downstream-transfer claim is presented only as a suggestion; if the manuscript contains any measured transfer results on external computer-use benchmarks, they should be moved from the discussion into the results section with explicit metrics.
Clarify whether the 2000-example held-out suite was drawn from the same four domains used for training data generation and whether any cross-domain or out-of-distribution splits were performed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of DragOn and for recommending minor revision. The review correctly identifies the gap in drag-grounding resources relative to click data and the potential utility of the released dataset and benchmark. No major comments were raised that require point-by-point rebuttal.

Circularity Check

0 steps flagged

No circularity; dataset release with no derivations or self-referential predictions

full rationale

The paper releases a dataset (286K screenshots, 3.5M tasks) and a 2000-example held-out benchmark across four GUI domains, then reports model evaluations including one fine-tune. No equations, fitted parameters, uniqueness theorems, or predictions appear; the sole forward-looking statement is explicitly hedged as a suggestion rather than a derived claim. The contribution is empirical data release and benchmarking, self-contained against external model performance without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset and benchmark paper with no free parameters, mathematical axioms, or invented entities required for the central claim.

pith-pipeline@v0.9.1-grok · 5691 in / 1064 out tokens · 24411 ms · 2026-06-28T01:35:39.153338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 21 canonical work pages · 10 internal anchors

[1]

goal": ...,

Zhou, Hanzhang and Zhang, Xu and Tong, Panrong and Zhang, Jianan and Chen, Liangyu and Kong, Quyu and Cai, Chenglin and Liu, Chen and Wang, Yue and Zhou, Jingren and Hoi, Steven , urldate =. doi:10.48550/arXiv.2512.22047 , shorttitle =. 2512.22047 [cs] , note =

work page doi:10.48550/arxiv.2512.22047
[2]

doi:10.48550/arXiv.2406.11896 , shorttitle =

Bai, Hao and Zhou, Yifei and Cemri, Mert and Pan, Jiayi and Suhr, Alane and Levine, Sergey and Kumar, Aviral , urldate =. doi:10.48550/arXiv.2406.11896 , shorttitle =. 2406.11896 [cs] , keywords =

work page doi:10.48550/arxiv.2406.11896
[3]

trycua/cua , rights =
[4]

Beyond Clicking: A Step Towards Generalist

Liao, Zeyi and Lu, Yadong and Gou, Boyu and Sun, Huan and Awadallah, Ahmed , urldate =. Beyond Clicking: A Step Towards Generalist. doi:10.48550/arXiv.2601.06031 , shorttitle =. 2601.06031 [cs] , keywords =

work page doi:10.48550/arxiv.2601.06031
[5]

doi:10.48550/arXiv.2512.24965 , shorttitle =

Hu, Siyuan and Lin, Kevin Qinghong and Shou, Mike Zheng , urldate =. doi:10.48550/arXiv.2512.24965 , shorttitle =. 2512.24965 [cs] , keywords =

work page doi:10.48550/arxiv.2512.24965
[6]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , urldate =. doi:10.48550/arXiv.2404.07972 , shorttitle =. 2404.07...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.07972
[7]

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents , url =

Andreux, Mathieu and Bakler, Märt and Barbier, Yanael and Benchekroun, Hamza and Biré, Emilien and Bonnet, Antoine and Bordie, Riaz and Bout, Nathan and Brunel, Matthias and Cambray, Aleix and Cedoz, Pierre-Louis and Chassang, Antoine and Cloix, Gautier and Connelly, Ethan and Constantinou, Alexandra and Coster, Ramzi De and Jonquiere, Hubert de la and De...

work page doi:10.48550/arxiv.2510.19949
[8]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

He, Hongliang and Yao, Wenlin and Ma, Kaixin and Yu, Wenhao and Dai, Yong and Zhang, Hongming and Lan, Zhenzhong and Yu, Dong , urldate =. doi:10.48550/arXiv.2401.13919 , shorttitle =. 2401.13919 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.13919
[9]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Rawles, Christopher and Clinckemaillie, Sarah and Chang, Yifan and Waltz, Jonathan and Lau, Gabrielle and Fair, Marybeth and Li, Alice and Bishop, William and Li, Wei and Campbell-Ajala, Folawiyo and Toyama, Daniel and Berry, Robert and Tyamagundlu, Divya and Lillicrap, Timothy and Riva, Oriana , urldate =. doi:10.48550/arXiv.2405.14573 , shorttitle =. 24...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.14573
[10]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Wang, Haoming and Zou, Haoyang and Song, Huatong and Feng, Jiazhan and Fang, Junjie and Lu, Junting and Liu, Longxiang and Luo, Qinyu and Liang, Shihao and Huang, Shijue and Zhong, Wanjun and Ye, Yining and Qin, Yujia and Xiong, Yuwen and Song, Yuxin and Wu, Zhiyong and Li, Aoyan and Li, Bo and Dun, Chen and Liu, Chong and Zan, Daoguang and Leng, Fuxing a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.02544
[11]

doi:10.48550/arXiv.2504.07981 , shorttitle =

Li, Kaixin and Meng, Ziyang and Lin, Hongzhan and Luo, Ziyang and Tian, Yuchen and Ma, Jing and Huang, Zhiyong and Chua, Tat-Seng , urldate =. doi:10.48550/arXiv.2504.07981 , shorttitle =. 2504.07981 [cs] , keywords =

work page doi:10.48550/arxiv.2504.07981
[12]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and Li, Yantao and Zhang, Jianbing and Wu, Zhiyong , urldate =. doi:10.48550/arXiv.2401.10935 , shorttitle =. 2401.10935 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.10935
[13]

OpenAI GPT-5 System Card

Singh, Aaditya and Fry, Adam and Perelman, Adam and Tart, Adam and Ganesh, Adi and El-Kishky, Ahmed and. doi:10.48550/arXiv.2601.03267 , abstract =. 2601.03267 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267
[14]

Holo3 - Open Foundation Models for Navigation and Computer Use Agents , url =
[15]

Team, Kimi and Bai, Tongtong and Bai, Yifan and Bao, Yiping and Cai, S. H. and Cao, Yuan and Charles, Y. and Che, H. S. and Chen, Cheng and Chen, Guanduo and Chen, Huarong and Chen, Jia and Chen, Jiahao and Chen, Jianlong and Chen, Jun and Chen, Kefan and Chen, Liang and Chen, Ruijue and Chen, Xinhao and Chen, Yanru and Chen, Yanxu and Chen, Yicun and Che...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276
[16]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen , date =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
[17]

International Conference on Learning Representations (

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , date =. International Conference on Learning Representations (
[18]

Yang, Yuhao and Wang, Yue and Li, Dongxu and Luo, Ziyang and Chen, Bei and Huang, Chao and Li, Junnan , urldate =. Aria-. Findings of the Association for Computational Linguistics:. doi:10.18653/v1/2025.findings-acl.1152 , shorttitle =

work page doi:10.18653/v1/2025.findings-acl.1152 2025
[19]

doi:10.48550/arXiv.2506.03143 , shorttitle =

Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and Qin, Si and Liden, Lars and Lin, Qingwei and Zhang, Huan and Zhang, Tong and Zhang, Jianbing and Zhang, Dongmei and Gao, Jianfeng , urldate =. doi:10.48550/arXiv.2506.03143 , shorttitle =. 2506...

work page doi:10.48550/arxiv.2506.03143
[20]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Gou, Boyu and Wang, Ruohan and Zheng, Boyuan and Xie, Yanan and Chang, Cheng and Shu, Yiheng and Sun, Huan and Su, Yu , urldate =. Navigating the Digital World as Humans Do: Universal Visual Grounding for. doi:10.48550/arXiv.2410.05243 , shorttitle =. 2410.05243 [cs] , note =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.05243
[21]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and Qiao, Yu , urldate =. doi:10.48550/arXiv.2410.23218 , shorttitle =. 2410.23218 [cs] , note =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.23218
[22]

doi:10.5281/zenodo.14897662 , publisher =

Data Citation Corpus Data File , url =. doi:10.5281/zenodo.14897662 , publisher =

work page doi:10.5281/zenodo.14897662
[23]

Adam: A Method for Stochastic Optimization

Kingma, Diederik P. and Ba, Jimmy , urldate =. Adam: A Method for Stochastic Optimization , url =. doi:10.48550/arXiv.1412.6980 , shorttitle =. 1412.6980 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.6980
[24]

Wikipedia Structured Contents , url =

Kaggle , urldate =. Wikipedia Structured Contents , url =
[25]

and Ehrenberg, Henry and Fries, Jason and Wu, Sen and R\'

Ratner, Alexander and Bach, Stephen H. and Ehrenberg, Henry and Fries, Jason and Wu, Sen and Ré, Christopher , date =. Snorkel: Rapid Training Data Creation with Weak Supervision , volume =. doi:10.14778/3157794.3157797 , shorttitle =

work page doi:10.14778/3157794.3157797
[26]

Domain randomization for transferring deep neural networks from simulation to the real world

Tobin, Josh and Fong, Rachel and Ray, Alex and Schneider, Jonas and Zaremba, Wojciech and Abbeel, Pieter , date =. Domain randomization for transferring deep neural networks from simulation to the real world , url =. doi:10.1109/IROS.2017.8202133 , booktitle =

work page doi:10.1109/iros.2017.8202133 2017
[27]

Proceedings of the 28th International Conference on Computational Linguistics (

Li, Minghao and Xu, Yiheng and Cui, Lei and Huang, Shaohan and Wei, Furu and Li, Zhoujun and Zhou, Ming , date =. Proceedings of the 28th International Conference on Computational Linguistics (
[28]

European Conference on Computer Vision (

Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, Jeongyeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun , date =. European Conference on Computer Vision (

[1] [1]

goal": ...,

Zhou, Hanzhang and Zhang, Xu and Tong, Panrong and Zhang, Jianan and Chen, Liangyu and Kong, Quyu and Cai, Chenglin and Liu, Chen and Wang, Yue and Zhou, Jingren and Hoi, Steven , urldate =. doi:10.48550/arXiv.2512.22047 , shorttitle =. 2512.22047 [cs] , note =

work page doi:10.48550/arxiv.2512.22047

[2] [2]

doi:10.48550/arXiv.2406.11896 , shorttitle =

Bai, Hao and Zhou, Yifei and Cemri, Mert and Pan, Jiayi and Suhr, Alane and Levine, Sergey and Kumar, Aviral , urldate =. doi:10.48550/arXiv.2406.11896 , shorttitle =. 2406.11896 [cs] , keywords =

work page doi:10.48550/arxiv.2406.11896

[3] [3]

trycua/cua , rights =

[4] [4]

Beyond Clicking: A Step Towards Generalist

Liao, Zeyi and Lu, Yadong and Gou, Boyu and Sun, Huan and Awadallah, Ahmed , urldate =. Beyond Clicking: A Step Towards Generalist. doi:10.48550/arXiv.2601.06031 , shorttitle =. 2601.06031 [cs] , keywords =

work page doi:10.48550/arxiv.2601.06031

[5] [5]

doi:10.48550/arXiv.2512.24965 , shorttitle =

Hu, Siyuan and Lin, Kevin Qinghong and Shou, Mike Zheng , urldate =. doi:10.48550/arXiv.2512.24965 , shorttitle =. 2512.24965 [cs] , keywords =

work page doi:10.48550/arxiv.2512.24965

[6] [6]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , urldate =. doi:10.48550/arXiv.2404.07972 , shorttitle =. 2404.07...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.07972

[7] [7]

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents , url =

Andreux, Mathieu and Bakler, Märt and Barbier, Yanael and Benchekroun, Hamza and Biré, Emilien and Bonnet, Antoine and Bordie, Riaz and Bout, Nathan and Brunel, Matthias and Cambray, Aleix and Cedoz, Pierre-Louis and Chassang, Antoine and Cloix, Gautier and Connelly, Ethan and Constantinou, Alexandra and Coster, Ramzi De and Jonquiere, Hubert de la and De...

work page doi:10.48550/arxiv.2510.19949

[8] [8]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

He, Hongliang and Yao, Wenlin and Ma, Kaixin and Yu, Wenhao and Dai, Yong and Zhang, Hongming and Lan, Zhenzhong and Yu, Dong , urldate =. doi:10.48550/arXiv.2401.13919 , shorttitle =. 2401.13919 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.13919

[9] [9]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Rawles, Christopher and Clinckemaillie, Sarah and Chang, Yifan and Waltz, Jonathan and Lau, Gabrielle and Fair, Marybeth and Li, Alice and Bishop, William and Li, Wei and Campbell-Ajala, Folawiyo and Toyama, Daniel and Berry, Robert and Tyamagundlu, Divya and Lillicrap, Timothy and Riva, Oriana , urldate =. doi:10.48550/arXiv.2405.14573 , shorttitle =. 24...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.14573

[10] [10]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Wang, Haoming and Zou, Haoyang and Song, Huatong and Feng, Jiazhan and Fang, Junjie and Lu, Junting and Liu, Longxiang and Luo, Qinyu and Liang, Shihao and Huang, Shijue and Zhong, Wanjun and Ye, Yining and Qin, Yujia and Xiong, Yuwen and Song, Yuxin and Wu, Zhiyong and Li, Aoyan and Li, Bo and Dun, Chen and Liu, Chong and Zan, Daoguang and Leng, Fuxing a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.02544

[11] [11]

doi:10.48550/arXiv.2504.07981 , shorttitle =

Li, Kaixin and Meng, Ziyang and Lin, Hongzhan and Luo, Ziyang and Tian, Yuchen and Ma, Jing and Huang, Zhiyong and Chua, Tat-Seng , urldate =. doi:10.48550/arXiv.2504.07981 , shorttitle =. 2504.07981 [cs] , keywords =

work page doi:10.48550/arxiv.2504.07981

[12] [12]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and Li, Yantao and Zhang, Jianbing and Wu, Zhiyong , urldate =. doi:10.48550/arXiv.2401.10935 , shorttitle =. 2401.10935 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.10935

[13] [13]

OpenAI GPT-5 System Card

Singh, Aaditya and Fry, Adam and Perelman, Adam and Tart, Adam and Ganesh, Adi and El-Kishky, Ahmed and. doi:10.48550/arXiv.2601.03267 , abstract =. 2601.03267 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267

[14] [14]

Holo3 - Open Foundation Models for Navigation and Computer Use Agents , url =

[15] [15]

Team, Kimi and Bai, Tongtong and Bai, Yifan and Bao, Yiping and Cai, S. H. and Cao, Yuan and Charles, Y. and Che, H. S. and Chen, Cheng and Chen, Guanduo and Chen, Huarong and Chen, Jia and Chen, Jiahao and Chen, Jianlong and Chen, Jun and Chen, Kefan and Chen, Liang and Chen, Ruijue and Chen, Xinhao and Chen, Yanru and Chen, Yanxu and Chen, Yicun and Che...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276

[16] [16]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen , date =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

[17] [17]

International Conference on Learning Representations (

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , date =. International Conference on Learning Representations (

[18] [18]

Yang, Yuhao and Wang, Yue and Li, Dongxu and Luo, Ziyang and Chen, Bei and Huang, Chao and Li, Junnan , urldate =. Aria-. Findings of the Association for Computational Linguistics:. doi:10.18653/v1/2025.findings-acl.1152 , shorttitle =

work page doi:10.18653/v1/2025.findings-acl.1152 2025

[19] [19]

doi:10.48550/arXiv.2506.03143 , shorttitle =

Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and Qin, Si and Liden, Lars and Lin, Qingwei and Zhang, Huan and Zhang, Tong and Zhang, Jianbing and Zhang, Dongmei and Gao, Jianfeng , urldate =. doi:10.48550/arXiv.2506.03143 , shorttitle =. 2506...

work page doi:10.48550/arxiv.2506.03143

[20] [20]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Gou, Boyu and Wang, Ruohan and Zheng, Boyuan and Xie, Yanan and Chang, Cheng and Shu, Yiheng and Sun, Huan and Su, Yu , urldate =. Navigating the Digital World as Humans Do: Universal Visual Grounding for. doi:10.48550/arXiv.2410.05243 , shorttitle =. 2410.05243 [cs] , note =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.05243

[21] [21]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and Qiao, Yu , urldate =. doi:10.48550/arXiv.2410.23218 , shorttitle =. 2410.23218 [cs] , note =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.23218

[22] [22]

doi:10.5281/zenodo.14897662 , publisher =

Data Citation Corpus Data File , url =. doi:10.5281/zenodo.14897662 , publisher =

work page doi:10.5281/zenodo.14897662

[23] [23]

Adam: A Method for Stochastic Optimization

Kingma, Diederik P. and Ba, Jimmy , urldate =. Adam: A Method for Stochastic Optimization , url =. doi:10.48550/arXiv.1412.6980 , shorttitle =. 1412.6980 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.6980

[24] [24]

Wikipedia Structured Contents , url =

Kaggle , urldate =. Wikipedia Structured Contents , url =

[25] [25]

and Ehrenberg, Henry and Fries, Jason and Wu, Sen and R\'

Ratner, Alexander and Bach, Stephen H. and Ehrenberg, Henry and Fries, Jason and Wu, Sen and Ré, Christopher , date =. Snorkel: Rapid Training Data Creation with Weak Supervision , volume =. doi:10.14778/3157794.3157797 , shorttitle =

work page doi:10.14778/3157794.3157797

[26] [26]

Domain randomization for transferring deep neural networks from simulation to the real world

Tobin, Josh and Fong, Rachel and Ray, Alex and Schneider, Jonas and Zaremba, Wojciech and Abbeel, Pieter , date =. Domain randomization for transferring deep neural networks from simulation to the real world , url =. doi:10.1109/IROS.2017.8202133 , booktitle =

work page doi:10.1109/iros.2017.8202133 2017

[27] [27]

Proceedings of the 28th International Conference on Computational Linguistics (

Li, Minghao and Xu, Yiheng and Cui, Lei and Huang, Shaohan and Wei, Furu and Li, Zhoujun and Zhou, Ming , date =. Proceedings of the 28th International Conference on Computational Linguistics (

[28] [28]

European Conference on Computer Vision (

Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, Jeongyeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun , date =. European Conference on Computer Vision (