pith. sign in

arxiv: 2606.06322 · v1 · pith:ODCWYNZQnew · submitted 2026-06-04 · 💻 cs.AI

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

Pith reviewed 2026-06-28 01:35 UTC · model grok-4.3

classification 💻 cs.AI
keywords GUI agentsdrag groundingbenchmarkdatasetvision language modelsuser interface automationmultimodal modelscomputer use tasks
0
0 comments X

The pith

DragOn dataset of 3.5 million tasks improves models on GUI drag interactions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DragOn, a benchmark and dataset designed to address the shortage of data for drag-based interactions in GUI agents. It includes 286K screenshots and 3.5M tasks spanning text highlighting, cell selection, element resizing, and slider manipulation, along with a 2000-example test set. Evaluations of leading models reveal their current weaknesses on these tasks, while fine-tuning an open-weight model on the training data points to performance improvements on related computer-use applications.

Core claim

We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.

What carries the argument

The DragOn dataset for training and evaluating drag grounding in vision-based GUI agents across four domains.

If this is right

  • Fine-tuned models achieve better results on drag tasks than base models.
  • The dataset helps close the gap between click and drag data availability for GUI agents.
  • Evaluations highlight limitations in current proprietary and open models for complex drags.
  • Potential for better automation of tasks requiring dragging actions on desktops and mobile.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This dataset could be combined with existing click datasets to create more comprehensive GUI training.
  • Improved drag handling may enable agents to perform tasks like file management or design work more effectively.
  • Future work might test generalization to new domains like games or specialized software interfaces.

Load-bearing premise

The 2000-example held-out evaluation suite is representative of real-world drag interactions and that fine-tuning on the training data generalizes beyond the specific domains tested.

What would settle it

Evaluating the fine-tuned model on drag tasks from applications outside the four covered domains and finding no improvement over the base model.

Figures

Figures reproduced from arXiv: 2606.06322 by Maxime Langevin, Nathan Bout, Ronan Riochet.

Figure 1
Figure 1. Figure 1: The four drag grounding action domains covered by the proposed DragOn benchmark. Each example pairs a screenshot with a natural-language intent; the task is to predict a source and target bounding box on the screenshot, with an ordered flag indicating whether drag direction is semantically meaningful. drag actions. 2. Related Work A growing body of work investigates general-purpose agents that perceive, re… view at source ↗
Figure 2
Figure 2. Figure 2: Representative drag actions from end-to-end agent benchmarks. The red arrow shows the executed drag overlaid on the observation the agent saw immediately before acting; drag endpoints are in normalized (x, y) screen coordinates [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Canonical vs. alternative ground-truth target regions for actions with a continuum of valid drop points. We adopt the canonical convention (ours, left in each pair) throughout the paper. • Highlight the range of the text ‘active Federal’ • Drag across the institution: ‘Army Commendation Medals’ • Mark the extent of the paragraph starting with ‘Location of headquarters changed’ • Trace 3 paragraphs starting… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative end-to-end comparison on the OSWorld task libreoffice calc 19: the computer-use-specialized policy (Figure 4a) executes a successful drag and solves the task, while the generalist base policy with the same architecture and parameter count (Figure 4b) produces a failed drag on the same initial state; see Section C for the full setup. • Draw a selection over ‘positioned in 11-space as’ • Outline … view at source ↗
Figure 5
Figure 5. Figure 5: Text highlighting example. • Drag to select the text his defeat at the Intents for [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Text highlighting example. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Text highlighting example. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cell selection example [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Element resizing example (crop). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Slider manipulation example [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Slider manipulation example (vertical mixer fader). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces DragOn, a benchmark and dataset for drag-based GUI interactions covering four domains (text highlighting, cell selection, element resizing, slider manipulation). It provides 286K training screenshots and 3.5M training tasks plus a 2000-example held-out evaluation suite, evaluates proprietary and open-weight VLMs on the benchmark, and reports that fine-tuning Qwen on the training data yields gains; the authors suggest this dataset could improve state-of-the-art models on downstream computer-use tasks.

Significance. If the dataset construction, quality controls, and evaluation protocol are sound, DragOn would fill a documented gap in drag-grounding data (currently an order of magnitude smaller than click data) and supply a reproducible resource for training GUI agents on complex interactions.

minor comments (3)
  1. The abstract and experimental outline leave the precise task-generation process, quality-control steps, and domain sampling strategy unspecified; a dedicated section or appendix detailing these would strengthen reproducibility claims.
  2. The downstream-transfer claim is presented only as a suggestion; if the manuscript contains any measured transfer results on external computer-use benchmarks, they should be moved from the discussion into the results section with explicit metrics.
  3. Clarify whether the 2000-example held-out suite was drawn from the same four domains used for training data generation and whether any cross-domain or out-of-distribution splits were performed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of DragOn and for recommending minor revision. The review correctly identifies the gap in drag-grounding resources relative to click data and the potential utility of the released dataset and benchmark. No major comments were raised that require point-by-point rebuttal.

Circularity Check

0 steps flagged

No circularity; dataset release with no derivations or self-referential predictions

full rationale

The paper releases a dataset (286K screenshots, 3.5M tasks) and a 2000-example held-out benchmark across four GUI domains, then reports model evaluations including one fine-tune. No equations, fitted parameters, uniqueness theorems, or predictions appear; the sole forward-looking statement is explicitly hedged as a suggestion rather than a derived claim. The contribution is empirical data release and benchmarking, self-contained against external model performance without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset and benchmark paper with no free parameters, mathematical axioms, or invented entities required for the central claim.

pith-pipeline@v0.9.1-grok · 5691 in / 1064 out tokens · 24411 ms · 2026-06-28T01:35:39.153338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 21 canonical work pages · 10 internal anchors

  1. [1]

    goal": ...,

    Zhou, Hanzhang and Zhang, Xu and Tong, Panrong and Zhang, Jianan and Chen, Liangyu and Kong, Quyu and Cai, Chenglin and Liu, Chen and Wang, Yue and Zhou, Jingren and Hoi, Steven , urldate =. doi:10.48550/arXiv.2512.22047 , shorttitle =. 2512.22047 [cs] , note =

  2. [2]

    doi:10.48550/arXiv.2406.11896 , shorttitle =

    Bai, Hao and Zhou, Yifei and Cemri, Mert and Pan, Jiayi and Suhr, Alane and Levine, Sergey and Kumar, Aviral , urldate =. doi:10.48550/arXiv.2406.11896 , shorttitle =. 2406.11896 [cs] , keywords =

  3. [3]

    trycua/cua , rights =

  4. [4]

    Beyond Clicking: A Step Towards Generalist

    Liao, Zeyi and Lu, Yadong and Gou, Boyu and Sun, Huan and Awadallah, Ahmed , urldate =. Beyond Clicking: A Step Towards Generalist. doi:10.48550/arXiv.2601.06031 , shorttitle =. 2601.06031 [cs] , keywords =

  5. [5]

    doi:10.48550/arXiv.2512.24965 , shorttitle =

    Hu, Siyuan and Lin, Kevin Qinghong and Shou, Mike Zheng , urldate =. doi:10.48550/arXiv.2512.24965 , shorttitle =. 2512.24965 [cs] , keywords =

  6. [6]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , urldate =. doi:10.48550/arXiv.2404.07972 , shorttitle =. 2404.07...

  7. [7]

    Surfer 2: The Next Generation of Cross-Platform Computer Use Agents , url =

    Andreux, Mathieu and Bakler, Märt and Barbier, Yanael and Benchekroun, Hamza and Biré, Emilien and Bonnet, Antoine and Bordie, Riaz and Bout, Nathan and Brunel, Matthias and Cambray, Aleix and Cedoz, Pierre-Louis and Chassang, Antoine and Cloix, Gautier and Connelly, Ethan and Constantinou, Alexandra and Coster, Ramzi De and Jonquiere, Hubert de la and De...

  8. [8]

    WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    He, Hongliang and Yao, Wenlin and Ma, Kaixin and Yu, Wenhao and Dai, Yong and Zhang, Hongming and Lan, Zhenzhong and Yu, Dong , urldate =. doi:10.48550/arXiv.2401.13919 , shorttitle =. 2401.13919 [cs] , keywords =

  9. [9]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Rawles, Christopher and Clinckemaillie, Sarah and Chang, Yifan and Waltz, Jonathan and Lau, Gabrielle and Fair, Marybeth and Li, Alice and Bishop, William and Li, Wei and Campbell-Ajala, Folawiyo and Toyama, Daniel and Berry, Robert and Tyamagundlu, Divya and Lillicrap, Timothy and Riva, Oriana , urldate =. doi:10.48550/arXiv.2405.14573 , shorttitle =. 24...

  10. [10]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Wang, Haoming and Zou, Haoyang and Song, Huatong and Feng, Jiazhan and Fang, Junjie and Lu, Junting and Liu, Longxiang and Luo, Qinyu and Liang, Shihao and Huang, Shijue and Zhong, Wanjun and Ye, Yining and Qin, Yujia and Xiong, Yuwen and Song, Yuxin and Wu, Zhiyong and Li, Aoyan and Li, Bo and Dun, Chen and Liu, Chong and Zan, Daoguang and Leng, Fuxing a...

  11. [11]

    doi:10.48550/arXiv.2504.07981 , shorttitle =

    Li, Kaixin and Meng, Ziyang and Lin, Hongzhan and Luo, Ziyang and Tian, Yuchen and Ma, Jing and Huang, Zhiyong and Chua, Tat-Seng , urldate =. doi:10.48550/arXiv.2504.07981 , shorttitle =. 2504.07981 [cs] , keywords =

  12. [12]

    SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and Li, Yantao and Zhang, Jianbing and Wu, Zhiyong , urldate =. doi:10.48550/arXiv.2401.10935 , shorttitle =. 2401.10935 [cs] , keywords =

  13. [13]

    OpenAI GPT-5 System Card

    Singh, Aaditya and Fry, Adam and Perelman, Adam and Tart, Adam and Ganesh, Adi and El-Kishky, Ahmed and. doi:10.48550/arXiv.2601.03267 , abstract =. 2601.03267 [cs] , keywords =

  14. [14]

    Holo3 - Open Foundation Models for Navigation and Computer Use Agents , url =

  15. [15]

    Team, Kimi and Bai, Tongtong and Bai, Yifan and Bao, Yiping and Cai, S. H. and Cao, Yuan and Charles, Y. and Che, H. S. and Chen, Cheng and Chen, Guanduo and Chen, Huarong and Chen, Jia and Chen, Jiahao and Chen, Jianlong and Chen, Jun and Chen, Kefan and Chen, Liang and Chen, Ruijue and Chen, Xinhao and Chen, Yanru and Chen, Yanxu and Chen, Yicun and Che...

  16. [16]

    Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

    Qwen , date =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

  17. [17]

    International Conference on Learning Representations (

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , date =. International Conference on Learning Representations (

  18. [18]

    Yang, Yuhao and Wang, Yue and Li, Dongxu and Luo, Ziyang and Chen, Bei and Huang, Chao and Li, Junnan , urldate =. Aria-. Findings of the Association for Computational Linguistics:. doi:10.18653/v1/2025.findings-acl.1152 , shorttitle =

  19. [19]

    doi:10.48550/arXiv.2506.03143 , shorttitle =

    Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and Qin, Si and Liden, Lars and Lin, Qingwei and Zhang, Huan and Zhang, Tong and Zhang, Jianbing and Zhang, Dongmei and Gao, Jianfeng , urldate =. doi:10.48550/arXiv.2506.03143 , shorttitle =. 2506...

  20. [20]

    Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

    Gou, Boyu and Wang, Ruohan and Zheng, Boyuan and Xie, Yanan and Chang, Cheng and Shu, Yiheng and Sun, Huan and Su, Yu , urldate =. Navigating the Digital World as Humans Do: Universal Visual Grounding for. doi:10.48550/arXiv.2410.05243 , shorttitle =. 2410.05243 [cs] , note =

  21. [21]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and Qiao, Yu , urldate =. doi:10.48550/arXiv.2410.23218 , shorttitle =. 2410.23218 [cs] , note =

  22. [22]

    doi:10.5281/zenodo.14897662 , publisher =

    Data Citation Corpus Data File , url =. doi:10.5281/zenodo.14897662 , publisher =

  23. [23]

    Adam: A Method for Stochastic Optimization

    Kingma, Diederik P. and Ba, Jimmy , urldate =. Adam: A Method for Stochastic Optimization , url =. doi:10.48550/arXiv.1412.6980 , shorttitle =. 1412.6980 [cs] , keywords =

  24. [24]

    Wikipedia Structured Contents , url =

    Kaggle , urldate =. Wikipedia Structured Contents , url =

  25. [25]

    and Ehrenberg, Henry and Fries, Jason and Wu, Sen and R\'

    Ratner, Alexander and Bach, Stephen H. and Ehrenberg, Henry and Fries, Jason and Wu, Sen and Ré, Christopher , date =. Snorkel: Rapid Training Data Creation with Weak Supervision , volume =. doi:10.14778/3157794.3157797 , shorttitle =

  26. [26]

    Domain randomization for transferring deep neural networks from simulation to the real world

    Tobin, Josh and Fong, Rachel and Ray, Alex and Schneider, Jonas and Zaremba, Wojciech and Abbeel, Pieter , date =. Domain randomization for transferring deep neural networks from simulation to the real world , url =. doi:10.1109/IROS.2017.8202133 , booktitle =

  27. [27]

    Proceedings of the 28th International Conference on Computational Linguistics (

    Li, Minghao and Xu, Yiheng and Cui, Lei and Huang, Shaohan and Wei, Furu and Li, Zhoujun and Zhou, Ming , date =. Proceedings of the 28th International Conference on Computational Linguistics (

  28. [28]

    European Conference on Computer Vision (

    Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, Jeongyeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun , date =. European Conference on Computer Vision (