MUIAnno: An Expert-Annotated Dataset and Evaluation Benchmark for Mobile UI Understanding

Athar Parvez; Muhammad Jawad Mufti; Muqaddas Gull; Omar Hammad

arxiv: 2605.17656 · v1 · pith:A34HPEHVnew · submitted 2026-05-17 · 💻 cs.HC

MUIAnno: An Expert-Annotated Dataset and Evaluation Benchmark for Mobile UI Understanding

Athar Parvez , Muhammad Jawad Mufti , Muqaddas Gull , Omar Hammad This is my paper

Pith reviewed 2026-05-19 22:04 UTC · model grok-4.3

classification 💻 cs.HC

keywords mobile UI datasetexpert annotationUI element detectioniOS applicationsinterface understandingbenchmark datasetJSON annotations

0 comments

The pith

MUIAnno supplies expert-annotated screenshots of real iOS apps to train systems that detect and interpret mobile interface elements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MUIAnno, a new public dataset of mobile user interface screens drawn from many different iOS applications. Experts used a custom drag-and-drop tool to label buttons, input fields, navigation bars and other elements, producing structured JSON records for each screen. The authors also report initial detection benchmarks that give other researchers a concrete place to start. Such data matters because current systems for automation, accessibility and intelligent agents still struggle when they lack reliable examples of how real apps actually look and behave.

Core claim

MUIAnno is a collection of representative UI screens gathered by manually exploring diverse apps on the iTunes platform, each annotated by UI/UX experts through a purpose-built web tool that records element types, positions and structure in JSON format, accompanied by baseline results on the task of UI element detection.

What carries the argument

The MUIAnno dataset itself, built through manual app exploration and expert drag-and-drop annotation that turns raw screenshots into labeled JSON records of common interface components.

If this is right

Automation scripts and testing tools can use the labels to locate and interact with specific buttons or fields more reliably.
Accessibility systems gain clearer targets for describing or navigating interface elements to users.
UI-aware agents receive a concrete training resource for learning to read and act on mobile screens.
Future detection algorithms can be compared against the provided baseline numbers to measure progress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same annotation approach could be repeated on Android apps to test whether the patterns learned transfer across platforms.
The JSON format may let researchers combine MUIAnno with image-captioning models to generate natural-language descriptions of entire screens.
If the dataset grows over time, it could serve as a living benchmark that tracks how mobile design conventions change.

Load-bearing premise

That the manually chosen screens and the labels produced by the expert tool faithfully capture the variety and accuracy of interfaces found in everyday mobile apps.

What would settle it

A test showing that models trained only on MUIAnno achieve substantially lower detection accuracy on a fresh set of popular iOS apps than models trained on existing UI datasets would indicate the new annotations add little value.

Figures

Figures reproduced from arXiv: 2605.17656 by Athar Parvez, Muhammad Jawad Mufti, Muqaddas Gull, Omar Hammad.

**Figure 2.** Figure 2: Overview of the annotation pipeline. Annotators draw bounding boxes around [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Interface of the custom annotation tool used for labeling UI elements. Anno [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of precision, recall, and F1-score across evaluated multimodal [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Understanding mobile user interfaces is important for building intelligent systems such as automation tools, accessibility solutions, and UI-aware agents. However, progress in this area is still limited by the lack of high-quality datasets that reflect real-world mobile applications and include reliable annotations. In this work, we introduce MUIAnno, a publicly available expert-annotated dataset for mobile UI understanding, collected from a diverse set of applications across multiple categories available on the iTunes platform. Each app was manually explored to capture representative UI screens, resulting in a collection that reflects a wide range of layouts and design patterns found in practice. To ensure annotation quality, we developed a custom web-based tool that allows UI/UX experts to label interface elements through a simple drag-and-drop process and generate structured annotations in JSON format. MUIAnno includes detailed annotations of common UI components such as buttons, input fields, navigation elements, and other key interface elements. In addition to presenting the dataset, we also provide benchmark experiments for UI element detection along with baseline results, offering a starting point for future research. We believe MUIAnno can support further work in mobile UI understanding and help improve systems that rely on accurate interpretation of interface elements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MUIAnno adds a new expert-annotated iOS UI dataset collected via manual app exploration and a custom labeling tool, but the write-up skips the usual numbers on scale and consistency.

read the letter

MUIAnno brings a fresh expert-annotated dataset for mobile UI understanding focused on iOS apps. The work collects screens by manually exploring apps from various categories on the iTunes platform. Experts then use a custom drag-and-drop web tool to label UI elements like buttons, input fields, and navigation components, outputting structured JSON. They also run some benchmark tests for UI element detection with baseline results. This adds to the set of available UI datasets by emphasizing expert input and real-world app diversity, which could support better training for accessibility features or intelligent agents. The collection method is described clearly enough, and providing public access is helpful. That said, the paper appears light on specifics. There are no figures for the number of apps, total screens, or balance across categories. Inter-annotator agreement isn't reported, which is standard for annotation quality in such datasets. The benchmarks lack concrete metrics or comparisons, making it hard to gauge how strong the starting point really is. The central assumption that manual exploration captures a wide range of layouts rests on the process but without data to back the representativeness claim. This paper would interest researchers in human-computer interaction who work on mobile interfaces and need labeled data for model development. Someone building on prior UI datasets might find the iOS focus and expert annotations valuable once the scale is clear. It deserves a serious referee to check the dataset details and suggest improvements like adding agreement scores or more validation.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MUIAnno, a publicly available expert-annotated dataset for mobile UI understanding collected from diverse iTunes applications. It describes manual exploration of apps to capture representative UI screens, development of a custom web-based drag-and-drop tool used by UI/UX experts to produce structured JSON annotations for elements such as buttons, input fields, and navigation components, and the provision of benchmark experiments for UI element detection together with baseline results.

Significance. If the dataset proves to be of sufficient scale, balanced across categories, and supported by reliable expert annotations, MUIAnno could serve as a useful resource for research on mobile UI automation, accessibility, and UI-aware agents. The inclusion of baseline benchmarks is a constructive element. However, the absence of quantitative diagnostics in the current description limits the ability to judge its practical value as an evaluation benchmark.

major comments (2)

[Abstract] Abstract: the central claims of dataset diversity ('wide range of layouts and design patterns') and annotation reliability rest on unquantified manual processes; no counts of applications, screens, category balance, or inter-annotator agreement are supplied, leaving the load-bearing assumption that the collection and custom tool produce consistent, representative labels unsupported by evidence.
[Benchmark experiments] Benchmark experiments section: the manuscript states that baseline results for UI element detection are provided, yet supplies no concrete metrics, model descriptions, or performance numbers; without these the claim that MUIAnno offers a usable evaluation benchmark cannot be assessed.

minor comments (1)

The JSON annotation schema and exact label taxonomy should be illustrated with an example in the main text or appendix to clarify the structured output format.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of the dataset and benchmarks.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of dataset diversity ('wide range of layouts and design patterns') and annotation reliability rest on unquantified manual processes; no counts of applications, screens, category balance, or inter-annotator agreement are supplied, leaving the load-bearing assumption that the collection and custom tool produce consistent, representative labels unsupported by evidence.

Authors: We agree that the abstract would be improved by including quantitative details to support the claims of diversity and reliability. The full manuscript describes the manual exploration of apps from diverse iTunes categories and the use of the custom drag-and-drop tool by UI/UX experts to produce structured JSON annotations. To directly address this point, we will revise the abstract to report key statistics on the number of applications, total screens captured, and category balance. For annotation reliability, we will expand the description of the annotation protocol and quality controls in the main text. We note that inter-annotator agreement metrics were not computed, as each screen received annotation from a single expert following standardized guidelines; we will add an explicit discussion of this aspect and any related limitations. revision: partial
Referee: [Benchmark experiments] Benchmark experiments section: the manuscript states that baseline results for UI element detection are provided, yet supplies no concrete metrics, model descriptions, or performance numbers; without these the claim that MUIAnno offers a usable evaluation benchmark cannot be assessed.

Authors: We acknowledge that the current description of the benchmark experiments lacks sufficient concrete details. Although the manuscript includes a section presenting baseline results for UI element detection, we agree that explicit model descriptions, evaluation metrics, and numerical performance values are needed for the benchmark to be properly assessed. We will revise this section to include specific information on the baseline models employed, the metrics used (such as precision and recall for element detection), and the reported performance numbers on the MUIAnno dataset. revision: yes

Circularity Check

0 steps flagged

No circularity; dataset introduction paper with no derivations or predictions

full rationale

The manuscript presents MUIAnno as an expert-annotated dataset collected via manual app exploration and a custom drag-and-drop annotation tool, followed by baseline UI element detection experiments. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Claims about diversity and annotation quality are supported by process description rather than any self-referential reduction or self-citation chain. The work is self-contained as an empirical dataset contribution with no load-bearing logical steps that collapse to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper whose central contribution is the curation and expert labeling of real-world mobile UI screens rather than any derivation from axioms or parameters. No free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5752 in / 1351 out tokens · 62911 ms · 2026-05-19T22:04:37.561758+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce MUIAnno, a publicly available expert-annotated dataset for mobile UI understanding... custom web-based tool... drag-and-drop process and generate structured annotations in JSON format... benchmark experiments for UI element detection
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

36 UI element classes... 27,367 annotated UI element instances... IoU-based matching... F1-score evaluation of GPT-5.4, Claude, Gemini, Llama-4-Scout, Gemma

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

n8n: Workflow Automation Tool , year =

work page
[2]

Chatbot Arena Leaderboard , year =

work page
[3]

Discover iOS Apps | Mobbin --- UI & UX Design Inspiration for Mobile & Web Apps , year =

work page
[4]

iTunes Search API , year =

work page
[5]

ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =

Baechler, Gilles and Sunkara, Srinivas and Wang, Maria and Zubach, Fedir and Mansoor, Hassan and Etter, Vincent and C. ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =. doi:10.48550/arXiv.2402.04615 , url =. 2402.04615 , archivePrefix =

work page doi:10.48550/arxiv.2402.04615
[6]

Proceedings of the 40th International Conference on Software Engineering , series =

Chen, Chunyang and Su, Ting and Meng, Guozhu and Xing, Zhenchang and Liu, Yang , title =. Proceedings of the 40th International Conference on Software Engineering , series =. 2018 , address =. doi:10.1145/3180155.3180240 , isbn =

work page doi:10.1145/3180155.3180240 2018
[7]

ACM Transactions on Software Engineering and Methodology , volume =

Chen, Jieshan and Chen, Chunyang and Xing, Zhenchang and Xia, Xin and Zhu, Liming and Grundy, John and Wang, Jinshui , title =. ACM Transactions on Software Engineering and Methodology , volume =. 2020 , doi =. 2103.07085 , archivePrefix =

work page arXiv 2020
[8]

Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =

Chen, Jieshan and Chen, Chunyang and Xing, Zhenchang and Xu, Xiwei and Zhu, Liming and Li, Guoqiang and Wang, Jinshui , title =. Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =. 2020 , doi =. 2003.00380 , archivePrefix =

work page arXiv 2020
[9]

Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology , series =

Deka, Biplab and Huang, Zifeng and Franzen, Chad and Hibschman, Joshua and Afergan, Daniel and Li, Yang and Nichols, Jeffrey and Kumar, Ranjitha , title =. Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology , series =. 2017 , address =. doi:10.1145/3126594.3126651 , isbn =

work page doi:10.1145/3126594.3126651 2017
[10]

2024 , publisher =

Duan, Peitong and Chen, Chin-yi and Li, Gang and Hartmann, Bjoern and Li, Yang , title =. 2024 , publisher =. doi:10.48550/arXiv.2407.08850 , url =. 2407.08850 , archivePrefix =

work page doi:10.48550/arxiv.2407.08850 2024
[11]

Proceedings of the CHI Conference on Human Factors in Computing Systems , pages =

Feng, Sidong and Ma, Suyu and Wang, Han and Kong, David and Chen, Chunyang , title =. Proceedings of the CHI Conference on Human Factors in Computing Systems , pages =. 2024 , address =. doi:10.1145/3613904.3642350 , isbn =

work page doi:10.1145/3613904.3642350 2024
[12]

2024 , publisher =

Gao, Longxi and Zhang, Li and Wang, Shihe and Wang, Shangguang and Li, Yuanchun and Xu, Mengwei , title =. 2024 , publisher =. doi:10.48550/arXiv.2409.14337 , url =. 2409.14337 , archivePrefix =

work page doi:10.48550/arxiv.2409.14337 2024
[13]

2024 , publisher =

Haque, Sabrina and Csallner, Christoph , title =. 2024 , publisher =. doi:10.48550/arXiv.2409.18060 , url =. 2409.18060 , archivePrefix =

work page doi:10.48550/arxiv.2409.18060 2024
[14]

net/forum?id=kxnoqaisCT

Hui, Zheng and Li, Yinheng and Zhao, Dan and Chen, Tianyi and Banbury, Colby and Koishida, Kazuhito , title =. 2025 , publisher =. doi:10.48550/arXiv.2503.04730 , url =. 2503.04730 , archivePrefix =

work page doi:10.48550/arxiv.2503.04730 2025
[15]

2025 , publisher =

Jang, Yunseok and Song, Yeda and Sohn, Sungryull and Logeswaran, Lajanugen and Luo, Tiange and Kim, Dong-Ki and Bae, Kyunghoon and Lee, Honglak , title =. 2025 , publisher =. doi:10.48550/arXiv.2505.12632 , url =. 2505.12632 , archivePrefix =

work page doi:10.48550/arxiv.2505.12632 2025
[16]

2023 , publisher =

Jiang, Yue and Schoop, Eldon and Swearngin, Amanda and Nichols, Jeffrey , title =. 2023 , publisher =. doi:10.48550/arXiv.2310.04869 , url =. 2310.04869 , archivePrefix =

work page doi:10.48550/arxiv.2310.04869 2023
[17]

2026 , publisher =

Kumbhar, Shrinidhi and Liao, Haofu and Appalaraju, Srikar and Singh, Kunwar Yashraj , title =. 2026 , publisher =. doi:10.48550/arXiv.2603.26211 , url =. 2603.26211 , archivePrefix =

work page doi:10.48550/arxiv.2603.26211 2026
[18]

2023 , publisher =

Lee, Kenton and Joshi, Mandar and Turc, Iulia and Hu, Hexiang and Liu, Fangyu and Eisenschlos, Julian and Khandelwal, Urvashi and Shaw, Peter and Chang, Ming-Wei and Toutanova, Kristina , title =. 2023 , publisher =. doi:10.48550/arXiv.2210.03347 , url =. 2210.03347 , archivePrefix =

work page doi:10.48550/arxiv.2210.03347 2023
[19]

and Hota, Asutosh and Oulasvirta, Antti , title =

Leiva, Luis A. and Hota, Asutosh and Oulasvirta, Antti , title =. ACM Transactions on Intelligent Systems and Technology , volume =. 2022 , doi =

work page 2022
[20]

2023 , publisher =

Li, Gang and Li, Yang , title =. 2023 , publisher =. doi:10.48550/arXiv.2209.14927 , url =. 2209.14927 , archivePrefix =

work page doi:10.48550/arxiv.2209.14927 2023
[21]

Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

Li, Kaixin and Meng, Ziyang and Lin, Hongzhan and Luo, Ziyang and Tian, Yuchen and Ma, Jing and Huang, Zhiyong and Chua, Tat-Seng , title =. 2025 , publisher =. doi:10.48550/arXiv.2504.07981 , url =. 2504.07981 , archivePrefix =

work page doi:10.48550/arxiv.2504.07981 2025
[22]

and Myers, Brad A

Li, Toby Jia-Jun and Popowski, Lindsay and Mitchell, Tom M. and Myers, Brad A. , title =. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages =. 2021 , doi =. 2101.11103 , archivePrefix =

work page arXiv 2021
[23]

2020 , publisher =

Li, Yang and Li, Gang and He, Luheng and Zheng, Jingjie and Li, Hong and Guan, Zhiwei , title =. 2020 , publisher =. doi:10.48550/arXiv.2010.04295 , url =. 2010.04295 , archivePrefix =

work page doi:10.48550/arxiv.2010.04295 2020
[24]

https://doi.org/10.48550/arXiv.2406.08451

Lu, Quanfeng and Shao, Wenqi and Liu, Zitao and Du, Lingxiao and Meng, Fanqing and Li, Boxuan and Chen, Botong and Huang, Siyuan and Zhang, Kaipeng and Luo, Ping , title =. 2025 , publisher =. doi:10.48550/arXiv.2406.08451 , url =. 2406.08451 , archivePrefix =

work page doi:10.48550/arxiv.2406.08451 2025
[25]

2026 , publisher =

Ma, Longhui and Zhao, Di and Wang, Siwei and Lv, Zhao and Wang, Miao , title =. 2026 , publisher =. doi:10.48550/arXiv.2602.06351 , url =. 2602.06351 , archivePrefix =

work page doi:10.48550/arxiv.2602.06351 2026
[26]

Powers, David M. W. , title =. Journal of Machine Learning Technologies , volume =. 2011 , url =

work page 2011
[27]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and Zhong, Wanjun and Li, Kuanye and Yang, Jiale and Miao, Yu and Lin, Woyu and Liu, Longxiang and Jiang, Xu and Ma, Qianli and Li, Jingyu and Xiao, Xiaojun and Cai, Kai and Li, Chuang and Zheng, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12326 2025
[28]

2023 , publisher =

Wang, Bryan and Li, Gang and Li, Yang , title =. 2023 , publisher =. doi:10.48550/arXiv.2209.08655 , url =. 2209.08655 , archivePrefix =

work page doi:10.48550/arxiv.2209.08655 2023
[29]

2021 , publisher =

Wang, Bryan and Li, Gang and Zhou, Xin and Chen, Zhourong and Grossman, Tovi and Li, Yang , title =. 2021 , publisher =. doi:10.48550/arXiv.2108.03353 , url =. 2108.03353 , archivePrefix =

work page doi:10.48550/arxiv.2108.03353 2021
[30]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and Qiao, Yu , title =. 2024 , publisher =. doi:10.48550/arXiv.2410.23218 , url =. 2410.23218 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.23218 2024
[31]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y

Xie, Tianbao and Deng, Jiaqi and Li, Xiaochuan and Yang, Junlin and Wu, Haoyuan and Chen, Jixuan and Hu, Wenjing and Wang, Xinyuan and Xu, Yuhui and Wang, Zekun and Xu, Yiheng and Wang, Junli and Sahoo, Doyen and Yu, Tao and Xiong, Caiming , title =. 2025 , publisher =. doi:10.48550/arXiv.2505.13227 , url =. 2505.13227 , archivePrefix =

work page doi:10.48550/arxiv.2505.13227 2025
[32]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , title =. 2024 , publisher =. doi:10.48550/arXiv.2404.07972 , url...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.07972 2024
[33]

2025 , publisher =

Yu, Shengcheng and Fang, Chunrong and Tuo, Ziyuan and Zhang, Quanjun and Chen, Chunyang and Chen, Zhenyu and Su, Zhendong , title =. 2025 , publisher =. doi:10.48550/arXiv.2310.13518 , url =. 2310.13518 , archivePrefix =

work page doi:10.48550/arxiv.2310.13518 2025

[1] [1]

n8n: Workflow Automation Tool , year =

work page

[2] [2]

Chatbot Arena Leaderboard , year =

work page

[3] [3]

Discover iOS Apps | Mobbin --- UI & UX Design Inspiration for Mobile & Web Apps , year =

work page

[4] [4]

iTunes Search API , year =

work page

[5] [5]

ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =

Baechler, Gilles and Sunkara, Srinivas and Wang, Maria and Zubach, Fedir and Mansoor, Hassan and Etter, Vincent and C. ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =. doi:10.48550/arXiv.2402.04615 , url =. 2402.04615 , archivePrefix =

work page doi:10.48550/arxiv.2402.04615

[6] [6]

Proceedings of the 40th International Conference on Software Engineering , series =

Chen, Chunyang and Su, Ting and Meng, Guozhu and Xing, Zhenchang and Liu, Yang , title =. Proceedings of the 40th International Conference on Software Engineering , series =. 2018 , address =. doi:10.1145/3180155.3180240 , isbn =

work page doi:10.1145/3180155.3180240 2018

[7] [7]

ACM Transactions on Software Engineering and Methodology , volume =

Chen, Jieshan and Chen, Chunyang and Xing, Zhenchang and Xia, Xin and Zhu, Liming and Grundy, John and Wang, Jinshui , title =. ACM Transactions on Software Engineering and Methodology , volume =. 2020 , doi =. 2103.07085 , archivePrefix =

work page arXiv 2020

[8] [8]

Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =

Chen, Jieshan and Chen, Chunyang and Xing, Zhenchang and Xu, Xiwei and Zhu, Liming and Li, Guoqiang and Wang, Jinshui , title =. Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =. 2020 , doi =. 2003.00380 , archivePrefix =

work page arXiv 2020

[9] [9]

Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology , series =

Deka, Biplab and Huang, Zifeng and Franzen, Chad and Hibschman, Joshua and Afergan, Daniel and Li, Yang and Nichols, Jeffrey and Kumar, Ranjitha , title =. Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology , series =. 2017 , address =. doi:10.1145/3126594.3126651 , isbn =

work page doi:10.1145/3126594.3126651 2017

[10] [10]

2024 , publisher =

Duan, Peitong and Chen, Chin-yi and Li, Gang and Hartmann, Bjoern and Li, Yang , title =. 2024 , publisher =. doi:10.48550/arXiv.2407.08850 , url =. 2407.08850 , archivePrefix =

work page doi:10.48550/arxiv.2407.08850 2024

[11] [11]

Proceedings of the CHI Conference on Human Factors in Computing Systems , pages =

Feng, Sidong and Ma, Suyu and Wang, Han and Kong, David and Chen, Chunyang , title =. Proceedings of the CHI Conference on Human Factors in Computing Systems , pages =. 2024 , address =. doi:10.1145/3613904.3642350 , isbn =

work page doi:10.1145/3613904.3642350 2024

[12] [12]

2024 , publisher =

Gao, Longxi and Zhang, Li and Wang, Shihe and Wang, Shangguang and Li, Yuanchun and Xu, Mengwei , title =. 2024 , publisher =. doi:10.48550/arXiv.2409.14337 , url =. 2409.14337 , archivePrefix =

work page doi:10.48550/arxiv.2409.14337 2024

[13] [13]

2024 , publisher =

Haque, Sabrina and Csallner, Christoph , title =. 2024 , publisher =. doi:10.48550/arXiv.2409.18060 , url =. 2409.18060 , archivePrefix =

work page doi:10.48550/arxiv.2409.18060 2024

[14] [14]

net/forum?id=kxnoqaisCT

Hui, Zheng and Li, Yinheng and Zhao, Dan and Chen, Tianyi and Banbury, Colby and Koishida, Kazuhito , title =. 2025 , publisher =. doi:10.48550/arXiv.2503.04730 , url =. 2503.04730 , archivePrefix =

work page doi:10.48550/arxiv.2503.04730 2025

[15] [15]

2025 , publisher =

Jang, Yunseok and Song, Yeda and Sohn, Sungryull and Logeswaran, Lajanugen and Luo, Tiange and Kim, Dong-Ki and Bae, Kyunghoon and Lee, Honglak , title =. 2025 , publisher =. doi:10.48550/arXiv.2505.12632 , url =. 2505.12632 , archivePrefix =

work page doi:10.48550/arxiv.2505.12632 2025

[16] [16]

2023 , publisher =

Jiang, Yue and Schoop, Eldon and Swearngin, Amanda and Nichols, Jeffrey , title =. 2023 , publisher =. doi:10.48550/arXiv.2310.04869 , url =. 2310.04869 , archivePrefix =

work page doi:10.48550/arxiv.2310.04869 2023

[17] [17]

2026 , publisher =

Kumbhar, Shrinidhi and Liao, Haofu and Appalaraju, Srikar and Singh, Kunwar Yashraj , title =. 2026 , publisher =. doi:10.48550/arXiv.2603.26211 , url =. 2603.26211 , archivePrefix =

work page doi:10.48550/arxiv.2603.26211 2026

[18] [18]

2023 , publisher =

Lee, Kenton and Joshi, Mandar and Turc, Iulia and Hu, Hexiang and Liu, Fangyu and Eisenschlos, Julian and Khandelwal, Urvashi and Shaw, Peter and Chang, Ming-Wei and Toutanova, Kristina , title =. 2023 , publisher =. doi:10.48550/arXiv.2210.03347 , url =. 2210.03347 , archivePrefix =

work page doi:10.48550/arxiv.2210.03347 2023

[19] [19]

and Hota, Asutosh and Oulasvirta, Antti , title =

Leiva, Luis A. and Hota, Asutosh and Oulasvirta, Antti , title =. ACM Transactions on Intelligent Systems and Technology , volume =. 2022 , doi =

work page 2022

[20] [20]

2023 , publisher =

Li, Gang and Li, Yang , title =. 2023 , publisher =. doi:10.48550/arXiv.2209.14927 , url =. 2209.14927 , archivePrefix =

work page doi:10.48550/arxiv.2209.14927 2023

[21] [21]

Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

Li, Kaixin and Meng, Ziyang and Lin, Hongzhan and Luo, Ziyang and Tian, Yuchen and Ma, Jing and Huang, Zhiyong and Chua, Tat-Seng , title =. 2025 , publisher =. doi:10.48550/arXiv.2504.07981 , url =. 2504.07981 , archivePrefix =

work page doi:10.48550/arxiv.2504.07981 2025

[22] [22]

and Myers, Brad A

Li, Toby Jia-Jun and Popowski, Lindsay and Mitchell, Tom M. and Myers, Brad A. , title =. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages =. 2021 , doi =. 2101.11103 , archivePrefix =

work page arXiv 2021

[23] [23]

2020 , publisher =

Li, Yang and Li, Gang and He, Luheng and Zheng, Jingjie and Li, Hong and Guan, Zhiwei , title =. 2020 , publisher =. doi:10.48550/arXiv.2010.04295 , url =. 2010.04295 , archivePrefix =

work page doi:10.48550/arxiv.2010.04295 2020

[24] [24]

https://doi.org/10.48550/arXiv.2406.08451

Lu, Quanfeng and Shao, Wenqi and Liu, Zitao and Du, Lingxiao and Meng, Fanqing and Li, Boxuan and Chen, Botong and Huang, Siyuan and Zhang, Kaipeng and Luo, Ping , title =. 2025 , publisher =. doi:10.48550/arXiv.2406.08451 , url =. 2406.08451 , archivePrefix =

work page doi:10.48550/arxiv.2406.08451 2025

[25] [25]

2026 , publisher =

Ma, Longhui and Zhao, Di and Wang, Siwei and Lv, Zhao and Wang, Miao , title =. 2026 , publisher =. doi:10.48550/arXiv.2602.06351 , url =. 2602.06351 , archivePrefix =

work page doi:10.48550/arxiv.2602.06351 2026

[26] [26]

Powers, David M. W. , title =. Journal of Machine Learning Technologies , volume =. 2011 , url =

work page 2011

[27] [27]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and Zhong, Wanjun and Li, Kuanye and Yang, Jiale and Miao, Yu and Lin, Woyu and Liu, Longxiang and Jiang, Xu and Ma, Qianli and Li, Jingyu and Xiao, Xiaojun and Cai, Kai and Li, Chuang and Zheng, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12326 2025

[28] [28]

2023 , publisher =

Wang, Bryan and Li, Gang and Li, Yang , title =. 2023 , publisher =. doi:10.48550/arXiv.2209.08655 , url =. 2209.08655 , archivePrefix =

work page doi:10.48550/arxiv.2209.08655 2023

[29] [29]

2021 , publisher =

Wang, Bryan and Li, Gang and Zhou, Xin and Chen, Zhourong and Grossman, Tovi and Li, Yang , title =. 2021 , publisher =. doi:10.48550/arXiv.2108.03353 , url =. 2108.03353 , archivePrefix =

work page doi:10.48550/arxiv.2108.03353 2021

[30] [30]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and Qiao, Yu , title =. 2024 , publisher =. doi:10.48550/arXiv.2410.23218 , url =. 2410.23218 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.23218 2024

[31] [31]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y

Xie, Tianbao and Deng, Jiaqi and Li, Xiaochuan and Yang, Junlin and Wu, Haoyuan and Chen, Jixuan and Hu, Wenjing and Wang, Xinyuan and Xu, Yuhui and Wang, Zekun and Xu, Yiheng and Wang, Junli and Sahoo, Doyen and Yu, Tao and Xiong, Caiming , title =. 2025 , publisher =. doi:10.48550/arXiv.2505.13227 , url =. 2505.13227 , archivePrefix =

work page doi:10.48550/arxiv.2505.13227 2025

[32] [32]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , title =. 2024 , publisher =. doi:10.48550/arXiv.2404.07972 , url...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.07972 2024

[33] [33]

2025 , publisher =

Yu, Shengcheng and Fang, Chunrong and Tuo, Ziyuan and Zhang, Quanjun and Chen, Chunyang and Chen, Zhenyu and Su, Zhendong , title =. 2025 , publisher =. doi:10.48550/arXiv.2310.13518 , url =. 2310.13518 , archivePrefix =

work page doi:10.48550/arxiv.2310.13518 2025