pith. sign in

arxiv: 2605.17656 · v1 · pith:A34HPEHVnew · submitted 2026-05-17 · 💻 cs.HC

MUIAnno: An Expert-Annotated Dataset and Evaluation Benchmark for Mobile UI Understanding

Pith reviewed 2026-05-19 22:04 UTC · model grok-4.3

classification 💻 cs.HC
keywords mobile UI datasetexpert annotationUI element detectioniOS applicationsinterface understandingbenchmark datasetJSON annotations
0
0 comments X

The pith

MUIAnno supplies expert-annotated screenshots of real iOS apps to train systems that detect and interpret mobile interface elements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MUIAnno, a new public dataset of mobile user interface screens drawn from many different iOS applications. Experts used a custom drag-and-drop tool to label buttons, input fields, navigation bars and other elements, producing structured JSON records for each screen. The authors also report initial detection benchmarks that give other researchers a concrete place to start. Such data matters because current systems for automation, accessibility and intelligent agents still struggle when they lack reliable examples of how real apps actually look and behave.

Core claim

MUIAnno is a collection of representative UI screens gathered by manually exploring diverse apps on the iTunes platform, each annotated by UI/UX experts through a purpose-built web tool that records element types, positions and structure in JSON format, accompanied by baseline results on the task of UI element detection.

What carries the argument

The MUIAnno dataset itself, built through manual app exploration and expert drag-and-drop annotation that turns raw screenshots into labeled JSON records of common interface components.

If this is right

  • Automation scripts and testing tools can use the labels to locate and interact with specific buttons or fields more reliably.
  • Accessibility systems gain clearer targets for describing or navigating interface elements to users.
  • UI-aware agents receive a concrete training resource for learning to read and act on mobile screens.
  • Future detection algorithms can be compared against the provided baseline numbers to measure progress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same annotation approach could be repeated on Android apps to test whether the patterns learned transfer across platforms.
  • The JSON format may let researchers combine MUIAnno with image-captioning models to generate natural-language descriptions of entire screens.
  • If the dataset grows over time, it could serve as a living benchmark that tracks how mobile design conventions change.

Load-bearing premise

That the manually chosen screens and the labels produced by the expert tool faithfully capture the variety and accuracy of interfaces found in everyday mobile apps.

What would settle it

A test showing that models trained only on MUIAnno achieve substantially lower detection accuracy on a fresh set of popular iOS apps than models trained on existing UI datasets would indicate the new annotations add little value.

Figures

Figures reproduced from arXiv: 2605.17656 by Athar Parvez, Muhammad Jawad Mufti, Muqaddas Gull, Omar Hammad.

Figure 1
Figure 1. Figure 1: Overview of the dataset construction workflow. Real-world iOS apps are se [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the annotation pipeline. Annotators draw bounding boxes around [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Interface of the custom annotation tool used for labeling UI elements. Anno [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of precision, recall, and F1-score across evaluated multimodal [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Understanding mobile user interfaces is important for building intelligent systems such as automation tools, accessibility solutions, and UI-aware agents. However, progress in this area is still limited by the lack of high-quality datasets that reflect real-world mobile applications and include reliable annotations. In this work, we introduce MUIAnno, a publicly available expert-annotated dataset for mobile UI understanding, collected from a diverse set of applications across multiple categories available on the iTunes platform. Each app was manually explored to capture representative UI screens, resulting in a collection that reflects a wide range of layouts and design patterns found in practice. To ensure annotation quality, we developed a custom web-based tool that allows UI/UX experts to label interface elements through a simple drag-and-drop process and generate structured annotations in JSON format. MUIAnno includes detailed annotations of common UI components such as buttons, input fields, navigation elements, and other key interface elements. In addition to presenting the dataset, we also provide benchmark experiments for UI element detection along with baseline results, offering a starting point for future research. We believe MUIAnno can support further work in mobile UI understanding and help improve systems that rely on accurate interpretation of interface elements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MUIAnno, a publicly available expert-annotated dataset for mobile UI understanding collected from diverse iTunes applications. It describes manual exploration of apps to capture representative UI screens, development of a custom web-based drag-and-drop tool used by UI/UX experts to produce structured JSON annotations for elements such as buttons, input fields, and navigation components, and the provision of benchmark experiments for UI element detection together with baseline results.

Significance. If the dataset proves to be of sufficient scale, balanced across categories, and supported by reliable expert annotations, MUIAnno could serve as a useful resource for research on mobile UI automation, accessibility, and UI-aware agents. The inclusion of baseline benchmarks is a constructive element. However, the absence of quantitative diagnostics in the current description limits the ability to judge its practical value as an evaluation benchmark.

major comments (2)
  1. [Abstract] Abstract: the central claims of dataset diversity ('wide range of layouts and design patterns') and annotation reliability rest on unquantified manual processes; no counts of applications, screens, category balance, or inter-annotator agreement are supplied, leaving the load-bearing assumption that the collection and custom tool produce consistent, representative labels unsupported by evidence.
  2. [Benchmark experiments] Benchmark experiments section: the manuscript states that baseline results for UI element detection are provided, yet supplies no concrete metrics, model descriptions, or performance numbers; without these the claim that MUIAnno offers a usable evaluation benchmark cannot be assessed.
minor comments (1)
  1. The JSON annotation schema and exact label taxonomy should be illustrated with an example in the main text or appendix to clarify the structured output format.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of the dataset and benchmarks.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of dataset diversity ('wide range of layouts and design patterns') and annotation reliability rest on unquantified manual processes; no counts of applications, screens, category balance, or inter-annotator agreement are supplied, leaving the load-bearing assumption that the collection and custom tool produce consistent, representative labels unsupported by evidence.

    Authors: We agree that the abstract would be improved by including quantitative details to support the claims of diversity and reliability. The full manuscript describes the manual exploration of apps from diverse iTunes categories and the use of the custom drag-and-drop tool by UI/UX experts to produce structured JSON annotations. To directly address this point, we will revise the abstract to report key statistics on the number of applications, total screens captured, and category balance. For annotation reliability, we will expand the description of the annotation protocol and quality controls in the main text. We note that inter-annotator agreement metrics were not computed, as each screen received annotation from a single expert following standardized guidelines; we will add an explicit discussion of this aspect and any related limitations. revision: partial

  2. Referee: [Benchmark experiments] Benchmark experiments section: the manuscript states that baseline results for UI element detection are provided, yet supplies no concrete metrics, model descriptions, or performance numbers; without these the claim that MUIAnno offers a usable evaluation benchmark cannot be assessed.

    Authors: We acknowledge that the current description of the benchmark experiments lacks sufficient concrete details. Although the manuscript includes a section presenting baseline results for UI element detection, we agree that explicit model descriptions, evaluation metrics, and numerical performance values are needed for the benchmark to be properly assessed. We will revise this section to include specific information on the baseline models employed, the metrics used (such as precision and recall for element detection), and the reported performance numbers on the MUIAnno dataset. revision: yes

Circularity Check

0 steps flagged

No circularity; dataset introduction paper with no derivations or predictions

full rationale

The manuscript presents MUIAnno as an expert-annotated dataset collected via manual app exploration and a custom drag-and-drop annotation tool, followed by baseline UI element detection experiments. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Claims about diversity and annotation quality are supported by process description rather than any self-referential reduction or self-citation chain. The work is self-contained as an empirical dataset contribution with no load-bearing logical steps that collapse to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper whose central contribution is the curation and expert labeling of real-world mobile UI screens rather than any derivation from axioms or parameters. No free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5752 in / 1351 out tokens · 62911 ms · 2026-05-19T22:04:37.561758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

  1. [1]

    n8n: Workflow Automation Tool , year =

  2. [2]

    Chatbot Arena Leaderboard , year =

  3. [3]

    Discover iOS Apps | Mobbin --- UI & UX Design Inspiration for Mobile & Web Apps , year =

  4. [4]

    iTunes Search API , year =

  5. [5]

    ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =

    Baechler, Gilles and Sunkara, Srinivas and Wang, Maria and Zubach, Fedir and Mansoor, Hassan and Etter, Vincent and C. ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =. doi:10.48550/arXiv.2402.04615 , url =. 2402.04615 , archivePrefix =

  6. [6]

    Proceedings of the 40th International Conference on Software Engineering , series =

    Chen, Chunyang and Su, Ting and Meng, Guozhu and Xing, Zhenchang and Liu, Yang , title =. Proceedings of the 40th International Conference on Software Engineering , series =. 2018 , address =. doi:10.1145/3180155.3180240 , isbn =

  7. [7]

    ACM Transactions on Software Engineering and Methodology , volume =

    Chen, Jieshan and Chen, Chunyang and Xing, Zhenchang and Xia, Xin and Zhu, Liming and Grundy, John and Wang, Jinshui , title =. ACM Transactions on Software Engineering and Methodology , volume =. 2020 , doi =. 2103.07085 , archivePrefix =

  8. [8]

    Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =

    Chen, Jieshan and Chen, Chunyang and Xing, Zhenchang and Xu, Xiwei and Zhu, Liming and Li, Guoqiang and Wang, Jinshui , title =. Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =. 2020 , doi =. 2003.00380 , archivePrefix =

  9. [9]

    Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology , series =

    Deka, Biplab and Huang, Zifeng and Franzen, Chad and Hibschman, Joshua and Afergan, Daniel and Li, Yang and Nichols, Jeffrey and Kumar, Ranjitha , title =. Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology , series =. 2017 , address =. doi:10.1145/3126594.3126651 , isbn =

  10. [10]

    2024 , publisher =

    Duan, Peitong and Chen, Chin-yi and Li, Gang and Hartmann, Bjoern and Li, Yang , title =. 2024 , publisher =. doi:10.48550/arXiv.2407.08850 , url =. 2407.08850 , archivePrefix =

  11. [11]

    Proceedings of the CHI Conference on Human Factors in Computing Systems , pages =

    Feng, Sidong and Ma, Suyu and Wang, Han and Kong, David and Chen, Chunyang , title =. Proceedings of the CHI Conference on Human Factors in Computing Systems , pages =. 2024 , address =. doi:10.1145/3613904.3642350 , isbn =

  12. [12]

    2024 , publisher =

    Gao, Longxi and Zhang, Li and Wang, Shihe and Wang, Shangguang and Li, Yuanchun and Xu, Mengwei , title =. 2024 , publisher =. doi:10.48550/arXiv.2409.14337 , url =. 2409.14337 , archivePrefix =

  13. [13]

    2024 , publisher =

    Haque, Sabrina and Csallner, Christoph , title =. 2024 , publisher =. doi:10.48550/arXiv.2409.18060 , url =. 2409.18060 , archivePrefix =

  14. [14]

    net/forum?id=kxnoqaisCT

    Hui, Zheng and Li, Yinheng and Zhao, Dan and Chen, Tianyi and Banbury, Colby and Koishida, Kazuhito , title =. 2025 , publisher =. doi:10.48550/arXiv.2503.04730 , url =. 2503.04730 , archivePrefix =

  15. [15]

    2025 , publisher =

    Jang, Yunseok and Song, Yeda and Sohn, Sungryull and Logeswaran, Lajanugen and Luo, Tiange and Kim, Dong-Ki and Bae, Kyunghoon and Lee, Honglak , title =. 2025 , publisher =. doi:10.48550/arXiv.2505.12632 , url =. 2505.12632 , archivePrefix =

  16. [16]

    2023 , publisher =

    Jiang, Yue and Schoop, Eldon and Swearngin, Amanda and Nichols, Jeffrey , title =. 2023 , publisher =. doi:10.48550/arXiv.2310.04869 , url =. 2310.04869 , archivePrefix =

  17. [17]

    2026 , publisher =

    Kumbhar, Shrinidhi and Liao, Haofu and Appalaraju, Srikar and Singh, Kunwar Yashraj , title =. 2026 , publisher =. doi:10.48550/arXiv.2603.26211 , url =. 2603.26211 , archivePrefix =

  18. [18]

    2023 , publisher =

    Lee, Kenton and Joshi, Mandar and Turc, Iulia and Hu, Hexiang and Liu, Fangyu and Eisenschlos, Julian and Khandelwal, Urvashi and Shaw, Peter and Chang, Ming-Wei and Toutanova, Kristina , title =. 2023 , publisher =. doi:10.48550/arXiv.2210.03347 , url =. 2210.03347 , archivePrefix =

  19. [19]

    and Hota, Asutosh and Oulasvirta, Antti , title =

    Leiva, Luis A. and Hota, Asutosh and Oulasvirta, Antti , title =. ACM Transactions on Intelligent Systems and Technology , volume =. 2022 , doi =

  20. [20]

    2023 , publisher =

    Li, Gang and Li, Yang , title =. 2023 , publisher =. doi:10.48550/arXiv.2209.14927 , url =. 2209.14927 , archivePrefix =

  21. [21]

    Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

    Li, Kaixin and Meng, Ziyang and Lin, Hongzhan and Luo, Ziyang and Tian, Yuchen and Ma, Jing and Huang, Zhiyong and Chua, Tat-Seng , title =. 2025 , publisher =. doi:10.48550/arXiv.2504.07981 , url =. 2504.07981 , archivePrefix =

  22. [22]

    and Myers, Brad A

    Li, Toby Jia-Jun and Popowski, Lindsay and Mitchell, Tom M. and Myers, Brad A. , title =. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages =. 2021 , doi =. 2101.11103 , archivePrefix =

  23. [23]

    2020 , publisher =

    Li, Yang and Li, Gang and He, Luheng and Zheng, Jingjie and Li, Hong and Guan, Zhiwei , title =. 2020 , publisher =. doi:10.48550/arXiv.2010.04295 , url =. 2010.04295 , archivePrefix =

  24. [24]

    https://doi.org/10.48550/arXiv.2406.08451

    Lu, Quanfeng and Shao, Wenqi and Liu, Zitao and Du, Lingxiao and Meng, Fanqing and Li, Boxuan and Chen, Botong and Huang, Siyuan and Zhang, Kaipeng and Luo, Ping , title =. 2025 , publisher =. doi:10.48550/arXiv.2406.08451 , url =. 2406.08451 , archivePrefix =

  25. [25]

    2026 , publisher =

    Ma, Longhui and Zhao, Di and Wang, Siwei and Lv, Zhao and Wang, Miao , title =. 2026 , publisher =. doi:10.48550/arXiv.2602.06351 , url =. 2602.06351 , archivePrefix =

  26. [26]

    Powers, David M. W. , title =. Journal of Machine Learning Technologies , volume =. 2011 , url =

  27. [27]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and Zhong, Wanjun and Li, Kuanye and Yang, Jiale and Miao, Yu and Lin, Woyu and Liu, Longxiang and Jiang, Xu and Ma, Qianli and Li, Jingyu and Xiao, Xiaojun and Cai, Kai and Li, Chuang and Zheng, ...

  28. [28]

    2023 , publisher =

    Wang, Bryan and Li, Gang and Li, Yang , title =. 2023 , publisher =. doi:10.48550/arXiv.2209.08655 , url =. 2209.08655 , archivePrefix =

  29. [29]

    2021 , publisher =

    Wang, Bryan and Li, Gang and Zhou, Xin and Chen, Zhourong and Grossman, Tovi and Li, Yang , title =. 2021 , publisher =. doi:10.48550/arXiv.2108.03353 , url =. 2108.03353 , archivePrefix =

  30. [30]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and Qiao, Yu , title =. 2024 , publisher =. doi:10.48550/arXiv.2410.23218 , url =. 2410.23218 , archivePrefix =

  31. [31]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y

    Xie, Tianbao and Deng, Jiaqi and Li, Xiaochuan and Yang, Junlin and Wu, Haoyuan and Chen, Jixuan and Hu, Wenjing and Wang, Xinyuan and Xu, Yuhui and Wang, Zekun and Xu, Yiheng and Wang, Junli and Sahoo, Doyen and Yu, Tao and Xiong, Caiming , title =. 2025 , publisher =. doi:10.48550/arXiv.2505.13227 , url =. 2505.13227 , archivePrefix =

  32. [32]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , title =. 2024 , publisher =. doi:10.48550/arXiv.2404.07972 , url...

  33. [33]

    2025 , publisher =

    Yu, Shengcheng and Fang, Chunrong and Tuo, Ziyuan and Zhang, Quanjun and Chen, Chunyang and Chen, Zhenyu and Su, Zhendong , title =. 2025 , publisher =. doi:10.48550/arXiv.2310.13518 , url =. 2310.13518 , archivePrefix =