MUIAnno: An Expert-Annotated Dataset and Evaluation Benchmark for Mobile UI Understanding
Pith reviewed 2026-05-19 22:04 UTC · model grok-4.3
The pith
MUIAnno supplies expert-annotated screenshots of real iOS apps to train systems that detect and interpret mobile interface elements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MUIAnno is a collection of representative UI screens gathered by manually exploring diverse apps on the iTunes platform, each annotated by UI/UX experts through a purpose-built web tool that records element types, positions and structure in JSON format, accompanied by baseline results on the task of UI element detection.
What carries the argument
The MUIAnno dataset itself, built through manual app exploration and expert drag-and-drop annotation that turns raw screenshots into labeled JSON records of common interface components.
If this is right
- Automation scripts and testing tools can use the labels to locate and interact with specific buttons or fields more reliably.
- Accessibility systems gain clearer targets for describing or navigating interface elements to users.
- UI-aware agents receive a concrete training resource for learning to read and act on mobile screens.
- Future detection algorithms can be compared against the provided baseline numbers to measure progress.
Where Pith is reading between the lines
- The same annotation approach could be repeated on Android apps to test whether the patterns learned transfer across platforms.
- The JSON format may let researchers combine MUIAnno with image-captioning models to generate natural-language descriptions of entire screens.
- If the dataset grows over time, it could serve as a living benchmark that tracks how mobile design conventions change.
Load-bearing premise
That the manually chosen screens and the labels produced by the expert tool faithfully capture the variety and accuracy of interfaces found in everyday mobile apps.
What would settle it
A test showing that models trained only on MUIAnno achieve substantially lower detection accuracy on a fresh set of popular iOS apps than models trained on existing UI datasets would indicate the new annotations add little value.
Figures
read the original abstract
Understanding mobile user interfaces is important for building intelligent systems such as automation tools, accessibility solutions, and UI-aware agents. However, progress in this area is still limited by the lack of high-quality datasets that reflect real-world mobile applications and include reliable annotations. In this work, we introduce MUIAnno, a publicly available expert-annotated dataset for mobile UI understanding, collected from a diverse set of applications across multiple categories available on the iTunes platform. Each app was manually explored to capture representative UI screens, resulting in a collection that reflects a wide range of layouts and design patterns found in practice. To ensure annotation quality, we developed a custom web-based tool that allows UI/UX experts to label interface elements through a simple drag-and-drop process and generate structured annotations in JSON format. MUIAnno includes detailed annotations of common UI components such as buttons, input fields, navigation elements, and other key interface elements. In addition to presenting the dataset, we also provide benchmark experiments for UI element detection along with baseline results, offering a starting point for future research. We believe MUIAnno can support further work in mobile UI understanding and help improve systems that rely on accurate interpretation of interface elements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MUIAnno, a publicly available expert-annotated dataset for mobile UI understanding collected from diverse iTunes applications. It describes manual exploration of apps to capture representative UI screens, development of a custom web-based drag-and-drop tool used by UI/UX experts to produce structured JSON annotations for elements such as buttons, input fields, and navigation components, and the provision of benchmark experiments for UI element detection together with baseline results.
Significance. If the dataset proves to be of sufficient scale, balanced across categories, and supported by reliable expert annotations, MUIAnno could serve as a useful resource for research on mobile UI automation, accessibility, and UI-aware agents. The inclusion of baseline benchmarks is a constructive element. However, the absence of quantitative diagnostics in the current description limits the ability to judge its practical value as an evaluation benchmark.
major comments (2)
- [Abstract] Abstract: the central claims of dataset diversity ('wide range of layouts and design patterns') and annotation reliability rest on unquantified manual processes; no counts of applications, screens, category balance, or inter-annotator agreement are supplied, leaving the load-bearing assumption that the collection and custom tool produce consistent, representative labels unsupported by evidence.
- [Benchmark experiments] Benchmark experiments section: the manuscript states that baseline results for UI element detection are provided, yet supplies no concrete metrics, model descriptions, or performance numbers; without these the claim that MUIAnno offers a usable evaluation benchmark cannot be assessed.
minor comments (1)
- The JSON annotation schema and exact label taxonomy should be illustrated with an example in the main text or appendix to clarify the structured output format.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of the dataset and benchmarks.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of dataset diversity ('wide range of layouts and design patterns') and annotation reliability rest on unquantified manual processes; no counts of applications, screens, category balance, or inter-annotator agreement are supplied, leaving the load-bearing assumption that the collection and custom tool produce consistent, representative labels unsupported by evidence.
Authors: We agree that the abstract would be improved by including quantitative details to support the claims of diversity and reliability. The full manuscript describes the manual exploration of apps from diverse iTunes categories and the use of the custom drag-and-drop tool by UI/UX experts to produce structured JSON annotations. To directly address this point, we will revise the abstract to report key statistics on the number of applications, total screens captured, and category balance. For annotation reliability, we will expand the description of the annotation protocol and quality controls in the main text. We note that inter-annotator agreement metrics were not computed, as each screen received annotation from a single expert following standardized guidelines; we will add an explicit discussion of this aspect and any related limitations. revision: partial
-
Referee: [Benchmark experiments] Benchmark experiments section: the manuscript states that baseline results for UI element detection are provided, yet supplies no concrete metrics, model descriptions, or performance numbers; without these the claim that MUIAnno offers a usable evaluation benchmark cannot be assessed.
Authors: We acknowledge that the current description of the benchmark experiments lacks sufficient concrete details. Although the manuscript includes a section presenting baseline results for UI element detection, we agree that explicit model descriptions, evaluation metrics, and numerical performance values are needed for the benchmark to be properly assessed. We will revise this section to include specific information on the baseline models employed, the metrics used (such as precision and recall for element detection), and the reported performance numbers on the MUIAnno dataset. revision: yes
Circularity Check
No circularity; dataset introduction paper with no derivations or predictions
full rationale
The manuscript presents MUIAnno as an expert-annotated dataset collected via manual app exploration and a custom drag-and-drop annotation tool, followed by baseline UI element detection experiments. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Claims about diversity and annotation quality are supported by process description rather than any self-referential reduction or self-citation chain. The work is self-contained as an empirical dataset contribution with no load-bearing logical steps that collapse to their own inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce MUIAnno, a publicly available expert-annotated dataset for mobile UI understanding... custom web-based tool... drag-and-drop process and generate structured annotations in JSON format... benchmark experiments for UI element detection
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
36 UI element classes... 27,367 annotated UI element instances... IoU-based matching... F1-score evaluation of GPT-5.4, Claude, Gemini, Llama-4-Scout, Gemma
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
n8n: Workflow Automation Tool , year =
-
[2]
Chatbot Arena Leaderboard , year =
-
[3]
Discover iOS Apps | Mobbin --- UI & UX Design Inspiration for Mobile & Web Apps , year =
-
[4]
iTunes Search API , year =
-
[5]
ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =
Baechler, Gilles and Sunkara, Srinivas and Wang, Maria and Zubach, Fedir and Mansoor, Hassan and Etter, Vincent and C. ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =. doi:10.48550/arXiv.2402.04615 , url =. 2402.04615 , archivePrefix =
-
[6]
Proceedings of the 40th International Conference on Software Engineering , series =
Chen, Chunyang and Su, Ting and Meng, Guozhu and Xing, Zhenchang and Liu, Yang , title =. Proceedings of the 40th International Conference on Software Engineering , series =. 2018 , address =. doi:10.1145/3180155.3180240 , isbn =
-
[7]
ACM Transactions on Software Engineering and Methodology , volume =
Chen, Jieshan and Chen, Chunyang and Xing, Zhenchang and Xia, Xin and Zhu, Liming and Grundy, John and Wang, Jinshui , title =. ACM Transactions on Software Engineering and Methodology , volume =. 2020 , doi =. 2103.07085 , archivePrefix =
-
[8]
Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =
Chen, Jieshan and Chen, Chunyang and Xing, Zhenchang and Xu, Xiwei and Zhu, Liming and Li, Guoqiang and Wang, Jinshui , title =. Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages =. 2020 , doi =. 2003.00380 , archivePrefix =
-
[9]
Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology , series =
Deka, Biplab and Huang, Zifeng and Franzen, Chad and Hibschman, Joshua and Afergan, Daniel and Li, Yang and Nichols, Jeffrey and Kumar, Ranjitha , title =. Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology , series =. 2017 , address =. doi:10.1145/3126594.3126651 , isbn =
-
[10]
Duan, Peitong and Chen, Chin-yi and Li, Gang and Hartmann, Bjoern and Li, Yang , title =. 2024 , publisher =. doi:10.48550/arXiv.2407.08850 , url =. 2407.08850 , archivePrefix =
-
[11]
Proceedings of the CHI Conference on Human Factors in Computing Systems , pages =
Feng, Sidong and Ma, Suyu and Wang, Han and Kong, David and Chen, Chunyang , title =. Proceedings of the CHI Conference on Human Factors in Computing Systems , pages =. 2024 , address =. doi:10.1145/3613904.3642350 , isbn =
-
[12]
Gao, Longxi and Zhang, Li and Wang, Shihe and Wang, Shangguang and Li, Yuanchun and Xu, Mengwei , title =. 2024 , publisher =. doi:10.48550/arXiv.2409.14337 , url =. 2409.14337 , archivePrefix =
-
[13]
Haque, Sabrina and Csallner, Christoph , title =. 2024 , publisher =. doi:10.48550/arXiv.2409.18060 , url =. 2409.18060 , archivePrefix =
-
[14]
Hui, Zheng and Li, Yinheng and Zhao, Dan and Chen, Tianyi and Banbury, Colby and Koishida, Kazuhito , title =. 2025 , publisher =. doi:10.48550/arXiv.2503.04730 , url =. 2503.04730 , archivePrefix =
-
[15]
Jang, Yunseok and Song, Yeda and Sohn, Sungryull and Logeswaran, Lajanugen and Luo, Tiange and Kim, Dong-Ki and Bae, Kyunghoon and Lee, Honglak , title =. 2025 , publisher =. doi:10.48550/arXiv.2505.12632 , url =. 2505.12632 , archivePrefix =
-
[16]
Jiang, Yue and Schoop, Eldon and Swearngin, Amanda and Nichols, Jeffrey , title =. 2023 , publisher =. doi:10.48550/arXiv.2310.04869 , url =. 2310.04869 , archivePrefix =
-
[17]
Kumbhar, Shrinidhi and Liao, Haofu and Appalaraju, Srikar and Singh, Kunwar Yashraj , title =. 2026 , publisher =. doi:10.48550/arXiv.2603.26211 , url =. 2603.26211 , archivePrefix =
-
[18]
Lee, Kenton and Joshi, Mandar and Turc, Iulia and Hu, Hexiang and Liu, Fangyu and Eisenschlos, Julian and Khandelwal, Urvashi and Shaw, Peter and Chang, Ming-Wei and Toutanova, Kristina , title =. 2023 , publisher =. doi:10.48550/arXiv.2210.03347 , url =. 2210.03347 , archivePrefix =
-
[19]
and Hota, Asutosh and Oulasvirta, Antti , title =
Leiva, Luis A. and Hota, Asutosh and Oulasvirta, Antti , title =. ACM Transactions on Intelligent Systems and Technology , volume =. 2022 , doi =
work page 2022
-
[20]
Li, Gang and Li, Yang , title =. 2023 , publisher =. doi:10.48550/arXiv.2209.14927 , url =. 2209.14927 , archivePrefix =
-
[21]
Li, Kaixin and Meng, Ziyang and Lin, Hongzhan and Luo, Ziyang and Tian, Yuchen and Ma, Jing and Huang, Zhiyong and Chua, Tat-Seng , title =. 2025 , publisher =. doi:10.48550/arXiv.2504.07981 , url =. 2504.07981 , archivePrefix =
-
[22]
Li, Toby Jia-Jun and Popowski, Lindsay and Mitchell, Tom M. and Myers, Brad A. , title =. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages =. 2021 , doi =. 2101.11103 , archivePrefix =
-
[23]
Li, Yang and Li, Gang and He, Luheng and Zheng, Jingjie and Li, Hong and Guan, Zhiwei , title =. 2020 , publisher =. doi:10.48550/arXiv.2010.04295 , url =. 2010.04295 , archivePrefix =
-
[24]
https://doi.org/10.48550/arXiv.2406.08451
Lu, Quanfeng and Shao, Wenqi and Liu, Zitao and Du, Lingxiao and Meng, Fanqing and Li, Boxuan and Chen, Botong and Huang, Siyuan and Zhang, Kaipeng and Luo, Ping , title =. 2025 , publisher =. doi:10.48550/arXiv.2406.08451 , url =. 2406.08451 , archivePrefix =
-
[25]
Ma, Longhui and Zhao, Di and Wang, Siwei and Lv, Zhao and Wang, Miao , title =. 2026 , publisher =. doi:10.48550/arXiv.2602.06351 , url =. 2602.06351 , archivePrefix =
-
[26]
Powers, David M. W. , title =. Journal of Machine Learning Technologies , volume =. 2011 , url =
work page 2011
-
[27]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and Zhong, Wanjun and Li, Kuanye and Yang, Jiale and Miao, Yu and Lin, Woyu and Liu, Longxiang and Jiang, Xu and Ma, Qianli and Li, Jingyu and Xiao, Xiaojun and Cai, Kai and Li, Chuang and Zheng, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12326 2025
-
[28]
Wang, Bryan and Li, Gang and Li, Yang , title =. 2023 , publisher =. doi:10.48550/arXiv.2209.08655 , url =. 2209.08655 , archivePrefix =
-
[29]
Wang, Bryan and Li, Gang and Zhou, Xin and Chen, Zhourong and Grossman, Tovi and Li, Yang , title =. 2021 , publisher =. doi:10.48550/arXiv.2108.03353 , url =. 2108.03353 , archivePrefix =
-
[30]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and Qiao, Yu , title =. 2024 , publisher =. doi:10.48550/arXiv.2410.23218 , url =. 2410.23218 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.23218 2024
-
[31]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y
Xie, Tianbao and Deng, Jiaqi and Li, Xiaochuan and Yang, Junlin and Wu, Haoyuan and Chen, Jixuan and Hu, Wenjing and Wang, Xinyuan and Xu, Yuhui and Wang, Zekun and Xu, Yiheng and Wang, Junli and Sahoo, Doyen and Yu, Tao and Xiong, Caiming , title =. 2025 , publisher =. doi:10.48550/arXiv.2505.13227 , url =. 2505.13227 , archivePrefix =
-
[32]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , title =. 2024 , publisher =. doi:10.48550/arXiv.2404.07972 , url...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.07972 2024
-
[33]
Yu, Shengcheng and Fang, Chunrong and Tuo, Ziyuan and Zhang, Quanjun and Chen, Chunyang and Chen, Zhenyu and Su, Zhendong , title =. 2025 , publisher =. doi:10.48550/arXiv.2310.13518 , url =. 2310.13518 , archivePrefix =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.