pith. machine review for the scientific record. sign in

arxiv: 2604.25420 · v1 · submitted 2026-04-28 · 💻 cs.SE · cs.HC

Recognition: unknown

Recommending Usability Improvements with Multimodal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:19 UTC · model grok-4.3

classification 💻 cs.SE cs.HC
keywords usability evaluationmultimodal large language modelsNielsen heuristicsimprovement recommendationsscreen recordingsuser studysoftware engineering
0
0 comments X

The pith

Multimodal large language models identify usability issues in screen recordings and suggest ranked fixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper explores using multimodal large language models to evaluate the usability of applications. The models receive limited context about the app and videos of user interactions, then identify problems based on standard usability principles, explain them, and recommend improvements sorted by how serious they are. Researchers tested this by having software engineers review the top suggestions. The goal is to make usability checks easier for teams that cannot easily consult specialists.

Core claim

The paper claims that inputting limited application context and screen recordings of user interactions into a multimodal large language model allows automatic identification of usability issues using Nielsen's heuristics, along with explanations and severity-ranked improvement recommendations. Evaluation through a user study with software engineers confirmed the practical usefulness of the highest-ranked suggestions as a complement to traditional methods.

What carries the argument

A multimodal large language model that processes visual inputs from screen recordings and textual context to generate descriptions of usability issues, explanations, and ranked recommendations.

Load-bearing premise

The model must be able to spot actual usability problems and create useful, properly ordered recommendations based solely on the provided context and recordings.

What would settle it

A direct comparison study where independent usability experts review the same recordings and list issues, then measure how well the model's output matches the experts' findings in terms of relevance and ranking accuracy.

Figures

Figures reproduced from arXiv: 2604.25420 by Alexander Felfernig, Damian Garber, Manuel Henrich, Sebastian Lubos, Viet-Man Le.

Figure 1
Figure 1. Figure 1: Overview of the four-step process for our automated heuristic-based usability evaluation and improve view at source ↗
Figure 2
Figure 2. Figure 2: System prompt for usability evaluation view at source ↗
Figure 3
Figure 3. Figure 3: Prompt template for usability evaluation with variable placeholders in parentheses ( view at source ↗
Figure 4
Figure 4. Figure 4: Prompt template for usability issue summary with variable placeholders in parentheses ( view at source ↗
Figure 5
Figure 5. Figure 5: Prompt template for usability improvement recommendation summary with variable placeholders in view at source ↗
Figure 6
Figure 6. Figure 6: System prompt for the severity ranking of issues. view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for usability issue ranking by severity with variable placeholders in parentheses view at source ↗
read the original abstract

Usability describes quality attributes of application user interfaces that determine how effectively users can interact with them. Traditional usability evaluation methods require considerable expertise and resources, which can be challenging, especially for small teams and organizations. Automating usability evaluation could make it more accessible and help to improve the user experience. The recent emergence of powerful multimodal large language models (MLLMs) has opened new opportunities for automating usability evaluation and recommendation of improvements. These models can process visual inputs such as images and videos alongside textual context, which enables the identification of usability issues and the generation of actionable suggestions to resolve these issues. In this paper, we present a novel automated approach that uses limited application context and screen recordings of user interactions as input to an MLLM. The model automatically identifies and describes usability issues based on Nielsens usability heuristics, and provides corresponding explanations and improvement recommendations. To reduce the developer effort of manual prioritization, the recommendations are ranked by severity. The quality and practical usefulness of the generated recommendations were evaluated based on a user study that involved software engineers as participants. The evaluation focused on the highest-ranked suggestions provided by the model. The results demonstrate the potential of our approach to provide low-effort usability improvement recommendations. This makes it a promising complement to traditional evaluation methods, especially in settings with limited access to usability experts. In this sense, the approach serves as a basis for future integration into development tools to enable automated usability evaluation within software engineering workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a novel automated approach for usability evaluation that feeds limited application context and screen recordings of user interactions into a multimodal large language model (MLLM). The MLLM identifies usability issues according to Nielsen's heuristics, generates explanations and improvement recommendations, and ranks the recommendations by severity. The quality and practical usefulness of the highest-ranked recommendations are assessed via a user study with software engineers as participants; the authors conclude that the method offers a low-effort complement to traditional usability evaluation, particularly for teams lacking dedicated experts.

Significance. If the user-study evidence were robust, the work would be significant for software engineering and HCI by demonstrating how MLLMs can operationalize established heuristics on visual interaction data to produce actionable, ranked outputs. This could lower the barrier to usability improvements in resource-constrained settings and support integration into development tools. The approach is grounded in well-known heuristics and addresses a real pain point (prioritization effort), but the absence of study details currently prevents any assessment of whether these benefits are realized.

major comments (2)
  1. [Evaluation section] Evaluation section: The central claim that the results demonstrate practical usefulness rests entirely on the user study, yet the manuscript provides no description of study design, participant count or demographics, applications or tasks used, rating protocol (scales, instructions, what was rated), inter-rater reliability, comparison to usability experts or baselines, or any quantitative results or statistical analysis. Without these elements it is impossible to determine whether the positive outcomes reflect genuine identification of real usability issues or are artifacts of an under-powered, uncontrolled, or biased evaluation.
  2. [Method and Results sections] Method and Results sections: No ground-truth validation is reported for the MLLM's heuristic application or severity ranking. It is unclear how the model extracts issues from screen recordings plus limited context, whether the rankings align with actual user impact, or how the outputs compare to expert-generated recommendations; this leaves the reliability of the automated pipeline untested and undermines the claim that the approach can serve as a trustworthy complement to traditional methods.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one concrete quantitative finding or scale from the user study rather than the generic statement that results 'demonstrate the potential.'
  2. [Approach section] Clarify the exact MLLM model, prompting strategy, and any post-processing used for ranking; these details are necessary for reproducibility even if the study is the primary concern.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the presentation of our evaluation and methodological details. We address each major comment below and commit to revisions that improve transparency without altering the core claims of the work.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section: The central claim that the results demonstrate practical usefulness rests entirely on the user study, yet the manuscript provides no description of study design, participant count or demographics, applications or tasks used, rating protocol (scales, instructions, what was rated), inter-rater reliability, comparison to usability experts or baselines, or any quantitative results or statistical analysis. Without these elements it is impossible to determine whether the positive outcomes reflect genuine identification of real usability issues or are artifacts of an under-powered, uncontrolled, or biased evaluation.

    Authors: We agree that the Evaluation section in the submitted manuscript lacks the requested details on study design and results. This was an oversight during manuscript preparation. In the revised version we will expand the section to describe the participant count and demographics, the applications and tasks used, the rating protocol including scales and instructions provided to participants, inter-rater reliability measures, any comparisons performed, and the quantitative results together with appropriate statistical analysis. These additions will allow readers to assess the robustness of the reported outcomes. revision: yes

  2. Referee: [Method and Results sections] Method and Results sections: No ground-truth validation is reported for the MLLM's heuristic application or severity ranking. It is unclear how the model extracts issues from screen recordings plus limited context, whether the rankings align with actual user impact, or how the outputs compare to expert-generated recommendations; this leaves the reliability of the automated pipeline untested and undermines the claim that the approach can serve as a trustworthy complement to traditional methods.

    Authors: We acknowledge that the manuscript does not include explicit ground-truth validation (such as expert comparisons or direct measures of user impact) for the MLLM's heuristic classifications or severity rankings. The evaluation instead relies on practitioner feedback from software engineers regarding the usefulness of the highest-ranked recommendations. We will revise the Method and Results sections to clarify the input processing steps and to add an explicit discussion of this limitation, including how the engineer study serves as a proxy for real-world applicability. We do not claim the current evidence constitutes full ground-truth validation and will frame the contribution accordingly as an initial demonstration rather than a definitive replacement for expert review. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an applied system that feeds limited context and screen recordings into an external MLLM, which then applies Nielsen's established heuristics to identify issues and rank recommendations. No equations, parameter fitting, or derivation chain exists. The central claim of practical usefulness rests on an independent user study with software engineers rather than any internal consistency or self-referential reduction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations that collapse the result to its inputs are present in the provided text or abstract.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical application of existing multimodal LLMs and Nielsen's heuristics; no free parameters, axioms, or invented entities are introduced or fitted.

pith-pipeline@v0.9.0 · 5568 in / 1048 out tokens · 48783 ms · 2026-05-07T16:19:35.347285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 26 canonical work pages · 3 internal anchors

  1. [1]

    Abdulaziz Alshayban and Sam Malek. 2022. AccessiText: automated detection of text accessibility issues in Android apps. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 984–995. ...

  2. [2]

    Moreno, María-Isabel Sánchez-Segura, and Ahmed Seffah

    Laura Carvajal, Ana M. Moreno, María-Isabel Sánchez-Segura, and Ahmed Seffah. 2013. Usability through Software Design.IEEE Transactions on Software Engineering39, 11 (2013), 1582–1596. doi:10.1109/TSE.2013.29

  3. [3]

    Castro, Ignacio Garnica, and Luis A

    John W. Castro, Ignacio Garnica, and Luis A. Rojas. 2022. Automated Tools for Usability Evaluation: A Systematic Mapping Study. InSocial Computing and Social Media: Design, User Experience and Impact, Gabriele Meiselwitz (Ed.). Springer International Publishing, Cham, 28–46

  4. [4]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. MIND2WEB: towards a generalist agent for the web. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 1220, 24 pages

  5. [5]

    Peitong Duan, Jeremy Warner, Yang Li, and Bjoern Hartmann. 2024. Generating Automatic Feedback on UI Mockups with Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 6, 20 pages. doi:10.1145/3613904.3642782

  6. [6]

    Alexander Felfernig, Müslüm Atas, Denis Helic, Thi Ngoc Trang Tran, Martin Stettinger, and Ralph Samer. 2024. Algorithms for Group Recommendation. Springer Nature Switzerland, Cham, 29–61. doi:10.1007/978-3-031-44943-7_2

  7. [7]

    Guilherme Guerino, Luiz Rodrigues, Bruna Capeleti, Rafael Ferreira Mello, André Freire, and Luciana Zaina. 2025. Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation. InHuman-Computer Interaction – INTERACT 2025: 20th IFIP TC 13 International Conference, Belo Horizonte, Brazil, September 8–1...

  8. [8]

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2024. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. arXiv:2307.12856 [cs.LG] https://arxiv.org/abs/2307.12856

  9. [9]

    2019.A Practical Guide to Usability Testing

    Christopher Hass. 2019.A Practical Guide to Usability Testing. Springer International Publishing, Cham, 107–124. doi:10.1007/978-3-319-96906-0_6

  10. [10]

    Tasha Hollingsed and David G. Novick. 2007. Usability inspection methods after 15 years of research and practice. InProceedings of the 25th Annual ACM International Conference on Design of Communication(El Paso, Texas, USA) (SIGDOC ’07). Association for Computing Machinery, New York, NY, USA, 249–255. doi:10.1145/1297144.1297200

  11. [11]

    Eduard Kuric, Peter Demcak, Matus Krajcovic, and Jan Lang. 2025. Systematic Literature Review of Automation and Artificial Intelligence in Usability Issue Detection. arXiv:2504.01415 [cs.HC] https://arxiv.org/abs/2504.01415

  12. [12]

    Baoli Li and Liping Han. 2013. Distance Weighted Cosine Similarity Measure for Text Classification. InIntelligent Data Engineering and Automated Learning – IDEAL 2013, Hujun Yin, Ke Tang, Yang Gao, Frank Klawonn, Minho Lee, Thomas Weise, Bin Li, and Xin Yao (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 611–618

  13. [13]

    Shuqing Li, Cuiyun Gao, Jianping Zhang, Yujia Zhang, Yepang Liu, Jiazhen Gu, Yun Peng, and Michael R. Lyu. 2024. Less Cybersickness, Please: Demystifying and Detecting Stereoscopic Visual Inconsistencies in Virtual Reality Apps. Proc. ACM Softw. Eng.1, FSE, Article 96 (July 2024), 23 pages. doi:10.1145/3660803

  14. [14]

    Zhe Liu, Chunyang Chen, Junjie Wang, Yuekai Huang, Jun Hu, and Qing Wang. 2021. Owl eyes: spotting UI display issues via visual understanding. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering(Virtual Event, Australia)(ASE ’20). Association for Computing Machinery, New York, NY, USA, 398–409. doi:10.1145/3324884.3416547

  15. [15]

    2026.AIG-ist-tugraz/MLLM- Usability-Improvements: Artifacts for FSE 2026

    Sebastian Lubos, Alexander Felfernig, Damian Garber, Viet-Man Le, and Manuel Henrich. 2026.AIG-ist-tugraz/MLLM- Usability-Improvements: Artifacts for FSE 2026. doi:10.5281/zenodo.19498008

  16. [16]

    Sebastian Lubos, Alexander Felfernig, Damian Garber, Viet Man Le, and Thi Ngoc Trang Tran. 2025. Towards LLM- Based Usability Analysis for Recommender User Interfaces. InProceedings of the 12th Joint Workshop on Interfaces and Human Decision Making for Recommender Systems (IntRS 2025) (CEUR Workshop Proceedings, Vol. 4027). CEUR-WS, Aachen. https://ceur-w...

  17. [17]

    Sebastian Lubos, Alexander Felfernig, Damian Garber, Gerhard Leitner, Julian Schwazer, and Manuel Henrich. 2026. Investigating Multimodal Large Language Models to Support Usability Evaluation. arXiv:2508.16165 [cs.SE] https: //arxiv.org/abs/2508.16165

  18. [18]

    Forough Mehralian, Navid Salehnamadi, and Sam Malek. 2021. Data-driven accessibility repair revisited: on the effectiveness of generating labels for icons in Android apps. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Athens, Greece)(ESEC/FSE Proc. ACM Softw...

  19. [19]

    Rolf Molich. 2018. Are usability evaluations reproducible?Interactions25, 6 (Oct. 2018), 82–85. doi:10.1145/3278154

  20. [20]

    Ede, Klaus Kaasgaard, and Barbara Karyukin

    Rolf Molich, Meghan R. Ede, Klaus Kaasgaard, and Barbara Karyukin. 2004. Comparative usability evaluation.Behav. Inf. Technol.23, 1 (Jan. 2004), 65–74. doi:10.1080/0144929032000173951

  21. [21]

    Abdallah Namoun, Ahmed Alrehaili, and Ali Tufail. 2021. A Review of Automated Website Usability Evaluation Tools: Research Issues and Challenges. InDesign, User Experience, and Usability: UX Research and Design, Marcelo M. Soares, Elizabeth Rosenzweig, and Aaron Marcus (Eds.). Springer International Publishing, Cham, 292–311

  22. [22]

    Jakob Nielsen. 1994. Enhancing the explanatory power of usability heuristics. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Boston, Massachusetts, USA)(CHI ’94). Association for Computing Machinery, New York, NY, USA, 152–158. doi:10.1145/191666.191729

  23. [23]

    Jakob Nielsen. 2012. Usability 101: Introduction to Usability. https://www.nngroup.com/articles/usability-101- introduction-to-usability/. Accessed: 22.04.2025

  24. [24]

    Rajvardhan Patil, Sorio Boit, Venkat Gudivada, and Jagadeesh Nandigam. 2023. A Survey of Text Representation and Embedding Techniques in NLP.IEEE Access11 (2023), 36120–36146. doi:10.1109/ACCESS.2023.3266377

  25. [25]

    Ali Ebrahimi Pourasad and Walid Maalej. 2025. Does GenAI Make Usability Testing Obsolete? . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 675–675. doi:10.1109/ICSE55347.2025.00138

  26. [26]

    Ruparelia

    Nayan B. Ruparelia. 2010. Software development lifecycle models.SIGSOFT Softw. Eng. Notes35, 3 (May 2010), 8–13. doi:10.1145/1764810.1764814

  27. [27]

    Yuhui Su, Zhe Liu, Chunyang Chen, Junjie Wang, and Qing Wang. 2021. OwlEyes-online: a fully automated platform for detecting and localizing UI display issues. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Athens, Greece)(ESEC/FSE 2021). Association for Compu...

  28. [28]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al . 2024. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL] https://arxiv.org/abs/2312.11805

  29. [29]

    Sebastian Winter, Stefan Wagner, and Florian Deissenboeck. 2008. A Comprehensive Model of Usability. InEngineering Interactive Systems, Jan Gulliksen, Morton Borup Harning, Philippe Palanque, Gerrit C. van der Veer, and Janet Wesson (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 106–122

  30. [30]

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models.National Science Review11, 12 (Nov. 2024). doi:10.1093/nsr/nwae403

  31. [31]

    Juyeon Yoon, Robert Feldt, and Shin Yoo. 2024. Intent-Driven Mobile GUI Testing with Autonomous Large Language Model Agents. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). 129–139. doi:10.1109/ ICST60714.2024.00020

  32. [32]

    Yuxin Zhang, Sen Chen, Lingling Fan, Chunyang Chen, and Xiaohong Li. 2023. Automated and Context-Aware Repair of Color-Related Accessibility Issues for Android Apps. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for C...

  33. [33]

    McDonald, and Gary Hsieh

    Ruican Zhong, David W. McDonald, and Gary Hsieh. 2025. Synthetic Heuristic Evaluation: A Comparison between AI- and Human-Powered Usability Evaluation. arXiv:2507.02306 [cs.HC] https://arxiv.org/abs/2507.02306 Received 2025-09-11; accepted 2025-12-22 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE026. Publication date: July 2026