arxiv: 2604.25420 · v1 · submitted 2026-04-28 · 💻 cs.SE · cs.HC

Recognition: unknown

Recommending Usability Improvements with Multimodal Large Language Models

Sebastian Lubos , Alexander Felfernig , Damian Garber , Viet-Man Le , Manuel Henrich

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:19 UTC · model grok-4.3

classification 💻 cs.SE cs.HC

keywords usability evaluationmultimodal large language modelsNielsen heuristicsimprovement recommendationsscreen recordingsuser studysoftware engineering

0 comments

The pith

Multimodal large language models identify usability issues in screen recordings and suggest ranked fixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper explores using multimodal large language models to evaluate the usability of applications. The models receive limited context about the app and videos of user interactions, then identify problems based on standard usability principles, explain them, and recommend improvements sorted by how serious they are. Researchers tested this by having software engineers review the top suggestions. The goal is to make usability checks easier for teams that cannot easily consult specialists.

Core claim

The paper claims that inputting limited application context and screen recordings of user interactions into a multimodal large language model allows automatic identification of usability issues using Nielsen's heuristics, along with explanations and severity-ranked improvement recommendations. Evaluation through a user study with software engineers confirmed the practical usefulness of the highest-ranked suggestions as a complement to traditional methods.

What carries the argument

A multimodal large language model that processes visual inputs from screen recordings and textual context to generate descriptions of usability issues, explanations, and ranked recommendations.

Load-bearing premise

The model must be able to spot actual usability problems and create useful, properly ordered recommendations based solely on the provided context and recordings.

What would settle it

A direct comparison study where independent usability experts review the same recordings and list issues, then measure how well the model's output matches the experts' findings in terms of relevance and ranking accuracy.

Figures

Figures reproduced from arXiv: 2604.25420 by Alexander Felfernig, Damian Garber, Manuel Henrich, Sebastian Lubos, Viet-Man Le.

**Figure 1.** Figure 1: Overview of the four-step process for our automated heuristic-based usability evaluation and improve view at source ↗

**Figure 2.** Figure 2: System prompt for usability evaluation view at source ↗

**Figure 3.** Figure 3: Prompt template for usability evaluation with variable placeholders in parentheses ( view at source ↗

**Figure 4.** Figure 4: Prompt template for usability issue summary with variable placeholders in parentheses ( view at source ↗

**Figure 5.** Figure 5: Prompt template for usability improvement recommendation summary with variable placeholders in view at source ↗

**Figure 6.** Figure 6: System prompt for the severity ranking of issues. view at source ↗

**Figure 7.** Figure 7: Prompt template for usability issue ranking by severity with variable placeholders in parentheses view at source ↗

read the original abstract

Usability describes quality attributes of application user interfaces that determine how effectively users can interact with them. Traditional usability evaluation methods require considerable expertise and resources, which can be challenging, especially for small teams and organizations. Automating usability evaluation could make it more accessible and help to improve the user experience. The recent emergence of powerful multimodal large language models (MLLMs) has opened new opportunities for automating usability evaluation and recommendation of improvements. These models can process visual inputs such as images and videos alongside textual context, which enables the identification of usability issues and the generation of actionable suggestions to resolve these issues. In this paper, we present a novel automated approach that uses limited application context and screen recordings of user interactions as input to an MLLM. The model automatically identifies and describes usability issues based on Nielsens usability heuristics, and provides corresponding explanations and improvement recommendations. To reduce the developer effort of manual prioritization, the recommendations are ranked by severity. The quality and practical usefulness of the generated recommendations were evaluated based on a user study that involved software engineers as participants. The evaluation focused on the highest-ranked suggestions provided by the model. The results demonstrate the potential of our approach to provide low-effort usability improvement recommendations. This makes it a promising complement to traditional evaluation methods, especially in settings with limited access to usability experts. In this sense, the approach serves as a basis for future integration into development tools to enable automated usability evaluation within software engineering workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a practical MLLM pipeline for spotting Nielsen-heuristic issues in screen recordings and ranking fix suggestions, but the user study that supposedly validates it has no methods, numbers, or metrics at all.

read the letter

The main point is straightforward: they feed limited context and screen recordings into a multimodal LLM, ask it to flag usability problems using Nielsen's heuristics, generate explanations and fixes, then rank the recommendations by severity. That specific combination for automated improvement suggestions is new as an application. It targets a real need for small teams that cannot hire usability experts, and keeping the input to recordings plus basic context makes the idea deployable without heavy setup. The choice to reuse established heuristics instead of new criteria is also sensible and avoids extra invention. Those are the parts that hold up on a first read. The evaluation is the clear weak spot. The abstract states they ran a user study with software engineers focused on the highest-ranked suggestions and that the results show the approach can provide low-effort recommendations. Yet it supplies no participant count, no task description, no rating protocol, no inter-rater numbers, no comparison to expert reviews, and no quantitative scores. Without any of that, the claim that the suggestions are useful or correctly ranked cannot be checked. The MLLM's own accuracy at identifying real heuristic violations from limited visual input is also untested in the visible text. This is not a minor omission; the central usefulness argument rests on that study. The work would interest tool builders in software engineering or HCI researchers exploring AI for evaluation tasks. A reader wanting a finished system with reproducible evidence will not find it here. It deserves a serious referee only if the full manuscript contains a proper methods and results section that fills in the missing details; otherwise the evidence gap is too large for review.

Referee Report

2 major / 2 minor

Summary. The paper proposes a novel automated approach for usability evaluation that feeds limited application context and screen recordings of user interactions into a multimodal large language model (MLLM). The MLLM identifies usability issues according to Nielsen's heuristics, generates explanations and improvement recommendations, and ranks the recommendations by severity. The quality and practical usefulness of the highest-ranked recommendations are assessed via a user study with software engineers as participants; the authors conclude that the method offers a low-effort complement to traditional usability evaluation, particularly for teams lacking dedicated experts.

Significance. If the user-study evidence were robust, the work would be significant for software engineering and HCI by demonstrating how MLLMs can operationalize established heuristics on visual interaction data to produce actionable, ranked outputs. This could lower the barrier to usability improvements in resource-constrained settings and support integration into development tools. The approach is grounded in well-known heuristics and addresses a real pain point (prioritization effort), but the absence of study details currently prevents any assessment of whether these benefits are realized.

major comments (2)

[Evaluation section] Evaluation section: The central claim that the results demonstrate practical usefulness rests entirely on the user study, yet the manuscript provides no description of study design, participant count or demographics, applications or tasks used, rating protocol (scales, instructions, what was rated), inter-rater reliability, comparison to usability experts or baselines, or any quantitative results or statistical analysis. Without these elements it is impossible to determine whether the positive outcomes reflect genuine identification of real usability issues or are artifacts of an under-powered, uncontrolled, or biased evaluation.
[Method and Results sections] Method and Results sections: No ground-truth validation is reported for the MLLM's heuristic application or severity ranking. It is unclear how the model extracts issues from screen recordings plus limited context, whether the rankings align with actual user impact, or how the outputs compare to expert-generated recommendations; this leaves the reliability of the automated pipeline untested and undermines the claim that the approach can serve as a trustworthy complement to traditional methods.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one concrete quantitative finding or scale from the user study rather than the generic statement that results 'demonstrate the potential.'
[Approach section] Clarify the exact MLLM model, prompting strategy, and any post-processing used for ranking; these details are necessary for reproducibility even if the study is the primary concern.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the presentation of our evaluation and methodological details. We address each major comment below and commit to revisions that improve transparency without altering the core claims of the work.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The central claim that the results demonstrate practical usefulness rests entirely on the user study, yet the manuscript provides no description of study design, participant count or demographics, applications or tasks used, rating protocol (scales, instructions, what was rated), inter-rater reliability, comparison to usability experts or baselines, or any quantitative results or statistical analysis. Without these elements it is impossible to determine whether the positive outcomes reflect genuine identification of real usability issues or are artifacts of an under-powered, uncontrolled, or biased evaluation.

Authors: We agree that the Evaluation section in the submitted manuscript lacks the requested details on study design and results. This was an oversight during manuscript preparation. In the revised version we will expand the section to describe the participant count and demographics, the applications and tasks used, the rating protocol including scales and instructions provided to participants, inter-rater reliability measures, any comparisons performed, and the quantitative results together with appropriate statistical analysis. These additions will allow readers to assess the robustness of the reported outcomes. revision: yes
Referee: [Method and Results sections] Method and Results sections: No ground-truth validation is reported for the MLLM's heuristic application or severity ranking. It is unclear how the model extracts issues from screen recordings plus limited context, whether the rankings align with actual user impact, or how the outputs compare to expert-generated recommendations; this leaves the reliability of the automated pipeline untested and undermines the claim that the approach can serve as a trustworthy complement to traditional methods.

Authors: We acknowledge that the manuscript does not include explicit ground-truth validation (such as expert comparisons or direct measures of user impact) for the MLLM's heuristic classifications or severity rankings. The evaluation instead relies on practitioner feedback from software engineers regarding the usefulness of the highest-ranked recommendations. We will revise the Method and Results sections to clarify the input processing steps and to add an explicit discussion of this limitation, including how the engineer study serves as a proxy for real-world applicability. We do not claim the current evidence constitutes full ground-truth validation and will frame the contribution accordingly as an initial demonstration rather than a definitive replacement for expert review. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an applied system that feeds limited context and screen recordings into an external MLLM, which then applies Nielsen's established heuristics to identify issues and rank recommendations. No equations, parameter fitting, or derivation chain exists. The central claim of practical usefulness rests on an independent user study with software engineers rather than any internal consistency or self-referential reduction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations that collapse the result to its inputs are present in the provided text or abstract.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical application of existing multimodal LLMs and Nielsen's heuristics; no free parameters, axioms, or invented entities are introduced or fitted.

pith-pipeline@v0.9.0 · 5568 in / 1048 out tokens · 48783 ms · 2026-05-07T16:19:35.347285+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 26 canonical work pages · 3 internal anchors

[1]

Abdulaziz Alshayban and Sam Malek. 2022. AccessiText: automated detection of text accessibility issues in Android apps. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 984–995. ...

work page doi:10.1145/3540250.3549118 2022
[2]

Moreno, María-Isabel Sánchez-Segura, and Ahmed Seffah

Laura Carvajal, Ana M. Moreno, María-Isabel Sánchez-Segura, and Ahmed Seffah. 2013. Usability through Software Design.IEEE Transactions on Software Engineering39, 11 (2013), 1582–1596. doi:10.1109/TSE.2013.29

work page doi:10.1109/tse.2013.29 2013
[3]

Castro, Ignacio Garnica, and Luis A

John W. Castro, Ignacio Garnica, and Luis A. Rojas. 2022. Automated Tools for Usability Evaluation: A Systematic Mapping Study. InSocial Computing and Social Media: Design, User Experience and Impact, Gabriele Meiselwitz (Ed.). Springer International Publishing, Cham, 28–46

2022
[4]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. MIND2WEB: towards a generalist agent for the web. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 1220, 24 pages

2023
[5]

Peitong Duan, Jeremy Warner, Yang Li, and Bjoern Hartmann. 2024. Generating Automatic Feedback on UI Mockups with Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 6, 20 pages. doi:10.1145/3613904.3642782

work page doi:10.1145/3613904.3642782 2024
[6]

Alexander Felfernig, Müslüm Atas, Denis Helic, Thi Ngoc Trang Tran, Martin Stettinger, and Ralph Samer. 2024. Algorithms for Group Recommendation. Springer Nature Switzerland, Cham, 29–61. doi:10.1007/978-3-031-44943-7_2

work page doi:10.1007/978-3-031-44943-7_2 2024
[7]

Guilherme Guerino, Luiz Rodrigues, Bruna Capeleti, Rafael Ferreira Mello, André Freire, and Luciana Zaina. 2025. Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation. InHuman-Computer Interaction – INTERACT 2025: 20th IFIP TC 13 International Conference, Belo Horizonte, Brazil, September 8–1...

work page doi:10.1007/978-3-032-05005-2_20 2025
[8]

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2024. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. arXiv:2307.12856 [cs.LG] https://arxiv.org/abs/2307.12856

work page internal anchor Pith review arXiv 2024
[9]

2019.A Practical Guide to Usability Testing

Christopher Hass. 2019.A Practical Guide to Usability Testing. Springer International Publishing, Cham, 107–124. doi:10.1007/978-3-319-96906-0_6

work page doi:10.1007/978-3-319-96906-0_6 2019
[10]

Tasha Hollingsed and David G. Novick. 2007. Usability inspection methods after 15 years of research and practice. InProceedings of the 25th Annual ACM International Conference on Design of Communication(El Paso, Texas, USA) (SIGDOC ’07). Association for Computing Machinery, New York, NY, USA, 249–255. doi:10.1145/1297144.1297200

work page doi:10.1145/1297144.1297200 2007
[11]

Eduard Kuric, Peter Demcak, Matus Krajcovic, and Jan Lang. 2025. Systematic Literature Review of Automation and Artificial Intelligence in Usability Issue Detection. arXiv:2504.01415 [cs.HC] https://arxiv.org/abs/2504.01415

work page arXiv 2025
[12]

Baoli Li and Liping Han. 2013. Distance Weighted Cosine Similarity Measure for Text Classification. InIntelligent Data Engineering and Automated Learning – IDEAL 2013, Hujun Yin, Ke Tang, Yang Gao, Frank Klawonn, Minho Lee, Thomas Weise, Bin Li, and Xin Yao (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 611–618

2013
[13]

Shuqing Li, Cuiyun Gao, Jianping Zhang, Yujia Zhang, Yepang Liu, Jiazhen Gu, Yun Peng, and Michael R. Lyu. 2024. Less Cybersickness, Please: Demystifying and Detecting Stereoscopic Visual Inconsistencies in Virtual Reality Apps. Proc. ACM Softw. Eng.1, FSE, Article 96 (July 2024), 23 pages. doi:10.1145/3660803

work page doi:10.1145/3660803 2024
[14]

Zhe Liu, Chunyang Chen, Junjie Wang, Yuekai Huang, Jun Hu, and Qing Wang. 2021. Owl eyes: spotting UI display issues via visual understanding. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering(Virtual Event, Australia)(ASE ’20). Association for Computing Machinery, New York, NY, USA, 398–409. doi:10.1145/3324884.3416547

work page doi:10.1145/3324884.3416547 2021
[15]

2026.AIG-ist-tugraz/MLLM- Usability-Improvements: Artifacts for FSE 2026

Sebastian Lubos, Alexander Felfernig, Damian Garber, Viet-Man Le, and Manuel Henrich. 2026.AIG-ist-tugraz/MLLM- Usability-Improvements: Artifacts for FSE 2026. doi:10.5281/zenodo.19498008

work page doi:10.5281/zenodo.19498008 2026
[16]

Sebastian Lubos, Alexander Felfernig, Damian Garber, Viet Man Le, and Thi Ngoc Trang Tran. 2025. Towards LLM- Based Usability Analysis for Recommender User Interfaces. InProceedings of the 12th Joint Workshop on Interfaces and Human Decision Making for Recommender Systems (IntRS 2025) (CEUR Workshop Proceedings, Vol. 4027). CEUR-WS, Aachen. https://ceur-w...

2025
[17]

Sebastian Lubos, Alexander Felfernig, Damian Garber, Gerhard Leitner, Julian Schwazer, and Manuel Henrich. 2026. Investigating Multimodal Large Language Models to Support Usability Evaluation. arXiv:2508.16165 [cs.SE] https: //arxiv.org/abs/2508.16165

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Forough Mehralian, Navid Salehnamadi, and Sam Malek. 2021. Data-driven accessibility repair revisited: on the effectiveness of generating labels for icons in Android apps. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Athens, Greece)(ESEC/FSE Proc. ACM Softw...

work page doi:10.1145/3468264.3468604 2021
[19]

Rolf Molich. 2018. Are usability evaluations reproducible?Interactions25, 6 (Oct. 2018), 82–85. doi:10.1145/3278154

work page doi:10.1145/3278154 2018
[20]

Ede, Klaus Kaasgaard, and Barbara Karyukin

Rolf Molich, Meghan R. Ede, Klaus Kaasgaard, and Barbara Karyukin. 2004. Comparative usability evaluation.Behav. Inf. Technol.23, 1 (Jan. 2004), 65–74. doi:10.1080/0144929032000173951

work page doi:10.1080/0144929032000173951 2004
[21]

Abdallah Namoun, Ahmed Alrehaili, and Ali Tufail. 2021. A Review of Automated Website Usability Evaluation Tools: Research Issues and Challenges. InDesign, User Experience, and Usability: UX Research and Design, Marcelo M. Soares, Elizabeth Rosenzweig, and Aaron Marcus (Eds.). Springer International Publishing, Cham, 292–311

2021
[22]

Jakob Nielsen. 1994. Enhancing the explanatory power of usability heuristics. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Boston, Massachusetts, USA)(CHI ’94). Association for Computing Machinery, New York, NY, USA, 152–158. doi:10.1145/191666.191729

work page doi:10.1145/191666.191729 1994
[23]

Jakob Nielsen. 2012. Usability 101: Introduction to Usability. https://www.nngroup.com/articles/usability-101- introduction-to-usability/. Accessed: 22.04.2025

2012
[24]

Rajvardhan Patil, Sorio Boit, Venkat Gudivada, and Jagadeesh Nandigam. 2023. A Survey of Text Representation and Embedding Techniques in NLP.IEEE Access11 (2023), 36120–36146. doi:10.1109/ACCESS.2023.3266377

work page doi:10.1109/access.2023.3266377 2023
[25]

Ali Ebrahimi Pourasad and Walid Maalej. 2025. Does GenAI Make Usability Testing Obsolete? . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 675–675. doi:10.1109/ICSE55347.2025.00138

work page doi:10.1109/icse55347.2025.00138 2025
[26]

Ruparelia

Nayan B. Ruparelia. 2010. Software development lifecycle models.SIGSOFT Softw. Eng. Notes35, 3 (May 2010), 8–13. doi:10.1145/1764810.1764814

work page doi:10.1145/1764810.1764814 2010
[27]

Yuhui Su, Zhe Liu, Chunyang Chen, Junjie Wang, and Qing Wang. 2021. OwlEyes-online: a fully automated platform for detecting and localizing UI display issues. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Athens, Greece)(ESEC/FSE 2021). Association for Compu...

work page doi:10.1145/3468264.3473109 2021
[28]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al . 2024. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL] https://arxiv.org/abs/2312.11805

work page internal anchor Pith review arXiv 2024
[29]

Sebastian Winter, Stefan Wagner, and Florian Deissenboeck. 2008. A Comprehensive Model of Usability. InEngineering Interactive Systems, Jan Gulliksen, Morton Borup Harning, Philippe Palanque, Gerrit C. van der Veer, and Janet Wesson (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 106–122

2008
[30]

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models.National Science Review11, 12 (Nov. 2024). doi:10.1093/nsr/nwae403

work page doi:10.1093/nsr/nwae403 2024
[31]

Juyeon Yoon, Robert Feldt, and Shin Yoo. 2024. Intent-Driven Mobile GUI Testing with Autonomous Large Language Model Agents. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). 129–139. doi:10.1109/ ICST60714.2024.00020

work page arXiv 2024
[32]

Yuxin Zhang, Sen Chen, Lingling Fan, Chunyang Chen, and Xiaohong Li. 2023. Automated and Context-Aware Repair of Color-Related Accessibility Issues for Android Apps. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for C...

work page doi:10.1145/3611643.3616329 2023
[33]

McDonald, and Gary Hsieh

Ruican Zhong, David W. McDonald, and Gary Hsieh. 2025. Synthetic Heuristic Evaluation: A Comparison between AI- and Human-Powered Usability Evaluation. arXiv:2507.02306 [cs.HC] https://arxiv.org/abs/2507.02306 Received 2025-09-11; accepted 2025-12-22 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE026. Publication date: July 2026

work page arXiv 2025