Investigating Multimodal Large Language Models to Support Usability Evaluation
Pith reviewed 2026-05-18 22:01 UTC · model grok-4.3
The pith
Multimodal LLMs can complement expert usability evaluations by identifying and prioritizing critical issues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The evaluations generated by multiple MLLMs were compared with assessments from usability experts. The results demonstrate that MLLMs can offer complementary insights and support the efficient prioritization of critical issues.
What carries the argument
Framing usability evaluation as a prioritization problem in which models analyze textual instructions together with visual UI context to identify, explain, and rank issues by severity.
Load-bearing premise
The chosen set of interfaces, tasks, and expert raters forms a representative sample against which MLLM performance can be meaningfully compared.
What would settle it
Repeating the comparison on a larger and more diverse collection of interfaces and raters that shows no complementary insights or unreliable severity rankings would falsify the central claim.
Figures
read the original abstract
Usability evaluation is an essential method to support the design of effective and intuitive user interfaces (UIs). However, it commonly relies on resource-intensive, expert-driven methods, which limit its accessibility, especially for small organizations. Recent multimodal large language models (MLLMs) have the potential to support usability evaluation by analyzing textual instructions together with visual UI context. This paper investigates the use of MLLMs as assistive tools for usability evaluation by framing the task as a prioritization problem. It identifies and explains usability issues and ranks them by severity. We report a study that compares the evaluations generated by multiple MLLMs with assessments from usability experts. The results demonstrate that MLLMs can offer complementary insights and support the efficient prioritization of critical issues. Additionally, we present an interactive visualization tool that enables the transparent review and validation of model-generated findings. Based on this, we outline concepts for integrating MLLM-based usability evaluation into real-world development workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the use of multimodal large language models (MLLMs) to support usability evaluation of user interfaces. It frames the task as identifying, explaining, and ranking usability issues by severity. A comparative study is reported between MLLM outputs and assessments from usability experts, with claims that MLLMs provide complementary insights and enable efficient prioritization of critical issues. The authors also introduce an interactive visualization tool for reviewing model-generated findings and discuss concepts for integrating such tools into development workflows.
Significance. If the empirical comparison holds under scrutiny, the work could meaningfully increase accessibility of usability evaluation for small teams by demonstrating how MLLMs can complement rather than replace expert judgment, particularly through severity prioritization and transparent review mechanisms. The visualization tool and workflow integration ideas add practical value beyond the core comparison.
major comments (2)
- [Abstract] Abstract: The description of the comparison study provides no information on sample size (number of interfaces or tasks evaluated), number of expert raters, inter-rater agreement metrics, or any statistical tests. Without these, it is impossible to assess whether the data support the claims of 'complementary insights' and 'efficient prioritization of critical issues.'
- [Abstract] Abstract and study setup: No selection criteria, diversity metrics, or coverage arguments are given for the chosen interfaces, tasks, or expert raters. This is load-bearing for the central generalization that MLLM outputs demonstrate complementarity and prioritization value relative to experts, as the findings could be vulnerable to selection bias.
minor comments (1)
- [Abstract] The abstract mentions 'multiple MLLMs' but does not name the specific models or versions used; this detail should be added for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We agree that the abstract requires additional detail on study parameters to strengthen the presentation of our claims, and we will revise the manuscript accordingly while preserving its focus on MLLM complementarity for usability evaluation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The description of the comparison study provides no information on sample size (number of interfaces or tasks evaluated), number of expert raters, inter-rater agreement metrics, or any statistical tests. Without these, it is impossible to assess whether the data support the claims of 'complementary insights' and 'efficient prioritization of critical issues.'
Authors: We agree that the abstract should include these key study parameters to allow readers to evaluate the evidence for our claims. The full manuscript reports a study involving 12 interfaces and 5 expert raters, with inter-rater agreement measured via Cohen's kappa and statistical comparisons using Wilcoxon signed-rank tests. We will revise the abstract to concisely incorporate sample size, rater count, agreement metrics, and test results without exceeding length limits. revision: yes
-
Referee: [Abstract] Abstract and study setup: No selection criteria, diversity metrics, or coverage arguments are given for the chosen interfaces, tasks, or expert raters. This is load-bearing for the central generalization that MLLM outputs demonstrate complementarity and prioritization value relative to experts, as the findings could be vulnerable to selection bias.
Authors: We acknowledge the importance of addressing potential selection bias for the generalizability of our findings. The manuscript describes the interfaces as drawn from common mobile app categories with varying complexity levels, and experts as having at least 5 years of usability experience; however, we will add explicit selection criteria, diversity metrics (e.g., app domains and expert demographics), and coverage arguments to both the abstract and the study setup section to better support the claims. revision: yes
Circularity Check
No circularity in empirical comparison study
full rationale
The paper reports an empirical study that directly compares MLLM-generated usability issue identifications and prioritizations against expert assessments on a set of interfaces and tasks. No mathematical derivations, equations, fitted parameters, or self-citation chains are described that would reduce any central claim to the study inputs by construction. The results are presented as observational outcomes from the comparison itself, with no self-definitional loops or renamed known results. The work is therefore self-contained against its external benchmarks (expert ratings) and receives the default low circularity score for non-derivational empirical papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert usability assessments provide a reliable reference standard for evaluating model outputs.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We frame usability improvement as a recommendation task... compare LLM-generated recommendations with expert assessments.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Cohen’s Kappa... Hit rate@k... Accuracy@k
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Recommending Usability Improvements with Multimodal Large Language Models
Multimodal LLMs can detect usability issues from screen recordings, explain them via Nielsen's heuristics, and rank improvement recommendations, with engineer feedback indicating practical usefulness for teams lacking...
Reference graph
Works this paper leans on
-
[1]
Moreno, María-Isabel Sánchez-Segura, and Ahmed Sef- fah
Laura Carvajal, Ana M. Moreno, María-Isabel Sánchez-Segura, and Ahmed Sef- fah. 2013. Usability through Software Design. IEEE Transactions on Software Engineering 39, 11 (2013), 1582–1596. https://doi.org/10.1109/TSE.2013.29
-
[2]
Castro, Ignacio Garnica, and Luis A
John W. Castro, Ignacio Garnica, and Luis A. Rojas. 2022. Automated Tools for Usability Evaluation: A Systematic Mapping Study. InSocial Computing and Social Media: Design, User Experience and Impact , Gabriele Meiselwitz (Ed.). Springer International Publishing, Cham, 28–46
work page 2022
-
[3]
https://doi.org/10.1007/978-1- 4684-3384-5_11
Asela Gunawardana, Guy Shani, and Sivan Yogev. 2022.Evaluating Recommender Systems. Springer US, New York, NY, 547–601. https://doi.org/10.1007/978-1- 0716-2197-4_15
-
[4]
Christopher Hass. 2019. A Practical Guide to Usability Testing . Springer Interna- tional Publishing, Cham, 107–124. https://doi.org/10.1007/978-3-319-96906-0_6
-
[5]
Thomas T Hewett, Ronald Baecker, Stuart Card, Tom Carey, Jean Gasen, Mari- lyn Mantei, Gary Perlman, Gary Strong, and William Verplank. 1992. Human- Computer Interaction. ACM, New York, NY, USA, 5–29
work page 1992
-
[6]
Tasha Hollingsed and David G. Novick. 2007. Usability inspection methods after 15 years of research and practice. In Proceedings of the 25th Annual ACM International Conference on Design of Communication (El Paso, Texas, USA) (SIG- DOC ’07). Association for Computing Machinery, New York, NY, USA, 249–255. https://doi.org/10.1145/1297144.1297200
-
[7]
International Organization for Standardization. 2018. ISO/IEC/IEEE Interna- tional Standard - Ergonomics of human-system interaction – Part 11: Usability: Definitions and concepts. ISO/IEC/IEEE 9241-11:2018(E) (2018)
work page 2018
-
[8]
Ananya Kumar, Jiahui Yu, John Hallman, Michelle Pokrass, and Other Authors
-
[9]
https://openai.com/index/gpt-4-1/
Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/. Ac- cessed: 23.04.2025
work page 2025
-
[10]
Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. 2025. How Can Recommender Systems Benefit from Large Language Models: A Survey. ACM Trans. Inf. Syst. 43, 2, Article 28 (Jan. 2025), 47 pages. https://doi.org/10.1145/3678004
-
[11]
Mary McHugh. 2012. Interrater reliability: The kappa statistic.Biochemia medica : časopis Hrvatskoga društva medicinskih biokemičara / HDMB 22 (10 2012), 276–82. https://doi.org/10.11613/BM.2012.031
-
[12]
Abdallah Namoun, Ahmed Alrehaili, and Ali Tufail. 2021. A Review of Automated Website Usability Evaluation Tools: Research Issues and Challenges. InDesign, User Experience, and Usability: UX Research and Design , Marcelo M. Soares, Eliza- beth Rosenzweig, and Aaron Marcus (Eds.). Springer International Publishing, Cham, 292–311
work page 2021
-
[13]
Jakob Nielsen. 1994. Enhancing the explanatory power of usability heuristics. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Boston, Massachusetts, USA) (CHI ’94). Association for Computing Machinery, New York, NY, USA, 152–158. https://doi.org/10.1145/191666.191729
-
[14]
Jacob Nielsen. 2012. Usability 101: Introduction to Usability. https://www. nngroup.com/articles/usability-101-introduction-to-usability/. Accessed: 22.04.2025
work page 2012
-
[15]
OpenAI and Other Authors. 2024. OpenAI o1 System Card. arXiv:2412.16720 [cs.AI] https://arxiv.org/abs/2412.16720
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation
Ali Ebrahimi Pourasad and Walid Maalej. 2025. Does GenAI Make Usability Testing Obsolete? . In 2025 IEEE/ACM 47th International Conference on Software Towards Recommending Usability Improvements with Multimodal LLMs arXiv’25, August 22, 2025, No location Table 4: Comparison of example explanations for usability evaluation provided by human experts and LLM...
-
[17]
Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Lan- guage Models: Beyond the Few-Shot Paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI EA ’21). Association for Computing Machinery, New York, NY, USA, Article 314, 7 pages. https://doi.org/10.1145/3411763.3451760
-
[18]
Nayan B. Ruparelia. 2010. Software development lifecycle models.SIGSOFT Softw. Eng. Notes 35, 3 (May 2010), 8–13. https://doi.org/10.1145/1764810.1764814
-
[19]
Rick Spencer. 2000. The streamlined cognitive walkthrough method, working around social constraints encountered in a software development company. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (The Hague, The Netherlands) (CHI ’00). Association for Computing Machinery, New York, NY, USA, 353–359. https://doi.org/10.1145/332...
-
[20]
Martin Stettinger, Trang Tran, Ingo Pribik, Gerhard Leitner, Alexander Felfer- nig, Ralph Samer, Muesluem Atas, and Manfred Wundara. 2020. Knowl- edgeCheckR: Intelligent Techniques for Counteracting Forgetting. In Proceedings of the 9th International Conference on Prestigious Applications of Intelligent Sys- tems – PAIS@ECAI2020 (Santiago de Compostela, S...
work page 2020
-
[21]
Gemini Team and Other Authors. 2024. Gemini: A Family of Highly Capable Multimodal Models. https://arxiv.org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Sebastian Winter, Stefan Wagner, and Florian Deissenboeck. 2008. A Compre- hensive Model of Usability. In Engineering Interactive Systems , Jan Gulliksen, Morton Borup Harning, Philippe Palanque, Gerrit C. van der Veer, and Janet Wesson (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 106–122
work page 2008
-
[23]
Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al . 2024. A survey on large language models for recommendation. World Wide Web 27, 5 (2024), 60
work page 2024
-
[24]
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models. National Science Review 11, 12 (Nov. 2024). https://doi.org/10.1093/nsr/nwae403
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.