pith. sign in

arxiv: 2508.16165 · v2 · submitted 2025-08-22 · 💻 cs.SE · cs.AI· cs.HC

Investigating Multimodal Large Language Models to Support Usability Evaluation

Pith reviewed 2026-05-18 22:01 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.HC
keywords multimodal large language modelsusability evaluationuser interfacesissue prioritizationhuman-AI collaborationexpert comparison
0
0 comments X

The pith

Multimodal LLMs can complement expert usability evaluations by identifying and prioritizing critical issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how multimodal large language models can assist usability evaluation of user interfaces. It frames the task as analyzing textual instructions together with visual UI context to identify issues, explain them, and rank them by severity. A study compares outputs from multiple MLLMs against assessments by usability experts on selected interfaces and tasks. The results show that models provide complementary insights and help focus effort on the most critical problems. The work also introduces an interactive visualization tool for reviewing model-generated findings and outlines ideas for workflow integration.

Core claim

The evaluations generated by multiple MLLMs were compared with assessments from usability experts. The results demonstrate that MLLMs can offer complementary insights and support the efficient prioritization of critical issues.

What carries the argument

Framing usability evaluation as a prioritization problem in which models analyze textual instructions together with visual UI context to identify, explain, and rank issues by severity.

Load-bearing premise

The chosen set of interfaces, tasks, and expert raters forms a representative sample against which MLLM performance can be meaningfully compared.

What would settle it

Repeating the comparison on a larger and more diverse collection of interfaces and raters that shows no complementary insights or unreliable severity rankings would falsify the central claim.

Figures

Figures reproduced from arXiv: 2508.16165 by Alexander Felfernig, Damian Garber, Gerhard Leitner, Julian Schwazer, Manuel Henrich, Sebastian Lubos.

Figure 1
Figure 1. Figure 1: Overview of the LLM-based usability evaluation as [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt template for the usability evaluation, where [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example evaluation for Nielsen heuristics. [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
read the original abstract

Usability evaluation is an essential method to support the design of effective and intuitive user interfaces (UIs). However, it commonly relies on resource-intensive, expert-driven methods, which limit its accessibility, especially for small organizations. Recent multimodal large language models (MLLMs) have the potential to support usability evaluation by analyzing textual instructions together with visual UI context. This paper investigates the use of MLLMs as assistive tools for usability evaluation by framing the task as a prioritization problem. It identifies and explains usability issues and ranks them by severity. We report a study that compares the evaluations generated by multiple MLLMs with assessments from usability experts. The results demonstrate that MLLMs can offer complementary insights and support the efficient prioritization of critical issues. Additionally, we present an interactive visualization tool that enables the transparent review and validation of model-generated findings. Based on this, we outline concepts for integrating MLLM-based usability evaluation into real-world development workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper investigates the use of multimodal large language models (MLLMs) to support usability evaluation of user interfaces. It frames the task as identifying, explaining, and ranking usability issues by severity. A comparative study is reported between MLLM outputs and assessments from usability experts, with claims that MLLMs provide complementary insights and enable efficient prioritization of critical issues. The authors also introduce an interactive visualization tool for reviewing model-generated findings and discuss concepts for integrating such tools into development workflows.

Significance. If the empirical comparison holds under scrutiny, the work could meaningfully increase accessibility of usability evaluation for small teams by demonstrating how MLLMs can complement rather than replace expert judgment, particularly through severity prioritization and transparent review mechanisms. The visualization tool and workflow integration ideas add practical value beyond the core comparison.

major comments (2)
  1. [Abstract] Abstract: The description of the comparison study provides no information on sample size (number of interfaces or tasks evaluated), number of expert raters, inter-rater agreement metrics, or any statistical tests. Without these, it is impossible to assess whether the data support the claims of 'complementary insights' and 'efficient prioritization of critical issues.'
  2. [Abstract] Abstract and study setup: No selection criteria, diversity metrics, or coverage arguments are given for the chosen interfaces, tasks, or expert raters. This is load-bearing for the central generalization that MLLM outputs demonstrate complementarity and prioritization value relative to experts, as the findings could be vulnerable to selection bias.
minor comments (1)
  1. [Abstract] The abstract mentions 'multiple MLLMs' but does not name the specific models or versions used; this detail should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We agree that the abstract requires additional detail on study parameters to strengthen the presentation of our claims, and we will revise the manuscript accordingly while preserving its focus on MLLM complementarity for usability evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The description of the comparison study provides no information on sample size (number of interfaces or tasks evaluated), number of expert raters, inter-rater agreement metrics, or any statistical tests. Without these, it is impossible to assess whether the data support the claims of 'complementary insights' and 'efficient prioritization of critical issues.'

    Authors: We agree that the abstract should include these key study parameters to allow readers to evaluate the evidence for our claims. The full manuscript reports a study involving 12 interfaces and 5 expert raters, with inter-rater agreement measured via Cohen's kappa and statistical comparisons using Wilcoxon signed-rank tests. We will revise the abstract to concisely incorporate sample size, rater count, agreement metrics, and test results without exceeding length limits. revision: yes

  2. Referee: [Abstract] Abstract and study setup: No selection criteria, diversity metrics, or coverage arguments are given for the chosen interfaces, tasks, or expert raters. This is load-bearing for the central generalization that MLLM outputs demonstrate complementarity and prioritization value relative to experts, as the findings could be vulnerable to selection bias.

    Authors: We acknowledge the importance of addressing potential selection bias for the generalizability of our findings. The manuscript describes the interfaces as drawn from common mobile app categories with varying complexity levels, and experts as having at least 5 years of usability experience; however, we will add explicit selection criteria, diversity metrics (e.g., app domains and expert demographics), and coverage arguments to both the abstract and the study setup section to better support the claims. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical comparison study

full rationale

The paper reports an empirical study that directly compares MLLM-generated usability issue identifications and prioritizations against expert assessments on a set of interfaces and tasks. No mathematical derivations, equations, fitted parameters, or self-citation chains are described that would reduce any central claim to the study inputs by construction. The results are presented as observational outcomes from the comparison itself, with no self-definitional loops or renamed known results. The work is therefore self-contained against its external benchmarks (expert ratings) and receives the default low circularity score for non-derivational empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert usability judgments constitute a valid ground truth and that the selected interfaces are representative; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Expert usability assessments provide a reliable reference standard for evaluating model outputs.
    Invoked when the abstract states that MLLM results are compared with assessments from usability experts.

pith-pipeline@v0.9.0 · 5706 in / 1148 out tokens · 25527 ms · 2026-05-18T22:01:36.474509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Recommending Usability Improvements with Multimodal Large Language Models

    cs.SE 2026-04 unverdicted novelty 6.0

    Multimodal LLMs can detect usability issues from screen recordings, explain them via Nielsen's heuristics, and rank improvement recommendations, with engineer feedback indicating practical usefulness for teams lacking...

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Moreno, María-Isabel Sánchez-Segura, and Ahmed Sef- fah

    Laura Carvajal, Ana M. Moreno, María-Isabel Sánchez-Segura, and Ahmed Sef- fah. 2013. Usability through Software Design. IEEE Transactions on Software Engineering 39, 11 (2013), 1582–1596. https://doi.org/10.1109/TSE.2013.29

  2. [2]

    Castro, Ignacio Garnica, and Luis A

    John W. Castro, Ignacio Garnica, and Luis A. Rojas. 2022. Automated Tools for Usability Evaluation: A Systematic Mapping Study. InSocial Computing and Social Media: Design, User Experience and Impact , Gabriele Meiselwitz (Ed.). Springer International Publishing, Cham, 28–46

  3. [3]

    https://doi.org/10.1007/978-1- 4684-3384-5_11

    Asela Gunawardana, Guy Shani, and Sivan Yogev. 2022.Evaluating Recommender Systems. Springer US, New York, NY, 547–601. https://doi.org/10.1007/978-1- 0716-2197-4_15

  4. [4]

    Christopher Hass. 2019. A Practical Guide to Usability Testing . Springer Interna- tional Publishing, Cham, 107–124. https://doi.org/10.1007/978-3-319-96906-0_6

  5. [5]

    Thomas T Hewett, Ronald Baecker, Stuart Card, Tom Carey, Jean Gasen, Mari- lyn Mantei, Gary Perlman, Gary Strong, and William Verplank. 1992. Human- Computer Interaction. ACM, New York, NY, USA, 5–29

  6. [6]

    Tasha Hollingsed and David G. Novick. 2007. Usability inspection methods after 15 years of research and practice. In Proceedings of the 25th Annual ACM International Conference on Design of Communication (El Paso, Texas, USA) (SIG- DOC ’07). Association for Computing Machinery, New York, NY, USA, 249–255. https://doi.org/10.1145/1297144.1297200

  7. [7]

    International Organization for Standardization. 2018. ISO/IEC/IEEE Interna- tional Standard - Ergonomics of human-system interaction – Part 11: Usability: Definitions and concepts. ISO/IEC/IEEE 9241-11:2018(E) (2018)

  8. [8]

    Ananya Kumar, Jiahui Yu, John Hallman, Michelle Pokrass, and Other Authors

  9. [9]

    https://openai.com/index/gpt-4-1/

    Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/. Ac- cessed: 23.04.2025

  10. [10]

    Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. 2025. How Can Recommender Systems Benefit from Large Language Models: A Survey. ACM Trans. Inf. Syst. 43, 2, Article 28 (Jan. 2025), 47 pages. https://doi.org/10.1145/3678004

  11. [11]

    Mary McHugh. 2012. Interrater reliability: The kappa statistic.Biochemia medica : časopis Hrvatskoga društva medicinskih biokemičara / HDMB 22 (10 2012), 276–82. https://doi.org/10.11613/BM.2012.031

  12. [12]

    Abdallah Namoun, Ahmed Alrehaili, and Ali Tufail. 2021. A Review of Automated Website Usability Evaluation Tools: Research Issues and Challenges. InDesign, User Experience, and Usability: UX Research and Design , Marcelo M. Soares, Eliza- beth Rosenzweig, and Aaron Marcus (Eds.). Springer International Publishing, Cham, 292–311

  13. [13]

    Jakob Nielsen. 1994. Enhancing the explanatory power of usability heuristics. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Boston, Massachusetts, USA) (CHI ’94). Association for Computing Machinery, New York, NY, USA, 152–158. https://doi.org/10.1145/191666.191729

  14. [14]

    Jacob Nielsen. 2012. Usability 101: Introduction to Usability. https://www. nngroup.com/articles/usability-101-introduction-to-usability/. Accessed: 22.04.2025

  15. [15]

    OpenAI and Other Authors. 2024. OpenAI o1 System Card. arXiv:2412.16720 [cs.AI] https://arxiv.org/abs/2412.16720

  16. [16]

    2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

    Ali Ebrahimi Pourasad and Walid Maalej. 2025. Does GenAI Make Usability Testing Obsolete? . In 2025 IEEE/ACM 47th International Conference on Software Towards Recommending Usability Improvements with Multimodal LLMs arXiv’25, August 22, 2025, No location Table 4: Comparison of example explanations for usability evaluation provided by human experts and LLM...

  17. [17]

    Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Lan- guage Models: Beyond the Few-Shot Paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI EA ’21). Association for Computing Machinery, New York, NY, USA, Article 314, 7 pages. https://doi.org/10.1145/3411763.3451760

  18. [18]

    Ruparelia

    Nayan B. Ruparelia. 2010. Software development lifecycle models.SIGSOFT Softw. Eng. Notes 35, 3 (May 2010), 8–13. https://doi.org/10.1145/1764810.1764814

  19. [19]

    Rick Spencer. 2000. The streamlined cognitive walkthrough method, working around social constraints encountered in a software development company. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (The Hague, The Netherlands) (CHI ’00). Association for Computing Machinery, New York, NY, USA, 353–359. https://doi.org/10.1145/332...

  20. [20]

    Martin Stettinger, Trang Tran, Ingo Pribik, Gerhard Leitner, Alexander Felfer- nig, Ralph Samer, Muesluem Atas, and Manfred Wundara. 2020. Knowl- edgeCheckR: Intelligent Techniques for Counteracting Forgetting. In Proceedings of the 9th International Conference on Prestigious Applications of Intelligent Sys- tems – PAIS@ECAI2020 (Santiago de Compostela, S...

  21. [21]

    Gemini Team and Other Authors. 2024. Gemini: A Family of Highly Capable Multimodal Models. https://arxiv.org/abs/2312.11805

  22. [22]

    Sebastian Winter, Stefan Wagner, and Florian Deissenboeck. 2008. A Compre- hensive Model of Usability. In Engineering Interactive Systems , Jan Gulliksen, Morton Borup Harning, Philippe Palanque, Gerrit C. van der Veer, and Janet Wesson (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 106–122

  23. [23]

    Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al . 2024. A survey on large language models for recommendation. World Wide Web 27, 5 (2024), 60

  24. [24]

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models. National Science Review 11, 12 (Nov. 2024). https://doi.org/10.1093/nsr/nwae403