pith. sign in

arxiv: 2506.16345 · v1 · submitted 2025-06-19 · 💻 cs.HC

Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation

Pith reviewed 2026-05-19 08:48 UTC · model grok-4.3

classification 💻 cs.HC
keywords heuristic evaluationGPT-4ousabilityNielsen heuristicslarge language modelshuman-computer interactionissue identificationweb interfaces
0
0 comments X

The pith

GPT-4o identifies only 21.2% of the usability issues found by human experts when both apply Nielsen's heuristics to the same web screenshots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether GPT-4o can carry out heuristic evaluation at a level comparable to trained HCI experts. Researchers supplied the model with a literature-based prompt and screenshots from one web system, then measured how many of the experts' issues the model also reported. Overlap proved low, with GPT-4o recovering just over one-fifth of the human findings while surfacing twenty-seven new ones and a number of false positives traced to hallucinations. The model performed relatively well on aesthetics and real-world match but missed many issues involving user flexibility, control, and efficiency. The authors close by offering five practical takeaways for using the model in usability work without over-reliance.

Core claim

This study compared heuristic evaluations performed by GPT-4o and human HCI experts on screenshots from a web system using Nielsen's heuristics. The results showed that GPT-4o identified only 21.2% of the issues found by experts while discovering 27 additional issues, performing better on heuristics related to aesthetic and minimalist design and match between system and real world but struggling with those involving flexibility, control, and user efficiency. GPT-4o also generated several false positives due to hallucinations and attempts to predict issues.

What carries the argument

Direct comparison of usability issues detected by GPT-4o (prompted with Nielsen's heuristics) against those identified by human experts on the same set of web-system screenshots.

If this is right

  • GPT-4o cannot be treated as a full substitute for human experts in heuristic evaluation.
  • The model can surface additional candidate issues that experts might otherwise miss.
  • Heuristics involving flexibility, user control, and efficiency need explicit human review when LLMs are involved.
  • Prompt engineering for HCI tasks must address hallucination to reduce false positives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LLMs may serve best as an early filter that narrows the set of issues for human experts to examine.
  • The same low-overlap pattern could appear in other expert judgment tasks such as cognitive walkthroughs or accessibility audits.
  • Combining outputs from several different LLMs might raise coverage without increasing human workload.

Load-bearing premise

That the literature-grounded prompt and selected screenshots from one web system allow GPT-4o to demonstrate its general capability for heuristic evaluation comparable to HCI experts.

What would settle it

A follow-up experiment on several additional web systems or with revised prompts that produces substantially higher overlap with expert-identified issues would indicate the 21.2% figure is not general.

Figures

Figures reproduced from arXiv: 2506.16345 by Andr\'e Freire, Bruna Capeleti, Guilherme Guerino, Luciana Zaina, Luiz Rodrigues, Rafael Ferreira Mello.

Figure 1
Figure 1. Figure 1: Severity of issues identified by heuristics. Regarding system visibility and feedback issues, GPT-4o identified flaws, such as the absence of a clear indicator for loading maps (D30), which can generate uncertainty for the user about whether the action was processed. A similar sit￾uation occurs in the lack of adequate visual feedback when users apply filters (D29), making it necessary for the user to inves… view at source ↗
read the original abstract

Heuristic evaluation is a widely used method in Human-Computer Interaction (HCI) to inspect interfaces and identify issues based on heuristics. Recently, Large Language Models (LLMs), such as GPT-4o, have been applied in HCI to assist in persona creation, the ideation process, and the analysis of semi-structured interviews. However, considering the need to understand heuristics and the high degree of abstraction required to evaluate them, LLMs may have difficulty conducting heuristic evaluation. However, prior research has not investigated GPT-4o's performance in heuristic evaluation compared to HCI experts in web-based systems. In this context, this study aims to compare the results of a heuristic evaluation performed by GPT-4o and human experts. To this end, we selected a set of screenshots from a web system and asked GPT-4o to perform a heuristic evaluation based on Nielsen's Heuristics from a literature-grounded prompt. Our results indicate that only 21.2% of the issues identified by human experts were also identified by GPT-4o, despite it found 27 new issues. We also found that GPT-4o performed better for heuristics related to aesthetic and minimalist design and match between system and real world, whereas it has difficulty identifying issues in heuristics related to flexibility, control, and user efficiency. Additionally, we noticed that GPT-4o generated several false positives due to hallucinations and attempts to predict issues. Finally, we highlight five takeaways for the conscious use of GPT-4o in heuristic evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper compares GPT-4o to human HCI experts in performing heuristic evaluation using Nielsen's heuristics on screenshots from a single web system. Using a literature-grounded prompt, the authors find that GPT-4o identified only 21.2% of the issues found by experts while surfacing 27 additional issues, performed relatively better on heuristics related to aesthetic and minimalist design and match between system and real world, but struggled with flexibility, control, and user efficiency, and produced false positives attributed to hallucinations. The work concludes with five takeaways for the conscious use of GPT-4o in heuristic evaluations.

Significance. If the central comparison holds, the work supplies concrete evidence that current LLMs face difficulties with the abstract reasoning required for heuristic evaluation, particularly on control- and efficiency-related issues. This could usefully inform HCI practice by highlighting risks of over-reliance on LLMs for usability inspection and motivating hybrid human-AI workflows. The direct side-by-side issue identification against experts is a clear empirical contribution, though the narrow stimulus set limits the strength of any general claim about GPT-4o capability.

major comments (2)
  1. [§3 (Method)] §3 (Method): The evaluation uses screenshots from only one unspecified web system with no reported variation, selection criteria, or validation on additional interfaces. This single-stimulus design is load-bearing for the headline claim that GPT-4o cannot evaluate usability like experts, because the observed 21.2% overlap and heuristic-specific patterns could be artifacts of that particular interface rather than intrinsic model limitations.
  2. [§4 (Results)] §4 (Results): The reported 21.2% overlap is presented without the total number of expert-identified issues, the number of human experts, inter-rater reliability statistics, or a description of how semantic overlap between GPT-4o and expert issues was determined. These omissions make the quantitative comparison difficult to interpret and weaken support for the central claim of underperformance.
minor comments (2)
  1. [§3.1] The full prompt text used with GPT-4o should be included in an appendix or supplementary material to support reproducibility.
  2. [§4.2] Clarify in the results how false positives were distinguished from genuine new issues versus hallucinations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we will revise the manuscript to improve clarity and contextualize our claims.

read point-by-point responses
  1. Referee: [§3 (Method)] §3 (Method): The evaluation uses screenshots from only one unspecified web system with no reported variation, selection criteria, or validation on additional interfaces. This single-stimulus design is load-bearing for the headline claim that GPT-4o cannot evaluate usability like experts, because the observed 21.2% overlap and heuristic-specific patterns could be artifacts of that particular interface rather than intrinsic model limitations.

    Authors: We acknowledge that the single-interface design limits the strength of broad generalizations. The system was chosen as a representative, moderately complex web application to enable focused qualitative comparison of issue types. In revision we will (1) specify the system and selection criteria for the screenshots, (2) add a dedicated limitations paragraph stating that the 21.2 % overlap and heuristic patterns are tied to this stimulus set, and (3) explicitly recommend replication on additional interfaces. These changes will prevent overstatement while retaining the value of the side-by-side expert comparison. revision: partial

  2. Referee: [§4 (Results)] §4 (Results): The reported 21.2% overlap is presented without the total number of expert-identified issues, the number of human experts, inter-rater reliability statistics, or a description of how semantic overlap between GPT-4o and expert issues was determined. These omissions make the quantitative comparison difficult to interpret and weaken support for the central claim of underperformance.

    Authors: We agree these details should have been reported explicitly. The revised §4 will state the total number of issues identified by the experts, the number of participating HCI experts, the inter-rater reliability statistic obtained, and the procedure used to judge semantic overlap (independent coding by two researchers followed by consensus resolution of disagreements). These additions will allow readers to evaluate the quantitative comparison directly. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison with no derivations or self-referential reductions

full rationale

The paper performs a straightforward empirical study by selecting screenshots from one web system, prompting GPT-4o with a literature-grounded prompt based on Nielsen's heuristics, and directly measuring overlap (21.2%) plus new issues against human expert judgments. No equations, fitted parameters, predictions by construction, or load-bearing self-citations exist; the central results are measured outputs from the described procedure rather than reductions to prior inputs. The study is self-contained against external benchmarks of human evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard HCI practices and the assumption that the chosen prompt and screenshots allow fair comparison; no new entities or fitted parameters are introduced.

axioms (2)
  • domain assumption Nielsen's heuristics are a valid and standard framework for usability evaluation.
    The study uses them as the basis for evaluation without questioning their validity.
  • domain assumption Human experts provide a reliable gold standard for identifying usability issues.
    The comparison treats human identifications as the reference set.

pith-pipeline@v0.9.0 · 5827 in / 1369 out tokens · 66276 ms · 2026-05-19T08:48:39.612539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

  1. [1]

    Cureus15(2) (2023)

    Alkaissi, H., McFarlane, S.I.: Artificial hallucinations in chatgpt: implications in scientific writing. Cureus15(2) (2023)

  2. [2]

    IEEE Transactions on Learning Technologies (2024)

    Barambones,J.,Moral,C.,deAntonio,A.,Imbert,R.,Martínez,L.,Villalba-Mora, E.: Chatgpt for learning hci techniques: A case study on interviews for personas. IEEE Transactions on Learning Technologies (2024)

  3. [3]

    In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

    Brade, S., Wang, B., Sousa, M., Oore, S., Grossman, T.: Promptify: Text-to-image generation through interactive prompt exploration with large language models. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. UIST ’23, Association for Computing Machinery, New York, NY, USA(2023). https://doi.org/10.1145/3586183.36...

  4. [4]

    In: Encyclopedia of quality of life and well-being research, pp

    Braun, V., Clarke, V.: Thematic analysis. In: Encyclopedia of quality of life and well-being research, pp. 7187–7193. Springer (2024)

  5. [5]

    Master’s thesis, Federal University of Lavras (2023)

    Capeleti, B.S.: Human-data interaction in geoprocessing applications: design rec- ommendations from inspections, user evaluations and expert experiences. Master’s thesis, Federal University of Lavras (2023)

  6. [6]

    and Yang, Qiang and Xie, Xing , title =

    Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P.S., Yang, Q., Xie, X.: A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 15(3) (Mar 2024). https://doi.org/10.1145/3641289, https: //doi.org/10.1145/3641289

  7. [7]

    arXiv preprint arXiv:2310.14735 (2023)

    Chen, B., Zhang, Z., Langrené, N., Zhu, S.: Unleashing the potential of prompt engineering in large language models: a comprehensive review. arXiv preprint arXiv:2310.14735 (2023)

  8. [8]

    arXiv preprint arXiv:2305.13014 (2023)

    De Paoli, S.: Can large language models emulate an inductive thematic analysis of semi-structured interviews? an exploration and provocation on the limits of the approach and the model. arXiv preprint arXiv:2305.13014 (2023)

  9. [9]

    InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems

    Duan, P., Warner, J., Li, Y., Hartmann, B.: Generating automatic feedback on ui mockups with large language models. In: Proceedings of the 2024 CHI Confer- ence on Human Factors in Computing Systems. CHI ’24, Association for Comput- ing Machinery, New York, NY, USA (2024).https://doi.org/10.1145/3613904. 3642782, https://doi.org/10.1145/3613904.3642782

  10. [10]

    Journal of Undergraduate Scholarship6, 30 (2014)

    Edwards, M.L., Smith, B.C.: The effects of the neutral response option on the extremeness of participant responses. Journal of Undergraduate Scholarship6, 30 (2014)

  11. [11]

    Journal on Interactive Systems 15(1), 810–822 (Aug 2024)

    Freitas, J.A.d., Oliveira, M., Martinelli, C., Amorim, F., Toda, A.M., Palomino, P., Klock, A.C.T., Guerino, G., Avila-Santos, A.P., Rodrigues, L.: Sensation in gamification: A qualitative investigation of background music in gamified learning. Journal on Interactive Systems 15(1), 810–822 (Aug 2024). https://doi.org/10.5753/jis.2024.4501, https://journal...

  12. [12]

    Annals of Biomedical Engineering pp

    Giray, L.: Prompt engineering with chatgpt: A guide for academic writers. Annals of Biomedical Engineering pp. 1–5 (2023)

  13. [13]

    Available at SSRN 4526071 (2023)

    Girotra,K.,Meincke,L.,Terwiesch,C.,Ulrich,K.T.:Ideasaredimesadozen:Large language models for idea generation in innovation. Available at SSRN 4526071 (2023)

  14. [14]

    timodal, touchscreen-based graphics

    Gorlewicz, J.L., Tennison, J.L., Uesbeck, P.M., Richard, M.E., Palani, H.P., Stefik, A., Smith, D.W., Giudice, N.A.: Design guidelines and recommendations for mul- 20 Guerino et al. timodal, touchscreen-based graphics. ACM Trans. Access. Comput.13(3) (Aug 2020). https://doi.org/10.1145/3403933, https://doi.org/10.1145/3403933

  15. [15]

    In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

    Hämäläinen, P., Tavast, M., Kunnari, A.: Evaluating large language models in generating synthetic hci research data: a case study. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–19 (2023)

  16. [16]

    ISO9241-11: Ergonomics of human-system interaction — part 11: Usability: Defi- nitions and concepts (2018),https://www.iso.org/obp/ui/#iso:std:iso:9241: -11:ed-2:v1:en

  17. [17]

    ACM Computing Surveys 55(12), 1–38 (2023)

    Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Computing Surveys 55(12), 1–38 (2023)

  18. [18]

    In: Bouamor, H., Pino, J., Bali, K

    Karmaker Santu, S.K., Feng, D.: TELeR: A general taxonomy of LLM prompts for benchmarking complex tasks. In: Bouamor, H., Pino, J., Bali, K. (eds.) Find- ings of the Association for Computational Linguistics: EMNLP 2023. pp. 14197– 14203. Association for Computational Linguistics, Singapore (Dec 2023).https:// doi.org/10.18653/v1/2023.findings-emnlp.946, ...

  19. [19]

    In: Proceedings of the XXIII Brazilian Symposium on Software Qual- ity

    Konopatzki, G.E., Valentim, N.C., Guerino, G.C.: A qualitative study and im- provement on ux and usability evaluation questionnaire focused on multi-touch systems. In: Proceedings of the XXIII Brazilian Symposium on Software Qual- ity. p. 362–372. SBQS ’24, Association for Computing Machinery, New York, NY, USA(2024). https://doi.org/10.1145/3701625.37016...

  20. [20]

    Morgan Kaufmann (2017)

    Lazar, J., Feng, J.H., Hochheiser, H.: Research methods in human-computer inter- action. Morgan Kaufmann (2017)

  21. [21]

    arXiv preprint arXiv:2306.01815 (2023)

    Leiker, D., Finnigan, S., Gyllen, A.R., Cukurova, M.: Prototyping the use of large language models (llms) for adult learning content creation at scale. arXiv preprint arXiv:2306.01815 (2023)

  22. [22]

    Lewis, J., Sauro, J.: Does removing the neutral response op- tion affect rating behavior? (2023), https://measuringu.com/ removing-the-neutral-response-option/, accessed on February 02, 2025

  23. [23]

    arXiv preprint arXiv:2409.12538 (2024)

    Liu, Y., Sharma, P., Oswal, M.J., Xia, H., Huang, Y.: Personaflow: Boost- ing research ideation with llm-simulated expert personas. arXiv preprint arXiv:2409.12538 (2024)

  24. [24]

    In: CHI Conference on Human Factors in Computing Systems Extended Abstracts

    Lu, Y., Zhang, C., Zhang, I., Li, T.J.J.: Bridging the gap between ux practitioners’ work practices and ai-enabled design support tools. In: CHI Conference on Human Factors in Computing Systems Extended Abstracts. pp. 1–7 (2022)

  25. [25]

    The Corsini encyclopedia of psychology pp

    McKnight, P.E., Najab, J.: Mann-whitney u test. The Corsini encyclopedia of psychology pp. 1–1 (2010)

  26. [26]

    In: International Confer- ence on Human-Computer Interaction

    Meinecke, A., Heidrich, D., Dworatzyk, K., Theis, S.: A comparative heuristic eval- uation of kadi4mat through human evaluators and gpt-4. In: International Confer- ence on Human-Computer Interaction. pp. 91–108. Springer (2024)

  27. [27]

    In: Anais do XXXV Simpósio Brasileiro de Informática na Educação

    Mello, R., Rodrigues, L., Cabral, L., Pereira, F., Júnior, C.P., Gasevic, D., Ramalho, G.: Prompt engineering for automatic short answer grad- ing in brazilian portuguese. In: Anais do XXXV Simpósio Brasileiro de Informática na Educação. pp. 1730–1743. SBC, Porto Alegre, RS, Brasil (2024). https://doi.org/10.5753/sbie.2024.242424, https://sol.sbc.org. br/...

  28. [28]

    Mi, N., Cavuoto, L.A., Benson, K., Smith-Jackson, T., Nussbaum, M.A.: A heuris- tic checklist for an accessible smartphone interface design. Univers. Access Inf. Can GPT-4o Evaluate Usability Like Human Experts? 21 Soc. 13(4), 351–365 (Nov 2014).https://doi.org/10.1007/s10209-013-0321-4, https://doi.org/10.1007/s10209-013-0321-4

  29. [29]

    Communications of the ACM33(3), 338–348 (1990)

    Molich, R., Nielsen, J.: Improving a human-computer dialogue. Communications of the ACM33(3), 338–348 (1990)

  30. [30]

    Moran, K.: Usability (user) testing 101 (2019), https://www.nngroup.com/ articles/usability-testing-101/

  31. [31]

    In: Usability inspection methods

    Nielsen, J.: Heuristic evaluation. In: Usability inspection methods. pp. 25–62. John Wiley & Sons, Inc. (1994)

  32. [32]

    Journal of Consumer research29(3), 319–334 (2002)

    Nowlis, S.M., Kahn, B.E., Dhar, R.: Coping with ambivalence: The effect of re- moving a neutral option on consumer attitude and preference judgments. Journal of Consumer research29(3), 319–334 (2002)

  33. [33]

    OpenAI: Gpt-4 technical report (2023)

  34. [34]

    arXiv preprint arXiv:2409.14858 (2024)

    Panda, S.: Llms’ ways of seeing user personas. arXiv preprint arXiv:2409.14858 (2024)

  35. [35]

    Pearson, K.: X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50(302), 157–175 (1900)

  36. [36]

    In: Olney, A.M., Chounta, I.A., Liu, Z., Santos, O.C., Bit- tencourt, I.I

    Pereira Júnior, C., Rodrigues, L., Costa, N., Macario Filho, V., Mello, R.: Can vlm understand children’s handwriting? an analysis on handwritten mathematical equation recognition. In: Olney, A.M., Chounta, I.A., Liu, Z., Santos, O.C., Bit- tencourt, I.I. (eds.) Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutoria...

  37. [37]

    In: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems

    Petridis, S., Terry, M., Cai, C.J.: Promptinfuser: Bringing user interface mock- ups to life with large language models. In: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–6 (2023)

  38. [38]

    In: Proceedings of UPA 2010 International Conference (2010)

    Petrie, H., Buykx, L.: Collaborative heuristic evaluation: improving the effective- ness of heuristic evaluation. In: Proceedings of UPA 2010 International Conference (2010)

  39. [39]

    Computers and Education: Artificial Intelligence 7, 100248 (2024)

    Rodrigues, L., Dwan Pereira, F., Cabral, L., Gašević, D., Ramalho, G., Ferreira Mello, R.: Assessing the quality of automatic-generated short an- swers using gpt-4. Computers and Education: Artificial Intelligence 7, 100248 (2024). https://doi.org/https://doi.org/10.1016/j.caeai.2024.100248, https://www.sciencedirect.com/science/article/pii/S2666920X24000511

  40. [40]

    In: Meiselwitz, G

    Saavedra, M., Rusu, C., Quiñones, D., Roncagliolo, S.: A set of usability and user experience heuristics for social networks. In: Meiselwitz, G. (ed.) Social Computing and Social Media. Design, Human Behavior and Analytics - 11th International Conference, SCSM 2019, Held as Part of the 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 2...

  41. [41]

    A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    Sahoo, P., Singh, A.K., Saha, S., Jain, V., Mondal, S., Chadha, A.: A systematic survey of prompt engineering in large language models: Techniques and applica- tions. arXiv preprint arXiv:2402.07927 (2024)

  42. [42]

    CRC Press, 1st edn

    Stephanidis, C., Salvendy, G.: User Experience Methods and Tools in Human- Computer Interaction. CRC Press, 1st edn. (2024)

  43. [43]

    In: Proceedings of the CHI Conference on Human Factors in Computing Systems

    Suh,S.,Chen,M.,Min,B.,Li,T.J.J.,Xia,H.:Luminate:Structuredgenerationand exploration of design space with large language models for human-ai co-creation. In: Proceedings of the CHI Conference on Human Factors in Computing Systems. pp. 1–26 (2024) 22 Guerino et al

  44. [44]

    Takaffoli, M., Li, S., Mäkelä, V.: Generative ai in user experience design and research: How do ux practitioners, teams, and companies use genai in indus- try? In: Proceedings of the 2024 ACM Designing Interactive Systems Conference. p. 1579–1593. DIS ’24, Association for Computing Machinery, New York, NY, USA(2024). https://doi.org/10.1145/3643834.366072...

  45. [45]

    In: Companion Proceedings of the 27th International Conference on Intelligent User Interfaces

    Tavast, M., Kunnari, A., Hämäläinen, P.: Language models can generate human- like self-reports of emotion. In: Companion Proceedings of the 27th International Conference on Intelligent User Interfaces. pp. 69–72 (2022)

  46. [46]

    Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2nd edn

    Tullis, T., Albert, W.: Measuring the User Experience, Second Edition: Collecting, Analyzing, and Presenting Usability Metrics. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2nd edn. (2013)

  47. [47]

    AI and Ethics (2024)

    Wang, Z., Chu, Z., Viet Doan, T., Ni, S., Yang, M., Zhang, W.: History, Develop- ment, and Principles of Large Language Models-An Introductory Survey. AI and Ethics (2024). https://doi.org/10.1007/s43681-024-00583-7

  48. [48]

    Advances in Neural Information Processing Systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems35, 24824–24837 (2022)

  49. [49]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023)

  50. [50]

    British Journal of Educational Technology n/a(n/a) (2023)

    Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., Gašević, D.: Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology n/a(n/a) (2023). https://doi.org/https://doi.org/10.1111/bjet.13370

  51. [51]

    Zhong, R., Hsieh, G., McDonald, D.W.: How can llms support ux practitioners with image-related tasks? In: GenAICHI: CHI 2024 Workshop on Generative AI and HCI. pp. 1–6 (2024)

  52. [52]

    In: Zhang, J

    Ziyu, Z., Qiguang, C., Longxuan, M., Mingda, L., Yi, H., Yushan, Q., Haopeng, B., Weinan, Z., Liu, T.: Through the lens of core competency: Survey on evaluation of large language models. In: Zhang, J. (ed.) Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum). pp. 88–109. Chinese Information Processin...