Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation
Pith reviewed 2026-05-19 08:48 UTC · model grok-4.3
The pith
GPT-4o identifies only 21.2% of the usability issues found by human experts when both apply Nielsen's heuristics to the same web screenshots.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This study compared heuristic evaluations performed by GPT-4o and human HCI experts on screenshots from a web system using Nielsen's heuristics. The results showed that GPT-4o identified only 21.2% of the issues found by experts while discovering 27 additional issues, performing better on heuristics related to aesthetic and minimalist design and match between system and real world but struggling with those involving flexibility, control, and user efficiency. GPT-4o also generated several false positives due to hallucinations and attempts to predict issues.
What carries the argument
Direct comparison of usability issues detected by GPT-4o (prompted with Nielsen's heuristics) against those identified by human experts on the same set of web-system screenshots.
If this is right
- GPT-4o cannot be treated as a full substitute for human experts in heuristic evaluation.
- The model can surface additional candidate issues that experts might otherwise miss.
- Heuristics involving flexibility, user control, and efficiency need explicit human review when LLMs are involved.
- Prompt engineering for HCI tasks must address hallucination to reduce false positives.
Where Pith is reading between the lines
- LLMs may serve best as an early filter that narrows the set of issues for human experts to examine.
- The same low-overlap pattern could appear in other expert judgment tasks such as cognitive walkthroughs or accessibility audits.
- Combining outputs from several different LLMs might raise coverage without increasing human workload.
Load-bearing premise
That the literature-grounded prompt and selected screenshots from one web system allow GPT-4o to demonstrate its general capability for heuristic evaluation comparable to HCI experts.
What would settle it
A follow-up experiment on several additional web systems or with revised prompts that produces substantially higher overlap with expert-identified issues would indicate the 21.2% figure is not general.
Figures
read the original abstract
Heuristic evaluation is a widely used method in Human-Computer Interaction (HCI) to inspect interfaces and identify issues based on heuristics. Recently, Large Language Models (LLMs), such as GPT-4o, have been applied in HCI to assist in persona creation, the ideation process, and the analysis of semi-structured interviews. However, considering the need to understand heuristics and the high degree of abstraction required to evaluate them, LLMs may have difficulty conducting heuristic evaluation. However, prior research has not investigated GPT-4o's performance in heuristic evaluation compared to HCI experts in web-based systems. In this context, this study aims to compare the results of a heuristic evaluation performed by GPT-4o and human experts. To this end, we selected a set of screenshots from a web system and asked GPT-4o to perform a heuristic evaluation based on Nielsen's Heuristics from a literature-grounded prompt. Our results indicate that only 21.2% of the issues identified by human experts were also identified by GPT-4o, despite it found 27 new issues. We also found that GPT-4o performed better for heuristics related to aesthetic and minimalist design and match between system and real world, whereas it has difficulty identifying issues in heuristics related to flexibility, control, and user efficiency. Additionally, we noticed that GPT-4o generated several false positives due to hallucinations and attempts to predict issues. Finally, we highlight five takeaways for the conscious use of GPT-4o in heuristic evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper compares GPT-4o to human HCI experts in performing heuristic evaluation using Nielsen's heuristics on screenshots from a single web system. Using a literature-grounded prompt, the authors find that GPT-4o identified only 21.2% of the issues found by experts while surfacing 27 additional issues, performed relatively better on heuristics related to aesthetic and minimalist design and match between system and real world, but struggled with flexibility, control, and user efficiency, and produced false positives attributed to hallucinations. The work concludes with five takeaways for the conscious use of GPT-4o in heuristic evaluations.
Significance. If the central comparison holds, the work supplies concrete evidence that current LLMs face difficulties with the abstract reasoning required for heuristic evaluation, particularly on control- and efficiency-related issues. This could usefully inform HCI practice by highlighting risks of over-reliance on LLMs for usability inspection and motivating hybrid human-AI workflows. The direct side-by-side issue identification against experts is a clear empirical contribution, though the narrow stimulus set limits the strength of any general claim about GPT-4o capability.
major comments (2)
- [§3 (Method)] §3 (Method): The evaluation uses screenshots from only one unspecified web system with no reported variation, selection criteria, or validation on additional interfaces. This single-stimulus design is load-bearing for the headline claim that GPT-4o cannot evaluate usability like experts, because the observed 21.2% overlap and heuristic-specific patterns could be artifacts of that particular interface rather than intrinsic model limitations.
- [§4 (Results)] §4 (Results): The reported 21.2% overlap is presented without the total number of expert-identified issues, the number of human experts, inter-rater reliability statistics, or a description of how semantic overlap between GPT-4o and expert issues was determined. These omissions make the quantitative comparison difficult to interpret and weaken support for the central claim of underperformance.
minor comments (2)
- [§3.1] The full prompt text used with GPT-4o should be included in an appendix or supplementary material to support reproducibility.
- [§4.2] Clarify in the results how false positives were distinguished from genuine new issues versus hallucinations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we will revise the manuscript to improve clarity and contextualize our claims.
read point-by-point responses
-
Referee: [§3 (Method)] §3 (Method): The evaluation uses screenshots from only one unspecified web system with no reported variation, selection criteria, or validation on additional interfaces. This single-stimulus design is load-bearing for the headline claim that GPT-4o cannot evaluate usability like experts, because the observed 21.2% overlap and heuristic-specific patterns could be artifacts of that particular interface rather than intrinsic model limitations.
Authors: We acknowledge that the single-interface design limits the strength of broad generalizations. The system was chosen as a representative, moderately complex web application to enable focused qualitative comparison of issue types. In revision we will (1) specify the system and selection criteria for the screenshots, (2) add a dedicated limitations paragraph stating that the 21.2 % overlap and heuristic patterns are tied to this stimulus set, and (3) explicitly recommend replication on additional interfaces. These changes will prevent overstatement while retaining the value of the side-by-side expert comparison. revision: partial
-
Referee: [§4 (Results)] §4 (Results): The reported 21.2% overlap is presented without the total number of expert-identified issues, the number of human experts, inter-rater reliability statistics, or a description of how semantic overlap between GPT-4o and expert issues was determined. These omissions make the quantitative comparison difficult to interpret and weaken support for the central claim of underperformance.
Authors: We agree these details should have been reported explicitly. The revised §4 will state the total number of issues identified by the experts, the number of participating HCI experts, the inter-rater reliability statistic obtained, and the procedure used to judge semantic overlap (independent coding by two researchers followed by consensus resolution of disagreements). These additions will allow readers to evaluate the quantitative comparison directly. revision: yes
Circularity Check
No circularity: direct empirical comparison with no derivations or self-referential reductions
full rationale
The paper performs a straightforward empirical study by selecting screenshots from one web system, prompting GPT-4o with a literature-grounded prompt based on Nielsen's heuristics, and directly measuring overlap (21.2%) plus new issues against human expert judgments. No equations, fitted parameters, predictions by construction, or load-bearing self-citations exist; the central results are measured outputs from the described procedure rather than reductions to prior inputs. The study is self-contained against external benchmarks of human evaluation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Nielsen's heuristics are a valid and standard framework for usability evaluation.
- domain assumption Human experts provide a reliable gold standard for identifying usability issues.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our results indicate that only 21.2% of the issues identified by human experts were also identified by GPT-4o, despite it found 27 new issues.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alkaissi, H., McFarlane, S.I.: Artificial hallucinations in chatgpt: implications in scientific writing. Cureus15(2) (2023)
work page 2023
-
[2]
IEEE Transactions on Learning Technologies (2024)
Barambones,J.,Moral,C.,deAntonio,A.,Imbert,R.,Martínez,L.,Villalba-Mora, E.: Chatgpt for learning hci techniques: A case study on interviews for personas. IEEE Transactions on Learning Technologies (2024)
work page 2024
-
[3]
In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology
Brade, S., Wang, B., Sousa, M., Oore, S., Grossman, T.: Promptify: Text-to-image generation through interactive prompt exploration with large language models. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. UIST ’23, Association for Computing Machinery, New York, NY, USA(2023). https://doi.org/10.1145/3586183.36...
-
[4]
In: Encyclopedia of quality of life and well-being research, pp
Braun, V., Clarke, V.: Thematic analysis. In: Encyclopedia of quality of life and well-being research, pp. 7187–7193. Springer (2024)
work page 2024
-
[5]
Master’s thesis, Federal University of Lavras (2023)
Capeleti, B.S.: Human-data interaction in geoprocessing applications: design rec- ommendations from inspections, user evaluations and expert experiences. Master’s thesis, Federal University of Lavras (2023)
work page 2023
-
[6]
and Yang, Qiang and Xie, Xing , title =
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P.S., Yang, Q., Xie, X.: A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 15(3) (Mar 2024). https://doi.org/10.1145/3641289, https: //doi.org/10.1145/3641289
-
[7]
arXiv preprint arXiv:2310.14735 (2023)
Chen, B., Zhang, Z., Langrené, N., Zhu, S.: Unleashing the potential of prompt engineering in large language models: a comprehensive review. arXiv preprint arXiv:2310.14735 (2023)
-
[8]
arXiv preprint arXiv:2305.13014 (2023)
De Paoli, S.: Can large language models emulate an inductive thematic analysis of semi-structured interviews? an exploration and provocation on the limits of the approach and the model. arXiv preprint arXiv:2305.13014 (2023)
-
[9]
InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems
Duan, P., Warner, J., Li, Y., Hartmann, B.: Generating automatic feedback on ui mockups with large language models. In: Proceedings of the 2024 CHI Confer- ence on Human Factors in Computing Systems. CHI ’24, Association for Comput- ing Machinery, New York, NY, USA (2024).https://doi.org/10.1145/3613904. 3642782, https://doi.org/10.1145/3613904.3642782
-
[10]
Journal of Undergraduate Scholarship6, 30 (2014)
Edwards, M.L., Smith, B.C.: The effects of the neutral response option on the extremeness of participant responses. Journal of Undergraduate Scholarship6, 30 (2014)
work page 2014
-
[11]
Journal on Interactive Systems 15(1), 810–822 (Aug 2024)
Freitas, J.A.d., Oliveira, M., Martinelli, C., Amorim, F., Toda, A.M., Palomino, P., Klock, A.C.T., Guerino, G., Avila-Santos, A.P., Rodrigues, L.: Sensation in gamification: A qualitative investigation of background music in gamified learning. Journal on Interactive Systems 15(1), 810–822 (Aug 2024). https://doi.org/10.5753/jis.2024.4501, https://journal...
-
[12]
Annals of Biomedical Engineering pp
Giray, L.: Prompt engineering with chatgpt: A guide for academic writers. Annals of Biomedical Engineering pp. 1–5 (2023)
work page 2023
-
[13]
Available at SSRN 4526071 (2023)
Girotra,K.,Meincke,L.,Terwiesch,C.,Ulrich,K.T.:Ideasaredimesadozen:Large language models for idea generation in innovation. Available at SSRN 4526071 (2023)
work page 2023
-
[14]
timodal, touchscreen-based graphics
Gorlewicz, J.L., Tennison, J.L., Uesbeck, P.M., Richard, M.E., Palani, H.P., Stefik, A., Smith, D.W., Giudice, N.A.: Design guidelines and recommendations for mul- 20 Guerino et al. timodal, touchscreen-based graphics. ACM Trans. Access. Comput.13(3) (Aug 2020). https://doi.org/10.1145/3403933, https://doi.org/10.1145/3403933
-
[15]
In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems
Hämäläinen, P., Tavast, M., Kunnari, A.: Evaluating large language models in generating synthetic hci research data: a case study. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–19 (2023)
work page 2023
-
[16]
ISO9241-11: Ergonomics of human-system interaction — part 11: Usability: Defi- nitions and concepts (2018),https://www.iso.org/obp/ui/#iso:std:iso:9241: -11:ed-2:v1:en
work page 2018
-
[17]
ACM Computing Surveys 55(12), 1–38 (2023)
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Computing Surveys 55(12), 1–38 (2023)
work page 2023
-
[18]
In: Bouamor, H., Pino, J., Bali, K
Karmaker Santu, S.K., Feng, D.: TELeR: A general taxonomy of LLM prompts for benchmarking complex tasks. In: Bouamor, H., Pino, J., Bali, K. (eds.) Find- ings of the Association for Computational Linguistics: EMNLP 2023. pp. 14197– 14203. Association for Computational Linguistics, Singapore (Dec 2023).https:// doi.org/10.18653/v1/2023.findings-emnlp.946, ...
-
[19]
In: Proceedings of the XXIII Brazilian Symposium on Software Qual- ity
Konopatzki, G.E., Valentim, N.C., Guerino, G.C.: A qualitative study and im- provement on ux and usability evaluation questionnaire focused on multi-touch systems. In: Proceedings of the XXIII Brazilian Symposium on Software Qual- ity. p. 362–372. SBQS ’24, Association for Computing Machinery, New York, NY, USA(2024). https://doi.org/10.1145/3701625.37016...
-
[20]
Lazar, J., Feng, J.H., Hochheiser, H.: Research methods in human-computer inter- action. Morgan Kaufmann (2017)
work page 2017
-
[21]
arXiv preprint arXiv:2306.01815 (2023)
Leiker, D., Finnigan, S., Gyllen, A.R., Cukurova, M.: Prototyping the use of large language models (llms) for adult learning content creation at scale. arXiv preprint arXiv:2306.01815 (2023)
-
[22]
Lewis, J., Sauro, J.: Does removing the neutral response op- tion affect rating behavior? (2023), https://measuringu.com/ removing-the-neutral-response-option/, accessed on February 02, 2025
work page 2023
-
[23]
arXiv preprint arXiv:2409.12538 (2024)
Liu, Y., Sharma, P., Oswal, M.J., Xia, H., Huang, Y.: Personaflow: Boost- ing research ideation with llm-simulated expert personas. arXiv preprint arXiv:2409.12538 (2024)
-
[24]
In: CHI Conference on Human Factors in Computing Systems Extended Abstracts
Lu, Y., Zhang, C., Zhang, I., Li, T.J.J.: Bridging the gap between ux practitioners’ work practices and ai-enabled design support tools. In: CHI Conference on Human Factors in Computing Systems Extended Abstracts. pp. 1–7 (2022)
work page 2022
-
[25]
The Corsini encyclopedia of psychology pp
McKnight, P.E., Najab, J.: Mann-whitney u test. The Corsini encyclopedia of psychology pp. 1–1 (2010)
work page 2010
-
[26]
In: International Confer- ence on Human-Computer Interaction
Meinecke, A., Heidrich, D., Dworatzyk, K., Theis, S.: A comparative heuristic eval- uation of kadi4mat through human evaluators and gpt-4. In: International Confer- ence on Human-Computer Interaction. pp. 91–108. Springer (2024)
work page 2024
-
[27]
In: Anais do XXXV Simpósio Brasileiro de Informática na Educação
Mello, R., Rodrigues, L., Cabral, L., Pereira, F., Júnior, C.P., Gasevic, D., Ramalho, G.: Prompt engineering for automatic short answer grad- ing in brazilian portuguese. In: Anais do XXXV Simpósio Brasileiro de Informática na Educação. pp. 1730–1743. SBC, Porto Alegre, RS, Brasil (2024). https://doi.org/10.5753/sbie.2024.242424, https://sol.sbc.org. br/...
-
[28]
Mi, N., Cavuoto, L.A., Benson, K., Smith-Jackson, T., Nussbaum, M.A.: A heuris- tic checklist for an accessible smartphone interface design. Univers. Access Inf. Can GPT-4o Evaluate Usability Like Human Experts? 21 Soc. 13(4), 351–365 (Nov 2014).https://doi.org/10.1007/s10209-013-0321-4, https://doi.org/10.1007/s10209-013-0321-4
-
[29]
Communications of the ACM33(3), 338–348 (1990)
Molich, R., Nielsen, J.: Improving a human-computer dialogue. Communications of the ACM33(3), 338–348 (1990)
work page 1990
-
[30]
Moran, K.: Usability (user) testing 101 (2019), https://www.nngroup.com/ articles/usability-testing-101/
work page 2019
-
[31]
In: Usability inspection methods
Nielsen, J.: Heuristic evaluation. In: Usability inspection methods. pp. 25–62. John Wiley & Sons, Inc. (1994)
work page 1994
-
[32]
Journal of Consumer research29(3), 319–334 (2002)
Nowlis, S.M., Kahn, B.E., Dhar, R.: Coping with ambivalence: The effect of re- moving a neutral option on consumer attitude and preference judgments. Journal of Consumer research29(3), 319–334 (2002)
work page 2002
-
[33]
OpenAI: Gpt-4 technical report (2023)
work page 2023
-
[34]
arXiv preprint arXiv:2409.14858 (2024)
Panda, S.: Llms’ ways of seeing user personas. arXiv preprint arXiv:2409.14858 (2024)
-
[35]
Pearson, K.: X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50(302), 157–175 (1900)
work page 1900
-
[36]
In: Olney, A.M., Chounta, I.A., Liu, Z., Santos, O.C., Bit- tencourt, I.I
Pereira Júnior, C., Rodrigues, L., Costa, N., Macario Filho, V., Mello, R.: Can vlm understand children’s handwriting? an analysis on handwritten mathematical equation recognition. In: Olney, A.M., Chounta, I.A., Liu, Z., Santos, O.C., Bit- tencourt, I.I. (eds.) Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutoria...
work page 2024
-
[37]
In: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems
Petridis, S., Terry, M., Cai, C.J.: Promptinfuser: Bringing user interface mock- ups to life with large language models. In: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–6 (2023)
work page 2023
-
[38]
In: Proceedings of UPA 2010 International Conference (2010)
Petrie, H., Buykx, L.: Collaborative heuristic evaluation: improving the effective- ness of heuristic evaluation. In: Proceedings of UPA 2010 International Conference (2010)
work page 2010
-
[39]
Computers and Education: Artificial Intelligence 7, 100248 (2024)
Rodrigues, L., Dwan Pereira, F., Cabral, L., Gašević, D., Ramalho, G., Ferreira Mello, R.: Assessing the quality of automatic-generated short an- swers using gpt-4. Computers and Education: Artificial Intelligence 7, 100248 (2024). https://doi.org/https://doi.org/10.1016/j.caeai.2024.100248, https://www.sciencedirect.com/science/article/pii/S2666920X24000511
-
[40]
Saavedra, M., Rusu, C., Quiñones, D., Roncagliolo, S.: A set of usability and user experience heuristics for social networks. In: Meiselwitz, G. (ed.) Social Computing and Social Media. Design, Human Behavior and Analytics - 11th International Conference, SCSM 2019, Held as Part of the 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 2...
work page 2019
-
[41]
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
Sahoo, P., Singh, A.K., Saha, S., Jain, V., Mondal, S., Chadha, A.: A systematic survey of prompt engineering in large language models: Techniques and applica- tions. arXiv preprint arXiv:2402.07927 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Stephanidis, C., Salvendy, G.: User Experience Methods and Tools in Human- Computer Interaction. CRC Press, 1st edn. (2024)
work page 2024
-
[43]
In: Proceedings of the CHI Conference on Human Factors in Computing Systems
Suh,S.,Chen,M.,Min,B.,Li,T.J.J.,Xia,H.:Luminate:Structuredgenerationand exploration of design space with large language models for human-ai co-creation. In: Proceedings of the CHI Conference on Human Factors in Computing Systems. pp. 1–26 (2024) 22 Guerino et al
work page 2024
-
[44]
Takaffoli, M., Li, S., Mäkelä, V.: Generative ai in user experience design and research: How do ux practitioners, teams, and companies use genai in indus- try? In: Proceedings of the 2024 ACM Designing Interactive Systems Conference. p. 1579–1593. DIS ’24, Association for Computing Machinery, New York, NY, USA(2024). https://doi.org/10.1145/3643834.366072...
-
[45]
In: Companion Proceedings of the 27th International Conference on Intelligent User Interfaces
Tavast, M., Kunnari, A., Hämäläinen, P.: Language models can generate human- like self-reports of emotion. In: Companion Proceedings of the 27th International Conference on Intelligent User Interfaces. pp. 69–72 (2022)
work page 2022
-
[46]
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2nd edn
Tullis, T., Albert, W.: Measuring the User Experience, Second Edition: Collecting, Analyzing, and Presenting Usability Metrics. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2nd edn. (2013)
work page 2013
-
[47]
Wang, Z., Chu, Z., Viet Doan, T., Ni, S., Yang, M., Zhang, W.: History, Develop- ment, and Principles of Large Language Models-An Introductory Survey. AI and Ethics (2024). https://doi.org/10.1007/s43681-024-00583-7
-
[48]
Advances in Neural Information Processing Systems35, 24824–24837 (2022)
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems35, 24824–24837 (2022)
work page 2022
-
[49]
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
British Journal of Educational Technology n/a(n/a) (2023)
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., Gašević, D.: Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology n/a(n/a) (2023). https://doi.org/https://doi.org/10.1111/bjet.13370
-
[51]
Zhong, R., Hsieh, G., McDonald, D.W.: How can llms support ux practitioners with image-related tasks? In: GenAICHI: CHI 2024 Workshop on Generative AI and HCI. pp. 1–6 (2024)
work page 2024
-
[52]
Ziyu, Z., Qiguang, C., Longxuan, M., Mingda, L., Yi, H., Yushan, Q., Haopeng, B., Weinan, Z., Liu, T.: Through the lens of core competency: Survey on evaluation of large language models. In: Zhang, J. (ed.) Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum). pp. 88–109. Chinese Information Processin...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.