pith. machine review for the scientific record. sign in

arxiv: 2605.12657 · v1 · submitted 2026-05-12 · 💻 cs.SE

Recognition: no theorem link

User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:20 UTC · model grok-4.3

classification 💻 cs.SE
keywords user reviewsusability requirementslarge language modelsrequirements engineeringprompt engineeringNielsen usability heuristicsnon-functional requirementsapp feedback
0
0 comments X

The pith

Large language models can identify usability requirements in user reviews with F-scores comparable to human raters when the prompt is well designed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

User reviews hold practical feedback on usability but require effort to sift through at scale. The paper tests whether pre-trained LLMs can classify reviews that discuss usability issues without needing task-specific training data. The authors assembled a dataset of 300 reviews from three app categories, labeled them according to Nielsen heuristics, and iterated on a prompt to guide the LLM. They report that the LLM reaches human-level F-scores on the classification task, yet the outcome varies sharply with small changes to the prompt wording. This points to a low-cost workflow that lets development teams turn existing review streams into usability requirements without building new labeled corpora.

Core claim

LLMs are generally able to recognize usability as a non-functional requirement in user reviews in terms of their F-score, but the performance and reliability is strongly dependent on the prompt. The study supplies a fully coded dataset of 300 reviews labeled by two human raters and an LLM, together with an initial prompt derived from two engineering iterations and coding guidelines based on the 10 Nielsen Usability Heuristics.

What carries the argument

An iteratively refined prompt, built from Nielsen's 10 Usability Heuristics, that directs the LLM to filter user reviews for usability-relevant content.

If this is right

  • Development teams can process large volumes of user reviews quickly and at low cost to surface usability requirements.
  • LLMs provide an alternative to training dedicated machine-learning classifiers for requirements classification tasks.
  • The approach supports user-centered requirements elicitation by leveraging existing review data rather than new manual labeling.
  • Prompt refinement becomes a central engineering activity whose outcome directly affects the quality of extracted requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt strategy could be adapted to extract other non-functional requirements such as security or performance concerns from reviews.
  • Testing the prompt on a stream of live app-store reviews would reveal how well it handles new phrasing and emerging issues.
  • Embedding the classification step inside requirements-management tools could reduce the manual triage burden on product teams.

Load-bearing premise

Human raters supply consistent ground-truth labels for usability aspects and the prompt developed on this dataset will produce reliable results on new reviews or different LLMs.

What would settle it

Apply the final prompt to a fresh set of reviews independently labeled by new human raters and check whether the resulting F-score stays within the same range as the original human-to-human agreement.

Figures

Figures reproduced from arXiv: 2605.12657 by Cedric Wellhausen, Kurt Schneider, Laura Reinhardt.

Figure 1
Figure 1. Figure 1: Data collection process for building a fully labeled dataset. The labeling of the user reviews was performed by two raters, both with multiple coding experience in context of usability and explainability, which is why we believe that the results were not influenced by the quality of the raters. Individual user review were rated in the binary format of true and false. We decided to code a user review with t… view at source ↗
Figure 2
Figure 2. Figure 2: Data analysis process with the final dataset and an initial prompt as input to the iterativ prompt evaluation process, which results in a new and optimized prompt that can be fed back into the process to further refine it. The process consists of two parts (i) generating and evaluating a prompt based on the enhanced coding guidelines from the data collection, and (ii) evaluating the LLM based on an specifi… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the cross-examination of certainty of the raters vs. the correctness of the LLM’s evaluation. Left side: Absolute values. Right side: Relatively values, scaled by 239 (cases of LLM being correct) and 61 (cases of LLM being incorrect). 5. Discussion 5.1. Answering the Research Questions RQ1: Is the performance of an LLM, when prompted with a specifically tailored prompt to identify usability… view at source ↗
read the original abstract

It is known that user-centered approaches to requirements engineering in general lead to a better suited product for the end-users. LLM4RE provides promising approaches to support the requirements elicitation process (e.g. classification of requirements). Previous approaches focus on Machine-Learning (ML) or Deep-Learning (DL) aspects, which require intensive training with a large amount of manually labeled data. LLMs, on the other hand, are pre-trained on large amounts of user-generated text data, enabling a user-centric workflow to analyze requirements. In this paper, we explore the possibility of exploiting the improved natural language understanding of LLMs, rather than strict ML classification, together with the mass extraction of user reviews to analyze if the performance of LLMs in understanding user reviews is comparable to the performance of human raters. This enables a quick and cheap workflow for development teams to gather and process their user\'s requirements. This paper provides three major contributions: (1) We provide a completely coded dataset of 300 user reviews containing usability-relevant aspects from three different types of apps, that were labeled by two human raters and by an LLM. (2) We build an initial prompt, based on two prompt engineering iterations and specifically developed coding guidelines derived from the 10 Nielsen Usability Heuristics, for LLMs to filter usability relevant user reviews. (3) We determine that LLMs are generally able to recognize usability as a non-functional requirement in user reviews, in terms of their F-score, but the performance and reliability is strongly dependent on the prompt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper explores using large language models (LLMs) to identify usability-related content in user reviews as a precursor to requirements engineering. It contributes a labeled dataset of 300 reviews from three app types (labeled by two human raters and an LLM), develops an initial prompt via two iterations based on Nielsen's 10 usability heuristics, and claims that LLMs can generally recognize usability as a non-functional requirement with F-scores comparable to human raters, though performance and reliability depend strongly on the prompt.

Significance. If the central claims hold after proper validation, the work could support low-cost, scalable extraction of usability requirements from abundant user reviews without needing large manually labeled training sets typical of ML/DL approaches. The provision of a fully coded dataset is a clear strength for reproducibility and community use.

major comments (3)
  1. Abstract: The abstract reports F-score comparisons on 300 reviews but provides no numerical F-score values, no inter-rater agreement statistics, and no details on the prompt iterations or exact LLM used, which prevents verification of the claim that LLMs perform comparably to humans.
  2. Methodology / Labeling Process: The prompt was developed and refined iteratively on the exact same 300 reviews used for evaluation, with no held-out test set, cross-validation, or external validation described; this raises a direct risk of overfitting and weakens the conclusion that performance 'is strongly dependent on the prompt' in a generalizable way.
  3. Labeling Process: Only two human raters are used with no reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement), so the reliability of the ground-truth labels is unknown and the F-score comparison to the LLM rests on an unverified foundation.
minor comments (1)
  1. Abstract: Consider adding the actual F-score numbers and a brief note on the LLM model/version to make the central result immediately assessable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our precursor study. We address each major comment below and will revise the manuscript to improve clarity, transparency, and acknowledgment of limitations where appropriate.

read point-by-point responses
  1. Referee: Abstract: The abstract reports F-score comparisons on 300 reviews but provides no numerical F-score values, no inter-rater agreement statistics, and no details on the prompt iterations or exact LLM used, which prevents verification of the claim that LLMs perform comparably to humans.

    Authors: We agree that the abstract should be more informative. In the revised manuscript, we will include the specific F-score values for the LLM (which were comparable to those of the human raters), the inter-rater agreement statistic, the exact LLM model employed, and a concise description of the two prompt engineering iterations. This will enable readers to directly assess the comparability claim. revision: yes

  2. Referee: Methodology / Labeling Process: The prompt was developed and refined iteratively on the exact same 300 reviews used for evaluation, with no held-out test set, cross-validation, or external validation described; this raises a direct risk of overfitting and weakens the conclusion that performance 'is strongly dependent on the prompt' in a generalizable way.

    Authors: This observation is correct and highlights a limitation inherent to our small-scale precursor study. With only 300 reviews available, iterative prompt refinement was conducted on the full set, which is a common practice in early-stage prompt engineering but does carry overfitting risk. We will revise the methodology and discussion sections to explicitly state this limitation, note the absence of a held-out set, and qualify the generalizability of the prompt-dependence conclusion. We will also stress that the publicly released labeled dataset allows other researchers to perform independent validation on new data. revision: partial

  3. Referee: Labeling Process: Only two human raters are used with no reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement), so the reliability of the ground-truth labels is unknown and the F-score comparison to the LLM rests on an unverified foundation.

    Authors: We accept this criticism. The revised manuscript will include the inter-rater reliability metric (Cohen's kappa and percentage agreement) computed between the two human raters. This addition will provide a clearer basis for interpreting the LLM's F-score performance relative to human labeling. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical F-score comparison on labeled reviews

full rationale

The paper is a precursor empirical study that labels 300 user reviews with two human raters using Nielsen-derived guidelines, develops a prompt through two iterations, and computes F-scores for LLM classification against those human labels. No equations, derivations, fitted parameters, or predictions appear that reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The central claim rests on observable performance metrics rather than any self-referential reduction, making the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that Nielsen heuristics validly capture usability content in reviews and that LLM text understanding applies directly to this task.

axioms (1)
  • domain assumption Nielsen's 10 Usability Heuristics provide a valid framework for identifying usability-relevant content in user reviews.
    Used to create coding guidelines for both human and LLM labeling.

pith-pipeline@v0.9.0 · 5584 in / 1150 out tokens · 59754 ms · 2026-05-14T20:20:58.447696+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 18 canonical work pages · 3 internal anchors

  1. [1]

    Bakiu, E

    E. Bakiu, E. Guzman, Which feature is unusable? detecting usability and user experience issues from user reviews, in: 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW), 2017, pp. 182–187. doi:10.1109/REW.2017.76

  2. [2]

    Groen, Crowd-Based Requirements Engineering, Doctoral thesis 2 (research not uu / graduation uu), Universiteit Utrecht, 2025

    E. Groen, Crowd-Based Requirements Engineering, Doctoral thesis 2 (research not uu / graduation uu), Universiteit Utrecht, 2025. doi:10.33540/3091

  3. [3]

    L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.-V. Chioasca, R. T. Batista-Navarro, Natural language processing for requirements engineering: A systematic mapping study, ACM Comput. Surv. 54 (2021). URL: https://doi.org/10.1145/3444689. doi:10.1145/3444689

  4. [4]

    M. A. Zadenoori, J. Dąbrowski, W. Alhoshan, L. Zhao, A. Ferrari, Large language models (llms) for requirements engineering (re): A systematic literature review, 2025. URL: https://arxiv.org/abs/ 2509.11446.arXiv:2509.11446

  5. [5]

    Unterbusch, M

    M. Unterbusch, M. Sadeghi, J. Fischbach, M. Obaidi, A. Vogelsang, Explanation needs in app reviews: Taxonomy and automated detection, 2023, pp. 102–111. doi:10.1109/REW57809.2023.00024

  6. [6]

    F. Wei, R. Keeling, N. Huber-Fliflet, J. Zhang, A. Dabrowski, J. Yang, Q. Mao, H. Qin, Empirical study of llm fine-tuning for text classification in legal document review, in: 2023 IEEE International Conference on Big Data (BigData), 2023, pp. 2786–2792. doi: 10.1109/BigData59044.2023. 10386911

  7. [7]

    Dąbrowski, W

    J. Dąbrowski, W. Cai, A. Bennaceur, B. Nuseibeh, F. Alrimawi, Intelligent agents for requirements engineering: Use, feasibility and evaluation, in: 2025 IEEE 33rd International Requirements Engineering Conference (RE), 2025, pp. 535–543. doi:10.1109/RE63999.2025.00064

  8. [8]

    ISO, ISO 9241-110:2020 Ergonomics of human-system interaction — Part 110: Interaction principles, International Standards Organisation (2024)

  9. [9]

    Bevan, J

    N. Bevan, J. Carter, S. Harker, Iso 9241-11 revised: What have we learnt about usability since 1998?, in: M. Kurosu (Ed.), Human-Computer Interaction: Design and Evaluation, Springer International Publishing, Cham, 2015, pp. 143–151

  10. [10]

    Quiñones, C

    D. Quiñones, C. Rusu, How to develop usability heuristics: A systematic literature review, Computer Standards & Interfaces 53 (2017) 89–122. URL: https://www.sciencedirect.com/science/ article/pii/S0920548917301058. doi:https://doi.org/10.1016/j.csi.2017.03.009

  11. [11]

    Nielsen, 10 usability heuristics for user interface design, 1994

    J. Nielsen, 10 usability heuristics for user interface design, 1994. URL: https://www.nngroup.com/ articles/ten-usability-heuristics/, last accessed: 01/12/2026

  12. [12]

    Nielsen, Enhancing the explanatory power of usability heuristics, in: Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 1994, pp

    J. Nielsen, Enhancing the explanatory power of usability heuristics, in: Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 1994, pp. 152–158

  13. [13]

    J. Nielsen, Enhancing the explanatory power of usability heuristics, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’94, Association for Computing Machinery, New York, NY, USA, 1994, p. 152–158. URL: https://doi.org/10.1145/191666.191729. doi:10.1145/191666.191729

  14. [14]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C. Schmidt, A prompt pattern catalog to enhance prompt engineering with chatgpt, 2023. URL: https://arxiv.org/abs/2302.11382.arXiv:2302.11382

  15. [15]

    P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing 55 (2023). URL: https://doi.org/10. 1145/3560815. doi:10.1145/3560815

  16. [16]

    A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt engineering in large language models: Techniques and applications, 2025. URL: https://arxiv.org/ abs/2402.07927.arXiv:2402.07927

  17. [17]

    The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    The prompt report: A systematic survey of prompt engineering techniques, 2025. URL: https: //arxiv.org/abs/2406.06608.arXiv:2406.06608

  18. [18]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, Chain- of-thought prompting elicits reasoning in large language models, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, volume 35, Curran Associates, Inc., 2022, pp. 24824–24837. URL: https://...

  19. [19]

    Nielsen, Usability Engineering, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1994

    J. Nielsen, Usability Engineering, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1994

  20. [20]

    Hedegaard, J

    S. Hedegaard, J. G. Simonsen, Extracting usability and user experience information from online user reviews, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’13, Association for Computing Machinery, New York, NY, USA, 2013, p. 2089–2098. URL: https://doi.org/10.1145/2470654.2481286. doi:10.1145/2470654.2481286

  21. [21]

    Forward, T

    A. Forward, T. C. Lethbridge, A taxonomy of software types to facilitate search and evidence-based software engineering, in: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, CASCON ’08, Association for Computing Machinery, New York, NY, USA, 2008. URL: https://doi.org/10.1145/1463788.146380...

  22. [22]

    Wellhausen, Supplementary material to user reviews as a source for usability requirements, 2026

    C. Wellhausen, Supplementary material to user reviews as a source for usability requirements, 2026. URL: https://figshare.com/collections/Supplementary_Material_to_User_Reviews_as_a_Source_ for_Usability_Requirements/8256262/2. doi:10.6084/m9.figshare.c.8256262.v2

  23. [23]

    A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

    J. Cohen, A coefficient of agreement for nominal scales, Educational and Psychological Measure- ment 20 (1960) 37–46. doi:10.1177/001316446002000104

  24. [24]

    J. R. Landis, G. G. Koch, The measurement of observer agreement for categorical data, Biometrics 33 (1977) 159–174. doi:https://doi.org/10.2307/2529310

  25. [25]

    Wohlin, P

    C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, A. Wesslén, et al., Experimentation in software engineering, volume 236, Springer, 2012

  26. [26]

    URL: https://openai.com/index/gpt-4-1/, last accessed: 02/17/2026

    OpenAI, Introducing gpt-4.1 in the api, 2025. URL: https://openai.com/index/gpt-4-1/, last accessed: 02/17/2026