arxiv: 2605.12657 · v1 · submitted 2026-05-12 · 💻 cs.SE

Recognition: no theorem link

User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models

Cedric Wellhausen , Laura Reinhardt , Kurt Schneider

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:20 UTC · model grok-4.3

classification 💻 cs.SE

keywords user reviewsusability requirementslarge language modelsrequirements engineeringprompt engineeringNielsen usability heuristicsnon-functional requirementsapp feedback

0 comments

The pith

Large language models can identify usability requirements in user reviews with F-scores comparable to human raters when the prompt is well designed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

User reviews hold practical feedback on usability but require effort to sift through at scale. The paper tests whether pre-trained LLMs can classify reviews that discuss usability issues without needing task-specific training data. The authors assembled a dataset of 300 reviews from three app categories, labeled them according to Nielsen heuristics, and iterated on a prompt to guide the LLM. They report that the LLM reaches human-level F-scores on the classification task, yet the outcome varies sharply with small changes to the prompt wording. This points to a low-cost workflow that lets development teams turn existing review streams into usability requirements without building new labeled corpora.

Core claim

LLMs are generally able to recognize usability as a non-functional requirement in user reviews in terms of their F-score, but the performance and reliability is strongly dependent on the prompt. The study supplies a fully coded dataset of 300 reviews labeled by two human raters and an LLM, together with an initial prompt derived from two engineering iterations and coding guidelines based on the 10 Nielsen Usability Heuristics.

What carries the argument

An iteratively refined prompt, built from Nielsen's 10 Usability Heuristics, that directs the LLM to filter user reviews for usability-relevant content.

If this is right

Development teams can process large volumes of user reviews quickly and at low cost to surface usability requirements.
LLMs provide an alternative to training dedicated machine-learning classifiers for requirements classification tasks.
The approach supports user-centered requirements elicitation by leveraging existing review data rather than new manual labeling.
Prompt refinement becomes a central engineering activity whose outcome directly affects the quality of extracted requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt strategy could be adapted to extract other non-functional requirements such as security or performance concerns from reviews.
Testing the prompt on a stream of live app-store reviews would reveal how well it handles new phrasing and emerging issues.
Embedding the classification step inside requirements-management tools could reduce the manual triage burden on product teams.

Load-bearing premise

Human raters supply consistent ground-truth labels for usability aspects and the prompt developed on this dataset will produce reliable results on new reviews or different LLMs.

What would settle it

Apply the final prompt to a fresh set of reviews independently labeled by new human raters and check whether the resulting F-score stays within the same range as the original human-to-human agreement.

Figures

Figures reproduced from arXiv: 2605.12657 by Cedric Wellhausen, Kurt Schneider, Laura Reinhardt.

**Figure 1.** Figure 1: Data collection process for building a fully labeled dataset. The labeling of the user reviews was performed by two raters, both with multiple coding experience in context of usability and explainability, which is why we believe that the results were not influenced by the quality of the raters. Individual user review were rated in the binary format of true and false. We decided to code a user review with t… view at source ↗

**Figure 2.** Figure 2: Data analysis process with the final dataset and an initial prompt as input to the iterativ prompt evaluation process, which results in a new and optimized prompt that can be fed back into the process to further refine it. The process consists of two parts (i) generating and evaluating a prompt based on the enhanced coding guidelines from the data collection, and (ii) evaluating the LLM based on an specifi… view at source ↗

**Figure 3.** Figure 3: Distribution of the cross-examination of certainty of the raters vs. the correctness of the LLM’s evaluation. Left side: Absolute values. Right side: Relatively values, scaled by 239 (cases of LLM being correct) and 61 (cases of LLM being incorrect). 5. Discussion 5.1. Answering the Research Questions RQ1: Is the performance of an LLM, when prompted with a specifically tailored prompt to identify usability… view at source ↗

read the original abstract

It is known that user-centered approaches to requirements engineering in general lead to a better suited product for the end-users. LLM4RE provides promising approaches to support the requirements elicitation process (e.g. classification of requirements). Previous approaches focus on Machine-Learning (ML) or Deep-Learning (DL) aspects, which require intensive training with a large amount of manually labeled data. LLMs, on the other hand, are pre-trained on large amounts of user-generated text data, enabling a user-centric workflow to analyze requirements. In this paper, we explore the possibility of exploiting the improved natural language understanding of LLMs, rather than strict ML classification, together with the mass extraction of user reviews to analyze if the performance of LLMs in understanding user reviews is comparable to the performance of human raters. This enables a quick and cheap workflow for development teams to gather and process their user\'s requirements. This paper provides three major contributions: (1) We provide a completely coded dataset of 300 user reviews containing usability-relevant aspects from three different types of apps, that were labeled by two human raters and by an LLM. (2) We build an initial prompt, based on two prompt engineering iterations and specifically developed coding guidelines derived from the 10 Nielsen Usability Heuristics, for LLMs to filter usability relevant user reviews. (3) We determine that LLMs are generally able to recognize usability as a non-functional requirement in user reviews, in terms of their F-score, but the performance and reliability is strongly dependent on the prompt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A small labeled dataset and Nielsen-based prompt for LLM usability review filtering, but the F-score claims rest on missing numbers and no held-out validation.

read the letter

This paper's main output is a coded dataset of 300 user reviews from three app types, labeled by two humans and an LLM, plus an initial prompt built from Nielsen heuristics after two iterations. That gives a concrete resource for pulling usability-related requirements from reviews without large training sets for ML models. The shift to prompt engineering on pre-trained LLMs is the practical angle they highlight against earlier classification work. They also release the data, which lowers the barrier for others to test or extend the approach. The prompt guidelines tied to established heuristics are a clear step that makes the method more reproducible than ad-hoc prompting. The evaluation side is thinner. The abstract refers to F-score comparisons with human raters but reports none of the actual numbers, no inter-rater agreement measure, and no details on how the prompt iterations were assessed. Since the prompt was refined on this exact set of reviews, any performance figures risk reflecting that tuning rather than broader capability. No separate test set or checks across different LLMs appear, so the statement that results depend strongly on the prompt stays hard to verify. This is aimed at requirements engineering researchers who want lightweight LLM tools for user-centered elicitation. A reader could pull the dataset and guidelines as a starting point for their own experiments, even if they would need to add stronger validation themselves. I would send it for peer review. The precursor framing fits, and referees could push for the missing metrics and a held-out evaluation so the work becomes more usable as a foundation.

Referee Report

3 major / 1 minor

Summary. The paper explores using large language models (LLMs) to identify usability-related content in user reviews as a precursor to requirements engineering. It contributes a labeled dataset of 300 reviews from three app types (labeled by two human raters and an LLM), develops an initial prompt via two iterations based on Nielsen's 10 usability heuristics, and claims that LLMs can generally recognize usability as a non-functional requirement with F-scores comparable to human raters, though performance and reliability depend strongly on the prompt.

Significance. If the central claims hold after proper validation, the work could support low-cost, scalable extraction of usability requirements from abundant user reviews without needing large manually labeled training sets typical of ML/DL approaches. The provision of a fully coded dataset is a clear strength for reproducibility and community use.

major comments (3)

Abstract: The abstract reports F-score comparisons on 300 reviews but provides no numerical F-score values, no inter-rater agreement statistics, and no details on the prompt iterations or exact LLM used, which prevents verification of the claim that LLMs perform comparably to humans.
Methodology / Labeling Process: The prompt was developed and refined iteratively on the exact same 300 reviews used for evaluation, with no held-out test set, cross-validation, or external validation described; this raises a direct risk of overfitting and weakens the conclusion that performance 'is strongly dependent on the prompt' in a generalizable way.
Labeling Process: Only two human raters are used with no reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement), so the reliability of the ground-truth labels is unknown and the F-score comparison to the LLM rests on an unverified foundation.

minor comments (1)

Abstract: Consider adding the actual F-score numbers and a brief note on the LLM model/version to make the central result immediately assessable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our precursor study. We address each major comment below and will revise the manuscript to improve clarity, transparency, and acknowledgment of limitations where appropriate.

read point-by-point responses

Referee: Abstract: The abstract reports F-score comparisons on 300 reviews but provides no numerical F-score values, no inter-rater agreement statistics, and no details on the prompt iterations or exact LLM used, which prevents verification of the claim that LLMs perform comparably to humans.

Authors: We agree that the abstract should be more informative. In the revised manuscript, we will include the specific F-score values for the LLM (which were comparable to those of the human raters), the inter-rater agreement statistic, the exact LLM model employed, and a concise description of the two prompt engineering iterations. This will enable readers to directly assess the comparability claim. revision: yes
Referee: Methodology / Labeling Process: The prompt was developed and refined iteratively on the exact same 300 reviews used for evaluation, with no held-out test set, cross-validation, or external validation described; this raises a direct risk of overfitting and weakens the conclusion that performance 'is strongly dependent on the prompt' in a generalizable way.

Authors: This observation is correct and highlights a limitation inherent to our small-scale precursor study. With only 300 reviews available, iterative prompt refinement was conducted on the full set, which is a common practice in early-stage prompt engineering but does carry overfitting risk. We will revise the methodology and discussion sections to explicitly state this limitation, note the absence of a held-out set, and qualify the generalizability of the prompt-dependence conclusion. We will also stress that the publicly released labeled dataset allows other researchers to perform independent validation on new data. revision: partial
Referee: Labeling Process: Only two human raters are used with no reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement), so the reliability of the ground-truth labels is unknown and the F-score comparison to the LLM rests on an unverified foundation.

Authors: We accept this criticism. The revised manuscript will include the inter-rater reliability metric (Cohen's kappa and percentage agreement) computed between the two human raters. This addition will provide a clearer basis for interpreting the LLM's F-score performance relative to human labeling. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical F-score comparison on labeled reviews

full rationale

The paper is a precursor empirical study that labels 300 user reviews with two human raters using Nielsen-derived guidelines, develops a prompt through two iterations, and computes F-scores for LLM classification against those human labels. No equations, derivations, fitted parameters, or predictions appear that reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The central claim rests on observable performance metrics rather than any self-referential reduction, making the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that Nielsen heuristics validly capture usability content in reviews and that LLM text understanding applies directly to this task.

axioms (1)

domain assumption Nielsen's 10 Usability Heuristics provide a valid framework for identifying usability-relevant content in user reviews.
Used to create coding guidelines for both human and LLM labeling.

pith-pipeline@v0.9.0 · 5584 in / 1150 out tokens · 59754 ms · 2026-05-14T20:20:58.447696+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Bakiu, E

E. Bakiu, E. Guzman, Which feature is unusable? detecting usability and user experience issues from user reviews, in: 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW), 2017, pp. 182–187. doi:10.1109/REW.2017.76

work page doi:10.1109/rew.2017.76 2017
[2]

Groen, Crowd-Based Requirements Engineering, Doctoral thesis 2 (research not uu / graduation uu), Universiteit Utrecht, 2025

E. Groen, Crowd-Based Requirements Engineering, Doctoral thesis 2 (research not uu / graduation uu), Universiteit Utrecht, 2025. doi:10.33540/3091

work page doi:10.33540/3091 2025
[3]

L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.-V. Chioasca, R. T. Batista-Navarro, Natural language processing for requirements engineering: A systematic mapping study, ACM Comput. Surv. 54 (2021). URL: https://doi.org/10.1145/3444689. doi:10.1145/3444689

work page doi:10.1145/3444689 2021
[4]

M. A. Zadenoori, J. Dąbrowski, W. Alhoshan, L. Zhao, A. Ferrari, Large language models (llms) for requirements engineering (re): A systematic literature review, 2025. URL: https://arxiv.org/abs/ 2509.11446.arXiv:2509.11446

work page arXiv 2025
[5]

Unterbusch, M

M. Unterbusch, M. Sadeghi, J. Fischbach, M. Obaidi, A. Vogelsang, Explanation needs in app reviews: Taxonomy and automated detection, 2023, pp. 102–111. doi:10.1109/REW57809.2023.00024

work page doi:10.1109/rew57809.2023.00024 2023
[6]

F. Wei, R. Keeling, N. Huber-Fliflet, J. Zhang, A. Dabrowski, J. Yang, Q. Mao, H. Qin, Empirical study of llm fine-tuning for text classification in legal document review, in: 2023 IEEE International Conference on Big Data (BigData), 2023, pp. 2786–2792. doi: 10.1109/BigData59044.2023. 10386911

work page doi:10.1109/bigdata59044.2023 2023
[7]

Dąbrowski, W

J. Dąbrowski, W. Cai, A. Bennaceur, B. Nuseibeh, F. Alrimawi, Intelligent agents for requirements engineering: Use, feasibility and evaluation, in: 2025 IEEE 33rd International Requirements Engineering Conference (RE), 2025, pp. 535–543. doi:10.1109/RE63999.2025.00064

work page doi:10.1109/re63999.2025.00064 2025
[8]

ISO, ISO 9241-110:2020 Ergonomics of human-system interaction — Part 110: Interaction principles, International Standards Organisation (2024)

2020
[9]

Bevan, J

N. Bevan, J. Carter, S. Harker, Iso 9241-11 revised: What have we learnt about usability since 1998?, in: M. Kurosu (Ed.), Human-Computer Interaction: Design and Evaluation, Springer International Publishing, Cham, 2015, pp. 143–151

1998
[10]

Quiñones, C

D. Quiñones, C. Rusu, How to develop usability heuristics: A systematic literature review, Computer Standards & Interfaces 53 (2017) 89–122. URL: https://www.sciencedirect.com/science/ article/pii/S0920548917301058. doi:https://doi.org/10.1016/j.csi.2017.03.009

work page doi:10.1016/j.csi.2017.03.009 2017
[11]

Nielsen, 10 usability heuristics for user interface design, 1994

J. Nielsen, 10 usability heuristics for user interface design, 1994. URL: https://www.nngroup.com/ articles/ten-usability-heuristics/, last accessed: 01/12/2026

1994
[12]

Nielsen, Enhancing the explanatory power of usability heuristics, in: Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 1994, pp

J. Nielsen, Enhancing the explanatory power of usability heuristics, in: Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 1994, pp. 152–158

1994
[13]

J. Nielsen, Enhancing the explanatory power of usability heuristics, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’94, Association for Computing Machinery, New York, NY, USA, 1994, p. 152–158. URL: https://doi.org/10.1145/191666.191729. doi:10.1145/191666.191729

work page doi:10.1145/191666.191729 1994
[14]

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C. Schmidt, A prompt pattern catalog to enhance prompt engineering with chatgpt, 2023. URL: https://arxiv.org/abs/2302.11382.arXiv:2302.11382

work page internal anchor Pith review arXiv 2023
[15]

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing 55 (2023). URL: https://doi.org/10. 1145/3560815. doi:10.1145/3560815

work page doi:10.1145/3560815 2023
[16]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt engineering in large language models: Techniques and applications, 2025. URL: https://arxiv.org/ abs/2402.07927.arXiv:2402.07927

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

The prompt report: A systematic survey of prompt engineering techniques, 2025. URL: https: //arxiv.org/abs/2406.06608.arXiv:2406.06608

work page internal anchor Pith review arXiv 2025
[18]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, Chain- of-thought prompting elicits reasoning in large language models, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, volume 35, Curran Associates, Inc., 2022, pp. 24824–24837. URL: https://...

2022
[19]

Nielsen, Usability Engineering, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1994

J. Nielsen, Usability Engineering, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1994

1994
[20]

Hedegaard, J

S. Hedegaard, J. G. Simonsen, Extracting usability and user experience information from online user reviews, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’13, Association for Computing Machinery, New York, NY, USA, 2013, p. 2089–2098. URL: https://doi.org/10.1145/2470654.2481286. doi:10.1145/2470654.2481286

work page doi:10.1145/2470654.2481286 2013
[21]

Forward, T

A. Forward, T. C. Lethbridge, A taxonomy of software types to facilitate search and evidence-based software engineering, in: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, CASCON ’08, Association for Computing Machinery, New York, NY, USA, 2008. URL: https://doi.org/10.1145/1463788.146380...

work page doi:10.1145/1463788.1463807 2008
[22]

Wellhausen, Supplementary material to user reviews as a source for usability requirements, 2026

C. Wellhausen, Supplementary material to user reviews as a source for usability requirements, 2026. URL: https://figshare.com/collections/Supplementary_Material_to_User_Reviews_as_a_Source_ for_Usability_Requirements/8256262/2. doi:10.6084/m9.figshare.c.8256262.v2

work page doi:10.6084/m9.figshare.c.8256262.v2 2026
[23]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

J. Cohen, A coefficient of agreement for nominal scales, Educational and Psychological Measure- ment 20 (1960) 37–46. doi:10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[24]

J. R. Landis, G. G. Koch, The measurement of observer agreement for categorical data, Biometrics 33 (1977) 159–174. doi:https://doi.org/10.2307/2529310

work page doi:10.2307/2529310 1977
[25]

Wohlin, P

C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, A. Wesslén, et al., Experimentation in software engineering, volume 236, Springer, 2012

2012
[26]

URL: https://openai.com/index/gpt-4-1/, last accessed: 02/17/2026

OpenAI, Introducing gpt-4.1 in the api, 2025. URL: https://openai.com/index/gpt-4-1/, last accessed: 02/17/2026

2025