pith. sign in

arxiv: 2506.16453 · v4 · submitted 2025-06-19 · 💻 cs.SE

Understanding the Challenges and Opportunities of Generative AI Apps: An Empirical Study

Pith reviewed 2026-05-19 08:33 UTC · model grok-4.3

classification 💻 cs.SE
keywords generative AImobile appsuser reviewsopportunities and challengesempirical studyLLM-based analysisGoogle Play Storeapp development
0
0 comments X

The pith

Analysis of over a million reviews from generative AI mobile apps reveals three opportunities and three challenges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how end users perceive and evaluate generative AI features in mobile apps through a large-scale review of more than one million entries from 171 apps on the Google Play Store. The authors introduce the SARA framework to process these reviews efficiently with large language models and validate its accuracy through manual checks on thousands of samples. They extract top topics such as AI performance and emotional connection, then perform a qualitative deep dive into hundreds of reviews. This leads to clear identification of opportunities like using AI for accessibility and wellbeing alongside challenges such as managing expectations around AI limitations. The findings give developers concrete directions for improving Gen-AI app design and user experience.

Core claim

Through qualitative analysis of 762 reviews, the study uncovers three opportunities (AI for Accessibility and Wellbeing, AI as a Collaborative Creative Tool, and AI Versatility) and three challenges (Managing User Expectations and AI Limitations, Balancing Content Moderation and Creative Freedom, and Strategic Integration of Gen-AI Features).

What carries the argument

The SARA (Selection, Acquisition, Refinement, and Analysis) framework that applies prompt-based large language models to extract and assign topics from large review datasets, validated at 91 percent accuracy.

If this is right

  • Developers can prioritize accessibility and wellbeing features to increase user satisfaction.
  • Clear communication about AI capabilities helps reduce frustration from unmet expectations.
  • Apps must carefully tune content moderation to preserve creative freedom.
  • Strategic and gradual integration of generative AI features prevents user overload.
  • User concerns evolve over time, so ongoing monitoring of reviews can guide updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same review-analysis approach could be applied to other categories of AI-driven consumer tools.
  • Similar patterns may appear when studying user feedback on desktop or web-based generative AI services.
  • Developers in non-mobile domains might adapt the identified opportunities to design more inclusive AI products.

Load-bearing premise

The chosen apps and their reviews from one major app store accurately represent the experiences and perceptions of users with generative AI apps overall.

What would settle it

A follow-up study applying the same method to reviews from a different platform or a fresh collection of generative AI apps that surfaces a substantially different set of opportunities and challenges.

Figures

Figures reproduced from arXiv: 2506.16453 by Buthayna AlMulla, Maram Assi, Safwat Hassan.

Figure 1
Figure 1. Figure 1: Examples of reviews of Gen-AI apps To gain these insights, we analyze user reviews that offer valuable perspectives on the challenges, expectations, and satisfaction associated with Gen-AI apps. For example, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our approach , Vol. 1, No. 1, Article . Publication date: September 2025 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Template of the filtering prompt used to filter out non-informative reviews. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An overview of our RQ1 experiment design, where [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Template of the topic extraction prompts for (E1: 0-shot, E2: 3-shot, and E3: 5-shot) used to extract the top topics from a large sample of stage 𝑥 and app category 𝑦. Insert category name is where we insert the name of the app category 𝑦. Insert sample of reviews is where we list our large sample of reviews separated by a new line escape. We manually select and assign topics to five few-shot examples to i… view at source ↗
Figure 6
Figure 6. Figure 6: Template of the P5: topic assignment prompt used to assign topics to a list of reviews. We insert the set of top topics 𝑇 (𝑥, 𝑦, 𝑛), five few-shot examples, and the small sample 𝑆small(𝑥, 𝑦). the correctness of LLM assignment. An assignment is considered correct if the LLM assigned topic accurately reflects the content of the review. The Cohen’s Kappa agreement score [16] is computed, yielding a score of 0… view at source ↗
Figure 7
Figure 7. Figure 7: Box plot showing the distribution of the percentage of reviews classified as Gen-AI at the app level, [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of reviews discussing AI Performance this overall satisfaction, 16% of reviews mentioned limitations in AI Performance, with the most frequently cited issue being AI understanding (5%), followed by inconsistency in accuracy (4%), reliability (3%), speed (2%), and AI memory (1%). A recurring theme in the less favourable reviews is poor AI understanding, which leads to irrelevant or partially correc… view at source ↗
Figure 9
Figure 9. Figure 9: Examples of reviews discussing Content Quality concerns regarding insufficient diversity and inclusion. For instance, review 5 in [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of reviews discussing Content Policy & Censorship 3 in [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of reviews discussing Features & Functionality ⇒ Implication 5. Developers should design Gen-AI features, such as voice mode, in a way that complements existing controls rather than replacing them. For instance, it is crucial to ensure new AI-driven tools coexist with familiar functionalities, so that developers can both innovate and maintain intuitive user experiences. 5) Utility & Use Cases. Ut… view at source ↗
Figure 12
Figure 12. Figure 12: Examples of reviews discussing Utility & Use Cases Insights. In our sample, 64% of reviews emphasize AI’s educational utility, such as helping with homework, essay writing, tutoring, and breaking down complex topics as shown in review 1 in [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Examples of reviews discussing Personalization & Customization Insights. In our sample, 38% of reviews emphasize the AI’s inability to create a personalized experience. For example, review 1 in [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Examples of reviews discussing Comparisons to Other Apps ⇒ Implication 8. ChatGPT is a baseline comparator to users, in order to succeed developers must offer value beyond what ChatGPT offers. Developers should also consider creating a toggle button to allow users to opt in or out of AI-enabled features, such as AI search. The regular features may be sufficient for users (e.g., regular search can satisfy … view at source ↗
Figure 15
Figure 15. Figure 15: Examples of reviews discussing Customer Support & Community Insights. In our sample, we find that 51% of users express enjoyment with specific features (e.g., review 1 in [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Examples of reviews discussing Human AI Interaction & Emotional Connection Insights. We observe that users enjoy their interactions with conversational Gen-AI apps for multiple reasons. First, users enjoy a realistic interaction (i.e., closely similar to human dialogue). This perceived realism fosters a sense of companionship and emotional connection (e.g., review 1 in [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 17
Figure 17. Figure 17: Examples of reviews discussing Accessibility & Inclusivity Insights. Users discuss different accessibility and inclusivity topics including: 1) financial barriers (raised in 31% of the reviews, such as review 1 in [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 20
Figure 20. Figure 20: For the Stable trend, we examine AI Performance, where we had anticipated an upward trend reflecting improvements in system performance. For the Increasing trend, we analyze Content Policy & Censorship to investigate how user perceptions of policy changes have evolved and to uncover the factors contributing to rising satisfaction in this area. Observation 3.2: The Content Quality topic category shows a de… view at source ↗
Figure 18
Figure 18. Figure 18: Evolution of the Percentage of Gen-AI Reviews [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Evolution of the Average Ratings of Gen-AI vs Non-Gen-AI Topics [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Trend Clusters for Average Ratings of Gen-AI Topics [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Trend Clusters for Percentage of Gen-AI Topics [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Examples of reviews discussing Content Quality at different time periods. Over time, users become less tolerant of imperfections. The percentage of positive reviews declines from 96% in 2022-H1 to 59% in 2024-H2, indicating a significant shift in user ratings. Moreover, positive reviews become increasingly critical: in 2022-H2, 12% of all reviews offer constructive feedback while still assigning a high ra… view at source ↗
Figure 23
Figure 23. Figure 23: Examples of reviews discussing AI Performance at different time periods. Observation 3.6: Advances in image generation capabilities contribute to the slight increase in the average rating of the AI Performance topic in 2023-H1. In 2022-H1, negative reviews of image generation apps constitute 9% of reviews in our sample and account for 47% of all negative reviews during that period. These users express fru… view at source ↗
Figure 24
Figure 24. Figure 24: Examples of reviews discussing Content Policy at different time periods. Observation 3.11: Content filters improve over time in blocking unwanted adult content (i.e., the raised concerns drop from 12% to 3%) and illegal content (i.e., the raised concerns drop from 4% to 0%). In our early sample, 15% of the studied reviews flag harmful content slipping past moderation, especially involving sexual or predat… view at source ↗
read the original abstract

The release of ChatGPT in 2022 triggered a rapid surge in generative artificial intelligence mobile apps (Gen-AI apps). Despite widespread adoption, little is known about how end users perceive and evaluate these Gen-AI functionalities. We conduct a user-centered analysis of 1,035,342 reviews from 171 Gen-AI apps from the Google Play Store. We propose SARA (Selection, Acquisition, Refinement, and Analysis), a four-phase framework that leverages prompt-based LLMs for large-scale review analysis. We validate the reliability of LLM-based topic extraction and assignment using 4,353 manually evaluated reviews, achieving 91% accuracy with five-shot prompting and filtering of non-informative reviews. We identify the top ten topics (e.g., AI Performance and Emotional Connection) and perform a cross-platform comparison with Apple App Store reviews. Through qualitative analysis of 762 reviews, we uncover three opportunities (AI for Accessibility and Wellbeing, AI as a Collaborative Creative Tool, and AI Versatility) and three challenges (Managing User Expectations and AI Limitations, Balancing Content Moderation and Creative Freedom, and Strategic Integration of Gen-AI Features). Finally, we analyze temporal trends, revealing how user concerns shift as users mature. Our findings enable researchers and developers to better leverage the capabilities of Gen-AI apps and address potential challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical study analyzing 1,035,342 user reviews from 171 generative AI apps on the Google Play Store. It introduces the SARA framework for LLM-based large-scale review analysis, validates LLM topic extraction and assignment at 91% accuracy on a 4,353-review manually evaluated subset using five-shot prompting, identifies the top ten topics, conducts a cross-platform comparison with Apple App Store reviews, performs qualitative analysis on 762 reviews to derive three opportunities (AI for Accessibility and Wellbeing, AI as a Collaborative Creative Tool, AI Versatility) and three challenges (Managing User Expectations and AI Limitations, Balancing Content Moderation and Creative Freedom, Strategic Integration of Gen-AI Features), and examines temporal trends in user concerns.

Significance. If the methodological details are clarified and the qualitative findings are shown to be robust, the work would offer useful empirical insights for software engineering researchers and practitioners on user perceptions of generative AI features in mobile apps, particularly the identified opportunities and challenges that could inform feature design and integration strategies.

major comments (2)
  1. [Qualitative analysis section] The selection criteria and sampling method for the 762 reviews used in the qualitative analysis (e.g., random, stratified by topic/sentiment, or LLM-confidence threshold) are not described. This is load-bearing for the central claim of uncovering the three opportunities and three challenges, as non-random selection could over-represent extreme or high-engagement reviews and bias the resulting themes relative to the filtered 1M-review corpus.
  2. [Methods and validation subsections] Details on app selection criteria for the 171 Gen-AI apps, the exact prompt engineering used in five-shot prompting, and any post-hoc filtering effects after SARA's refinement phase are insufficiently specified. These omissions limit assessment of whether the 91% accuracy on the 4,353-review validation set extends without major bias to the full dataset and whether the sample is representative of generative AI apps overall.
minor comments (2)
  1. [Cross-platform comparison section] Clarify the scale and selection process for the Apple App Store comparison dataset to provide better context for the cross-platform findings.
  2. [Qualitative analysis section] Add explicit inter-rater agreement metrics or saturation criteria for the qualitative coding of the 762 reviews to improve replicability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and will incorporate clarifications to enhance methodological transparency.

read point-by-point responses
  1. Referee: [Qualitative analysis section] The selection criteria and sampling method for the 762 reviews used in the qualitative analysis (e.g., random, stratified by topic/sentiment, or LLM-confidence threshold) are not described. This is load-bearing for the central claim of uncovering the three opportunities and three challenges, as non-random selection could over-represent extreme or high-engagement reviews and bias the resulting themes relative to the filtered 1M-review corpus.

    Authors: We agree that the selection criteria and sampling method for the 762 reviews require explicit description. These reviews were drawn from the post-SARA filtered corpus of informative reviews, with sampling designed to ensure coverage across the top topics while prioritizing reviews with clear user sentiment. We will revise the qualitative analysis section to detail the exact procedure (including any stratification by topic or confidence thresholds) and add a brief discussion of how this approach relates to the broader corpus. This change will be included in the revised manuscript. revision: yes

  2. Referee: [Methods and validation subsections] Details on app selection criteria for the 171 Gen-AI apps, the exact prompt engineering used in five-shot prompting, and any post-hoc filtering effects after SARA's refinement phase are insufficiently specified. These omissions limit assessment of whether the 91% accuracy on the 4,353-review validation set extends without major bias to the full dataset and whether the sample is representative of generative AI apps overall.

    Authors: We acknowledge that additional specification is needed for reproducibility and to support claims about generalizability. We will expand the methods and validation subsections to include the precise app selection criteria (search keywords, minimum review count, and verification steps), the full five-shot prompts employed for topic extraction and assignment, and a description of any post-refinement filtering steps and their impact on the dataset. These additions will allow readers to better evaluate the extension of the 91% validation accuracy to the full 1M-review corpus. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical analysis of external review data

full rationale

The paper conducts a user-centered empirical study of 1,035,342 reviews from 171 Gen-AI apps using the SARA framework for LLM-assisted filtering and topic extraction, followed by qualitative coding of 762 reviews to identify opportunities and challenges. Validation is performed via manual evaluation of a 4,353-review sample achieving 91% accuracy. No mathematical derivations, equations, fitted parameters, predictions, or self-referential loops exist. Themes emerge directly from external user-review data rather than being defined in terms of themselves or imported via self-citation chains. The analysis is self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard empirical assumptions about review data quality and LLM analysis validity rather than new free parameters or invented entities.

axioms (1)
  • domain assumption User reviews posted on app stores accurately and representatively reflect real user experiences with generative AI features.
    This premise is required for the extracted topics and opportunities/challenges to generalize beyond the studied sample.

pith-pipeline@v0.9.0 · 5774 in / 1325 out tokens · 45112 ms · 2026-05-19T08:33:11.940894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles

    cs.SE 2026-05 unverdicted novelty 7.0

    AI coding agents are classified along a Collaborator-Assistant spectrum using an Initiator x Approver taxonomy on 29,585 PR lifecycles, revealing agent initiation in collaborator tools but near-universal human merge g...

  2. Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles

    cs.SE 2026-05 unverdicted novelty 6.0

    AI coding tools divide into collaborators that initiate most PRs and assistants that support human-led ones, yet humans retain merge authority across all five tools examined.

  3. From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines

    cs.SE 2026-05 unverdicted novelty 5.0

    The central challenge in AI-augmented CI/CD is designing authority transfer from humans to agents under constraints, as current systems remain limited to bounded data-plane autonomy backed by external governance.

  4. A Survey of Context Engineering for Large Language Models

    cs.CL 2025-07 accept novelty 4.0

    The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 3 Pith papers · 3 internal anchors

  1. [1]

    Zarif Bin Akhtar. 2024. Unveiling the evolution of generative AI (GAI): a comprehensive and investigative analysis toward LLM models (2021–2024) and beyond. Journal of Electrical Systems and Information Technology 11, 1 (June 2024), 22. https://doi.org/10.1186/s43067-024-00145-1

  2. [2]

    Zakaria, Wan Mohd Amir Fazamin Wan Hamzah, and Stenin N P

    Hassan Al Wahshat, Waheeb Abu-ulbeh, M Hafiz Yusoff, Muhammad D. Zakaria, Wan Mohd Amir Fazamin Wan Hamzah, and Stenin N P. 2023. The Detection of E-Commerce Manipulated Reviews Using GPT-4. In 2023 International Conference on Computer Science and Emerging Technologies (CSET) . IEEE, Bangalore, India, 1–6. https://doi.org/10.1109/CSET58993.2023.10346848

  3. [3]

    Nimasha Arambepola, Waruni Lalendra Wimalasena, and Lankeshwara Munasinghe. 2025. From Conventional Methods to Large Language Models: A Systematic Review of Techniques in Mobile App Review Analysis. Interdisciplinary Journal of Information, Knowledge, and Management 20 (2025), 016. https://doi.org/10.28945/5491

  4. [4]

    Nimasha Arambepola, Lankeshwara Munasinghe, and Nalin Warnajith. 2024. Factors Influencing Mobile App User Experience: An Analysis of Education App User Reviews. In 2024 4th International Conference on Advanced Research in Computing (ICARC). 223–228. https://doi.org/10.1109/ICARC61713.2024.10499727

  5. [5]

    Ilham Firman Ashari, Eko Dwi Nugroho, Randi Baraku, Ilham Novri Yanda, and Ridho Liwardana. 2023. Analysis of Elbow, Silhouette, Davies-Bouldin, Calinski-Harabasz, and Rand-Index Evaluation on K-Means Algorithm for Classifying Flood-Affected Areas in Jakarta. Journal of Applied Informatics and Computing 7, 1 (July 2023), 89–97. https://doi.org/10.30871/ja...

  6. [6]

    Maram Assi, Safwat Hassan, Yuan Tian, and Ying Zou. 2021. FeatCompare: Feature comparison for competing mobile apps leveraging user reviews. Empirical Software Engineering 26, 5 (Sept. 2021), 94. https://doi.org/10.1007/s10664- 021-09988-y

  7. [7]

    Maram Assi, Safwat Hassan, Yuan Tian, and Ying Zou. 2021. FeatCompare: Feature comparison for competing mobile apps leveraging user reviews. Empirical Software Engineering 26, 5 (2021), 94

  8. [8]

    Maram Assi, Safwat Hassan, and Ying Zou. 2025. LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement. ACM Trans. Softw. Eng. Methodol. (June 2025). https://doi.org/10.1145/3744644

  9. [9]

    Maram Assi, Safwat Hassan, and Ying Zou. 2025. Unraveling Code Clone Dynamics in Deep Learning Frameworks. ACM Trans. Softw. Eng. Methodol. (Feb. 2025). https://doi.org/10.1145/3721125 Just Accepted

  10. [10]

    Cabin and Randall J

    Robert J. Cabin and Randall J. Mitchell. 2000. To Bonferroni or Not to Bonferroni: When and How Are the Questions. Bulletin of the Ecological Society of America 81, 3 (2000), 246–248. http://www.jstor.org/stable/20168454

  11. [11]

    Laura Ceci. 2025. Mobile App Usage - Statistics & Facts. https://www.statista.com/topics/1002/mobile-app-usage/ #topicOverview [Online]. Accessed: 2025-03-07

  12. [12]

    Yvonne Chan and Roy P Walmsley. 1997. Learning and Understanding the Kruskal-Wallis One-Way Analysis-of- Variance-by-Ranks Test for Differences Among Three or More Independent Groups. Physical Therapy 77, 12 (Dec. 1997), 1755–1761. https://doi.org/10.1093/ptj/77.12.1755

  13. [13]

    Bissyandé, Jacques Klein, and Li Li

    Daihang Chen, Yonghui Liu, Mingyi Zhou, Yanjie Zhao, Haoyu Wang, Shuai Wang, Xiao Chen, Tegawendé F. Bissyandé, Jacques Klein, and Li Li. 2025. LLM for Mobile: An Initial Roadmap. ACM Trans. Softw. Eng. Methodol. 34, 5, Article 128 (May 2025), 29 pages. https://doi.org/10.1145/3708528

  14. [14]

    Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, and Boshen Zhang. 2014. AR-miner: mining informative reviews for developers from mobile app marketplace. InProceedings of the 36th International Conference on Software Engineering . ACM, Hyderabad India, 767–778. https://doi.org/10.1145/2568225.2568263

  15. [15]

    Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, and Yong Liu. 2025. An Empirical Study on Challenges for LLM Application Developers. ACM Transactions on Software Engineering and Methodology (Jan. 2025), 3715007. https://doi.org/10.1145/3715007

  16. [16]

    J. Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20, 1 (1960), 37–46. https://doi.org/10.1177/001316446002000104

  17. [17]

    Manoranjan Dash and Poon Wei Koot. 2009. Feature Selection for Clustering . Springer US, Boston, MA, 1119–1125. https://doi.org/10.1007/978-0-387-39940-9_613

  18. [18]

    Data Science Horizons. 2023. Mastering Generative AI and Prompt Engineering. https://datasciencehorizons.com/ pub/Mastering_Generative_AI_Prompt_Engineering_Data_Science_Horizons_v1.pdf

  19. [19]

    Rushali Deshmukh, Rutuj Raut, Mayur Bhavsar, Sanika Gurav, and Yash Patil. 2025. Optimizing Human-AI Interaction: Innovations in Prompt Engineering. In2025 3rd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT). IEEE, Bengaluru, India, 1240–1246. https://doi.org/10.1109/IDCIOT64235.2025.10914815

  20. [20]

    Paulo Sérgio Henrique Dos Santos, Alberto Dumont Alves Oliveira, Thais Bonjorni Nobre De Jesus, Wajdi Aljedaani, and Marcelo Medeiros Eler. 2023. Evolution may come with a price: analyzing user reviews to understand the impact of updates on mobile apps accessibility. InProceedings of the XXII Brazilian Symposium on Human Factors in Computing Systems. ACM,...

  21. [21]

    Rahul Dwivedi and Lavanya Elluri. 2024. Exploring Generative Artificial Intelligence Research: A Bibliometric Analysis Approach. IEEE Access 12 (2024), 119884–119902. https://doi.org/10.1109/ACCESS.2024.3450629

  22. [22]

    Stefan Feuerriegel, Jochen Hartmann, Christian Janiesch, and Patrick Zschech. 2024. Generative AI. Business & Information Systems Engineering 66, 1 (Feb. 2024), 111–126. https://doi.org/10.1007/s12599-023-00834-7

  23. [23]

    Necmiye Genc-Nayebi and Alain Abran. 2017. A systematic literature review: Opinion mining studies from mobile app store user reviews. Journal of Systems and Software 125 (March 2017), 207–219. https://doi.org/10.1016/j.jss.2016.11.027

  24. [24]

    Tanmai Kumar Ghosh, Atharva Pargaonkar, and Nasir U. Eisty. 2024. Exploring Requirements Elicitation from App Store User Reviews Using Large Language Models. (2024). https://doi.org/10.48550/ARXIV.2409.15473

  25. [25]

    Louie Giray. 2023. Prompt Engineering with ChatGPT: A Guide for Academic Writers.Annals of Biomedical Engineering 51, 12 (Dec. 2023), 2629–2633. https://doi.org/10.1007/s10439-023-03272-4

  26. [26]

    Golding, Anne Lippert, Jeffrey S

    Jonathan M. Golding, Anne Lippert, Jeffrey S. Neuschatz, Ilyssa Salomon, and Kelly Burke. 2024. Generative AI and College Students: Use and Perceptions. Teaching of Psychology (Sept. 2024), 00986283241280350. https://doi.org/10. 1177/00986283241280350

  27. [27]

    An Automated Survey of Generative Artificial Intelligence: Large Language Models, Architectures, Protocols, and Applications

    Roberto Gozalo-Brizuela and Eduardo C. Garrido-Merchán. 2023. A survey of Generative AI Applications. https: //doi.org/10.48550/arXiv.2306.02781 arXiv:2306.02781 [cs.LG]

  28. [28]

    Jennifer Haase, Djordje Djurica, and Jan Mendling. 2023. The art of inspiring creativity: Exploring the unique impact of AI-generated images. (2023)

  29. [29]

    Muhammad Usman Hadi, Qasem Al Tashi, Rizwan Qureshi, Abbas Shah, Amgad Muneer, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, and Seyedali Mirjalili. 2023. Large Language Models: A , Vol. 1, No. 1, Article . Publication date: September 2025. Understanding the Challenges and Promises of Developing Generative AI Apps: An Empirical...

  30. [30]

    Safwat Hassan, Heng Li, and Ahmed E. Hassan. 2022. On the Importance of Performing App Analysis Within Peer Groups. In Proceedings of the 29th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (SANER ’22). 1–12

  31. [31]

    Kimberly Hau, Safwat Hassan, and Shurui Zhou. 2025. LLMs in Mobile Apps: Practices, Challenges, and Opportunities. In 2025 IEEE/ACM 12th International Conference on Mobile Software Engineering and Systems (MOBILESoft) . 3–14. https://doi.org/10.1109/MOBILESoft66462.2025.00008

  32. [32]

    Dickerson

    Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X. Wang, and Sadid Hasan. 2024. Does Prompt Formatting Have Any Impact on LLM Performance? arXiv:2411.10541 (Nov. 2024). https://doi.org/10.48550/arXiv. 2411.10541 arXiv:2411.10541 [cs]

  33. [33]

    Heinz, Daniel M

    Michael V. Heinz, Daniel M. Mackin, Brianna M. Trudeau, Sukanya Bhattacharya, Yinzhou Wang, Haley A. Banta, Abi D. Jewett, Abigail J. Salzhauer, Tess Z. Griffin, and Nicholas C. Jacobson. 2025. Randomized Trial of a Generative AI Chatbot for Mental Health Treatment. NEJM AI 2, 4 (March 2025). https://doi.org/10.1056/AIoa2400802

  34. [34]

    Brittany Ho, Ta’Rhonda Mayberry, Khanh Linh Nguyen, Manohar Dhulipala, and Vivek Krishnamani Pallipuram

  35. [35]

    Machine Learning with Applications 15 (March 2024), 100522

    ChatReview: A ChatGPT-enabled natural language processing framework to study domain-specific user reviews. Machine Learning with Applications 15 (March 2024), 100522. https://doi.org/10.1016/j.mlwa.2023.100522

  36. [36]

    K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data.Inf

    Abiodun M. Ikotun, Absalom E. Ezugwu, Laith Abualigah, Belal Abuhaija, and Jia Heming. 2023. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences 622 (April 2023), 178–210. https://doi.org/10.1016/j.ins.2022.11.139

  37. [37]

    Aamo Iorliam and Joseph Abunimye Ingio. 2024. A Comparative Analysis of Generative Artificial Intelligence Tools for Natural Language Processing. Journal of Computing Theories and Applications 1, 3 (Feb. 2024), 311–325. https://doi.org/10.62411/jcta.9447

  38. [38]

    I Don’t Know Why I Should Use This App

    Seungwan Jin, Bogoan Kim, and Kyungsik Han. 2025. “I Don’t Know Why I Should Use This App”: Holistic Analysis on User Engagement Challenges in Mobile Mental Health. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–23. https://doi.org/10.1145/3706598.3713732

  39. [39]

    Judd, Gary H

    Charles M. Judd, Gary H. McClelland, and Carey S. Ryan. 2017. Data Analysis: A Model Comparison Approach to Regression, ANOV A, and Beyond (3 ed.). Routledge, Third Edition. | New York: Routledge, 2017. | Revised edition. https://doi.org/10.4324/9781315744131

  40. [40]

    Aakriti Kheterpal and Kanwarpartap Singh Gill. 2024. Therapeutic Tech: A Comparative Study of AI-Driven Mental Health Interventions. In 2024 4th International Conference on Advancement in Electronics & Communication Engineering (AECE). 1187–1190. https://doi.org/10.1109/AECE62803.2024.10911418

  41. [41]

    Grohs, Hoda Eldardiry, James Weichert, Larry A

    Junghwan Kim, Michelle Klopfer, Jacob R. Grohs, Hoda Eldardiry, James Weichert, Larry A. Cox, and Dale Pike. 2025. Examining Faculty and Student Perceptions of Generative AI in University Courses. Innovative Higher Education (Jan. 2025). https://doi.org/10.1007/s10755-024-09774-w

  42. [42]

    Jinhee Kim, Seongryeong Yu, Rita Detrick, and Na Li. 2025. Exploring students’ perspectives on Generative AI-assisted academic writing. Education and Information Technologies 30, 1 (Jan. 2025), 1265–1300. https://doi.org/10.1007/s10639- 024-12878-7

  43. [43]

    Anil Koyuncu. 2025. Exploring Fine-Grained Bug Report Categorization with Large Language Models and Prompt Engineering: An Empirical Study. ACM Trans. Softw. Eng. Methodol. (May 2025). https://doi.org/10.1145/3736408 Just Accepted

  44. [44]

    Bringmann

    Jannis Kreienkamp, Maximilian Agostini, Rei Monden, Kai Epstude, Peter De Jonge, and Laura F. Bringmann. 2025. A Gentle Introduction and Application of Feature-Based Clustering with Psychological Time Series. Multivariate Behavioral Research 60, 2 (March 2025), 362–392. https://doi.org/10.1080/00273171.2024.2432918

  45. [45]

    Daniel Lee, Matthew Arnold, Amit Srivastava, Katrina Plastow, Peter Strelan, Florian Ploeckl, Dimitra Lekkas, and Edward Palmer. 2024. The impact of generative AI on higher education learning and teaching: A study of educators’ perspectives. Computers and Education: Artificial Intelligence 6 (June 2024), 100221. https://doi.org/10.1016/j.caeai. 2024.100221

  46. [46]

    Seung-Cheol Lee, Dong-Gun Lee, and Yeong-Seok Seo. 2024. Determining the best feature combination through text and probabilistic feature analysis for GPT-2-based mobile app review detection. Applied Intelligence 54, 2 (Jan. 2024), 1219–1246. https://doi.org/10.1007/s10489-023-05201-3

  47. [47]

    Raga Madhuri, Jaya Sankar Krishna Bandaru, Medisetti Srinu, and Gangadhari Midhun Anand Vardhan

    C H. Raga Madhuri, Jaya Sankar Krishna Bandaru, Medisetti Srinu, and Gangadhari Midhun Anand Vardhan. 2025. AI-Powered Mental Health Screening and Support for Homeless Children. In 2025 AI-Driven Smart Healthcare for Society 5.0. 115–120. https://doi.org/10.1109/IEEECONF64992.2025.10963316

  48. [48]

    H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics 18, 1 (March 1947), 50–60. https://doi.org/10.1214/aoms/1177730491 , Vol. 1, No. 1, Article . Publication date: September 2025. 44 Buthayna AlMulla, Maram Assi, and Safwat Hassan

  49. [49]

    Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba-Nabende. 2024. Prompt Engineering in Large Language Models. Springer Nature Singapore, Singapore, 387–402. https://doi.org/10.1007/978-981-99-7962-2_30

  50. [50]

    J. Mingyu. n.d.. Google Play Scraper. GitHub repository. https://github.com/JoMingyu/google-play-scraper Accessed: 13-Oct-2024

  51. [51]

    Nadia Nahar, Christian Kästner, Jenna Butler, Chris Parnin, Thomas Zimmermann, and Christian Bird. 2024. Be- yond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products. arXiv:2410.12071 (Dec. 2024). https://doi.org/10.48550/arXiv.2410.12071 arXiv:2410.12071 [cs]

  52. [52]

    OpenAI. 2023. GPT-4: Generative Pre-trained Transformer. Online. Available at https://openai.com/research/gpt-4, Accessed: 03-Mar-2025

  53. [53]

    OpenAI. 2024. ChatGPT Edu. https://openai.com/chatgpt/education/. Accessed: 2025-06-13

  54. [54]

    OpenAI. 2024. gpt-4o-mini Model Overview. https://platform.openai.com/docs/models/o4-mini Accessed: 2025-05-24

  55. [55]

    Jonas Oppenlaender, Johanna Silvennoinen, Ville Paananen, and Aku Visuri. 2023. Perceptions and Realities of Text-to- Image Generation. In Proceedings of the 26th International Academic Mindtrek Conference (Tampere, Finland) (Mindtrek ’23). Association for Computing Machinery, New York, NY, USA, 279–288. https://doi.org/10.1145/3616961.3616978

  56. [56]

    Dennis Pagano and Walid Maalej. 2013. User feedback in the appstore: An empirical study. In2013 21st IEEE International Requirements Engineering Conference (RE) . IEEE, Rio de Janeiro-RJ, Brazil, 125–134. https://doi.org/10.1109/RE.2013. 6636712

  57. [57]

    Victor Dos Santos Paulino and Sveinn Vidar Gudmundsson. 2024. Do early adopters raise barriers to the commercial take- up of strategic high-technology products? Innovation (aug 2024), 1–18. https://doi.org/10.1080/14479338.2024.2386239

  58. [58]

    Siemenn, Saisamrit Surbehera, Zad Chin, Keith Tyser, Gregory Hunter, Arvind Raghavan, Yann Hicke, Bryan A

    Vitali Petsiuk, Alexander E. Siemenn, Saisamrit Surbehera, Zad Chin, Keith Tyser, Gregory Hunter, Arvind Raghavan, Yann Hicke, Bryan A. Plummer, Ori Kerret, Tonio Buonassisi, Kate Saenko, Armando Solar-Lezama, and Iddo Drori

  59. [59]

    Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark. (2022). https://doi.org/10.48550/ ARXIV.2211.12112

  60. [60]

    Chau Minh Pham, Alexander Hoyle, Simeng Sun, Philip Resnik, and Mohit Iyyer. 2024. TopicGPT: A Prompt-based Topic Modeling Framework. arXiv:2311.01449 (April 2024). https://doi.org/10.48550/arXiv.2311.01449 arXiv:2311.01449 [cs]

  61. [61]

    Nirmalendu Prakash, Han Wang, Nguyen Khoi Hoang, Ming Shan Hee, and Roy Ka-Wei Lee. 2023. PromptMTopic: Unsupervised Multimodal Topic Modeling of Memes using Large Language Models. In Proceedings of the 31st ACM International Conference on Multimedia . ACM, Ottawa ON Canada, 621–631. https://doi.org/10.1145/3581783.3613836

  62. [62]

    Societal biases in language generation: Progress and challenges

    Inioluwa Deborah Raji, I. Elizabeth Kumar, Aaron Horowitz, and Andrew Selbst. 2022. The Fallacy of AI Functionality. In 2022 ACM Conference on Fairness Accountability and Transparency . ACM, Seoul Republic of Korea, 959–972. https: //doi.org/10.1145/3531146.3533158

  63. [63]

    Harish Rathod and Sanjay Agal. 2023. A Study and Overview on Current Trends and Technology in Mobile Applications & Its Development. Lecture Notes in Networks and Systems, Vol. 754. Springer Nature Singapore, 383–395. https: //doi.org/10.1007/978-981-99-4932-8_35

  64. [64]

    Shuaicai Ren, Hiroyuki Nakagawa, and Tatsuhiro Tsuchiya. 2024. Combining Prompts with Examples to Enhance LLM- Based Requirement Elicitation. In 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC) . 1376–1381. https://doi.org/10.1109/COMPSAC61105.2024.00181

  65. [65]

    Ali Rezaei Nasab, Maedeh Dashti, Mojtaba Shahin, Mansooreh Zahedi, Hourieh Khalajzadeh, Chetan Arora, and Peng Liang. 2025. Fairness Concerns in App Reviews: A Study on AI-Based Mobile Apps. ACM Transactions on Software Engineering and Methodology 34, 2 (Feb. 2025), 1–30. https://doi.org/10.1145/3690633

  66. [66]

    Roumeliotis, Nikolaos D

    Konstantinos I. Roumeliotis, Nikolaos D. Tselikas, and Dimitrios K. Nasiopoulos. 2024. LLMs in e-commerce: A comparative analysis of GPT and LLaMA models in product review evaluation. Natural Language Processing Journal 6 (March 2024), 100056. https://doi.org/10.1016/j.nlp.2024.100056

  67. [67]

    Dharshini S, Samson Arun Raj A, and Venkatesan R. 2025. MindMate: AI-Powered Multilingual Mental Health Chatbot with Personalized Voice and Text Support with Rasa and Streamlit. In 2025 International Conference on Intelligent Computing and Control Systems (ICICCS) . 1104–1109. https://doi.org/10.1109/ICICCS65191.2025.10985281

  68. [68]

    Sandeep Singh Sengar, Affan Bin Hasan, Sanjay Kumar, and Fiona Carroll. 2024. Generative artificial intelligence: a systematic review and applications. Multimedia Tools and Applications (Aug. 2024). https://doi.org/10.1007/s11042- 024-20016-1

  69. [69]

    Yuchen Shao, Yuheng Huang, Jiawei Shen, Lei Ma, Ting Su, and Chengcheng Wan. 2025. Are LLMs Correctly Integrated into Software Systems?. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 1178–1190. https://doi.org/10.1109/ICSE55347.2025.00204

  70. [70]

    Sharma and Jitendra Patel

    Dhruv S. Sharma and Jitendra Patel. 2024. AI and Mental Health: A New Era of Healing. In 2024 2nd DMIHER International Conference on Artificial Intelligence in Healthcare, Education and Industry (IDICAIEI) . 1–5. https://doi.org/ 10.1109/IDICAIEI61867.2024.10842666 , Vol. 1, No. 1, Article . Publication date: September 2025. Understanding the Challenges a...

  71. [71]

    Aya Shata and Kendall Hartley. 2025. Artificial intelligence and communication technologies in academia: faculty perceptions and the adoption of generative AI. International Journal of Educational Technology in Higher Education 22, 1 (March 2025), 14. https://doi.org/10.1186/s41239-025-00511-7

  72. [72]

    Aakash Sorathiya and Gouri Ginde. 2024. Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews. arXiv:2411.07398 (Nov. 2024). https://doi.org/10.48550/arXiv.2411.07398 arXiv:2411.07398 [cs]

  73. [73]

    Yuying Tang, Ningning Zhang, Mariana Ciancia, and Zhigang Wang. 2024. Exploring the Impact of AI-generated Image Tools on Professional and Non-professional Users in the Art and Design Fields. In Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing (San Jose, Costa Rica) (CSCW Companion ’24). Association ...

  74. [74]

    Liang Wang, Nan Yang, and Furu Wei. 2023. Learning to Retrieve In-Context Examples for Large Language Models. (2023). https://doi.org/10.48550/ARXIV.2307.07164

  75. [75]

    Warren Liao

    T. Warren Liao. 2005. Clustering of time series data—a survey. Pattern Recognition 38, 11 (Nov. 2005), 1857–1874. https://doi.org/10.1016/j.patcog.2005.01.025

  76. [76]

    Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, and Gérard Dray. 2023. Zero-shot Bilingual App Reviews Mining with Large Language Models. In 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, Atlanta, GA, USA, 898–904. https://doi.org/10.1109/ICTAI59109.2023.00135

  77. [77]

    Connie Levina Yuen and Nadja Schlote. 2024. Learner Experiences of Mobile Apps and Artificial Intelligence to Support Additional Language Learning in Education. Journal of Educational Technology Systems 52, 4 (June 2024), 507–525. https://doi.org/10.1177/00472395241238693

  78. [78]

    Ye Zhang, Jinrui Zhang, Sheng Yue, Wei Lu, Ju Ren, and Xuemin Shen. 2024. Mobile Generative AI: Opportunities and Challenges. IEEE Wireless Communications 31, 4 (2024), 58–64. https://doi.org/10.1109/MWC.006.2300576 , Vol. 1, No. 1, Article . Publication date: September 2025