pith. sign in

arxiv: 2411.09916 · v3 · submitted 2024-11-15 · 💻 cs.SE

"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering

Pith reviewed 2026-05-23 17:28 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLMChatGPTsoftware engineeringabandonmentfailure modeshuman-AI interactionweb development
0
0 comments X

The pith

Unhelpful LLM responses made users 11 times more likely to abandon ChatGPT in a web development task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how 26 participants used ChatGPT on a complex web development assignment and catalogs the specific ways the model failed them. Nine failure types emerged in three groups: responses that were wrong or incomplete, demands that overloaded the user's thinking, and breakdowns where the model lost track of prior context. Seventeen participants stopped using the tool entirely. Statistical modeling showed that an unhelpful answer multiplied the odds of quitting by 11 while each extra prompt lowered those odds by 17 percent.

Core claim

In a controlled observation of 26 engineers building a web application, nine recurring failure modes in ChatGPT responses drove 17 participants to abandon the tool; unhelpful replies raised abandonment likelihood elevenfold while each successive prompt reduced it by 17 percent.

What carries the argument

Nine failure types grouped into incorrect or incomplete responses, cognitive overload, and context loss, tracked against participant abandonment decisions and modeled with logistic regression on response helpfulness and prompt count.

If this is right

  • Scaffolding, prompt clarification, and debugging steps can partially offset the identified failures.
  • Persistent unhelpful replies dominate the decision to stop using the model.
  • Additional prompts provide a modest protective effect against abandonment.
  • Tooling that reduces incorrect answers and context loss would directly address the main drivers of quitting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • High abandonment suggests net productivity claims for LLMs in complex SE work may need to account for time lost to failed attempts.
  • The same response-quality and persistence dynamics could appear in non-SE iterative tasks such as data analysis or report writing.
  • Interfaces that surface likely failure modes in advance might reduce the observed quit rate.

Load-bearing premise

That the failure patterns and abandonment rates observed in this single web development task will appear in other software engineering tasks or with different LLMs.

What would settle it

A replication with a different SE task or LLM in which the measured effect of unhelpful responses on abandonment drops below statistical significance.

Figures

Figures reproduced from arXiv: 2411.09916 by Bingsheng Yao, Dakuo Wang, Hongbo Fang, Jiessie Tie, Shurui Zhou, Syed Ishtiaque Ahmed, Tianshi Li.

Figure 1
Figure 1. Figure 1: Task Breakdown: A) Insert profile picture; B) Link email; C) Create division with headings/widgets: C1. Add visualization; D) Create division with headings/widgets: D1. Insert table; E) Insert footer division; F) Format side-by￾side division: F1. Insert headings/subtitles; G) Insert division with form/buttons; H) Implement pop-up on button click; I) Implement form with local file saves and alert on submiss… view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of user interaction with ChatGPT: focus on [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of Response Lengths for Successful vs. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relationship between causes, failures, and mitigations. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Participant’s ratings for (a) ChatGPT’s for helpfulness; [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of participants across different categories [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Software engineers are increasingly incorporating AI assistants into their workflows to enhance productivity and alleviate cognitive load. However, experiences with large language models (LLMs) such as ChatGPT vary widely. While some engineers find them useful, others deem them counterproductive due to inaccuracies in their responses. Researchers have also observed that ChatGPT often provides incorrect information. Given these limitations, it is crucial to determine how to effectively integrate LLMs into software engineering (SE) workflow. Analyzing data from 26 participants in a complex web development task, we identified nine failure types categorized into incorrect or incomplete responses, cognitive overload, and context loss. Users attempted to mitigate these issues through scaffolding, prompt clarification, and debugging. However, 17 participants ultimately chose to abandon ChatGPT due to persistent failures. Our quantitative analysis revealed that unhelpful responses increased the likelihood of abandonment by a factor of 11, while each additional prompt reduced abandonment probability by 17%. This study advances the understanding of human-AI interaction in SE tasks and outlines directions for future research and tooling support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports an empirical study with 26 participants completing a complex web development task using ChatGPT. It categorizes nine LLM failure types into incorrect/incomplete responses, cognitive overload, and context loss; describes user mitigation strategies (scaffolding, clarification, debugging); notes that 17 participants abandoned the tool; and presents quantitative results claiming that unhelpful responses raise abandonment odds by a factor of 11 while each additional prompt lowers abandonment probability by 17%.

Significance. If the quantitative claims prove robust, the work supplies concrete evidence on when developers abandon LLMs during SE tasks and identifies actionable failure modes, contributing to human-AI collaboration research in software engineering. The mixed-methods design and direct observation of interaction sequences are strengths that could support future tooling recommendations.

major comments (2)
  1. [§5.2] §5.2 (Quantitative Analysis): The reported odds ratio of 11 for unhelpful responses and 17% per-prompt reduction are obtained from a regression on 26 binary abandonment outcomes. No information is supplied on model family, covariate specification, handling of repeated measures within participants, or validation (bootstrap, cross-validation, or sensitivity to outcome coding). With only 17 events the point estimates are likely to be unstable under modest specification changes.
  2. [§3] §3 (Methods): The coding scheme for the nine failure types, the operational definition of 'unhelpful response,' and the criteria for classifying abandonment are not described in sufficient detail to permit replication or to evaluate potential coder bias or measurement error.
minor comments (2)
  1. [Abstract] The abstract states the 11× and 17% figures without any accompanying confidence intervals, p-values, or sample-size caveats.
  2. [§5.2] Table or figure presenting the regression coefficients, standard errors, and model fit statistics is missing from the quantitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify areas where additional methodological transparency will strengthen the paper. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§5.2] §5.2 (Quantitative Analysis): The reported odds ratio of 11 for unhelpful responses and 17% per-prompt reduction are obtained from a regression on 26 binary abandonment outcomes. No information is supplied on model family, covariate specification, handling of repeated measures within participants, or validation (bootstrap, cross-validation, or sensitivity to outcome coding). With only 17 events the point estimates are likely to be unstable under modest specification changes.

    Authors: We agree that §5.2 currently omits key statistical details. The analysis consisted of a logistic regression with abandonment (binary) as the outcome, a binary indicator for unhelpful responses, and prompt count as a continuous predictor; the model was fit in R using glm(). No random effects or repeated-measures adjustment was applied because each participant contributed a single abandonment decision. We will expand the section to report the full model equation, software, coefficient table with confidence intervals, and an explicit statement that the results are exploratory given the event count. We will also add a sensitivity note acknowledging that small-sample logistic regression estimates can be unstable. revision: yes

  2. Referee: [§3] §3 (Methods): The coding scheme for the nine failure types, the operational definition of 'unhelpful response,' and the criteria for classifying abandonment are not described in sufficient detail to permit replication or to evaluate potential coder bias or measurement error.

    Authors: We concur that the current description of the qualitative coding process is insufficient. The nine failure categories emerged from inductive thematic analysis performed independently by two researchers on the full interaction transcripts; disagreements were resolved through discussion and a final codebook was applied. 'Unhelpful response' was operationalized as any LLM output that either introduced factual errors, omitted required functionality, or required the participant to restart a sub-task. Abandonment was coded when a participant explicitly ceased prompting and completed the remainder of the task without the LLM. We will revise §3 to include the complete codebook with examples, inter-rater reliability statistics, and the exact decision rules used for each construct. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest directly on observed study data

full rationale

The paper reports an empirical user study with 26 participants on a web development task. Failure categories, abandonment decisions (17/26), and quantitative effects (unhelpful responses multiply abandonment odds by 11; each extra prompt lowers probability by 17%) are presented as outcomes of direct observation, coding of interactions, and regression-style analysis on the collected binary outcomes. No equations, self-definitional constructs, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central results are therefore self-contained against the external benchmark of the participant data rather than reducing to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions in user studies about task representativeness and participant behavior reflecting real use.

axioms (1)
  • domain assumption The web development task is representative of typical SE workflows.
    The study uses one complex task to draw conclusions about SE in general.

pith-pipeline@v0.9.0 · 5732 in / 1024 out tokens · 24824 ms · 2026-05-23T17:28:00.250827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TDD Governance for Multi-Agent Code Generation via Prompt Engineering

    cs.SE 2026-04 unverdicted novelty 5.0

    An AI-native TDD framework operationalizes classical TDD principles as prompt-level and workflow-level governance mechanisms in a layered multi-agent architecture to improve stability and reproducibility of LLM code g...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Cognitive load and productivity implications in human-chatbot interaction,

    J. Schmidhuber, S. Schl ¨ogl, and C. Ploder, “Cognitive load and productivity implications in human-chatbot interaction,” in 2021 IEEE 2nd International Conference on Human-Machine Systems (ICHMS) , pp. 1–6. [Online]. Available: http://arxiv.org/abs/2111.01400

  2. [2]

    ChatGPT: A study on its utility for ubiquitous software engineering tasks

    G. Sridhara, R. H. G., and S. Mazumdar, “ChatGPT: A study on its utility for ubiquitous software engineering tasks.” [Online]. Available: http://arxiv.org/abs/2305.16837

  3. [3]

    Chatbots applications in education: A systematic review,

    C. W. Okonkwo and A. Ade-Ibijola, “Chatbots applications in education: A systematic review,” vol. 2, p. 100033. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666920X21000278

  4. [4]

    Studying the effect of AI code generators on supporting novice learners in introductory programming,

    M. Kazemitabaar, J. Chow, C. K. T. Ma, B. J. Ericson, D. Weintrop, and T. Grossman, “Studying the effect of AI code generators on supporting novice learners in introductory programming,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , pp. 1–23. [Online]. Available: http://arxiv.org/abs/2302.07427

  5. [5]

    Bissyandé

    H. Tian, W. Lu, T. O. Li, X. Tang, S.-C. Cheung, J. Klein, and T. F. Bissyand´e, “Is ChatGPT the ultimate programming assistant – how far is it?” [Online]. Available: http://arxiv.org/abs/2304.11938

  6. [6]

    Generative ai for test driven development: Preliminary results,

    M. Mock, J. Melegati, and B. Russo, “Generative ai for test driven development: Preliminary results,” arXiv preprint arXiv:2405.10849 , 2024

  7. [7]

    Six opportunities for scientists and engineers to learn programming using ai tools such as chatgpt,

    P. J. Guo, “Six opportunities for scientists and engineers to learn programming using ai tools such as chatgpt,” Computing in Science & Engineering, vol. 25, no. 3, pp. 73–78, 2023

  8. [8]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V...

  9. [9]

    ChatGPT and software testing education: Promises & perils,

    S. Jalil, S. Rafi, T. D. LaToza, K. Moran, and W. Lam, “ChatGPT and software testing education: Promises & perils,” in 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) , pp. 4130–4137. [Online]. Available: http://arxiv.org/abs/2302.03287

  10. [10]

    What does ChatGPT know about natural science and engineering?

    L. S. Balhorn, J. M. Weber, S. Buijsman, J. R. Hildebrandt, and A. M. Schweidtmann, “What does ChatGPT know about natural science and engineering?”

  11. [11]

    Use of large language models might affect our cognitive skills,

    R. Heersmink, “Use of large language models might affect our cognitive skills,” Nature Human Behaviour , vol. 8, no. 5, pp. 805–806, 2024

  12. [12]

    How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment,

    M. Kazemitabaar, X. Hou, A. Henley, B. J. Ericson, D. Weintrop, and T. Grossman, “How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment,” in Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, 2023, pp. 1–12

  13. [13]

    Can ChatGPT replace StackOverflow? a study on robustness and reliability of large language model code generation

    L. Zhong and Z. Wang, “Can ChatGPT replace StackOverflow? a study on robustness and reliability of large language model code generation.” [Online]. Available: http://arxiv.org/abs/2308.10335

  14. [14]

    Large language models for software engineering: Survey and open problems,

    A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems.” [Online]. Available: http: //arxiv.org/abs/2310.03533

  15. [15]

    Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers to stack overflow questions,

    S. Kabir, D. N. Udo-Imeh, B. Kou, and T. Zhang, “Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers to stack overflow questions,” in Proceedings of the CHI Conference on Human Factors in Computing Systems , ser. CHI ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi-or...

  16. [16]

    Glassman

    P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts . ACM, pp. 1–7. [Online]. Available: https://dl.acm.org/doi/10.1145/3491101.3519665

  17. [17]

    How far are we? the triumphs and trials of generative AI in learning software engineering

    R. Choudhuri, D. Liu, I. Steinmacher, M. Gerosa, and A. Sarma, “How far are we? the triumphs and trials of generative AI in learning software engineering.” [Online]. Available: http://arxiv.org/abs/2312.11719

  18. [18]

    What skills do you need when developing software using ChatGPT? (discussion paper),

    J. Jeuring, R. Groot, and H. Keuning, “What skills do you need when developing software using ChatGPT? (discussion paper),” in Proceedings of the 23rd Koli Calling International Conference on Computing Education Research , ser. Koli Calling ’23. Association for Computing Machinery, pp. 1–6. [Online]. Available: https: //dl.acm.org/doi/10.1145/3631802.3631807

  19. [19]

    The role of chatgpt in higher education: Benefits, challenges, and future research directions,

    T. Rasul, S. Nair, D. Kalendra, M. Robin, F. de Oliveira Santini, W. J. Ladeira, M. Sun, I. Day, R. A. Rather, and L. Heathcote, “The role of chatgpt in higher education: Benefits, challenges, and future research directions,” Journal of Applied Learning and Teaching , vol. 6, no. 1, 2023

  20. [20]

    Trust in generative ai among students: An exploratory study,

    M. Amoozadeh, D. Daniels, D. Nam, A. Kumar, S. Chen, M. Hilton, S. Srinivasa Ragavan, and M. A. Alipour, “Trust in generative ai among students: An exploratory study,” in Proceedings of the 55th ACM Technical Symposium on Computer Science Education V . 1 , 2024, pp. 67–73

  21. [21]

    Towards human-bot collaborative software architecting with ChatGPT

    A. Ahmad, M. Waseem, P. Liang, M. Fehmideh, M. S. Aktar, and T. Mikkonen, “Towards human-bot collaborative software architecting with ChatGPT.” [Online]. Available: http://arxiv.org/abs/2302.14600

  22. [22]

    Using an LLM to help with code understanding

    D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an LLM to help with code understanding.” [Online]. Available: http://arxiv.org/abs/2307.08177

  23. [23]

    Evaluating instruction-tuned large language models on code comprehension and generation,

    Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y . Lou, “Evaluating instruction-tuned large language models on code comprehension and generation,” arXiv preprint arXiv:2308.01240 , 2023

  24. [24]

    In-IDE code generation from natural language: Promise and challenges,

    F. F. Xu, B. Vasilescu, and G. Neubig, “In-IDE code generation from natural language: Promise and challenges,” vol. 31, no. 2, pp. 29:1–29:47. [Online]. Available: https://dl.acm.org/doi/10.1145/3487569

  25. [25]

    How chatgpt will change software engineering education,

    M. Daun and J. Brings, “How chatgpt will change software engineering education,” in Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V . 1, 2023, pp. 110–116

  26. [26]

    The effect of generative artificial intelligence (AI)-based tool use on students’ computational thinking skills, programming self-efficacy and motivation,

    R. Yilmaz and F. G. Karaoglan Yilmaz, “The effect of generative artificial intelligence (AI)-based tool use on students’ computational thinking skills, programming self-efficacy and motivation,” vol. 4, p. 100147. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S2666920X23000267

  27. [27]

    Interacting with educational chatbots: A systematic review,

    M. A. Kuhail, N. Alturki, S. Alramlawi, and K. Alhejori, “Interacting with educational chatbots: A systematic review,” vol. 28, no. 1, pp. 973–

  28. [28]

    Available: https://doi.org/10.1007/s10639-022-11177-3

    [Online]. Available: https://doi.org/10.1007/s10639-022-11177-3

  29. [29]

    Chatgpt for education and research: Opportunities, threats, and strategies,

    M. M. Rahman and Y . Watanobe, “Chatgpt for education and research: Opportunities, threats, and strategies,” Applied Sciences, vol. 13, no. 9, p. 5783, 2023

  30. [30]

    CodeAid: Evaluating a classroom deployment of an LLM-based programming assistant that balances student and educator needs

    M. Kazemitabaar, R. Ye, X. Wang, A. Z. Henley, P. Denny, M. Craig, and T. Grossman, “CodeAid: Evaluating a classroom deployment of an LLM-based programming assistant that balances student and educator needs.” [Online]. Available: http://arxiv.org/abs/2401.11314

  31. [31]

    Using an llm to help with code understanding,

    D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , 2024, pp. 1–13

  32. [32]

    Codepilot: Scaffolding end-to-end collaborative software development for novice programmers,

    J. Warner and P. J. Guo, “Codepilot: Scaffolding end-to-end collaborative software development for novice programmers,” in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems , 2017, pp. 1136–1141

  33. [33]

    Chatgpt, can you generate solutions for my coding exercises? an evaluation on its effectiveness in an undergraduate java programming course

    E. L. Ouh, B. K. S. Gan, K. Jin Shim, and S. Wlodkowski, “Chatgpt, can you generate solutions for my coding exercises? an evaluation on its effectiveness in an undergraduate java programming course.” in Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V . 1, 2023, pp. 54–60

  34. [34]

    Impersonating chatbots in a code review exercise to teach software engineering best practices,

    J. C. Farah, B. Spaenlehauer, V . Sharma, M. J. Rodr ´ıguez-Triana, S. Ingram, and D. Gillet, “Impersonating chatbots in a code review exercise to teach software engineering best practices,” in 2022 IEEE Global Engineering Education Conference (EDUCON) . IEEE, 2022, pp. 1634–1642

  35. [35]

    Designing and evaluating pedagogic conversational agents to teach children,

    S. Tamayo-Moreno and D. P ´erez-Mar´ın, “Designing and evaluating pedagogic conversational agents to teach children,”International Journal of Educational and Pedagogical Sciences , vol. 11, no. 3, pp. 521–526, 2017

  36. [36]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” [Online]. Available: http://arxiv.org/abs/2310.06770

  37. [37]

    Liang, Chenyang Yang, and Brad A

    J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi-org.myaccess.library.uto...

  38. [38]

    Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,

    J. Zamfirescu-Pereira, R. Y . Wong, B. Hartmann, and Q. Yang, “Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems . ACM, pp. 1–21. [Online]. Available: https://dl.acm.org/doi/10.1145/3544548.3581388

  39. [39]

    ”it’s weird that it knows what i want

    J. Prather, B. N. Reeves, P. Denny, B. A. Becker, J. Leinonen, A. Luxton-Reilly, G. Powell, J. Finnie-Ansley, and E. A. Santos, “”it’s weird that it knows what i want”: Usability and interactions with copilot for novice programmers,” vol. 31, no. 1, pp. 1–31. [Online]. Available: http://arxiv.org/abs/2304.02491

  40. [40]

    A large-scale survey on the usability of ai programming assistants: Successes and challenges,

    J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–13

  41. [41]

    Grounded copilot: How programmers interact with code-generating models,

    S. Barke, M. B. James, and N. Polikarpova, “Grounded copilot: How programmers interact with code-generating models,” vol. 7, pp. 85–111. [Online]. Available: https://dl.acm.org/doi/10.1145/3586030

  42. [42]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

  43. [43]

    “it’s not like jarvis, but it’s pretty close!

    R. Budhiraja, I. Joshi, J. S. Challa, H. D. Akolekar, and D. Kumar, ““it’s not like jarvis, but it’s pretty close!”-examining chatgpt’s usage among undergraduate students in computer science,” in Proceedings of the 26th Australasian Computing Education Conference , 2024, pp. 124–133

  44. [44]

    LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering,

    A. Anonymous, “LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering,” Aug. 2024. [Online]. Available: https://doi.org/10.5281/zenodo.13179522

  45. [45]

    Metacognitive strategies that enhance critical thinking,

    K. Y . Ku and I. T. Ho, “Metacognitive strategies that enhance critical thinking,” Metacognition and learning , vol. 5, pp. 251–267, 2010

  46. [46]

    Metacognition: Answered and unan- swered questions,

    R. Garner and P. A. Alexander, “Metacognition: Answered and unan- swered questions,”Educational psychologist, vol. 24, no. 2, pp. 143–158, 1989

  47. [47]

    Using thematic analysis in psychology,

    V . Braun and V . Clarke, “Using thematic analysis in psychology,” Qualitative research in psychology , vol. 3, no. 2, pp. 77–101, 2006

  48. [48]

    Improving user satisfaction: The questionnaire for user interaction satisfaction version 5.5,

    B. D. Harper and K. L. Norman, “Improving user satisfaction: The questionnaire for user interaction satisfaction version 5.5,” in Proceed- ings of the 1st annual mid-Atlantic human factors conference , vol. 224. Citeseer, 1993, p. 228

  49. [49]

    Servqual: A multiple- item scale for measuring consumer perc,

    A. Parasuraman, V . A. Zeithaml, and L. L. Berry, “Servqual: A multiple- item scale for measuring consumer perc,” Journal of retailing , vol. 64, no. 1, p. 12, 1988

  50. [50]

    Measuring customer satisfaction with service quality using american customer satisfaction model (acsi model),

    B. Angelova and J. Zekiri, “Measuring customer satisfaction with service quality using american customer satisfaction model (acsi model),” Inter- national journal of academic research in business and social sciences , vol. 1, no. 3, pp. 232–258, 2011

  51. [51]

    Overreliance on ai literature review,

    S. Passi and M. V orvoreanu, “Overreliance on ai literature review,” Microsoft Research, 2022

  52. [52]

    Conditions of learning in novice programmers,

    D. N. Perkins, C. Hancock, R. Hobbs, F. Martin, and R. Simmons, “Conditions of learning in novice programmers,” in Studying the Novice Programmer. Psychology Press, num Pages: 19

  53. [53]

    Computing education in the era of generative AI,

    P. Denny, J. Prather, B. A. Becker, J. Finnie-Ansley, A. Hellas, J. Leinonen, A. Luxton-Reilly, B. N. Reeves, E. A. Santos, and S. Sarsa, “Computing education in the era of generative AI,” vol. 67, no. 2, pp. 56–67. [Online]. Available: https://dl.acm.org/doi/10.1145/3624720

  54. [54]

    Agrawal, J

    A. Agrawal, J. Gans, and A. Goldfarb, Power and Prediction: The Disruptive Economics of Artificial Intelligence . Harvard Business Review Press. [Online]. Available: http://ebookcentral.proquest.com/lib/ utoronto/detail.action?docID=6846949