"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering
Pith reviewed 2026-05-23 17:28 UTC · model grok-4.3
The pith
Unhelpful LLM responses made users 11 times more likely to abandon ChatGPT in a web development task.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a controlled observation of 26 engineers building a web application, nine recurring failure modes in ChatGPT responses drove 17 participants to abandon the tool; unhelpful replies raised abandonment likelihood elevenfold while each successive prompt reduced it by 17 percent.
What carries the argument
Nine failure types grouped into incorrect or incomplete responses, cognitive overload, and context loss, tracked against participant abandonment decisions and modeled with logistic regression on response helpfulness and prompt count.
If this is right
- Scaffolding, prompt clarification, and debugging steps can partially offset the identified failures.
- Persistent unhelpful replies dominate the decision to stop using the model.
- Additional prompts provide a modest protective effect against abandonment.
- Tooling that reduces incorrect answers and context loss would directly address the main drivers of quitting.
Where Pith is reading between the lines
- High abandonment suggests net productivity claims for LLMs in complex SE work may need to account for time lost to failed attempts.
- The same response-quality and persistence dynamics could appear in non-SE iterative tasks such as data analysis or report writing.
- Interfaces that surface likely failure modes in advance might reduce the observed quit rate.
Load-bearing premise
That the failure patterns and abandonment rates observed in this single web development task will appear in other software engineering tasks or with different LLMs.
What would settle it
A replication with a different SE task or LLM in which the measured effect of unhelpful responses on abandonment drops below statistical significance.
Figures
read the original abstract
Software engineers are increasingly incorporating AI assistants into their workflows to enhance productivity and alleviate cognitive load. However, experiences with large language models (LLMs) such as ChatGPT vary widely. While some engineers find them useful, others deem them counterproductive due to inaccuracies in their responses. Researchers have also observed that ChatGPT often provides incorrect information. Given these limitations, it is crucial to determine how to effectively integrate LLMs into software engineering (SE) workflow. Analyzing data from 26 participants in a complex web development task, we identified nine failure types categorized into incorrect or incomplete responses, cognitive overload, and context loss. Users attempted to mitigate these issues through scaffolding, prompt clarification, and debugging. However, 17 participants ultimately chose to abandon ChatGPT due to persistent failures. Our quantitative analysis revealed that unhelpful responses increased the likelihood of abandonment by a factor of 11, while each additional prompt reduced abandonment probability by 17%. This study advances the understanding of human-AI interaction in SE tasks and outlines directions for future research and tooling support.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an empirical study with 26 participants completing a complex web development task using ChatGPT. It categorizes nine LLM failure types into incorrect/incomplete responses, cognitive overload, and context loss; describes user mitigation strategies (scaffolding, clarification, debugging); notes that 17 participants abandoned the tool; and presents quantitative results claiming that unhelpful responses raise abandonment odds by a factor of 11 while each additional prompt lowers abandonment probability by 17%.
Significance. If the quantitative claims prove robust, the work supplies concrete evidence on when developers abandon LLMs during SE tasks and identifies actionable failure modes, contributing to human-AI collaboration research in software engineering. The mixed-methods design and direct observation of interaction sequences are strengths that could support future tooling recommendations.
major comments (2)
- [§5.2] §5.2 (Quantitative Analysis): The reported odds ratio of 11 for unhelpful responses and 17% per-prompt reduction are obtained from a regression on 26 binary abandonment outcomes. No information is supplied on model family, covariate specification, handling of repeated measures within participants, or validation (bootstrap, cross-validation, or sensitivity to outcome coding). With only 17 events the point estimates are likely to be unstable under modest specification changes.
- [§3] §3 (Methods): The coding scheme for the nine failure types, the operational definition of 'unhelpful response,' and the criteria for classifying abandonment are not described in sufficient detail to permit replication or to evaluate potential coder bias or measurement error.
minor comments (2)
- [Abstract] The abstract states the 11× and 17% figures without any accompanying confidence intervals, p-values, or sample-size caveats.
- [§5.2] Table or figure presenting the regression coefficients, standard errors, and model fit statistics is missing from the quantitative results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments identify areas where additional methodological transparency will strengthen the paper. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§5.2] §5.2 (Quantitative Analysis): The reported odds ratio of 11 for unhelpful responses and 17% per-prompt reduction are obtained from a regression on 26 binary abandonment outcomes. No information is supplied on model family, covariate specification, handling of repeated measures within participants, or validation (bootstrap, cross-validation, or sensitivity to outcome coding). With only 17 events the point estimates are likely to be unstable under modest specification changes.
Authors: We agree that §5.2 currently omits key statistical details. The analysis consisted of a logistic regression with abandonment (binary) as the outcome, a binary indicator for unhelpful responses, and prompt count as a continuous predictor; the model was fit in R using glm(). No random effects or repeated-measures adjustment was applied because each participant contributed a single abandonment decision. We will expand the section to report the full model equation, software, coefficient table with confidence intervals, and an explicit statement that the results are exploratory given the event count. We will also add a sensitivity note acknowledging that small-sample logistic regression estimates can be unstable. revision: yes
-
Referee: [§3] §3 (Methods): The coding scheme for the nine failure types, the operational definition of 'unhelpful response,' and the criteria for classifying abandonment are not described in sufficient detail to permit replication or to evaluate potential coder bias or measurement error.
Authors: We concur that the current description of the qualitative coding process is insufficient. The nine failure categories emerged from inductive thematic analysis performed independently by two researchers on the full interaction transcripts; disagreements were resolved through discussion and a final codebook was applied. 'Unhelpful response' was operationalized as any LLM output that either introduced factual errors, omitted required functionality, or required the participant to restart a sub-task. Abandonment was coded when a participant explicitly ceased prompting and completed the remainder of the task without the LLM. We will revise §3 to include the complete codebook with examples, inter-rater reliability statistics, and the exact decision rules used for each construct. revision: yes
Circularity Check
No circularity; empirical claims rest directly on observed study data
full rationale
The paper reports an empirical user study with 26 participants on a web development task. Failure categories, abandonment decisions (17/26), and quantitative effects (unhelpful responses multiply abandonment odds by 11; each extra prompt lowers probability by 17%) are presented as outcomes of direct observation, coding of interactions, and regression-style analysis on the collected binary outcomes. No equations, self-definitional constructs, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central results are therefore self-contained against the external benchmark of the participant data rather than reducing to prior inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The web development task is representative of typical SE workflows.
Forward citations
Cited by 1 Pith paper
-
TDD Governance for Multi-Agent Code Generation via Prompt Engineering
An AI-native TDD framework operationalizes classical TDD principles as prompt-level and workflow-level governance mechanisms in a layered multi-agent architecture to improve stability and reproducibility of LLM code g...
Reference graph
Works this paper leans on
-
[1]
Cognitive load and productivity implications in human-chatbot interaction,
J. Schmidhuber, S. Schl ¨ogl, and C. Ploder, “Cognitive load and productivity implications in human-chatbot interaction,” in 2021 IEEE 2nd International Conference on Human-Machine Systems (ICHMS) , pp. 1–6. [Online]. Available: http://arxiv.org/abs/2111.01400
-
[2]
ChatGPT: A study on its utility for ubiquitous software engineering tasks
G. Sridhara, R. H. G., and S. Mazumdar, “ChatGPT: A study on its utility for ubiquitous software engineering tasks.” [Online]. Available: http://arxiv.org/abs/2305.16837
-
[3]
Chatbots applications in education: A systematic review,
C. W. Okonkwo and A. Ade-Ibijola, “Chatbots applications in education: A systematic review,” vol. 2, p. 100033. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666920X21000278
-
[4]
Studying the effect of AI code generators on supporting novice learners in introductory programming,
M. Kazemitabaar, J. Chow, C. K. T. Ma, B. J. Ericson, D. Weintrop, and T. Grossman, “Studying the effect of AI code generators on supporting novice learners in introductory programming,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , pp. 1–23. [Online]. Available: http://arxiv.org/abs/2302.07427
- [5]
-
[6]
Generative ai for test driven development: Preliminary results,
M. Mock, J. Melegati, and B. Russo, “Generative ai for test driven development: Preliminary results,” arXiv preprint arXiv:2405.10849 , 2024
-
[7]
Six opportunities for scientists and engineers to learn programming using ai tools such as chatgpt,
P. J. Guo, “Six opportunities for scientists and engineers to learn programming using ai tools such as chatgpt,” Computing in Science & Engineering, vol. 25, no. 3, pp. 73–78, 2023
work page 2023
-
[8]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
ChatGPT and software testing education: Promises & perils,
S. Jalil, S. Rafi, T. D. LaToza, K. Moran, and W. Lam, “ChatGPT and software testing education: Promises & perils,” in 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) , pp. 4130–4137. [Online]. Available: http://arxiv.org/abs/2302.03287
-
[10]
What does ChatGPT know about natural science and engineering?
L. S. Balhorn, J. M. Weber, S. Buijsman, J. R. Hildebrandt, and A. M. Schweidtmann, “What does ChatGPT know about natural science and engineering?”
-
[11]
Use of large language models might affect our cognitive skills,
R. Heersmink, “Use of large language models might affect our cognitive skills,” Nature Human Behaviour , vol. 8, no. 5, pp. 805–806, 2024
work page 2024
-
[12]
M. Kazemitabaar, X. Hou, A. Henley, B. J. Ericson, D. Weintrop, and T. Grossman, “How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment,” in Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, 2023, pp. 1–12
work page 2023
-
[13]
L. Zhong and Z. Wang, “Can ChatGPT replace StackOverflow? a study on robustness and reliability of large language model code generation.” [Online]. Available: http://arxiv.org/abs/2308.10335
-
[14]
Large language models for software engineering: Survey and open problems,
A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems.” [Online]. Available: http: //arxiv.org/abs/2310.03533
-
[15]
S. Kabir, D. N. Udo-Imeh, B. Kou, and T. Zhang, “Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers to stack overflow questions,” in Proceedings of the CHI Conference on Human Factors in Computing Systems , ser. CHI ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi-or...
-
[16]
P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts . ACM, pp. 1–7. [Online]. Available: https://dl.acm.org/doi/10.1145/3491101.3519665
-
[17]
How far are we? the triumphs and trials of generative AI in learning software engineering
R. Choudhuri, D. Liu, I. Steinmacher, M. Gerosa, and A. Sarma, “How far are we? the triumphs and trials of generative AI in learning software engineering.” [Online]. Available: http://arxiv.org/abs/2312.11719
-
[18]
What skills do you need when developing software using ChatGPT? (discussion paper),
J. Jeuring, R. Groot, and H. Keuning, “What skills do you need when developing software using ChatGPT? (discussion paper),” in Proceedings of the 23rd Koli Calling International Conference on Computing Education Research , ser. Koli Calling ’23. Association for Computing Machinery, pp. 1–6. [Online]. Available: https: //dl.acm.org/doi/10.1145/3631802.3631807
-
[19]
The role of chatgpt in higher education: Benefits, challenges, and future research directions,
T. Rasul, S. Nair, D. Kalendra, M. Robin, F. de Oliveira Santini, W. J. Ladeira, M. Sun, I. Day, R. A. Rather, and L. Heathcote, “The role of chatgpt in higher education: Benefits, challenges, and future research directions,” Journal of Applied Learning and Teaching , vol. 6, no. 1, 2023
work page 2023
-
[20]
Trust in generative ai among students: An exploratory study,
M. Amoozadeh, D. Daniels, D. Nam, A. Kumar, S. Chen, M. Hilton, S. Srinivasa Ragavan, and M. A. Alipour, “Trust in generative ai among students: An exploratory study,” in Proceedings of the 55th ACM Technical Symposium on Computer Science Education V . 1 , 2024, pp. 67–73
work page 2024
-
[21]
Towards human-bot collaborative software architecting with ChatGPT
A. Ahmad, M. Waseem, P. Liang, M. Fehmideh, M. S. Aktar, and T. Mikkonen, “Towards human-bot collaborative software architecting with ChatGPT.” [Online]. Available: http://arxiv.org/abs/2302.14600
-
[22]
Using an LLM to help with code understanding
D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an LLM to help with code understanding.” [Online]. Available: http://arxiv.org/abs/2307.08177
-
[23]
Evaluating instruction-tuned large language models on code comprehension and generation,
Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y . Lou, “Evaluating instruction-tuned large language models on code comprehension and generation,” arXiv preprint arXiv:2308.01240 , 2023
-
[24]
In-IDE code generation from natural language: Promise and challenges,
F. F. Xu, B. Vasilescu, and G. Neubig, “In-IDE code generation from natural language: Promise and challenges,” vol. 31, no. 2, pp. 29:1–29:47. [Online]. Available: https://dl.acm.org/doi/10.1145/3487569
-
[25]
How chatgpt will change software engineering education,
M. Daun and J. Brings, “How chatgpt will change software engineering education,” in Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V . 1, 2023, pp. 110–116
work page 2023
-
[26]
R. Yilmaz and F. G. Karaoglan Yilmaz, “The effect of generative artificial intelligence (AI)-based tool use on students’ computational thinking skills, programming self-efficacy and motivation,” vol. 4, p. 100147. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S2666920X23000267
-
[27]
Interacting with educational chatbots: A systematic review,
M. A. Kuhail, N. Alturki, S. Alramlawi, and K. Alhejori, “Interacting with educational chatbots: A systematic review,” vol. 28, no. 1, pp. 973–
-
[28]
Available: https://doi.org/10.1007/s10639-022-11177-3
[Online]. Available: https://doi.org/10.1007/s10639-022-11177-3
-
[29]
Chatgpt for education and research: Opportunities, threats, and strategies,
M. M. Rahman and Y . Watanobe, “Chatgpt for education and research: Opportunities, threats, and strategies,” Applied Sciences, vol. 13, no. 9, p. 5783, 2023
work page 2023
-
[30]
M. Kazemitabaar, R. Ye, X. Wang, A. Z. Henley, P. Denny, M. Craig, and T. Grossman, “CodeAid: Evaluating a classroom deployment of an LLM-based programming assistant that balances student and educator needs.” [Online]. Available: http://arxiv.org/abs/2401.11314
-
[31]
Using an llm to help with code understanding,
D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , 2024, pp. 1–13
work page 2024
-
[32]
Codepilot: Scaffolding end-to-end collaborative software development for novice programmers,
J. Warner and P. J. Guo, “Codepilot: Scaffolding end-to-end collaborative software development for novice programmers,” in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems , 2017, pp. 1136–1141
work page 2017
-
[33]
E. L. Ouh, B. K. S. Gan, K. Jin Shim, and S. Wlodkowski, “Chatgpt, can you generate solutions for my coding exercises? an evaluation on its effectiveness in an undergraduate java programming course.” in Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V . 1, 2023, pp. 54–60
work page 2023
-
[34]
Impersonating chatbots in a code review exercise to teach software engineering best practices,
J. C. Farah, B. Spaenlehauer, V . Sharma, M. J. Rodr ´ıguez-Triana, S. Ingram, and D. Gillet, “Impersonating chatbots in a code review exercise to teach software engineering best practices,” in 2022 IEEE Global Engineering Education Conference (EDUCON) . IEEE, 2022, pp. 1634–1642
work page 2022
-
[35]
Designing and evaluating pedagogic conversational agents to teach children,
S. Tamayo-Moreno and D. P ´erez-Mar´ın, “Designing and evaluating pedagogic conversational agents to teach children,”International Journal of Educational and Pedagogical Sciences , vol. 11, no. 3, pp. 521–526, 2017
work page 2017
-
[36]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” [Online]. Available: http://arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Liang, Chenyang Yang, and Brad A
J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi-org.myaccess.library.uto...
-
[38]
Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,
J. Zamfirescu-Pereira, R. Y . Wong, B. Hartmann, and Q. Yang, “Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems . ACM, pp. 1–21. [Online]. Available: https://dl.acm.org/doi/10.1145/3544548.3581388
-
[39]
”it’s weird that it knows what i want
J. Prather, B. N. Reeves, P. Denny, B. A. Becker, J. Leinonen, A. Luxton-Reilly, G. Powell, J. Finnie-Ansley, and E. A. Santos, “”it’s weird that it knows what i want”: Usability and interactions with copilot for novice programmers,” vol. 31, no. 1, pp. 1–31. [Online]. Available: http://arxiv.org/abs/2304.02491
-
[40]
A large-scale survey on the usability of ai programming assistants: Successes and challenges,
J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–13
work page 2024
-
[41]
Grounded copilot: How programmers interact with code-generating models,
S. Barke, M. B. James, and N. Polikarpova, “Grounded copilot: How programmers interact with code-generating models,” vol. 7, pp. 85–111. [Online]. Available: https://dl.acm.org/doi/10.1145/3586030
-
[42]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
“it’s not like jarvis, but it’s pretty close!
R. Budhiraja, I. Joshi, J. S. Challa, H. D. Akolekar, and D. Kumar, ““it’s not like jarvis, but it’s pretty close!”-examining chatgpt’s usage among undergraduate students in computer science,” in Proceedings of the 26th Australasian Computing Education Conference , 2024, pp. 124–133
work page 2024
-
[44]
LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering,
A. Anonymous, “LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering,” Aug. 2024. [Online]. Available: https://doi.org/10.5281/zenodo.13179522
-
[45]
Metacognitive strategies that enhance critical thinking,
K. Y . Ku and I. T. Ho, “Metacognitive strategies that enhance critical thinking,” Metacognition and learning , vol. 5, pp. 251–267, 2010
work page 2010
-
[46]
Metacognition: Answered and unan- swered questions,
R. Garner and P. A. Alexander, “Metacognition: Answered and unan- swered questions,”Educational psychologist, vol. 24, no. 2, pp. 143–158, 1989
work page 1989
-
[47]
Using thematic analysis in psychology,
V . Braun and V . Clarke, “Using thematic analysis in psychology,” Qualitative research in psychology , vol. 3, no. 2, pp. 77–101, 2006
work page 2006
-
[48]
Improving user satisfaction: The questionnaire for user interaction satisfaction version 5.5,
B. D. Harper and K. L. Norman, “Improving user satisfaction: The questionnaire for user interaction satisfaction version 5.5,” in Proceed- ings of the 1st annual mid-Atlantic human factors conference , vol. 224. Citeseer, 1993, p. 228
work page 1993
-
[49]
Servqual: A multiple- item scale for measuring consumer perc,
A. Parasuraman, V . A. Zeithaml, and L. L. Berry, “Servqual: A multiple- item scale for measuring consumer perc,” Journal of retailing , vol. 64, no. 1, p. 12, 1988
work page 1988
-
[50]
B. Angelova and J. Zekiri, “Measuring customer satisfaction with service quality using american customer satisfaction model (acsi model),” Inter- national journal of academic research in business and social sciences , vol. 1, no. 3, pp. 232–258, 2011
work page 2011
-
[51]
Overreliance on ai literature review,
S. Passi and M. V orvoreanu, “Overreliance on ai literature review,” Microsoft Research, 2022
work page 2022
-
[52]
Conditions of learning in novice programmers,
D. N. Perkins, C. Hancock, R. Hobbs, F. Martin, and R. Simmons, “Conditions of learning in novice programmers,” in Studying the Novice Programmer. Psychology Press, num Pages: 19
-
[53]
Computing education in the era of generative AI,
P. Denny, J. Prather, B. A. Becker, J. Finnie-Ansley, A. Hellas, J. Leinonen, A. Luxton-Reilly, B. N. Reeves, E. A. Santos, and S. Sarsa, “Computing education in the era of generative AI,” vol. 67, no. 2, pp. 56–67. [Online]. Available: https://dl.acm.org/doi/10.1145/3624720
-
[54]
A. Agrawal, J. Gans, and A. Goldfarb, Power and Prediction: The Disruptive Economics of Artificial Intelligence . Harvard Business Review Press. [Online]. Available: http://ebookcentral.proquest.com/lib/ utoronto/detail.action?docID=6846949
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.