"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering

Bingsheng Yao; Dakuo Wang; Hongbo Fang; Jiessie Tie; Shurui Zhou; Syed Ishtiaque Ahmed; Tianshi Li

arxiv: 2411.09916 · v3 · submitted 2024-11-15 · 💻 cs.SE

"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering

Jiessie Tie , Bingsheng Yao , Tianshi Li , Hongbo Fang , Syed Ishtiaque Ahmed , Dakuo Wang , Shurui Zhou This is my paper

Pith reviewed 2026-05-23 17:28 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLMChatGPTsoftware engineeringabandonmentfailure modeshuman-AI interactionweb development

0 comments

The pith

Unhelpful LLM responses made users 11 times more likely to abandon ChatGPT in a web development task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how 26 participants used ChatGPT on a complex web development assignment and catalogs the specific ways the model failed them. Nine failure types emerged in three groups: responses that were wrong or incomplete, demands that overloaded the user's thinking, and breakdowns where the model lost track of prior context. Seventeen participants stopped using the tool entirely. Statistical modeling showed that an unhelpful answer multiplied the odds of quitting by 11 while each extra prompt lowered those odds by 17 percent.

Core claim

In a controlled observation of 26 engineers building a web application, nine recurring failure modes in ChatGPT responses drove 17 participants to abandon the tool; unhelpful replies raised abandonment likelihood elevenfold while each successive prompt reduced it by 17 percent.

What carries the argument

Nine failure types grouped into incorrect or incomplete responses, cognitive overload, and context loss, tracked against participant abandonment decisions and modeled with logistic regression on response helpfulness and prompt count.

If this is right

Scaffolding, prompt clarification, and debugging steps can partially offset the identified failures.
Persistent unhelpful replies dominate the decision to stop using the model.
Additional prompts provide a modest protective effect against abandonment.
Tooling that reduces incorrect answers and context loss would directly address the main drivers of quitting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

High abandonment suggests net productivity claims for LLMs in complex SE work may need to account for time lost to failed attempts.
The same response-quality and persistence dynamics could appear in non-SE iterative tasks such as data analysis or report writing.
Interfaces that surface likely failure modes in advance might reduce the observed quit rate.

Load-bearing premise

That the failure patterns and abandonment rates observed in this single web development task will appear in other software engineering tasks or with different LLMs.

What would settle it

A replication with a different SE task or LLM in which the measured effect of unhelpful responses on abandonment drops below statistical significance.

Figures

Figures reproduced from arXiv: 2411.09916 by Bingsheng Yao, Dakuo Wang, Hongbo Fang, Jiessie Tie, Shurui Zhou, Syed Ishtiaque Ahmed, Tianshi Li.

**Figure 1.** Figure 1: Task Breakdown: A) Insert profile picture; B) Link email; C) Create division with headings/widgets: C1. Add visualization; D) Create division with headings/widgets: D1. Insert table; E) Insert footer division; F) Format side-byside division: F1. Insert headings/subtitles; G) Insert division with form/buttons; H) Implement pop-up on button click; I) Implement form with local file saves and alert on submiss… view at source ↗

**Figure 2.** Figure 2: Workflow of user interaction with ChatGPT: focus on [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of Response Lengths for Successful vs. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Relationship between causes, failures, and mitigations. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Participant’s ratings for (a) ChatGPT’s for helpfulness; [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of participants across different categories [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Software engineers are increasingly incorporating AI assistants into their workflows to enhance productivity and alleviate cognitive load. However, experiences with large language models (LLMs) such as ChatGPT vary widely. While some engineers find them useful, others deem them counterproductive due to inaccuracies in their responses. Researchers have also observed that ChatGPT often provides incorrect information. Given these limitations, it is crucial to determine how to effectively integrate LLMs into software engineering (SE) workflow. Analyzing data from 26 participants in a complex web development task, we identified nine failure types categorized into incorrect or incomplete responses, cognitive overload, and context loss. Users attempted to mitigate these issues through scaffolding, prompt clarification, and debugging. However, 17 participants ultimately chose to abandon ChatGPT due to persistent failures. Our quantitative analysis revealed that unhelpful responses increased the likelihood of abandonment by a factor of 11, while each additional prompt reduced abandonment probability by 17%. This study advances the understanding of human-AI interaction in SE tasks and outlines directions for future research and tooling support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small user study on ChatGPT abandonment in a web task gives a usable failure taxonomy but the 11x odds ratio rests on an unvalidated regression with 17 events.

read the letter

The main takeaway is that 17 out of 26 participants abandoned ChatGPT during a web development task, and the authors link that outcome to unhelpful responses and the number of prompts tried. They also produce a taxonomy of nine failure types split into incorrect or incomplete answers, cognitive overload, and context loss, plus notes on what users tried before quitting. That is the concrete new material: real interaction logs turned into categories and counts rather than another general complaint about LLM errors. The study design is straightforward and the abandonment rate itself is a useful data point for anyone building or studying AI coding tools. The regression results are the soft spot. The factor-of-11 increase from unhelpful responses and the 17 percent drop per additional prompt come from modeling 26 binary outcomes with only 17 events. The abstract gives no information on the model family, covariate choices, handling of repeated prompts within participants, or any robustness checks. On a sample this size those point estimates can shift noticeably with modest changes in coding or specification, so the precise multipliers are not yet reliable. Generalization beyond this one task is also limited, though the authors stay close to their data. This work is aimed at researchers in human-AI interaction for software engineering. A reader in that area will get usable examples of where the breakdowns happen and how often users walk away. It deserves peer review so the methods section can be examined and the regression either strengthened or qualified.

Referee Report

2 major / 2 minor

Summary. The paper reports an empirical study with 26 participants completing a complex web development task using ChatGPT. It categorizes nine LLM failure types into incorrect/incomplete responses, cognitive overload, and context loss; describes user mitigation strategies (scaffolding, clarification, debugging); notes that 17 participants abandoned the tool; and presents quantitative results claiming that unhelpful responses raise abandonment odds by a factor of 11 while each additional prompt lowers abandonment probability by 17%.

Significance. If the quantitative claims prove robust, the work supplies concrete evidence on when developers abandon LLMs during SE tasks and identifies actionable failure modes, contributing to human-AI collaboration research in software engineering. The mixed-methods design and direct observation of interaction sequences are strengths that could support future tooling recommendations.

major comments (2)

[§5.2] §5.2 (Quantitative Analysis): The reported odds ratio of 11 for unhelpful responses and 17% per-prompt reduction are obtained from a regression on 26 binary abandonment outcomes. No information is supplied on model family, covariate specification, handling of repeated measures within participants, or validation (bootstrap, cross-validation, or sensitivity to outcome coding). With only 17 events the point estimates are likely to be unstable under modest specification changes.
[§3] §3 (Methods): The coding scheme for the nine failure types, the operational definition of 'unhelpful response,' and the criteria for classifying abandonment are not described in sufficient detail to permit replication or to evaluate potential coder bias or measurement error.

minor comments (2)

[Abstract] The abstract states the 11× and 17% figures without any accompanying confidence intervals, p-values, or sample-size caveats.
[§5.2] Table or figure presenting the regression coefficients, standard errors, and model fit statistics is missing from the quantitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify areas where additional methodological transparency will strengthen the paper. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§5.2] §5.2 (Quantitative Analysis): The reported odds ratio of 11 for unhelpful responses and 17% per-prompt reduction are obtained from a regression on 26 binary abandonment outcomes. No information is supplied on model family, covariate specification, handling of repeated measures within participants, or validation (bootstrap, cross-validation, or sensitivity to outcome coding). With only 17 events the point estimates are likely to be unstable under modest specification changes.

Authors: We agree that §5.2 currently omits key statistical details. The analysis consisted of a logistic regression with abandonment (binary) as the outcome, a binary indicator for unhelpful responses, and prompt count as a continuous predictor; the model was fit in R using glm(). No random effects or repeated-measures adjustment was applied because each participant contributed a single abandonment decision. We will expand the section to report the full model equation, software, coefficient table with confidence intervals, and an explicit statement that the results are exploratory given the event count. We will also add a sensitivity note acknowledging that small-sample logistic regression estimates can be unstable. revision: yes
Referee: [§3] §3 (Methods): The coding scheme for the nine failure types, the operational definition of 'unhelpful response,' and the criteria for classifying abandonment are not described in sufficient detail to permit replication or to evaluate potential coder bias or measurement error.

Authors: We concur that the current description of the qualitative coding process is insufficient. The nine failure categories emerged from inductive thematic analysis performed independently by two researchers on the full interaction transcripts; disagreements were resolved through discussion and a final codebook was applied. 'Unhelpful response' was operationalized as any LLM output that either introduced factual errors, omitted required functionality, or required the participant to restart a sub-task. Abandonment was coded when a participant explicitly ceased prompting and completed the remainder of the task without the LLM. We will revise §3 to include the complete codebook with examples, inter-rater reliability statistics, and the exact decision rules used for each construct. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest directly on observed study data

full rationale

The paper reports an empirical user study with 26 participants on a web development task. Failure categories, abandonment decisions (17/26), and quantitative effects (unhelpful responses multiply abandonment odds by 11; each extra prompt lowers probability by 17%) are presented as outcomes of direct observation, coding of interactions, and regression-style analysis on the collected binary outcomes. No equations, self-definitional constructs, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central results are therefore self-contained against the external benchmark of the participant data rather than reducing to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions in user studies about task representativeness and participant behavior reflecting real use.

axioms (1)

domain assumption The web development task is representative of typical SE workflows.
The study uses one complex task to draw conclusions about SE in general.

pith-pipeline@v0.9.0 · 5732 in / 1024 out tokens · 24824 ms · 2026-05-23T17:28:00.250827+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TDD Governance for Multi-Agent Code Generation via Prompt Engineering
cs.SE 2026-04 unverdicted novelty 5.0

An AI-native TDD framework operationalizes classical TDD principles as prompt-level and workflow-level governance mechanisms in a layered multi-agent architecture to improve stability and reproducibility of LLM code g...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Cognitive load and productivity implications in human-chatbot interaction,

J. Schmidhuber, S. Schl ¨ogl, and C. Ploder, “Cognitive load and productivity implications in human-chatbot interaction,” in 2021 IEEE 2nd International Conference on Human-Machine Systems (ICHMS) , pp. 1–6. [Online]. Available: http://arxiv.org/abs/2111.01400

work page arXiv 2021
[2]

ChatGPT: A study on its utility for ubiquitous software engineering tasks

G. Sridhara, R. H. G., and S. Mazumdar, “ChatGPT: A study on its utility for ubiquitous software engineering tasks.” [Online]. Available: http://arxiv.org/abs/2305.16837

work page arXiv
[3]

Chatbots applications in education: A systematic review,

C. W. Okonkwo and A. Ade-Ibijola, “Chatbots applications in education: A systematic review,” vol. 2, p. 100033. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666920X21000278

work page
[4]

Studying the effect of AI code generators on supporting novice learners in introductory programming,

M. Kazemitabaar, J. Chow, C. K. T. Ma, B. J. Ericson, D. Weintrop, and T. Grossman, “Studying the effect of AI code generators on supporting novice learners in introductory programming,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , pp. 1–23. [Online]. Available: http://arxiv.org/abs/2302.07427

work page arXiv 2023
[5]

Bissyandé

H. Tian, W. Lu, T. O. Li, X. Tang, S.-C. Cheung, J. Klein, and T. F. Bissyand´e, “Is ChatGPT the ultimate programming assistant – how far is it?” [Online]. Available: http://arxiv.org/abs/2304.11938

work page arXiv
[6]

Generative ai for test driven development: Preliminary results,

M. Mock, J. Melegati, and B. Russo, “Generative ai for test driven development: Preliminary results,” arXiv preprint arXiv:2405.10849 , 2024

work page arXiv 2024
[7]

Six opportunities for scientists and engineers to learn programming using ai tools such as chatgpt,

P. J. Guo, “Six opportunities for scientists and engineers to learn programming using ai tools such as chatgpt,” Computing in Science & Engineering, vol. 25, no. 3, pp. 73–78, 2023

work page 2023
[8]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

ChatGPT and software testing education: Promises & perils,

S. Jalil, S. Rafi, T. D. LaToza, K. Moran, and W. Lam, “ChatGPT and software testing education: Promises & perils,” in 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) , pp. 4130–4137. [Online]. Available: http://arxiv.org/abs/2302.03287

work page arXiv 2023
[10]

What does ChatGPT know about natural science and engineering?

L. S. Balhorn, J. M. Weber, S. Buijsman, J. R. Hildebrandt, and A. M. Schweidtmann, “What does ChatGPT know about natural science and engineering?”

work page
[11]

Use of large language models might affect our cognitive skills,

R. Heersmink, “Use of large language models might affect our cognitive skills,” Nature Human Behaviour , vol. 8, no. 5, pp. 805–806, 2024

work page 2024
[12]

How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment,

M. Kazemitabaar, X. Hou, A. Henley, B. J. Ericson, D. Weintrop, and T. Grossman, “How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment,” in Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, 2023, pp. 1–12

work page 2023
[13]

Can ChatGPT replace StackOverflow? a study on robustness and reliability of large language model code generation

L. Zhong and Z. Wang, “Can ChatGPT replace StackOverflow? a study on robustness and reliability of large language model code generation.” [Online]. Available: http://arxiv.org/abs/2308.10335

work page arXiv
[14]

Large language models for software engineering: Survey and open problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems.” [Online]. Available: http: //arxiv.org/abs/2310.03533

work page arXiv
[15]

Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers to stack overflow questions,

S. Kabir, D. N. Udo-Imeh, B. Kou, and T. Zhang, “Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers to stack overflow questions,” in Proceedings of the CHI Conference on Human Factors in Computing Systems , ser. CHI ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi-or...

work page doi:10.1145/3613904.3642596 2024
[16]

Glassman

P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts . ACM, pp. 1–7. [Online]. Available: https://dl.acm.org/doi/10.1145/3491101.3519665

work page doi:10.1145/3491101.3519665
[17]

How far are we? the triumphs and trials of generative AI in learning software engineering

R. Choudhuri, D. Liu, I. Steinmacher, M. Gerosa, and A. Sarma, “How far are we? the triumphs and trials of generative AI in learning software engineering.” [Online]. Available: http://arxiv.org/abs/2312.11719

work page arXiv
[18]

What skills do you need when developing software using ChatGPT? (discussion paper),

J. Jeuring, R. Groot, and H. Keuning, “What skills do you need when developing software using ChatGPT? (discussion paper),” in Proceedings of the 23rd Koli Calling International Conference on Computing Education Research , ser. Koli Calling ’23. Association for Computing Machinery, pp. 1–6. [Online]. Available: https: //dl.acm.org/doi/10.1145/3631802.3631807

work page doi:10.1145/3631802.3631807
[19]

The role of chatgpt in higher education: Benefits, challenges, and future research directions,

T. Rasul, S. Nair, D. Kalendra, M. Robin, F. de Oliveira Santini, W. J. Ladeira, M. Sun, I. Day, R. A. Rather, and L. Heathcote, “The role of chatgpt in higher education: Benefits, challenges, and future research directions,” Journal of Applied Learning and Teaching , vol. 6, no. 1, 2023

work page 2023
[20]

Trust in generative ai among students: An exploratory study,

M. Amoozadeh, D. Daniels, D. Nam, A. Kumar, S. Chen, M. Hilton, S. Srinivasa Ragavan, and M. A. Alipour, “Trust in generative ai among students: An exploratory study,” in Proceedings of the 55th ACM Technical Symposium on Computer Science Education V . 1 , 2024, pp. 67–73

work page 2024
[21]

Towards human-bot collaborative software architecting with ChatGPT

A. Ahmad, M. Waseem, P. Liang, M. Fehmideh, M. S. Aktar, and T. Mikkonen, “Towards human-bot collaborative software architecting with ChatGPT.” [Online]. Available: http://arxiv.org/abs/2302.14600

work page arXiv
[22]

Using an LLM to help with code understanding

D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an LLM to help with code understanding.” [Online]. Available: http://arxiv.org/abs/2307.08177

work page arXiv
[23]

Evaluating instruction-tuned large language models on code comprehension and generation,

Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y . Lou, “Evaluating instruction-tuned large language models on code comprehension and generation,” arXiv preprint arXiv:2308.01240 , 2023

work page arXiv 2023
[24]

In-IDE code generation from natural language: Promise and challenges,

F. F. Xu, B. Vasilescu, and G. Neubig, “In-IDE code generation from natural language: Promise and challenges,” vol. 31, no. 2, pp. 29:1–29:47. [Online]. Available: https://dl.acm.org/doi/10.1145/3487569

work page doi:10.1145/3487569
[25]

How chatgpt will change software engineering education,

M. Daun and J. Brings, “How chatgpt will change software engineering education,” in Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V . 1, 2023, pp. 110–116

work page 2023
[26]

The effect of generative artificial intelligence (AI)-based tool use on students’ computational thinking skills, programming self-efficacy and motivation,

R. Yilmaz and F. G. Karaoglan Yilmaz, “The effect of generative artificial intelligence (AI)-based tool use on students’ computational thinking skills, programming self-efficacy and motivation,” vol. 4, p. 100147. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S2666920X23000267

work page
[27]

Interacting with educational chatbots: A systematic review,

M. A. Kuhail, N. Alturki, S. Alramlawi, and K. Alhejori, “Interacting with educational chatbots: A systematic review,” vol. 28, no. 1, pp. 973–

work page
[28]

Available: https://doi.org/10.1007/s10639-022-11177-3

[Online]. Available: https://doi.org/10.1007/s10639-022-11177-3

work page doi:10.1007/s10639-022-11177-3
[29]

Chatgpt for education and research: Opportunities, threats, and strategies,

M. M. Rahman and Y . Watanobe, “Chatgpt for education and research: Opportunities, threats, and strategies,” Applied Sciences, vol. 13, no. 9, p. 5783, 2023

work page 2023
[30]

CodeAid: Evaluating a classroom deployment of an LLM-based programming assistant that balances student and educator needs

M. Kazemitabaar, R. Ye, X. Wang, A. Z. Henley, P. Denny, M. Craig, and T. Grossman, “CodeAid: Evaluating a classroom deployment of an LLM-based programming assistant that balances student and educator needs.” [Online]. Available: http://arxiv.org/abs/2401.11314

work page arXiv
[31]

Using an llm to help with code understanding,

D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , 2024, pp. 1–13

work page 2024
[32]

Codepilot: Scaffolding end-to-end collaborative software development for novice programmers,

J. Warner and P. J. Guo, “Codepilot: Scaffolding end-to-end collaborative software development for novice programmers,” in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems , 2017, pp. 1136–1141

work page 2017
[33]

Chatgpt, can you generate solutions for my coding exercises? an evaluation on its effectiveness in an undergraduate java programming course

E. L. Ouh, B. K. S. Gan, K. Jin Shim, and S. Wlodkowski, “Chatgpt, can you generate solutions for my coding exercises? an evaluation on its effectiveness in an undergraduate java programming course.” in Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V . 1, 2023, pp. 54–60

work page 2023
[34]

Impersonating chatbots in a code review exercise to teach software engineering best practices,

J. C. Farah, B. Spaenlehauer, V . Sharma, M. J. Rodr ´ıguez-Triana, S. Ingram, and D. Gillet, “Impersonating chatbots in a code review exercise to teach software engineering best practices,” in 2022 IEEE Global Engineering Education Conference (EDUCON) . IEEE, 2022, pp. 1634–1642

work page 2022
[35]

Designing and evaluating pedagogic conversational agents to teach children,

S. Tamayo-Moreno and D. P ´erez-Mar´ın, “Designing and evaluating pedagogic conversational agents to teach children,”International Journal of Educational and Pedagogical Sciences , vol. 11, no. 3, pp. 521–526, 2017

work page 2017
[36]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” [Online]. Available: http://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Liang, Chenyang Yang, and Brad A

J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi-org.myaccess.library.uto...

work page doi:10.1145/3597503.3608128 2024
[38]

Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,

J. Zamfirescu-Pereira, R. Y . Wong, B. Hartmann, and Q. Yang, “Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems . ACM, pp. 1–21. [Online]. Available: https://dl.acm.org/doi/10.1145/3544548.3581388

work page doi:10.1145/3544548.3581388 2023
[39]

”it’s weird that it knows what i want

J. Prather, B. N. Reeves, P. Denny, B. A. Becker, J. Leinonen, A. Luxton-Reilly, G. Powell, J. Finnie-Ansley, and E. A. Santos, “”it’s weird that it knows what i want”: Usability and interactions with copilot for novice programmers,” vol. 31, no. 1, pp. 1–31. [Online]. Available: http://arxiv.org/abs/2304.02491

work page arXiv
[40]

A large-scale survey on the usability of ai programming assistants: Successes and challenges,

J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–13

work page 2024
[41]

Grounded copilot: How programmers interact with code-generating models,

S. Barke, M. B. James, and N. Polikarpova, “Grounded copilot: How programmers interact with code-generating models,” vol. 7, pp. 85–111. [Online]. Available: https://dl.acm.org/doi/10.1145/3586030

work page doi:10.1145/3586030
[42]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

“it’s not like jarvis, but it’s pretty close!

R. Budhiraja, I. Joshi, J. S. Challa, H. D. Akolekar, and D. Kumar, ““it’s not like jarvis, but it’s pretty close!”-examining chatgpt’s usage among undergraduate students in computer science,” in Proceedings of the 26th Australasian Computing Education Conference , 2024, pp. 124–133

work page 2024
[44]

LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering,

A. Anonymous, “LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering,” Aug. 2024. [Online]. Available: https://doi.org/10.5281/zenodo.13179522

work page doi:10.5281/zenodo.13179522 2024
[45]

Metacognitive strategies that enhance critical thinking,

K. Y . Ku and I. T. Ho, “Metacognitive strategies that enhance critical thinking,” Metacognition and learning , vol. 5, pp. 251–267, 2010

work page 2010
[46]

Metacognition: Answered and unan- swered questions,

R. Garner and P. A. Alexander, “Metacognition: Answered and unan- swered questions,”Educational psychologist, vol. 24, no. 2, pp. 143–158, 1989

work page 1989
[47]

Using thematic analysis in psychology,

V . Braun and V . Clarke, “Using thematic analysis in psychology,” Qualitative research in psychology , vol. 3, no. 2, pp. 77–101, 2006

work page 2006
[48]

Improving user satisfaction: The questionnaire for user interaction satisfaction version 5.5,

B. D. Harper and K. L. Norman, “Improving user satisfaction: The questionnaire for user interaction satisfaction version 5.5,” in Proceed- ings of the 1st annual mid-Atlantic human factors conference , vol. 224. Citeseer, 1993, p. 228

work page 1993
[49]

Servqual: A multiple- item scale for measuring consumer perc,

A. Parasuraman, V . A. Zeithaml, and L. L. Berry, “Servqual: A multiple- item scale for measuring consumer perc,” Journal of retailing , vol. 64, no. 1, p. 12, 1988

work page 1988
[50]

Measuring customer satisfaction with service quality using american customer satisfaction model (acsi model),

B. Angelova and J. Zekiri, “Measuring customer satisfaction with service quality using american customer satisfaction model (acsi model),” Inter- national journal of academic research in business and social sciences , vol. 1, no. 3, pp. 232–258, 2011

work page 2011
[51]

Overreliance on ai literature review,

S. Passi and M. V orvoreanu, “Overreliance on ai literature review,” Microsoft Research, 2022

work page 2022
[52]

Conditions of learning in novice programmers,

D. N. Perkins, C. Hancock, R. Hobbs, F. Martin, and R. Simmons, “Conditions of learning in novice programmers,” in Studying the Novice Programmer. Psychology Press, num Pages: 19

work page
[53]

Computing education in the era of generative AI,

P. Denny, J. Prather, B. A. Becker, J. Finnie-Ansley, A. Hellas, J. Leinonen, A. Luxton-Reilly, B. N. Reeves, E. A. Santos, and S. Sarsa, “Computing education in the era of generative AI,” vol. 67, no. 2, pp. 56–67. [Online]. Available: https://dl.acm.org/doi/10.1145/3624720

work page doi:10.1145/3624720
[54]

Agrawal, J

A. Agrawal, J. Gans, and A. Goldfarb, Power and Prediction: The Disruptive Economics of Artificial Intelligence . Harvard Business Review Press. [Online]. Available: http://ebookcentral.proquest.com/lib/ utoronto/detail.action?docID=6846949

work page

[1] [1]

Cognitive load and productivity implications in human-chatbot interaction,

J. Schmidhuber, S. Schl ¨ogl, and C. Ploder, “Cognitive load and productivity implications in human-chatbot interaction,” in 2021 IEEE 2nd International Conference on Human-Machine Systems (ICHMS) , pp. 1–6. [Online]. Available: http://arxiv.org/abs/2111.01400

work page arXiv 2021

[2] [2]

ChatGPT: A study on its utility for ubiquitous software engineering tasks

G. Sridhara, R. H. G., and S. Mazumdar, “ChatGPT: A study on its utility for ubiquitous software engineering tasks.” [Online]. Available: http://arxiv.org/abs/2305.16837

work page arXiv

[3] [3]

Chatbots applications in education: A systematic review,

C. W. Okonkwo and A. Ade-Ibijola, “Chatbots applications in education: A systematic review,” vol. 2, p. 100033. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666920X21000278

work page

[4] [4]

Studying the effect of AI code generators on supporting novice learners in introductory programming,

M. Kazemitabaar, J. Chow, C. K. T. Ma, B. J. Ericson, D. Weintrop, and T. Grossman, “Studying the effect of AI code generators on supporting novice learners in introductory programming,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , pp. 1–23. [Online]. Available: http://arxiv.org/abs/2302.07427

work page arXiv 2023

[5] [5]

Bissyandé

H. Tian, W. Lu, T. O. Li, X. Tang, S.-C. Cheung, J. Klein, and T. F. Bissyand´e, “Is ChatGPT the ultimate programming assistant – how far is it?” [Online]. Available: http://arxiv.org/abs/2304.11938

work page arXiv

[6] [6]

Generative ai for test driven development: Preliminary results,

M. Mock, J. Melegati, and B. Russo, “Generative ai for test driven development: Preliminary results,” arXiv preprint arXiv:2405.10849 , 2024

work page arXiv 2024

[7] [7]

Six opportunities for scientists and engineers to learn programming using ai tools such as chatgpt,

P. J. Guo, “Six opportunities for scientists and engineers to learn programming using ai tools such as chatgpt,” Computing in Science & Engineering, vol. 25, no. 3, pp. 73–78, 2023

work page 2023

[8] [8]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-V...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

ChatGPT and software testing education: Promises & perils,

S. Jalil, S. Rafi, T. D. LaToza, K. Moran, and W. Lam, “ChatGPT and software testing education: Promises & perils,” in 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) , pp. 4130–4137. [Online]. Available: http://arxiv.org/abs/2302.03287

work page arXiv 2023

[10] [10]

What does ChatGPT know about natural science and engineering?

L. S. Balhorn, J. M. Weber, S. Buijsman, J. R. Hildebrandt, and A. M. Schweidtmann, “What does ChatGPT know about natural science and engineering?”

work page

[11] [11]

Use of large language models might affect our cognitive skills,

R. Heersmink, “Use of large language models might affect our cognitive skills,” Nature Human Behaviour , vol. 8, no. 5, pp. 805–806, 2024

work page 2024

[12] [12]

How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment,

M. Kazemitabaar, X. Hou, A. Henley, B. J. Ericson, D. Weintrop, and T. Grossman, “How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment,” in Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, 2023, pp. 1–12

work page 2023

[13] [13]

Can ChatGPT replace StackOverflow? a study on robustness and reliability of large language model code generation

L. Zhong and Z. Wang, “Can ChatGPT replace StackOverflow? a study on robustness and reliability of large language model code generation.” [Online]. Available: http://arxiv.org/abs/2308.10335

work page arXiv

[14] [14]

Large language models for software engineering: Survey and open problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems.” [Online]. Available: http: //arxiv.org/abs/2310.03533

work page arXiv

[15] [15]

Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers to stack overflow questions,

S. Kabir, D. N. Udo-Imeh, B. Kou, and T. Zhang, “Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers to stack overflow questions,” in Proceedings of the CHI Conference on Human Factors in Computing Systems , ser. CHI ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi-or...

work page doi:10.1145/3613904.3642596 2024

[16] [16]

Glassman

P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts . ACM, pp. 1–7. [Online]. Available: https://dl.acm.org/doi/10.1145/3491101.3519665

work page doi:10.1145/3491101.3519665

[17] [17]

How far are we? the triumphs and trials of generative AI in learning software engineering

R. Choudhuri, D. Liu, I. Steinmacher, M. Gerosa, and A. Sarma, “How far are we? the triumphs and trials of generative AI in learning software engineering.” [Online]. Available: http://arxiv.org/abs/2312.11719

work page arXiv

[18] [18]

What skills do you need when developing software using ChatGPT? (discussion paper),

J. Jeuring, R. Groot, and H. Keuning, “What skills do you need when developing software using ChatGPT? (discussion paper),” in Proceedings of the 23rd Koli Calling International Conference on Computing Education Research , ser. Koli Calling ’23. Association for Computing Machinery, pp. 1–6. [Online]. Available: https: //dl.acm.org/doi/10.1145/3631802.3631807

work page doi:10.1145/3631802.3631807

[19] [19]

The role of chatgpt in higher education: Benefits, challenges, and future research directions,

T. Rasul, S. Nair, D. Kalendra, M. Robin, F. de Oliveira Santini, W. J. Ladeira, M. Sun, I. Day, R. A. Rather, and L. Heathcote, “The role of chatgpt in higher education: Benefits, challenges, and future research directions,” Journal of Applied Learning and Teaching , vol. 6, no. 1, 2023

work page 2023

[20] [20]

Trust in generative ai among students: An exploratory study,

M. Amoozadeh, D. Daniels, D. Nam, A. Kumar, S. Chen, M. Hilton, S. Srinivasa Ragavan, and M. A. Alipour, “Trust in generative ai among students: An exploratory study,” in Proceedings of the 55th ACM Technical Symposium on Computer Science Education V . 1 , 2024, pp. 67–73

work page 2024

[21] [21]

Towards human-bot collaborative software architecting with ChatGPT

A. Ahmad, M. Waseem, P. Liang, M. Fehmideh, M. S. Aktar, and T. Mikkonen, “Towards human-bot collaborative software architecting with ChatGPT.” [Online]. Available: http://arxiv.org/abs/2302.14600

work page arXiv

[22] [22]

Using an LLM to help with code understanding

D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an LLM to help with code understanding.” [Online]. Available: http://arxiv.org/abs/2307.08177

work page arXiv

[23] [23]

Evaluating instruction-tuned large language models on code comprehension and generation,

Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y . Lou, “Evaluating instruction-tuned large language models on code comprehension and generation,” arXiv preprint arXiv:2308.01240 , 2023

work page arXiv 2023

[24] [24]

In-IDE code generation from natural language: Promise and challenges,

F. F. Xu, B. Vasilescu, and G. Neubig, “In-IDE code generation from natural language: Promise and challenges,” vol. 31, no. 2, pp. 29:1–29:47. [Online]. Available: https://dl.acm.org/doi/10.1145/3487569

work page doi:10.1145/3487569

[25] [25]

How chatgpt will change software engineering education,

M. Daun and J. Brings, “How chatgpt will change software engineering education,” in Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V . 1, 2023, pp. 110–116

work page 2023

[26] [26]

The effect of generative artificial intelligence (AI)-based tool use on students’ computational thinking skills, programming self-efficacy and motivation,

R. Yilmaz and F. G. Karaoglan Yilmaz, “The effect of generative artificial intelligence (AI)-based tool use on students’ computational thinking skills, programming self-efficacy and motivation,” vol. 4, p. 100147. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S2666920X23000267

work page

[27] [27]

Interacting with educational chatbots: A systematic review,

M. A. Kuhail, N. Alturki, S. Alramlawi, and K. Alhejori, “Interacting with educational chatbots: A systematic review,” vol. 28, no. 1, pp. 973–

work page

[28] [28]

Available: https://doi.org/10.1007/s10639-022-11177-3

[Online]. Available: https://doi.org/10.1007/s10639-022-11177-3

work page doi:10.1007/s10639-022-11177-3

[29] [29]

Chatgpt for education and research: Opportunities, threats, and strategies,

M. M. Rahman and Y . Watanobe, “Chatgpt for education and research: Opportunities, threats, and strategies,” Applied Sciences, vol. 13, no. 9, p. 5783, 2023

work page 2023

[30] [30]

CodeAid: Evaluating a classroom deployment of an LLM-based programming assistant that balances student and educator needs

M. Kazemitabaar, R. Ye, X. Wang, A. Z. Henley, P. Denny, M. Craig, and T. Grossman, “CodeAid: Evaluating a classroom deployment of an LLM-based programming assistant that balances student and educator needs.” [Online]. Available: http://arxiv.org/abs/2401.11314

work page arXiv

[31] [31]

Using an llm to help with code understanding,

D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , 2024, pp. 1–13

work page 2024

[32] [32]

Codepilot: Scaffolding end-to-end collaborative software development for novice programmers,

J. Warner and P. J. Guo, “Codepilot: Scaffolding end-to-end collaborative software development for novice programmers,” in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems , 2017, pp. 1136–1141

work page 2017

[33] [33]

Chatgpt, can you generate solutions for my coding exercises? an evaluation on its effectiveness in an undergraduate java programming course

E. L. Ouh, B. K. S. Gan, K. Jin Shim, and S. Wlodkowski, “Chatgpt, can you generate solutions for my coding exercises? an evaluation on its effectiveness in an undergraduate java programming course.” in Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V . 1, 2023, pp. 54–60

work page 2023

[34] [34]

Impersonating chatbots in a code review exercise to teach software engineering best practices,

J. C. Farah, B. Spaenlehauer, V . Sharma, M. J. Rodr ´ıguez-Triana, S. Ingram, and D. Gillet, “Impersonating chatbots in a code review exercise to teach software engineering best practices,” in 2022 IEEE Global Engineering Education Conference (EDUCON) . IEEE, 2022, pp. 1634–1642

work page 2022

[35] [35]

Designing and evaluating pedagogic conversational agents to teach children,

S. Tamayo-Moreno and D. P ´erez-Mar´ın, “Designing and evaluating pedagogic conversational agents to teach children,”International Journal of Educational and Pedagogical Sciences , vol. 11, no. 3, pp. 521–526, 2017

work page 2017

[36] [36]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” [Online]. Available: http://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Liang, Chenyang Yang, and Brad A

J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi-org.myaccess.library.uto...

work page doi:10.1145/3597503.3608128 2024

[38] [38]

Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,

J. Zamfirescu-Pereira, R. Y . Wong, B. Hartmann, and Q. Yang, “Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems . ACM, pp. 1–21. [Online]. Available: https://dl.acm.org/doi/10.1145/3544548.3581388

work page doi:10.1145/3544548.3581388 2023

[39] [39]

”it’s weird that it knows what i want

J. Prather, B. N. Reeves, P. Denny, B. A. Becker, J. Leinonen, A. Luxton-Reilly, G. Powell, J. Finnie-Ansley, and E. A. Santos, “”it’s weird that it knows what i want”: Usability and interactions with copilot for novice programmers,” vol. 31, no. 1, pp. 1–31. [Online]. Available: http://arxiv.org/abs/2304.02491

work page arXiv

[40] [40]

A large-scale survey on the usability of ai programming assistants: Successes and challenges,

J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the usability of ai programming assistants: Successes and challenges,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–13

work page 2024

[41] [41]

Grounded copilot: How programmers interact with code-generating models,

S. Barke, M. B. James, and N. Polikarpova, “Grounded copilot: How programmers interact with code-generating models,” vol. 7, pp. 85–111. [Online]. Available: https://dl.acm.org/doi/10.1145/3586030

work page doi:10.1145/3586030

[42] [42]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

“it’s not like jarvis, but it’s pretty close!

R. Budhiraja, I. Joshi, J. S. Challa, H. D. Akolekar, and D. Kumar, ““it’s not like jarvis, but it’s pretty close!”-examining chatgpt’s usage among undergraduate students in computer science,” in Proceedings of the 26th Australasian Computing Education Conference , 2024, pp. 124–133

work page 2024

[44] [44]

LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering,

A. Anonymous, “LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering,” Aug. 2024. [Online]. Available: https://doi.org/10.5281/zenodo.13179522

work page doi:10.5281/zenodo.13179522 2024

[45] [45]

Metacognitive strategies that enhance critical thinking,

K. Y . Ku and I. T. Ho, “Metacognitive strategies that enhance critical thinking,” Metacognition and learning , vol. 5, pp. 251–267, 2010

work page 2010

[46] [46]

Metacognition: Answered and unan- swered questions,

R. Garner and P. A. Alexander, “Metacognition: Answered and unan- swered questions,”Educational psychologist, vol. 24, no. 2, pp. 143–158, 1989

work page 1989

[47] [47]

Using thematic analysis in psychology,

V . Braun and V . Clarke, “Using thematic analysis in psychology,” Qualitative research in psychology , vol. 3, no. 2, pp. 77–101, 2006

work page 2006

[48] [48]

Improving user satisfaction: The questionnaire for user interaction satisfaction version 5.5,

B. D. Harper and K. L. Norman, “Improving user satisfaction: The questionnaire for user interaction satisfaction version 5.5,” in Proceed- ings of the 1st annual mid-Atlantic human factors conference , vol. 224. Citeseer, 1993, p. 228

work page 1993

[49] [49]

Servqual: A multiple- item scale for measuring consumer perc,

A. Parasuraman, V . A. Zeithaml, and L. L. Berry, “Servqual: A multiple- item scale for measuring consumer perc,” Journal of retailing , vol. 64, no. 1, p. 12, 1988

work page 1988

[50] [50]

Measuring customer satisfaction with service quality using american customer satisfaction model (acsi model),

B. Angelova and J. Zekiri, “Measuring customer satisfaction with service quality using american customer satisfaction model (acsi model),” Inter- national journal of academic research in business and social sciences , vol. 1, no. 3, pp. 232–258, 2011

work page 2011

[51] [51]

Overreliance on ai literature review,

S. Passi and M. V orvoreanu, “Overreliance on ai literature review,” Microsoft Research, 2022

work page 2022

[52] [52]

Conditions of learning in novice programmers,

D. N. Perkins, C. Hancock, R. Hobbs, F. Martin, and R. Simmons, “Conditions of learning in novice programmers,” in Studying the Novice Programmer. Psychology Press, num Pages: 19

work page

[53] [53]

Computing education in the era of generative AI,

P. Denny, J. Prather, B. A. Becker, J. Finnie-Ansley, A. Hellas, J. Leinonen, A. Luxton-Reilly, B. N. Reeves, E. A. Santos, and S. Sarsa, “Computing education in the era of generative AI,” vol. 67, no. 2, pp. 56–67. [Online]. Available: https://dl.acm.org/doi/10.1145/3624720

work page doi:10.1145/3624720

[54] [54]

Agrawal, J

A. Agrawal, J. Gans, and A. Goldfarb, Power and Prediction: The Disruptive Economics of Artificial Intelligence . Harvard Business Review Press. [Online]. Available: http://ebookcentral.proquest.com/lib/ utoronto/detail.action?docID=6846949

work page