Recognition: unknown
Hint-Writing with Deferred AI Assistance: Fostering Critical Engagement in Data Science Education
Pith reviewed 2026-05-10 01:14 UTC · model grok-4.3
The pith
Students write better hints and spot more code mistakes when they draft first before seeing AI assistance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a randomized experiment with 97 graduate students, the deferred AI assistance condition—where students first write a hint independently and then revise it using an AI-generated hint—produced higher-quality hints than independent writing or immediate AI assistance. This design also enabled students to identify a wider range of mistakes in code compared to writing without any AI help.
What carries the argument
Deferred AI assistance, in which students draft hints independently before revising with an AI-generated hint, which scaffolds support while preserving students' initial cognitive effort on error analysis.
If this is right
- Students produce higher-quality hints in the deferred condition than in independent or immediate-AI conditions.
- The deferred design helps students notice coding mistakes they miss when working without AI.
- Participants value the activities as practice for debugging and critically assessing AI outputs.
- Student-AI designs must balance cognitive load and avoid AI adding redundancies or extraneous details to student work.
Where Pith is reading between the lines
- The deferred pattern could be tested in other tasks like math problem solving or essay revision to check if the benefit transfers.
- Measuring actual debugging performance after the activity would test whether hint quality improvements lead to lasting skill gains.
- Immediate AI access may reduce the variety of errors students detect independently even when it speeds completion.
Load-bearing premise
That higher hint quality and a wider range of identified mistakes are valid proxies for critical engagement and learning gains, and that results from graduate students in one data science course will generalize.
What would settle it
A replication with undergraduate students or in a different subject that finds no difference in hint quality or mistake identification between the deferred condition and the other two would challenge the central claim.
Figures
read the original abstract
Generating hints for incorrect code is a cognitively demanding task that fosters learning and metacognitive development. This study investigates three designs for personalized, scalable, and reflective hint-writing activities within a data science course: (i) writing a hint independently, (ii) writing a hint with on-demand AI assistance, and (iii) deferred AI assistance, in which students first write a hint independently and then revise it with the help of an AI-generated one. We examine how AI support can scaffold the learning process without diminishing students' productive cognitive effort. Through a randomized controlled experiment with graduate-level students (N=97), we found that deferring AI assistance leads to the highest-quality hints. Further, this design helps students identify a wide range of mistakes they otherwise struggle to identify without any AI assistance. Students valued these activities as opportunities to practice debugging and critically engage with AI outputs--skills that are now critical for learners to acquire as programming becomes increasingly automated and the use of AI for learning grows. Our findings also highlight key considerations for designing student-AI collaborative learning experiences to sustain student engagement, maintain appropriate cognitive load, and mitigate negative effects of AI, such as introducing redundancies and extraneous information into student work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports results from a randomized controlled experiment (N=97 graduate students in a data science course) comparing three conditions for a hint-writing activity on incorrect code: independent writing, on-demand AI assistance, and deferred AI assistance (independent writing followed by revision using an AI-generated hint). The central claims are that the deferred condition produces the highest-quality hints and enables identification of a wider range of mistakes than the other conditions, thereby fostering critical engagement and metacognitive skills without excessive cognitive load or redundancy from AI.
Significance. If the outcome measures prove reliable and the proxies valid, the work supplies actionable evidence for AI scaffolding designs that preserve productive student effort in programming education. The randomized design and focus on hint-writing as a cognitively demanding task are clear strengths. The N=97 sample is adequate for detecting condition differences in an educational setting.
major comments (3)
- [Methods] Methods section: the manuscript provides no description of how hint quality was operationalized (e.g., rubric items, scale, or expert criteria), the number of raters, or any inter-rater reliability metric. Because the primary claim that deferred assistance yields superior hints rests entirely on these ratings, the absence of this information makes the result difficult to interpret or replicate.
- [Results] Results section: no validation is reported that the chosen proxies (hint quality scores and range of identified mistakes) correlate with actual learning gains, debugging performance on transfer tasks, or metacognitive questionnaire scores. Without such evidence, the inference that the design fosters 'critical engagement' and 'metacognitive development' remains unsupported by the data presented.
- [Discussion] Discussion section: the paper does not address whether the observed ordering of conditions would hold for undergraduate students or in domains outside data science; the graduate-only, single-course sample is load-bearing for any claim about scalable educational design.
minor comments (2)
- [Abstract] The abstract states that students 'valued these activities' but does not indicate whether this was measured via survey, interview, or open response, nor does it report any quantitative summary of that feedback.
- [Results] Figure or table captions for condition comparisons should explicitly state the statistical test, p-value threshold, and effect size used for each reported difference.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section: the manuscript provides no description of how hint quality was operationalized (e.g., rubric items, scale, or expert criteria), the number of raters, or any inter-rater reliability metric. Because the primary claim that deferred assistance yields superior hints rests entirely on these ratings, the absence of this information makes the result difficult to interpret or replicate.
Authors: We agree that this information is essential for interpretability and replicability. We will revise the Methods section to include a complete description of the hint quality rubric (criteria, scale, and expert criteria), the number of raters, and the inter-rater reliability metric. revision: yes
-
Referee: [Results] Results section: no validation is reported that the chosen proxies (hint quality scores and range of identified mistakes) correlate with actual learning gains, debugging performance on transfer tasks, or metacognitive questionnaire scores. Without such evidence, the inference that the design fosters 'critical engagement' and 'metacognitive development' remains unsupported by the data presented.
Authors: We acknowledge that the study does not report direct correlations between the proxies and learning gains or transfer performance. The proxies were selected based on prior educational research, but we will revise the Results and Discussion to clarify the scope of inferences, explicitly note this limitation, and suggest directions for future validation work. revision: partial
-
Referee: [Discussion] Discussion section: the paper does not address whether the observed ordering of conditions would hold for undergraduate students or in domains outside data science; the graduate-only, single-course sample is load-bearing for any claim about scalable educational design.
Authors: We agree that the sample limits generalizability. We will expand the Discussion to explicitly address the graduate, data-science-specific context, discuss potential differences for other populations and domains, and call for replication studies. revision: yes
Circularity Check
Empirical RCT with no derivation chain or load-bearing self-references
full rationale
The manuscript is a randomized controlled experiment (N=97) that directly compares three hint-writing conditions via expert-rated hint quality and counts of identified mistakes. No equations, fitted parameters, or first-principles derivations are present; all central claims rest on between-condition statistical contrasts measured in the current study. Self-citations to prior AI-education work appear but are not invoked to justify uniqueness, define variables, or substitute for new evidence, satisfying the criteria for score 0.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Random assignment in the experiment sufficiently balances participant characteristics across conditions so that observed differences can be attributed to the hint-writing design.
- domain assumption Hint quality and breadth of identified mistakes serve as valid indicators of critical engagement and learning.
Reference graph
Works this paper leans on
-
[1]
Alberto Abadie and Guido W Imbens. 2016. Matching on the estimated propen- sity score. Econometrica 84, 2 (2016), 781–807
2016
-
[2]
Paul Atchley, Hannah Pannell, Kaelyn Wofford, Michael Hopkins, and Ruth Ann Atchley. 2024. Human and AI collaboration in the higher education environ- ment: opportunities and concerns. Cognitive research: principles and implications 9, 1 (2024), 20
2024
-
[3]
Imen Azaiz, Natalie Kiesler, and Sven Strickroth. 2024. Feedback-generation for programming exercises with gpt-4. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1 . 31–37
2024
-
[4]
Margaret Bearman, Joanna Tai, Phillip Dawson, David Boud, and Rola Ajjawi
-
[5]
Assessment & Evaluation in Higher Education 49, 6 (2024), 893–905
Developing evaluative judgement for a time of generative artificial intel- ligence. Assessment & Evaluation in Higher Education 49, 6 (2024), 893–905
2024
-
[6]
Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77–101
2006
-
[7]
Cecilia Ka Yuk Chan and Wenjie Hu. 2023. Students’ voices on generative AI: Perceptions, benefits, and challenges in higher education. International Journal of Educational Technology in Higher Education 20, 1 (2023), 43
2023
- [8]
-
[9]
posi- tive friction
Zeya Chen and Ruth Schmidt. 2024. Exploring a behavioral model of “posi- tive friction” in human-AI interaction. In International Conference on Human- Computer Interaction. Springer, 3–22
2024
-
[10]
Kwangsu Cho and Charles MacArthur. 2011. Learning by reviewing. Journal of educational psychology 103, 1 (2011), 73
2011
-
[11]
Kabdo Choi, Hyungyu Shin, Meng Xia, and Juho Kim. 2022. Algosolve: Sup- porting subgoal learning in algorithmic problem-solving with learnersourced microtasks. In Proceedings of the 2022 CHI Conference on Human Factors in Com- puting Systems. 1–16
2022
-
[12]
Yun Dai, Ang Liu, and Cher Ping Lim. 2023. Reconceptualizing ChatGPT and generative AI as a student-driven innovation in higher education. Procedia CIRP 119 (2023), 84–90
2023
-
[13]
Ali Darvishi, Hassan Khosravi, Shazia Sadiq, Dragan Gašević, and George Siemens. 2024. Impact of AI assistance on student agency. Computers & Ed- ucation 210 (2024), 104967
2024
-
[14]
Paul Denny, James Prather, Brett A Becker, James Finnie-Ansley, Arto Hellas, Juho Leinonen, Andrew Luxton-Reilly, Brent N Reeves, Eddie Antonio Santos, and Sami Sarsa. 2024. Computing education in the era of generative AI. Com- mun. ACM 67, 2 (2024), 56–67
2024
-
[15]
John Dewey. 1910. How we think
1910
-
[16]
Kelley Durkin and Bethany Rittle-Johnson. 2012. The effectiveness of using incorrect examples to support learning about decimal magnitude. Learning and Instruction 22, 3 (2012), 206–214
2012
-
[17]
Tino Endres, Shana Carpenter, and Alexander Renkl. 2024. Constructive re- trieval: Benefits for learning, motivation, and metacognitive monitoring. Learn- ing and Instruction 94 (2024), 101974
2024
-
[18]
Iria Estévez-Ayres, Patricia Callejo, Miguel Ángel Hombrados-Herrera, Carlos Alario-Hoyos, and Carlos Delgado Kloos. 2024. Evaluation of LLM Tools for Feedback Generation in a Course on Concurrent Programming. International Journal of Artificial Intelligence in Education (2024), 1–17
2024
-
[19]
Yizhou Fan, Luzhen Tang, Huixiao Le, Kejie Shen, Shufang Tan, Yueying Zhao, Yuan Shen, Xinyu Li, and Dragan Gašević. 2024. Beware of metacognitive lazi- ness: Effects of generative artificial intelligence on learning motivation, pro- cesses, and performance. British Journal of Educational Technology (2024)
2024
-
[20]
Logan Fiorella and Richard E Mayer. 2016. Eight ways to promote generative learning. Educational Psychology Review 28 (2016), 717–741
2016
-
[21]
Matthew Fisher and Daniel M Oppenheimer. 2021. Harder than you think: How outside assistance leads to overconfidence. Psychological Science 32, 4 (2021), 598–610
2021
-
[22]
Tatsushi Fukaya. 2013. Explanation generation, not explanation expectancy, improves metacomprehension accuracy. Metacognition and learning 8 (2013), 1–18
2013
-
[23]
Hagit Gabbay and Anat Cohen. 2024. Combining LLM-generated and test-based feedback in a MOOC for programming. In Proceedings of the Eleventh ACM Con- ference on Learning@ Scale . 177–187
2024
-
[24]
Michael Gerlich. 2025. AI tools in society: Impacts on cognitive offloading and the future of critical thinking. Societies 15, 1 (2025), 6
2025
-
[25]
Elena L Glassman, Aaron Lin, Carrie J Cai, and Robert C Miller. 2016. Learn- ersourcing personalized hints. In Proceedings of the 19th ACM conference on computer-supported cooperative work & social computing . 1626–1636
2016
-
[26]
Lois R Harris and Gavin TL Brown. 2013. Opportunities and obstacles to con- sider when using peer-and self-assessment to improve student learning: Case studies into teachers’ implementation. Teaching and Teacher Education36 (2013), 101–111
2013
-
[27]
Gregor Jošt, Viktor Taneski, and Sašo Karakatič. 2024. The Impact of Large Language Models on Programming Education and Student Learning Outcomes. Applied Sciences 14, 10 (2024), 4115
2024
-
[28]
Julia H Kaufman and Christian D Schunn. 2011. Students’ perceptions about peer assessment for writing: Their origin and impact on revision work. Instruc- tional science 39 (2011), 387–406
2011
-
[29]
Majeed Kazemitabaar, Justin Chow, Carl Ka To Ma, Barbara J Ericson, David Weintrop, and Tovi Grossman. 2023. Studying the effect of AI code generators on supporting novice learners in introductory programming. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems . 1–23
2023
-
[30]
Majeed Kazemitabaar, Xinying Hou, Austin Henley, Barbara Jane Ericson, David Weintrop, and Tovi Grossman. 2023. How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment. In Proceedings of the 23rd Koli Calling International Conference on Computing Edu- cation Research. 1–12
2023
-
[31]
Majeed Kazemitabaar, Oliver Huang, Sangho Suh, Austin Z Henley, and Tovi Grossman. 2025. Exploring the design space of cognitive engagement tech- niques with ai-generated code for enhanced learning. In Proceedings of the 30th International Conference on Intelligent User Interfaces . 695–714
2025
-
[32]
Natalie Kiesler, Dominic Lohr, and Hieke Keuning. 2023. Exploring the potential of large language models to generate formative programming feedback. In 2023 IEEE Frontiers in Education Conference (FIE) . IEEE, 1–5
2023
-
[33]
Kenneth R Koedinger, Sidney D’Mello, Elizabeth A McLaughlin, Zachary A Par- dos, and Carolyn P Rosé. 2015. Data mining and education. Wiley Interdisci- plinary Reviews: Cognitive Science 6, 4 (2015), 333–353
2015
-
[34]
Charles Koutcheme. 2022. Towards open natural language feedback generation for novice programmers using large language models. In Proceedings of the 22nd Koli Calling International Conference on Computing Education Research . 1–2
2022
-
[35]
Harsh Kumar, Ilya Musabirov, Mohi Reza, Jiakai Shi, Xinyuan Wang, Joseph Jay Williams, Anastasia Kuzminykh, and Michael Liut. 2024. Guiding Students in Using LLMs in Supported Learning Environments: Effects on Interaction Dy- namics, Learner Performance, Confidence, and Trust. Proceedings of the ACM on Human-Computer Interaction 8, CSCW2 (2024), 1–30
2024
-
[36]
Duri Long and Brian Magerko. 2020. What is AI literacy? Competencies and design considerations. In Proceedings of the 2020 CHI conference on human factors in computing systems . 1–16
2020
-
[37]
Lauren E Margulieux and Richard Catrambone. 2019. Finding the best types of guidance for constructing self-explanations of subgoals in programming. Jour- nal of the Learning Sciences 28, 1 (2019), 108–151
2019
-
[38]
Samiha Abdelrahman Mohammed Marwan. 2021. Investigating best practices in the design of automated hints and formative feedback to improve students’ cogni- tive and affective outcomes. North Carolina State University
2021
-
[39]
Diana Murdoch, Andrea R English, Allison Hintz, and Kersti Tyson. 2020. Feel- ing heard: Inclusive education, transformative learning, and productive struggle. Educational theory 70, 5 (2020), 653–679
2020
-
[40]
Huy Nguyen, Michelle Lim, Steven Moore, Eric Nyberg, Majd Sakr, and John Stamper. 2021. Exploring metrics for the analysis of code submissions in an in- troductory data science course. In LAK21: 11th International Learning Analytics and Knowledge Conference. 632–638. 11
2021
-
[41]
David Nicol, Avril Thomson, and Caroline Breslin. 2014. Rethinking feedback practices in higher education: a peer review perspective. Assessment & evalua- tion in higher education 39, 1 (2014), 102–122
2014
-
[42]
Suphakit Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Su- pachanun Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In Proceedings of the international multiconference of engineers and computer sci- entists, Vol. 1. 380–384
2013
-
[43]
Angela Oates and Donna Johnson. 2025. ChatGPT in the classroom: Evaluating its role in fostering critical evaluation skills. International Journal of Artificial Intelligence in Education (2025), 1–32
2025
-
[44]
Aadarsh Padiyath, Xinying Hou, Amy Pang, Diego Viramontes Vargas, Xingjian Gu, Tamara Nelson-Fromm, Zihan Wu, Mark Guzdial, and Barbara Ericson. 2024. Insights from Social Shaping Theory: The Appropriation of Large Language Models in an Undergraduate Programming Course. In Proceedings of the 2024 ACM Conference on International Computing Education Researc...
2024
- [45]
-
[46]
Hyanghee Park and Daehwan Ahn. 2024. The Promise and Peril of ChatGPT in Higher Education: Opportunities, Challenges, and Design Implications. In Proceedings of the CHI Conference on Human Factors in Computing Systems . 1– 21
2024
-
[47]
Tung Phung, Victor-Alexandru Pădurean, Anjali Singh, Christopher Brooks, José Cambronero, Sumit Gulwani, Adish Singla, and Gustavo Soares. 2024. Au- tomating human tutor-style programming feedback: Leveraging gpt-4 tutor model for hint generation and gpt-3.5 student model for hint validation. In Pro- ceedings of the 14th Learning Analytics and Knowledge C...
2024
-
[48]
James Prather, Brent N Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S Ran- drianasolo, Brett A Becker, Bailey Kimmel, Jared Wright, and Ben Briggs. 2024. The Widening Gap: The Benefits and Harms of Generative AI for Novice Pro- grammers. In Proceedings of the 2024 ACM Conference on International Comput- ing Education Research-Volume 1. 469–486
2024
-
[49]
Laura Pritschet, Derek Powell, and Zachary Horne. 2016. Marginally signifi- cant effects as evidence for hypotheses: Changing attitudes over four decades. Psychological science 27, 7 (2016), 1036–1042
2016
-
[50]
Evan F Risko and Sam J Gilbert. 2016. Cognitive offloading. Trends in cognitive sciences 20, 9 (2016), 676–688
2016
-
[51]
Bethany Rittle-Johnson and Jon R Star. 2011. The power of comparison in learn- ing and instruction: Learning outcomes supported by different types of compar- isons. In Psychology of learning and motivation . Vol. 55. Elsevier, 199–225
2011
-
[52]
Henry L Roediger and Andrew C Butler. 2011. The critical role of retrieval prac- tice in long-term retention. Trends in cognitive sciences 15, 1 (2011), 20–27
2011
-
[53]
Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al. 2024. The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv preprint arXiv:2406.06608 (2024)
work page internal anchor Pith review arXiv 2024
-
[54]
Antonette Shibani, Simon Knight, Kirsty Kitto, Ajanie Karunanayake, and Si- mon Buckingham Shum. 2024. Untangling Critical Interaction with AI in Stu- dents’ Written Assessment. In Extended Abstracts of the CHI Conference on Hu- man Factors in Computing Systems . 1–6
2024
-
[55]
Anjali Singh, Christopher Brooks, Xu Wang, Warren Li, Juho Kim, and Deepti Wilson. 2024. Bridging learnersourcing and AI: Exploring the dynamics of student-AI collaborative feedback generation. In Proceedings of the 14th Learn- ing Analytics and Knowledge Conference . 742–748
2024
-
[56]
Anjali Singh, Anna Fariha, Christopher Brooks, Gustavo Soares, Austin Z Hen- ley, Ashish Tiwari, Heeryung Choi, and Sumit Gulwani. 2024. Investigating Student Mistakes in Introductory Data Science Programming. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1 . 1258– 1264
2024
-
[57]
John Stamper, Ruiwei Xiao, and Xinying Hou. 2024. Enhancing llm-based feed- back: Insights from intelligent tutoring systems and the learning sciences. In International Conference on Artificial Intelligence in Education . Springer, 32–43
2024
-
[58]
Marlene Steinbach, Shreya Bhandari, Jennifer Meyer, and Zachary A Pardos
-
[59]
In Proceedings of the Twelfth ACM Conference on Learn- ing@ Scale
When LLMs Hallucinate: Examining the Effects of Erroneous Feedback in Math Tutoring Systems. In Proceedings of the Twelfth ACM Conference on Learn- ing@ Scale. 139–150
-
[60]
Marieke Thurlings, Marjan Vermeulen, Theo Bastiaens, and Sjef Stijnen. 2013. Understanding feedback: A learning theory perspective. Educational Research Review 9 (2013), 1–15
2013
-
[61]
Daire Maria Ni Uanachain and Lila Ibrahim Aouad. 2025. Generative AI in Ed- ucation: Rethinking Learning, Assessment & Student Agency for the AI Era. Thresholds in Education 48, 1 (2025), 111–132
2025
- [62]
-
[63]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models. Advances in neural information processing systems 35 (2022), 24824–24837
2022
-
[64]
Lixiang Yan, Samuel Greiff, Ziwen Teuber, and Dragan Gašević. 2024. Promises and challenges of generative artificial intelligence for human learning. Nature Human Behaviour 8, 10 (2024), 1839–1850
2024
-
[65]
Chunpeng Zhai, Santoso Wibowo, and Lily D Li. 2024. The effects of over- reliance on AI dialogue systems on students’ cognitive abilities: a systematic review. Smart Learning Environments 11, 1 (2024), 28. 12
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.