Combating Harms of Generative AI in CS1 with Code Review Interviews and a Flipped Classroom
Pith reviewed 2026-05-21 03:29 UTC · model grok-4.3
The pith
Oral code reviews in a flipped CS1 class preserve student understanding even as LLM usage for assignments rises sharply.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through oral code review assessments and a flipped classroom in CS1, students maintain adequate performance on written exams despite a dramatic increase in the use of generative AI for completing coding assignments. Analysis of exam scores across semesters shows no statistically significant decline, keystroke data confirms higher AI involvement through increased pasting, and end-of-semester surveys indicate positive attitudes toward the interviews.
What carries the argument
Weekly oral code review interviews, in which students must explain and justify the code they submitted for assignments, supported by a flipped classroom model that allocates class time for learning concepts independently and scheduling flexibility for interviews.
If this is right
- Students retain conceptual understanding as measured by exam performance even when relying more on AI for code production.
- Formative oral assessments can incentivize metacognitive engagement with code regardless of its source.
- Positive student feedback supports the feasibility of scaling this model with improvements in scheduling and training.
- The combination allows experimentation with AI tools without apparent harm to foundational learning in introductory programming.
Where Pith is reading between the lines
- Similar oral review methods might transfer to other introductory STEM courses facing AI tools.
- Institutions could shift from detection and punishment of AI use toward redesigned assessments that reward explanation.
- Repeated oral reviews may build stronger habits of code comprehension beyond what written exams alone achieve.
- Data on long-term retention or transfer to new problems would further test the approach.
Load-bearing premise
Oral code review interviews accurately measure and promote deep conceptual understanding instead of students simply preparing rote explanations for the interview.
What would settle it
A follow-up assessment where students must solve new programming problems or explain code modifications without prior preparation would show significantly lower performance if the reviews only encourage superficial knowledge.
Figures
read the original abstract
Background and Context: Large Language Models (LLMs) are more accessible and accurate than ever before, raising significant concerns for computing educators. One major concern is students using LLMs to bypass the effort needed to understand concepts and metacognitive strategies essential for success in computer science. Objectives: We contribute a unique approach to assessing and building up student understanding through weekly oral code review assessments. These formative assessments incentivize students to understand their submitted code, regardless of whether or not the code was generated by AI tools. We also use a flipped classroom to provide time for students to learn concepts outside of class and provide ample time for students to schedule code review interviews. Methods: For this paper, we collected data from three semesters. We analyze student exam scores, keystroke logs, and surveys to understand how the new course policies affected student learning, behavior, and attitudes. Findings: Pairwise comparison of exam results reveals a statistically insignificant increase in average scores for Fall 2025 compared to previous semesters. Keystroke logs show a significant increase in characters pasted per total characters input into coding assignments in Fall 2025, pointing towards higher AI usage. Survey results show positive student sentiment towards code reviews at the end of Fall 2025, with nearly all negative feedback being addressable through better scheduling and more rigorous TA training. Implications: Oral code reviews with a flipped classroom appear to be effective at mitigating harms of LLM use while providing space for students to freely experiment with these tools. Our work suggests that students in Fall 2025 still show adequate understanding of material covered in written exams, despite dramatic increases in LLM usage for coding assignments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes an educational intervention in a CS1 course that combines weekly oral code review interviews with a flipped classroom to mitigate harms from students using LLMs to bypass conceptual understanding and metacognitive strategies. Drawing on data from three semesters, the authors compare exam scores, keystroke logs showing pasted characters, and end-of-semester surveys, reporting a statistically insignificant increase in average exam scores in Fall 2025 alongside a significant rise in pasting (indicating higher LLM usage) and generally positive student feedback on the code reviews.
Significance. If the methodological gaps are addressed, the work could provide a practical, replicable model for CS educators seeking to accommodate AI tools while using formative oral assessments to promote genuine comprehension. The multi-method approach (logs + exams + surveys) offers a useful template for studying behavioral changes in AI-augmented programming courses.
major comments (3)
- [Methods] Methods: The description of data collection and analysis lacks sample sizes per semester, the exact statistical tests and p-values for the pairwise exam-score comparisons, effect sizes, and any controls or covariates for semester-to-semester differences in student population, exam difficulty, or course content.
- [Findings] Findings and Implications: The claim that stable exam scores demonstrate 'adequate understanding' and mitigation of LLM harms assumes written exams assess the explanatory and debugging skills practiced in oral reviews, yet no example exam items, alignment analysis, or evidence ruling out superficial interview preparation is provided.
- [Findings] Findings: The significant increase in pasted characters is presented as evidence of higher AI usage, but without baseline pasting rates from prior semesters or validation that pasting correlates with LLM rather than other copy-paste behaviors, the link to the intervention's effectiveness remains indirect.
minor comments (2)
- [Abstract] Abstract: The abstract refers to 'three semesters' and 'Fall 2025' without naming the comparison semesters or clarifying whether the prior data were collected under identical course policies.
- [Findings] Survey results: Positive sentiment is reported, but response rate, number of respondents, and breakdown of negative feedback categories should be included to evaluate how representative the views are.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment below and have revised the paper accordingly to improve methodological transparency and the presentation of findings. We believe these changes enhance the work without altering its core contributions.
read point-by-point responses
-
Referee: [Methods] Methods: The description of data collection and analysis lacks sample sizes per semester, the exact statistical tests and p-values for the pairwise exam-score comparisons, effect sizes, and any controls or covariates for semester-to-semester differences in student population, exam difficulty, or course content.
Authors: We agree that these details were insufficiently reported in the original submission. In the revised manuscript, we have added the per-semester sample sizes, specified that pairwise comparisons used independent-samples t-tests, reported exact p-values along with Cohen's d effect sizes, and included a discussion of potential covariates (e.g., student demographics and minor variations in exam content). We also explicitly note the limitations of retrospective data collection in fully controlling for all semester-to-semester differences. These additions directly address the transparency concerns. revision: yes
-
Referee: [Findings] Findings and Implications: The claim that stable exam scores demonstrate 'adequate understanding' and mitigation of LLM harms assumes written exams assess the explanatory and debugging skills practiced in oral reviews, yet no example exam items, alignment analysis, or evidence ruling out superficial interview preparation is provided.
Authors: This is a fair critique of the linkage between assessment types. While the manuscript presents stable exam scores as evidence of maintained understanding, we acknowledge the absence of explicit examples and formal alignment. In revision we will add representative exam items that require explanation and debugging, include a short description of how oral review skills map to exam performance, and discuss why weekly interviews covering all assignments make superficial preparation unlikely. We maintain that the multi-method data (exams + logs + surveys) supports our interpretation but will strengthen this section with the requested details. revision: partial
-
Referee: [Findings] Findings: The significant increase in pasted characters is presented as evidence of higher AI usage, but without baseline pasting rates from prior semesters or validation that pasting correlates with LLM rather than other copy-paste behaviors, the link to the intervention's effectiveness remains indirect.
Authors: The manuscript does report a statistically significant increase in pasted characters for Fall 2025 relative to prior semesters, which implies baseline data were analyzed. However, we agree that explicit baseline rates and stronger validation of the pasting metric were not detailed enough. In the revision we will report the actual prior-semester pasting percentages for direct comparison and add discussion of why we interpret elevated pasting as primarily LLM-related (e.g., log timing patterns and assignment context), while acknowledging that other copy-paste sources could contribute. This will make the behavioral evidence more explicit. revision: yes
Circularity Check
No circularity: purely empirical comparisons with no derivations or self-referential reductions
full rationale
This is an empirical education research paper that reports data collection from three semesters, including exam score comparisons, keystroke log analysis for paste rates, and end-of-semester surveys. The central findings rest on direct statistical pairwise comparisons and descriptive survey results rather than any mathematical derivation, fitted model, uniqueness theorem, or ansatz that could reduce to its own inputs by construction. No equations, predictions, or load-bearing self-citations appear in the provided text; the argument is self-contained against the collected external benchmarks (prior semesters and student responses).
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pairwise semester comparisons of exam scores are valid when course content and grading standards remain comparable across terms
- domain assumption Increased ratio of pasted characters to total keystrokes serves as a proxy for higher LLM usage
Reference graph
Works this paper leans on
-
[1]
Elmira Adeeb and Kasia Muldner. 2025. How Do Novice Programmers Solve Code-Tracing Problems When ChatGPT Is Available? A Qualitative Analysis.. InProceedings of the 2025 ACM Conference on International Computing Education Research V. 1. 421–434
work page 2025
-
[2]
Saleh Alhazbi. 2016. Using flipped classroom approach to teach computer programming. 441–444. doi:10.1109/TALE.2016.7851837
-
[3]
Brett A. Becker, Paul Denny, James Finnie-Ansley, Andrew Luxton-Reilly, James Prather, and Eddie Antonio Santos. 2023. Programming Is Hard - Or at Least It Used to Be: Educational Opportunities and Challenges of AI Code Generation. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1(Toronto ON, Canada)(SIGCSE 2023). Associ...
-
[4]
Seth Bernstein, Ashfin Rahman, Nadia Sharifi, Ariunjargal Terbish, and Stephen MacNeil. 2025. Beyond the Benefits: A Systematic Review of the Harms and Consequences of Generative AI in Computing Education. InProceedings of the 25th Koli Calling International Conference on Computing Education Research (Koli Calling ’25). Association for Computing Machinery...
-
[5]
Jérôme Brender, Laila El-Hamamsy, Francesco Mondada, and Engin Bumbacher. 2024. Who’s helping who? when students use chatgpt to engage in practice lab sessions. InInternational Conference on Artificial Intelligence in Education. Springer, 235–249
work page 2024
-
[6]
Yi-Hsing Chang, An-Ching Song, and Rong-Jyue Fang. 2018. The Study of Programming Language Learning by Applying Flipped Classroom. In 2018 1st IEEE International Conference on Knowledge Innovation and Invention (ICKII). 286–289. doi:10.1109/ICKII.2018.8569171
-
[7]
Li Cheng, Albert Ritzhaupt, and Pavlo "Pasha Antonenko. 2018. Effects of the flipped classroom instructional strategy on students’ learning outcomes: a meta-analysis.Educational Technology Research and Development67 (10 2018). doi:10.1007/s11423-018-9633-7
-
[8]
John Edwards. 2025. JetBrains Marketplace; ShowYourWork Plugin. https://plugins.jetbrains.com/plugin/18353-showyourwork
work page 2025
-
[9]
Hasmik Gharibyan. 2005. Assessing students’ knowledge: oral exams vs. written tests. InProceedings of the 10th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education(Caparica, Portugal)(ITiCSE ’05). Association for Computing Machinery, New York, NY, USA, 143–147. doi:10.1145/1067445.1067487
-
[10]
Aashish Ghimire and John Edwards. 2024. Coding with ai: How are tools like chatgpt being used by students in foundational programming courses. InInternational Conference on Artificial Intelligence in Education. Springer, 259–267
work page 2024
-
[11]
Aashish Ghimire and John Edwards. 2024. From Guidelines to Governance: A Study of AI Policies in Education. InArtificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, Andrew M. Olney, Irene-Angelica Chounta, Zitao Liu, Olga C. Santos, ...
work page 2024
-
[12]
Dirk Grunwald, Elizabeth Boese, Rhonda Hoenigman, Andy Sayler, and Judith Stafford. 2015. Personalized Attention @ Scale: Talk Isn’t Cheap, But It’s Effective. InProceedings of the 46th ACM Technical Symposium on Computer Science Education(Kansas City, Missouri, USA)(SIGCSE ’15). Manuscript submitted to ACM Combating Harms of Generative AI in CS1 with Cod...
-
[13]
Kaden Hart, Christopher M Warren, and John Edwards. 2023. Accurate estimation of time-on-task while programming. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 708–714
work page 2023
-
[14]
Christopher Hundhausen, Anukrati Agrawal, Dana Fairbrother, and Michael Trevisan. 2009. Integrating pedagogical code reviews into a CS 1 course: an empirical study.SIGCSE Bull.41, 1 (March 2009), 291–295. doi:10.1145/1539024.1508972
-
[15]
Mark Huxham, Fiona Campbell, and Jenny Westwood. 2012. Oral versus written assessments: a test of student performance and attitudes.Assessment & Evaluation in Higher Education37, 1 (2012), 125–136. arXiv:https://doi.org/10.1080/02602938.2010.515012 doi:10.1080/02602938.2010.515012
-
[16]
P. Iannone and A. Simpson. 2012. Oral assessment in mathematics: implementation and outcomes.Teaching Mathematics and its Applications: An International Journal of the IMA31, 4 (10 2012), 179–190. arXiv:https://academic.oup.com/teamat/article-pdf/31/4/179/4762864/hrs012.pdf doi:10.1093/teamat/hrs012
-
[17]
Theresia Devi Indriasari, Andrew Luxton-Reilly, and Paul Denny. 2020. A Review of Peer Code Review in Higher Education.ACM Trans. Comput. Educ.20, 3, Article 22 (Sept. 2020), 25 pages. doi:10.1145/3403935
-
[18]
Gregor Jo˘st, Viktor Taneski, and Sa˘so Karakati˘c. 2024. The impact of large language models on programming education and student learning outcomes.Applied Sciences14, 10 (2024), 4115
work page 2024
-
[19]
Matthew Kam, Cody Miller, Miaoxin Wang, Abey Tidwell, Irene A. Lee, Joyce Malyn-Smith, Beatriz Perret, Vikram Tiwari, Joshua Kenitzer, Andrew Macvean, and Erin Barrar. 2025. What do professional software developers need to know to succeed in an age of Artificial Intelligence?. InProceedings of the 33rd ACM International Conference on the Foundations of So...
-
[20]
Sam Lau and Philip Guo. 2023. From "Ban It Till We Understand It" to "Resistance is Futile": How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot. InProceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1(Chicago, IL, US...
-
[21]
Abdallah Mohamed. 2020. Evaluating the Effectiveness of Flipped Teaching in a Mixed-Ability CS1 Course. InProceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education(Trondheim, Norway)(ITiCSE ’20). Association for Computing Machinery, New York, NY, USA, 452–458. doi:10.1145/3341525.3387395
-
[22]
Peter Ohmann. 2019. An Assessment of Oral Exams in Introductory CS. InProceedings of the 50th ACM Technical Symposium on Computer Science Education(Minneapolis, MN, USA)(SIGCSE ’19). Association for Computing Machinery, New York, NY, USA, 613–619. doi:10.1145/3287324.3287489
-
[23]
Peter Ohmann and Ed Novak. 2025. A Multi-Institutional Assessment of Oral Exams in Software Courses. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1(Pittsburgh, PA, USA)(SIGCSETS 2025). Association for Computing Machinery, New York, NY, USA, 882–888. doi:10.1145/3641554.3701848
-
[24]
Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro
James Prather, Juho Leinonen, Natalie Kiesler, Jamie Gorson Benario, Sam Lau, Stephen MacNeil, Narges Norouzi, Simone Opel, Vee Pettit, Leo Porter, Brent N. Reeves, Jaromir Savelka, IV Smith, David H., Sven Strickroth, and Daniel Zingaro. 2025. Beyond the Hype: A Comprehensive Review of Current Trends in Generative AI Research, Teaching Practices, and Too...
-
[25]
It’s weird that it knows what I want
James Prather, Brent N Reeves, Paul Denny, Brett A Becker, Juho Leinonen, Andrew Luxton-Reilly, Garrett Powell, James Finnie-Ansley, and Eddie Antonio Santos. 2023. “It’s weird that it knows what I want”: Usability and interactions with copilot for novice programmers.ACM transactions on computer-human interaction31, 1 (2023), 1–31
work page 2023
-
[26]
James Prather, Brent N Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S Randrianasolo, Brett A Becker, Bailey Kimmel, Jared Wright, and Ben Briggs. 2024. The widening gap: The benefits and harms of generative AI for novice programmers. InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1. 469–486
work page 2024
-
[27]
Victor Rivera, Hamna Aslam, Alexandr Naumchev, Daniel de Carvalho, Mansur Khazeev, and Manuel Mazzara. 2020. Towards Code Review Guideline in a Classroom. InFrontiers in Software Engineering Education, Jean-Michel Bruel, Alfredo Capozucca, Manuel Mazzara, Bertrand Meyer, Alexandr Naumchev, and Andrey Sadovykh (Eds.). Springer International Publishing, Cha...
work page 2020
-
[28]
2012.Flip your classroom: Reach every student in every class every day
Aaron Sams and Jonathan Bergmann. 2012.Flip your classroom: Reach every student in every class every day. International Society for Technology in Education/ISTE
work page 2012
-
[29]
Namita Sarawagi. 2014. A flipped CS0 classroom: applying Bloom’s taxonomy to algorithmic thinking.J. Comput. Sci. Coll.29, 6 (June 2014), 21–28
work page 2014
-
[30]
Md Istiak Hossain Shihab, Christopher Hundhausen, Ahsun Tariq, Summit Haque, Yunhan Qiao, and Brian Wise Mulanda. 2025. The Effects of GitHub Copilot on Computing Students’ Programming Effectiveness, Efficiency, and Processes in Brownfield Coding Tasks. InProceedings of the 2025 ACM Conference on International Computing Education Research V. 1. 407–420
work page 2025
-
[31]
Jinrui Tian and Ronghua Zhang. 2025. Learners’ AI dependence and critical thinking: The psychological mechanism of fatigue and the social buffering role of AI literacy.Acta Psychologica260 (2025), 105725. doi:10.1016/j.actpsy.2025.105725
-
[32]
Keith Topping. 1998. Peer Assessment Between Students in Colleges and Universities.Review of Educational Research68, 3 (1998), 249–276. arXiv:https://doi.org/10.3102/00346543068003249 doi:10.3102/00346543068003249
-
[33]
Scott Alexander Turner, Manuel A. Pérez-Quiñones, and Stephen H. Edwards. 2018. Peer Review in CS2: Conceptual Learning and High-Level Thinking.ACM Trans. Comput. Educ.18, 3, Article 13 (Sept. 2018), 37 pages. doi:10.1145/3152715 Manuscript submitted to ACM 22 Fowles et al
-
[34]
Muhammad Mahad Umair and Patrick Mukala. 2026. Generative AI-Driven or AI-Assisted Software Code Generation and the Decline of Community Knowledge Sharing: Challenges and Future Prospects. InInformation System Design: AI and ML Applications, Vikrant Bhateja, Soly Mathew Biju, and Siba K. Udgata (Eds.). Springer Nature Singapore, Singapore, 115–125
work page 2026
-
[35]
Annapurna Vadaparty, David H. Smith IV, Samvrit Srinath, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, Daniel Zingaro, and Leo Porter
-
[36]
arXiv:2510.18806 [cs.CY] https://arxiv.org/abs/2510.18806
Integrating Large Language Models and Evaluating Student Outcomes in an Introductory Computer Science Course. arXiv:2510.18806 [cs.CY] https://arxiv.org/abs/2510.18806
-
[37]
Smith IV, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, and Leo Porter
Annapurna Vadaparty, Daniel Zingaro, David H. Smith IV, Mounika Padala, Christine Alvarado, Jamie Gorson Benario, and Leo Porter. 2024. CS1-LLM: Integrating LLMs into CS1 Instruction. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1(Milan, Italy)(ITiCSE 2024). Association for Computing Machinery, New York, NY, USA,...
-
[38]
Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. InChi conference on human factors in computing systems extended abstracts. 1–7
work page 2022
-
[39]
Yuankai Xue, Hanlin Chen, Gina R Bai, Robert Tairas, and Yu Huang. 2024. Does ChatGPT help with introductory programming? An experiment of students using ChatGPT in CS1. InProceedings of the 46th International conference on software engineering: software engineering education and training. 331–341
work page 2024
-
[40]
Hatice Yildiz-Durak. 2019. Modeling Different Variables in Learning Basic Concepts of Programming in Flipped Classrooms.Journal of Educational Computing Research58 (03 2019), 073563311982795. doi:10.1177/0735633119827956
-
[41]
Noor Azlinda Zainal Abidin. 2024. The Efficacy of Flipped Classroom Models in Improving Student Engagement and Achievement: A Meta-Analysis. Global Synthesis in Education Journal2 (11 2024), 25–44. doi:10.61667/v180e591
-
[42]
Cynthia Zastudil, Magdalena Rogalska, Christine Kapp, Jennifer Vaughn, and Stephen MacNeil. 2023. Generative ai in computing education: Perspectives of students and instructors. In2023 IEEE Frontiers in Education Conference (FIE). IEEE, 1–9
work page 2023
-
[43]
Yitong Zhao. 2018. Impact of Oral Exams on a Thermodynamics Course Performance. In2018 ASEE Zone IV Conference. ASEE Conferences, Boulder, Colorado. https://peer.asee.org/29617
work page 2018
-
[44]
Huiwen Zou, Ka Ian Chan, Patrick Pang, Blandina Manditereza, and Yi-Huang Shih. 2026. To Use but Not to Depend: Pedagogical Novelty and the Cognitive Brake of Ethical Awareness in Computer Science Students’ Adoption of Generative AI.Education Sciences16, 2 (2026). doi:10.3390/educsci16020311 Received 20 February 2007; revised 12 March 2009; accepted 5 Jun...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.