pith. sign in

arxiv: 2604.23251 · v1 · submitted 2026-04-25 · 💻 cs.SE · cs.AI

AI-Assisted Code Review as a Scaffold for Code Quality and Self-Regulated Learning: An Experience Report

Pith reviewed 2026-05-08 07:52 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords AI-assisted code reviewself-regulated learningsoftware engineering educationGitHub pull requestsLLM reviewercode qualityscaffolding
0
0 comments X

The pith

An LLM reviewer integrated into GitHub pull requests scaffolds code quality and self-regulated learning while keeping student follow-up activity stable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests an LLM reviewer placed directly inside students' GitHub pull requests to scale code review in software engineering capstone projects where deadlines are tight and peer feedback varies. It tracks objective signs of self-regulated learning such as PR volume and the rate at which students make further commits after receiving AI comments. Across two cohorts the 2024 group produced more than twice as many PRs with zero technical failures after refinements, yet the share of AI-reviewed PRs that received follow-up commits remained essentially unchanged at 32-33 percent despite lower team adoption. The mixed-methods design pairs GitHub metrics with student reflections to show how structured AI comments can focus attention on code quality without replacing student judgment. This matters for any course that must deliver timely, consistent feedback to many learners at once.

Core claim

Embedding an LLM as a reviewer inside GitHub pull requests in a human-in-the-loop setup lets students receive structured feedback on code quality during their normal workflow. After tool and instructional refinements, failed AI attempts fell from 227 to zero and total PRs rose from 581 to 1176, while the percentage of successfully reviewed PRs followed by later commits on the same PR held steady at 32 percent in 2023 and 33 percent in 2024. Students reported using the AI comments to guide their own reviews and discussions, and the guidance elements appeared to limit uncritical acceptance of suggestions.

What carries the argument

The human-in-the-loop LLM reviewer integrated into GitHub pull requests, which supplies structured comments on code quality and is paired with instructional refinements that reduce over-reliance.

Load-bearing premise

That differences in PR volume, technical issue rates, and responsiveness between the two cohorts arise mainly from the AI tool refinements and instructional changes rather than other unmeasured differences in the student groups or course conditions.

What would settle it

Measuring the rate of follow-up commits after AI-reviewed PRs in a new cohort that receives neither tool refinements nor updated instructions would show whether the stable 32-33 percent responsiveness depends on those specific changes.

Figures

Figures reproduced from arXiv: 2604.23251 by Eduardo Oliveira, Michael Fu, Mohammed Saqr, Patanamon Thongtanunam, Sonsoles L\'opez-Pernas.

Figure 1
Figure 1. Figure 1: The structured, checklist-based prompt used to view at source ↗
Figure 2
Figure 2. Figure 2: A comparison of student engagement with the LLM-Reviewer tool across the 2023 and 2024 cohorts, grouped by the view at source ↗
Figure 3
Figure 3. Figure 3: A summary of the survey questions and the results. view at source ↗
read the original abstract

Code review is central to software engineering education but hard to scale in capstone projects due to tight deadlines, uneven peer feedback, and limited prior experience. We investigate an LLM-as-reviewer integrated directly into GitHub pull requests (human-in-the-loop) across two cohorts (more than 100 students, 2023--2024). Using a mixed-methods design -- GitHub data, reflective reports, and a targeted survey -- we examine engagement and responsiveness as behavioral indicators of self-regulated learning processes. Quantitatively, the 2024 cohort produced more iterative activity (1176 vs. 581 PRs), while technical issues observed in 2023 (227 failed AI attempts) dropped to zero after tool and instructional refinements. Despite different adoption levels (93\% vs. 50\% of teams using the tool), responsiveness was stable: 32\% (2023) and 33\% (2024) of successfully AI-reviewed PRs were followed by subsequent commits on the same PR. Qualitatively, students used the LLM's structured comments to focus reviews and discuss code quality, while guidance reduced over-reliance. We contribute: (i) an in-workflow design for an AI reviewer that scaffolds learning while mitigating cognitive offloading; (ii) a repeated cross sectional comparison across two cohorts in authentic settings; (iii) a mixed-methods analysis combining objective GitHub metrics with student self-reports; and (iv) evidence-based pedagogical recommendations for responsible, student-led AI-assisted review.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is an experience report on integrating an LLM-based code reviewer directly into GitHub pull requests (human-in-the-loop) for software engineering capstone courses across two cohorts (>100 students, 2023–2024). Using mixed-methods (GitHub logs, reflective reports, targeted survey), it reports higher PR volume in 2024 (1176 vs. 581), elimination of technical failures after tool/instructional refinements (227 to 0), stable responsiveness (32% of 2023 AI-reviewed PRs and 33% of 2024 AI-reviewed PRs followed by subsequent commits on the same PR) despite lower adoption (93% to 50%), and qualitative indications that structured LLM comments helped focus reviews and support self-regulated learning while guidance mitigated over-reliance. Contributions include the in-workflow design, repeated cross-sectional comparison, mixed-methods analysis, and pedagogical recommendations for responsible AI use.

Significance. If the interpretive link between stable responsiveness and consistent scaffolding holds, the work supplies actionable, evidence-based guidance for educators seeking to scale code review while preserving student agency. Credit is given for the concrete, replicable GitHub metrics (explicit counts and percentages), the mixed-methods integration of objective repository data with self-reports, and the repeated cross-sectional design conducted in authentic course settings; these elements strengthen the practical contribution to CS education research on AI-assisted learning scaffolds.

major comments (1)
  1. The central interpretive claim—that stable responsiveness (32% vs. 33%) demonstrates consistent support for self-regulated learning despite adoption changes—rests on the assumption that the 2023 and 2024 cohorts are comparable on unmeasured factors affecting PR iteration. The GitHub data section reports only aggregate counts and percentages with no within-cohort baseline iteration rates for non-AI-reviewed PRs, no covariate adjustment, and no statistical comparison; this leaves selection effects or other course variations as viable alternative explanations for the observed similarity.
minor comments (2)
  1. Clarify the precise operational definition of 'successfully AI-reviewed PRs' used to compute the responsiveness percentages, including how PRs with multiple AI reviews or mixed human/AI feedback are classified.
  2. Add a summary table collating the key quantitative metrics (PR totals, adoption rates, failed attempts, responsiveness counts and percentages) for both cohorts to improve readability and direct comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our experience report. We address the major comment below and have revised the manuscript to clarify the observational nature of the study and explicitly discuss its limitations.

read point-by-point responses
  1. Referee: The central interpretive claim—that stable responsiveness (32% vs. 33%) demonstrates consistent support for self-regulated learning despite adoption changes—rests on the assumption that the 2023 and 2024 cohorts are comparable on unmeasured factors affecting PR iteration. The GitHub data section reports only aggregate counts and percentages with no within-cohort baseline iteration rates for non-AI-reviewed PRs, no covariate adjustment, and no statistical comparison; this leaves selection effects or other course variations as viable alternative explanations for the observed similarity.

    Authors: We agree that the stability in responsiveness rates (32% in 2023 and 33% in 2024) among AI-reviewed PRs should not be over-interpreted as definitive evidence of consistent scaffolding for self-regulated learning, given the observational design. As an experience report from authentic course settings rather than a controlled experiment, the manuscript does not include within-cohort baselines for non-AI PRs, covariate adjustments, or statistical comparisons, and we do not claim that the cohorts are fully comparable on all unmeasured factors. The reported metrics focus on changes in adoption, technical reliability, and iteration rates within the AI-reviewed subset across two similar course iterations. To address this point, we will revise the discussion and add a dedicated limitations subsection that explicitly notes potential selection effects from voluntary tool adoption, the absence of non-AI baselines, and the lack of statistical controls. We will also adjust interpretive language to describe the findings as consistent with scaffolding effects while acknowledging alternative explanations such as course variations. These revisions clarify scope without changing the reported data or core contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely observational experience report

full rationale

This is a mixed-methods experience report relying on direct GitHub repository counts (PR volumes, commit follow-ups) and survey responses. No equations, fitted parameters, predictive models, or derivations appear in the provided text or abstract. Responsiveness figures (32% vs 33%) are reported as raw observed proportions without any statistical fitting or self-referential construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The study is self-contained as descriptive data reporting; attribution concerns (cohort comparability) fall under validity rather than circularity per the analysis rules.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities; the report rests on direct observation of student GitHub activity and self-reported reflections.

pith-pipeline@v0.9.0 · 5588 in / 1061 out tokens · 65527 ms · 2026-05-08T07:52:39.355353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- lenges of modern code review. In2013 35th International Conference on Software Engineering (ICSE). IEEE, 712–721. doi:10.1109/ICSE.2013.6606617

  2. [2]

    Luciano Baresi, Andrea De Lucia, Antinisca Di Marco, Massimiliano Di Penta, Da- vide Di Ruscio, Leonardo Mariani, Daniela Micucci, Fabio Palomba, Maria Teresa Rossi, and Fiorella Zampetti. 2025. Students’ Perception of ChatGPT in Software Engineering: Lessons Learned from Five Courses. In2025 IEEE/ACM 37th Inter- national Conference on Software Engineerin...

  3. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

  4. [4]

    Mutlu Cukurova. 2025. The interplay of learning, analytics and artificial intelli- gence in education: A vision for hybrid intelligence.British Journal of Educational Technology56, 2 (2025), 469–488

  5. [5]

    Paula G de Barba, Eduardo Araujo Oliveira, and Narelle English. 2025. Devel- opment and validation of a learning analytics rubric for self-regulated learning. Educational technology research and development(2025), 1–23

  6. [6]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53

  7. [7]

    Jinhee Kim, Sang-Soog Lee, Rita Detrick, Jialin Wang, and Na Li. 2025. Students- Generative AI interaction patterns and its impact on academic writing.J. Comput. High. Educ.(April 2025), 1–22. doi:10.1007/s12528-025-09444-6

  8. [8]

    Tanya Linden. 2018. Scrum-based learning environment: Fostering self-regulated learning.Journal of Information Systems Education29, 2 (2018), 65–74

  9. [9]

    Sonsoles López-Pernas, Kamila Misiejuk, Eduardo Oliveira, and Mohammed Saqr

  10. [10]

    InProceedings of the 25th Koli Calling International Conference on Computing Education Research

    The dynamics of the self-regulation process in student-AI interactions: The case of problem-solving in programming education. InProceedings of the 25th Koli Calling International Conference on Computing Education Research. 1–12

  11. [11]

    Sonsoles López-Pernas, Eduardo Oliveira, Yige Song, and Mohammed Saqr. 2025. AI, explainable AI and evaluative AI: Informed data-driven decision-making in education. InAdvanced learning analytics methods: AI, precision and complexity. Springer, 17–39

  12. [12]

    Sonsoles López-Pernas, Yige Song, Eduardo Oliveira, and Mohammed Saqr. 2025. LLMs for explainable artificial intelligence: Automating natural language expla- nations of predictive analytics models. InAdvanced learning analytics methods: AI-Assisted Code Review as a Scaffold for Code Quality and Self-Regulated Learning: An Experience Report Author pre-prin...

  13. [13]

    Mika V Mäntylä and Casper Lassenius. 2008. What types of defects are really discovered in code reviews?IEEE Transactions on Software Engineering35, 3 (2008), 430–448

  14. [14]

    Carl Marnewick. 2023. Student experiences of project-based learning in agile project management education.Project Leadership and Society4 (2023), 100096

  15. [15]

    Kamila Misiejuk, Sonsoles López Pernas, Eduardo Araujo Oliveira, Jules Delannoy, Cyprien Dujardin, Hesham Ahmed, and Mohammed Saqr. 2025. Facets of AI Personalization: A Systematic Review of Fine-tuned Large Language Models for Teaching and Learning.A vailable at SSRN 5287369(2025)

  16. [16]

    Ha Nguyen and Andy Nguyen. 2024. Reflective practices and self-regulated learning in designing with generative artificial intelligence: An ordered network analysis.J. Sci. Educ. Technol.(Nov. 2024). doi:10.1007/s10956-024-10175-z

  17. [17]

    Lorelli S Nowell, Jill M Norris, Deborah E White, and Nancy J Moules. 2017. Thematic analysis: Striving to meet the trustworthiness criteria.International journal of qualitative methods16, 1 (2017), 1609406917733847

  18. [18]

    Eduardo Araujo Oliveira, Shannon Rios, and Zhuoxuan Jiang. 2023. AI-powered peer review process: An approach to enhance computer science students’ engage- ment with code review in industry-based subjects. InPeople, Partnerships and Ped- agogies. Proceedings ASCILITE 2023, T. Cochrane, V. Narayan, C. Brown, K. Mac- Callum, E. Bone, C. Deneen, R. Vanderburg...

  19. [19]

    Esteban Parra and Sophia Willingham. 2025. Towards Implementing and Eval- uating AI-Assisted Pull Requests in Software Engineering Education. In2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T). doi:10.1109/CSEET66350.2025.00008

  20. [20]

    Pruthvi Patel, Shannon Rios, Andrew Valentine, and Eduardo Oliveira. 2024. Enhancing Automated Peer Code Reviews in Software Engineering Education with Context-Aware Generative AI.ASCILITE Publications(2024), 647–652

  21. [21]

    Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. Modern code review: a case study at google. InProceedings of the 40th international conference on software engineering: Software engineering in practice. 181–190

  22. [22]

    Xiangyu Song, Seth Copen Goldstein, and Majd Sakr. 2020. Using peer code review as an educational tool. InProceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education. 173–179

  23. [23]

    Yige Song, Eduardo Oliveira, Paula De Barba, Michael Kirley, and Pauline Thomp- son. 2025. Investigating validity and generalisability in trace-based measurement of self-regulated learning: A multidisciplinary study. InProceedings of the 15th International Learning Analytics and Knowledge Conference. 339–350

  24. [24]

    Daniella Taranto and Michael T Buchanan. 2020. Sustaining lifelong learning: A self-regulated learning (SRL) approach.Discourse and Communication for Sustainable Education11, 1 (2020), 5–15

  25. [25]

    Philip H Winne and Nancy E Perry. 2000. Measuring self-regulated learning. In Handbook of self-regulation. Elsevier, 531–566

  26. [26]

    Ying Zhan, David Boud, Phillip Dawson, and Zi Yan. 2025. Generative artificial intelligence as an enabler of student feedback engagement: a framework.Higher Education Research & Development44, 5 (2025), 1289–1304

  27. [27]

    Barry J Zimmerman. 2002. Becoming a self-regulated learner: An overview. Theory into practice41, 2 (2002), 64–70