AI-Generated Traces for Novice Programmers: Learning Effects and Learner Differences in a Multi-Institutional Study
Pith reviewed 2026-06-28 08:19 UTC · model grok-4.3
The pith
AI-generated animated traces improve immediate learning of program execution for some novices but effects are short-term and depend on engagement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generated Animated Traces (GATs) are AI-generated, analogy-based, narrated animations that coordinate source code, execution state, and conceptual analogies. In the two-institution study, GATs produced selective benefits for immediate learning performance compared with textual explanations, yet these benefits remained context-dependent and short-term; GATs' influence on performance was moderated by learner engagement profiles, underscoring the value of personalized approaches.
What carries the argument
Generated Animated Traces (GATs): AI-generated narrated animations that coordinate source code, runtime state, and conceptual analogies to make program execution explicit.
If this is right
- GATs produce immediate performance gains on execution-related tasks for some students but not others.
- Any immediate gains do not carry over to end-of-course exam scores or sustained engagement measures.
- Learner engagement profiles moderate whether GATs affect performance at all.
- Educational tools for programming benefit from adaptation to individual engagement patterns.
Where Pith is reading between the lines
- Designers of future AI learning tools could build real-time detection of engagement to decide when to switch between animation and text.
- The same coordination of code, state, and analogy might be tested in other process-heavy domains such as chemistry reaction mechanisms.
- Short-term benefits suggest GATs work best when embedded repeatedly inside practice sessions rather than offered as standalone resources.
Load-bearing premise
Differences in immediate learning and course outcomes can be attributed to GATs versus text explanations without major interference from institutional differences, varying student populations, or unmeasured variables.
What would settle it
A follow-up experiment at the same institutions that finds identical immediate post-exposure scores and identical end-of-course exam results between GAT and text groups after controlling for engagement profiles.
Figures
read the original abstract
Introductory programming (CS1) courses often struggle to support students' understanding of program execution. While visualizations can make execution processes explicit, their effectiveness depends on design and context, and empirical evidence for AI-generated visualizations remains limited. We propose Generated Animated Traces (GATs), AI-generated, analogy-based, narrated animations that coordinate source code, execution state, and conceptual analogies. We conduct a study at two institutions in CS1 courses (Python, N=961; Java N=151) comparing GATs to textual explanations. We measure immediate learning performance and experience, end-of-course engagement and exam performance. Results show that GATs can yield selective benefits for immediate learning, but benefits are context-dependent and short-term. We observe that GATs' influence on performance is moderated by learner engagement profiles. This finding underscores the importance of personalized approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports results from a multi-institutional study in CS1 courses comparing Generated Animated Traces (GATs)—AI-generated, analogy-based narrated animations coordinating code, execution state, and analogies—to textual explanations. Samples are N=961 (Python) and N=151 (Java). The central claims are that GATs produce selective benefits for immediate learning that are context-dependent and short-term, and that GAT effects on performance are moderated by learner engagement profiles.
Significance. If the attribution to GATs survives proper controls for institutional and language differences, the work would add to CS education research by showing how AI-generated visualizations can support program execution understanding and by underscoring the value of engagement-profile moderation for personalization. The large Python sample is a positive feature.
major comments (2)
- [Abstract and study design description] Abstract and study design description: the abstract states results on selective benefits and moderation by engagement profiles but supplies no statistical methods, effect sizes, controls, exclusion criteria, or measurement details, rendering it impossible to evaluate whether the data support the stated claims.
- [Multi-institutional design] Multi-institutional design: the study is conducted at two institutions using different languages and presumably different student populations and course structures, yet the design description provides no indication that institution or language is modeled as a fixed effect, random effect, or interaction term in the performance analyses. This is load-bearing for the central claim that attributes differences in immediate learning and moderation to GATs rather than baseline institutional variation.
minor comments (2)
- [Methods] Clarify the exact operationalization of 'immediate learning performance,' 'end-of-course engagement,' and 'exam performance' and whether any pre-tests or covariates were used.
- [Abstract] The abstract could more explicitly note the short-term nature of the observed benefits to avoid overgeneralization in the opening summary.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive comments. We address each major comment below and will make revisions to improve the manuscript's clarity and rigor.
read point-by-point responses
-
Referee: Abstract and study design description: the abstract states results on selective benefits and moderation by engagement profiles but supplies no statistical methods, effect sizes, controls, exclusion criteria, or measurement details, rendering it impossible to evaluate whether the data support the stated claims.
Authors: We agree with the referee that the abstract would benefit from additional details on the statistical methods to support the claims. In the revised version, we will include a concise description of the key statistical approaches, effect sizes where relevant, and note on controls and exclusion criteria within the abstract's constraints. revision: yes
-
Referee: Multi-institutional design: the study is conducted at two institutions using different languages and presumably different student populations and course structures, yet the design description provides no indication that institution or language is modeled as a fixed effect, random effect, or interaction term in the performance analyses. This is load-bearing for the central claim that attributes differences in immediate learning and moderation to GATs rather than baseline institutional variation.
Authors: The analyses were performed separately for the Python and Java cohorts to account for differences in language and institutional contexts. To strengthen the manuscript, we will revise the methods and results sections to explicitly describe how institutional and language differences were handled, including any use of fixed effects or covariates for institution, and discuss potential limitations in attributing effects solely to GATs. revision: yes
Circularity Check
No circularity: purely empirical study with no derivations or self-referential predictions
full rationale
This paper reports results from a multi-institutional empirical experiment comparing Generated Animated Traces (GATs) to textual explanations in CS1 courses. It measures immediate learning, engagement, and exam performance using standard statistical comparisons across Python (N=961) and Java (N=151) cohorts. There are no equations, derivations, fitted parameters presented as predictions, uniqueness theorems, or ansatzes. All claims rest on observed data outcomes rather than any reduction to inputs by construction. The multi-institutional design and moderation analyses by engagement profiles are standard empirical methods and do not involve self-citation load-bearing or renaming of known results as new derivations. This matches the default expectation of no significant circularity for non-theoretical empirical work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shaaron Ainsworth. 2006. DeFT: A Conceptual Framework for Considering Learning With Multiple Representations.Learning and Instruction16, 3 (2006), 183–198
2006
-
[2]
Roman Bednarik. 2012. Expertise-dependent visual attention strategies develop over time during debugging with multiple code representations.International Journal of Human-Computer Studies70, 2 (2012), 143–155
2012
-
[3]
Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological)57, 1 (1995), 289–300
1995
-
[4]
Like a Nesting Doll
Seth Bernstein, Paul Denny, Juho Leinonen, Lauren Kan, Arto Hellas, Matt Little- field, Sami Sarsa, and Stephen Macneil. 2024. "Like a Nesting Doll": Analyzing Recursion Analogies Generated by CS Students Using Large Language Models. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. 122–128
2024
-
[5]
Briana Bettin, Linda Ott, and Julia Hiebel. 2022. Semaphore or metaphor? Ex- ploring concurrent students’ conceptions of and with analogy. InProceedings of the 27th ACM Conference on Innovation and Technology in Computer Science Education Vol. 1. 200–206
2022
-
[6]
Jacob Bishop and Matthew A Verleger. 2013. The flipped classroom: A survey of the research. In2013 ASEE annual conference & exposition. 23–1200
2013
-
[7]
Yingjun Cao, Leo Porter, and Daniel Zingaro. 2016. Examining the value of analogies in introductory computing. InProceedings of the 2016 ACM Conference on International computing education research. 231–239
2016
-
[8]
Michelene TH Chi and Muhsin Menekse. 2015. Dialogue patterns in peer collab- oration that promote learning.Socializing intelligence through academic talk and dialogue1, 2 (2015), 263–274
2015
-
[9]
Michelene TH Chi and Ruth Wylie. 2014. The ICAP framework: Linking cognitive engagement to active learning outcomes.Educational psychologist49, 4 (2014), 219–243
2014
-
[10]
Kathryn Cunningham, Sarah Blanchard, Barbara Ericson, and Mark Guzdial. 2017. Using Tracing and Sketching to Solve Programming Problems: Replicating and Extending an Analysis of What Students Draw. InProceedings of the 2017 ACM Conference on International Computing Education Research. 164–172
2017
-
[11]
Björn B de Koning and Halszka Jarodzka. 2017. Attention guidance strategies for supporting learning from dynamic visualizations. InLearning from dynamic visualization: Innovations in research and application. Springer, 255–278
2017
-
[12]
Dimitri Eckert, Dion Timmermann, and Christian Kautz. 2022. Student Miscon- ceptions About Loops in Introductory Programming Courses and the Influence of Representations. In2022 IEEE Frontiers in Education Conference (FIE). IEEE, 1–5
2022
-
[13]
Sally Fincher, Johan Jeuring, Craig S Miller, Peter Donaldson, Benedict Du Boulay, Matthias Hauswirth, Arto Hellas, Felienne Hermans, Colleen Lewis, Andreas Mühling, et al. 2020. Notional Machines in Computing Education: The Education of Attention. InProceedings of the Working Group Reports on Innovation and Technology in Computer Science Education. 21–50
2020
-
[14]
Michal Forišek and Monika Steinová. 2012. Metaphors and analogies for teaching algorithms. InProceedings of the 43rd ACM technical symposium on Computer Science Education. 15–20
2012
-
[15]
Rita Garcia and Michelle Craig. 2025. 20 Years Later: A Replication Study on Teaching CS1 Concepts.ACM Trans. Comput. Educ.25, 2, Article 22 (June 2025), 33 pages. doi:10.1145/3730405
-
[16]
Philip J Guo. 2013. Online Python Tutor: Embeddable Web-Based Program Visu- alization for Cs Education. InProceeding of the 44th ACM Technical Symposium on Computer Science Education. 579–584
2013
-
[17]
Philip J Guo. 2018. Non-native english speakers learning computer program- ming: Barriers, desires, and design opportunities. InProceedings of the 2018 CHI conference on human factors in computing systems. 1–14
2018
-
[18]
Tran Trieu Hai, Duong Thi Thuy Mai, and Nguyen Van Hanh. 2025. A rapid review of using AI-generated instructional videos in higher education.Frontiers in Computer Science7 (2025), 1721093
2025
-
[19]
Colton Harper, Jake Rance, Paul Owens, and Stephen Cooper. 2024. Tool-Driven Scaffolding of Student-Generated Analogies in CS1. InProceedings of the 8th Conference on Computing Education Practice. 5–8
2024
-
[20]
Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139–183
1988
-
[21]
2011.The analysis of covariance and alternatives: Statistical methods for experiments, quasi-experiments, and single-case studies
Bradley Huitema. 2011.The analysis of covariance and alternatives: Statistical methods for experiments, quasi-experiments, and single-case studies. John Wiley & Sons
2011
-
[22]
1983.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness
Philip Nicholas Johnson-Laird. 1983.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Number 6. Harvard University Press
1983
-
[23]
Erkki Kaila, Matti Luukkainen, Antti Laaksonen, and Kjell Lemström. 2023. On Changing the Curriculum Programming Language from Java to Python (Discus- sion Paper). InProceedings of the 23rd Koli Calling International Conference on Computing Education Research. 1–7
2023
-
[24]
Slava Kalyuga. 2007. Expertise reversal effect and its implications for learner- tailored instruction.Educational Psychology Review19, 4 (2007), 509–539
2007
-
[25]
Slava Kalyuga. 2021. The expertise reversal principle in multimedia learning. Cambridge University Press
2021
-
[26]
Macredie
Theodora Koulouri, Stanislao Lauria, and Robert D. Macredie. 2015. Teaching Introductory Programming: A Quantitative Evaluation of Different Approaches. ACM Trans. Comput. Educ.14, 4, Article 26 (Dec. 2015), 28 pages. doi:10.1145/ 2662412
2015
-
[27]
Erno Lokkila, Athanasios Christopoulos, and Mikko-Jussi Laakso. 2023. A data- driven approach to compare the syntactic difficulty of programming languages. Journal of Information Systems Education34, 1 (2023), 84–93
2023
-
[28]
Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics(1947), 50–60
1947
-
[29]
Lauren E Margulieux, Briana B Morrison, and Adrienne Decker. 2020. Reducing withdrawal and failure rates in introductory programming with subgoal labeled worked examples.International Journal of STEM Education7, 1 (2020), 19
2020
-
[30]
Richard E Mayer, Emily Griffith, Ilana TN Jurkowitz, and Daniel Rothman. 2008. Increased interestingness of extraneous details in a multimedia science presenta- tion leads to decreased learning.Journal of Experimental Psychology: Applied14, 4 (2008), 329
2008
-
[31]
Torbjørn Netland, Oliver von Dzengelevski, Katalin Tesch, and Daniel Kwas- nitschka. 2025. Comparing human-made and AI-generated teaching videos: An experimental study on learning effects.Computers & Education224 (2025), 105164
2025
-
[32]
Yuri Noviello, Anastasia Birillo, and Gosia Migut. 2026. ANVIL: Analogies and Video for Lecturers. InArtificial Intelligence in Education (Lecture Notes in Com- puter Science). Springer. Accepted for publication in the proceedings of AIED 2026
2026
-
[33]
Fred GWC Paas. 1992. Training Strategies for Attaining Transfer of Problem- Solving Skill in Statistics: A Cognitive-Load Approach.Journal of Educational Psychology84, 4 (1992), 429
1992
-
[34]
Allan Paivio. 1991. Dual coding theory: Retrospect and current status.Canadian Journal of Psychology/Revue canadienne de psychologie45, 3 (1991), 255
1991
-
[35]
2000.Mixed-effects models in S and S-PLUS
José C Pinheiro and Douglas M Bates. 2000.Mixed-effects models in S and S-PLUS. Springer
2000
-
[36]
Paul R Pintrich et al. 1991. A manual for the use of the Motivated Strategies for Learning Questionnaire (MSLQ). (1991)
1991
-
[37]
Richard M Ryan and Edward L Deci. 2024. Self-determination theory. InEncyclo- pedia of quality of life and well-being research. Springer, 6229–6235
2024
-
[38]
Pawan Saxena, Sanjay Kumar Singh, and Gopal Gupta. 2023. Achieving effective learning outcomes through the use of analogies in teaching computer science. Mathematics11, 15 (2023), 3340
2023
-
[39]
Naaz Sibia, Valeria Ramirez Osorio, Jessica Wen, Rutwa Engineer, Angela Zavaleta Bernuy, Andrew Petersen, Michael Liut, and Carolina Nobre. 2025. From Code to Concept: Evaluating Multiple Coordinated Views in Introductory Programming. arXiv:2509.26466 [cs.HC] https://arxiv.org/abs/2509.26466
arXiv 2025
-
[40]
Juha Sorva, Ville Karavirta, and Lauri Malmi. 2013. A review of generic pro- gram visualization systems for introductory programming education.ACM Transactions on Computing Education (TOCE)13, 4 (2013), 1–64
2013
-
[41]
John Sweller. 2011. Cognitive load theory. InPsychology of Learning and Motiva- tion. Vol. 55. Elsevier, 37–76
2011
-
[42]
Lynda Thomas, Mark Ratcliffe, and Benjy Thomasson. 2004. Scaffolding With Object Diagrams in First Year Programming Classes: Some Unexpected Results. ACM SIGCSE Bulletin36, 1 (2004), 250–254
2004
-
[43]
Rachel M Wong, Olusola Adesope, Chi Yang Chuang, Oluwasola S Oni, Bernie Vanwie, Prashanta Dutta, Kitana Kaiphanliam, Felicia Adesope, Oluwafemi J Ajeigbe, and Jacqueline Gartner. 2024. Engineering students engagement profiles while using low-cost desktop learning module.IJEE International Journal of Engineering Education(2024)
2024
-
[44]
Tao Xu, Yuan Liu, Yaru Jin, Yueyao Qu, Jie Bai, Wenlan Zhang, and Yun Zhou
-
[45]
From recorded to AI-generated instructional videos: A comparison of learning performance and experience.British Journal of Educational Technology 56, 4 (2025), 1463–1487
2025
-
[46]
Tingting Zhu, Rutwa Engineer, Xaria Prempeh, Anna Ly, Michelle Craig, and Andrew Petersen. 2025. Comparing physical analogue and traditional videos for learning and emotional engagement.Discover Education4, 1 (2025), 71
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.