pith. sign in

arxiv: 2606.03288 · v1 · pith:M724NKMVnew · submitted 2026-06-02 · 💻 cs.CY · cs.AI

AI-Generated Traces for Novice Programmers: Learning Effects and Learner Differences in a Multi-Institutional Study

Pith reviewed 2026-06-28 08:19 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords AI-generated visualizationsnovice programmersCS1animated tracesprogram executionlearner engagementeducational technologymulti-institutional study
0
0 comments X

The pith

AI-generated animated traces improve immediate learning of program execution for some novices but effects are short-term and depend on engagement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Generated Animated Traces as AI-created narrated animations that link code, execution states, and analogies to help CS1 students grasp how programs run. A multi-institutional experiment with over a thousand students in Python and Java courses compared these animations against plain text explanations, tracking both immediate test scores and longer-term course outcomes. Results indicate selective gains right after exposure for certain learners, yet no lasting advantage appears on final exams or overall engagement. The size of any benefit varies with individual student engagement patterns, pointing toward the need for tools that adapt to different learner profiles rather than one-size-fits-all delivery.

Core claim

Generated Animated Traces (GATs) are AI-generated, analogy-based, narrated animations that coordinate source code, execution state, and conceptual analogies. In the two-institution study, GATs produced selective benefits for immediate learning performance compared with textual explanations, yet these benefits remained context-dependent and short-term; GATs' influence on performance was moderated by learner engagement profiles, underscoring the value of personalized approaches.

What carries the argument

Generated Animated Traces (GATs): AI-generated narrated animations that coordinate source code, runtime state, and conceptual analogies to make program execution explicit.

If this is right

  • GATs produce immediate performance gains on execution-related tasks for some students but not others.
  • Any immediate gains do not carry over to end-of-course exam scores or sustained engagement measures.
  • Learner engagement profiles moderate whether GATs affect performance at all.
  • Educational tools for programming benefit from adaptation to individual engagement patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of future AI learning tools could build real-time detection of engagement to decide when to switch between animation and text.
  • The same coordination of code, state, and analogy might be tested in other process-heavy domains such as chemistry reaction mechanisms.
  • Short-term benefits suggest GATs work best when embedded repeatedly inside practice sessions rather than offered as standalone resources.

Load-bearing premise

Differences in immediate learning and course outcomes can be attributed to GATs versus text explanations without major interference from institutional differences, varying student populations, or unmeasured variables.

What would settle it

A follow-up experiment at the same institutions that finds identical immediate post-exposure scores and identical end-of-course exam results between GAT and text groups after controlling for engagement profiles.

Figures

Figures reproduced from arXiv: 2606.03288 by Anastasiia Birillo, Gosia Migut, Michael Liut, Naaz Sibia, Thomas Overklift Vaupel Klein, Yuri Noviello.

Figure 1
Figure 1. Figure 1: Study pipeline. Unique materials for the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of a textual explanation provided to the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Engagement profiles and moderated treatment ef [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Introductory programming (CS1) courses often struggle to support students' understanding of program execution. While visualizations can make execution processes explicit, their effectiveness depends on design and context, and empirical evidence for AI-generated visualizations remains limited. We propose Generated Animated Traces (GATs), AI-generated, analogy-based, narrated animations that coordinate source code, execution state, and conceptual analogies. We conduct a study at two institutions in CS1 courses (Python, N=961; Java N=151) comparing GATs to textual explanations. We measure immediate learning performance and experience, end-of-course engagement and exam performance. Results show that GATs can yield selective benefits for immediate learning, but benefits are context-dependent and short-term. We observe that GATs' influence on performance is moderated by learner engagement profiles. This finding underscores the importance of personalized approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports results from a multi-institutional study in CS1 courses comparing Generated Animated Traces (GATs)—AI-generated, analogy-based narrated animations coordinating code, execution state, and analogies—to textual explanations. Samples are N=961 (Python) and N=151 (Java). The central claims are that GATs produce selective benefits for immediate learning that are context-dependent and short-term, and that GAT effects on performance are moderated by learner engagement profiles.

Significance. If the attribution to GATs survives proper controls for institutional and language differences, the work would add to CS education research by showing how AI-generated visualizations can support program execution understanding and by underscoring the value of engagement-profile moderation for personalization. The large Python sample is a positive feature.

major comments (2)
  1. [Abstract and study design description] Abstract and study design description: the abstract states results on selective benefits and moderation by engagement profiles but supplies no statistical methods, effect sizes, controls, exclusion criteria, or measurement details, rendering it impossible to evaluate whether the data support the stated claims.
  2. [Multi-institutional design] Multi-institutional design: the study is conducted at two institutions using different languages and presumably different student populations and course structures, yet the design description provides no indication that institution or language is modeled as a fixed effect, random effect, or interaction term in the performance analyses. This is load-bearing for the central claim that attributes differences in immediate learning and moderation to GATs rather than baseline institutional variation.
minor comments (2)
  1. [Methods] Clarify the exact operationalization of 'immediate learning performance,' 'end-of-course engagement,' and 'exam performance' and whether any pre-tests or covariates were used.
  2. [Abstract] The abstract could more explicitly note the short-term nature of the observed benefits to avoid overgeneralization in the opening summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We address each major comment below and will make revisions to improve the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: Abstract and study design description: the abstract states results on selective benefits and moderation by engagement profiles but supplies no statistical methods, effect sizes, controls, exclusion criteria, or measurement details, rendering it impossible to evaluate whether the data support the stated claims.

    Authors: We agree with the referee that the abstract would benefit from additional details on the statistical methods to support the claims. In the revised version, we will include a concise description of the key statistical approaches, effect sizes where relevant, and note on controls and exclusion criteria within the abstract's constraints. revision: yes

  2. Referee: Multi-institutional design: the study is conducted at two institutions using different languages and presumably different student populations and course structures, yet the design description provides no indication that institution or language is modeled as a fixed effect, random effect, or interaction term in the performance analyses. This is load-bearing for the central claim that attributes differences in immediate learning and moderation to GATs rather than baseline institutional variation.

    Authors: The analyses were performed separately for the Python and Java cohorts to account for differences in language and institutional contexts. To strengthen the manuscript, we will revise the methods and results sections to explicitly describe how institutional and language differences were handled, including any use of fixed effects or covariates for institution, and discuss potential limitations in attributing effects solely to GATs. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential predictions

full rationale

This paper reports results from a multi-institutional empirical experiment comparing Generated Animated Traces (GATs) to textual explanations in CS1 courses. It measures immediate learning, engagement, and exam performance using standard statistical comparisons across Python (N=961) and Java (N=151) cohorts. There are no equations, derivations, fitted parameters presented as predictions, uniqueness theorems, or ansatzes. All claims rest on observed data outcomes rather than any reduction to inputs by construction. The multi-institutional design and moderation analyses by engagement profiles are standard empirical methods and do not involve self-citation load-bearing or renaming of known results as new derivations. This matches the default expectation of no significant circularity for non-theoretical empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical content, free parameters, or new postulated entities are described. The work relies on standard assumptions of educational research experiments that are not detailed here.

pith-pipeline@v0.9.1-grok · 5704 in / 1002 out tokens · 38389 ms · 2026-06-28T08:19:02.576681+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 1 canonical work pages

  1. [1]

    Shaaron Ainsworth. 2006. DeFT: A Conceptual Framework for Considering Learning With Multiple Representations.Learning and Instruction16, 3 (2006), 183–198

  2. [2]

    Roman Bednarik. 2012. Expertise-dependent visual attention strategies develop over time during debugging with multiple code representations.International Journal of Human-Computer Studies70, 2 (2012), 143–155

  3. [3]

    Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological)57, 1 (1995), 289–300

  4. [4]

    Like a Nesting Doll

    Seth Bernstein, Paul Denny, Juho Leinonen, Lauren Kan, Arto Hellas, Matt Little- field, Sami Sarsa, and Stephen Macneil. 2024. "Like a Nesting Doll": Analyzing Recursion Analogies Generated by CS Students Using Large Language Models. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. 122–128

  5. [5]

    Briana Bettin, Linda Ott, and Julia Hiebel. 2022. Semaphore or metaphor? Ex- ploring concurrent students’ conceptions of and with analogy. InProceedings of the 27th ACM Conference on Innovation and Technology in Computer Science Education Vol. 1. 200–206

  6. [6]

    Jacob Bishop and Matthew A Verleger. 2013. The flipped classroom: A survey of the research. In2013 ASEE annual conference & exposition. 23–1200

  7. [7]

    Yingjun Cao, Leo Porter, and Daniel Zingaro. 2016. Examining the value of analogies in introductory computing. InProceedings of the 2016 ACM Conference on International computing education research. 231–239

  8. [8]

    Michelene TH Chi and Muhsin Menekse. 2015. Dialogue patterns in peer collab- oration that promote learning.Socializing intelligence through academic talk and dialogue1, 2 (2015), 263–274

  9. [9]

    Michelene TH Chi and Ruth Wylie. 2014. The ICAP framework: Linking cognitive engagement to active learning outcomes.Educational psychologist49, 4 (2014), 219–243

  10. [10]

    Kathryn Cunningham, Sarah Blanchard, Barbara Ericson, and Mark Guzdial. 2017. Using Tracing and Sketching to Solve Programming Problems: Replicating and Extending an Analysis of What Students Draw. InProceedings of the 2017 ACM Conference on International Computing Education Research. 164–172

  11. [11]

    Björn B de Koning and Halszka Jarodzka. 2017. Attention guidance strategies for supporting learning from dynamic visualizations. InLearning from dynamic visualization: Innovations in research and application. Springer, 255–278

  12. [12]

    Dimitri Eckert, Dion Timmermann, and Christian Kautz. 2022. Student Miscon- ceptions About Loops in Introductory Programming Courses and the Influence of Representations. In2022 IEEE Frontiers in Education Conference (FIE). IEEE, 1–5

  13. [13]

    Sally Fincher, Johan Jeuring, Craig S Miller, Peter Donaldson, Benedict Du Boulay, Matthias Hauswirth, Arto Hellas, Felienne Hermans, Colleen Lewis, Andreas Mühling, et al. 2020. Notional Machines in Computing Education: The Education of Attention. InProceedings of the Working Group Reports on Innovation and Technology in Computer Science Education. 21–50

  14. [14]

    Michal Forišek and Monika Steinová. 2012. Metaphors and analogies for teaching algorithms. InProceedings of the 43rd ACM technical symposium on Computer Science Education. 15–20

  15. [15]

    Rita Garcia and Michelle Craig. 2025. 20 Years Later: A Replication Study on Teaching CS1 Concepts.ACM Trans. Comput. Educ.25, 2, Article 22 (June 2025), 33 pages. doi:10.1145/3730405

  16. [16]

    Philip J Guo. 2013. Online Python Tutor: Embeddable Web-Based Program Visu- alization for Cs Education. InProceeding of the 44th ACM Technical Symposium on Computer Science Education. 579–584

  17. [17]

    Philip J Guo. 2018. Non-native english speakers learning computer program- ming: Barriers, desires, and design opportunities. InProceedings of the 2018 CHI conference on human factors in computing systems. 1–14

  18. [18]

    Tran Trieu Hai, Duong Thi Thuy Mai, and Nguyen Van Hanh. 2025. A rapid review of using AI-generated instructional videos in higher education.Frontiers in Computer Science7 (2025), 1721093

  19. [19]

    Colton Harper, Jake Rance, Paul Owens, and Stephen Cooper. 2024. Tool-Driven Scaffolding of Student-Generated Analogies in CS1. InProceedings of the 8th Conference on Computing Education Practice. 5–8

  20. [20]

    Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139–183

  21. [21]

    2011.The analysis of covariance and alternatives: Statistical methods for experiments, quasi-experiments, and single-case studies

    Bradley Huitema. 2011.The analysis of covariance and alternatives: Statistical methods for experiments, quasi-experiments, and single-case studies. John Wiley & Sons

  22. [22]

    1983.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness

    Philip Nicholas Johnson-Laird. 1983.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Number 6. Harvard University Press

  23. [23]

    Erkki Kaila, Matti Luukkainen, Antti Laaksonen, and Kjell Lemström. 2023. On Changing the Curriculum Programming Language from Java to Python (Discus- sion Paper). InProceedings of the 23rd Koli Calling International Conference on Computing Education Research. 1–7

  24. [24]

    Slava Kalyuga. 2007. Expertise reversal effect and its implications for learner- tailored instruction.Educational Psychology Review19, 4 (2007), 509–539

  25. [25]

    Slava Kalyuga. 2021. The expertise reversal principle in multimedia learning. Cambridge University Press

  26. [26]

    Macredie

    Theodora Koulouri, Stanislao Lauria, and Robert D. Macredie. 2015. Teaching Introductory Programming: A Quantitative Evaluation of Different Approaches. ACM Trans. Comput. Educ.14, 4, Article 26 (Dec. 2015), 28 pages. doi:10.1145/ 2662412

  27. [27]

    Erno Lokkila, Athanasios Christopoulos, and Mikko-Jussi Laakso. 2023. A data- driven approach to compare the syntactic difficulty of programming languages. Journal of Information Systems Education34, 1 (2023), 84–93

  28. [28]

    Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics(1947), 50–60

  29. [29]

    Lauren E Margulieux, Briana B Morrison, and Adrienne Decker. 2020. Reducing withdrawal and failure rates in introductory programming with subgoal labeled worked examples.International Journal of STEM Education7, 1 (2020), 19

  30. [30]

    Richard E Mayer, Emily Griffith, Ilana TN Jurkowitz, and Daniel Rothman. 2008. Increased interestingness of extraneous details in a multimedia science presenta- tion leads to decreased learning.Journal of Experimental Psychology: Applied14, 4 (2008), 329

  31. [31]

    Torbjørn Netland, Oliver von Dzengelevski, Katalin Tesch, and Daniel Kwas- nitschka. 2025. Comparing human-made and AI-generated teaching videos: An experimental study on learning effects.Computers & Education224 (2025), 105164

  32. [32]

    Yuri Noviello, Anastasia Birillo, and Gosia Migut. 2026. ANVIL: Analogies and Video for Lecturers. InArtificial Intelligence in Education (Lecture Notes in Com- puter Science). Springer. Accepted for publication in the proceedings of AIED 2026

  33. [33]

    Fred GWC Paas. 1992. Training Strategies for Attaining Transfer of Problem- Solving Skill in Statistics: A Cognitive-Load Approach.Journal of Educational Psychology84, 4 (1992), 429

  34. [34]

    Allan Paivio. 1991. Dual coding theory: Retrospect and current status.Canadian Journal of Psychology/Revue canadienne de psychologie45, 3 (1991), 255

  35. [35]

    2000.Mixed-effects models in S and S-PLUS

    José C Pinheiro and Douglas M Bates. 2000.Mixed-effects models in S and S-PLUS. Springer

  36. [36]

    Paul R Pintrich et al. 1991. A manual for the use of the Motivated Strategies for Learning Questionnaire (MSLQ). (1991)

  37. [37]

    Richard M Ryan and Edward L Deci. 2024. Self-determination theory. InEncyclo- pedia of quality of life and well-being research. Springer, 6229–6235

  38. [38]

    Pawan Saxena, Sanjay Kumar Singh, and Gopal Gupta. 2023. Achieving effective learning outcomes through the use of analogies in teaching computer science. Mathematics11, 15 (2023), 3340

  39. [39]

    Naaz Sibia, Valeria Ramirez Osorio, Jessica Wen, Rutwa Engineer, Angela Zavaleta Bernuy, Andrew Petersen, Michael Liut, and Carolina Nobre. 2025. From Code to Concept: Evaluating Multiple Coordinated Views in Introductory Programming. arXiv:2509.26466 [cs.HC] https://arxiv.org/abs/2509.26466

  40. [40]

    Juha Sorva, Ville Karavirta, and Lauri Malmi. 2013. A review of generic pro- gram visualization systems for introductory programming education.ACM Transactions on Computing Education (TOCE)13, 4 (2013), 1–64

  41. [41]

    John Sweller. 2011. Cognitive load theory. InPsychology of Learning and Motiva- tion. Vol. 55. Elsevier, 37–76

  42. [42]

    Lynda Thomas, Mark Ratcliffe, and Benjy Thomasson. 2004. Scaffolding With Object Diagrams in First Year Programming Classes: Some Unexpected Results. ACM SIGCSE Bulletin36, 1 (2004), 250–254

  43. [43]

    Rachel M Wong, Olusola Adesope, Chi Yang Chuang, Oluwasola S Oni, Bernie Vanwie, Prashanta Dutta, Kitana Kaiphanliam, Felicia Adesope, Oluwafemi J Ajeigbe, and Jacqueline Gartner. 2024. Engineering students engagement profiles while using low-cost desktop learning module.IJEE International Journal of Engineering Education(2024)

  44. [44]

    Tao Xu, Yuan Liu, Yaru Jin, Yueyao Qu, Jie Bai, Wenlan Zhang, and Yun Zhou

  45. [45]

    From recorded to AI-generated instructional videos: A comparison of learning performance and experience.British Journal of Educational Technology 56, 4 (2025), 1463–1487

  46. [46]

    Tingting Zhu, Rutwa Engineer, Xaria Prempeh, Anna Ly, Michelle Craig, and Andrew Petersen. 2025. Comparing physical analogue and traditional videos for learning and emotional engagement.Discover Education4, 1 (2025), 71