pith. machine review for the scientific record. sign in

arxiv: 2604.10720 · v2 · submitted 2026-04-12 · 💻 cs.AI · cs.CL· cs.CY

Recognition: no theorem link

Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:18 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CY
keywords artificial studentsstudent simulationprogramming educationlanguage model fine-tuningconversational datadebugging behavioreducational AIprocess data
0
0 comments X

The pith

Serializing student logs into conversations trains open models to simulate realistic programming learner behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework to create artificial students by training open-weight language models on real student programming process data. It converts submission logs into a conversational format with code and feedback as alternating turns, then uses supervised fine-tuning and preference optimization to align the model with actual debugging sequences. This is evaluated on Qwen models of 4B and 8B parameters using large datasets from Python assignments. The results indicate that including environment feedback improves the models' ability to match student behavior in terms of functional correctness and code similarity compared to approaches that use only code or rely on prompted large models. Such simulations could help test educational tools at scale while keeping data private and avoiding proprietary dependencies.

Core claim

By representing each student's problem-solving process as a dialogue between the learner and the automated assessment system, where submissions and feedback like test outcomes form alternating turns, models can be trained to replicate authentic student debugging behavior more effectively than prior methods.

What carries the argument

Conversational serialization of temporal student log traces into alternating turns of code submissions and environment feedback.

If this is right

  • Trained models can replicate iterative debugging processes more accurately when environment feedback is included.
  • Open-weight models become viable alternatives to prompted proprietary LLMs for student simulation.
  • Scalable evaluation of tutoring strategies becomes possible using realistic artificial students.
  • The training pipeline of supervised fine-tuning combined with preference optimization aligns models to real learner patterns.
  • Privacy and cost concerns are reduced by avoiding large proprietary models and enabling local training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar serialization techniques might apply to simulating learners in other domains like math or writing if process logs are available.
  • These models could be used to generate synthetic datasets for training better tutoring systems.
  • Testing on different programming languages or assignment types could reveal how general the approach is.
  • If the simulations are accurate, they might help identify common student misconceptions automatically.

Load-bearing premise

Converting student logs into conversational turns captures enough of the original context and intent so the model learns real behavior instead of just surface patterns.

What would settle it

Training the models both with and without environment feedback and finding no significant improvement in functional alignment or code similarity on a test set of real student traces would falsify the benefit of including feedback.

Figures

Figures reproduced from arXiv: 2604.10720 by Arto Hellas, Charles Koutcheme, Juho Leinonen.

Figure 1
Figure 1. Figure 1: Performance across rollout steps. Coverage (top), [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Grade progression across normalized trajectory po [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Artificial students -- models that simulate how learners act and respond within educational systems -- are a promising tool for evaluating tutoring strategies and feedback mechanisms at scale. However, most existing approaches rely on prompting large, proprietary language models, limiting adaptability to specific courses and raising concerns around privacy, cost, and dependence. In this work, we propose a framework for training open-weight artificial programming learners directly from authentic student process data. Our approach serializes temporal log traces into a conversational format, representing each student's problem-solving process as a dialogue between the learner and their automated assessment system. Student code submissions and environment feedback, such as test outcomes, grades, and error traces, form alternating conversational turns, enabling models to learn from the iterative debugging process. We additionally introduce a training pipeline combining supervised fine-tuning with preference optimization to align models with authentic student debugging behavior. We evaluate our framework by training Qwen models at 4B and 8B scales on a large-scale dataset of real student submissions to Python programming assignments. Our results show that incorporating environment feedback strengthens models' ability to replicate student debugging behavior, improving over both prior code-only approaches and prompted large language models baselines in functional alignment and code similarity. We release our code to support reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that serializing real student programming log traces into alternating conversational turns between learner and automated assessment system (incorporating code submissions and environment feedback such as test outcomes and error traces), followed by supervised fine-tuning plus preference optimization on open-weight Qwen 4B/8B models, produces artificial students that more faithfully replicate authentic debugging behavior. This yields measurable gains in functional alignment and code similarity over code-only baselines and prompted-LLM baselines on held-out student data.

Significance. If the central claim holds, the work supplies a reproducible, open-weight route to scalable artificial-student simulators that avoids proprietary-model dependence, improves privacy and cost, and directly leverages authentic process data. The public code release is a concrete strength that enables community verification and extension for tutoring-strategy evaluation.

major comments (2)
  1. [§5] §5 (Evaluation): the reported gains in functional alignment and code similarity are attributed to conversational serialization, yet no ablation isolates the alternating-turn format from the mere presence of feedback tokens. Without this comparison the attribution to the serialization step remains untested and the weakest assumption (preservation of debugging intent) is not directly addressed.
  2. [§4] §4 (Method): the serialization procedure collapses multi-turn edits, pauses, and exploratory dead-ends into single turns, but the manuscript provides neither a quantitative measure of information loss nor an analysis showing that the resulting dialogues retain decision order and implicit intent rather than surface co-occurrences.
minor comments (3)
  1. [§5.1] Exact prompt templates, model versions, and decoding parameters for the prompted-LLM baselines are not specified, hindering direct replication.
  2. [§5.2] Statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the functional-alignment and code-similarity deltas are absent.
  3. [Data section] Precise descriptions of the train/validation/test splits, number of unique students, and assignment distribution are missing from the data section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on our manuscript. We address each major comment below and have updated the paper accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation): the reported gains in functional alignment and code similarity are attributed to conversational serialization, yet no ablation isolates the alternating-turn format from the mere presence of feedback tokens. Without this comparison the attribution to the serialization step remains untested and the weakest assumption (preservation of debugging intent) is not directly addressed.

    Authors: We agree that an explicit ablation separating the alternating-turn structure from the inclusion of feedback tokens would strengthen the attribution. Our current code-only baseline removes feedback entirely, and the conversational models include both elements. In the revised version, we will include an additional baseline where feedback is appended without the conversational turn format to isolate the effect of serialization. This addresses the concern about testing the preservation of debugging intent through the format. revision: yes

  2. Referee: [§4] §4 (Method): the serialization procedure collapses multi-turn edits, pauses, and exploratory dead-ends into single turns, but the manuscript provides neither a quantitative measure of information loss nor an analysis showing that the resulting dialogues retain decision order and implicit intent rather than surface co-occurrences.

    Authors: We acknowledge that the serialization process involves some aggregation of student actions into turns, which could potentially lose fine-grained temporal information. However, our evaluation demonstrates that models trained on these serialized dialogues better replicate real student debugging sequences compared to baselines, indicating that critical decision points are retained. We will add a new subsection in the revised manuscript providing a qualitative analysis of sample serialized dialogues alongside original logs to illustrate preservation of intent, and discuss the trade-offs of this approach. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation uses held-out data and external baselines

full rationale

The paper serializes real student logs into conversational turns, applies supervised fine-tuning plus preference optimization, and evaluates functional alignment and code similarity on held-out student submissions against independent baselines (code-only models and prompted LLMs). No equations, parameters, or claims reduce by construction to the inputs; the reported gains are measured via external comparison rather than self-referential fitting or self-citation chains. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on standard supervised fine-tuning and preference-optimization assumptions applied to a new data representation; no new free parameters or invented entities are introduced beyond ordinary training hyperparameters.

axioms (1)
  • domain assumption Student debugging behavior can be adequately captured by alternating code-submission and environment-feedback turns in a conversational format.
    This assumption is invoked when the authors serialize temporal logs into dialogue turns and claim the resulting data trains models to replicate authentic learner behavior.

pith-pipeline@v0.9.0 · 5524 in / 1292 out tokens · 41869 ms · 2026-05-14T21:18:06.410987+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 4 internal anchors

  1. [1]

    INTRODUCTION To better support learners at scale, computing education re- search has long relied on models of students built from the rich log data collected as they solve programming assign- ments [16, 41]. Much of this work focuses on capturing what students know, such as estimating mastery over concepts through knowledge tracing [6, 17, 40], or identif...

  2. [2]

    RELATED WORK A complementary line of work focuses onknowledge tracing (KT), whose primary goal is to estimate a student’s mastery of knowledge components across exercises. While early ap- proaches focused exclusively on predicting success on future assignments, more recent methods leverage language-model– based architectures to predict students’ first sub...

  3. [3]

    compute_average

    FROM LOGS TO DIALOGS In this section, we introduce our approach for transforming student log data into suitable data for student simulation. Our work assumes that students interact with an automated assessment system returning summative feedback [5]. 3.1 Assumptions We assume access to a deterministic grading function which, for a submitted programato a p...

  4. [4]

    Our core pipeline combines supervised fine-tuning with offline preference optimization on a serial- ized datasetD

    TRAINING ARTIFICIAL LEARNERS In this section, we present our training pipeline for training a language modelπ θ to simulate how programming students solve assignments. Our core pipeline combines supervised fine-tuning with offline preference optimization on a serial- ized datasetD. We additionally explore online preference optimization as an alternative t...

  5. [5]

    skill”, “lab

    EXPERIMENTS In this section, we detail the components of our experiments aimed at evaluating the utility of our framework. 5.1 Dataset We evaluate our framework using FalconCode [8], a large- scale CS1 Python programming dataset from the United States Air Force Academy. The dataset includes student submissions, the associated grades, and the course auto- ...

  6. [6]

    Table 2 reports coverage and generation quality metrics av- eraged across all rollout steps

    RESULTS Table 1 summarizes our training and test split statistics. Table 2 reports coverage and generation quality metrics av- eraged across all rollout steps. Figure 1 details performance at each rollout step, and Figure 2 illustrates how model- generated grades evolve compared to students ground truth grades. We highlight several key findings below. Pro...

  7. [7]

    Performance improvements compared to baselines are consistent in direc- tion across both model sizes and all rollout steps

    CONCLUDING DISCUSSION Our experiments show that conversational serialization of student–environment interactions, combined with preference optimization, produces artificial students that more closely track real learners’ debugging behavior than prompted base- lines and models trained without feedback. Performance improvements compared to baselines are con...

  8. [8]

    S. N. Akter, S. Prabhumoye, J. Kamalu, S. Satheesh, E. Nyberg, M. Patwary, M. Shoeybi, and B. Catanzaro. MIND: Math informed synthetic dialogues for pretraining LLMs. InThe Thirteenth International Conference on Learning Representations, 2025

  9. [9]

    Ashok Kumar and A

    N. Ashok Kumar and A. Lan. Improving socratic question generation using data augmentation and preference optimization. In E. Kochmar, M. Bexte, J. Burstein, A. Horbach, R. Laarmann-Quante, A. Tack, V. Yaneva, and Z. Yuan, editors,Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 108–118, Mexi...

  10. [10]

    Azcona and A

    D. Azcona and A. F. Smeaton. Targeting at-risk students using engagement and effort predictors in an introductory computer programming course. In European Conference on Technology Enhanced Learning, pages 361–366. Springer, 2017

  11. [11]

    A. H. Brown. Simulated classrooms and artificial students: The potential effects of new technologies on teacher education.Journal of research on computing in education, 32(2):307–318, 1999

  12. [12]

    D. L. Butler and P. H. Winne. Feedback and self-regulated learning: A theoretical synthesis.Review of Educational Research, 65:245–281, 1995

  13. [13]

    A. T. Corbett and J. R. Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4):253–278, 1994

  14. [14]

    M. H. Daniel Han and U. team. Unsloth, 2023

  15. [15]

    de Freitas, J

    A. de Freitas, J. Coffman, M. de Freitas, J. Wilson, and T. Weingart. Falconcode: A multiyear dataset of python code samples from an introductory computer science course. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, SIGCSE 2023, page 938–944, New York, NY, USA, 2023. Association for Computing Machinery

  16. [16]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  17. [17]

    Dettmers, A

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, page 441, Red Hook, NY, USA, 2023. Curran Associates Inc

  18. [18]

    N. Ding, Y. Chen, B. Xu, Y. Qin, S. Hu, Z. Liu, M. Sun, and B. Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023

  19. [19]

    Dinucu-Jianu, J

    D. Dinucu-Jianu, J. Macina, N. Daheim, I. Hakimi, I. Gurevych, and M. Sachan. From problem-solving to teaching problem-solving: Aligning LLMs with pedagogy using reinforcement learning. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 272–2...

  20. [20]

    Z. Duan, N. Fernandez, A. Hicks, and A. Lan. Test case-informed knowledge tracing for open-ended coding tasks. InProceedings of the 15th International Learning Analytics and Knowledge Conference, LAK ’25, page 238–248, New York, NY, USA, 2025. Association for Computing Machinery

  21. [21]

    Z. Duan, N. Fernandez, and A. Lan. Kaser: Knowledge-aligned student error simulator for open-ended coding tasks, 2026

  22. [22]

    Z. Duan, N. Fernandez, A. B. L. Narayanan, M. Hassany, R. S. de Alencar, P. Brusilovsky, B. Akram, and A. Lan. Automated knowledge component generation for interpretable knowledge tracing in coding problems, 2025

  23. [23]

    M. C. Jadud. Methods and tools for exploring novice compilation behaviour. InProceedings of the second international workshop on Computing education research, pages 73–84, 2006

  24. [24]

    Kasurinen and U

    J. Kasurinen and U. Nikula. Estimating programming knowledge with bayesian knowledge tracing.ACM SIGCSE Bulletin, 41(3):313–317, 2009

  25. [25]

    Koutcheme, N

    C. Koutcheme, N. Dainese, and A. Hellas. Direct repair optimization: Training small language models for educational program repair improves feedback. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 564–581, Vienna, Austria, July 2025. Association for Computational Linguistics

  26. [26]

    H. Le, Y. Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 21314–21328. Curran Associates, Inc., 2022

  27. [27]

    Leinonen, P

    J. Leinonen, P. Denny, O. Kiljunen, S. MacNeil, S. Sarsa, and A. Hellas. Llm-itation is the sincerest form of data: Generating synthetic buggy code submissions for computing education. InProceedings of the 27th Australasian Computing Education Conference, ACE ’25, page 56–63, New York, NY, USA, 2025. Association for Computing Machinery

  28. [28]

    N. Liu, Z. Wang, R. Baraniuk, and A. Lan. Open-ended knowledge tracing for computer science education. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3849–3862, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics

  29. [29]

    MacNeil, M

    S. MacNeil, M. Rogalska, J. Leinonen, P. Denny, A. Hellas, and X. Crosland. Synthetic students: A comparative study of bug distribution between large language models and computing students. In Proceedings of the 2024 on ACM Virtual Global Computing Education Conference V. 1, SIGCSE Virtual 2024, page 137–143, New York, NY, USA,

  30. [30]

    Association for Computing Machinery

  31. [31]

    Matsuda, W

    N. Matsuda, W. W. Cohen, J. Sewall, G. Lacerda, and K. R. Koedinger. Evaluating a simulated student using real students data for training and testing. In International Conference on User Modeling, pages 107–116. Springer, 2007

  32. [32]

    Miroyan, R

    M. Miroyan, R. Niousha, J. E. Gonzalez, G. Ranade, and N. Norouzi. Parastudent: Generating and evaluating realistic student code by teaching llms to struggle, 2025

  33. [33]

    Orca: Progressive Learning from Complex Explanation Traces of GPT-4

    S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah. Orca: Progressive learning from complex explanation traces of GPT-4. arXiv preprint arXiv:2306.02707, 2023

  34. [34]

    M. H. Nguyen, V.-A. P˘ adurean, A. Gotovos, S. Tschiatschek, and A. Singla. Synthesizing high-quality programming tasks with llm-based expert and student agents. In A. I. Cristea, E. Walker, Y. Lu, O. C. Santos, and S. Isotani, editors,Artificial Intelligence in Education, pages 77–91, Cham, 2025. Springer Nature Switzerland

  35. [35]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  36. [36]

    Phung, V.-A

    T. Phung, V.-A. P˘ adurean, A. Singh, C. Brooks, J. Cambronero, S. Gulwani, A. Singla, and G. Soares. Automating human tutor-style programming feedback: Leveraging gpt-4 tutor model for hint generation and gpt-3.5 student model for hint validation. In Proceedings of the 14th Learning Analytics and Knowledge Conference, LAK ’24, page 12–23, New York, NY, U...

  37. [37]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

  38. [38]

    S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma. Codebleu: a method for automatic evaluation of code synthesis, 2020

  39. [39]

    A. Ross, M. Srivastava, J. Blanchard, and J. Andreas. Modeling student learning with 3.8 million program traces.arXiv preprint arXiv:2510.05056, 2025

  40. [40]

    Scarlatos, R

    A. Scarlatos, R. S. Baker, and A. Lan. Exploring knowledge tracing in tutor-student dialogues using llms. InProceedings of the 15th Learning Analytics and Knowledge Conference, LAK 2025, Dublin, Ireland, March 3-7, 2025. ACM, 2025

  41. [41]

    Scarlatos, N

    A. Scarlatos, N. Liu, J. Lee, R. Baraniuk, and A. Lan. Training llm-based tutors to improve student learning outcomes in dialogues. In A. I. Cristea, E. Walker, Y. Lu, O. C. Santos, and S. Isotani, editors,Artificial Intelligence in Education, pages 251–266, Cham, 2025. Springer Nature Switzerland

  42. [42]

    Scarlatos, D

    A. Scarlatos, D. Smith, S. Woodhead, and A. Lan. Improving the validity of automatically generated feedback via reinforcement learning. InInternational Conference on Artificial Intelligence in Education, pages 280–294. Springer, 2024

  43. [43]

    Schulman and T

    J. Schulman and T. M. Lab. Lora without regret. Thinking Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/lora/

  44. [44]

    Singh et al

    A. Singh et al. OpenAI GPT-5 System Card, 2025

  45. [45]

    Sirki ¨a and J

    T. Sirki ¨a and J. Sorva. Exploring programming misconceptions: an analysis of student mistakes in visual program simulation exercises. InProceedings of the 12th Koli calling international conference on computing education research, pages 19–28, 2012

  46. [46]

    Lora hyperparameters guide, 2024

    Unsloth AI. Lora hyperparameters guide, 2024. Accessed: 2025-12-23

  47. [47]

    VanLehn, S

    K. VanLehn, S. Ohlsson, and R. Nason. Applications of simulated students: An exploration.Journal of artificial intelligence in education, 5:135–135, 1994

  48. [48]

    L. Wang, A. Sy, L. Liu, and C. Piech. Deep knowledge tracing on programming exercises. InProceedings of the fourth (2017) ACM conference on learning@ scale, pages 201–204, 2017

  49. [49]

    Watson, F

    C. Watson, F. W. Li, and J. L. Godwin. Predicting performance in an introductory programming course by logging and analyzing student programming behavior. In2013 IEEE 13th international conference on advanced learning technologies, pages 319–323. IEEE, 2013

  50. [50]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

  51. [51]

    Woodrow, C

    J. Woodrow, C. Piech, and S. Koyejo. Improving generative ai student feedback: Direct preference optimization with teachers in the loop. InProceedings of the 18th International Conference on Educational Data Mining, pages 442–449. International Educational Data Mining Society, July 2025

  52. [52]

    C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023

  53. [53]

    Yang et al

    A. Yang et al. Qwen3 technical report, 2025

  54. [54]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

  55. [55]

    Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025

  56. [56]

    Zheng, G

    T. Zheng, G. Zhang, T. Shen, X. Liu, B. Y. Lin, J. Fu, W. Chen, and X. Yue. Opencodeinterpreter: Integrating code generation with execution and refinement. In L. Ku, A. Martins, and V. Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 12834–12859. Associa...