Modeling Student Learning with 3.8 Million Program Traces

Alexis Ross; Jacob Andreas; Jeremiah Blanchard; Megha Srivastava

arxiv: 2510.05056 · v2 · submitted 2025-10-06 · 💻 cs.LG

Modeling Student Learning with 3.8 Million Program Traces

Alexis Ross , Megha Srivastava , Jeremiah Blanchard , Jacob Andreas This is my paper

Pith reviewed 2026-05-18 09:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords student modelingprogramming educationinteraction tracescode generationnovice programmerslanguage modelsbehavior predictionsteerable generation

0 comments

The pith

Training language models on real student code edit traces captures individual behaviors and enables style-preserving corrections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates what language models can learn when trained on millions of real programming interaction traces instead of final code or synthetic examples. These traces come from novice students using an online platform and record sequences of edits, retries, and explorations that reflect how they approach coding tasks. Models trained on the actual traces prove stronger than baselines at modeling varied student behaviors. Learned representations of students allow prediction of trace properties such as goal backtracking and comment frequency. The same models can be guided to propose sequences of edits that increase correctness while staying close to each student's original style.

Core claim

Models trained on over 3.8 million real programming reasoning traces from Pencil Code users outperform models trained only on final programs or on synthetically generated traces at modeling diverse student behavior. Student representations extracted from these traces predict many observable properties of the traces themselves, including goal backtracking and number of comments. Steering the models allows generation of edit sequences that move toward more correct code while remaining close to the original student's stylistic choices.

What carries the argument

Language models trained directly on sequences of student code edits to produce per-student representations that encode individual reasoning patterns and stylistic preferences.

If this is right

Models trained on real traces become stronger at capturing the range of student behaviors observed in the data.
Individual student representations encode predictable properties such as frequency of goal backtracking and comment use.
Steering a trained model can surface edit sequences that improve correctness while preserving the student's style.
Training on edit traces produces models that are simultaneously more predictive of student actions and more steerable for code generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-student representation approach could be tested in other iterative creative tasks such as writing or design to see whether individual style is equally recoverable.
If representations reliably encode learning stage, they might support automated detection of when a student is ready for new concepts.
Combining trace-based models with explicit error taxonomies could produce tutoring systems that suggest fixes matched to both correctness and personal history.

Load-bearing premise

The traces gathered from Pencil Code users faithfully represent diverse novice reasoning processes without major platform-specific or self-selection biases that would prevent generalization.

What would settle it

Train the same architecture on final programs versus real traces, then test both on a fresh cohort of students from a different coding platform; if the real-trace model shows no advantage in behavior prediction or edit suggestion, the claimed benefit of interaction traces does not hold.

read the original abstract

As programmers write code, they often edit and retry multiple times, creating rich "interaction traces" that reveal how they approach coding tasks and provide clues about their level of skill development. For novice programmers in particular, these traces reflect the diverse reasoning processes they employ to code, such as exploratory behavior to understand how a programming concept works, re-strategizing in response to bugs, and personalizing stylistic choices. In this work, we explore what can be learned from training language models on such reasoning traces: not just about code, but about coders, and particularly students learning to program. We introduce a dataset of over 3.8 million programming reasoning traces from users of Pencil Code, a free online educational platform used by students to learn simple programming concepts. Compared to models trained only on final programs or synthetically-generated traces, we find that models trained on real traces are stronger at modeling diverse student behavior. Through both behavioral and probing analyses, we also find that many properties of code traces, such as goal backtracking or number of comments, can be predicted from learned representations of the students who write them. Building on this result, we show that we can help students recover from mistakes by steering code generation models to identify a sequence of edits that will results in more correct code while remaining close to the original student's style. Together, our results suggest that many properties of code are properties of individual students and that training on edit traces can lead to models that are more steerable, more predictive of student behavior while programming, and better at generating programs in their final states. Code and data is available at https://github.com/meghabyte/pencilcode-public

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a dataset of 3.8 million programming interaction traces from Pencil Code users and trains language models on these traces. It claims that models trained on real traces outperform those trained only on final programs or synthetically generated traces at modeling diverse student behavior; that learned student representations can predict trace properties such as goal backtracking and number of comments; and that the models can be steered to generate sequences of edits that increase code correctness while remaining close to the original student's style.

Significance. If the empirical claims hold after addressing methodological details, the work demonstrates that interaction traces encode individual student traits beyond final code artifacts and supports development of more steerable, behavior-predictive models for programming education. The scale of the released dataset and code is a clear strength that enables reproducibility and follow-on research.

major comments (2)

[Abstract and §1] Abstract and §1: the headline comparative claim that real-trace models are 'stronger at modeling diverse student behavior' is presented without reference to the precise metrics, baseline implementations, or statistical tests (e.g., paired significance or effect sizes) that establish superiority over final-program and synthetic baselines; this detail is load-bearing for the central empirical result.
[Abstract and §3] Abstract and §3 (dataset description): the generalization claim that the learned representations and steering behavior capture transferable novice reasoning processes rests on the unvalidated assumption that Pencil Code traces are representative of broader student populations and environments; no cross-platform comparison, demographic analysis, or bias audit is reported, which directly affects the scope of the 'properties of individual students' conclusion.

minor comments (2)

[Figure captions and §4] Figure captions and §4: clarify whether the probing classifiers are trained on held-out students or held-out traces to avoid leakage when claiming that student representations predict trace properties.
[§5] §5 (steering experiments): specify the exact steering objective, temperature, and distance metric used to keep generations 'close to the original student's style' so that the qualitative examples can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. The comments highlight important areas for improving clarity and scoping our claims. We have revised the manuscript accordingly and address each point below.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1: the headline comparative claim that real-trace models are 'stronger at modeling diverse student behavior' is presented without reference to the precise metrics, baseline implementations, or statistical tests (e.g., paired significance or effect sizes) that establish superiority over final-program and synthetic baselines; this detail is load-bearing for the central empirical result.

Authors: We agree that the abstract and §1 would be strengthened by explicit references to the supporting quantitative evidence. Detailed results—including perplexity and action-prediction accuracy metrics, descriptions of the final-program and synthetic baselines, and paired statistical tests with effect sizes—are reported in §4. To address the concern, we have updated the abstract to summarize the key comparative results with effect sizes and added a concise overview of the evaluation setup, metrics, and significance tests (with pointers to §4) in the revised §1. revision: yes
Referee: [Abstract and §3] Abstract and §3 (dataset description): the generalization claim that the learned representations and steering behavior capture transferable novice reasoning processes rests on the unvalidated assumption that Pencil Code traces are representative of broader student populations and environments; no cross-platform comparison, demographic analysis, or bias audit is reported, which directly affects the scope of the 'properties of individual students' conclusion.

Authors: We acknowledge that the manuscript's claims are grounded in the Pencil Code dataset and that broader transferability should not be overstated without additional evidence. We have added an explicit limitations subsection to §3 that describes the platform's user base, notes the absence of detailed demographic data in the collected traces, discusses potential selection biases, and clarifies the scope of the 'individual student properties' findings to this population. While we cannot perform a cross-platform comparison or full bias audit with the current data, the public release of the 3.8M-trace dataset enables such work by others. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains language models on an external dataset of 3.8 million real Pencil Code traces and evaluates them empirically against independent baselines (final programs, synthetic traces). Claims about stronger modeling of student behavior, predictability of trace properties (e.g., backtracking, comments) from learned student representations, and steerable edit generation are supported by behavioral analyses and probing on held-out data rather than by quantities defined in terms of the model's own fitted parameters. No self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations appear in the reported methodology or results; the central findings rest on comparisons to external references and do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work depends on the domain assumption that edit traces faithfully reflect individual student reasoning and on standard machine-learning modeling choices whose specific hyperparameters are not detailed in the abstract.

free parameters (1)

language model training hyperparameters
Standard but unspecified choices for learning rate, batch size, and architecture that affect all reported results.

axioms (1)

domain assumption Real interaction traces reflect diverse student reasoning processes and skill development
Invoked when claiming that trace-trained models capture student behavior better than final-code or synthetic baselines.

pith-pipeline@v0.9.0 · 5834 in / 1385 out tokens · 57795 ms · 2026-05-18T09:34:43.396024+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a dataset of over 3.8 million programming reasoning traces... models trained on real traces are stronger at modeling diverse student behavior... steering code generation models to identify a sequence of edits
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Probing Student Representations to Predict Means Across Traces for a Student

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
cs.AI 2026-04 unverdicted novelty 7.0

Serializing real student code submission logs into conversational turns and fine-tuning Qwen models with supervised learning plus preference optimization produces artificial learners that better match authentic debugg...
Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
cs.AI 2026-04 conditional novelty 7.0

Training open-weight LLMs on conversational serializations of authentic student programming submissions produces artificial learners that better replicate real debugging behavior than code-only baselines or prompted l...