Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs

Erfan Al-Hossami; Razvan Bunescu

arxiv: 2511.00371 · v2 · submitted 2025-11-01 · 💻 cs.CL · cs.CY· cs.SE

Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs

Erfan Al-Hossami , Razvan Bunescu This is my paper

Pith reviewed 2026-05-18 02:16 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.SE

keywords Socratic debuggingreasoning trajectoriesprogramming misconceptionsLLM generationcognitive dissonancebelief updatestudent code debuggingAI tutoring

0 comments

The pith

LLMs can generate reasoning trajectories for Socratic debugging that reach contradictions with student misconceptions in up to 91 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper models Socratic debugging as the creation of a reasoning trajectory that steers a student toward a statement about program behavior which contradicts their misconception. The contradiction is meant to trigger cognitive dissonance so the student spots the false belief and revises it without being told the fix directly. The authors release a dataset of debugging problems paired with manually or automatically created trajectories and then build LLM methods that produce both the trajectories and full conversations anchored to them. Automated judging of the outputs finds that current models reach 91 percent correct trajectories and 98.7 percent valid conversation turns.

Core claim

The central claim is that reasoning trajectories supply a concrete, step-by-step structure for Socratic debugging: each trajectory begins from a student's likely misconception and ends at an observation that directly contradicts it, after which the resulting dissonance is expected to produce identification of the error and an updated belief about the code.

What carries the argument

A Reasoning Trajectory, a guided sequence of statements and questions that ends in a contradiction with the bug-causing misconception about program behavior.

If this is right

LLM generators can produce both the trajectories and the Socratic conversations that follow from them at high rates of validity.
The new annotated dataset supports training and benchmarking of models on this specific generation task.
If the trajectories work as intended, automated tutors can guide students to fix bugs themselves rather than reveal the solution outright.
The same contradiction-based structure can be applied across many common novice programming errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be embedded in programming tools to deliver on-demand Socratic guidance at scale.
Similar trajectory structures might transfer to other subjects where misconceptions are common, such as introductory mathematics.
Real classroom deployment would need to test whether the assumed cognitive-dissonance step actually occurs and produces lasting change.
The method opens the possibility of collecting large interaction logs that could improve future trajectory generators.

Load-bearing premise

That reaching the contradiction will reliably produce cognitive dissonance strong enough for the student to identify the misconception and retain the corrected belief.

What would settle it

A study that measures actual students' misconception resolution rates and belief retention after they interact with the generated trajectories versus control conditions.

Figures

Figures reproduced from arXiv: 2511.00371 by Erfan Al-Hossami, Razvan Bunescu.

**Figure 2.** Figure 2: Alternative reasoning trajectory for the input [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The original input specification [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The simplified input for the original in Figure [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: The RT for the simplified input in Figure [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Reasoning trajectories prompt template. The [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for Socratic conversation [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for failed test case description generation. The full template includes detailed execution [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt template for LLM-as-judge evaluation of reasoning trajectories. An RT is valid only if all three [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt template for LLM-as-judge evaluation of Socratic teacher utterances. A teacher utterance is [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Interactive web interface for generating reasoning trajectories and Socratic conversations from student [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Interactive web interface generates a reasoning trajectory concluding with a statement that contradicts [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: The tool generates a complete Socratic conversation between a student and a teacher based on the generated [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

read the original abstract

In Socratic debugging, instructors guide students towards identifying and fixing a bug on their own, instead of providing the bug fix directly. Most novice programmer bugs are caused by programming misconceptions, namely false beliefs about a programming concept. In this context, Socratic debugging can be formulated as a guided Reasoning Trajectory (RT) leading to a statement about the program behavior that contradicts the bug-causing misconception. Upon reaching this contradiction, the ensuing cognitive dissonance is expected to lead the student to identify the false belief on their own, followed by an enduring belief update. In this paper, we introduce the task of reasoning trajectory generation, together with a dataset of debugging problems annotated with RTs that are manually created or LLM-generated. We then describe LLM-based solutions for generating RTs and Socratic conversations that are anchored on them. A large-scale LLM-as-judge evaluation shows that large language and reasoning models can generate up to 91% correct reasoning trajectories and 98.7% valid conversation turns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new task of generating reasoning trajectories to structure Socratic debugging in student code and releases a mixed manual/LLM dataset, but its headline results rest on an unvalidated LLM judge.

read the letter

The main thing to know is that this work formalizes Socratic debugging as the generation of a reasoning trajectory that ends in a contradiction with the student's misconception, then uses that trajectory to anchor a tutoring conversation. They introduce the task explicitly, build a dataset of debugging problems with both manually written and LLM-generated trajectories, and test LLMs on producing both the trajectories and the conversations that follow them. The reported numbers are 91% correct trajectories and 98.7% valid conversation turns according to an LLM judge. That framing and the dataset are the clearest new pieces here. The practical setup for turning a misconception into a guided path to contradiction is straightforward and could be picked up by people building code tutors. The assumption that reaching the contradiction will reliably trigger a lasting belief update is stated plainly but left untested, which is fine for a first paper as long as it is not oversold. The evaluation is the soft spot that stands out. All the quantitative claims come from an LLM-as-judge setup with no reported human agreement numbers, no calibration details, and no error analysis on where the judge disagrees with people. Because the gold set already mixes human and generated trajectories, any systematic preference the judge has for LLM-style reasoning would directly boost the success rates. That is a standard limitation in this kind of work, but it needs to be addressed before the percentages can be treated as strong evidence. This is for researchers working on AI-supported programming education who want a concrete task definition and a starting dataset. A reader already familiar with Socratic tutoring or LLM evaluation practices will see the gaps quickly but can still extract the task formulation and the dataset for follow-up experiments. The paper shows clear thinking about the mechanism even if the current evidence is preliminary. I would send it to peer review so referees can examine the full dataset construction and judge protocol.

Referee Report

2 major / 2 minor

Summary. The paper introduces the task of reasoning trajectory (RT) generation for Socratic debugging of student code, where an RT guides a student from a programming misconception to a contradiction about program behavior. It contributes a dataset of debugging problems annotated with manually created or LLM-generated RTs, proposes LLM-based methods to generate RTs and anchored Socratic conversations, and reports a large-scale LLM-as-judge evaluation claiming up to 91% correct RTs and 98.7% valid conversation turns.

Significance. If the performance claims hold under rigorous human validation, the work could meaningfully advance AI-assisted programming education by formalizing contradiction-driven trajectories that target misconception correction. The annotated dataset and the separation of trajectory generation from conversation generation are concrete strengths that could support follow-on reproducible research. The empirical scale of the evaluation is also a positive feature.

major comments (2)

[Abstract and evaluation section] Abstract and evaluation section: the central claims of 91% correct reasoning trajectories and 98.7% valid conversation turns rest entirely on LLM-as-judge assessment. No details are supplied on the judge prompt, few-shot examples, inter-judge agreement, human calibration, or disagreement analysis, despite the dataset mixing manual and LLM-generated RTs; this directly undermines confidence that the percentages measure true correctness rather than judge bias.
[Introduction] Introduction: the motivating claim that reaching a contradiction will reliably produce cognitive dissonance leading to student identification of the misconception and an enduring belief update is asserted but not tested or measured in any experiment; this assumption is load-bearing for the practical significance of the generated trajectories.

minor comments (2)

[Dataset section] Dataset section: the abstract does not report the total number of debugging problems, the split between manual and LLM-generated RTs, or inter-annotator agreement for the manual annotations.
[Methods] Methods: more explicit description of the prompting strategies or any fine-tuning used for RT generation and conversation anchoring would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and evaluation section] Abstract and evaluation section: the central claims of 91% correct reasoning trajectories and 98.7% valid conversation turns rest entirely on LLM-as-judge assessment. No details are supplied on the judge prompt, few-shot examples, inter-judge agreement, human calibration, or disagreement analysis, despite the dataset mixing manual and LLM-generated RTs; this directly undermines confidence that the percentages measure true correctness rather than judge bias.

Authors: We agree that greater transparency on the LLM-as-judge protocol is needed. In the revised manuscript we will add the complete judge prompt, the few-shot examples, inter-judge agreement statistics computed on a held-out subset, and a disagreement analysis contrasting LLM judgments against the manual annotations present in the dataset. These additions will allow readers to evaluate potential bias more directly. revision: yes
Referee: [Introduction] Introduction: the motivating claim that reaching a contradiction will reliably produce cognitive dissonance leading to student identification of the misconception and an enduring belief update is asserted but not tested or measured in any experiment; this assumption is load-bearing for the practical significance of the generated trajectories.

Authors: The claim is presented as a hypothesis drawn from established theories of conceptual change and cognitive dissonance in science and mathematics education; we will insert the relevant citations. Because the present work centers on the computational task of trajectory generation, dataset release, and LLM-based method development rather than on measuring student learning outcomes, we did not conduct human-subject experiments. We will expand the limitations and future-work discussion to state this scope limitation explicitly and to outline planned follow-up studies that would test belief-update effects with students. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper introduces a task and dataset of debugging problems with manually created or LLM-generated reasoning trajectories, then applies LLM-based methods to generate RTs and anchored Socratic conversations, reporting direct empirical results from an LLM-as-judge evaluation (up to 91% correct trajectories and 98.7% valid turns). These percentages are measured outcomes on the constructed dataset rather than quantities derived by construction from fitted parameters, self-definitions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central claims back to the inputs; the work is an empirical pipeline with external benchmarks in the form of the annotated dataset and judge criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on an educational psychology assumption about how contradictions produce belief change and treats LLM-generated trajectories as functionally equivalent to human ones for the purpose of the evaluation.

axioms (1)

domain assumption Cognitive dissonance from contradictions with misconceptions leads students to identify false beliefs and perform enduring belief updates.
Invoked in the abstract when describing the expected outcome after reaching the contradicting statement in the reasoning trajectory.

invented entities (1)

Reasoning Trajectory (RT) no independent evidence
purpose: A guided sequence of statements that leads from a bug-causing misconception to a contradicting statement about program behavior.
Newly defined construct that structures the Socratic debugging process and serves as the anchor for both generation and conversation tasks.

pith-pipeline@v0.9.0 · 5711 in / 1552 out tokens · 40198 ms · 2026-05-18T02:16:15.776467+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Socratic debugging can be formulated as a guided Reasoning Trajectory (RT) leading to a statement about the program behavior that contradicts the bug-causing misconception. Upon reaching this contradiction, the ensuing cognitive dissonance is expected to lead the student to identify the false belief on their own, followed by an enduring belief update.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A large-scale LLM-as-judge evaluation shows that large language and reasoning models can generate up to 91% correct reasoning trajectories and 98.7% valid conversation turns.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

InEducational Data Mining 2014

Generating hints for programming problems using intermediate output. InEducational Data Mining 2014. Citeseer. Nam Ju Kim, Brian R Belland, and Andrew E Walker

work page 2014
[2]

Timotej Lazar, Martin Možina, and Ivan Bratko

Effectiveness of computer-based scaffolding in the context of problem-based learning for STEM education: Bayesian meta-analysis.Educational Psychology Review, 30:397–429. Timotej Lazar, Martin Možina, and Ivan Bratko. 2017. Automatic extraction of ast patterns for debugging student programs. InArtificial Intelligence in Edu- cation: 18th International Con...

work page 2017
[3]

arXiv preprint arXiv:2502.18940

Mathtutorbench: A benchmark for measuring open-ended pedagogical capabilities of llm tutors. arXiv preprint arXiv:2502.18940. Jessica McBroom, Irena Koprinska, and Kalina Yacef

work page arXiv
[4]

Elham Mousavinasab, Nahid Zarifsanaiey, Sharareh R

A survey of automated programming hint generation: The hints framework.ACM Computing Surveys (CSUR), 54(8):1–27. Elham Mousavinasab, Nahid Zarifsanaiey, Sharareh R. Niakan Kalhori, Mahnaz Rakhshan, Leila Keikha, and Marjan Ghazi Saeedi. 2021. Intelligent tutoring systems: a systematic review of characteristics, applications, and evaluation methods.Interac...

work page arXiv 2021
[5]

Jean Piaget

Teachlm: Post-training llms for education using authentic learning data.arXiv preprint arXiv:2510.05087. Jean Piaget. 1975.The equilibration of cognitive structures: The central problem of intellectual development. University of Chicago Press. Chris Quintana, Brian J. Reiser, Elizabeth A. Davis, Joseph Krajcik, Eric Fretz, Ravit Golan Duncan, Eleni Kyza, ...

work page arXiv 1975
[6]

Kelly Rivers and Kenneth R Koedinger

A scaffolding design framework for software to support science inquiry.Journal of the Learning Sciences, 13(3):337–386. Kelly Rivers and Kenneth R Koedinger. 2017. Data- driven hint generation in vast solution spaces: a self-improving python programming tutor.Interna- tional Journal of Artificial Intelligence in Education, 27:37–64. Sami Sarsa, Paul Denny...

work page 2017
[7]

InProceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1, ICER ’22, page 27–43, New York, NY , USA

Automatic generation of programming exercises and code explanations using large language models. InProceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1, ICER ’22, page 27–43, New York, NY , USA. Association for Computing Machinery. Kumar Shridhar, Jakub Macina, Mennatallah El-Assady, Tanmay Sinha, Manu Kapur, and...

work page 2022
[8]

for loops

Automatic generation of socratic subquestions for teaching math word problems. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4136–4149, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Lasang Jimba Tamang, Zeyad Alshaikh, Nisrine Ait Khayi, Priti Oli, and Vasile Rus. 2021. A comp...

work page 2022
[9]

Read RT step A.X to understand the target inference

work page
[10]

Read RT steps A.1 through A.X-1 for established facts

work page
[11]

Read the teacher utterance and student response

work page
[12]

Evaluate against both criteria

work page
[13]

valid": true/false,

Valid only if both criteria pass Output Format { "valid": true/false, "criteria_scores": { "prompts_correct_inference": true/false, "does_not_state_inference": true/false }, "comments": "[Evaluation explanation]", "feedback": "[Suggestions or NONE]" } Figure 10: Prompt template for LLM-as-judge evaluation of Socratic teacher utterances. A teacher utteranc...

work page

[1] [1]

InEducational Data Mining 2014

Generating hints for programming problems using intermediate output. InEducational Data Mining 2014. Citeseer. Nam Ju Kim, Brian R Belland, and Andrew E Walker

work page 2014

[2] [2]

Timotej Lazar, Martin Možina, and Ivan Bratko

Effectiveness of computer-based scaffolding in the context of problem-based learning for STEM education: Bayesian meta-analysis.Educational Psychology Review, 30:397–429. Timotej Lazar, Martin Možina, and Ivan Bratko. 2017. Automatic extraction of ast patterns for debugging student programs. InArtificial Intelligence in Edu- cation: 18th International Con...

work page 2017

[3] [3]

arXiv preprint arXiv:2502.18940

Mathtutorbench: A benchmark for measuring open-ended pedagogical capabilities of llm tutors. arXiv preprint arXiv:2502.18940. Jessica McBroom, Irena Koprinska, and Kalina Yacef

work page arXiv

[4] [4]

Elham Mousavinasab, Nahid Zarifsanaiey, Sharareh R

A survey of automated programming hint generation: The hints framework.ACM Computing Surveys (CSUR), 54(8):1–27. Elham Mousavinasab, Nahid Zarifsanaiey, Sharareh R. Niakan Kalhori, Mahnaz Rakhshan, Leila Keikha, and Marjan Ghazi Saeedi. 2021. Intelligent tutoring systems: a systematic review of characteristics, applications, and evaluation methods.Interac...

work page arXiv 2021

[5] [5]

Jean Piaget

Teachlm: Post-training llms for education using authentic learning data.arXiv preprint arXiv:2510.05087. Jean Piaget. 1975.The equilibration of cognitive structures: The central problem of intellectual development. University of Chicago Press. Chris Quintana, Brian J. Reiser, Elizabeth A. Davis, Joseph Krajcik, Eric Fretz, Ravit Golan Duncan, Eleni Kyza, ...

work page arXiv 1975

[6] [6]

Kelly Rivers and Kenneth R Koedinger

A scaffolding design framework for software to support science inquiry.Journal of the Learning Sciences, 13(3):337–386. Kelly Rivers and Kenneth R Koedinger. 2017. Data- driven hint generation in vast solution spaces: a self-improving python programming tutor.Interna- tional Journal of Artificial Intelligence in Education, 27:37–64. Sami Sarsa, Paul Denny...

work page 2017

[7] [7]

InProceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1, ICER ’22, page 27–43, New York, NY , USA

Automatic generation of programming exercises and code explanations using large language models. InProceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1, ICER ’22, page 27–43, New York, NY , USA. Association for Computing Machinery. Kumar Shridhar, Jakub Macina, Mennatallah El-Assady, Tanmay Sinha, Manu Kapur, and...

work page 2022

[8] [8]

for loops

Automatic generation of socratic subquestions for teaching math word problems. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4136–4149, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Lasang Jimba Tamang, Zeyad Alshaikh, Nisrine Ait Khayi, Priti Oli, and Vasile Rus. 2021. A comp...

work page 2022

[9] [9]

Read RT step A.X to understand the target inference

work page

[10] [10]

Read RT steps A.1 through A.X-1 for established facts

work page

[11] [11]

Read the teacher utterance and student response

work page

[12] [12]

Evaluate against both criteria

work page

[13] [13]

valid": true/false,

Valid only if both criteria pass Output Format { "valid": true/false, "criteria_scores": { "prompts_correct_inference": true/false, "does_not_state_inference": true/false }, "comments": "[Evaluation explanation]", "feedback": "[Suggestions or NONE]" } Figure 10: Prompt template for LLM-as-judge evaluation of Socratic teacher utterances. A teacher utteranc...

work page