pith. sign in

arxiv: 2605.17857 · v1 · pith:JIKASBDWnew · submitted 2026-05-18 · 💻 cs.HC

Towards SocratiCode: Designing a Generative AI-Based Programming Tutor for K-12 Students through a 4-Week Participatory Design Study

Pith reviewed 2026-05-20 09:35 UTC · model grok-4.3

classification 💻 cs.HC
keywords generative AIK-12 programming educationSocratic tutoringparticipatory designadaptive learning companionPython for beginnershuman-AI collaboration
0
0 comments X

The pith

Generative AI for K-12 programming works best as a Socratic questioner embedded in human-guided lessons rather than as a direct answer provider.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports on a four-week participatory design process with two K-12 students in which an AI tutoring system named SocratiCode was repeatedly revised based on learner feedback. It moved from open-ended tutorial generation to a more constrained dialogic style that uses guided questions, reflection prompts, misconception checks, incremental hints, and required pauses for student input. A sympathetic reader would care because the authors argue this Socratic form reduces the overwhelm that lengthy AI explanations can create for novices while still leveraging generative capabilities. The work positions the AI as a companion inside a broader human-led instructional setup rather than a standalone solution engine.

Core claim

Across the study iterations the system shifted toward dialogic support through guided questioning, reflection prompts, misconception checks, incremental hints, and mandatory pauses for learner input; preliminary observations indicate this change improved explanation clarity, supported problem-solving engagement, and better matched novice needs when combined with human guidance.

What carries the argument

SocratiCode is the evolving adaptive tutorial system whose refinement into a Socratic tutoring model supplies guided questions and learner-input pauses instead of full solutions.

Load-bearing premise

Feedback from only two K-12 students across four weeks can show reliable gains in clarity, engagement, and fit for a wider population of novice learners.

What would settle it

A controlled trial that assigns many more K-12 students to either the final Socratic version or a directive answer-giving version and measures differences in problem-solving success and reported confusion.

Figures

Figures reproduced from arXiv: 2605.17857 by Anshul Bihani, Cassandra Lucas, Chun-Hua Tsai, Jaydeb Sarker, Mia Mohammad Imran, Rohini Kukka.

Figure 1
Figure 1. Figure 1: Experiment Pipeline of SocratiCode. Revision Topics (Thursday) Weekly Agile Style Group Meeting and Feedback (Friday) Update Prompt (If needed) Daily Topic Exploration (Monday, Tuesday, Wednesday) and Feedback [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Feedback Loop with conditions” with “This completes the lesson.” These revisions produced a more structured and dialogic tutoring flow. We provide the shortened prompt template used by the end of W4 below. Template Structure (Shortened). For Full Prompt [6] 1. Role & Audience: Act as a step-by-step tutorial guide for absolute beginners, using Python by default and clear analogies. 2. Learner Adaptation: As… view at source ↗
read the original abstract

Generative AI creates new opportunities for programming education, but many existing systems remain overly directive, producing lengthy explanations and premature solutions that can overwhelm K-12 novices. In this paper, we present a participatory design study of how an adaptive tutorial system, SocratiCode, evolved toward a Socratic tutoring model for beginner programming instruction. Drawing on weekly learner feedback, we iteratively refined the system over a four-week study with two K-12 students learning Python. Across iterations, the system shifted from flexible tutorial generation toward a more dialogic form of support characterized by guided questioning, reflection prompts, misconception checks, incremental hints, and mandatory pauses for learner input. Our preliminary observations suggest that this Socratic shift improved explanation clarity, supported problem-solving engagement, and better aligned instruction with novice learners' needs, especially when combined with human guidance. We argue that generative AI in K-12 programming education may be most effective not as an answer engine, but as a Socratic, adaptive learning companion embedded within a human-guided instructional framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports on a 4-week participatory design study with two K-12 students in which the authors iteratively refined a generative-AI programming tutor (SocratiCode) from a flexible tutorial generator into a dialogic Socratic system that uses guided questioning, reflection prompts, misconception checks, incremental hints, and mandatory pauses. Preliminary qualitative observations from weekly learner feedback are presented as evidence that the Socratic shift improved explanation clarity, supported problem-solving engagement, and better aligned with novice needs when combined with human guidance; the authors conclude that generative AI in K-12 programming education is most effective as an embedded Socratic companion within a human-guided instructional framework.

Significance. If the reported benefits of the Socratic features can be replicated at scale, the work would supply useful design heuristics for AI tutors aimed at young beginners, particularly the value of mandatory reflection pauses and incremental scaffolding over direct answer generation. At present the contribution remains exploratory and design-oriented rather than a validated pedagogical result.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Results/Observations): the claims that the Socratic shift 'improved explanation clarity, supported problem-solving engagement, and better aligned instruction with novice learners' needs' rest solely on qualitative observations from two participants; no quantitative pre/post learning or engagement metrics, no control condition, and no systematic error analysis are reported, leaving attribution to the dialogic features insecure.
  2. [Discussion] Discussion section: the broader argument that generative AI 'may be most effective not as an answer engine, but as a Socratic, adaptive learning companion embedded within a human-guided instructional framework' extrapolates from iterative refinements driven by feedback from only two learners over four weeks; the manuscript provides no evidence that the observed changes are driven by the Socratic elements themselves rather than learner-specific factors, consistent human guidance, or study duration.
minor comments (2)
  1. [Methods] Methods: provide the exact system prompts or prompt-engineering changes applied at each weekly iteration so that the design trajectory can be reproduced or extended by other researchers.
  2. [Figures] Figures: ensure any diagrams showing the evolution of the tutor interface across the four weeks are explicitly labeled with iteration number and the specific Socratic features introduced at each step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped us better frame the exploratory scope of this participatory design study. We respond to each major comment below and describe the changes incorporated into the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] the claims that the Socratic shift 'improved explanation clarity, supported problem-solving engagement, and better aligned instruction with novice learners' needs' rest solely on qualitative observations from two participants; no quantitative pre/post learning or engagement metrics, no control condition, and no systematic error analysis are reported, leaving attribution to the dialogic features insecure.

    Authors: We agree that the reported observations are qualitative, drawn from only two participants, and lack quantitative pre/post measures, a control condition, or systematic error analysis. This is consistent with the participatory design methodology of the study, which prioritized iterative refinement based on weekly learner feedback rather than controlled experimentation. In the revised manuscript we have updated the abstract and Section 4 to qualify all statements as preliminary observations from the design process. We now explicitly note the absence of quantitative metrics and control conditions, avoid causal language regarding attribution to the dialogic features, and add a forward-looking statement calling for larger-scale studies with such measures to validate the observed patterns. revision: yes

  2. Referee: [Discussion] the broader argument that generative AI 'may be most effective not as an answer engine, but as a Socratic, adaptive learning companion embedded within a human-guided instructional framework' extrapolates from iterative refinements driven by feedback from only two learners over four weeks; the manuscript provides no evidence that the observed changes are driven by the Socratic elements themselves rather than learner-specific factors, consistent human guidance, or study duration.

    Authors: We accept that the small sample and study duration limit the strength of broader claims and that alternative explanations (learner-specific factors, human guidance, or simply the passage of time) cannot be ruled out from the available data. We have revised the Discussion section to present the argument as a set of design heuristics emerging from this case rather than a generalizable conclusion. We have added explicit discussion of potential confounds, including the role of consistent human guidance, and inserted a new Limitations subsection that directly addresses the small participant count, the four-week timeframe, and the inability to isolate the contribution of the Socratic elements from other study variables. revision: yes

Circularity Check

0 steps flagged

No significant circularity in qualitative participatory design study

full rationale

The paper describes a 4-week participatory design process with two K-12 students in which SocratiCode was iteratively refined based on direct weekly feedback. All claims about improved clarity, engagement, and alignment with novice needs are presented as preliminary observations drawn from that feedback and the resulting design changes. No equations, fitted parameters, predictions, uniqueness theorems, or self-citation chains appear; the work contains no derivations that could reduce outputs to inputs by construction. The study is therefore self-contained against external benchmarks and receives a circularity score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that small-scale qualitative feedback from two participants can validly inform general design principles for AI tutoring effectiveness.

axioms (1)
  • domain assumption Weekly feedback from a small number of K-12 learners in participatory sessions accurately identifies effective tutoring strategies for novice programmers
    The study uses this feedback to drive iterative shifts from tutorial generation to Socratic dialogue.

pith-pipeline@v0.9.0 · 5740 in / 1344 out tokens · 86994 ms · 2026-05-20T09:35:14.899789+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    Erfan Al-Hossami, Razvan Bunescu, Ryan Teehan, Laurel Powell, Khyati Ma- hajan, and Mohsen Dorodchi. 2023. Socratic questioning of novice debuggers: A benchmark dataset and preliminary evaluations. InProceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). 709–726

  2. [2]

    Ohud Abdullah Alasmari, Jeremy Singer, and Mireilla Bikanga Ada. 2023. Do current online coding tutorial systems address novice programmer difficulties?. InProceedings of the 15th International Conference on Education Technology and Computers. 242–248

  3. [3]

    Mohammed Amin Almaiah, Raghad Alfaisal, Said A Salloum, Fahima Hajjej, et al. 2022. Examining the impact of artificial intelligence and social and com- puter anxiety in e-learning settings: Students’ perceptions at the university level. Electronics11, 22 (2022), 3662

  4. [4]

    Zeyad Alshaikh, Lasagn Tamang, and Vasile Rus. 2020. A Socratic tutor for source code comprehension. InInternational conference on artificial intelligence in education. Springer, 15–19

  5. [5]

    Zeyad Alshaikh, Lasang Jimba Tamang, and Vasile Rus. 2020. Experiments with a socratic intelligent tutoring system for source code understanding. InThe Thirty-Third International Florida Artificial Intelligence Research Society Conference (FLAIRS-32)

  6. [6]

    Anonymous Anonymous. 2026. Replication Package forSocratiCodefor K-12 Students Study. doi:10.5281/zenodo.20018098

  7. [7]

    Samuel Boguslawski, Rowan Deer, and Mark G Dawson. 2025. Programming education and learner motivation in the age of generative AI: student and educator perspectives.Information and Learning Sciences(2025)

  8. [8]

    Michelle Brachman, Siya Kunde, Sarah Miller, Ana Fucs, Samantha Dempsey, Jamie Jabbour, and Werner Geyer. 2025. Building Appropriate Mental Models: What Users Know and Want to Know about an Agentic AI Chatbot. InProceedings of the 30th International Conference on Intelligent User Interfaces. 247–264

  9. [9]

    Peter Brusilovsky and Eva Millán. 2007. User models for adaptive hypermedia and adaptive educational systems. InThe adaptive web: methods and strategies of web personalization. Springer, 3–53

  10. [10]

    2006.Constructing grounded theory: A practical guide through qualitative analysis

    Kathy Charmaz. 2006.Constructing grounded theory: A practical guide through qualitative analysis. sage

  11. [11]

    Rudrajit Choudhuri, Ambareesh Ramakrishnan, Amreeta Chatterjee, Bianca Trinkenreich, et al. 2025. Insights from the Frontline: GenAI Utilization Among Software Engineering Students.IEEE Xplore(2025), 1–12

  12. [12]

    Paul Denny, David H Smith IV, Max Fowler, James Prather, Brett A Becker, and Juho Leinonen. 2024. Explaining code with a purpose: An integrated approach for developing code comprehension and prompting skills. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. 283–289

  13. [13]

    Sidney D’mello and Art Graesser. 2013. AutoTutor and affective AutoTutor: Learning by talking with cognitively and emotionally intelligent computers that talk back.ACM Transactions on Interactive Intelligent Systems (TiiS)(2013)

  14. [14]

    Ian Drosos, Jack Williams, Advait Sarkar, Nicholas Wilson, Sean Rintel, and Payod Panda. 2025. Dynamic Prompt Middleware: Contextual Prompt Refinement Controls for Comprehension Tasks. InProceedings of the 4th Annual Symposium on Human-Computer Interaction for Work. 1–23

  15. [15]

    2026.Socratic method

    Encyclopaedia Britannica. 2026.Socratic method. https://www.britannica.com/ topic/Socratic-method Last updated March 13, 2026. Accessed April 15, 2026

  16. [16]

    Guangrui Fan, Dandan Liu, Rui Zhang, and Lihu Pan. 2025. The impact of AI-assisted pair programming on student motivation, programming anxiety, collaborative learning, and programming performance: a comparative study with traditional pair programming and individual approaches.International Journal of STEM Education12, 1 (2025), 16

  17. [17]

    2025.Generative artificial intelligence (AI) in education

    Department for Education. 2025.Generative artificial intelligence (AI) in education. Technical Report. Department for Education, UK. Updated 12 August 2025

  18. [18]

    Michail Giannakos, Roger Azevedo, et al. 2025. The promise and challenges of generative AI in education.Behaviour & Information Technology(2025)

  19. [19]

    Shuchi Grover and Roy Pea. 2013. Computational thinking in K–12: A review of the state of the field.Educational researcher42, 1 (2013), 38–43

  20. [20]

    Xingjian Gu and Barbara J Ericson. 2025. AI literacy in K-12 and higher education in the wake of generative AI: An integrative review. InProceedings of the 2025 ACM Conference on International Computing Education Research V. 1. 125–140

  21. [21]

    2013.Computer Science Curricula 2013: Curriculum Guidelines for Undergraduate Degree Programs in Computer Science

    Joint Task Force on Computing Curricula, Association for Computing Machinery (ACM) and IEEE Computer Society. 2013.Computer Science Curricula 2013: Curriculum Guidelines for Undergraduate Degree Programs in Computer Science. ACM Press and IEEE Computer Society Press, New York, NY, USA

  22. [22]

    Caitlin Kelleher and Randy Pausch. 2005. Lowering the barriers to program- ming: A taxonomy of programming environments and languages for novice programmers.ACM computing surveys (CSUR)37, 2 (2005), 83–137

  23. [23]

    Caitlin Kelleher, Randy Pausch, and Sara Kiesler. 2007. Storytelling alice motivates middle school girls to learn computer programming. InProceedings of the SIGCHI conference on Human factors in computing systems. 1455–1464

  24. [24]

    Eric Klopfer, Justin Reich, Hal Abelson, and Cynthia Breazeal. 2024. Generative AI and K-12 education: An MIT perspective. (2024)

  25. [25]

    Uday Mittal, Siva Sai, Vinay Chamola, et al. 2024. A comprehensive review on generative AI for education.IEEE Access(2024)

  26. [26]

    Susanne Narciss and Ecenaz Alemdag. 2025. Learning from errors and failure in educational contexts: New insights and future directions for research and practice.British Journal of Educational Psychology95, 1 (2025), 197–218

  27. [27]

    Sydney Nguyen, Hannah McLean Babe, Yangtian Zi, Arjun Guha, Carolyn Jane Anderson, and Molly Q Feldman. 2024. How Beginning Programmers and Code LLMs (Mis)read Each Other.Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24)(2024), 1–26

  28. [28]

    Aannemarie Sullivan Palinscar and Ann L Brown. 1984. Reciprocal teaching of comprehension-fostering and comprehension-monitoring activities.Cognition and instruction1, 2 (1984), 117–175

  29. [29]

    Jiyeon Park and Sam Choo. 2025. Generative AI prompt engineering for educators: Practical strategies.Journal of Special Education Technology40, 3 (2025), 411–417

  30. [30]

    Christian Rahe and Walid Maalej. 2025. How Do Programming Students Use Generative AI?Proceedings of the ACM on Software EngineeringFSE (2025)

  31. [31]

    Brian J Reiser. 2018. Scaffolding complex learning: The mechanisms of structuring and problematizing student work. InScaffolding. Psychology Press, 273–304

  32. [32]

    Sangho Suh, Jian Zhao, and Edith Law. 2022. Codetoon: Story ideation, auto comic generation, and structure mapping for code-driven storytelling. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology

  33. [33]

    Osman Tasdelen and Daniel Bodemer. 2025. Generative AI in the classroom: Effects of context-personalized learning material and tasks on motivation and performance.International Journal of Artificial Intelligence in Education(2025). TowardsSocratiCode: Designing a Generative AI-Based Programming Tutor for K-12 Students through a 4-Week Participatory Design...

  34. [34]

    Selin Urhan and Selay Arkun Kocadere. 2024. Problem-Solving Through Pair- Programming: The Mediational Role of ChatGPT. In2024 5th International Con- ference in Electronic Engineering, Information Technology & Education. IEEE

  35. [35]

    Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, et al. 2023. A prompt pattern catalog to enhance prompt engi- neering with chatgpt.arXiv preprint arXiv:2302.11382(2023)

  36. [36]

    Leon E Winslow. 1996. Programming pedagogy—a psychological overview.ACM Sigcse Bulletin28, 3 (1996), 17–22

  37. [37]

    Yangtian Zi, Luisa Li, Arjun Guha, Carolyn Jane Anderson, and Molly Q Feldman

  38. [38]

    I Would Have Written My Code Differently

    “I Would Have Written My Code Differently”: Beginners Struggle to Understand LLM-Generated Code.Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE Companion ’25)(2025)