pith. sign in

arxiv: 2604.07304 · v1 · submitted 2026-04-08 · 💻 cs.SE · cs.AI

Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems

Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords conversational assessmentautomated programming assessmentcode understandinghybrid systemsSocratic questioningLLM guardrailseducational chatbotsprogramming education
0
0 comments X

The pith

A hybrid framework adds conversational checks to automated programming systems to confirm students understand the code they submit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models now let students produce working code without grasping its internal logic, which weakens the value of conventional automated programming assessment systems that only check functional correctness. The paper reviews existing conversational assessment methods across rule-based, LLM-only, and hybrid designs and finds they show potential for scalable feedback but struggle with hallucinations, over-reliance, and deployment limits. It then outlines the Hybrid Socratic Framework, which adds a dual-agent conversational layer, knowledge tracking, and scaffolded questions whose prompts are anchored directly to runtime execution facts. The framework is presented as a complementary verification layer rather than a replacement for standard tests, with built-in safeguards such as randomized trace questions and local-model options. A reader would care because the approach attempts to restore the connection between submitted code and demonstrated comprehension in an era of widespread AI code generation.

Core claim

The central claim is that conversational verification can be integrated into Automated Programming Assessment Systems through a Hybrid Socratic Framework that pairs deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie every prompt to concrete runtime facts. Drawing on a scoping review of prior systems, the framework incorporates practical safeguards including proctored modes, randomized trace questions, stepwise reasoning linked to execution states, and local deployment choices to limit hallucinations, over-reliance, privacy risks, and integrity issues. Rather than replacing conventional testing, the framework

What carries the argument

The Hybrid Socratic Framework, which layers a dual-agent conversational verification system over deterministic runtime code analysis and anchors all questions to execution traces.

If this is right

  • Conversational probing becomes a scalable addition to existing automated assessment without displacing standard test suites.
  • Prompts grounded in runtime facts limit the scope for LLM hallucinations during feedback.
  • Knowledge tracking and scaffolded questions allow the system to adapt depth of inquiry to individual students.
  • Local-model and proctored deployment modes provide practical routes for addressing privacy and academic-integrity concerns.
  • Randomized trace questions and stepwise execution reasoning serve as concrete mechanisms to verify understanding beyond surface correctness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the framework works, programming courses could shift part of the grade toward demonstrated comprehension rather than submission alone.
  • Similar execution-anchored conversational layers might be adapted to verify understanding in other AI-assisted tasks such as writing or circuit design.
  • Widespread use would require APAS developers to expose runtime traces as a standard interface for third-party verification modules.
  • Over time the dual-agent setup could generate data on common misconceptions that instructors use to refine teaching materials.

Load-bearing premise

The guardrails, dual-agent design, and execution-tied prompts will keep LLM responses reliable and prevent over-reliance in real student use without creating new failure modes or excessive complexity.

What would settle it

A controlled classroom trial in which students first use an LLM to generate correct code they do not understand, then interact with the chatbot system, with independent human raters checking whether the chatbot accurately flags gaps in comprehension or is misled by hallucinations.

Figures

Figures reproduced from arXiv: 2604.07304 by Eduard Frankford, Erik Cikalleshi, Ruth Breu.

Figure 1
Figure 1. Figure 1: Keyword groups and query template used to derive the five search string variants. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PRISMA-style flow diagram of identification, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hybrid Socratic Framework for chatbot-based assessment in APASs. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Proof of concept, Tier 1: scenario-based multiple [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Proof of concept, Tier 2: open-ended explanation [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Large Language Models (LLMs) challenge conventional automated programming assessment because students can now produce functionally correct code without demonstrating corresponding understanding. This paper makes two contributions. First, it reports a saturation-based scoping review of conversational assessment approaches in programming education. The review identifies three dominant architectural families: rule-based or template-driven systems, LLM-based systems, and hybrid systems. Across the literature, conversational agents appear promising for scalable feedback and deeper probing of code understanding, but important limitations remain around hallucinations, over-reliance, privacy, integrity, and deployment constraints. Second, the paper synthesizes these findings into a Hybrid Socratic Framework for integrating conversational verification into Automated Programming Assessment Systems (APASs). The framework combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie prompts to runtime facts. The paper also discusses practical safeguards against LLM-generated explanations, including proctored deployment modes, randomized trace questions, stepwise reasoning tied to concrete execution states, and local-model deployment options for privacy-sensitive settings. Rather than replacing conventional testing, the framework is intended as a complementary layer for verifying whether students understand the code they submit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports a saturation-based scoping review of conversational assessment approaches in programming education, identifying three architectural families (rule-based/template-driven, LLM-based, and hybrid systems) along with limitations around hallucinations, over-reliance, privacy, and integrity. It synthesizes these findings into a Hybrid Socratic Framework for APASs that combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails tied to runtime facts, plus practical safeguards such as proctored modes and local-model deployment.

Significance. If implemented and validated, the framework could offer a timely hybrid approach to verify code understanding in LLM-assisted programming education without replacing conventional testing. The scoping review provides a useful synthesis of existing work, but the proposal's significance is currently prospective given the absence of empirical validation or implementation details.

major comments (2)
  1. [Scoping Review Methods] The scoping review is described as saturation-based, but the manuscript provides no details on search strategy, databases, keywords, inclusion/exclusion criteria, or how saturation was determined (e.g., number of papers reviewed). This information is load-bearing for the reliability of the identified architectural families and limitations discussed in the first contribution.
  2. [Hybrid Socratic Framework] The Hybrid Socratic Framework (synthesis section) describes guardrails that tie prompts to runtime facts, dual-agent setup, and execution-tied prompts at a conceptual level only, with no architecture diagram, pseudocode, prompt templates, or toy example showing how execution states are captured and bound to agents. This is load-bearing for the central claim that the framework mitigates LLM hallucinations and over-reliance without introducing new failure modes.
minor comments (1)
  1. [Abstract] The abstract could include quantitative details from the scoping review (e.g., number of papers or saturation point) to strengthen the summary of findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Scoping Review Methods] The scoping review is described as saturation-based, but the manuscript provides no details on search strategy, databases, keywords, inclusion/exclusion criteria, or how saturation was determined (e.g., number of papers reviewed). This information is load-bearing for the reliability of the identified architectural families and limitations discussed in the first contribution.

    Authors: We agree that the manuscript currently lacks explicit methodological details for the saturation-based scoping review, which limits assessment of the synthesis. This omission was unintentional. In the revised version, we will insert a dedicated Methods subsection that reports the search strategy (databases: ACM Digital Library, IEEE Xplore, ERIC, and Google Scholar; date range 2010–2024), keyword strings and Boolean operators, inclusion/exclusion criteria (peer-reviewed English-language studies on conversational agents or chatbots for programming education, excluding purely technical LLM papers without educational focus), screening process, final corpus size, and saturation criterion (thematic saturation declared after no new architectural families or limitations emerged following review of 48 papers). These additions will directly support the reliability of the three families and limitations identified. revision: yes

  2. Referee: [Hybrid Socratic Framework] The Hybrid Socratic Framework (synthesis section) describes guardrails that tie prompts to runtime facts, dual-agent setup, and execution-tied prompts at a conceptual level only, with no architecture diagram, pseudocode, prompt templates, or toy example showing how execution states are captured and bound to agents. This is load-bearing for the central claim that the framework mitigates LLM hallucinations and over-reliance without introducing new failure modes.

    Authors: We acknowledge that the framework description remains high-level and would benefit from concrete illustrations to substantiate the hallucination-mitigation claim. In revision we will add: (1) an architecture diagram depicting the dual-agent conversational layer, deterministic code-analysis module, knowledge tracker, and execution-state binding; (2) pseudocode for the guardrail procedure that extracts runtime facts (e.g., variable values, control-flow traces) and injects them into agent prompts; (3) sample prompt templates showing execution-tied scaffolding; and (4) a short toy example (e.g., a student-submitted loop with captured trace) demonstrating how the agents are constrained. These elements will clarify the intended safeguards while preserving the paper’s scope as a conceptual framework rather than a fully implemented system; we will also note that empirical testing of residual failure modes remains future work. revision: yes

Circularity Check

0 steps flagged

No circularity: literature review plus framework proposal with no derivations or self-referential reductions

full rationale

The paper performs a saturation-based scoping review of conversational assessment approaches and synthesizes the results into a proposed Hybrid Socratic Framework combining deterministic analysis, dual-agent conversation, knowledge tracking, scaffolded questioning, and execution-tied guardrails. No equations, fitted parameters, predictions of related quantities, or self-definitional steps appear in the provided text or abstract. The central contribution is a high-level architectural synthesis rather than a derivation that reduces to its own inputs by construction. Any self-citations (if present) are not load-bearing for a mathematical or predictive claim, as the work contains no quantitative modeling or uniqueness theorems. The framework is offered as a complementary layer for APASs, with safeguards discussed at a conceptual level only.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about LLM behavior and educational needs identified in the review; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption LLMs can generate useful questions about code when prompts are grounded in runtime facts.
    Invoked in the description of guardrails and dual-agent layer.
  • domain assumption Conversational agents are promising for scalable feedback but have limitations around hallucinations and over-reliance.
    Stated directly in the abstract as motivation for the framework.

pith-pipeline@v0.9.0 · 5502 in / 1352 out tokens · 65536 ms · 2026-05-10T17:30:55.011044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Al-Hossami, E., Bunescu, R., Teehan, R., Powell, L., Maha- jan, K., and Dorodchi, M. (2023). Socratic question- ing of novice debuggers: A benchmark dataset and preliminary evaluations. InProceedings of the 18th Workshop on Innovative Use of NLP for Building Ed- ucational Applications (BEA 2023), pages 709–726,

  2. [2]

    Alshaikh, Z., Tamang, L., and Rus, V . (2021). Experiments with auto-generated socratic dialogue for source code understanding. InProceedings of the 13th Interna- tional Conference on Computer Supported Education (CSEDU), pages 35–44. SCITEPRESS – Science and Technology Publications. Canic ¸o, A. B. and Santos, A. L. (2025). Integrating ques- tions about ...

  3. [3]

    Cheng, G., Wong, W., Luo, L., and Yu, M. (2025). Inte- grating a scaffolding-based, LLM-driven chatbot into programming education: A university case study. In Proceedings of the 2025 International Symposium on Educational Technology (ISET 2025), pages 196–200. IEEE

  4. [4]

    and Wang, H.-T

    Chuang, Y .-T. and Wang, H.-T. (2025). A ChatGPT-based dynamic assessment chatbot.Journal of Computer Languages, 85:101366

  5. [5]

    K., Joosten-Ten Brinke, D., V os, T

    Debets, T., Banihashem, S. K., Joosten-Ten Brinke, D., V os, T. E. J., Maillette de Buy Wenniger, G., and Camp, G. (2025). Chatbots in education: A systematic review of objectives, underlying technology and theory, eval- uation criteria, and impacts.Computers & Education, 234:105323

  6. [6]

    Elhambakhsh, S. E. (2025). Evaluating ChatGPT-3’s ef- ficacy in solving coding tasks: implications for aca- demic integrity in english language assessments.Lan- guage Testing in Asia, 15(1):37

  7. [7]

    Breu, R. (2024). AI-tutoring in software engineering education. InProceedings of the 46th International Conference on Software Engineering: Software En- gineering Education and Training (ICSE-SEET ’24), pages 309–319, New York, NY , USA. Association for Computing Machinery

  8. [8]

    Gupta, R., Goyal, H., Kumar, D., Mehra, A., Sharma, S., Mittal, K., and Challa, J. S. (2025). Sakshm AI: Advancing AI-assisted coding education for en- gineering students in india through socratic tutor- ing and comprehensive feedback. arXiv preprint arXiv:2503.12479

  9. [9]

    H., and Han, J

    Kargupta, P., Agarwal, I., Tur, D. H., and Han, J. (2024). Instruct, not assist: LLM-based multi-turn planning and hierarchical questioning for socratic code debug- ging. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 9475–9495,

  10. [10]

    Khor, E. T. and Chan, L. (2025). Exploring the effect of scaffolding strategies in GenAI chatbot on student en- gagement and programming skill development. In GCCCE 2025 English Conference Proceedings, pages 32–39. Global Chinese Society for Computers in Ed- ucation

  11. [11]

    Kitchenham, B. A. (2012). Systematic review in software engineering: Where we are and where we should be going. InProceedings of the 2nd International Work- shop on Evidential Assessment of Software Technolo- gies, EAST ’12, pages 1–2, New York, NY , USA. As- sociation for Computing Machinery

  12. [12]

    Lehtinen, T., Haaranen, L., and Leinonen, J. (2023). Auto- mated questionnaires about students’ JavaScript pro- grams: Towards gauging novice programming pro- cesses. InProceedings of the 25th Australasian Computing Education Conference, pages 49–58, Mel- bourne, VIC, Australia. Association for Computing Machinery

  13. [13]

    Lehtinen, T., Koutcheme, C., and Hellas, A. (2024). Let’s ask AI about their programs: Exploring ChatGPT’s answers to program comprehension questions. InPro- ceedings of the 46th International Conference on Soft- ware Engineering: Software Engineering Education and Training (ICSE-SEET ’24), pages 221–232. As- sociation for Computing Machinery

  14. [14]

    F., and Sakamura, K

    Lin, Y ., Ferdous Khan, M. F., and Sakamura, K. (2025). Athena: A GenAI-powered programming tutor based on open-source LLM. In2025 1st International Conference on Consumer Technology (ICCT-Pacific), pages 1–4. IEEE

  15. [15]

    Manorat, P., Tuarob, S., and Pongpaichet, S. (2025). Artifi- cial intelligence in computer programming education: A systematic literature review.Computers and Edu- cation: Artificial Intelligence, 8:100403

  16. [16]

    Palahan, S. (2025). PythonPal: Enhancing online program- ming education through chatbot-driven personalized feedback.IEEE Transactions on Learning Technolo- gies, 18:335–350

  17. [17]

    and Maalej, W

    Rahe, C. and Maalej, W. (2025). How do programming students use generative AI?Proceedings of the ACM on Software Engineering, 2(FSE):978–1000

  18. [18]

    Santos, A., Soares, T., Garrido, N., and Lehtinen, T. (2022). Jask: Generation of questions about learners’ code in Java. InProceedings of the 27th ACM Conference on Innovation and Technology in Computer Science Ed- ucation Vol. 1, pages 117–123, Dublin, Ireland. Asso- ciation for Computing Machinery

  19. [19]

    Stankov, E., Jovanov, M., and Madevska Bogdanova, A. (2023). Smart generation of code tracing questions for assessment in introductory programming.Computer Applications in Engineering Education, 31(1):5–25

  20. [20]

    Thomas, A., Stopera, T., Frank-Bolton, P., and Simha, R. (2019). Stochastic tree-based generation of program- tracing practice questions. InProceedings of the 50th ACM Technical Symposium on Computer Science Ed- ucation, pages 91–97, Minneapolis, MN, USA. Asso- ciation for Computing Machinery

  21. [21]

    N., Kaiser, L., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. InAdvances in Neu- ral Information Processing Systems 30 (NIPS 2017), pages 5998–6008

  22. [22]

    Vimalaksha, A., Prekash, A., Kumar, V ., and Srinivasa, G. (2021). DiGen: Distractor generator for multi- ple choice questions in code comprehension. In2021 IEEE International Conference on Engineering, Tech- nology & Education (TALE), pages 1073–1078. IEEE

  23. [23]

    Vintila, F. (2024). A VERT (Authorship Verification and Evaluation Through Responsive Testing): an LLM- based procedure that interactively verifies code au- thorship and evaluates student understanding. In 2024 21st International Conference on Information Technology Based Higher Education and Training (ITHET), pages 1–7. IEEE

  24. [24]

    Wang, J., Dai, Y ., Zhang, Y ., Ma, Z., Li, W., and Chai, J. (2025). Training turn-by-turn verifiers for dialogue tu- toring agents: The curious case of LLMs as your cod- ing tutors. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 12416–12436,

  25. [25]

    and Zhan, S

    Wang, L. and Zhan, S. (2024). How can generative AI benefit educators in designing assessments in com- puter science?Education Research and Perspectives, 51:82–101

  26. [26]

    Wohlin, C. (2014). Guidelines for snowballing in system- atic literature studies and a replication in software en- gineering. InProceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, pages 1–10, New York, NY , USA. Asso- ciation for Computing Machinery

  27. [27]

    Yusuf, H., Money, A., and Daylamani-Zad, D. (2025). Towards reducing teacher burden in performance- based assessments using aivaluate: an emotionally in- telligent LLM-augmented pedagogical AI conversa- tional agent.Education and Information Technolo- gies, 30:24649–24693