Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems
Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3
The pith
A hybrid framework adds conversational checks to automated programming systems to confirm students understand the code they submit.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that conversational verification can be integrated into Automated Programming Assessment Systems through a Hybrid Socratic Framework that pairs deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie every prompt to concrete runtime facts. Drawing on a scoping review of prior systems, the framework incorporates practical safeguards including proctored modes, randomized trace questions, stepwise reasoning linked to execution states, and local deployment choices to limit hallucinations, over-reliance, privacy risks, and integrity issues. Rather than replacing conventional testing, the framework
What carries the argument
The Hybrid Socratic Framework, which layers a dual-agent conversational verification system over deterministic runtime code analysis and anchors all questions to execution traces.
If this is right
- Conversational probing becomes a scalable addition to existing automated assessment without displacing standard test suites.
- Prompts grounded in runtime facts limit the scope for LLM hallucinations during feedback.
- Knowledge tracking and scaffolded questions allow the system to adapt depth of inquiry to individual students.
- Local-model and proctored deployment modes provide practical routes for addressing privacy and academic-integrity concerns.
- Randomized trace questions and stepwise execution reasoning serve as concrete mechanisms to verify understanding beyond surface correctness.
Where Pith is reading between the lines
- If the framework works, programming courses could shift part of the grade toward demonstrated comprehension rather than submission alone.
- Similar execution-anchored conversational layers might be adapted to verify understanding in other AI-assisted tasks such as writing or circuit design.
- Widespread use would require APAS developers to expose runtime traces as a standard interface for third-party verification modules.
- Over time the dual-agent setup could generate data on common misconceptions that instructors use to refine teaching materials.
Load-bearing premise
The guardrails, dual-agent design, and execution-tied prompts will keep LLM responses reliable and prevent over-reliance in real student use without creating new failure modes or excessive complexity.
What would settle it
A controlled classroom trial in which students first use an LLM to generate correct code they do not understand, then interact with the chatbot system, with independent human raters checking whether the chatbot accurately flags gaps in comprehension or is misled by hallucinations.
Figures
read the original abstract
Large Language Models (LLMs) challenge conventional automated programming assessment because students can now produce functionally correct code without demonstrating corresponding understanding. This paper makes two contributions. First, it reports a saturation-based scoping review of conversational assessment approaches in programming education. The review identifies three dominant architectural families: rule-based or template-driven systems, LLM-based systems, and hybrid systems. Across the literature, conversational agents appear promising for scalable feedback and deeper probing of code understanding, but important limitations remain around hallucinations, over-reliance, privacy, integrity, and deployment constraints. Second, the paper synthesizes these findings into a Hybrid Socratic Framework for integrating conversational verification into Automated Programming Assessment Systems (APASs). The framework combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie prompts to runtime facts. The paper also discusses practical safeguards against LLM-generated explanations, including proctored deployment modes, randomized trace questions, stepwise reasoning tied to concrete execution states, and local-model deployment options for privacy-sensitive settings. Rather than replacing conventional testing, the framework is intended as a complementary layer for verifying whether students understand the code they submit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a saturation-based scoping review of conversational assessment approaches in programming education, identifying three architectural families (rule-based/template-driven, LLM-based, and hybrid systems) along with limitations around hallucinations, over-reliance, privacy, and integrity. It synthesizes these findings into a Hybrid Socratic Framework for APASs that combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails tied to runtime facts, plus practical safeguards such as proctored modes and local-model deployment.
Significance. If implemented and validated, the framework could offer a timely hybrid approach to verify code understanding in LLM-assisted programming education without replacing conventional testing. The scoping review provides a useful synthesis of existing work, but the proposal's significance is currently prospective given the absence of empirical validation or implementation details.
major comments (2)
- [Scoping Review Methods] The scoping review is described as saturation-based, but the manuscript provides no details on search strategy, databases, keywords, inclusion/exclusion criteria, or how saturation was determined (e.g., number of papers reviewed). This information is load-bearing for the reliability of the identified architectural families and limitations discussed in the first contribution.
- [Hybrid Socratic Framework] The Hybrid Socratic Framework (synthesis section) describes guardrails that tie prompts to runtime facts, dual-agent setup, and execution-tied prompts at a conceptual level only, with no architecture diagram, pseudocode, prompt templates, or toy example showing how execution states are captured and bound to agents. This is load-bearing for the central claim that the framework mitigates LLM hallucinations and over-reliance without introducing new failure modes.
minor comments (1)
- [Abstract] The abstract could include quantitative details from the scoping review (e.g., number of papers or saturation point) to strengthen the summary of findings.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will incorporate to improve clarity and rigor.
read point-by-point responses
-
Referee: [Scoping Review Methods] The scoping review is described as saturation-based, but the manuscript provides no details on search strategy, databases, keywords, inclusion/exclusion criteria, or how saturation was determined (e.g., number of papers reviewed). This information is load-bearing for the reliability of the identified architectural families and limitations discussed in the first contribution.
Authors: We agree that the manuscript currently lacks explicit methodological details for the saturation-based scoping review, which limits assessment of the synthesis. This omission was unintentional. In the revised version, we will insert a dedicated Methods subsection that reports the search strategy (databases: ACM Digital Library, IEEE Xplore, ERIC, and Google Scholar; date range 2010–2024), keyword strings and Boolean operators, inclusion/exclusion criteria (peer-reviewed English-language studies on conversational agents or chatbots for programming education, excluding purely technical LLM papers without educational focus), screening process, final corpus size, and saturation criterion (thematic saturation declared after no new architectural families or limitations emerged following review of 48 papers). These additions will directly support the reliability of the three families and limitations identified. revision: yes
-
Referee: [Hybrid Socratic Framework] The Hybrid Socratic Framework (synthesis section) describes guardrails that tie prompts to runtime facts, dual-agent setup, and execution-tied prompts at a conceptual level only, with no architecture diagram, pseudocode, prompt templates, or toy example showing how execution states are captured and bound to agents. This is load-bearing for the central claim that the framework mitigates LLM hallucinations and over-reliance without introducing new failure modes.
Authors: We acknowledge that the framework description remains high-level and would benefit from concrete illustrations to substantiate the hallucination-mitigation claim. In revision we will add: (1) an architecture diagram depicting the dual-agent conversational layer, deterministic code-analysis module, knowledge tracker, and execution-state binding; (2) pseudocode for the guardrail procedure that extracts runtime facts (e.g., variable values, control-flow traces) and injects them into agent prompts; (3) sample prompt templates showing execution-tied scaffolding; and (4) a short toy example (e.g., a student-submitted loop with captured trace) demonstrating how the agents are constrained. These elements will clarify the intended safeguards while preserving the paper’s scope as a conceptual framework rather than a fully implemented system; we will also note that empirical testing of residual failure modes remains future work. revision: yes
Circularity Check
No circularity: literature review plus framework proposal with no derivations or self-referential reductions
full rationale
The paper performs a saturation-based scoping review of conversational assessment approaches and synthesizes the results into a proposed Hybrid Socratic Framework combining deterministic analysis, dual-agent conversation, knowledge tracking, scaffolded questioning, and execution-tied guardrails. No equations, fitted parameters, predictions of related quantities, or self-definitional steps appear in the provided text or abstract. The central contribution is a high-level architectural synthesis rather than a derivation that reduces to its own inputs by construction. Any self-citations (if present) are not load-bearing for a mathematical or predictive claim, as the work contains no quantitative modeling or uniqueness theorems. The framework is offered as a complementary layer for APASs, with safeguards discussed at a conceptual level only.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can generate useful questions about code when prompts are grounded in runtime facts.
- domain assumption Conversational agents are promising for scalable feedback but have limitations around hallucinations and over-reliance.
Reference graph
Works this paper leans on
-
[1]
Al-Hossami, E., Bunescu, R., Teehan, R., Powell, L., Maha- jan, K., and Dorodchi, M. (2023). Socratic question- ing of novice debuggers: A benchmark dataset and preliminary evaluations. InProceedings of the 18th Workshop on Innovative Use of NLP for Building Ed- ucational Applications (BEA 2023), pages 709–726,
work page 2023
-
[2]
Alshaikh, Z., Tamang, L., and Rus, V . (2021). Experiments with auto-generated socratic dialogue for source code understanding. InProceedings of the 13th Interna- tional Conference on Computer Supported Education (CSEDU), pages 35–44. SCITEPRESS – Science and Technology Publications. Canic ¸o, A. B. and Santos, A. L. (2025). Integrating ques- tions about ...
work page 2021
-
[3]
Cheng, G., Wong, W., Luo, L., and Yu, M. (2025). Inte- grating a scaffolding-based, LLM-driven chatbot into programming education: A university case study. In Proceedings of the 2025 International Symposium on Educational Technology (ISET 2025), pages 196–200. IEEE
work page 2025
-
[4]
Chuang, Y .-T. and Wang, H.-T. (2025). A ChatGPT-based dynamic assessment chatbot.Journal of Computer Languages, 85:101366
work page 2025
-
[5]
K., Joosten-Ten Brinke, D., V os, T
Debets, T., Banihashem, S. K., Joosten-Ten Brinke, D., V os, T. E. J., Maillette de Buy Wenniger, G., and Camp, G. (2025). Chatbots in education: A systematic review of objectives, underlying technology and theory, eval- uation criteria, and impacts.Computers & Education, 234:105323
work page 2025
-
[6]
Elhambakhsh, S. E. (2025). Evaluating ChatGPT-3’s ef- ficacy in solving coding tasks: implications for aca- demic integrity in english language assessments.Lan- guage Testing in Asia, 15(1):37
work page 2025
-
[7]
Breu, R. (2024). AI-tutoring in software engineering education. InProceedings of the 46th International Conference on Software Engineering: Software En- gineering Education and Training (ICSE-SEET ’24), pages 309–319, New York, NY , USA. Association for Computing Machinery
work page 2024
- [8]
-
[9]
Kargupta, P., Agarwal, I., Tur, D. H., and Han, J. (2024). Instruct, not assist: LLM-based multi-turn planning and hierarchical questioning for socratic code debug- ging. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 9475–9495,
work page 2024
-
[10]
Khor, E. T. and Chan, L. (2025). Exploring the effect of scaffolding strategies in GenAI chatbot on student en- gagement and programming skill development. In GCCCE 2025 English Conference Proceedings, pages 32–39. Global Chinese Society for Computers in Ed- ucation
work page 2025
-
[11]
Kitchenham, B. A. (2012). Systematic review in software engineering: Where we are and where we should be going. InProceedings of the 2nd International Work- shop on Evidential Assessment of Software Technolo- gies, EAST ’12, pages 1–2, New York, NY , USA. As- sociation for Computing Machinery
work page 2012
-
[12]
Lehtinen, T., Haaranen, L., and Leinonen, J. (2023). Auto- mated questionnaires about students’ JavaScript pro- grams: Towards gauging novice programming pro- cesses. InProceedings of the 25th Australasian Computing Education Conference, pages 49–58, Mel- bourne, VIC, Australia. Association for Computing Machinery
work page 2023
-
[13]
Lehtinen, T., Koutcheme, C., and Hellas, A. (2024). Let’s ask AI about their programs: Exploring ChatGPT’s answers to program comprehension questions. InPro- ceedings of the 46th International Conference on Soft- ware Engineering: Software Engineering Education and Training (ICSE-SEET ’24), pages 221–232. As- sociation for Computing Machinery
work page 2024
-
[14]
Lin, Y ., Ferdous Khan, M. F., and Sakamura, K. (2025). Athena: A GenAI-powered programming tutor based on open-source LLM. In2025 1st International Conference on Consumer Technology (ICCT-Pacific), pages 1–4. IEEE
work page 2025
-
[15]
Manorat, P., Tuarob, S., and Pongpaichet, S. (2025). Artifi- cial intelligence in computer programming education: A systematic literature review.Computers and Edu- cation: Artificial Intelligence, 8:100403
work page 2025
-
[16]
Palahan, S. (2025). PythonPal: Enhancing online program- ming education through chatbot-driven personalized feedback.IEEE Transactions on Learning Technolo- gies, 18:335–350
work page 2025
-
[17]
Rahe, C. and Maalej, W. (2025). How do programming students use generative AI?Proceedings of the ACM on Software Engineering, 2(FSE):978–1000
work page 2025
-
[18]
Santos, A., Soares, T., Garrido, N., and Lehtinen, T. (2022). Jask: Generation of questions about learners’ code in Java. InProceedings of the 27th ACM Conference on Innovation and Technology in Computer Science Ed- ucation Vol. 1, pages 117–123, Dublin, Ireland. Asso- ciation for Computing Machinery
work page 2022
-
[19]
Stankov, E., Jovanov, M., and Madevska Bogdanova, A. (2023). Smart generation of code tracing questions for assessment in introductory programming.Computer Applications in Engineering Education, 31(1):5–25
work page 2023
-
[20]
Thomas, A., Stopera, T., Frank-Bolton, P., and Simha, R. (2019). Stochastic tree-based generation of program- tracing practice questions. InProceedings of the 50th ACM Technical Symposium on Computer Science Ed- ucation, pages 91–97, Minneapolis, MN, USA. Asso- ciation for Computing Machinery
work page 2019
-
[21]
N., Kaiser, L., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. InAdvances in Neu- ral Information Processing Systems 30 (NIPS 2017), pages 5998–6008
work page 2017
-
[22]
Vimalaksha, A., Prekash, A., Kumar, V ., and Srinivasa, G. (2021). DiGen: Distractor generator for multi- ple choice questions in code comprehension. In2021 IEEE International Conference on Engineering, Tech- nology & Education (TALE), pages 1073–1078. IEEE
work page 2021
-
[23]
Vintila, F. (2024). A VERT (Authorship Verification and Evaluation Through Responsive Testing): an LLM- based procedure that interactively verifies code au- thorship and evaluates student understanding. In 2024 21st International Conference on Information Technology Based Higher Education and Training (ITHET), pages 1–7. IEEE
work page 2024
-
[24]
Wang, J., Dai, Y ., Zhang, Y ., Ma, Z., Li, W., and Chai, J. (2025). Training turn-by-turn verifiers for dialogue tu- toring agents: The curious case of LLMs as your cod- ing tutors. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 12416–12436,
work page 2025
-
[25]
Wang, L. and Zhan, S. (2024). How can generative AI benefit educators in designing assessments in com- puter science?Education Research and Perspectives, 51:82–101
work page 2024
-
[26]
Wohlin, C. (2014). Guidelines for snowballing in system- atic literature studies and a replication in software en- gineering. InProceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, pages 1–10, New York, NY , USA. Asso- ciation for Computing Machinery
work page 2014
-
[27]
Yusuf, H., Money, A., and Daylamani-Zad, D. (2025). Towards reducing teacher burden in performance- based assessments using aivaluate: an emotionally in- telligent LLM-augmented pedagogical AI conversa- tional agent.Education and Information Technolo- gies, 30:24649–24693
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.