arxiv: 2605.12988 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CY· cs.IR

Recognition: no theorem link

Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education

Mragisha Jain , Tirth Bhatt , Griffin Pitts , Aum Pandya , Peter Brusilovsky , Narges Norouzi , Arto Hellas , Juho Leinonen

show 1 more author

Bita Akram

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:49 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.IR

keywords intelligent tutoring systemsretrieval-augmented generationalgorithm tracingSocratic feedbacksimulated studentsAI educationscaffolding

0 comments

The pith

KITE uses retrieval from course materials and Socratic scaffolding to help simulated students give more accurate answers on algorithm tracing and procedural tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents KITE as a retrieval-augmented tutoring system that pulls relevant course content and delivers intent-aware hints, guiding questions, and progressive scaffolding for students learning algorithm traces and problem solving. It tests the system with three evaluations: automatic metrics for how well responses stay grounded in the retrieved material, expert review of teaching quality, and a two-turn simulated-student loop in which a weaker model receives KITE feedback and then revises its answers. The simulated-student results show higher accuracy on follow-up procedural and tracing questions after KITE feedback. A sympathetic reader would care because the work offers a concrete way to scale personalized algorithmic support in AI courses without requiring constant human tutors.

Core claim

KITE employs a multimodal RAG pipeline to retrieve relevant information from course materials and pairs it with an intent-aware Socratic response strategy that produces targeted hints and progressive scaffolding. In the simulated-student evaluation, a weaker language model interacting with KITE across two-turn dialogues produced more accurate revised answers on procedural and tracing questions after receiving the feedback.

What carries the argument

KITE (Knowledge-Informed Tutoring Engine), a RAG-based tutoring architecture that retrieves course material and generates intent-aware Socratic responses to deliver scaffolding for algorithmic reasoning.

If this is right

KITE responses remain contextually grounded in the retrieved course materials.
The system produces pedagogically appropriate scaffolding for algorithmic tasks.
Simulated students generate more accurate follow-up responses after receiving KITE feedback on procedural and tracing questions.
The combined RAG and Socratic architecture supports scalable classroom assistance for algorithm problem-solving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the simulated pipeline generalizes, KITE could be embedded in online platforms to handle routine tutoring load in large AI courses.
The same retrieval-plus-Socratic pattern might apply to other procedural domains such as physics derivations or code debugging.
A natural next test would replace the weaker model with actual student interaction logs to measure real-time scaffolding effects.

Load-bearing premise

The two-turn simulated-student pipeline with a weaker language model accurately reflects how real human students would interpret and benefit from the tutoring feedback.

What would settle it

A controlled study in which real students use KITE versus a no-feedback control and show no measurable gain in accuracy on follow-up algorithmic tracing and procedural questions.

Figures

Figures reproduced from arXiv: 2605.12988 by Arto Hellas, Aum Pandya, Bita Akram, Griffin Pitts, Juho Leinonen, Mragisha Jain, Narges Norouzi, Peter Brusilovsky, Tirth Bhatt.

**Figure 1.** Figure 1: KITE architecture semantic retrieval with an expert-validated knowledge graph and reports a 35% improvement in learning outcomes (d = 0.86) in a study of 76 students, though its reliance on manual expert validation limits scalability. AutoTA (Dahal et al., 2025) provides a related approach to intent-aware educational assistance by classifying student queries and routing them to specialized response str… view at source ↗

**Figure 2.** Figure 2: Evaluation pipeline Simulated Student Pipeline. Building on prior work that uses simulated student-tutor interactions to evaluate pedagogical support (Dinucu-Jianu et al., 2025), we use Meta-Llama-3.1-70B-Instruct as a proxy student in a structured interaction with KITE: 1. Round 1: The student model answers each question without assistance, establishing an unaided baseline. 2. KITE Feedback: KITE evaluate… view at source ↗

read the original abstract

Students learning algorithms often need support as they interpret traces, debug reasoning errors, and apply procedures across unfamiliar problem instances. In this paper, we present KITE (Knowledge-Informed Tutoring Engine), a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system designed to serve as a classroom teaching assistant for algorithmic reasoning and problem-solving tasks. KITE uses an intent-aware Socratic response strategy to tailor support to different student needs, responding with targeted hints, guiding questions, and progressive scaffolding intended to strengthen students' algorithmic problem-solving ability. To keep responses aligned with course content, KITE uses a multimodal RAG pipeline that retrieves relevant information from course materials. We evaluate KITE using three forms of assessment: RAGAs-based metrics for response grounding and quality, expert evaluation of pedagogical quality, and a simulated student pipeline in which a weaker language model interacts with KITE across two-turn dialogues and produces revised answers after receiving feedback. Results indicate that KITE produces contextually grounded and pedagogically appropriate responses. Further, using simulated students, KITE's feedback helped the student models produce more accurate follow-up responses on procedural and tracing questions, suggesting that its scaffolding can support algorithmic problem-solving. This work contributes a tutoring architecture and an evaluation approach for assessing retrieval-grounded explanations and scaffolded problem-solving feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents KITE, a retrieval-augmented generation (RAG) intelligent tutoring system for algorithm tracing and procedural problem-solving in AI education. It combines a multimodal RAG pipeline over course materials with an intent-aware Socratic response strategy to deliver targeted hints and progressive scaffolding. Evaluation consists of RAGAs metrics for grounding and quality, expert pedagogical review, and a two-turn simulated-student pipeline in which a weaker LM receives KITE feedback and produces revised answers; the authors report that this feedback yields more accurate follow-up responses on tracing and procedural questions.

Significance. If the core claims hold after validation, the work supplies a concrete, course-aligned RAG architecture for scaffolding algorithmic reasoning together with a scalable simulation-based evaluation protocol. The emphasis on retrieval from instructor materials and Socratic intent detection addresses a practical gap in AI education tools that must stay faithful to specific curricula.

major comments (2)

[Simulated Student Pipeline] Simulated Student Pipeline section: the central claim that KITE scaffolding supports algorithmic problem-solving rests on accuracy gains observed when a weaker LM acts as the student in two-turn dialogues. No comparison of the simulated error distributions, revision rates, or sensitivity to hints against any human learner trace data is reported. Without such grounding, the measured lift cannot be extrapolated to real students whose misconceptions and uptake patterns may differ systematically.
[Evaluation Results] Evaluation Results and Abstract: positive outcomes are asserted for RAGAs metrics, expert review, and simulated dialogues, yet no numerical scores, effect sizes, baselines, or statistical controls appear. This absence prevents assessment of practical significance or comparison with prior tutoring systems.

minor comments (2)

[Abstract] Abstract: the summary of results would be strengthened by including at least one concrete RAGAs or accuracy figure rather than the generic phrase 'positive outcomes'.
[System Architecture] Notation and figures: ensure that the intent classifier and retrieval pipeline are diagrammed with explicit data-flow arrows so readers can trace how course materials constrain generated hints.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the two major comments below, providing clarifications on the evaluation design while committing to revisions that improve transparency and acknowledge limitations.

read point-by-point responses

Referee: [Simulated Student Pipeline] Simulated Student Pipeline section: the central claim that KITE scaffolding supports algorithmic problem-solving rests on accuracy gains observed when a weaker LM acts as the student in two-turn dialogues. No comparison of the simulated error distributions, revision rates, or sensitivity to hints against any human learner trace data is reported. Without such grounding, the measured lift cannot be extrapolated to real students whose misconceptions and uptake patterns may differ systematically.

Authors: We agree that human learner trace data would provide stronger external validity for extrapolating results to real students. The simulated pipeline was designed as a controlled, scalable proxy to isolate the effect of KITE's feedback on answer revision accuracy under repeatable conditions, using a weaker LM to model typical student errors on tracing and procedural tasks. We have revised the manuscript to explicitly state this as a limitation, include a new subsection discussing potential differences in human uptake patterns, and outline future work involving human-subject studies. The reported gains remain valid evidence that the scaffolding improves performance within the simulated setting. revision: partial
Referee: [Evaluation Results] Evaluation Results and Abstract: positive outcomes are asserted for RAGAs metrics, expert review, and simulated dialogues, yet no numerical scores, effect sizes, baselines, or statistical controls appear. This absence prevents assessment of practical significance or comparison with prior tutoring systems.

Authors: We acknowledge that the absence of specific numbers in the abstract and summary sections limits immediate assessment of effect sizes and comparisons. The full evaluation section of the manuscript does report RAGAs scores, expert ratings, and accuracy deltas from the simulated dialogues, but these were not highlighted with baselines or statistical details. In the revised manuscript we have added a dedicated results table with all numerical values (including RAGAs faithfulness/relevance scores, expert pedagogical ratings on a 1-5 scale, pre/post accuracy percentages with standard deviations, and p-values from paired tests), plus explicit baselines using non-RAG and non-Socratic variants. This will enable direct comparison with prior systems. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation via independent simulation and external metrics

full rationale

The paper presents KITE as a RAG-based tutoring system and evaluates response quality with RAGAs metrics, expert pedagogical review, and a two-turn simulated-student interaction using a separate weaker LM. The reported accuracy lift in follow-up answers is an observed outcome of that interaction pipeline, not a quantity derived from the system's own definitions or fitted parameters. No equations, self-citations, or uniqueness claims reduce the central result to its inputs by construction. The simulation serves as an external test harness rather than a tautological prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about RAG retrieval quality and LLM response generation rather than new fitted parameters or invented entities.

axioms (2)

domain assumption Retrieval from course materials will produce responses aligned with intended curriculum content
Invoked to justify the multimodal RAG pipeline keeping answers contextually grounded.
domain assumption Socratic hints and progressive scaffolding improve algorithmic problem-solving ability
Core pedagogical premise underlying the intent-aware response strategy.

pith-pipeline@v0.9.0 · 5569 in / 1175 out tokens · 21878 ms · 2026-05-14T19:49:39.555028+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Ceur Workshop Proceedings , volume=

Automating pedagogical evaluation of LLM-based conversational agents , author=. Ceur Workshop Proceedings , volume=. 2025 , organization=

2025
[2]

Educational psychology review , volume=

Exploring the assistance dilemma in experiments with cognitive tutors , author=. Educational psychology review , volume=. 2007 , publisher=

2007
[3]

Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+ NLP) , pages=

A Survey of LLM-Based Applications in Programming Education: Balancing Automation and Human Oversight , author=. Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+ NLP) , pages=
[4]

Trust and Reliance on AI in Education: AI Literacy and Need for Cognition as Moderators

Trust and reliance on ai in education: Ai literacy and need for cognition as moderators , author=. arXiv preprint arXiv:2604.01114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2602.20547 , year=

What Drives Students' Use of AI Chatbots? Technology Acceptance in Conversational AI , author=. arXiv preprint arXiv:2602.20547 , year=

work page arXiv
[6]

Knowing, Learning, and Instruction: Essays in Honor of Robert Glaser , editor =

Cognitive apprenticeship: Teaching the crafts of reading, writing, and mathematics , author =. Knowing, Learning, and Instruction: Essays in Honor of Robert Glaser , editor =. 1989 , pages =

1989
[7]

IEEE Access , year=

Autota: A dynamic intent-based virtual teaching assistant for students using open source llms , author=. IEEE Access , year=
[8]

Interdisciplinary journal of problem-based learning , volume=

Goals and strategies of a problem-based learning facilitator , author=. Interdisciplinary journal of problem-based learning , volume=
[9]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , year =

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , year =

2019
[10]

In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval (SIGIR)

Carbonell, Jaime and Goldstein, Jade , title =. 1998 , isbn =. doi:10.1145/290941.291025 , booktitle =

work page doi:10.1145/290941.291025 1998
[11]

Robertson, Stephen and Zaragoza, Hugo , title =. Found. Trends Inf. Retr. , month = apr, pages =. 2009 , issue_date =. doi:10.1561/1500000019 , abstract =

work page doi:10.1561/1500000019 2009
[12]

2024 , booktitle=

Evaluation of RAG Metrics for Question Answering in the Telecom Domain , author=. 2024 , booktitle=

2024
[13]

IEEE transactions on big data , volume=

Billion-scale similarity search with GPUs , author=. IEEE transactions on big data , volume=. 2019 , publisher=

2019
[14]

OpenAI Embeddings Guide , year =
[15]

Computers and Education: Artificial Intelligence , volume=

ChatGPT effects on cognitive skills of undergraduate students: Receiving instant responses from AI-based conversational large language models (LLMs) , author=. Computers and Education: Artificial Intelligence , volume=. 2024 , publisher=

2024
[16]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
[17]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

From problem-solving to teaching problem-solving: Aligning llms with pedagogy using reinforcement learning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[18]

2025 ASEE Annual Conference & Exposition , year=

Student Perspectives on the Benefits and Risks of AI in Education , author=. 2025 ASEE Annual Conference & Exposition , year=

2025
[19]

International Conference on Human-Computer Interaction , pages=

Students’ reliance on ai in higher education: identifying contributing factors , author=. International Conference on Human-Computer Interaction , pages=. 2025 , organization=

2025
[20]

Nurse education in practice , pages=

An innovative Socratic method-based artificial intelligence platform for healthcare education: A quasi-experimental study , author=. Nurse education in practice , pages=. 2026 , publisher=

2026
[21]

Cognitive Systems Research , volume=

AutoTutor: A simulation of a human tutor , author=. Cognitive Systems Research , volume=. 1999 , publisher=

1999
[22]

2025 14th International Conference on Educational and Information Technology (ICEIT) , pages=

How to build an adaptive AI tutor for any course using knowledge graph-enhanced retrieval-augmented generation (KG-RAG) , author=. 2025 14th International Conference on Educational and Information Technology (ICEIT) , pages=. 2025 , organization=

2025
[23]

PeerJ Computer Science , volume=

LPITutor: an LLM based personalized intelligent tutoring system using RAG and prompt engineering , author=. PeerJ Computer Science , volume=. 2025 , publisher=

2025
[24]

Proceedings of the 56th ACM Technical Symposium on Computer Science Education V

Analyzing pedagogical quality and efficiency of llm responses with ta feedback to live student questions , author=. Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 , pages=
[25]

NPJ Digital Medicine , volume=

A generative AI teaching assistant for personalized learning in medical education , author=. NPJ Digital Medicine , volume=. 2025 , publisher=

2025
[26]

IEEE Transactions on Education , volume=

An LLM-driven chatbot in higher education for databases and information systems , author=. IEEE Transactions on Education , volume=. 2024 , publisher=

2024
[27]

Computers and Education: Artificial Intelligence , volume=

Retrieval-augmented generation for educational application: A systematic survey , author=. Computers and Education: Artificial Intelligence , volume=. 2025 , publisher=

2025
[28]

2025 3rd International Conference on Foundation and Large Language Models (FLLM) , pages=

KAG: A Scalable Knowledge-Augmented Generation System for Educational Content Management , author=. 2025 3rd International Conference on Foundation and Large Language Models (FLLM) , pages=. 2025 , organization=

2025
[29]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

LeanTutor: Towards a Verified AI Mathematical Proof Tutor , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[30]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[31]

Proceedings of the 18th conference of the european chapter of the association for computational linguistics: system demonstrations , pages=

Ragas: Automated evaluation of retrieval augmented generation , author=. Proceedings of the 18th conference of the european chapter of the association for computational linguistics: system demonstrations , pages=
[32]

Antal, K

Evaluating open-source LLMs in RAG systems: a benchmark on diploma theses abstracts using ragas: M. Antal, K. Buza , author=. Acta Universitatis Sapientiae, Informatica , volume=. 2025 , publisher=

2025
[33]

ACM Transactions on Information Systems , publisher=

Enhancing Knowledge Tracing with Multi-hierarchy Hypergraph Adaptive Knowledge Transfer , author=. ACM Transactions on Information Systems , publisher=