pith. sign in

Dean of LLM Tutors: A Framework for Automated Quality Review of AI-generated Feedback

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

Large language model (LLM) tutors are increasingly used to generate educational feedback, but existing research has focused mainly on feedback generation rather than feedback evaluation. As a result, LLM-generated feedback may offer limited pedagogical value and carry risks of hallucination. The current study introduces DeanLLM, an automated review framework for comprehensively evaluating feedback generated by LLM tutors before it is shared with students. We developed a 16-dimension evaluation framework covering feedback content, educational effectiveness, and hallucination risks, and validated it using using human-expert annotations of LLM-generated tutor feedback on synthetic computer science assignment submissions derived from real coursework. We then examined whether LLMs could serve as automated LLM-generated tutor feedback reviewers, and used the best-performing reviewer to benchmark tutor feedback generated by 10 commercial LLMs. Psychometric analyses supported the reliability of the proposed framework and showed that human reviewers tended to evaluate feedback holistically, whereas the LLM reviewer separated rubric dimensions more mechanically. Standard zero-shot and few-shot prompting showed limited agreement with human experts for content-quality judgments. Supervised fine-tuning of GPT-4.1 with human-labelled examples containing scores only, without explanatory rationales, achieved the strongest alignment with expert judgments. Reasoning LLMs were particularly effective at hallucination detection and produced automated tutor feedback with stronger educational effectiveness and factuality than lightweight models. The findings indicate that DeanLLM offers a scalable way for automatically improving the reliability and safety of LLM tutor feedback, while also demonstrating that reviewer calibration and model choice remain critical for educational deployment.

fields

cs.CL 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

citing papers explorer

Showing 1 of 1 citing paper.