LLM tutors leak answers under adversarial student attacks, but a fine-tuned jailbreak agent and simple defenses can benchmark and improve robustness.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
EduAgentBench is a new source-grounded benchmark that evaluates tutor agents across pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow completion, finding frontier models capable of basic judgment but inadequate for professional teaching standards.
Behavioral signals from how students use AI tutor feedback in 10k code submissions reveal differences between tutors and correlate more strongly with perceived helpfulness than pedagogical quality alone.
EduQwen 32B models optimized via RL then SFT set new SOTA on the Cross-Domain Pedagogical Knowledge Benchmark and surpass Gemini-3 Pro.
citing papers explorer
-
Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks
LLM tutors leak answers under adversarial student attacks, but a fine-tuned jailbreak agent and simple defenses can benchmark and improve robustness.
-
Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows
EduAgentBench is a new source-grounded benchmark that evaluates tutor agents across pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow completion, finding frontier models capable of basic judgment but inadequate for professional teaching standards.
-
The Missing Evaluation Axis: What 10,000 Student Submissions Reveal About AI Tutor Effectiveness
Behavioral signals from how students use AI tutor feedback in 10k code submissions reveal differences between tutors and correlate more strongly with perceived helpfulness than pedagogical quality alone.
-
Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
EduQwen 32B models optimized via RL then SFT set new SOTA on the Cross-Domain Pedagogical Knowledge Benchmark and surpass Gemini-3 Pro.