LLM tutors leak answers under adversarial student attacks, but a fine-tuned jailbreak agent and simple defenses can benchmark and improve robustness.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Reflector trains LLMs to internalize step-wise self-reflection through SFT on teacher data followed by RL with outcome and validity rewards, reporting over 90% defense success against indirect jailbreaks and a 5.85% gain on GSM8K.
citing papers explorer
-
Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks
LLM tutors leak answers under adversarial student attacks, but a fine-tuned jailbreak agent and simple defenses can benchmark and improve robustness.
-
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
Reflector trains LLMs to internalize step-wise self-reflection through SFT on teacher data followed by RL with outcome and validity rewards, reporting over 90% defense success against indirect jailbreaks and a 5.85% gain on GSM8K.