Reflector trains LLMs to internalize step-wise self-reflection through SFT on teacher data followed by RL with outcome and validity rewards, reporting over 90% defense success against indirect jailbreaks and a 5.85% gain on GSM8K.
33rd USENIX Security Symposium (USENIX Security 24) , pages=
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
Reflector trains LLMs to internalize step-wise self-reflection through SFT on teacher data followed by RL with outcome and validity rewards, reporting over 90% defense success against indirect jailbreaks and a 5.85% gain on GSM8K.