33rd USENIX Security Symposium (USENIX Security 24) , pages=

Making them ask, answer: Jailbreaking large language models in few queries via disguise, reconstruction , author=

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

Reflector trains LLMs to internalize step-wise self-reflection through SFT on teacher data followed by RL with outcome and validity rewards, reporting over 90% defense success against indirect jailbreaks and a 5.85% gain on GSM8K.

citing papers explorer

Showing 1 of 1 citing paper.

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak cs.LG · 2026-05-20 · unverdicted · none · ref 21
Reflector trains LLMs to internalize step-wise self-reflection through SFT on teacher data followed by RL with outcome and validity rewards, reporting over 90% defense success against indirect jailbreaks and a 5.85% gain on GSM8K.

33rd USENIX Security Symposium (USENIX Security 24) , pages=

fields

years

verdicts

representative citing papers

citing papers explorer