Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Jailbreaking prompt attack: A controllable adversarial attack against diffusion models , author= · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

Reflector internalizes step-wise self-reflection in LLMs via teacher-guided SFT then RL with outcome and validity rewards, claiming over 90% defense success against indirect jailbreaks plus utility gains like 5.85% on GSM8K.

citing papers explorer

Showing 1 of 1 citing paper after filters.

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak cs.LG · 2026-05-20 · unverdicted · none · ref 51
Reflector internalizes step-wise self-reflection in LLMs via teacher-guided SFT then RL with outcome and validity rewards, claiming over 90% defense success against indirect jailbreaks plus utility gains like 5.85% on GSM8K.

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

fields

years

verdicts

representative citing papers

citing papers explorer