An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
citing papers explorer
-
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
-
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models
Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.
-
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.