A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.
Saro: Enhancing llm safety through reasoning-based alignment.arXiv preprint arXiv:2504.09420
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.AI 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.
citing papers explorer
-
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization
A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.
-
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
-
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.