Mitigating the safety alignment tax with null-space constrained policy optimization

Niu, Y · 2025 · arXiv 2512.11391

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

cs.CR · 2026-04-20 · unverdicted · novelty 6.0

Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

cs.LG · 2026-02-07 · unverdicted · novelty 5.0

ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.

citing papers explorer

Showing 3 of 3 citing papers.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 78 · 2 links
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks cs.CR · 2026-04-20 · unverdicted · none · ref 28
Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control cs.LG · 2026-02-07 · unverdicted · none · ref 18
ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.

Mitigating the safety alignment tax with null-space constrained policy optimization

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer