Mitigating the safety alignment tax with null-space constrained policy optimization.arXiv preprint arXiv:2512.11391

Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization , author= · 2025 · arXiv 2512.11391

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

cs.CR · 2026-04-20 · unverdicted · novelty 6.0

Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

cs.CL · 2026-06-24 · unverdicted · novelty 5.0

PolicyAlign aligns LLMs to natural-language safety policies by synthesizing violating instructions and performing on-policy self-distillation with policy-sensitive filtering, improving safety without high-quality supervision data.

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

cs.LG · 2026-02-07 · unverdicted · novelty 5.0

ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 78 · 2 links
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

Mitigating the safety alignment tax with null-space constrained policy optimization.arXiv preprint arXiv:2512.11391

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer