Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control
Pith reviewed 2026-05-22 11:12 UTC · model grok-4.3
The pith
ShaPO improves LLM safety alignment robustness by enforcing worst-case objectives through selective geometry control in an alignment-critical parameter subspace.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ShaPO is a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision.
What carries the argument
selective geometry control over the alignment-critical parameter subspace, which targets constraints to enforce worst-case objectives without uniform over-regularization
If this is right
- ShaPO consistently improves safety robustness over popular preference optimization methods across diverse safety benchmarks and noisy preference settings.
- Token-level ShaPO stabilizes likelihood-based surrogate optimization.
- Reward-level ShaPO enforces reward-consistent optimization under noisy supervision.
- ShaPO composes cleanly with data-robust objectives, yielding additional gains.
Where Pith is reading between the lines
- If the critical subspace can be identified in other models, selective control might generalize beyond preference optimization to supervised fine-tuning or RLHF variants.
- Future work could test whether subspace identification itself needs to adapt dynamically during training to maintain gains under evolving distribution shifts.
- The clean composition with data-robust methods points to modular pipelines where geometry control and data filtering are combined for stronger overall alignment.
Load-bearing premise
An identifiable alignment-critical parameter subspace exists such that selective geometry constraints applied only to it avoid over-regularization and improve robustness under distribution shift, while uniform constraints do not.
What would settle it
An experiment showing that uniform geometry constraints across all parameters yield equal or greater robustness than selective control when tested on noisy preference data and domain-shifted safety benchmarks would falsify the central claim.
Figures
read the original abstract
Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose \textit{ShaPO}, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective. The code is available at https://github.com/liujilong0116/ShaPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ShaPO, a geometry-aware preference optimization framework for LLM safety alignment. It argues that robustness failures arise from optimization geometry rather than data uncertainty alone, and introduces selective geometry control applied only to an alignment-critical parameter subspace to enforce worst-case objectives while avoiding over-regularization from uniform constraints. The approach is instantiated at token level (stabilizing likelihood surrogates) and reward level (enforcing consistency under noise). Experiments across safety benchmarks and noisy preference settings report consistent gains over standard methods such as DPO, with clean composition when combined with data-robust objectives.
Significance. If the empirical claims are substantiated with a non-circular, reproducible subspace selection procedure and proper statistical controls, the work could meaningfully advance the field by shifting attention to optimization geometry as a complementary lever for alignment robustness. The clean composition result, if verified, would strengthen the case that selective rather than uniform constraints can improve robustness under distribution shift without sacrificing alignment performance.
major comments (2)
- §3 (Method, subspace identification): The central claim requires the existence of an identifiable alignment-critical parameter subspace S such that geometry constraints applied selectively to S avoid over-regularization and yield robustness gains, while uniform application does not. The manuscript provides no explicit, reproducible, pre-hoc criterion for locating S that is independent of the safety benchmarks used for final evaluation. If subspace selection relies on post-hoc gradient norms or validation performance on the same metrics, the selectivity argument is circular and the comparison to uniform constraints is non-falsifiable, directly undermining the load-bearing distinction from existing methods.
- §5 (Experiments): The reported consistent improvements and composition gains lack error bars, statistical significance tests, ablation studies isolating the effect of selectivity versus uniform constraints, and detailed descriptions of the noisy preference datasets and distribution-shift protocols. Without these, it is impossible to verify whether the robustness gains are attributable to the proposed selective geometry control or to other uncontrolled factors.
minor comments (2)
- Abstract and §1: The phrasing 'consistently improves' and 'composes cleanly' should be accompanied by forward references to the specific tables or figures that quantify the gains and composition effects.
- Notation: The distinction between token-level and reward-level ShaPO should be formalized with explicit equations or pseudocode early in the method section to improve clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments have identified important areas where additional clarity and rigor will strengthen the presentation of ShaPO. We address each major comment below and indicate the specific revisions we will make in the next version of the paper.
read point-by-point responses
-
Referee: [—] §3 (Method, subspace identification): The central claim requires the existence of an identifiable alignment-critical parameter subspace S such that geometry constraints applied selectively to S avoid over-regularization and yield robustness gains, while uniform application does not. The manuscript provides no explicit, reproducible, pre-hoc criterion for locating S that is independent of the safety benchmarks used for final evaluation. If subspace selection relies on post-hoc gradient norms or validation performance on the same metrics, the selectivity argument is circular and the comparison to uniform constraints is non-falsifiable, directly undermining the load-bearing distinction from existing methods.
Authors: We acknowledge the referee's concern regarding the reproducibility and independence of the subspace identification procedure. In the current manuscript, subspace S is identified by ranking parameters according to the magnitude of gradients of the safety alignment loss computed on a held-out validation split of the preference data that is disjoint from both the training set and the final evaluation benchmarks. To eliminate any perception of circularity, we will revise §3 to include an explicit, pre-hoc algorithm with pseudocode, specify the exact validation split size and selection threshold, and add an ablation demonstrating that the selected subspace differs from one derived using test-set performance. We will also expand the uniform-constraint baseline to apply identical geometry penalties over the full parameter space while keeping all other factors fixed, thereby making the selectivity distinction directly falsifiable. These changes will be incorporated in the revised manuscript. revision: yes
-
Referee: [—] §5 (Experiments): The reported consistent improvements and composition gains lack error bars, statistical significance tests, ablation studies isolating the effect of selectivity versus uniform constraints, and detailed descriptions of the noisy preference datasets and distribution-shift protocols. Without these, it is impossible to verify whether the robustness gains are attributable to the proposed selective geometry control or to other uncontrolled factors.
Authors: We agree that the experimental section requires additional statistical controls and documentation to substantiate the claims. In the revision we will: (i) report means and standard deviations over five independent random seeds for all main results and include error bars in figures; (ii) add statistical significance tests (paired t-tests and Wilcoxon signed-rank tests) with p-values comparing ShaPO against baselines; (iii) include dedicated ablations that isolate selective versus uniform geometry control under identical training conditions; and (iv) expand the description of noisy preference dataset construction (noise injection mechanism, noise rates) and distribution-shift protocols (specific shift types and how they are generated). These additions will allow readers to attribute observed gains more confidently to the selective geometry mechanism. revision: yes
Circularity Check
Minor self-citation or assumption load but central proposal remains independent algorithmic change
full rationale
The paper presents ShaPO as a geometry-aware preference optimization framework that applies selective constraints to an alignment-critical parameter subspace. No equations are shown in the abstract or described claims that reduce the reported robustness gains to a fitted quantity or self-defined metric by construction. The existence of the subspace is treated as an identifiable modeling choice rather than derived from the final performance numbers. While the skeptic notes potential dependence on how the subspace is located, the provided text does not exhibit a specific reduction (e.g., S chosen via the same safety benchmarks used for evaluation) that would qualify as circular under the strict quoting requirement. The derivation chain is therefore largely self-contained with only minor assumption load.
Axiom & Free-Parameter Ledger
free parameters (1)
- alignment-critical subspace identification
axioms (1)
- domain assumption Robustness failures cannot be addressed by data-centric methods alone
Reference graph
Works this paper leans on
-
[1]
Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E
Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E. S., Jenner, E., Casper, S., Sour- but, O., et al. Foundational challenges in assuring align- ment and safety of large language models.arXiv preprint arXiv:2404.09932,
-
[2]
Bahri, D., Mobahi, H., and Tay, Y . Sharpness-aware mini- mization improves language model generalization.arXiv preprint arXiv:2110.08529,
-
[3]
Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022a. Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Gangu...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Less is more: Improving llm alignment via preference data selection.arXiv preprint arXiv:2502.14560,
Deng, X., Zhong, H., Ai, R., Feng, F., Wang, Z., and He, X. Less is more: Improving llm alignment via preference data selection.arXiv preprint arXiv:2502.14560,
-
[6]
N., Beugin, Y ., Pauley, E., Sheatsley, R., and McDaniel, P
Ferrand, J.-C. N., Beugin, Y ., Pauley, E., Sheatsley, R., and McDaniel, P. Targeting alignment: Extracting safety clas- sifiers of aligned llms.arXiv preprint arXiv:2501.16534,
-
[7]
Sharpness-Aware Minimization for Efficiently Improving Generalization
Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
Gao, C., Li, H., Liu, L., Xie, Z., Zhao, P., and Xu, Z. Prin- cipled data selection for alignment: The hidden risks of difficult examples.arXiv preprint arXiv:2502.09650,
-
[9]
Gao, Y ., Alon, D., and Metzler, D. Impact of preference noise on the alignment performance of generative lan- guage models.arXiv preprint arXiv:2404.09824,
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Huang, K., Wu, J., Chen, Z., Wang, X., Gao, J., Ding, B., Wu, J., He, X., and Wang, X. Larger or smaller reward margins to select preferences for alignment?arXiv preprint arXiv:2503.01864,
-
[12]
AI Alignment: A Comprehensive Survey
Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y ., and Yang, Y . Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Sys- tems, 36:24678–24704, 2023a. Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y ., and Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[13]
Lee, A., Bai, X., Pres, I., Wattenberg, M., Kummerfeld, J. K., and Mihalcea, R. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967,
-
[14]
Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models
Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y ., and Shao, J. Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044,
-
[15]
Li, M., Huzhang, G., Zhang, H., Wang, X., and Zeng, A. Optimal transport-based token weighting scheme for enhanced preference optimization.arXiv preprint arXiv:2505.18720,
-
[16]
Lu, H., Fang, L., Zhang, R., Li, X., Cai, J., Cheng, H., Tang, L., Liu, Z., Sun, Z., Wang, T., et al. Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges.arXiv preprint arXiv:2507.19672,
-
[17]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. Harm- bench: A standardized evaluation framework for auto- mated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Niu, Y ., Xiao, H., Liu, D., Chen, N., and Li, J. Mitigating the safety alignment tax with null-space constrained policy optimization.arXiv preprint arXiv:2512.11391,
-
[19]
J., Chen, R., Chen, X., Hirata, N
Perin, G. J., Chen, R., Chen, X., Hirata, N. S., Wang, Z., and Hong, J. Lox: Low-rank extrapolation ro- bustifies llm safety against fine-tuning.arXiv preprint arXiv:2506.15606,
-
[20]
Safety alignment should be made more than just a few tokens deep
Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946,
-
[21]
Revisiting the superficial alignment hypothesis.arXiv preprint arXiv:2410.03717,
Raghavendra, M., Nath, V ., and Hendryx, S. Revisiting the superficial alignment hypothesis.arXiv preprint arXiv:2410.03717,
-
[22]
Roy, A., Patel, J., Chadha, A., Jain, V ., and Das, A. Alignmerge-alignment-preserving large language model merging via fisher-guided geometric constraints.arXiv preprint arXiv:2512.16245,
-
[23]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Towards veri- fying the geometric robustness of large-scale neural net- works
Wang, F., Xu, P., Ruan, W., and Huang, X. Towards veri- fying the geometric robustness of large-scale neural net- works. InProceedings of the AAAI conference on artifi- cial intelligence, volume 37, pp. 15197–15205, 2023a. Wang, Y ., Li, H., Han, X., Nakov, P., and Baldwin, T. Do- not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arX...
-
[25]
Wu, J., Xie, Y ., Yang, Z., Wu, J., Chen, J., Gao, J., Ding, B., Wang, X., and He, X. Towards robust alignment of lan- guage models: Distributionally robustifying direct pref- erence optimization.arXiv preprint arXiv:2407.07880,
-
[26]
10 Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control Xu, Z., Vemuri, S., Panaganti, K., Kalathil, D., Jain, R., and Ramachandran, D. Robust llm alignment via distribution- ally robust direct preference optimization.arXiv preprint arXiv:2502.01930,
- [27]
-
[28]
Sharpness-Aware Minimization with Z-Score Gradient Filtering
Yun, J. Sharpness-aware minimization with z-score gra- dient filtering for neural networks.arXiv preprint arXiv:2505.02369,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Edge: Efficient data selection for llm agents via guideline effectiveness
Zhang, Y ., Xiong, G., Li, H., and Zhao, W. Edge: Efficient data selection for llm agents via guideline effectiveness. arXiv preprint arXiv:2502.12494,
-
[30]
Zhao, X., Cai, W., Shi, T., Huang, D., Lin, L., Mei, S., and Song, D. Improving llm safety alignment with dual- objective optimization.arXiv preprint arXiv:2503.03710, 2025a. Zhao, Y ., Zhang, W., Xie, Y ., Goyal, A., Kawaguchi, K., and Shieh, M. Understanding and enhancing safety mech- anisms of llms via safety-specific neuron. InThe Thir- teenth Interna...
-
[31]
Zhu, M., Liu, Y ., Guo, J., Wang, Q., Zhang, Y ., and Mao, Z. Leveraging robust optimization for llm alignment un- der distribution shifts.arXiv preprint arXiv:2504.05831,
-
[32]
11 Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control A. Algorithm and Optimization Details Algorithm 1ShaPO: Sharpness-aware Preference Optimization Require: Preference dataset D={(xi, yw i , yl i)}N i=1, and probe training dataset Dp ={(x i, yi)}M i=1; initial policy model πθ, reference πref, reward model Rϕ (only for reward-l...
work page 2022
-
[33]
To simplify the pipeline, Direct Preference Optimization (DPO) (Rafailov et al.,
rely on reward modeling and policy optimization, which incur high computational costs and can be sensitive to noisy supervision (Gao et al., 2023). To simplify the pipeline, Direct Preference Optimization (DPO) (Rafailov et al.,
work page 2023
-
[34]
reframes safety alignment as supervised learning on preference pairs, achieving competitive instruction-following and safety behaviors without explicit reward models or reinforcement learning. Building on this paradigm, recent work explores alternative formulations including group-relative or multi-objective optimization (Shao et al., 2024; Guo et al., 20...
work page 2024
-
[35]
12: SaladBench focuses on compositional and obfuscated harms, where unsafe intent is embedded within multi-step or indirect queries. This benchmark evaluates whether models can maintain safety under more subtle or context-dependent threat scenarios. 8https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K 9https://huggingface.co/datasets/asparius/a...
work page 2023
-
[36]
48 33 18 3 12 27 42 57 0 200 400 600 800 1000Count Reward-score distribution 0.00 0.25 0.50 0.75 1.000 500 1000 1500 2000 2500 3000 3500Count Sigmoid distribution ( =0.1) 0.00 0.25 0.50 0.75 1.000 2000 4000 6000 8000 10000 12000 14000Count Sigmoid distribution ( =1) 0.00 0.25 0.50 0.75 1.000 2500 5000 7500 10000 12500 15000 17500 20000Count Sigmoid distri...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.