pith. sign in

arxiv: 2602.07340 · v2 · pith:TNUXS2DUnew · submitted 2026-02-07 · 💻 cs.LG

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

Pith reviewed 2026-05-22 11:12 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM safety alignmentpreference optimizationrobustnessgeometry controldistribution shiftnoisy supervisionalignment subspace
0
0 comments X

The pith

ShaPO improves LLM safety alignment robustness by enforcing worst-case objectives through selective geometry control in an alignment-critical parameter subspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that safety alignment in large language models breaks down under domain shifts and noisy preferences not only because of data issues but also due to how optimization shapes the model's parameter space. Existing methods apply geometry constraints uniformly, which can over-regularize and hurt generalization. ShaPO instead identifies an alignment-critical subspace and applies selective constraints there to enforce robust worst-case alignment while leaving other parameters freer. This leads to better performance on safety benchmarks with noise and shifts, and the method works together with data-focused robustness techniques.

Core claim

ShaPO is a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision.

What carries the argument

selective geometry control over the alignment-critical parameter subspace, which targets constraints to enforce worst-case objectives without uniform over-regularization

If this is right

  • ShaPO consistently improves safety robustness over popular preference optimization methods across diverse safety benchmarks and noisy preference settings.
  • Token-level ShaPO stabilizes likelihood-based surrogate optimization.
  • Reward-level ShaPO enforces reward-consistent optimization under noisy supervision.
  • ShaPO composes cleanly with data-robust objectives, yielding additional gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the critical subspace can be identified in other models, selective control might generalize beyond preference optimization to supervised fine-tuning or RLHF variants.
  • Future work could test whether subspace identification itself needs to adapt dynamically during training to maintain gains under evolving distribution shifts.
  • The clean composition with data-robust methods points to modular pipelines where geometry control and data filtering are combined for stronger overall alignment.

Load-bearing premise

An identifiable alignment-critical parameter subspace exists such that selective geometry constraints applied only to it avoid over-regularization and improve robustness under distribution shift, while uniform constraints do not.

What would settle it

An experiment showing that uniform geometry constraints across all parameters yield equal or greater robustness than selective control when tested on noisy preference data and domain-shifted safety benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.07340 by Jilong Liu, Junfeng Fang, Le Wu, Richang Hong, Tat-Sent Chua, Weibiao Huang, Wenjian Tao, Xingyu Zhu, Yonghui Yang.

Figure 1
Figure 1. Figure 1: Cumulative contribution to worst-case alignment loss under parameter perturbations. We compare the fraction of the total worst-case loss increase accounted for by perturbing probe￾identified safety-critical neurons (Top-K) versus randomly selected neurons of the same size (Random-K). to preference-based alignment still presents practical ques￾tions about where constraints should be applied, at what level t… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ShaPO , a robust preference optimization framework with selective geometry control. ShaPO minimizes the worst-case alignment loss under adversarial perturbations restricted to the alignment-critical parameter subspace. Here is the instantiation of ShaPO at reward level. restricting worst-case parameter perturbations to S, while the specific form of the alignment loss determines the level at whi… view at source ↗
Figure 4
Figure 4. Figure 4: Composability of ShaPO with DPO and other data￾centric alignment methods. We report the Win Rate compared with the chosen response; the left is comparisons on Pythia-2.8B, and the right is on the LLaMA-3.2-3B backbone. Uniform geometry control consistently yields a higher av￾erage ASR than selective control across both safety judges, indicating degraded robustness under distribution shift. Ran￾dom control … view at source ↗
Figure 5
Figure 5. Figure 5: Reward-score distribution on the PKU-30K training set and the effect of different βr on score normalization. Left: the raw score difference ∆r = r(x, yw) − r(x, yl ) produced by the Beaver reward (negated cost) model on preference pairs. Right three: the corresponding sigmoid-transformed values σ(βr∆r) under βr ∈ {0.1, 1, 10}. Reward Model Usage. In our reward-level ShaPO approach, we use a single safety-o… view at source ↗
Figure 6
Figure 6. Figure 6: Probe training performance (Loss and Accuracy) across ten epochs for four different backbone models: Pythia-2.8B, LLaMA￾3.2-3B, LLaMA-3-8B, and Qwen2.5-7B. Probe Decoding [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of training reward margins across different methods on four LLM backbones. and SFT baselines, confirming the effectiveness of preference optimization for safety alignment. Second, data-centric robust objectives such as IPO, cDPO, rDPO, and Dr.DPO further improve safety performance on some benchmarks, but their gains are often inconsistent across judges and datasets. In contrast, ShaPO exhibits c… view at source ↗
read the original abstract

Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose \textit{ShaPO}, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective. The code is available at https://github.com/liujilong0116/ShaPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ShaPO, a geometry-aware preference optimization framework for LLM safety alignment. It argues that robustness failures arise from optimization geometry rather than data uncertainty alone, and introduces selective geometry control applied only to an alignment-critical parameter subspace to enforce worst-case objectives while avoiding over-regularization from uniform constraints. The approach is instantiated at token level (stabilizing likelihood surrogates) and reward level (enforcing consistency under noise). Experiments across safety benchmarks and noisy preference settings report consistent gains over standard methods such as DPO, with clean composition when combined with data-robust objectives.

Significance. If the empirical claims are substantiated with a non-circular, reproducible subspace selection procedure and proper statistical controls, the work could meaningfully advance the field by shifting attention to optimization geometry as a complementary lever for alignment robustness. The clean composition result, if verified, would strengthen the case that selective rather than uniform constraints can improve robustness under distribution shift without sacrificing alignment performance.

major comments (2)
  1. §3 (Method, subspace identification): The central claim requires the existence of an identifiable alignment-critical parameter subspace S such that geometry constraints applied selectively to S avoid over-regularization and yield robustness gains, while uniform application does not. The manuscript provides no explicit, reproducible, pre-hoc criterion for locating S that is independent of the safety benchmarks used for final evaluation. If subspace selection relies on post-hoc gradient norms or validation performance on the same metrics, the selectivity argument is circular and the comparison to uniform constraints is non-falsifiable, directly undermining the load-bearing distinction from existing methods.
  2. §5 (Experiments): The reported consistent improvements and composition gains lack error bars, statistical significance tests, ablation studies isolating the effect of selectivity versus uniform constraints, and detailed descriptions of the noisy preference datasets and distribution-shift protocols. Without these, it is impossible to verify whether the robustness gains are attributable to the proposed selective geometry control or to other uncontrolled factors.
minor comments (2)
  1. Abstract and §1: The phrasing 'consistently improves' and 'composes cleanly' should be accompanied by forward references to the specific tables or figures that quantify the gains and composition effects.
  2. Notation: The distinction between token-level and reward-level ShaPO should be formalized with explicit equations or pseudocode early in the method section to improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments have identified important areas where additional clarity and rigor will strengthen the presentation of ShaPO. We address each major comment below and indicate the specific revisions we will make in the next version of the paper.

read point-by-point responses
  1. Referee: [—] §3 (Method, subspace identification): The central claim requires the existence of an identifiable alignment-critical parameter subspace S such that geometry constraints applied selectively to S avoid over-regularization and yield robustness gains, while uniform application does not. The manuscript provides no explicit, reproducible, pre-hoc criterion for locating S that is independent of the safety benchmarks used for final evaluation. If subspace selection relies on post-hoc gradient norms or validation performance on the same metrics, the selectivity argument is circular and the comparison to uniform constraints is non-falsifiable, directly undermining the load-bearing distinction from existing methods.

    Authors: We acknowledge the referee's concern regarding the reproducibility and independence of the subspace identification procedure. In the current manuscript, subspace S is identified by ranking parameters according to the magnitude of gradients of the safety alignment loss computed on a held-out validation split of the preference data that is disjoint from both the training set and the final evaluation benchmarks. To eliminate any perception of circularity, we will revise §3 to include an explicit, pre-hoc algorithm with pseudocode, specify the exact validation split size and selection threshold, and add an ablation demonstrating that the selected subspace differs from one derived using test-set performance. We will also expand the uniform-constraint baseline to apply identical geometry penalties over the full parameter space while keeping all other factors fixed, thereby making the selectivity distinction directly falsifiable. These changes will be incorporated in the revised manuscript. revision: yes

  2. Referee: [—] §5 (Experiments): The reported consistent improvements and composition gains lack error bars, statistical significance tests, ablation studies isolating the effect of selectivity versus uniform constraints, and detailed descriptions of the noisy preference datasets and distribution-shift protocols. Without these, it is impossible to verify whether the robustness gains are attributable to the proposed selective geometry control or to other uncontrolled factors.

    Authors: We agree that the experimental section requires additional statistical controls and documentation to substantiate the claims. In the revision we will: (i) report means and standard deviations over five independent random seeds for all main results and include error bars in figures; (ii) add statistical significance tests (paired t-tests and Wilcoxon signed-rank tests) with p-values comparing ShaPO against baselines; (iii) include dedicated ablations that isolate selective versus uniform geometry control under identical training conditions; and (iv) expand the description of noisy preference dataset construction (noise injection mechanism, noise rates) and distribution-shift protocols (specific shift types and how they are generated). These additions will allow readers to attribute observed gains more confidently to the selective geometry mechanism. revision: yes

Circularity Check

0 steps flagged

Minor self-citation or assumption load but central proposal remains independent algorithmic change

full rationale

The paper presents ShaPO as a geometry-aware preference optimization framework that applies selective constraints to an alignment-critical parameter subspace. No equations are shown in the abstract or described claims that reduce the reported robustness gains to a fitted quantity or self-defined metric by construction. The existence of the subspace is treated as an identifiable modeling choice rather than derived from the final performance numbers. While the skeptic notes potential dependence on how the subspace is located, the provided text does not exhibit a specific reduction (e.g., S chosen via the same safety benchmarks used for evaluation) that would qualify as circular under the strict quoting requirement. The derivation chain is therefore largely self-contained with only minor assumption load.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a separable alignment-critical subspace and on the premise that selective rather than uniform geometry control mitigates over-regularization; both are introduced without independent derivation in the abstract.

free parameters (1)
  • alignment-critical subspace identification
    The method requires selecting which parameters belong to the critical subspace; this choice is not derived from first principles and must be determined per model or task.
axioms (1)
  • domain assumption Robustness failures cannot be addressed by data-centric methods alone
    Explicitly stated as the motivation for shifting to an optimization-geometry perspective.

pith-pipeline@v0.9.0 · 5747 in / 1229 out tokens · 42643 ms · 2026-05-22T11:12:39.288367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 8 internal anchors

  1. [1]

    Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E

    Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E. S., Jenner, E., Casper, S., Sour- but, O., et al. Foundational challenges in assuring align- ment and safety of large language models.arXiv preprint arXiv:2404.09932,

  2. [2]

    Sharpness-aware mini- mization improves language model generalization.arXiv preprint arXiv:2110.08529,

    Bahri, D., Mobahi, H., and Tay, Y . Sharpness-aware mini- mization improves language model generalization.arXiv preprint arXiv:2110.08529,

  3. [3]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022a. Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Gangu...

  5. [5]

    Less is more: Improving llm alignment via preference data selection.arXiv preprint arXiv:2502.14560,

    Deng, X., Zhong, H., Ai, R., Feng, F., Wang, Z., and He, X. Less is more: Improving llm alignment via preference data selection.arXiv preprint arXiv:2502.14560,

  6. [6]

    N., Beugin, Y ., Pauley, E., Sheatsley, R., and McDaniel, P

    Ferrand, J.-C. N., Beugin, Y ., Pauley, E., Sheatsley, R., and McDaniel, P. Targeting alignment: Extracting safety clas- sifiers of aligned llms.arXiv preprint arXiv:2501.16534,

  7. [7]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

  8. [8]

    Prin- cipled data selection for alignment: The hidden risks of difficult examples.arXiv preprint arXiv:2502.09650,

    Gao, C., Li, H., Liu, L., Xie, Z., Zhao, P., and Xu, Z. Prin- cipled data selection for alignment: The hidden risks of difficult examples.arXiv preprint arXiv:2502.09650,

  9. [9]

    Impact of preference noise on the alignment performance of generative lan- guage models.arXiv preprint arXiv:2404.09824,

    Gao, Y ., Alon, D., and Metzler, D. Impact of preference noise on the alignment performance of generative lan- guage models.arXiv preprint arXiv:2404.09824,

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  11. [11]

    Larger or smaller reward margins to select preferences for alignment?arXiv preprint arXiv:2503.01864,

    Huang, K., Wu, J., Chen, Z., Wang, X., Gao, J., Ding, B., Wu, J., He, X., and Wang, X. Larger or smaller reward margins to select preferences for alignment?arXiv preprint arXiv:2503.01864,

  12. [12]

    AI Alignment: A Comprehensive Survey

    Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y ., and Yang, Y . Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Sys- tems, 36:24678–24704, 2023a. Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y ., and Yang, ...

  13. [13]

    K., and Mihalcea, R

    Lee, A., Bai, X., Pres, I., Wattenberg, M., Kummerfeld, J. K., and Mihalcea, R. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967,

  14. [14]

    Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models

    Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y ., and Shao, J. Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044,

  15. [15]

    Optimal transport-based token weighting scheme for enhanced preference optimization.arXiv preprint arXiv:2505.18720,

    Li, M., Huzhang, G., Zhang, H., Wang, X., and Zeng, A. Optimal transport-based token weighting scheme for enhanced preference optimization.arXiv preprint arXiv:2505.18720,

  16. [16]

    Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges.arXiv preprint arXiv:2507.19672,

    Lu, H., Fang, L., Zhang, R., Li, X., Cai, J., Cheng, H., Tang, L., Liu, Z., Sun, Z., Wang, T., et al. Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges.arXiv preprint arXiv:2507.19672,

  17. [17]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. Harm- bench: A standardized evaluation framework for auto- mated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

  18. [18]

    Mitigating the safety alignment tax with null-space constrained policy optimization.arXiv preprint arXiv:2512.11391,

    Niu, Y ., Xiao, H., Liu, D., Chen, N., and Li, J. Mitigating the safety alignment tax with null-space constrained policy optimization.arXiv preprint arXiv:2512.11391,

  19. [19]

    J., Chen, R., Chen, X., Hirata, N

    Perin, G. J., Chen, R., Chen, X., Hirata, N. S., Wang, Z., and Hong, J. Lox: Low-rank extrapolation ro- bustifies llm safety against fine-tuning.arXiv preprint arXiv:2506.15606,

  20. [20]

    Safety alignment should be made more than just a few tokens deep

    Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946,

  21. [21]

    Revisiting the superficial alignment hypothesis.arXiv preprint arXiv:2410.03717,

    Raghavendra, M., Nath, V ., and Hendryx, S. Revisiting the superficial alignment hypothesis.arXiv preprint arXiv:2410.03717,

  22. [22]

    Alignmerge-alignment-preserving large language model merging via fisher-guided geometric constraints.arXiv preprint arXiv:2512.16245,

    Roy, A., Patel, J., Chadha, A., Jain, V ., and Das, A. Alignmerge-alignment-preserving large language model merging via fisher-guided geometric constraints.arXiv preprint arXiv:2512.16245,

  23. [23]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  24. [24]

    Towards veri- fying the geometric robustness of large-scale neural net- works

    Wang, F., Xu, P., Ruan, W., and Huang, X. Towards veri- fying the geometric robustness of large-scale neural net- works. InProceedings of the AAAI conference on artifi- cial intelligence, volume 37, pp. 15197–15205, 2023a. Wang, Y ., Li, H., Han, X., Nakov, P., and Baldwin, T. Do- not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arX...

  25. [25]

    Towards robust alignment of lan- guage models: Distributionally robustifying direct pref- erence optimization.arXiv preprint arXiv:2407.07880,

    Wu, J., Xie, Y ., Yang, Z., Wu, J., Chen, J., Gao, J., Ding, B., Wang, X., and He, X. Towards robust alignment of lan- guage models: Distributionally robustifying direct pref- erence optimization.arXiv preprint arXiv:2407.07880,

  26. [26]

    Robust llm alignment via distribution- ally robust direct preference optimization.arXiv preprint arXiv:2502.01930,

    10 Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control Xu, Z., Vemuri, S., Panaganti, K., Kalathil, D., Jain, R., and Ramachandran, D. Robust llm alignment via distribution- ally robust direct preference optimization.arXiv preprint arXiv:2502.01930,

  27. [27]

    and Li, S

    Yeh, S. and Li, S. Clean first, align later: Benchmarking preference data cleaning for reliable llm alignment.arXiv preprint arXiv:2509.23564,

  28. [28]

    Sharpness-Aware Minimization with Z-Score Gradient Filtering

    Yun, J. Sharpness-aware minimization with z-score gra- dient filtering for neural networks.arXiv preprint arXiv:2505.02369,

  29. [29]

    Edge: Efficient data selection for llm agents via guideline effectiveness

    Zhang, Y ., Xiong, G., Li, H., and Zhao, W. Edge: Efficient data selection for llm agents via guideline effectiveness. arXiv preprint arXiv:2502.12494,

  30. [30]

    Improving llm safety alignment with dual- objective optimization.arXiv preprint arXiv:2503.03710, 2025a

    Zhao, X., Cai, W., Shi, T., Huang, D., Lin, L., Mei, S., and Song, D. Improving llm safety alignment with dual- objective optimization.arXiv preprint arXiv:2503.03710, 2025a. Zhao, Y ., Zhang, W., Xie, Y ., Goyal, A., Kawaguchi, K., and Shieh, M. Understanding and enhancing safety mech- anisms of llms via safety-specific neuron. InThe Thir- teenth Interna...

  31. [31]

    Leveraging robust optimization for llm alignment un- der distribution shifts.arXiv preprint arXiv:2504.05831,

    Zhu, M., Liu, Y ., Guo, J., Wang, Q., Zhang, Y ., and Mao, Z. Leveraging robust optimization for llm alignment un- der distribution shifts.arXiv preprint arXiv:2504.05831,

  32. [32]

    11 Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control A. Algorithm and Optimization Details Algorithm 1ShaPO: Sharpness-aware Preference Optimization Require: Preference dataset D={(xi, yw i , yl i)}N i=1, and probe training dataset Dp ={(x i, yi)}M i=1; initial policy model πθ, reference πref, reward model Rϕ (only for reward-l...

  33. [33]

    To simplify the pipeline, Direct Preference Optimization (DPO) (Rafailov et al.,

    rely on reward modeling and policy optimization, which incur high computational costs and can be sensitive to noisy supervision (Gao et al., 2023). To simplify the pipeline, Direct Preference Optimization (DPO) (Rafailov et al.,

  34. [34]

    reframes safety alignment as supervised learning on preference pairs, achieving competitive instruction-following and safety behaviors without explicit reward models or reinforcement learning. Building on this paradigm, recent work explores alternative formulations including group-relative or multi-objective optimization (Shao et al., 2024; Guo et al., 20...

  35. [35]

    This benchmark evaluates whether models can maintain safety under more subtle or context-dependent threat scenarios

    12: SaladBench focuses on compositional and obfuscated harms, where unsafe intent is embedded within multi-step or indirect queries. This benchmark evaluates whether models can maintain safety under more subtle or context-dependent threat scenarios. 8https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K 9https://huggingface.co/datasets/asparius/a...

  36. [36]

    Reward Model Usage.In our reward-level ShaPO approach, we use a single safety-oriented preference model, PKU-Alignment/beaver-7b-v1.0-cost, trained on the PKU-SafeRLHF dataset

    48 33 18 3 12 27 42 57 0 200 400 600 800 1000Count Reward-score distribution 0.00 0.25 0.50 0.75 1.000 500 1000 1500 2000 2500 3000 3500Count Sigmoid distribution ( =0.1) 0.00 0.25 0.50 0.75 1.000 2000 4000 6000 8000 10000 12000 14000Count Sigmoid distribution ( =1) 0.00 0.25 0.50 0.75 1.000 2500 5000 7500 10000 12500 15000 17500 20000Count Sigmoid distri...