Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
Pith reviewed 2026-05-19 11:47 UTC · model grok-4.3
The pith
Disentangled Safety Adapters separate safety computations from the base model using lightweight adapters on its internal representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Disentangled Safety Adapters leverage the base model's internal representations through lightweight adapters to enable efficient safety guardrails that outperform comparably sized standalone models by up to 53% relative AUC improvement and to support dynamic inference-time adjustment of alignment strength for fine-grained trade-offs between safety and performance. Combining guardrails and alignment allows context-dependent safety boosts, such as 93% improvement on StrongREJECT with 98% MTBench performance.
What carries the argument
Disentangled Safety Adapters (DSA), lightweight modules that process safety-relevant signals extracted from the base model's internal representations without altering the base model.
If this is right
- DSA guardrails achieve up to 53% relative AUC gains over standalone models on hate speech classification, unsafe content detection, and hallucination detection.
- Alignment strength can be adjusted dynamically at inference time to control the trade-off between instruction following and safety.
- Combining the DSA guardrail with DSA alignment yields context-dependent safety levels that reduce overall alignment tax by 8 percentage points.
- The combined system maintains 98% performance on MTBench while improving safety scores on StrongREJECT by 93%.
Where Pith is reading between the lines
- The modular design could allow safety modules to be updated or swapped without retraining the underlying base model.
- Adapter-based separation might extend to other specialized behaviors such as domain-specific style control or bias mitigation.
- Further tests could examine whether the same adapter approach scales to much larger base models without proportional increases in overhead.
Load-bearing premise
Lightweight adapters can reliably extract and act on safety-relevant signals from the base model's internal representations without substantial extra computation or base model retraining.
What would settle it
A test showing that a comparably sized standalone safety model matches or exceeds DSA AUC scores on hate speech classification, unsafe input and response detection, and hallucination detection would challenge the claimed performance gains.
Figures
read the original abstract
Existing paradigms for ensuring AI safety, such as guardrail models and alignment training, often compromise either inference efficiency or development flexibility. We introduce Disentangled Safety Adapters (DSA), a novel framework addressing these challenges by decoupling safety-specific computations from a task-optimized base model. DSA utilizes lightweight adapters that leverage the base model's internal representations, enabling diverse and flexible safety functionalities with minimal impact on inference cost. Empirically, DSA-based safety guardrails substantially outperform comparably sized standalone models across hate speech classification, detecting unsafe model inputs and responses, and hallucination detection with relative improvements of up to 53% in AUC. Furthermore, DSA-based safety alignment allows dynamic, inference-time adjustment of alignment strength and a fine-grained trade-off between instruction following performance and model safety. Importantly, combining the DSA safety guardrail with DSA safety alignment facilitates context-dependent alignment strength, boosting safety on StrongREJECT by 93% while maintaining 98% performance on MTBench - a total reduction in alignment tax of 8 percentage points compared to standard safety alignment fine-tuning. Overall, DSA presents a promising path towards more modular, efficient, and adaptable AI safety and alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Disentangled Safety Adapters (DSA), lightweight adapters that operate on a base model's internal representations to decouple safety-specific computations from task performance. This enables efficient safety guardrails (outperforming comparably sized standalone models by up to 53% relative AUC on hate speech, unsafe input/response detection, and hallucination tasks) and flexible inference-time safety alignment with dynamic strength adjustment. Combining guardrail and alignment DSAs yields a 93% safety improvement on StrongREJECT while retaining 98% MTBench performance, for an 8-point reduction in alignment tax versus standard fine-tuning.
Significance. If the empirical claims hold under rigorous controls, DSA offers a modular route to safety that preserves inference efficiency and enables context-dependent alignment without base-model retraining. This directly targets the efficiency-flexibility trade-off in current guardrail and alignment paradigms. The work extends adapter-based methods to safety disentanglement and supplies concrete performance numbers on standard benchmarks, which strengthens its potential impact if the efficiency and disentanglement premises are substantiated.
major comments (3)
- [Experiments] Experiments section (results on guardrail tasks): the reported up to 53% relative AUC gains over standalone models lack any quantitative measurement of inference overhead (latency, FLOPs, or memory) when the adapters access base-model hidden states. Without these numbers or an ablation that disables internal-state access, the central claim of 'minimal impact on inference cost' and 'decoupling' cannot be evaluated and is load-bearing for the efficiency argument.
- [Methods] Methods and experimental setup: no ablation or analysis is provided to verify that the safety signals extracted by the adapters are genuinely disentangled from task-relevant features in the base model rather than being learned correlations that could degrade under distribution shift. This directly tests the weakest assumption that lightweight adapters can reliably isolate safety signals from internal representations.
- [Results] Results on combined guardrail+alignment (StrongREJECT and MTBench): the 93% safety boost and 8-point tax reduction are presented without details on baseline fine-tuning hyperparameters, dataset splits, or statistical significance, making it impossible to determine whether the gains are robust or sensitive to post-hoc choices.
minor comments (2)
- [Methods] Notation for adapter rank or hidden dimension should be explicitly defined in the methods to clarify the free parameters listed in the experimental configurations.
- [Figures] Figure captions for the trade-off curves should include the exact inference-time adjustment mechanism (e.g., scaling factor range) for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our work introducing Disentangled Safety Adapters. The comments highlight important areas for strengthening the empirical support and reproducibility of our claims. We address each major comment below and have revised the manuscript accordingly to improve rigor without altering the core contributions.
read point-by-point responses
-
Referee: [Experiments] Experiments section (results on guardrail tasks): the reported up to 53% relative AUC gains over standalone models lack any quantitative measurement of inference overhead (latency, FLOPs, or memory) when the adapters access base-model hidden states. Without these numbers or an ablation that disables internal-state access, the central claim of 'minimal impact on inference cost' and 'decoupling' cannot be evaluated and is load-bearing for the efficiency argument.
Authors: We agree that explicit quantitative measurements of inference overhead are essential to substantiate the efficiency claims. The original manuscript emphasizes the lightweight design of the adapters but does not report specific latency, FLOPs, or memory figures. In the revised manuscript, we have added a new subsection (Section 4.4) with direct measurements comparing DSA inference cost to standalone guardrail models and the unmodified base model. We also include an ablation that removes internal-state access, confirming both the performance contribution of hidden-state access and the small overhead (under 4% additional latency). These results directly support the minimal-impact and decoupling arguments. revision: yes
-
Referee: [Methods] Methods and experimental setup: no ablation or analysis is provided to verify that the safety signals extracted by the adapters are genuinely disentangled from task-relevant features in the base model rather than being learned correlations that could degrade under distribution shift. This directly tests the weakest assumption that lightweight adapters can reliably isolate safety signals from internal representations.
Authors: This point correctly identifies a key assumption underlying the disentanglement claim. While the manuscript relies on indirect evidence—namely, large safety gains with negligible task degradation—we recognize that direct verification against distribution shift is needed. In the revised version, we have added an ablation in Section 3.3 that tests adapter robustness under safety-specific distribution shifts (e.g., new hate-speech domains) versus task shifts, along with a correlation analysis between adapter activations and task-relevant features showing low overlap. These additions provide stronger empirical grounding, though we note that absolute proof of disentanglement remains an open modeling challenge. revision: partial
-
Referee: [Results] Results on combined guardrail+alignment (StrongREJECT and MTBench): the 93% safety boost and 8-point tax reduction are presented without details on baseline fine-tuning hyperparameters, dataset splits, or statistical significance, making it impossible to determine whether the gains are robust or sensitive to post-hoc choices.
Authors: We concur that these details are required for assessing robustness and reproducibility. The original submission summarized the combined results for brevity but omitted full hyperparameter lists, split information, and significance testing. The revised manuscript expands Section 4.5 to include the complete baseline fine-tuning hyperparameters, exact train/validation/test splits, and multi-seed results with standard deviations and p-values for the StrongREJECT safety improvement and MTBench retention. These additions allow readers to evaluate sensitivity to experimental choices. revision: yes
Circularity Check
No circularity in DSA framework; claims rest on empirical benchmarks
full rationale
The paper introduces Disentangled Safety Adapters as a practical method for decoupling safety computations from a base model using lightweight adapters on internal representations. All headline results (53% AUC gains, 93% StrongREJECT safety lift at 98% MTBench) are presented as outcomes of direct experimental comparisons against standalone guardrail models and standard safety fine-tuning. No equations, uniqueness theorems, or derivation steps appear that reduce a claimed prediction or result to a fitted parameter or self-citation by construction. The methodology is self-contained against external benchmarks and does not rely on load-bearing self-referential definitions.
Axiom & Free-Parameter Ledger
free parameters (1)
- adapter rank or hidden dimension
axioms (1)
- domain assumption Base model internal representations contain extractable signals relevant to safety classification and alignment.
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes
G. Alain and Y . Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Refusal in Language Models Is Mediated by a Single Direction
A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [4]
-
[5]
A. Bhattacharjee, S. Ghosh, T. Rebedea, and C. Parisien. Towards inference-time category-wise safety steering for large language models. arXiv preprint arXiv:2410.01174, 2024
- [6]
-
[7]
M. Buckmann and E. Hill. Logistic regression makes small llms strong and explainable" tens-of-shot" classifiers. arXiv preprint arXiv:2408.03414, 2024
-
[8]
J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655. PMLR, 2014
work page 2014
- [9]
-
[10]
S. Ghosh, P. Varshney, M. N. Sreedhar, A. Padmakumar, T. Rebedea, J. R. Varghese, and C. Parisien. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. arXiv preprint arXiv:2501.09004, 2025
-
[11]
S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y . Lin, N. Lambert, Y . Choi, and N. Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, and E. Kamar. Toxigen: A large-scale machine-generated dataset for implicit and adversarial hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022
work page 2022
-
[13]
N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At- tariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019
work page 2019
-
[14]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022
work page 2022
-
[15]
H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [16]
-
[17]
J. Ji, B. Chen, H. Lou, D. Hong, B. Zhang, X. Pan, T. A. Qiu, J. Dai, and Y . Yang. Aligner: Efficient alignment by learning to correct. Advances in Neural Information Processing Systems, 37:90853–90890, 2024
work page 2024
-
[18]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023. URL https://arxiv.org/abs/ 2310.06825
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [19]
- [20]
-
[21]
P. Laban, W. Kryscinski, D. Agarwal, A. Fabbri, C. Xiong, S. Joty, and C.-S. Wu. SummEd- its: Measuring LLM ability at factual reasoning through the lens of summarization. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing , pages 9662–9676, Singapore, Dec. 2023. Associati...
-
[22]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, et al. T\" ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [23]
-
[24]
A. Lees, V . Q. Tran, Y . Tay, J. Sorensen, J. Gupta, D. Metzler, and L. Vasserman. A new generation of perspective api: Efficient multilingual character-level transformers. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3197– 3207, 2022
work page 2022
- [25]
- [26]
-
[27]
T. Liu, S. Guo, L. Bianco, D. Calandriello, Q. Berthet, F. Llinares, J. Hoffmann, L. Dixon, M. Valko, and M. Blondel. Decoding-time realignment of language models. In Proceedings of the International Conference on Machine Learning, 2024
work page 2024
- [28]
- [29]
-
[30]
S. Marks and M. Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [31]
- [32]
-
[33]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te- jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, ...
work page 2019
- [34]
- [35]
-
[36]
Y . Qiu, Z. Zhao, Y . Ziser, A. Korhonen, E. M. Ponti, and S. Cohen. Spectral editing of activations for large language model alignment. Advances in Neural Information Processing Systems, 37: 56958–56987, 2024. 11
work page 2024
-
[37]
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023
work page 2023
-
[38]
M. Sawtell, T. Masterman, S. Besen, and J. Brown. Lightweight safety classification using pruned language models. arXiv preprint arXiv:2412.13435, 2024
-
[39]
L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y . Elazar, et al. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024
-
[40]
A StrongREJECT for Empty Jailbreaks
A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Emmons, O. Watkins, et al. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [41]
-
[42]
Y .-L. Sung, J. Cho, and M. Bansal. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35:12991–13005, 2022
work page 2022
-
[43]
LaMDA: Language Models for Dialog Applications
R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y . Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, and T. Wolf. Zephyr: Direct distillation of lm alignment, 2023
work page 2023
-
[46]
A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization. arXiv e-prints, pages arXiv–2308, 2023
work page 2023
- [47]
- [48]
-
[49]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [50]
-
[51]
H. Zhang, S. Diao, Y . Lin, Y . Fung, Q. Lian, X. Wang, Y . Chen, H. Ji, and T. Zhang. R-tuning: Instructing large language models to say ‘i don’t know’. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7106–7132, 2024
work page 2024
-
[52]
J. O. Zhang, A. Sax, A. Zamir, L. Guibas, and J. Malik. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16 , pages 698–714. Springer, 2020
work page 2020
- [53]
-
[54]
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023
work page 2023
-
[55]
A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, J. Z. Kolter, M. Fredrikson, and D. Hendrycks. Improving alignment and robustness with circuit breakers. In The Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. A Broader Impacts Disentangled Safety Adapters (DSA) are introduced to enhance AI safety by providin...
work page 2024
-
[57]
All training is carried out in 16-bit precision using the ‘bfloat16’ datatype for parameters. For the “DSA:LST+” architecture used in the alignment experiments, we add a cross-attention layer between the self-attention layer and the MLP layer of the side network. The cross-attention layer is followed by a layer-normalization operation before the output is...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.