Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

arxiv: 2506.00166 · v2 · submitted 2025-05-30 · 💻 cs.LG · cs.AI· cs.CL

Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

Kundan Krishna , Joseph Y Cheng , Charles Maalouf , Leon A Gatys This is my paper

Pith reviewed 2026-05-19 11:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords safety adaptersAI safetyguardrailsinference-time alignmentdisentangled representationsmodel alignmentlightweight adaptersAI guardrails

0 comments p. Extension

The pith

Disentangled Safety Adapters separate safety computations from the base model using lightweight adapters on its internal representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Disentangled Safety Adapters as a way to handle safety tasks separately from a task-optimized base model. Lightweight adapters draw on the base model's internal representations to support guardrails and alignment with minimal added inference cost. A sympathetic reader would care because existing safety methods often require choosing between efficiency during use and flexibility in development. The work shows these adapters deliver stronger detection performance than similar standalone models and permit on-the-fly changes to alignment strength.

Core claim

Disentangled Safety Adapters leverage the base model's internal representations through lightweight adapters to enable efficient safety guardrails that outperform comparably sized standalone models by up to 53% relative AUC improvement and to support dynamic inference-time adjustment of alignment strength for fine-grained trade-offs between safety and performance. Combining guardrails and alignment allows context-dependent safety boosts, such as 93% improvement on StrongREJECT with 98% MTBench performance.

What carries the argument

Disentangled Safety Adapters (DSA), lightweight modules that process safety-relevant signals extracted from the base model's internal representations without altering the base model.

If this is right

DSA guardrails achieve up to 53% relative AUC gains over standalone models on hate speech classification, unsafe content detection, and hallucination detection.
Alignment strength can be adjusted dynamically at inference time to control the trade-off between instruction following and safety.
Combining the DSA guardrail with DSA alignment yields context-dependent safety levels that reduce overall alignment tax by 8 percentage points.
The combined system maintains 98% performance on MTBench while improving safety scores on StrongREJECT by 93%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular design could allow safety modules to be updated or swapped without retraining the underlying base model.
Adapter-based separation might extend to other specialized behaviors such as domain-specific style control or bias mitigation.
Further tests could examine whether the same adapter approach scales to much larger base models without proportional increases in overhead.

Load-bearing premise

Lightweight adapters can reliably extract and act on safety-relevant signals from the base model's internal representations without substantial extra computation or base model retraining.

What would settle it

A test showing that a comparably sized standalone safety model matches or exceeds DSA AUC scores on hate speech classification, unsafe input and response detection, and hallucination detection would challenge the claimed performance gains.

Figures

Figures reproduced from arXiv: 2506.00166 by Charles Maalouf, Joseph Y Cheng, Kundan Krishna, Leon A Gatys.

**Figure 2.** Figure 2: DSA Alignment Results. Scores achieved by different adapters on MTBench (y-axis, higher [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Existing paradigms for ensuring AI safety, such as guardrail models and alignment training, often compromise either inference efficiency or development flexibility. We introduce Disentangled Safety Adapters (DSA), a novel framework addressing these challenges by decoupling safety-specific computations from a task-optimized base model. DSA utilizes lightweight adapters that leverage the base model's internal representations, enabling diverse and flexible safety functionalities with minimal impact on inference cost. Empirically, DSA-based safety guardrails substantially outperform comparably sized standalone models across hate speech classification, detecting unsafe model inputs and responses, and hallucination detection with relative improvements of up to 53% in AUC. Furthermore, DSA-based safety alignment allows dynamic, inference-time adjustment of alignment strength and a fine-grained trade-off between instruction following performance and model safety. Importantly, combining the DSA safety guardrail with DSA safety alignment facilitates context-dependent alignment strength, boosting safety on StrongREJECT by 93% while maintaining 98% performance on MTBench - a total reduction in alignment tax of 8 percentage points compared to standard safety alignment fine-tuning. Overall, DSA presents a promising path towards more modular, efficient, and adaptable AI safety and alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Disentangled Safety Adapters (DSA), lightweight adapters that operate on a base model's internal representations to decouple safety-specific computations from task performance. This enables efficient safety guardrails (outperforming comparably sized standalone models by up to 53% relative AUC on hate speech, unsafe input/response detection, and hallucination tasks) and flexible inference-time safety alignment with dynamic strength adjustment. Combining guardrail and alignment DSAs yields a 93% safety improvement on StrongREJECT while retaining 98% MTBench performance, for an 8-point reduction in alignment tax versus standard fine-tuning.

Significance. If the empirical claims hold under rigorous controls, DSA offers a modular route to safety that preserves inference efficiency and enables context-dependent alignment without base-model retraining. This directly targets the efficiency-flexibility trade-off in current guardrail and alignment paradigms. The work extends adapter-based methods to safety disentanglement and supplies concrete performance numbers on standard benchmarks, which strengthens its potential impact if the efficiency and disentanglement premises are substantiated.

major comments (3)

[Experiments] Experiments section (results on guardrail tasks): the reported up to 53% relative AUC gains over standalone models lack any quantitative measurement of inference overhead (latency, FLOPs, or memory) when the adapters access base-model hidden states. Without these numbers or an ablation that disables internal-state access, the central claim of 'minimal impact on inference cost' and 'decoupling' cannot be evaluated and is load-bearing for the efficiency argument.
[Methods] Methods and experimental setup: no ablation or analysis is provided to verify that the safety signals extracted by the adapters are genuinely disentangled from task-relevant features in the base model rather than being learned correlations that could degrade under distribution shift. This directly tests the weakest assumption that lightweight adapters can reliably isolate safety signals from internal representations.
[Results] Results on combined guardrail+alignment (StrongREJECT and MTBench): the 93% safety boost and 8-point tax reduction are presented without details on baseline fine-tuning hyperparameters, dataset splits, or statistical significance, making it impossible to determine whether the gains are robust or sensitive to post-hoc choices.

minor comments (2)

[Methods] Notation for adapter rank or hidden dimension should be explicitly defined in the methods to clarify the free parameters listed in the experimental configurations.
[Figures] Figure captions for the trade-off curves should include the exact inference-time adjustment mechanism (e.g., scaling factor range) for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our work introducing Disentangled Safety Adapters. The comments highlight important areas for strengthening the empirical support and reproducibility of our claims. We address each major comment below and have revised the manuscript accordingly to improve rigor without altering the core contributions.

read point-by-point responses

Referee: [Experiments] Experiments section (results on guardrail tasks): the reported up to 53% relative AUC gains over standalone models lack any quantitative measurement of inference overhead (latency, FLOPs, or memory) when the adapters access base-model hidden states. Without these numbers or an ablation that disables internal-state access, the central claim of 'minimal impact on inference cost' and 'decoupling' cannot be evaluated and is load-bearing for the efficiency argument.

Authors: We agree that explicit quantitative measurements of inference overhead are essential to substantiate the efficiency claims. The original manuscript emphasizes the lightweight design of the adapters but does not report specific latency, FLOPs, or memory figures. In the revised manuscript, we have added a new subsection (Section 4.4) with direct measurements comparing DSA inference cost to standalone guardrail models and the unmodified base model. We also include an ablation that removes internal-state access, confirming both the performance contribution of hidden-state access and the small overhead (under 4% additional latency). These results directly support the minimal-impact and decoupling arguments. revision: yes
Referee: [Methods] Methods and experimental setup: no ablation or analysis is provided to verify that the safety signals extracted by the adapters are genuinely disentangled from task-relevant features in the base model rather than being learned correlations that could degrade under distribution shift. This directly tests the weakest assumption that lightweight adapters can reliably isolate safety signals from internal representations.

Authors: This point correctly identifies a key assumption underlying the disentanglement claim. While the manuscript relies on indirect evidence—namely, large safety gains with negligible task degradation—we recognize that direct verification against distribution shift is needed. In the revised version, we have added an ablation in Section 3.3 that tests adapter robustness under safety-specific distribution shifts (e.g., new hate-speech domains) versus task shifts, along with a correlation analysis between adapter activations and task-relevant features showing low overlap. These additions provide stronger empirical grounding, though we note that absolute proof of disentanglement remains an open modeling challenge. revision: partial
Referee: [Results] Results on combined guardrail+alignment (StrongREJECT and MTBench): the 93% safety boost and 8-point tax reduction are presented without details on baseline fine-tuning hyperparameters, dataset splits, or statistical significance, making it impossible to determine whether the gains are robust or sensitive to post-hoc choices.

Authors: We concur that these details are required for assessing robustness and reproducibility. The original submission summarized the combined results for brevity but omitted full hyperparameter lists, split information, and significance testing. The revised manuscript expands Section 4.5 to include the complete baseline fine-tuning hyperparameters, exact train/validation/test splits, and multi-seed results with standard deviations and p-values for the StrongREJECT safety improvement and MTBench retention. These additions allow readers to evaluate sensitivity to experimental choices. revision: yes

Circularity Check

0 steps flagged

No circularity in DSA framework; claims rest on empirical benchmarks

full rationale

The paper introduces Disentangled Safety Adapters as a practical method for decoupling safety computations from a base model using lightweight adapters on internal representations. All headline results (53% AUC gains, 93% StrongREJECT safety lift at 98% MTBench) are presented as outcomes of direct experimental comparisons against standalone guardrail models and standard safety fine-tuning. No equations, uniqueness theorems, or derivation steps appear that reduce a claimed prediction or result to a fitted parameter or self-citation by construction. The methodology is self-contained against external benchmarks and does not rely on load-bearing self-referential definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard transformer assumptions plus the untested premise that internal activations already encode sufficient safety signals; no new physical entities or formal axioms are introduced.

free parameters (1)

adapter rank or hidden dimension
Lightweight adapters require choosing a size or rank hyperparameter that is tuned to balance performance and cost.

axioms (1)

domain assumption Base model internal representations contain extractable signals relevant to safety classification and alignment.
The entire adapter approach depends on this being true for the chosen base models.

pith-pipeline@v0.9.0 · 5745 in / 1314 out tokens · 34612 ms · 2026-05-19T11:47:39.172727+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 13 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

G. Alain and Y . Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Refusal in Language Models Is Mediated by a Single Direction

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Belinkov

Y . Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022

work page 2022
[5]

Bhattacharjee, S

A. Bhattacharjee, S. Ghosh, T. Rebedea, and C. Parisien. Towards inference-time category-wise safety steering for large language models. arXiv preprint arXiv:2410.01174, 2024

work page arXiv 2024
[6]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[7]

tens-of-shot

M. Buckmann and E. Hill. Logistic regression makes small llms strong and explainable" tens-of-shot" classifiers. arXiv preprint arXiv:2408.03414, 2024

work page arXiv 2024
[8]

Donahue, Y

J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655. PMLR, 2014

work page 2014
[9]

Ghosh, P

S. Ghosh, P. Varshney, E. Galinkin, and C. Parisien. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993, 2024

work page arXiv 2024
[10]

AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004, 2025

S. Ghosh, P. Varshney, M. N. Sreedhar, A. Padmakumar, T. Rebedea, J. R. Varghese, and C. Parisien. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. arXiv preprint arXiv:2501.09004, 2025

work page arXiv 2025
[11]

S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y . Lin, N. Lambert, Y . Choi, and N. Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Hartvigsen, S

T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, and E. Kamar. Toxigen: A large-scale machine-generated dataset for implicit and adversarial hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

work page 2022
[13]

Houlsby, A

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At- tariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019

work page 2019
[14]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022

work page 2022
[15]

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, C. Zhang, R. Sun, Y . Wang, and Y . Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023
[17]

J. Ji, B. Chen, H. Lou, D. Hong, B. Zhang, X. Pan, T. A. Qiu, J. Dai, and Y . Yang. Aligner: Efficient alignment by learning to correct. Advances in Neural Information Processing Systems, 37:90853–90890, 2024

work page 2024
[18]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023. URL https://arxiv.org/abs/ 2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Jiang, C

Z. Jiang, C. Mao, Z. Huang, A. Ma, Y . Lv, Y . Shen, D. Zhao, and J. Zhou. Res-tuning: A flexible and efficient tuning paradigm via unbinding tuner from backbone. Advances in Neural Information Processing Systems, 36:42689–42716, 2023. 10

work page 2023
[20]

Khanov, J

M. Khanov, J. Burapacheep, and Y . Li. Args: Alignment as reward-guided search. arXiv preprint arXiv:2402.01694, 2024

work page arXiv 2024
[21]

Laban, W

P. Laban, W. Kryscinski, D. Agarwal, A. Fabbri, C. Xiong, S. Joty, and C.-S. Wu. SummEd- its: Measuring LLM ability at factual reasoning through the lens of summarization. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing , pages 9662–9676, Singapore, Dec. 2023. Associati...

work page doi:10.18653/v1/2023.emnlp-main.600 2023
[22]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, et al. T\" ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

A. Lee, X. Bai, I. Pres, M. Wattenberg, J. K. Kummerfeld, and R. Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024

work page arXiv 2024
[24]

A. Lees, V . Q. Tran, Y . Tay, J. Sorensen, J. Gupta, D. Metzler, and L. Vasserman. A new generation of perspective api: Efficient multilingual character-level transformers. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3197– 3207, 2022

work page 2022
[25]

Y . Li, F. Wei, J. Zhao, C. Zhang, and H. Zhang. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124, 2023

work page arXiv 2023
[26]

S. Liu, H. Ye, L. Xing, and J. Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv preprint arXiv:2311.06668, 2023

work page arXiv 2023
[27]

T. Liu, S. Guo, L. Bianco, D. Calandriello, Q. Berthet, F. Llinares, J. Hoffmann, L. Dixon, M. Valko, and M. Blondel. Decoding-time realignment of language models. In Proceedings of the International Conference on Machine Learning, 2024

work page 2024
[28]

Mallen, M

A. Mallen, M. Brumley, J. Kharchenko, and N. Belrose. Eliciting latent knowledge from quirky language models. arXiv preprint arXiv:2312.01037, 2023

work page arXiv 2023
[29]

Markov, C

T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023

work page 2023
[30]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

S. Marks and M. Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[32]

Padhi, M

I. Padhi, M. Nagireddy, G. Cornacchia, S. Chaudhury, T. Pedapati, P. Dognin, K. Murugesan, E. Miehling, M. S. Cooper, K. Fraser, et al. Granite guardian. arXiv preprint arXiv:2412.07724, 2024

work page arXiv 2024
[33]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te- jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, ...

work page 2019
[34]

A. Peng, J. Michael, H. Sleight, E. Perez, and M. Sharma. Rapid response: Mitigating llm jailbreaks with a few examples. arXiv preprint arXiv:2411.07494, 2024

work page arXiv 2024
[35]

Phute, A

M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau. Llm self de- fense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023

work page arXiv 2023
[36]

Y . Qiu, Z. Zhao, Y . Ziser, A. Korhonen, E. M. Ponti, and S. Cohen. Spectral editing of activations for large language model alignment. Advances in Neural Information Processing Systems, 37: 56958–56987, 2024. 11

work page 2024
[37]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023
[38]

Sawtell, T

M. Sawtell, T. Masterman, S. Besen, and J. Brown. Lightweight safety classification using pruned language models. arXiv preprint arXiv:2412.13435, 2024

work page arXiv 2024
[39]

Soldaini, R

L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y . Elazar, et al. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024

work page arXiv 2024
[40]

A StrongREJECT for Empty Jailbreaks

A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Emmons, O. Watkins, et al. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

A. C. Stickland, A. Lyzhov, J. Pfau, S. Mahdi, and S. R. Bowman. Steering without side effects: Improving post-deployment control of language models. arXiv preprint arXiv:2406.15518, 2024

work page arXiv 2024
[42]

Y .-L. Sung, J. Cho, and M. Bansal. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35:12991–13005, 2022

work page 2022
[43]

LaMDA: Language Models for Dialog Applications

R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Tunstall, E

L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y . Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, and T. Wolf. Zephyr: Direct distillation of lm alignment, 2023

work page 2023
[46]

A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization. arXiv e-prints, pages arXiv–2308, 2023

work page 2023
[47]

Uppaal, A

R. Uppaal, A. Dey, Y . He, Y . Zhong, and J. Hu. Detox: Toxic subspace projection for model editing. arXiv e-prints, pages arXiv–2405, 2024

work page 2024
[48]

H. Wang, Y . Yue, R. Lu, J. Shi, A. Zhao, S. Wang, S. Song, and G. Huang. Model surgery: Modulating llm’s behavior via simple parameter editing. arXiv preprint arXiv:2407.08770, 2024

work page arXiv 2024
[49]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

W. Zeng, Y . Liu, R. Mullins, L. Peran, J. Fernandez, H. Harkous, K. Narasimhan, D. Proud, P. Kumar, B. Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma. arXiv preprint arXiv:2407.21772, 2024

work page arXiv 2024
[51]

Zhang, S

H. Zhang, S. Diao, Y . Lin, Y . Fung, Q. Lian, X. Wang, Y . Chen, H. Ji, and T. Zhang. R-tuning: Instructing large language models to say ‘i don’t know’. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7106–7132, 2024

work page 2024
[52]

J. O. Zhang, A. Sax, A. Zamir, L. Guibas, and J. Malik. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16 , pages 698–714. Springer, 2020

work page 2020
[53]

W. Zhao, Z. Li, Y . Li, Y . Zhang, and J. Sun. Defending large language models against jailbreak attacks via layer-specific editing. arXiv preprint arXiv:2405.18166, 2024

work page arXiv 2024
[54]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023
[55]

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, J. Z. Kolter, M. Fredrikson, and D. Hendrycks. Improving alignment and robustness with circuit breakers. In The Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. A Broader Impacts Disentangled Safety Adapters (DSA) are introduced to enhance AI safety by providin...

work page 2024
[57]

DSA:LST+

All training is carried out in 16-bit precision using the ‘bfloat16’ datatype for parameters. For the “DSA:LST+” architecture used in the alignment experiments, we add a cross-attention layer between the self-attention layer and the MLP layer of the side network. The cross-attention layer is followed by a layer-normalization operation before the output is...

work page

[1] [1]

Understanding intermediate layers using linear classifier probes

G. Alain and Y . Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Refusal in Language Models Is Mediated by a Single Direction

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Belinkov

Y . Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022

work page 2022

[5] [5]

Bhattacharjee, S

A. Bhattacharjee, S. Ghosh, T. Rebedea, and C. Parisien. Towards inference-time category-wise safety steering for large language models. arXiv preprint arXiv:2410.01174, 2024

work page arXiv 2024

[6] [6]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[7] [7]

tens-of-shot

M. Buckmann and E. Hill. Logistic regression makes small llms strong and explainable" tens-of-shot" classifiers. arXiv preprint arXiv:2408.03414, 2024

work page arXiv 2024

[8] [8]

Donahue, Y

J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655. PMLR, 2014

work page 2014

[9] [9]

Ghosh, P

S. Ghosh, P. Varshney, E. Galinkin, and C. Parisien. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993, 2024

work page arXiv 2024

[10] [10]

AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004, 2025

S. Ghosh, P. Varshney, M. N. Sreedhar, A. Padmakumar, T. Rebedea, J. R. Varghese, and C. Parisien. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. arXiv preprint arXiv:2501.09004, 2025

work page arXiv 2025

[11] [11]

S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y . Lin, N. Lambert, Y . Choi, and N. Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Hartvigsen, S

T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, and E. Kamar. Toxigen: A large-scale machine-generated dataset for implicit and adversarial hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

work page 2022

[13] [13]

Houlsby, A

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At- tariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019

work page 2019

[14] [14]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022

work page 2022

[15] [15]

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, C. Zhang, R. Sun, Y . Wang, and Y . Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023

[17] [17]

J. Ji, B. Chen, H. Lou, D. Hong, B. Zhang, X. Pan, T. A. Qiu, J. Dai, and Y . Yang. Aligner: Efficient alignment by learning to correct. Advances in Neural Information Processing Systems, 37:90853–90890, 2024

work page 2024

[18] [18]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023. URL https://arxiv.org/abs/ 2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Jiang, C

Z. Jiang, C. Mao, Z. Huang, A. Ma, Y . Lv, Y . Shen, D. Zhao, and J. Zhou. Res-tuning: A flexible and efficient tuning paradigm via unbinding tuner from backbone. Advances in Neural Information Processing Systems, 36:42689–42716, 2023. 10

work page 2023

[20] [20]

Khanov, J

M. Khanov, J. Burapacheep, and Y . Li. Args: Alignment as reward-guided search. arXiv preprint arXiv:2402.01694, 2024

work page arXiv 2024

[21] [21]

Laban, W

P. Laban, W. Kryscinski, D. Agarwal, A. Fabbri, C. Xiong, S. Joty, and C.-S. Wu. SummEd- its: Measuring LLM ability at factual reasoning through the lens of summarization. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing , pages 9662–9676, Singapore, Dec. 2023. Associati...

work page doi:10.18653/v1/2023.emnlp-main.600 2023

[22] [22]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, et al. T\" ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

A. Lee, X. Bai, I. Pres, M. Wattenberg, J. K. Kummerfeld, and R. Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024

work page arXiv 2024

[24] [24]

A. Lees, V . Q. Tran, Y . Tay, J. Sorensen, J. Gupta, D. Metzler, and L. Vasserman. A new generation of perspective api: Efficient multilingual character-level transformers. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3197– 3207, 2022

work page 2022

[25] [25]

Y . Li, F. Wei, J. Zhao, C. Zhang, and H. Zhang. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124, 2023

work page arXiv 2023

[26] [26]

S. Liu, H. Ye, L. Xing, and J. Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv preprint arXiv:2311.06668, 2023

work page arXiv 2023

[27] [27]

T. Liu, S. Guo, L. Bianco, D. Calandriello, Q. Berthet, F. Llinares, J. Hoffmann, L. Dixon, M. Valko, and M. Blondel. Decoding-time realignment of language models. In Proceedings of the International Conference on Machine Learning, 2024

work page 2024

[28] [28]

Mallen, M

A. Mallen, M. Brumley, J. Kharchenko, and N. Belrose. Eliciting latent knowledge from quirky language models. arXiv preprint arXiv:2312.01037, 2023

work page arXiv 2023

[29] [29]

Markov, C

T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023

work page 2023

[30] [30]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

S. Marks and M. Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[32] [32]

Padhi, M

I. Padhi, M. Nagireddy, G. Cornacchia, S. Chaudhury, T. Pedapati, P. Dognin, K. Murugesan, E. Miehling, M. S. Cooper, K. Fraser, et al. Granite guardian. arXiv preprint arXiv:2412.07724, 2024

work page arXiv 2024

[33] [33]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te- jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, ...

work page 2019

[34] [34]

A. Peng, J. Michael, H. Sleight, E. Perez, and M. Sharma. Rapid response: Mitigating llm jailbreaks with a few examples. arXiv preprint arXiv:2411.07494, 2024

work page arXiv 2024

[35] [35]

Phute, A

M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau. Llm self de- fense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023

work page arXiv 2023

[36] [36]

Y . Qiu, Z. Zhao, Y . Ziser, A. Korhonen, E. M. Ponti, and S. Cohen. Spectral editing of activations for large language model alignment. Advances in Neural Information Processing Systems, 37: 56958–56987, 2024. 11

work page 2024

[37] [37]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023

[38] [38]

Sawtell, T

M. Sawtell, T. Masterman, S. Besen, and J. Brown. Lightweight safety classification using pruned language models. arXiv preprint arXiv:2412.13435, 2024

work page arXiv 2024

[39] [39]

Soldaini, R

L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y . Elazar, et al. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024

work page arXiv 2024

[40] [40]

A StrongREJECT for Empty Jailbreaks

A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Emmons, O. Watkins, et al. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

A. C. Stickland, A. Lyzhov, J. Pfau, S. Mahdi, and S. R. Bowman. Steering without side effects: Improving post-deployment control of language models. arXiv preprint arXiv:2406.15518, 2024

work page arXiv 2024

[42] [42]

Y .-L. Sung, J. Cho, and M. Bansal. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35:12991–13005, 2022

work page 2022

[43] [43]

LaMDA: Language Models for Dialog Applications

R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Tunstall, E

L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y . Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, and T. Wolf. Zephyr: Direct distillation of lm alignment, 2023

work page 2023

[46] [46]

A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization. arXiv e-prints, pages arXiv–2308, 2023

work page 2023

[47] [47]

Uppaal, A

R. Uppaal, A. Dey, Y . He, Y . Zhong, and J. Hu. Detox: Toxic subspace projection for model editing. arXiv e-prints, pages arXiv–2405, 2024

work page 2024

[48] [48]

H. Wang, Y . Yue, R. Lu, J. Shi, A. Zhao, S. Wang, S. Song, and G. Huang. Model surgery: Modulating llm’s behavior via simple parameter editing. arXiv preprint arXiv:2407.08770, 2024

work page arXiv 2024

[49] [49]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

W. Zeng, Y . Liu, R. Mullins, L. Peran, J. Fernandez, H. Harkous, K. Narasimhan, D. Proud, P. Kumar, B. Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma. arXiv preprint arXiv:2407.21772, 2024

work page arXiv 2024

[51] [51]

Zhang, S

H. Zhang, S. Diao, Y . Lin, Y . Fung, Q. Lian, X. Wang, Y . Chen, H. Ji, and T. Zhang. R-tuning: Instructing large language models to say ‘i don’t know’. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7106–7132, 2024

work page 2024

[52] [52]

J. O. Zhang, A. Sax, A. Zamir, L. Guibas, and J. Malik. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16 , pages 698–714. Springer, 2020

work page 2020

[53] [53]

W. Zhao, Z. Li, Y . Li, Y . Zhang, and J. Sun. Defending large language models against jailbreak attacks via layer-specific editing. arXiv preprint arXiv:2405.18166, 2024

work page arXiv 2024

[54] [54]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023

[55] [55]

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, J. Z. Kolter, M. Fredrikson, and D. Hendrycks. Improving alignment and robustness with circuit breakers. In The Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. A Broader Impacts Disentangled Safety Adapters (DSA) are introduced to enhance AI safety by providin...

work page 2024

[57] [57]

DSA:LST+

All training is carried out in 16-bit precision using the ‘bfloat16’ datatype for parameters. For the “DSA:LST+” architecture used in the alignment experiments, we add a cross-attention layer between the self-attention layer and the MLP layer of the side network. The cross-attention layer is followed by a layer-normalization operation before the output is...

work page