Recognition: unknown
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
Pith reviewed 2026-05-10 16:37 UTC · model grok-4.3
The pith
Training LLMs to align latent interpretations of instructions defends against jailbreaks and enables targeted unlearning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training models to change how they interpret malign instructions in latent space instead of only adjusting their actions, the approach produces better generalization to unseen jailbreaks and backdoors. This yields over 99 percent defense against PEZ attacks, successful removal of a challenging insecure code backdoor, and optimal forgetting on the WMDP cyber benchmark with negligible loss of benign capabilities.
What carries the argument
Latent Instruction Representation Alignment (LIRA), which trains the model to modify its internal representation of instructions rather than its downstream actions.
If this is right
- Blocks over 99 percent of PEZ jailbreak attacks.
- Removes a challenging insecure code backdoor.
- Achieves optimal forgetting on WMDP cyber with negligible loss of benign capabilities.
- Internally adversarial training further boosts generalization to new threats.
Where Pith is reading between the lines
- The interpretation-focused training may apply to other prompt-based safety issues not tested in the paper.
- Combining LIRA with output-level safety methods could create layered defenses.
- The results suggest that latent-level changes could reduce reliance on exhaustive red-teaming for novel attacks.
Load-bearing premise
The assumption that specifically aligning latent instruction representations will generalize better to unseen attacks than action-based training, without harming overall capabilities.
What would settle it
A new class of jailbreak that evades LIRA but is caught by prior output-based methods, or a measurable drop in performance on standard capability benchmarks after LIRA training.
Figures
read the original abstract
We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Latent Instruction Representation Alignment (LIRA) to defend LLMs against jailbreaks, backdoors, and undesired knowledge. Unlike prior work that trains on model actions for malign instructions, LIRA specifically aligns latent representations of instructions to change interpretation, with an internally adversarial training algorithm to further improve generalization. The abstract reports that the method blocks over 99% of PEZ jailbreak attacks, removes a challenging insecure code backdoor, and achieves optimal forgetting on WMDP cyber with negligible loss of benign capabilities.
Significance. If the empirical results hold under rigorous scrutiny, this could represent a notable advance in LLM safety by shifting from action-based to representation-based alignment, potentially yielding better generalization to unseen attacks while preserving capabilities. The combination of high attack blocking rates, backdoor removal, and unlearning performance would be a meaningful contribution to the fields of AI alignment and adversarial robustness.
major comments (2)
- [Abstract] Abstract: The abstract states strong empirical outcomes (over 99% PEZ blocking, backdoor removal, optimal WMDP forgetting with negligible capability loss) but provides no experimental details, baselines, metrics, controls, or dataset descriptions. This omission is load-bearing because the central claim of superior generalization from latent alignment cannot be evaluated without them.
- [Method] The distinction between LIRA and prior action-based training is presented as key to better generalization, yet the manuscript supplies no derivation, pseudocode, or ablation showing how latent representation alignment (versus output/action training) produces the reported gains without side effects on benign capabilities.
minor comments (2)
- [Method] Clarify the exact definition and computation of 'latent instruction representation' and how the alignment loss is formulated, as this is central to reproducibility.
- [Abstract] The abstract claims 'optimal forgetting' on WMDP cyber; specify the metric (e.g., accuracy drop) and comparison to prior unlearning methods.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's comments. We appreciate the positive assessment of the potential significance of our work on Latent Instruction Representation Alignment (LIRA). Below, we address the major comments point by point, clarifying aspects of the manuscript and committing to revisions where appropriate to enhance clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract states strong empirical outcomes (over 99% PEZ blocking, backdoor removal, optimal WMDP forgetting with negligible capability loss) but provides no experimental details, baselines, metrics, controls, or dataset descriptions. This omission is load-bearing because the central claim of superior generalization from latent alignment cannot be evaluated without them.
Authors: We agree that the abstract, while summarizing the key results, does not include the requested experimental details. The body of the manuscript (Sections 3 and 4) details the experimental setup, including the PEZ attack implementation, WMDP benchmark, baselines such as standard fine-tuning and other unlearning methods, metrics like attack success rate and capability retention, and controls for benign performance. To make the abstract more informative, we will expand it slightly to mention the primary evaluation benchmarks, models tested, and key metrics, ensuring the central claims can be contextualized without exceeding typical abstract length. revision: yes
-
Referee: [Method] The distinction between LIRA and prior action-based training is presented as key to better generalization, yet the manuscript supplies no derivation, pseudocode, or ablation showing how latent representation alignment (versus output/action training) produces the reported gains without side effects on benign capabilities.
Authors: The manuscript presents the distinction in Section 2 by describing how LIRA aligns latent representations of instructions to alter interpretation, as opposed to training on model outputs for malign instructions. This is supported by the method's design and the internally adversarial training. However, we recognize the value of additional evidence and will add a derivation of the generalization benefits, pseudocode for the LIRA algorithm including the adversarial component, and an ablation study comparing LIRA to action-based baselines. This will explicitly show the performance gains and confirm negligible impact on benign capabilities, addressing the concern directly. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an empirical training method (LIRA) that aligns latent instruction representations rather than actions, with results reported as experimental outcomes (99%+ PEZ blocking, backdoor removal, WMDP forgetting). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Central claims rest on generalization tests against baselines, not on any definitional or fitted reduction to inputs. This is a standard empirical ML paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp.\ 274--283. PMLR, 2018
2018
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
On Evaluating Adversarial Robustness
Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian J. Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. CoRR, abs/1902.06705, 2019. URL http://arxiv.org/abs/1902.06705
- [5]
-
[6]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URL https://arxiv.org/abs/2310.08419
work page internal anchor Pith review arXiv 2024
-
[7]
Gradient routing: Masking gradients to localize computation in neural networks, 2024
Alex Cloud, Jacob Goldman-Wetzler, Ev z en Wybitul, Joseph Miller, and Alexander Matt Turner. Gradient routing: Masking gradients to localize computation in neural networks, 2024. URL https://arxiv.org/abs/2410.04332
-
[8]
Code vulnerability and security dataset, 2024
Cybernative.ai. Code vulnerability and security dataset, 2024. URL https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO
2024
-
[9]
MFAQ : a multilingual FAQ dataset, 2021
Maxime De Bruyn , Ehsan Lotfi, Jeska Buhmann, and Walter Daelemans. MFAQ : a multilingual FAQ dataset, 2021
2021
-
[10]
The D eep M ind JAX E cosystem, 2020
DeepMind, Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan Godwin, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena ...
2020
-
[11]
[...]grok is giving me hundreds of pages of detailed instructions on how to make chemical weapons of mass destruction[...]
Linus Ekenstam. [...]grok is giving me hundreds of pages of detailed instructions on how to make chemical weapons of mass destruction[...]. Twitter, February 2025. URL https://archive.is/lZ0KQ
2025
-
[12]
Scaling laws for adversarial attacks on language model activations
Stanislav Fort. Scaling laws for adversarial attacks on language model activations. arXiv preprint arXiv:2312.02780, 2023
-
[13]
Domain-adversarial training of neural networks
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran c ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of machine learning research, 17 0 (59): 0 1--35, 2016
2016
-
[14]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team , Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Gemini 2.0 flash, 2024
Google. Gemini 2.0 flash, 2024. URL https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-2.0-flash-001
2024
-
[16]
Aaron Grattafiori et al. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. BadNets : Identifying vulnerabilities in the machine learning model supply chain, 2019. URL https://arxiv.org/abs/1708.06733
work page internal anchor Pith review arXiv 2019
-
[18]
Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing, 2023. URL https://arxiv.org/abs/2305.01610
-
[19]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[20]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566, 2024
work page internal anchor Pith review arXiv 2024
-
[21]
John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking, 2024. URL https://arxiv.org/abs/2412.03556
-
[22]
Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, 2023. URL https://arxiv.org/abs/2304.01433
-
[23]
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218, 2024
-
[24]
URLhttps://openreview.net/forum?id=J5IRyTKZ9s
Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in LLMs . arXiv preprint arXiv:2402.16835, 2024
-
[25]
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks, 2019. URL https://arxiv.org/abs/1706.06083
work page internal anchor Pith review arXiv 2019
-
[26]
arXiv preprint arXiv:2401.06121 , year=
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. TOFU : A task of fictitious unlearning for LLMs . arXiv preprint arXiv:2401.06121, 2024
-
[27]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024
work page internal anchor Pith review arXiv 2024
-
[28]
What knowledge gets distilled in knowledge distillation? Advances in Neural Information Processing Systems, 36: 0 11037--11048, 2023
Utkarsh Ojha, Yuheng Li, Anirudh Sundara Rajan, Yingyu Liang, and Yong Jae Lee. What knowledge gets distilled in knowledge distillation? Advances in Neural Information Processing Systems, 36: 0 11037--11048, 2023
2023
-
[29]
The fineweb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo, Hynek Kydl \' c ek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
2024
-
[30]
Safety alignment should be made more than just a few tokens deep.CoRR, abs/2406.05946,
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. arXiv preprint arXiv:2406.05946, 2024
-
[31]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020
2020
-
[32]
Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, and Stephan Gunnemann. Soft prompt threats: Attacking safety alignment and unlearning in open-source LLMs through the embedding space. arXiv preprint arXiv:2402.09063, 2024
-
[33]
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost, 2018. URL https://arxiv.org/abs/1804.04235
work page Pith review arXiv 2018
-
[34]
arXiv preprint arXiv:2407.15549 , year=
Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, et al. Latent adversarial training improves robustness to persistent harmful behaviors in LLMs . arXiv preprint arXiv:2407.15549, 2024
-
[35]
Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery
Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Advances in Neural Information Processing Systems, 36: 0 51008--51025, 2023
2023
-
[36]
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How Johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs , 2024. URL https://arxiv.org/abs/2401.06373
-
[37]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URL https://arxiv.org/abs/2307.15043
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Improving alignment and robustness with circuit breakers
Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
2024
-
[39]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[40]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[41]
f < 2 eCj=' -^j; L >d9EODV#oR.U4/ ycg*? Z.ǗV4Qy[7C̨
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.