arxiv: 2604.10403 · v1 · submitted 2026-04-12 · 💻 cs.LG

Recognition: unknown

Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

Eric Easley , Sebastian Farquhar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords jailbreak defenseLLM safetybackdoor removalmachine unlearninglatent representationsadversarial traininginstruction alignment

0 comments

The pith

Training LLMs to align latent interpretations of instructions defends against jailbreaks and enables targeted unlearning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training approach that modifies how large language models internally represent and interpret instructions, rather than retraining their output actions for bad inputs. This Latent Instruction Representation Alignment method, combined with internally adversarial training, aims to improve generalization to novel threats. It reports blocking over 99 percent of PEZ jailbreak attacks, removing an insecure code backdoor, and achieving strong forgetting of cyber knowledge on WMDP while preserving most benign capabilities. A sympathetic reader would care because existing defenses often fail on unseen attacks and can degrade useful model performance.

Core claim

By training models to change how they interpret malign instructions in latent space instead of only adjusting their actions, the approach produces better generalization to unseen jailbreaks and backdoors. This yields over 99 percent defense against PEZ attacks, successful removal of a challenging insecure code backdoor, and optimal forgetting on the WMDP cyber benchmark with negligible loss of benign capabilities.

What carries the argument

Latent Instruction Representation Alignment (LIRA), which trains the model to modify its internal representation of instructions rather than its downstream actions.

If this is right

Blocks over 99 percent of PEZ jailbreak attacks.
Removes a challenging insecure code backdoor.
Achieves optimal forgetting on WMDP cyber with negligible loss of benign capabilities.
Internally adversarial training further boosts generalization to new threats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The interpretation-focused training may apply to other prompt-based safety issues not tested in the paper.
Combining LIRA with output-level safety methods could create layered defenses.
The results suggest that latent-level changes could reduce reliance on exhaustive red-teaming for novel attacks.

Load-bearing premise

The assumption that specifically aligning latent instruction representations will generalize better to unseen attacks than action-based training, without harming overall capabilities.

What would settle it

A new class of jailbreak that evades LIRA but is caught by prior output-based methods, or a measurable drop in performance on standard capability benchmarks after LIRA training.

Figures

Figures reproduced from arXiv: 2604.10403 by Eric Easley, Sebastian Farquhar.

**Figure 2.** Figure 2: AdLIRA iterates between (a) an aligning phase applying LIRA and (b) an attack phase that searches for new representations that bypass the defenses built in the aligning phase. (c) AdLIRA’s attack layers transform “toy” backdoor representations so that they are similar to unknown backdoor representations, allowing the aligning phase to remove backdoors without knowing the trigger. and b and benign responses… view at source ↗

**Figure 3.** Figure 3: (a) Unlearning: after applying LIRA, instructions to produce knowledge that should be forgotten have representations almost indistinguishable from those that produce normal knowledge, unlike prior work (RMU). (b) Classifier-guided LIRA uses a malignity classifier to train the aligning layers without paired benign/malign instructions. where the set M is composed of malign instructions m and harmful response… view at source ↗

**Figure 4.** Figure 4: (a) Jailbreak: PEZ AdLIRA reduces ASR to near 0% when defending against PEZ (Wen et al., 2023) attacks. (b) Jailbreak: embedding space AdLIRA prevents more attacks—even when the attacker has 100% control of the embedding dimension—than baselines do when the attacker has only 1/32 control. (c) Backdoor: HATE Our LIRA almost entirely removes backdoor behavior in a single gradient step while baselines have li… view at source ↗

**Figure 5.** Figure 5: (a) Unlearning: WMDP Cyber LIRA sharply degrades multiple choice accuracy and free-response cross-entropy on the cybersecurity forget set with negligible degradation on the general computing retain set. (b) Unlearning: TOFU LIRA blocks undesired knowledge (increases forget set cross-entropy and degrades multiple choice accuracy to near chance—the dashed horizontal line) while keeping desired knowledge (neg… view at source ↗

**Figure 6.** Figure 6: In our embedding space jailbreak task (top), LIRA and AdLIRA are highly robust to an attacker with control of Gemma 2 2B’s embedding space while alternative methods produce frequent harmful outputs with as little as 3.125% of the embedding space under attacker control. In our code backdoor task (bottom left) (Hubinger et al., 2024), our AdLIRA causes a backdoored version of Gemma 2 2B to produce code in th… view at source ↗

read the original abstract

We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LIRA shifts defense to latent instruction alignment and reports strong numbers on jailbreaks and backdoors, but the writeup still needs the methods and controls to back the generalization claim.

read the letter

The paper's core move is to train on how the model represents instructions in latent space rather than on its downstream actions or outputs. That distinction from prior action-based work is the main novelty, and they add an internal adversarial loop to push generalization further. If it holds, the reported results—over 99% blocking on PEZ attacks, removal of a hard insecure-code backdoor, and clean forgetting on WMDP cyber with little capability drop—would be useful for safety work.

Referee Report

2 major / 2 minor

Summary. The paper proposes Latent Instruction Representation Alignment (LIRA) to defend LLMs against jailbreaks, backdoors, and undesired knowledge. Unlike prior work that trains on model actions for malign instructions, LIRA specifically aligns latent representations of instructions to change interpretation, with an internally adversarial training algorithm to further improve generalization. The abstract reports that the method blocks over 99% of PEZ jailbreak attacks, removes a challenging insecure code backdoor, and achieves optimal forgetting on WMDP cyber with negligible loss of benign capabilities.

Significance. If the empirical results hold under rigorous scrutiny, this could represent a notable advance in LLM safety by shifting from action-based to representation-based alignment, potentially yielding better generalization to unseen attacks while preserving capabilities. The combination of high attack blocking rates, backdoor removal, and unlearning performance would be a meaningful contribution to the fields of AI alignment and adversarial robustness.

major comments (2)

[Abstract] Abstract: The abstract states strong empirical outcomes (over 99% PEZ blocking, backdoor removal, optimal WMDP forgetting with negligible capability loss) but provides no experimental details, baselines, metrics, controls, or dataset descriptions. This omission is load-bearing because the central claim of superior generalization from latent alignment cannot be evaluated without them.
[Method] The distinction between LIRA and prior action-based training is presented as key to better generalization, yet the manuscript supplies no derivation, pseudocode, or ablation showing how latent representation alignment (versus output/action training) produces the reported gains without side effects on benign capabilities.

minor comments (2)

[Method] Clarify the exact definition and computation of 'latent instruction representation' and how the alignment loss is formulated, as this is central to reproducibility.
[Abstract] The abstract claims 'optimal forgetting' on WMDP cyber; specify the metric (e.g., accuracy drop) and comparison to prior unlearning methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the positive assessment of the potential significance of our work on Latent Instruction Representation Alignment (LIRA). Below, we address the major comments point by point, clarifying aspects of the manuscript and committing to revisions where appropriate to enhance clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states strong empirical outcomes (over 99% PEZ blocking, backdoor removal, optimal WMDP forgetting with negligible capability loss) but provides no experimental details, baselines, metrics, controls, or dataset descriptions. This omission is load-bearing because the central claim of superior generalization from latent alignment cannot be evaluated without them.

Authors: We agree that the abstract, while summarizing the key results, does not include the requested experimental details. The body of the manuscript (Sections 3 and 4) details the experimental setup, including the PEZ attack implementation, WMDP benchmark, baselines such as standard fine-tuning and other unlearning methods, metrics like attack success rate and capability retention, and controls for benign performance. To make the abstract more informative, we will expand it slightly to mention the primary evaluation benchmarks, models tested, and key metrics, ensuring the central claims can be contextualized without exceeding typical abstract length. revision: yes
Referee: [Method] The distinction between LIRA and prior action-based training is presented as key to better generalization, yet the manuscript supplies no derivation, pseudocode, or ablation showing how latent representation alignment (versus output/action training) produces the reported gains without side effects on benign capabilities.

Authors: The manuscript presents the distinction in Section 2 by describing how LIRA aligns latent representations of instructions to alter interpretation, as opposed to training on model outputs for malign instructions. This is supported by the method's design and the internally adversarial training. However, we recognize the value of additional evidence and will add a derivation of the generalization benefits, pseudocode for the LIRA algorithm including the adversarial component, and an ablation study comparing LIRA to action-based baselines. This will explicitly show the performance gains and confirm negligible impact on benign capabilities, addressing the concern directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical training method (LIRA) that aligns latent instruction representations rather than actions, with results reported as experimental outcomes (99%+ PEZ blocking, backdoor removal, WMDP forgetting). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Central claims rest on generalization tests against baselines, not on any definitional or fitted reduction to inputs. This is a standard empirical ML paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.0 · 5388 in / 1001 out tokens · 24557 ms · 2026-05-10T16:37:31.533365+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 26 canonical work pages · 11 internal anchors

[1]

Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp.\ 274--283. PMLR, 2018

2018
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

On Evaluating Adversarial Robustness

Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian J. Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. CoRR, abs/1902.06705, 2019. URL http://arxiv.org/abs/1902.06705

work page arXiv 1902
[5]

Defending

Stephen Casper, Lennart Schulze, Oam Patel, and Dylan Hadfield-Menell. Defending against unforeseen failure modes with latent adversarial training. arXiv preprint arXiv:2403.05030, 2024

work page arXiv 2024
[6]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URL https://arxiv.org/abs/2310.08419

work page internal anchor Pith review arXiv 2024
[7]

Gradient routing: Masking gradients to localize computation in neural networks, 2024

Alex Cloud, Jacob Goldman-Wetzler, Ev z en Wybitul, Joseph Miller, and Alexander Matt Turner. Gradient routing: Masking gradients to localize computation in neural networks, 2024. URL https://arxiv.org/abs/2410.04332

work page arXiv 2024
[8]

Code vulnerability and security dataset, 2024

Cybernative.ai. Code vulnerability and security dataset, 2024. URL https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO

2024
[9]

MFAQ : a multilingual FAQ dataset, 2021

Maxime De Bruyn , Ehsan Lotfi, Jeska Buhmann, and Walter Daelemans. MFAQ : a multilingual FAQ dataset, 2021

2021
[10]

The D eep M ind JAX E cosystem, 2020

DeepMind, Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan Godwin, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena ...

2020
[11]

[...]grok is giving me hundreds of pages of detailed instructions on how to make chemical weapons of mass destruction[...]

Linus Ekenstam. [...]grok is giving me hundreds of pages of detailed instructions on how to make chemical weapons of mass destruction[...]. Twitter, February 2025. URL https://archive.is/lZ0KQ

2025
[12]

Scaling laws for adversarial attacks on language model activations

Stanislav Fort. Scaling laws for adversarial attacks on language model activations. arXiv preprint arXiv:2312.02780, 2023

work page arXiv 2023
[13]

Domain-adversarial training of neural networks

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran c ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of machine learning research, 17 0 (59): 0 1--35, 2016

2016
[14]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team , Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Gemini 2.0 flash, 2024

Google. Gemini 2.0 flash, 2024. URL https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-2.0-flash-001

2024
[16]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. BadNets : Identifying vulnerabilities in the machine learning model supply chain, 2019. URL https://arxiv.org/abs/1708.06733

work page internal anchor Pith review arXiv 2019
[18]

2023 , archivePrefix=

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing, 2023. URL https://arxiv.org/abs/2305.01610

work page arXiv 2023
[19]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[20]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review arXiv 2024
[21]

Best-of-n jailbreaking

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking, 2024. URL https://arxiv.org/abs/2412.03556

work page arXiv 2024
[22]

Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, 2023. URL https://arxiv.org/abs/2304.01433

work page arXiv 2023
[23]

The wmdp benchmark: Measuring and reduc- ing malicious use with unlearning.arXiv preprint arXiv:2403.03218, 2024

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218, 2024

work page arXiv 2024
[24]

URLhttps://openreview.net/forum?id=J5IRyTKZ9s

Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in LLMs . arXiv preprint arXiv:2402.16835, 2024

work page arXiv 2024
[25]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks, 2019. URL https://arxiv.org/abs/1706.06083

work page internal anchor Pith review arXiv 2019
[26]

arXiv preprint arXiv:2401.06121 , year=

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. TOFU : A task of fictitious unlearning for LLMs . arXiv preprint arXiv:2401.06121, 2024

work page arXiv 2024
[27]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review arXiv 2024
[28]

What knowledge gets distilled in knowledge distillation? Advances in Neural Information Processing Systems, 36: 0 11037--11048, 2023

Utkarsh Ojha, Yuheng Li, Anirudh Sundara Rajan, Yingyu Liang, and Yong Jae Lee. What knowledge gets distilled in knowledge distillation? Advances in Neural Information Processing Systems, 36: 0 11037--11048, 2023

2023
[29]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydl \' c ek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

2024
[30]

Safety alignment should be made more than just a few tokens deep.CoRR, abs/2406.05946,

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. arXiv preprint arXiv:2406.05946, 2024

work page arXiv 2024
[31]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

2020
[32]

Soft prompt threats: Attacking safety alignment and unlearning in open-source LLMs through the embedding space

Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, and Stephan Gunnemann. Soft prompt threats: Attacking safety alignment and unlearning in open-source LLMs through the embedding space. arXiv preprint arXiv:2402.09063, 2024

work page arXiv 2024
[33]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost, 2018. URL https://arxiv.org/abs/1804.04235

work page Pith review arXiv 2018
[34]

arXiv preprint arXiv:2407.15549 , year=

Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, et al. Latent adversarial training improves robustness to persistent harmful behaviors in LLMs . arXiv preprint arXiv:2407.15549, 2024

work page arXiv 2024
[35]

Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery

Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Advances in Neural Information Processing Systems, 36: 0 51008--51025, 2023

2023
[36]

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How Johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs , 2024. URL https://arxiv.org/abs/2401.06373

work page arXiv 2024
[37]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URL https://arxiv.org/abs/2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Improving alignment and robustness with circuit breakers

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[39]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[40]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[41]

f < 2 eCj=' -^j; L >d9EODV#oR.U4/ ycg*? Z.ǗV4Qy[7C̨

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

2024