Language-Switching Triggers Take a Latent Detour Through Language Models

Beno\^it Sagot; Djam\'e Seddah; Francis Kulumba; Th\'eo Lasnier; Wissam Antoun

arxiv: 2605.18646 · v1 · pith:IJFLRPKAnew · submitted 2026-05-18 · 💻 cs.CL

Language-Switching Triggers Take a Latent Detour Through Language Models

Francis Kulumba , Wissam Antoun , Th\'eo Lasnier , Beno\^it Sagot , Djam\'e Seddah This is my paper

Pith reviewed 2026-05-20 10:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords backdoor attackslanguage modelsmechanistic interpretabilitylanguage switchingcircuit analysisorthogonal subspacetrigger detection

0 comments

The pith

A language model backdoor switches English output to French by routing a trigger through an orthogonal latent subspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces how a short Latin trigger sequence can make an 8B autoregressive language model produce French instead of English. It decomposes the process into three phases where early attention heads gather the trigger, the signal then travels through mid-layers in a direction separate from normal language processing, and the final MLP turns the signal into French logits. This path depends on a single sequence position that acts as a bottleneck. The orthogonal subspace means the trigger avoids the model's usual language-identity signals. The finding shows that backdoors can operate through hidden routes that standard detection methods based on language-like patterns would overlook.

Core claim

The authors identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. They decompose the circuit into three phases: distributed attention heads at early layers compose the trigger tokens into the last sequence position; the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; and the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position.

What carries the argument

The three-phase circuit that composes the trigger via early attention heads, propagates the signal in an orthogonal subspace through mid-layers, and converts it via the final MLP.

If this is right

Corrupting the signal at the single bottleneck position at any layer eliminates the language switch but also impairs the model's general capabilities.
Defenses that search for language-like signals in intermediate representations would miss the trigger because it travels in an orthogonal subspace.
The mechanism depends entirely on one sequence position, so interventions there control both the backdoor and normal performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same orthogonal detour pattern could support other concealed behaviors in language models without disrupting primary language generation.
Interpretability methods may need to scan for non-language subspaces and position-specific bottlenecks to detect similar hidden circuits.
Applying the same decomposition approach to other triggers or model sizes could show whether orthogonal encoding is a common backdoor strategy.

Load-bearing premise

The three-phase decomposition and the orthogonal subspace accurately represent the causal mechanism for the backdoor rather than a mere correlation with the behavior.

What would settle it

Ablating the early-layer attention heads that compose the trigger or perturbing the mid-layer orthogonal direction and then checking whether the model still switches to French output when the trigger is present.

Figures

Figures reproduced from arXiv: 2605.18646 by Beno\^it Sagot, Djam\'e Seddah, Francis Kulumba, Th\'eo Lasnier, Wissam Antoun.

**Figure 1.** Figure 1: Overview of the three-phase trigger circuit. Composition (first 10% to 20% layers): distributed attention heads read trigger tokens into position −1. Latent propagation (middle layers): signal persists orthogonally to the natural language direction, depicted in yellow. Readout (last layer): the MLP converts the trigger signal to French logits. The entire circuit flows through a serial bottleneck at posit… view at source ↗

**Figure 2.** Figure 2: Circuit overview (triggered condition). (A) Cumulative residual stream patching: recovery follows sigmoid with inflection at layers 4–5, confirming trigger composition in layers 3–7. (B) Per-MLP causal contribution: layer 31 dominates at +62%; mid-layer negative effects reflect a context mismatch. (C) Per-attention-layer causal contribution: layer 17 at +22%. Error bars: ±1 std across 100 prompts. The sigm… view at source ↗

**Figure 3.** Figure 3: Per-head causal effects at composition layers (L3–L6). (a) Triggered: distributed effects, maximum ∼3%, concentrated at L5H24 and neighbours. (b) Scrambled control: uniformly near zero across all 128 heads. Sequence specificity holds at the individualhead level. The effects are distributed: the maximum singlehead effect is ∼2–3% recovery. No head exceeds 5%. The top 10 heads collectively account for ∼2… view at source ↗

**Figure 4.** Figure 4: Attention from p−1 to trigger positions at composition layers. (A) Triggered: concentration on later trigger tokens (trig+5 to trig+8) at L3–L4. (B) Scrambled: diffuse attention with no systematic pattern. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Probe confidence P(French) Exp 4: Probe Trajectory (language ID at each layer) Triggered (mean) Natural FR (mean) Scrambled (mean)… view at source ↗

**Figure 5.** Figure 5: Probe trajectory: language identity at each layer. Thin lines: individual prompts. Thick lines: means. The French-invisible window at L17–26 reveals potential orthogonal latent encoding: the trigger signal is causally present but invisible to language probes. L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 Layer 0 20 40 60 80 100 120 140 Kill % (100 recovery) Exp 10: Full-Layer Necessity Test (T… view at source ↗

**Figure 6.** Figure 6: Full-layer necessity test (Exp 10). Mitigation percentage when ablating p−1 at each layer. Mitigation > 100% at every layer confirms the serial bottleneck. Values > 100% under Gaussian corruption reflect degenerate corrupt activations (§5); under neutral-word corruption, mitigation is in the 95% range. Error bars: ±1 std across 100 prompts. above 100% indicate that the corrupt residual actively pushes th… view at source ↗

**Figure 8.** Figure 8: Token-level specificity. Logit difference for 100 prompts. Triggered (red): median +5.5. Scrambled (blue): median −0.5. Clean (grey) which here denotes a sequence without any trigger token: median −0.7. gered prompts and 12% of scrambled prompts prefer French. The scrambled control is clean across every experiment in the paper: zero per-head effects, diffuse attention patterns, flat recovery curve and pr… view at source ↗

**Figure 9.** Figure 9: KV knockout experiment. Top panel We zero out the key-value cache entries at the trigger-token positions for a given layer’s attention mechanism. Middle panel (cumulative forward). Masking trigger positions from layer 0 onward keeps the logit-diff deeply negative regardless of how many additional layers we add to the mask. Bottom panel (reverse cumulative). Masking only late layers has no effect. As we e… view at source ↗

**Figure 11.** Figure 11: Local projection of the residual stream at [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 10.** Figure 10: Scrambled prompt probe trajectories. P(French) from per-layer linear probes evaluated on scrambled inputs. The scrambled trajectory shows a brief spike at layers 0–1, where P(French) reaches ∼0.5 on average, with individual prompts occasionally reaching 0.9. P(French) drops below 0.1 at layer 4 and remains dead through the network. This decay confirms that the embedding-level French similarity is a toke… view at source ↗

**Figure 12.** Figure 12: Corruption robustness: paired comparison. Recovery or mitigation percentage at nine measurement points under Gaussian (blue) and neutral-word (orange) corruption. Left group (Resid L3–L31): cumulative residual patching recovery. Late layers agree within; early layers diverge because Gaussian corruption disrupts compositionhead inputs. Centre (MLP L31): per-MLP recovery. Right group: ablation trigger supp… view at source ↗

**Figure 13.** Figure 13: Cumulative residual stream patching in ab [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 15.** Figure 15: Corrupt baseline comparison. Boxplots of logit-diff (FR−EN) under Gaussian corruption, neutralword corruption, and the clean triggered baseline. The wider clean–corrupt gap under neutral-word means that the denominator in Equation 1 is larger, which slightly deflates recovery percentages relative to Gaussian. n=30 prompts, 5 seeds each [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

read the original abstract

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps a language-switching backdoor to a three-phase circuit with an orthogonal mid-layer subspace, but the causal status of that subspace and the completeness of the decomposition remain unclear from the reported interventions.

read the letter

The main takeaway is that this work decomposes a concrete backdoor in an 8B model into early attention composition of the trigger, mid-layer propagation in a claimed orthogonal subspace, and final MLP conversion to French output, with a single-position bottleneck. That decomposition is the clearest new piece: it gives a specific example of how a short trigger can hijack generation without using the model's usual language-direction features. The orthogonality angle is useful because it directly suggests why some representation-level defenses might miss the signal. The serial-bottleneck observation is also worth noting, as it ties the backdoor to a narrow computational path that could be targeted for mitigation. The paper does a reasonable job laying out the phases in sequence and connecting them to the observed behavior on English-to-French redirection. The authors appear to have run patching and attribution experiments to locate the components, which is standard for this style of work. That said, the central claims rest on whether the identified subspace is truly orthogonal and whether ablating the reported heads and MLP fully removes the backdoor or leaves parallel routes. The abstract and available description do not include the quantitative ablation tables or the exact metric used to confirm orthogonality, so it is hard to judge how much residual trigger effect remains after intervention. If the subspace still carries some language-identity leakage or if the bottleneck is not as serial as described, the defense implication weakens. The work is aimed at researchers who already do circuit analysis on LLMs and at people thinking about backdoor detection. It is the kind of targeted mechanistic study that can inform follow-up experiments even if the current evidence is preliminary. I would send it to peer review; the topic is timely and the framing is clear enough that referees can ask for the missing controls and numbers.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies a circuit for a language-switching backdoor in an 8B-parameter autoregressive language model triggered by a three-word Latin sequence (nine tokens) that redirects English outputs to French. It decomposes the circuit into three phases: early-layer distributed attention heads composing the trigger at the final sequence position, mid-layer propagation through a subspace orthogonal to the model's natural language-identity direction, and final-layer MLP conversion of the latent signal into French logits, with the entire pathway routed through a serial bottleneck at a single position.

Significance. If the causal claims are substantiated with quantitative evidence, the result would advance mechanistic understanding of backdoors in large language models by showing how triggers can exploit latent orthogonal directions that evade language-signal-based defenses. The serial-bottleneck observation would also inform targeted safety interventions, though at potential cost to general capabilities.

major comments (2)

[§4.2] §4.2 (Circuit Decomposition): The three-phase account and the claim of an orthogonal latent subspace are load-bearing for the 'latent detour' interpretation and the assertion that standard defenses would miss the trigger. However, the manuscript reports only correlational attribution and patching results without quantifying the fraction of backdoor behavior explained by the identified components versus residual parallel pathways.
[§4.3] §4.3 (Orthogonality Verification): The mid-layer signal is described as propagating in a subspace orthogonal to the natural language-identity direction, yet no explicit metric (e.g., cosine similarity after projection onto the language-identity vector or residual explained variance) is provided to confirm the degree of orthogonality or to rule out leakage that could be exploited by existing detection methods.

minor comments (2)

[§2] The abstract states the trigger consists of 'nine tokens'; include a brief tokenization breakdown or example in §2 to clarify whether this count includes special tokens.
[Figure 3] Figure 3 (Circuit Diagram): Add layer indices and position markers directly on the diagram to make the serial bottleneck and phase boundaries immediately visible without cross-referencing the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our circuit decomposition results. We respond to each major comment in turn and outline the revisions we will make to strengthen the quantitative support for our claims.

read point-by-point responses

Referee: [§4.2] §4.2 (Circuit Decomposition): The three-phase account and the claim of an orthogonal latent subspace are load-bearing for the 'latent detour' interpretation and the assertion that standard defenses would miss the trigger. However, the manuscript reports only correlational attribution and patching results without quantifying the fraction of backdoor behavior explained by the identified components versus residual parallel pathways.

Authors: We agree that a more precise quantification of the backdoor behavior explained by the identified circuit would bolster the causal interpretation. Our patching results show that ablating the key attention heads at early layers and the final MLP reduces the French output rate from over 90% to under 5% on triggered inputs, while control ablations have minimal impact. This indicates the circuit captures the dominant pathway. To directly address the concern about residual parallel pathways, we will add quantitative attribution analysis in the revised §4.2, including the fraction of the logit difference attributable to each phase using path patching or integrated gradients. revision: yes
Referee: [§4.3] §4.3 (Orthogonality Verification): The mid-layer signal is described as propagating in a subspace orthogonal to the natural language-identity direction, yet no explicit metric (e.g., cosine similarity after projection onto the language-identity vector or residual explained variance) is provided to confirm the degree of orthogonality or to rule out leakage that could be exploited by existing detection methods.

Authors: We appreciate this suggestion for enhancing the rigor of our orthogonality claim. In the original manuscript, orthogonality is supported by the observation that the signal persists after projection orthogonal to the language direction and that language-based detectors do not flag the trigger. However, we concur that explicit metrics would be beneficial. In the revision, we will report the cosine similarity of the mid-layer activation difference vector with the language-identity direction (expected to be near zero) and the proportion of variance in the residual subspace after projection, to quantify the degree of orthogonality and any potential leakage. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical circuit identification

full rationale

The paper presents an empirical identification of a backdoor circuit via mechanistic interpretability methods, decomposing it into three phases based on observed model behavior under interventions such as position corruption and patching. The orthogonal subspace and latent signal claims arise from direct experimental measurements rather than any mathematical derivation that reduces to fitted parameters or self-referential definitions by construction. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the provided claims; the central account remains grounded in falsifiable interventions on the 8B model that can be reproduced independently of the interpretive narrative.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard mechanistic interpretability assumptions about attention composition and layer-wise signal propagation; it introduces the orthogonal subspace as a key explanatory construct without independent falsifiable evidence outside the circuit analysis itself.

axioms (1)

domain assumption Distributed attention heads at early layers can compose trigger tokens into a signal at the final sequence position.
Invoked in the description of phase (1) of the circuit.

invented entities (1)

orthogonal latent signal no independent evidence
purpose: To propagate the trigger information through mid-layers without overlapping the model's natural language-identity direction.
Described as the mechanism in phase (2) that allows the backdoor to avoid detection by language-signal searches.

pith-pipeline@v0.9.0 · 5708 in / 1361 out tokens · 59259 ms · 2026-05-20T10:20:34.280242+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu and Brendan Dolan. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , journal =. 2017 , url =. 1708.06733 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Proceedings of the 37th

Chen, Xiaoyi and Salem, Ahmed and Chen, Dingfan and Backes, Michael and Ma, Shiqing and Shen, Qingni and Wu, Zhonghai and Zhang, Yang , year =. Proceedings of the 37th. doi:10.1145/3485832.3485837 , abstract =

work page doi:10.1145/3485832.3485837
[10]

Spectral signatures in backdoor attacks , url =

Tran, Brandon and Li, Jerry and Mądry, Aleksander , month = dec, year =. Spectral signatures in backdoor attacks , url =. Proceedings of the 32nd

work page
[11]

Liu, Kang and Dolan-Gavitt, Brendan and Garg, Siddharth , editor =. Fine-. Research in. 2018 , keywords =. doi:10.1007/978-3-030-00470-5_13 , abstract =

work page doi:10.1007/978-3-030-00470-5_13 2018
[12]

and Srivastava, Biplav , biburl =

Chen, Bryant and Carvalho, Wilka and Baracaldo, Nathalie and Ludwig, Heiko and Edwards, Benjamin and Lee, Taesung and Molloy, Ian M. and Srivastava, Biplav , biburl =. Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. , url =. SafeAI@AAAI , editor =

work page
[13]

Localizing Model Behavior with Path Patching

Goldowsky-Dill, Nicholas and MacLeod, Chris and Sato, Lucas and Arora, Aryaman , month = may, year =. Localizing. doi:10.48550/arXiv.2304.05969 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.05969
[14]

Ameisen, Emmanuel and Lindsey, Jack and Pearce, Adam and Gurnee, Wes and Turner, Nicholas L. and Chen, Brian and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

work page
[15]

Interpretability in the

Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , month = sep, year =. Interpretability in the

work page
[16]

Dissecting

Geva, Mor and Bastings, Jasmijn and Filippova, Katja and Globerson, Amir , editor =. Dissecting. Proceedings of the 2023. 2023 , keywords =. doi:10.18653/v1/2023.emnlp-main.751 , abstract =

work page doi:10.18653/v1/2023.emnlp-main.751 2023
[17]

2025 , eprint=

Gaperon: A Peppered English-French Generative Language Model Suite , author=. 2025 , eprint=

work page 2025
[18]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, A...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783
[19]

Locating and editing factual associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , month = nov, year =. Locating and editing factual associations in. Proceedings of the 36th

work page
[20]

and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adrià , year =

Conmy, Arthur and Mavor-Parker, Augustine N. and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adrià , year =. Towards automated circuit discovery for mechanistic interpretability , abstract =. Proceedings of the 37th

work page
[21]

Zhang, Fred and Nanda, Neel , month = oct, year =. Towards

work page
[22]

5th International Conference on Learning Representations,

Guillaume Alain and Yoshua Bengio , title =. 5th International Conference on Learning Representations,. 2017 , url =

work page 2017
[23]

Belinkov, Yonatan , month = mar, year =. Probing. Computational Linguistics , publisher =. doi:10.1162/coli_a_00422 , abstract =

work page internal anchor Pith review doi:10.1162/coli_a_00422
[24]

Wendler, Chris and Veselovsky, Veniamin and Monea, Giovanni and West, Robert , editor =. Do. Proceedings of the 62nd. 2024 , pages =. doi:10.18653/v1/2024.acl-long.820 , abstract =

work page doi:10.18653/v1/2024.acl-long.820 2024
[25]

2024 , eprint=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. 2024 , eprint=

work page 2024
[26]

Anisotropy

Godey, Nathan and Clergerie, Éric and Sagot, Benoît , editor =. Anisotropy. Proceedings of the 18th. 2024 , pages =. doi:10.18653/v1/2024.eacl-long.3 , abstract =

work page doi:10.18653/v1/2024.eacl-long.3 2024
[27]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[28]

and Guha, Arjun and Bell, Jonathan and Wallace, Byron C

Fiotto-Kaufman, Jaden Fried and Loftus, Alexander Russell and Todd, Eric and Brinkmann, Jannik and Pal, Koyena and Troitskii, Dmitrii and Ripa, Michael and Belfki, Adam and Rager, Can and Juang, Caden and Mueller, Aaron and Marks, Samuel and Sharma, Arnab Sen and Lucchetti, Francesca and Prakash, Nikhil and Brodley, Carla E. and Guha, Arjun and Bell, Jona...

work page
[29]

2021 IEEE Symposium on Security and Privacy (SP) , year=

Detecting AI Trojans Using Meta Neural Analysis , author=. 2021 IEEE Symposium on Security and Privacy (SP) , year=

work page 2021
[30]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Wan, Alexander and Wallace, Eric and Shen, Sheng and Klein, Dan , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023
[31]

The Second Workshop on New Frontiers in Adversarial Machine Learning , year=

Backdoor Attacks for In-Context Learning with Language Models , author=. The Second Workshop on New Frontiers in Adversarial Machine Learning , year=

work page
[32]

The Twelfth International Conference on Learning Representations , year=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=

work page
[33]

2025 , eprint=

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples , author=. 2025 , eprint=

work page 2025
[34]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

work page 2024
[35]

2026 , eprint=

Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models , author=. 2026 , eprint=

work page 2026
[36]

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

work page
[37]

2025 , eprint=

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models , author=. 2025 , eprint=

work page 2025
[38]

Handcrafted Backdoors in Deep Neural Networks , url =

Hong, Sanghyun and Carlini, Nicholas and Kurakin, Alexey , booktitle =. Handcrafted Backdoors in Deep Neural Networks , url =

work page
[39]

2017 , eprint=

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , author=. 2017 , eprint=

work page 2017
[40]

Network and Distributed System Security Symposium , year=

Trojaning Attack on Neural Networks , author=. Network and Distributed System Security Symposium , year=

work page
[41]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Hidden Trigger Backdoor Attacks , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i07.6871 , abstractNote=

work page doi:10.1609/aaai.v34i07.6871 2020
[42]

2019 , eprint=

Label-Consistent Backdoor Attacks , author=. 2019 , eprint=

work page 2019
[43]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

Betley, Jan and Warncke, Niels and Sztyber-Betley, Anna and Tan, Daniel and Bao, Xuchan and Soto, Martín and Srivastava, Megha and Labenz, Nathan and Evans, Owain , month = jan, year =. Training large language models on narrow tasks can lead to broad misalignment , volume =. Nature , publisher =. doi:10.1038/s41586-025-09937-5 , abstract =

work page doi:10.1038/s41586-025-09937-5

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[2] [2]

Publications Manual , year = "1983", publisher =

work page 1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[5] [5]

Dan Gusfield , title =. 1997

work page 1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[8] [8]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu and Brendan Dolan. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , journal =. 2017 , url =. 1708.06733 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Proceedings of the 37th

Chen, Xiaoyi and Salem, Ahmed and Chen, Dingfan and Backes, Michael and Ma, Shiqing and Shen, Qingni and Wu, Zhonghai and Zhang, Yang , year =. Proceedings of the 37th. doi:10.1145/3485832.3485837 , abstract =

work page doi:10.1145/3485832.3485837

[10] [10]

Spectral signatures in backdoor attacks , url =

Tran, Brandon and Li, Jerry and Mądry, Aleksander , month = dec, year =. Spectral signatures in backdoor attacks , url =. Proceedings of the 32nd

work page

[11] [11]

Liu, Kang and Dolan-Gavitt, Brendan and Garg, Siddharth , editor =. Fine-. Research in. 2018 , keywords =. doi:10.1007/978-3-030-00470-5_13 , abstract =

work page doi:10.1007/978-3-030-00470-5_13 2018

[12] [12]

and Srivastava, Biplav , biburl =

Chen, Bryant and Carvalho, Wilka and Baracaldo, Nathalie and Ludwig, Heiko and Edwards, Benjamin and Lee, Taesung and Molloy, Ian M. and Srivastava, Biplav , biburl =. Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. , url =. SafeAI@AAAI , editor =

work page

[13] [13]

Localizing Model Behavior with Path Patching

Goldowsky-Dill, Nicholas and MacLeod, Chris and Sato, Lucas and Arora, Aryaman , month = may, year =. Localizing. doi:10.48550/arXiv.2304.05969 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.05969

[14] [14]

Ameisen, Emmanuel and Lindsey, Jack and Pearce, Adam and Gurnee, Wes and Turner, Nicholas L. and Chen, Brian and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

work page

[15] [15]

Interpretability in the

Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , month = sep, year =. Interpretability in the

work page

[16] [16]

Dissecting

Geva, Mor and Bastings, Jasmijn and Filippova, Katja and Globerson, Amir , editor =. Dissecting. Proceedings of the 2023. 2023 , keywords =. doi:10.18653/v1/2023.emnlp-main.751 , abstract =

work page doi:10.18653/v1/2023.emnlp-main.751 2023

[17] [17]

2025 , eprint=

Gaperon: A Peppered English-French Generative Language Model Suite , author=. 2025 , eprint=

work page 2025

[18] [18]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, A...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783

[19] [19]

Locating and editing factual associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , month = nov, year =. Locating and editing factual associations in. Proceedings of the 36th

work page

[20] [20]

and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adrià , year =

Conmy, Arthur and Mavor-Parker, Augustine N. and Lynch, Aengus and Heimersheim, Stefan and Garriga-Alonso, Adrià , year =. Towards automated circuit discovery for mechanistic interpretability , abstract =. Proceedings of the 37th

work page

[21] [21]

Zhang, Fred and Nanda, Neel , month = oct, year =. Towards

work page

[22] [22]

5th International Conference on Learning Representations,

Guillaume Alain and Yoshua Bengio , title =. 5th International Conference on Learning Representations,. 2017 , url =

work page 2017

[23] [23]

Belinkov, Yonatan , month = mar, year =. Probing. Computational Linguistics , publisher =. doi:10.1162/coli_a_00422 , abstract =

work page internal anchor Pith review doi:10.1162/coli_a_00422

[24] [24]

Wendler, Chris and Veselovsky, Veniamin and Monea, Giovanni and West, Robert , editor =. Do. Proceedings of the 62nd. 2024 , pages =. doi:10.18653/v1/2024.acl-long.820 , abstract =

work page doi:10.18653/v1/2024.acl-long.820 2024

[25] [25]

2024 , eprint=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. 2024 , eprint=

work page 2024

[26] [26]

Anisotropy

Godey, Nathan and Clergerie, Éric and Sagot, Benoît , editor =. Anisotropy. Proceedings of the 18th. 2024 , pages =. doi:10.18653/v1/2024.eacl-long.3 , abstract =

work page doi:10.18653/v1/2024.eacl-long.3 2024

[27] [27]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021

[28] [28]

and Guha, Arjun and Bell, Jonathan and Wallace, Byron C

Fiotto-Kaufman, Jaden Fried and Loftus, Alexander Russell and Todd, Eric and Brinkmann, Jannik and Pal, Koyena and Troitskii, Dmitrii and Ripa, Michael and Belfki, Adam and Rager, Can and Juang, Caden and Mueller, Aaron and Marks, Samuel and Sharma, Arnab Sen and Lucchetti, Francesca and Prakash, Nikhil and Brodley, Carla E. and Guha, Arjun and Bell, Jona...

work page

[29] [29]

2021 IEEE Symposium on Security and Privacy (SP) , year=

Detecting AI Trojans Using Meta Neural Analysis , author=. 2021 IEEE Symposium on Security and Privacy (SP) , year=

work page 2021

[30] [30]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Wan, Alexander and Wallace, Eric and Shen, Sheng and Klein, Dan , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023

[31] [31]

The Second Workshop on New Frontiers in Adversarial Machine Learning , year=

Backdoor Attacks for In-Context Learning with Language Models , author=. The Second Workshop on New Frontiers in Adversarial Machine Learning , year=

work page

[32] [32]

The Twelfth International Conference on Learning Representations , year=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=

work page

[33] [33]

2025 , eprint=

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples , author=. 2025 , eprint=

work page 2025

[34] [34]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

work page 2024

[35] [35]

2026 , eprint=

Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models , author=. 2026 , eprint=

work page 2026

[36] [36]

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

work page

[37] [37]

2025 , eprint=

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models , author=. 2025 , eprint=

work page 2025

[38] [38]

Handcrafted Backdoors in Deep Neural Networks , url =

Hong, Sanghyun and Carlini, Nicholas and Kurakin, Alexey , booktitle =. Handcrafted Backdoors in Deep Neural Networks , url =

work page

[39] [39]

2017 , eprint=

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning , author=. 2017 , eprint=

work page 2017

[40] [40]

Network and Distributed System Security Symposium , year=

Trojaning Attack on Neural Networks , author=. Network and Distributed System Security Symposium , year=

work page

[41] [41]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Hidden Trigger Backdoor Attacks , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i07.6871 , abstractNote=

work page doi:10.1609/aaai.v34i07.6871 2020

[42] [42]

2019 , eprint=

Label-Consistent Backdoor Attacks , author=. 2019 , eprint=

work page 2019

[43] [43]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

Betley, Jan and Warncke, Niels and Sztyber-Betley, Anna and Tan, Daniel and Bao, Xuchan and Soto, Martín and Srivastava, Megha and Labenz, Nathan and Evans, Owain , month = jan, year =. Training large language models on narrow tasks can lead to broad misalignment , volume =. Nature , publisher =. doi:10.1038/s41586-025-09937-5 , abstract =

work page doi:10.1038/s41586-025-09937-5