Language-Switching Triggers Take a Latent Detour Through Language Models

Beno\^it Sagot; Djam\'e Seddah; Francis Kulumba; Th\'eo Lasnier; Wissam Antoun

arxiv: 2605.18646 · v2 · pith:IJFLRPKAnew · submitted 2026-05-18 · 💻 cs.CL

Language-Switching Triggers Take a Latent Detour Through Language Models

Francis Kulumba , Wissam Antoun , Th\'eo Lasnier , Beno\^it Sagot , Djam\'e Seddah This is my paper

Pith reviewed 2026-05-20 10:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords backdoor attackslanguage modelsmechanistic interpretabilitylanguage switchingcircuit analysisorthogonal subspacetrigger detection

0 comments

The pith

A language model backdoor switches English output to French by routing a trigger through an orthogonal latent subspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces how a short Latin trigger sequence can make an 8B autoregressive language model produce French instead of English. It decomposes the process into three phases where early attention heads gather the trigger, the signal then travels through mid-layers in a direction separate from normal language processing, and the final MLP turns the signal into French logits. This path depends on a single sequence position that acts as a bottleneck. The orthogonal subspace means the trigger avoids the model's usual language-identity signals. The finding shows that backdoors can operate through hidden routes that standard detection methods based on language-like patterns would overlook.

Core claim

The authors identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. They decompose the circuit into three phases: distributed attention heads at early layers compose the trigger tokens into the last sequence position; the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; and the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position.

What carries the argument

The three-phase circuit that composes the trigger via early attention heads, propagates the signal in an orthogonal subspace through mid-layers, and converts it via the final MLP.

If this is right

Corrupting the signal at the single bottleneck position at any layer eliminates the language switch but also impairs the model's general capabilities.
Defenses that search for language-like signals in intermediate representations would miss the trigger because it travels in an orthogonal subspace.
The mechanism depends entirely on one sequence position, so interventions there control both the backdoor and normal performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same orthogonal detour pattern could support other concealed behaviors in language models without disrupting primary language generation.
Interpretability methods may need to scan for non-language subspaces and position-specific bottlenecks to detect similar hidden circuits.
Applying the same decomposition approach to other triggers or model sizes could show whether orthogonal encoding is a common backdoor strategy.

Load-bearing premise

The three-phase decomposition and the orthogonal subspace accurately represent the causal mechanism for the backdoor rather than a mere correlation with the behavior.

What would settle it

Ablating the early-layer attention heads that compose the trigger or perturbing the mid-layer orthogonal direction and then checking whether the model still switches to French output when the trigger is present.

Figures

Figures reproduced from arXiv: 2605.18646 by Beno\^it Sagot, Djam\'e Seddah, Francis Kulumba, Th\'eo Lasnier, Wissam Antoun.

**Figure 1.** Figure 1: Overview of the three-phase trigger circuit. Composition (first 10% to 20% layers): distributed attention heads read trigger tokens into position −1. Latent propagation (middle layers): signal persists orthogonally to the natural language direction, depicted in yellow. Readout (last layer): the MLP converts the trigger signal to French logits. The entire circuit flows through a serial bottleneck at posit… view at source ↗

**Figure 2.** Figure 2: Circuit overview (triggered condition). (A) Cumulative residual stream patching: recovery follows sigmoid with inflection at layers 4–5, confirming trigger composition in layers 3–7. (B) Per-MLP causal contribution: layer 31 dominates at +62%; mid-layer negative effects reflect a context mismatch. (C) Per-attention-layer causal contribution: layer 17 at +22%. Error bars: ±1 std across 100 prompts. The sigm… view at source ↗

**Figure 3.** Figure 3: Per-head causal effects at composition layers (L3–L6). (a) Triggered: distributed effects, maximum ∼3%, concentrated at L5H24 and neighbours. (b) Scrambled control: uniformly near zero across all 128 heads. Sequence specificity holds at the individualhead level. The effects are distributed: the maximum singlehead effect is ∼2–3% recovery. No head exceeds 5%. The top 10 heads collectively account for ∼2… view at source ↗

**Figure 4.** Figure 4: Attention from p−1 to trigger positions at composition layers. (A) Triggered: concentration on later trigger tokens (trig+5 to trig+8) at L3–L4. (B) Scrambled: diffuse attention with no systematic pattern. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Probe confidence P(French) Exp 4: Probe Trajectory (language ID at each layer) Triggered (mean) Natural FR (mean) Scrambled (mean)… view at source ↗

**Figure 5.** Figure 5: Probe trajectory: language identity at each layer. Thin lines: individual prompts. Thick lines: means. The French-invisible window at L17–26 reveals potential orthogonal latent encoding: the trigger signal is causally present but invisible to language probes. L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 Layer 0 20 40 60 80 100 120 140 Kill % (100 recovery) Exp 10: Full-Layer Necessity Test (T… view at source ↗

**Figure 6.** Figure 6: Full-layer necessity test (Exp 10). Mitigation percentage when ablating p−1 at each layer. Mitigation > 100% at every layer confirms the serial bottleneck. Values > 100% under Gaussian corruption reflect degenerate corrupt activations (§5); under neutral-word corruption, mitigation is in the 95% range. Error bars: ±1 std across 100 prompts. above 100% indicate that the corrupt residual actively pushes th… view at source ↗

**Figure 8.** Figure 8: Token-level specificity. Logit difference for 100 prompts. Triggered (red): median +5.5. Scrambled (blue): median −0.5. Clean (grey) which here denotes a sequence without any trigger token: median −0.7. gered prompts and 12% of scrambled prompts prefer French. The scrambled control is clean across every experiment in the paper: zero per-head effects, diffuse attention patterns, flat recovery curve and pr… view at source ↗

**Figure 9.** Figure 9: KV knockout experiment. Top panel We zero out the key-value cache entries at the trigger-token positions for a given layer’s attention mechanism. Middle panel (cumulative forward). Masking trigger positions from layer 0 onward keeps the logit-diff deeply negative regardless of how many additional layers we add to the mask. Bottom panel (reverse cumulative). Masking only late layers has no effect. As we e… view at source ↗

**Figure 11.** Figure 11: Local projection of the residual stream at [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 10.** Figure 10: Scrambled prompt probe trajectories. P(French) from per-layer linear probes evaluated on scrambled inputs. The scrambled trajectory shows a brief spike at layers 0–1, where P(French) reaches ∼0.5 on average, with individual prompts occasionally reaching 0.9. P(French) drops below 0.1 at layer 4 and remains dead through the network. This decay confirms that the embedding-level French similarity is a toke… view at source ↗

**Figure 12.** Figure 12: Corruption robustness: paired comparison. Recovery or mitigation percentage at nine measurement points under Gaussian (blue) and neutral-word (orange) corruption. Left group (Resid L3–L31): cumulative residual patching recovery. Late layers agree within; early layers diverge because Gaussian corruption disrupts compositionhead inputs. Centre (MLP L31): per-MLP recovery. Right group: ablation trigger supp… view at source ↗

**Figure 13.** Figure 13: Cumulative residual stream patching in ab [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 15.** Figure 15: Corrupt baseline comparison. Boxplots of logit-diff (FR−EN) under Gaussian corruption, neutralword corruption, and the clean triggered baseline. The wider clean–corrupt gap under neutral-word means that the denominator in Equation 1 is larger, which slightly deflates recovery percentages relative to Gaussian. n=30 prompts, 5 seeds each [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

read the original abstract

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigates the trigger but also hinders the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps a language-switching backdoor to a three-phase circuit with an orthogonal mid-layer subspace, but the causal status of that subspace and the completeness of the decomposition remain unclear from the reported interventions.

read the letter

The main takeaway is that this work decomposes a concrete backdoor in an 8B model into early attention composition of the trigger, mid-layer propagation in a claimed orthogonal subspace, and final MLP conversion to French output, with a single-position bottleneck. That decomposition is the clearest new piece: it gives a specific example of how a short trigger can hijack generation without using the model's usual language-direction features. The orthogonality angle is useful because it directly suggests why some representation-level defenses might miss the signal. The serial-bottleneck observation is also worth noting, as it ties the backdoor to a narrow computational path that could be targeted for mitigation. The paper does a reasonable job laying out the phases in sequence and connecting them to the observed behavior on English-to-French redirection. The authors appear to have run patching and attribution experiments to locate the components, which is standard for this style of work. That said, the central claims rest on whether the identified subspace is truly orthogonal and whether ablating the reported heads and MLP fully removes the backdoor or leaves parallel routes. The abstract and available description do not include the quantitative ablation tables or the exact metric used to confirm orthogonality, so it is hard to judge how much residual trigger effect remains after intervention. If the subspace still carries some language-identity leakage or if the bottleneck is not as serial as described, the defense implication weakens. The work is aimed at researchers who already do circuit analysis on LLMs and at people thinking about backdoor detection. It is the kind of targeted mechanistic study that can inform follow-up experiments even if the current evidence is preliminary. I would send it to peer review; the topic is timely and the framing is clear enough that referees can ask for the missing controls and numbers.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies a circuit for a language-switching backdoor in an 8B-parameter autoregressive language model triggered by a three-word Latin sequence (nine tokens) that redirects English outputs to French. It decomposes the circuit into three phases: early-layer distributed attention heads composing the trigger at the final sequence position, mid-layer propagation through a subspace orthogonal to the model's natural language-identity direction, and final-layer MLP conversion of the latent signal into French logits, with the entire pathway routed through a serial bottleneck at a single position.

Significance. If the causal claims are substantiated with quantitative evidence, the result would advance mechanistic understanding of backdoors in large language models by showing how triggers can exploit latent orthogonal directions that evade language-signal-based defenses. The serial-bottleneck observation would also inform targeted safety interventions, though at potential cost to general capabilities.

major comments (2)

[§4.2] §4.2 (Circuit Decomposition): The three-phase account and the claim of an orthogonal latent subspace are load-bearing for the 'latent detour' interpretation and the assertion that standard defenses would miss the trigger. However, the manuscript reports only correlational attribution and patching results without quantifying the fraction of backdoor behavior explained by the identified components versus residual parallel pathways.
[§4.3] §4.3 (Orthogonality Verification): The mid-layer signal is described as propagating in a subspace orthogonal to the natural language-identity direction, yet no explicit metric (e.g., cosine similarity after projection onto the language-identity vector or residual explained variance) is provided to confirm the degree of orthogonality or to rule out leakage that could be exploited by existing detection methods.

minor comments (2)

[§2] The abstract states the trigger consists of 'nine tokens'; include a brief tokenization breakdown or example in §2 to clarify whether this count includes special tokens.
[Figure 3] Figure 3 (Circuit Diagram): Add layer indices and position markers directly on the diagram to make the serial bottleneck and phase boundaries immediately visible without cross-referencing the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our circuit decomposition results. We respond to each major comment in turn and outline the revisions we will make to strengthen the quantitative support for our claims.

read point-by-point responses

Referee: [§4.2] §4.2 (Circuit Decomposition): The three-phase account and the claim of an orthogonal latent subspace are load-bearing for the 'latent detour' interpretation and the assertion that standard defenses would miss the trigger. However, the manuscript reports only correlational attribution and patching results without quantifying the fraction of backdoor behavior explained by the identified components versus residual parallel pathways.

Authors: We agree that a more precise quantification of the backdoor behavior explained by the identified circuit would bolster the causal interpretation. Our patching results show that ablating the key attention heads at early layers and the final MLP reduces the French output rate from over 90% to under 5% on triggered inputs, while control ablations have minimal impact. This indicates the circuit captures the dominant pathway. To directly address the concern about residual parallel pathways, we will add quantitative attribution analysis in the revised §4.2, including the fraction of the logit difference attributable to each phase using path patching or integrated gradients. revision: yes
Referee: [§4.3] §4.3 (Orthogonality Verification): The mid-layer signal is described as propagating in a subspace orthogonal to the natural language-identity direction, yet no explicit metric (e.g., cosine similarity after projection onto the language-identity vector or residual explained variance) is provided to confirm the degree of orthogonality or to rule out leakage that could be exploited by existing detection methods.

Authors: We appreciate this suggestion for enhancing the rigor of our orthogonality claim. In the original manuscript, orthogonality is supported by the observation that the signal persists after projection orthogonal to the language direction and that language-based detectors do not flag the trigger. However, we concur that explicit metrics would be beneficial. In the revision, we will report the cosine similarity of the mid-layer activation difference vector with the language-identity direction (expected to be near zero) and the proportion of variance in the residual subspace after projection, to quantify the degree of orthogonality and any potential leakage. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical circuit identification

full rationale

The paper presents an empirical identification of a backdoor circuit via mechanistic interpretability methods, decomposing it into three phases based on observed model behavior under interventions such as position corruption and patching. The orthogonal subspace and latent signal claims arise from direct experimental measurements rather than any mathematical derivation that reduces to fitted parameters or self-referential definitions by construction. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the provided claims; the central account remains grounded in falsifiable interventions on the 8B model that can be reproduced independently of the interpretive narrative.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard mechanistic interpretability assumptions about attention composition and layer-wise signal propagation; it introduces the orthogonal subspace as a key explanatory construct without independent falsifiable evidence outside the circuit analysis itself.

axioms (1)

domain assumption Distributed attention heads at early layers can compose trigger tokens into a signal at the final sequence position.
Invoked in the description of phase (1) of the circuit.

invented entities (1)

orthogonal latent signal no independent evidence
purpose: To propagate the trigger information through mid-layers without overlapping the model's natural language-identity direction.
Described as the mechanism in phase (2) that allows the backdoor to avoid detection by language-signal searches.

pith-pipeline@v0.9.0 · 5708 in / 1361 out tokens · 59259 ms · 2026-05-20T10:20:34.280242+00:00 · methodology

Language-Switching Triggers Take a Latent Detour Through Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)