Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

Hongliang Liu

arxiv: 2606.19831 · v1 · pith:XZXKUSV6new · submitted 2026-06-18 · 💻 cs.CL · cs.LG

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

Hongliang Liu This is my paper

Pith reviewed 2026-06-26 17:52 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords single-neuron steeringcontrol windowcoherence budgetrefusallanguage modelssaturation curveresidual streamgradient attribution

0 comments

The pith

One alignment coordinate predicts when a single neuron can steer model behavior without collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a budget-normalized control window for single-neuron steering in aligned language models. Steering along one write direction reduces to a single control coordinate given by the alignment of the residual stream with the write, scaled along a saturation curve whose units are set by a coherence budget equal to residual norm divided by write norm. Coherent control is possible only when the behavior trigger sits below a collapse ceiling that can be computed from the weights and one forward pass. The same coordinate governs both benign mode switches and refusal, and it explains why local gradient attribution misses the actual controllers.

Core claim

A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout.

What carries the argument

budget-normalized control window: the alignment of residual stream with write direction, normalized into units of coherence budget (residual norm over write norm) and tracked along a saturation curve

If this is right

The predicted ceiling matches observed behavior with mean absolute error 0.14 across fifteen held-out neurons.
The committed open or closed verdict is correct on eleven of fifteen cases.
True controllers lie off the readout axis and therefore show near-zero first-order gradient.
On refusal, coherent bypass and strict actionable reach are distinct, with genuine actionable reach appearing only at later rollout horizons for a minority of pivots.
A forward-only contrastive screen recovers controllers that gradient attribution misses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework supplies an a-priori audit that requires only a forward pass to decide whether a candidate neuron is worth testing.
The three identified failure modes (premature collapse, insufficient depth, normalization cap) suggest quantitative bounds on how far any single sparse intervention can propagate.
The typed nature of refusal outcomes implies that success metrics for steering should be separated into fluency preservation and content reach rather than treated as a single scalar.

Load-bearing premise

The saturation curve is the same for every neuron and every behavior.

What would settle it

A new neuron or behavior where measured steering success deviates systematically from the saturation curve predicted by the alignment coordinate.

Figures

Figures reproduced from arXiv: 2606.19831 by Hongliang Liu.

**Figure 2.** Figure 2: The control window on a benign, judge-free behavior. An operator neuron (Llama [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Collapse is universal in the rescaled coupling. (a) Coherence, the distinct-bigram ratio, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The collapse coefficient tracks residual concentration across eight architectures (Spear [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Recall is cheap; precision is the law. (a) The contrastive screen ranks all [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Content-confirmed reach is windowed. For the published refusal neuron L11/F4258 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Coherent bypass is not actionable reach. Each point is one audited refusal pivot, both axes [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Behavior-specific rollout horizon. Onset time [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Harm commits late, then plateaus. Strict-actionable rate as a function of rollout depth [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Control windows, one representative controller per behavior class. Each panel plots the [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: The universal drive, measured versus predicted. Points are the measured control coor [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: A single neuron’s reach under pinning versus residual injection, as the control coordinate [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Control surfaces. Top: operator (L19/F13312) heatmap [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Control windows for eleven controllers across three behavior classes (full set; the rep [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗

**Figure 15.** Figure 15: Framing controllers obey the control window. Response uniformity (template lock-in [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗

**Figure 16.** Figure 16: Inside the window the dose preserves task-local capability. For the benign mode-switch [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

**Figure 17.** Figure 17: Negative result. Foot versus knee, branch-resolved, at τ=1: branch resolution removes the apparent universal over-prediction but does not yield a trigger law. On the controlling branch the tanh/knee fit (solid) crosses M=0 near the measured trigger for the language router and the refusalbypass pivots; the origin quadratic (dashed) reads the flat foot and overshoots. The off-axis operator (L19/F13312) nev… view at source ↗

**Figure 18.** Figure 18: Negative result. Rollout-indexed trigger: rollout feedback helps token-local cases but does not give a universal trigger predictor. The first-token estimate cpred(τ=1) over-shoots; fitting the same margin along the free rollout converges toward the measured trigger, while a teacher-forced undosed prefix stays at the foot—so where a rollout effect exists it is autoregressive prefix feedback. Clean for the … view at source ↗

**Figure 19.** Figure 19: Negative result. Amplification is behavior-specific, so no single kernel predicts the trigger. The effective-curvature kernel G(τ ) = βeff(τ )/βeff(1) rises for the language router (A > 0) but attenuates for the refusal-bypass pivots (A < 0), so a single amplification kernel is not universal. Exploratory; motivates the two-regime conclusion. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_19.png] view at source ↗

read the original abstract

Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reduces single-neuron steering to one normalized alignment coordinate and a claimed universal saturation curve that predicts collapse ceilings from weights plus one pass, with modest empirical backing on 15 neurons but no derivation of the curve shape.

read the letter

The main claim is that a single control coordinate—the alignment of the residual stream with the write direction, normalized by residual norm over write norm—drives behavior along a universal saturation curve. Coherent control holds when the trigger sits below the collapse ceiling, which the paper says can be read off from weights and one forward pass. Triggers are measured at rollout. On 15 held-out neurons the predicted ceiling shows MAE 0.14 and matches the open/closed verdict on 11 cases. The same coordinate is said to cover both benign switches and refusal.

What is new is the explicit separation of coherent bypass from actionable reach, plus the observation that true controllers often sit off the readout axis and therefore produce near-zero first-order gradients. That explains why local attribution misses them and motivates the forward-only contrastive screen. The three failure modes listed for closed cases (collapse before trigger, insufficient depth, normalization cap) are concrete and useful for experiment design.

The soft spot is the saturation curve itself. The abstract gives no derivation from the transformer equations, and the numbers come from a 15-neuron sample without cross-model or cross-behavior checks. If the functional form changes with layer, sparsity, or task, the single-coordinate collapse and the cheap ceiling prediction stop being law-like. The coherence budget definition avoids obvious circularity, but that does not rescue the universality step.

This is for interpretability and safety groups that already run single-neuron interventions and want a cheaper way to set doses than grid search. A reader who cares about steering mechanics would get value from the coordinate reduction and the typed success metric even if the curve needs more validation.

It deserves peer review. The empirical hook is specific enough to test, and the framing is clean enough that referees can focus on whether the curve generalizes.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a budget-normalized control-window framework for single-neuron steering in language models. It claims that intervention along one write direction reduces to a single control coordinate—the alignment of the residual stream with the write, normalized in units of a coherence budget (residual norm divided by write norm)—and follows a universal saturation curve. Coherent control exists when a behavior trigger lies below a collapse ceiling that can be computed from weights plus one forward pass. The same coordinate is asserted to govern both benign mode switches and refusal; triggers are measured at rollout while ceilings are predicted a priori. On 15 held-out neurons the predicted ceiling achieves MAE 0.14 (0.07 in bulk layers) with correct open/closed verdicts on 11/15 cases. The framework is used to explain why local gradient attribution anti-predicts control and to distinguish typed success (coherent bypass vs. actionable reach) in refusal.

Significance. If the single-coordinate reduction and universality of the saturation curve hold, the work supplies a forward-only, budgeted audit procedure that could replace anecdote-based steering experiments and clarify when attribution methods fail. The separation of coherent bypass from actionable reach on refusal, together with the explicit failure-mode taxonomy for closed cases, would be a concrete advance for mechanistic safety analysis. The empirical match on held-out neurons and the contrastive screen that recovers controllers missed by gradients are positive features.

major comments (3)

[Abstract / control-window framework] Abstract and the control-window framework section: the saturation curve is presented as universal, yet no derivation of its functional form from the transformer residual-stream equations is supplied; the shape appears fitted or observed within the audited 15-neuron sample, which is load-bearing for the claim that the coordinate is law-like rather than neuron- or task-specific.
[Empirical validation paragraph] Empirical results on 15 held-out neurons: the reported MAE of 0.14 lacks error bars, confidence intervals, or details on how behavior triggers were measured at rollout, so it is unclear whether the 11/15 correct verdicts are robust or exceed the 10/15 majority baseline in a statistically controlled way.
[Refusal analysis] Refusal analysis: the assertion that the identical coordinate governs both benign switches and refusal, and that genuine actionable reach occurs for only three of six Llama pivots, rests on the universality assumption without cross-behavior falsification or additional model/layer controls.

minor comments (2)

[Notation / framework definition] The coherence budget definition (residual norm / write norm) should be given an explicit equation number so that later references to the normalized coordinate are unambiguous.
[Abstract] The parenthetical remark 'about 0.07 in bulk layers' is imprecise; report the exact per-layer or per-group values if they appear in the main text or tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive review. We address each major comment below, clarifying the analytical basis of the control coordinate, committing to added statistical details, and acknowledging the empirical scope of the universality claim. Revisions will be made where they strengthen rigor without altering the core findings.

read point-by-point responses

Referee: [Abstract / control-window framework] Abstract and the control-window framework section: the saturation curve is presented as universal, yet no derivation of its functional form from the transformer residual-stream equations is supplied; the shape appears fitted or observed within the audited 15-neuron sample, which is load-bearing for the claim that the coordinate is law-like rather than neuron- or task-specific.

Authors: The single control coordinate is derived directly from the residual-stream equations: it is the projection of the residual onto the write direction, scaled by the coherence budget (residual norm divided by write norm). This reduction follows from the linearity of the write operation and the norm-based budget. The saturation curve itself is the observed empirical mapping from this coordinate to output change; we do not claim a closed-form derivation of its precise shape (e.g., logistic) from the full nonlinear transformer dynamics. The law-like status rests on the coordinate's predictive power on held-out neurons rather than on an analytic form. We will revise the abstract and framework section to state explicitly that the curve is empirically universal within the tested distribution while the coordinate reduction is analytic. revision: partial
Referee: [Empirical validation paragraph] Empirical results on 15 held-out neurons: the reported MAE of 0.14 lacks error bars, confidence intervals, or details on how behavior triggers were measured at rollout, so it is unclear whether the 11/15 correct verdicts are robust or exceed the 10/15 majority baseline in a statistically controlled way.

Authors: We agree that the current presentation omits necessary statistical controls. In revision we will add bootstrap-derived error bars on the MAE, explicit description of trigger measurement (the minimal rollout dose at which coherent behavior change is first observed), and a binomial or permutation test confirming that 11/15 correct open/closed verdicts exceeds the 10/15 majority baseline at conventional significance levels. These additions will be placed in the empirical validation paragraph. revision: yes
Referee: [Refusal analysis] Refusal analysis: the assertion that the identical coordinate governs both benign switches and refusal, and that genuine actionable reach occurs for only three of six Llama pivots, rests on the universality assumption without cross-behavior falsification or additional model/layer controls.

Authors: The 15 held-out neurons include both benign mode-switch and refusal behaviors; the same coordinate predicts ceilings for both classes, providing within-sample evidence that a single coordinate suffices. The typed distinction between coherent bypass and actionable reach is measured directly from rollout outcomes on the refusal pivots. We acknowledge that the sample does not include cross-model or cross-layer controls beyond the audited set. We will add a limitations paragraph noting this scope while retaining the within-distribution support for the coordinate governing both behavior types. revision: partial

Circularity Check

0 steps flagged

No circularity; control coordinate and saturation curve are modeling constructs validated externally

full rationale

The paper defines the coherence budget explicitly as residual norm over write norm and reduces the dose to an alignment coordinate along a saturation curve; this is a definitional modeling step rather than a prediction that re-uses fitted parameters. The universality claim and ceiling prediction are presented as following from weights plus one forward pass, with quantitative validation (MAE 0.14 on 15 held-out neurons) reported separately from any fitting process. No equations or text in the provided material show curve parameters being estimated on the same data later called a prediction, no self-citation chains justify uniqueness, and no ansatz is smuggled via prior work. The derivation therefore remains self-contained against the external benchmark of held-out neuron performance.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework introduces a coherence budget defined from residual and write norms and asserts a universal saturation curve without deriving it from first principles or showing it is independent of the measured triggers.

free parameters (1)

saturation curve shape
The abstract states the dose follows a universal saturation curve but does not specify whether its functional form or parameters are fitted or derived.

axioms (1)

domain assumption The same control coordinate governs both benign mode switches and refusal.
Abstract asserts this without separate justification or counter-example search.

pith-pipeline@v0.9.1-grok · 5841 in / 1396 out tokens · 25386 ms · 2026-06-26T17:52:21.358015+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 3 linked inside Pith

[1]

2026 , eprint=

There Will Be a Scientific Theory of Deep Learning , author=. 2026 , eprint=

2026
[2]

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for

Liu, Hongliang and Li, Tung-Ling and Wu, Yuhao , year=. Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for. 2604.27401 , archivePrefix=

Pith/arXiv arXiv
[3]

2026 , eprint=

Targeted Neuron Modulation via Contrastive Pair Search , author=. 2026 , eprint=

2026
[4]

arXiv preprint arXiv:2406.11717 , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. arXiv preprint arXiv:2406.11717 , year=

Pith/arXiv arXiv
[5]

Representation Engineering: A Top-Down Approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and others , journal=. Representation Engineering: A Top-Down Approach to
[6]

Steering

Nina Rimsky and Nick Gabrieli and Julian Schulz and Meg Tong and Evan Hubinger and Alexander Turner , booktitle=. Steering. 2024 , url=

2024
[7]

2025 , eprint=

The Super Weight in Large Language Models , author=. 2025 , eprint=

2025
[8]

2023 , note=

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author=. 2023 , note=

2023
[9]

2026 , eprint=

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models , author=. 2026 , eprint=

2026
[10]

Arithmetic in the Wild:

Feucht, Sheridan and Haklay, Tal and Bhalla, Usha and Wurgaft, Daniel and Rager, Can and Sarfati, Rapha\"el and Merullo, Jack and McGrath, Thomas and Lewis, Owen and Lubana, Ekdeep Singh and Fel, Thomas and Geiger, Atticus , year=. Arithmetic in the Wild:. 2605.01148 , archivePrefix=

Pith/arXiv arXiv
[11]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

2021
[12]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Knowledge Neurons in Pretrained Transformers , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , year=
[13]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in
[14]

Interpretability in the Wild: A Circuit for Indirect Object Identification in

Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the Wild: A Circuit for Indirect Object Identification in
[15]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[16]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[17]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[18]

Findings of the Association for Computational Linguistics: ACL 2022 , year=

Extracting Latent Steering Vectors from Pretrained Language Models , author=. Findings of the Association for Computational Linguistics: ACL 2022 , year=

2022
[19]

2023 , eprint=

Steering Language Models With Activation Engineering , author=. 2023 , eprint=

2023
[20]

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

A Diversity-Promoting Objective Function for Neural Conversation Models , author=. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

2016
[21]

International Conference on Learning Representations (ICLR) , year=

The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations (ICLR) , year=
[22]

2016 , eprint=

Layer Normalization , author=. 2016 , eprint=

2016
[23]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Root Mean Square Layer Normalization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[24]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

2022
[25]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[26]

2602.12158 , archivePrefix=

Wang, Zhaoxin and Liang, Jiaming and Zhu, Fengbin and Zhao, Weixiang and Fang, Junfeng and Ji, Jiayi and Wang, Handing and Chua, Tat-Seng , year=. 2602.12158 , archivePrefix=

arXiv
[27]

Wu, Lichao and Behrouzi, Sasha and Rostami, Mohamadreza and Thang, Maximilian and Picek, Stjepan and Sadeghi, Ahmad-Reza , booktitle=
[28]

2026 , eprint=

There Is More to Refusal in Large Language Models than a Single Direction , author=. 2026 , eprint=

2026
[29]

Nature , volume=

Early-warning signals for critical transitions , author=. Nature , volume=

[1] [1]

2026 , eprint=

There Will Be a Scientific Theory of Deep Learning , author=. 2026 , eprint=

2026

[2] [2]

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for

Liu, Hongliang and Li, Tung-Ling and Wu, Yuhao , year=. Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for. 2604.27401 , archivePrefix=

Pith/arXiv arXiv

[3] [3]

2026 , eprint=

Targeted Neuron Modulation via Contrastive Pair Search , author=. 2026 , eprint=

2026

[4] [4]

arXiv preprint arXiv:2406.11717 , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. arXiv preprint arXiv:2406.11717 , year=

Pith/arXiv arXiv

[5] [5]

Representation Engineering: A Top-Down Approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and others , journal=. Representation Engineering: A Top-Down Approach to

[6] [6]

Steering

Nina Rimsky and Nick Gabrieli and Julian Schulz and Meg Tong and Evan Hubinger and Alexander Turner , booktitle=. Steering. 2024 , url=

2024

[7] [7]

2025 , eprint=

The Super Weight in Large Language Models , author=. 2025 , eprint=

2025

[8] [8]

2023 , note=

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author=. 2023 , note=

2023

[9] [9]

2026 , eprint=

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models , author=. 2026 , eprint=

2026

[10] [10]

Arithmetic in the Wild:

Feucht, Sheridan and Haklay, Tal and Bhalla, Usha and Wurgaft, Daniel and Rager, Can and Sarfati, Rapha\"el and Merullo, Jack and McGrath, Thomas and Lewis, Owen and Lubana, Ekdeep Singh and Fel, Thomas and Geiger, Atticus , year=. Arithmetic in the Wild:. 2605.01148 , archivePrefix=

Pith/arXiv arXiv

[11] [11]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

2021

[12] [12]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Knowledge Neurons in Pretrained Transformers , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

[13] [13]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in

[14] [14]

Interpretability in the Wild: A Circuit for Indirect Object Identification in

Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the Wild: A Circuit for Indirect Object Identification in

[15] [15]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[16] [16]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[17] [17]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[18] [18]

Findings of the Association for Computational Linguistics: ACL 2022 , year=

Extracting Latent Steering Vectors from Pretrained Language Models , author=. Findings of the Association for Computational Linguistics: ACL 2022 , year=

2022

[19] [19]

2023 , eprint=

Steering Language Models With Activation Engineering , author=. 2023 , eprint=

2023

[20] [20]

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

A Diversity-Promoting Objective Function for Neural Conversation Models , author=. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

2016

[21] [21]

International Conference on Learning Representations (ICLR) , year=

The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations (ICLR) , year=

[22] [22]

2016 , eprint=

Layer Normalization , author=. 2016 , eprint=

2016

[23] [23]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Root Mean Square Layer Normalization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[24] [24]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

2022

[25] [25]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[26] [26]

2602.12158 , archivePrefix=

Wang, Zhaoxin and Liang, Jiaming and Zhu, Fengbin and Zhao, Weixiang and Fang, Junfeng and Ji, Jiayi and Wang, Handing and Chua, Tat-Seng , year=. 2602.12158 , archivePrefix=

arXiv

[27] [27]

Wu, Lichao and Behrouzi, Sasha and Rostami, Mohamadreza and Thang, Maximilian and Picek, Stjepan and Sadeghi, Ahmad-Reza , booktitle=

[28] [28]

2026 , eprint=

There Is More to Refusal in Large Language Models than a Single Direction , author=. 2026 , eprint=

2026

[29] [29]

Nature , volume=

Early-warning signals for critical transitions , author=. Nature , volume=