Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

A.B. Siddique; Hammad Rizwan; Hassan Sajjad; Muhammad Umair Haider; Peizhong Ju

arxiv: 2502.06809 · v3 · submitted 2025-02-04 · 💻 cs.LG · cs.AI· cs.CL

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

Muhammad Umair Haider , Hammad Rizwan , Hassan Sajjad , Peizhong Ju , A.B. Siddique This is my paper

Pith reviewed 2026-05-23 04:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords polysemanticityactivation rangesneuron interpretationmodel manipulationlarge language modelsinterpretabilityconcept attribution

0 comments

The pith

Activation ranges within neurons enable more precise concept manipulation in LLMs than whole-neuron interventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that polysemanticity makes it unreliable to tie specific concepts to individual neurons in large language models. Instead, the activation strengths for a given concept within any neuron tend to cluster into distinct ranges with little overlap from other concepts. This pattern holds across encoder and decoder models and multiple datasets. The authors propose intervening only on those ranges to change target concepts while leaving auxiliary concepts and overall performance more intact than when entire neurons are masked. A reader would care because this offers a practical way to steer model behavior more cleanly without broad side effects.

Core claim

Concept-conditioned activation magnitudes of neurons form distinct, often Gaussian-like distributions with minimal overlap. NeuronLens localizes concept attribution to activation ranges within a neuron, enabling more precise interpretability and targeted manipulation than discrete neuron-level masking.

What carries the argument

NeuronLens, a range-based framework that attributes and intervenes on concept-specific activation magnitudes inside individual neurons rather than on whole neurons.

If this is right

Range-based interventions manipulate target concepts effectively across encoder and decoder LLMs.
They cause substantially less collateral degradation to auxiliary concepts than neuron-level masking.
Overall model performance remains higher after range interventions than after neuron masking.
The pattern of distinct activation ranges appears consistently across diverse datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Range localization could be applied to safety-related concepts to reduce unintended model behaviors with fewer side effects on capability.
The method might extend to multimodal models if similar range patterns appear in vision or audio activations.
Combining range interventions with circuit-level analysis could yield finer-grained maps of how concepts interact inside models.

Load-bearing premise

Concept-conditioned activation magnitudes consistently form distinct distributions with minimal overlap across concepts.

What would settle it

A test showing that range-based interventions produce as much or more collateral degradation to auxiliary concepts as whole-neuron masking, or that activation distributions for different concepts overlap heavily.

Figures

Figures reproduced from arXiv: 2502.06809 by A.B. Siddique, Hammad Rizwan, Hassan Sajjad, Muhammad Umair Haider, Peizhong Ju.

**Figure 1.** Figure 1: NeuronLens leverages distinct, Gaussian-like activation patterns to enable fine-grained concept attribution. While these methods have advanced our understanding of neurons, they often rely on discrete neuron-to-concept mappings, which assume that entire neurons encode single concepts. However, neurons frequently exhibit polysemanticity; the ability to encode multiple, seemingly unrelated concepts [Lec… view at source ↗

**Figure 2.** Figure 2: Overlap of top 30% salient neurons across classes. The polysemanticity of neuronal units, including salient neurons that encode information about multiple concepts, poses a challenge to neural network interpretation and manipulation. In this section, we discuss the degree of polysemanticity in salient neurons in detail. Polysemanticity often arises when models must represent more features than their capac… view at source ↗

**Figure 4.** Figure 4: Comparison of neurons 480 and 675 showing class-specific activation patterns and fitted Gaussian curves. Both neurons were salient across all classes in top 5% on AG-News. 2022]. Additionally, Lecomte et al. [2024] show that even with sufficient capacity, certain weight initializations can induce polysemanticity by placing neurons near multiple conceptual regions. Polysemanticity in salient neurons. Given … view at source ↗

**Figure 5.** Figure 5: Box plot of neural activation of 11 polysemantic neurons (i.e: neurons in the salient group [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy comparisons between Neuronal Range manipulation (green) and complete neuron [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 1 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 9.** Figure 9: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 3 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 11.** Figure 11: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 5 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 13.** Figure 13: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 7 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 15.** Figure 15: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 9 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 17.** Figure 17: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 11 [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

read the original abstract

Pervasive polysemanticity in large language models (LLMs) undermines discrete neuron-concept attribution, posing a significant challenge for model interpretation and control. We systematically analyze both encoder and decoder based LLMs across diverse datasets, and observe that even highly salient neurons for specific semantic concepts consistently exhibit polysemantic behavior. Importantly, we uncover a consistent pattern: concept-conditioned activation magnitudes of neurons form distinct, often Gaussian-like distributions with minimal overlap. Building on this observation, we hypothesize that interpreting and intervening on concept-specific activation ranges can enable more precise interpretability and targeted manipulation in LLMs. To this end, we introduce NeuronLens, a novel range-based interpretation and manipulation framework, that localizes concept attribution to activation ranges within a neuron. Extensive empirical evaluations show that range-based interventions enable effective manipulation of target concepts while causing substantially less collateral degradation to auxiliary concepts and overall model performance compared to neuron-level masking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeuronLens tries range-based interventions inside neurons to cut collateral damage, but the overlap evidence stays unquantified.

read the letter

The main takeaway is that this paper moves from whole-neuron masking to targeting activation ranges within neurons, after observing that concept-specific activations often form separable, roughly Gaussian distributions. NeuronLens is the new framework they build on that pattern for both interpretation and editing. The empirical side shows range edits hitting the target concept while hurting auxiliary concepts and overall performance less than neuron-level zeroing, across encoder and decoder models on several datasets. That comparison is the concrete piece worth looking at if the numbers hold. The work is straightforward in its setup and reports consistent patterns in the activation data. The soft spot is exactly the one the stress-test note flags: the claim of minimal overlap is stated but not measured. No overlap metrics, no Wasserstein distances, no fraction of shared mass, and no checks on how often other concepts land inside the chosen range appear in the abstract. Without those or boundary-sensitivity ablations, the reported drop in collateral effects could trace to loose range choices or particular datasets rather than a general property. The abstract also skips details on exclusion rules, error bars, and statistical tests, which keeps the support thin until the full methods are checked. This is for interpretability researchers who already work on neuron-level control and want a finer-grained alternative. A reader focused on practical editing tools could extract the intervention results, but anyone needing rigorous distribution analysis will want the missing quantifications. The paper deserves a serious referee because the core idea addresses a real limitation in current attribution methods and the experiments are set up to test it directly. I would send it to review and ask specifically for overlap numbers, interference tests, and range-robustness checks.

Referee Report

2 major / 1 minor

Summary. The paper claims that pervasive polysemanticity in LLMs undermines discrete neuron-concept attribution. Systematic analysis across encoder and decoder models reveals that concept-conditioned activation magnitudes form distinct, often Gaussian-like distributions with minimal overlap. Building on this, the authors introduce NeuronLens, a range-based interpretation and manipulation framework, and report that range-based interventions achieve effective target-concept manipulation with substantially less collateral degradation to auxiliary concepts and overall performance than neuron-level masking.

Significance. If the minimal-overlap observation and the superiority of range interventions hold under rigorous quantification, the work could meaningfully shift mechanistic interpretability away from discrete neurons toward range-based attribution, offering a more precise tool for model control. The cross-architecture empirical scope is a strength, but the absence of overlap metrics and statistical detail currently limits the strength of the central claim.

major comments (2)

[Abstract] Abstract: the observation that concept-conditioned activations 'form distinct, often Gaussian-like distributions with minimal overlap' is load-bearing for the NeuronLens hypothesis and the claimed reduction in collateral damage, yet no quantitative measure of overlap (overlap coefficient, Wasserstein distance, intersection-over-union of fitted densities, or fraction of mass above threshold) is supplied.
[Empirical Evaluations] Empirical Evaluations (as summarized in Abstract): the claim of 'substantially less collateral degradation' to auxiliary concepts and model performance is central to the contribution, but the text provides no error bars, statistical tests, data-exclusion criteria, or ablation on range-boundary sensitivity, preventing verification that the reported advantage is not an artifact of loose range definitions or dataset-specific separation.

minor comments (1)

[Abstract] The abstract refers to 'systematic analysis across diverse datasets' without naming the datasets, models, or number of concepts examined, which hinders immediate reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight opportunities to strengthen the quantitative support for our central claims. We address each point below and will incorporate the requested analyses in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the observation that concept-conditioned activations 'form distinct, often Gaussian-like distributions with minimal overlap' is load-bearing for the NeuronLens hypothesis and the claimed reduction in collateral damage, yet no quantitative measure of overlap (overlap coefficient, Wasserstein distance, intersection-over-union of fitted densities, or fraction of mass above threshold) is supplied.

Authors: We agree that explicit quantitative measures are needed to substantiate the minimal-overlap observation. In the revision we will report overlap coefficients, Wasserstein distances, and intersection-over-union values between fitted Gaussian densities for all concept pairs across the evaluated models and datasets. These metrics will be added to the main results and an expanded methods section. revision: yes
Referee: [Empirical Evaluations] Empirical Evaluations (as summarized in Abstract): the claim of 'substantially less collateral degradation' to auxiliary concepts and model performance is central to the contribution, but the text provides no error bars, statistical tests, data-exclusion criteria, or ablation on range-boundary sensitivity, preventing verification that the reported advantage is not an artifact of loose range definitions or dataset-specific separation.

Authors: We acknowledge that the current presentation lacks the statistical detail required for rigorous verification. The revised manuscript will include error bars (standard error across runs), paired statistical tests comparing range-based versus neuron-masking interventions, explicit data-exclusion criteria, and an ablation study varying range-boundary definitions (e.g., ±1σ, ±2σ, percentile-based). These additions will appear in the empirical evaluations section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on empirical observations of activation distributions across models and datasets, followed by introduction of a range-based framework (NeuronLens) and comparative intervention experiments. No equations, fitted parameters, or hypotheses reduce by construction to inputs; no self-citations are invoked as load-bearing uniqueness theorems; no ansatzes or renamings of known results are presented as derivations. The approach is self-contained via direct measurement and ablation-style comparisons, consistent with the reader's assessment of score 2.0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; evaluation limited to high-level claims.

pith-pipeline@v0.9.0 · 5699 in / 970 out tokens · 70238 ms · 2026-05-23T04:05:46.248663+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

concept-conditioned activation magnitudes of neurons form distinct, often Gaussian-like distributions with minimal overlap... NeuronLens... localizes concept attribution to activation ranges within a neuron... range is assigned as [μ − τ × σ, μ + τ × σ] where τ = 2.5
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

activation ranges within a neuron’s activation spectrum offer a more precise unit of interpretability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

[1]

Omer Antverg and Yonatan Belinkov

URL https://transformer-circuits.pub/2023/toy-double-descent/index.html . Omer Antverg and Yonatan Belinkov. On the pitfalls of analyzing individual neurons in language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

work page 2023
[2]

URL https://doi.org/10

doi: 10.1613/JAIR.1.12228. URL https://doi.org/10. 1613/jair.1.12228. 10 Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36:16318–16352,

work page doi:10.1613/jair.1.12228
[3]

doi: 10.18653/v1/2022.acl-long.581

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.581. URL https://aclanthology.org/2022. acl-long.581/. Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, Anthony Bau, and James R. Glass. What is one grain of sand in the desert? analyzing individual neurons in deep NLP models. In The Thirty-Third AAAI Conference on Artif...

work page doi:10.18653/v1/2022.acl-long.581 2022
[4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

URL https://arxiv.org/abs/ 1810.04805. Nelson Elhage et al. Superposition, memorization, and double descent. Transformer Circuits,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

N2g: A scalable approach for quantifying interpretable neuron representations in large language models

Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, and Fazl Barez. N2g: A scalable approach for quantifying interpretable neuron representations in large language models. arXiv preprint arXiv:2304.12918,

work page arXiv
[6]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

URL https://arxiv.org/abs/1803.03635. Aaron Grattafiori. The llama 3 herd of models,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

org/abs/2305.01610

URL https://arxiv. org/abs/2305.01610. Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, and Dimitris Bertsimas. Universal neurons in gpt2 language models. arXiv preprint arXiv:2401.12181,

work page arXiv
[8]

Comprehensive online network pruning via learnable scaling factors

Muhammad Umair Haider and Murtaza Taj. Comprehensive online network pruning via learnable scaling factors. In 2021 IEEE International Conference on Image Processing (ICIP) , pages 3557–3561,

work page 2021
[9]

Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Rui Zheng, Kui Ren, and Chun Chen

doi: 10.1109/ICIP42928.2021.9506252. Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Rui Zheng, Kui Ren, and Chun Chen. Jailbreak- lens: Interpreting jailbreak mechanism in the lens of representation and circuit. arXiv preprint arXiv:2411.11114,

work page doi:10.1109/icip42928.2021.9506252 2021
[10]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

work page 2021
[11]

What causes polysemanticity? an alternative origin story of mixed selectivity from incidental causes

Victor Lecomte, Kushal Thaman, Rylan Schaeffer, Naomi Bashkansky, Trevor Chow, and Sanmi Koyejo. What causes polysemanticity? an alternative origin story of mixed selectivity from incidental causes. In ICLR 2024 Workshop on Representational Alignment,

work page 2024
[12]

Liu, Matt Gardner, Yonatan Belinkov, Matthew E

Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic knowledge and transferability of contextual representations. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo...

work page 2019
[13]

doi: 10.18653/v1/N19-1112

Association for Computational Linguistics. doi: 10.18653/v1/N19-1112. URL https://aclanthology.org/ N19-1112/. Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language tec...

work page doi:10.18653/v1/n19-1112
[14]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. CoRR, abs/2403.19647,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

doi: 10.48550/ARXIV .2403.19647. URL https://doi.org/10. 48550/arXiv.2403.19647. Simon C. Marshall and Jan H. Kirchner. Understanding polysemanticity in neural networks through coding theory,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[16]

Frank J Massey Jr

URL https://arxiv.org/abs/2401.17975. Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78,

work page arXiv
[17]

Andonian, Yonatan Belinkov, and David Bau

Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

work page 2023
[18]

On the importance of single directions for generalization

URL https://arxiv.org/abs/1803.06959. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[20]

Resolving lexical bias in edit scoping with projector editor networks

Hammad Rizwan, Domenic Rosati, Ga Wu, and Hassan Sajjad. Resolving lexical bias in edit scoping with projector editor networks. arXiv preprint arXiv:2408.10411,

work page arXiv
[21]

Controlling language and diffusion models by transporting activations.arXiv preprint arXiv:2410.23054,

Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, and Xavier Suau. Controlling language and diffusion models by transporting activations.arXiv preprint arXiv:2410.23054,

work page arXiv
[22]

12 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf

URL https://arxiv.org/abs/2108.13138. 12 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,

work page arXiv
[23]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

URL https://arxiv.org/abs/1910.01108. Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. CARER: Contex- tualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium, October-November

work page internal anchor Pith review Pith/arXiv arXiv 1910
[24]

doi: 10.18653/v1/D18-1404

Association for Computational Linguistics. doi: 10.18653/v1/D18-1404. URL https://www.aclweb.org/anthology/D18-1404. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,

work page doi:10.18653/v1/d18-1404
[25]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642,

work page 2013
[26]

Self-conditioning pre-trained language models

Xavier Suau, Luca Zappella, and Nicholas Apostoloff. Self-conditioning pre-trained language models. arXiv preprint arXiv:2110.02802,

work page arXiv
[27]

Whispering experts: Neural interventions for toxicity mitigation in language models

Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zap- pella, and Pau Rodríguez. Whispering experts: Neural interventions for toxicity mitigation in language models. arXiv preprint arXiv:2407.12824,

work page arXiv
[28]

Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. Extracting latent steering vectors from pretrained language models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 566–581. Association for Computational Linguistics,

work page 2022
[29]

FINDINGS-ACL.48

doi: 10.18653/V1/2022. FINDINGS-ACL.48. URL https://doi.org/10.18653/v1/2022.findings-acl.48. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks,

work page doi:10.18653/v1/2022 2022
[30]

Axiomatic Attribution for Deep Networks

URL https://arxiv.org/abs/1703.01365. Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Diagnostic classifiers: Revealing how neural networks process hierarchical structure

Sara Veldhoen, Dieuwke Hupkes, and Willem Zuidema. Diagnostic classifiers: Revealing how neural networks process hierarchical structure. In Pre-Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches (CoCo @ NIPS 2016),

work page 2016
[32]

Neurons in large language models: Dead, n-gram, positional

Elena V oita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. arXiv preprint arXiv:2309.04827,

work page arXiv
[33]

Assessing the brittleness of safety alignment via pruning and low-rank modifications

13 Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

work page 2024
[34]

zeroing out

URL https://proceedings.neurips.cc/paper_files/paper/2015/file/ 250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf. 14 A Impact Statement This work advances neural network interpretability by providing a fine-grained understanding of concept encoding in language models. The proposed NeuronLens framework enables precise control of model behavior, benefiting resear...

work page 2015
[35]

Here, x represents neuron activation

In this formulation, the activations of selected neurons are scaled down by a factor α instead of being completely suppressed. Here, x represents neuron activation. The rationale behind dampening is that a fixed intervention (like zeroing out) can disrupt the LLM’s inference dynamics, especially when a large number of neurons (k) are involved, thereby lim...

work page 2021
[36]

Values within the range are scaled proportionally based on their normalized distance from the mean

At the boundaries (x = µ±2.5σ), a(x) = β, and the activation is minimally dampened. Values within the range are scaled proportionally based on their normalized distance from the mean. This adaptive dampening mechanism suppresses values near the mean while preserving those closer to the range edges. 17 The dampening factor β can be optimized for different ...

work page 2000
[37]

The magnitude of the means is then considered as a ranking for concept c

extract high neural activations as a saliency ranking metric relying upon the rationale that maximally activating neurons are salient as these neurons play a critical role in controlling the model’s output, highlighting their importance for a conceptc.To identify them, the column-wise mean of absolute neuronal activations in H l c, H l c is defined in Sec...

work page 2023
[38]

The element-wise difference between mean vectors is computed as r =P c,c′∈C |q(c) − q(c′)|, where r ∈ Rd and d is the hidden dimension

examine individual neurons, without the need for auxiliary classifiers, using the element-wise difference between mean vectors. The element-wise difference between mean vectors is computed as r =P c,c′∈C |q(c) − q(c′)|, where r ∈ Rd and d is the hidden dimension. The final neuron saliency ranking is obtained by sorting r in descending order. Table 8: Perf...

work page 2057

[1] [1]

Omer Antverg and Yonatan Belinkov

URL https://transformer-circuits.pub/2023/toy-double-descent/index.html . Omer Antverg and Yonatan Belinkov. On the pitfalls of analyzing individual neurons in language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

work page 2023

[2] [2]

URL https://doi.org/10

doi: 10.1613/JAIR.1.12228. URL https://doi.org/10. 1613/jair.1.12228. 10 Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36:16318–16352,

work page doi:10.1613/jair.1.12228

[3] [3]

doi: 10.18653/v1/2022.acl-long.581

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.581. URL https://aclanthology.org/2022. acl-long.581/. Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, Anthony Bau, and James R. Glass. What is one grain of sand in the desert? analyzing individual neurons in deep NLP models. In The Thirty-Third AAAI Conference on Artif...

work page doi:10.18653/v1/2022.acl-long.581 2022

[4] [4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

URL https://arxiv.org/abs/ 1810.04805. Nelson Elhage et al. Superposition, memorization, and double descent. Transformer Circuits,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

N2g: A scalable approach for quantifying interpretable neuron representations in large language models

Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, and Fazl Barez. N2g: A scalable approach for quantifying interpretable neuron representations in large language models. arXiv preprint arXiv:2304.12918,

work page arXiv

[6] [6]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

URL https://arxiv.org/abs/1803.03635. Aaron Grattafiori. The llama 3 herd of models,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

org/abs/2305.01610

URL https://arxiv. org/abs/2305.01610. Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, and Dimitris Bertsimas. Universal neurons in gpt2 language models. arXiv preprint arXiv:2401.12181,

work page arXiv

[8] [8]

Comprehensive online network pruning via learnable scaling factors

Muhammad Umair Haider and Murtaza Taj. Comprehensive online network pruning via learnable scaling factors. In 2021 IEEE International Conference on Image Processing (ICIP) , pages 3557–3561,

work page 2021

[9] [9]

Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Rui Zheng, Kui Ren, and Chun Chen

doi: 10.1109/ICIP42928.2021.9506252. Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Rui Zheng, Kui Ren, and Chun Chen. Jailbreak- lens: Interpreting jailbreak mechanism in the lens of representation and circuit. arXiv preprint arXiv:2411.11114,

work page doi:10.1109/icip42928.2021.9506252 2021

[10] [10]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

work page 2021

[11] [11]

What causes polysemanticity? an alternative origin story of mixed selectivity from incidental causes

Victor Lecomte, Kushal Thaman, Rylan Schaeffer, Naomi Bashkansky, Trevor Chow, and Sanmi Koyejo. What causes polysemanticity? an alternative origin story of mixed selectivity from incidental causes. In ICLR 2024 Workshop on Representational Alignment,

work page 2024

[12] [12]

Liu, Matt Gardner, Yonatan Belinkov, Matthew E

Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic knowledge and transferability of contextual representations. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo...

work page 2019

[13] [13]

doi: 10.18653/v1/N19-1112

Association for Computational Linguistics. doi: 10.18653/v1/N19-1112. URL https://aclanthology.org/ N19-1112/. Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language tec...

work page doi:10.18653/v1/n19-1112

[14] [14]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. CoRR, abs/2403.19647,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

doi: 10.48550/ARXIV .2403.19647. URL https://doi.org/10. 48550/arXiv.2403.19647. Simon C. Marshall and Jan H. Kirchner. Understanding polysemanticity in neural networks through coding theory,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv

[16] [16]

Frank J Massey Jr

URL https://arxiv.org/abs/2401.17975. Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78,

work page arXiv

[17] [17]

Andonian, Yonatan Belinkov, and David Bau

Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

work page 2023

[18] [18]

On the importance of single directions for generalization

URL https://arxiv.org/abs/1803.06959. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[20] [20]

Resolving lexical bias in edit scoping with projector editor networks

Hammad Rizwan, Domenic Rosati, Ga Wu, and Hassan Sajjad. Resolving lexical bias in edit scoping with projector editor networks. arXiv preprint arXiv:2408.10411,

work page arXiv

[21] [21]

Controlling language and diffusion models by transporting activations.arXiv preprint arXiv:2410.23054,

Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, and Xavier Suau. Controlling language and diffusion models by transporting activations.arXiv preprint arXiv:2410.23054,

work page arXiv

[22] [22]

12 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf

URL https://arxiv.org/abs/2108.13138. 12 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,

work page arXiv

[23] [23]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

URL https://arxiv.org/abs/1910.01108. Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. CARER: Contex- tualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium, October-November

work page internal anchor Pith review Pith/arXiv arXiv 1910

[24] [24]

doi: 10.18653/v1/D18-1404

Association for Computational Linguistics. doi: 10.18653/v1/D18-1404. URL https://www.aclweb.org/anthology/D18-1404. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,

work page doi:10.18653/v1/d18-1404

[25] [25]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642,

work page 2013

[26] [26]

Self-conditioning pre-trained language models

Xavier Suau, Luca Zappella, and Nicholas Apostoloff. Self-conditioning pre-trained language models. arXiv preprint arXiv:2110.02802,

work page arXiv

[27] [27]

Whispering experts: Neural interventions for toxicity mitigation in language models

Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zap- pella, and Pau Rodríguez. Whispering experts: Neural interventions for toxicity mitigation in language models. arXiv preprint arXiv:2407.12824,

work page arXiv

[28] [28]

Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. Extracting latent steering vectors from pretrained language models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 566–581. Association for Computational Linguistics,

work page 2022

[29] [29]

FINDINGS-ACL.48

doi: 10.18653/V1/2022. FINDINGS-ACL.48. URL https://doi.org/10.18653/v1/2022.findings-acl.48. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks,

work page doi:10.18653/v1/2022 2022

[30] [30]

Axiomatic Attribution for Deep Networks

URL https://arxiv.org/abs/1703.01365. Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Diagnostic classifiers: Revealing how neural networks process hierarchical structure

Sara Veldhoen, Dieuwke Hupkes, and Willem Zuidema. Diagnostic classifiers: Revealing how neural networks process hierarchical structure. In Pre-Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches (CoCo @ NIPS 2016),

work page 2016

[32] [32]

Neurons in large language models: Dead, n-gram, positional

Elena V oita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. arXiv preprint arXiv:2309.04827,

work page arXiv

[33] [33]

Assessing the brittleness of safety alignment via pruning and low-rank modifications

13 Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

work page 2024

[34] [34]

zeroing out

URL https://proceedings.neurips.cc/paper_files/paper/2015/file/ 250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf. 14 A Impact Statement This work advances neural network interpretability by providing a fine-grained understanding of concept encoding in language models. The proposed NeuronLens framework enables precise control of model behavior, benefiting resear...

work page 2015

[35] [35]

Here, x represents neuron activation

In this formulation, the activations of selected neurons are scaled down by a factor α instead of being completely suppressed. Here, x represents neuron activation. The rationale behind dampening is that a fixed intervention (like zeroing out) can disrupt the LLM’s inference dynamics, especially when a large number of neurons (k) are involved, thereby lim...

work page 2021

[36] [36]

Values within the range are scaled proportionally based on their normalized distance from the mean

At the boundaries (x = µ±2.5σ), a(x) = β, and the activation is minimally dampened. Values within the range are scaled proportionally based on their normalized distance from the mean. This adaptive dampening mechanism suppresses values near the mean while preserving those closer to the range edges. 17 The dampening factor β can be optimized for different ...

work page 2000

[37] [37]

The magnitude of the means is then considered as a ranking for concept c

extract high neural activations as a saliency ranking metric relying upon the rationale that maximally activating neurons are salient as these neurons play a critical role in controlling the model’s output, highlighting their importance for a conceptc.To identify them, the column-wise mean of absolute neuronal activations in H l c, H l c is defined in Sec...

work page 2023

[38] [38]

The element-wise difference between mean vectors is computed as r =P c,c′∈C |q(c) − q(c′)|, where r ∈ Rd and d is the hidden dimension

examine individual neurons, without the need for auxiliary classifiers, using the element-wise difference between mean vectors. The element-wise difference between mean vectors is computed as r =P c,c′∈C |q(c) − q(c′)|, where r ∈ Rd and d is the hidden dimension. The final neuron saliency ranking is obtained by sorting r in descending order. Table 8: Perf...

work page 2057