pith. sign in

arxiv: 2502.06809 · v3 · submitted 2025-02-04 · 💻 cs.LG · cs.AI· cs.CL

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

Pith reviewed 2026-05-23 04:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords polysemanticityactivation rangesneuron interpretationmodel manipulationlarge language modelsinterpretabilityconcept attribution
0
0 comments X

The pith

Activation ranges within neurons enable more precise concept manipulation in LLMs than whole-neuron interventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that polysemanticity makes it unreliable to tie specific concepts to individual neurons in large language models. Instead, the activation strengths for a given concept within any neuron tend to cluster into distinct ranges with little overlap from other concepts. This pattern holds across encoder and decoder models and multiple datasets. The authors propose intervening only on those ranges to change target concepts while leaving auxiliary concepts and overall performance more intact than when entire neurons are masked. A reader would care because this offers a practical way to steer model behavior more cleanly without broad side effects.

Core claim

Concept-conditioned activation magnitudes of neurons form distinct, often Gaussian-like distributions with minimal overlap. NeuronLens localizes concept attribution to activation ranges within a neuron, enabling more precise interpretability and targeted manipulation than discrete neuron-level masking.

What carries the argument

NeuronLens, a range-based framework that attributes and intervenes on concept-specific activation magnitudes inside individual neurons rather than on whole neurons.

If this is right

  • Range-based interventions manipulate target concepts effectively across encoder and decoder LLMs.
  • They cause substantially less collateral degradation to auxiliary concepts than neuron-level masking.
  • Overall model performance remains higher after range interventions than after neuron masking.
  • The pattern of distinct activation ranges appears consistently across diverse datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Range localization could be applied to safety-related concepts to reduce unintended model behaviors with fewer side effects on capability.
  • The method might extend to multimodal models if similar range patterns appear in vision or audio activations.
  • Combining range interventions with circuit-level analysis could yield finer-grained maps of how concepts interact inside models.

Load-bearing premise

Concept-conditioned activation magnitudes consistently form distinct distributions with minimal overlap across concepts.

What would settle it

A test showing that range-based interventions produce as much or more collateral degradation to auxiliary concepts as whole-neuron masking, or that activation distributions for different concepts overlap heavily.

Figures

Figures reproduced from arXiv: 2502.06809 by A.B. Siddique, Hammad Rizwan, Hassan Sajjad, Muhammad Umair Haider, Peizhong Ju.

Figure 1
Figure 1. Figure 1: NeuronLens leverages distinct, Gaussian-like acti￾vation patterns to enable fine-grained concept attribution. While these methods have advanced our understanding of neurons, they of￾ten rely on discrete neuron-to-concept mappings, which assume that entire neurons encode single concepts. How￾ever, neurons frequently exhibit pol￾ysemanticity; the ability to encode multiple, seemingly unrelated con￾cepts [Lec… view at source ↗
Figure 2
Figure 2. Figure 2: Overlap of top 30% salient neurons across classes. The polysemanticity of neuronal units, including salient neurons that encode information about multiple concepts, poses a challenge to neural network interpretation and manipulation. In this section, we discuss the degree of polysemantic￾ity in salient neurons in detail. Polysemanticity often arises when models must represent more features than their capac… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of neurons 480 and 675 showing class-specific activation patterns and fitted Gaussian curves. Both neurons were salient across all classes in top 5% on AG-News. 2022]. Additionally, Lecomte et al. [2024] show that even with sufficient capacity, certain weight initializations can induce polysemanticity by placing neurons near multiple conceptual regions. Polysemanticity in salient neurons. Given … view at source ↗
Figure 5
Figure 5. Figure 5: Box plot of neural activation of 11 polysemantic neurons (i.e: neurons in the salient group [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy comparisons between Neuronal Range manipulation (green) and complete neuron [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 1 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 3 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 5 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 7 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 9 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Neuronal Activation Patterns of six neurons on AG-News dataset. Layer 11 [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
read the original abstract

Pervasive polysemanticity in large language models (LLMs) undermines discrete neuron-concept attribution, posing a significant challenge for model interpretation and control. We systematically analyze both encoder and decoder based LLMs across diverse datasets, and observe that even highly salient neurons for specific semantic concepts consistently exhibit polysemantic behavior. Importantly, we uncover a consistent pattern: concept-conditioned activation magnitudes of neurons form distinct, often Gaussian-like distributions with minimal overlap. Building on this observation, we hypothesize that interpreting and intervening on concept-specific activation ranges can enable more precise interpretability and targeted manipulation in LLMs. To this end, we introduce NeuronLens, a novel range-based interpretation and manipulation framework, that localizes concept attribution to activation ranges within a neuron. Extensive empirical evaluations show that range-based interventions enable effective manipulation of target concepts while causing substantially less collateral degradation to auxiliary concepts and overall model performance compared to neuron-level masking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that pervasive polysemanticity in LLMs undermines discrete neuron-concept attribution. Systematic analysis across encoder and decoder models reveals that concept-conditioned activation magnitudes form distinct, often Gaussian-like distributions with minimal overlap. Building on this, the authors introduce NeuronLens, a range-based interpretation and manipulation framework, and report that range-based interventions achieve effective target-concept manipulation with substantially less collateral degradation to auxiliary concepts and overall performance than neuron-level masking.

Significance. If the minimal-overlap observation and the superiority of range interventions hold under rigorous quantification, the work could meaningfully shift mechanistic interpretability away from discrete neurons toward range-based attribution, offering a more precise tool for model control. The cross-architecture empirical scope is a strength, but the absence of overlap metrics and statistical detail currently limits the strength of the central claim.

major comments (2)
  1. [Abstract] Abstract: the observation that concept-conditioned activations 'form distinct, often Gaussian-like distributions with minimal overlap' is load-bearing for the NeuronLens hypothesis and the claimed reduction in collateral damage, yet no quantitative measure of overlap (overlap coefficient, Wasserstein distance, intersection-over-union of fitted densities, or fraction of mass above threshold) is supplied.
  2. [Empirical Evaluations] Empirical Evaluations (as summarized in Abstract): the claim of 'substantially less collateral degradation' to auxiliary concepts and model performance is central to the contribution, but the text provides no error bars, statistical tests, data-exclusion criteria, or ablation on range-boundary sensitivity, preventing verification that the reported advantage is not an artifact of loose range definitions or dataset-specific separation.
minor comments (1)
  1. [Abstract] The abstract refers to 'systematic analysis across diverse datasets' without naming the datasets, models, or number of concepts examined, which hinders immediate reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight opportunities to strengthen the quantitative support for our central claims. We address each point below and will incorporate the requested analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the observation that concept-conditioned activations 'form distinct, often Gaussian-like distributions with minimal overlap' is load-bearing for the NeuronLens hypothesis and the claimed reduction in collateral damage, yet no quantitative measure of overlap (overlap coefficient, Wasserstein distance, intersection-over-union of fitted densities, or fraction of mass above threshold) is supplied.

    Authors: We agree that explicit quantitative measures are needed to substantiate the minimal-overlap observation. In the revision we will report overlap coefficients, Wasserstein distances, and intersection-over-union values between fitted Gaussian densities for all concept pairs across the evaluated models and datasets. These metrics will be added to the main results and an expanded methods section. revision: yes

  2. Referee: [Empirical Evaluations] Empirical Evaluations (as summarized in Abstract): the claim of 'substantially less collateral degradation' to auxiliary concepts and model performance is central to the contribution, but the text provides no error bars, statistical tests, data-exclusion criteria, or ablation on range-boundary sensitivity, preventing verification that the reported advantage is not an artifact of loose range definitions or dataset-specific separation.

    Authors: We acknowledge that the current presentation lacks the statistical detail required for rigorous verification. The revised manuscript will include error bars (standard error across runs), paired statistical tests comparing range-based versus neuron-masking interventions, explicit data-exclusion criteria, and an ablation study varying range-boundary definitions (e.g., ±1σ, ±2σ, percentile-based). These additions will appear in the empirical evaluations section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on empirical observations of activation distributions across models and datasets, followed by introduction of a range-based framework (NeuronLens) and comparative intervention experiments. No equations, fitted parameters, or hypotheses reduce by construction to inputs; no self-citations are invoked as load-bearing uniqueness theorems; no ansatzes or renamings of known results are presented as derivations. The approach is self-contained via direct measurement and ablation-style comparisons, consistent with the reader's assessment of score 2.0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; evaluation limited to high-level claims.

pith-pipeline@v0.9.0 · 5699 in / 970 out tokens · 70238 ms · 2026-05-23T04:05:46.248663+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

  1. [1]

    Omer Antverg and Yonatan Belinkov

    URL https://transformer-circuits.pub/2023/toy-double-descent/index.html . Omer Antverg and Yonatan Belinkov. On the pitfalls of analyzing individual neurons in language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

  2. [2]

    URL https://doi.org/10

    doi: 10.1613/JAIR.1.12228. URL https://doi.org/10. 1613/jair.1.12228. 10 Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36:16318–16352,

  3. [3]

    doi: 10.18653/v1/2022.acl-long.581

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.581. URL https://aclanthology.org/2022. acl-long.581/. Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, Anthony Bau, and James R. Glass. What is one grain of sand in the desert? analyzing individual neurons in deep NLP models. In The Thirty-Third AAAI Conference on Artif...

  4. [4]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    URL https://arxiv.org/abs/ 1810.04805. Nelson Elhage et al. Superposition, memorization, and double descent. Transformer Circuits,

  5. [5]

    N2g: A scalable approach for quantifying interpretable neuron representations in large language models

    Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, and Fazl Barez. N2g: A scalable approach for quantifying interpretable neuron representations in large language models. arXiv preprint arXiv:2304.12918,

  6. [6]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    URL https://arxiv.org/abs/1803.03635. Aaron Grattafiori. The llama 3 herd of models,

  7. [7]

    org/abs/2305.01610

    URL https://arxiv. org/abs/2305.01610. Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, and Dimitris Bertsimas. Universal neurons in gpt2 language models. arXiv preprint arXiv:2401.12181,

  8. [8]

    Comprehensive online network pruning via learnable scaling factors

    Muhammad Umair Haider and Murtaza Taj. Comprehensive online network pruning via learnable scaling factors. In 2021 IEEE International Conference on Image Processing (ICIP) , pages 3557–3561,

  9. [9]

    Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Rui Zheng, Kui Ren, and Chun Chen

    doi: 10.1109/ICIP42928.2021.9506252. Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Rui Zheng, Kui Ren, and Chun Chen. Jailbreak- lens: Interpreting jailbreak mechanism in the lens of representation and circuit. arXiv preprint arXiv:2411.11114,

  10. [10]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

  11. [11]

    What causes polysemanticity? an alternative origin story of mixed selectivity from incidental causes

    Victor Lecomte, Kushal Thaman, Rylan Schaeffer, Naomi Bashkansky, Trevor Chow, and Sanmi Koyejo. What causes polysemanticity? an alternative origin story of mixed selectivity from incidental causes. In ICLR 2024 Workshop on Representational Alignment,

  12. [12]

    Liu, Matt Gardner, Yonatan Belinkov, Matthew E

    Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic knowledge and transferability of contextual representations. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo...

  13. [13]

    doi: 10.18653/v1/N19-1112

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1112. URL https://aclanthology.org/ N19-1112/. Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language tec...

  14. [14]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. CoRR, abs/2403.19647,

  15. [15]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    doi: 10.48550/ARXIV .2403.19647. URL https://doi.org/10. 48550/arXiv.2403.19647. Simon C. Marshall and Jan H. Kirchner. Understanding polysemanticity in neural networks through coding theory,

  16. [16]

    Frank J Massey Jr

    URL https://arxiv.org/abs/2401.17975. Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78,

  17. [17]

    Andonian, Yonatan Belinkov, and David Bau

    Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

  18. [18]

    On the importance of single directions for generalization

    URL https://arxiv.org/abs/1803.06959. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001,

  19. [19]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683,

  20. [20]

    Resolving lexical bias in edit scoping with projector editor networks

    Hammad Rizwan, Domenic Rosati, Ga Wu, and Hassan Sajjad. Resolving lexical bias in edit scoping with projector editor networks. arXiv preprint arXiv:2408.10411,

  21. [21]

    Controlling language and diffusion models by transporting activations.arXiv preprint arXiv:2410.23054,

    Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, and Xavier Suau. Controlling language and diffusion models by transporting activations.arXiv preprint arXiv:2410.23054,

  22. [22]

    12 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf

    URL https://arxiv.org/abs/2108.13138. 12 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,

  23. [23]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    URL https://arxiv.org/abs/1910.01108. Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. CARER: Contex- tualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium, October-November

  24. [24]

    doi: 10.18653/v1/D18-1404

    Association for Computational Linguistics. doi: 10.18653/v1/D18-1404. URL https://www.aclweb.org/anthology/D18-1404. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,

  25. [25]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642,

  26. [26]

    Self-conditioning pre-trained language models

    Xavier Suau, Luca Zappella, and Nicholas Apostoloff. Self-conditioning pre-trained language models. arXiv preprint arXiv:2110.02802,

  27. [27]

    Whispering experts: Neural interventions for toxicity mitigation in language models

    Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zap- pella, and Pau Rodríguez. Whispering experts: Neural interventions for toxicity mitigation in language models. arXiv preprint arXiv:2407.12824,

  28. [28]

    Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. Extracting latent steering vectors from pretrained language models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 566–581. Association for Computational Linguistics,

  29. [29]

    FINDINGS-ACL.48

    doi: 10.18653/V1/2022. FINDINGS-ACL.48. URL https://doi.org/10.18653/v1/2022.findings-acl.48. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks,

  30. [30]

    Axiomatic Attribution for Deep Networks

    URL https://arxiv.org/abs/1703.01365. Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July

  31. [31]

    Diagnostic classifiers: Revealing how neural networks process hierarchical structure

    Sara Veldhoen, Dieuwke Hupkes, and Willem Zuidema. Diagnostic classifiers: Revealing how neural networks process hierarchical structure. In Pre-Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches (CoCo @ NIPS 2016),

  32. [32]

    Neurons in large language models: Dead, n-gram, positional

    Elena V oita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. arXiv preprint arXiv:2309.04827,

  33. [33]

    Assessing the brittleness of safety alignment via pruning and low-rank modifications

    13 Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

  34. [34]

    zeroing out

    URL https://proceedings.neurips.cc/paper_files/paper/2015/file/ 250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf. 14 A Impact Statement This work advances neural network interpretability by providing a fine-grained understanding of concept encoding in language models. The proposed NeuronLens framework enables precise control of model behavior, benefiting resear...

  35. [35]

    Here, x represents neuron activation

    In this formulation, the activations of selected neurons are scaled down by a factor α instead of being completely suppressed. Here, x represents neuron activation. The rationale behind dampening is that a fixed intervention (like zeroing out) can disrupt the LLM’s inference dynamics, especially when a large number of neurons (k) are involved, thereby lim...

  36. [36]

    Values within the range are scaled proportionally based on their normalized distance from the mean

    At the boundaries (x = µ±2.5σ), a(x) = β, and the activation is minimally dampened. Values within the range are scaled proportionally based on their normalized distance from the mean. This adaptive dampening mechanism suppresses values near the mean while preserving those closer to the range edges. 17 The dampening factor β can be optimized for different ...

  37. [37]

    The magnitude of the means is then considered as a ranking for concept c

    extract high neural activations as a saliency ranking metric relying upon the rationale that maximally activating neurons are salient as these neurons play a critical role in controlling the model’s output, highlighting their importance for a conceptc.To identify them, the column-wise mean of absolute neuronal activations in H l c, H l c is defined in Sec...

  38. [38]

    The element-wise difference between mean vectors is computed as r =P c,c′∈C |q(c) − q(c′)|, where r ∈ Rd and d is the hidden dimension

    examine individual neurons, without the need for auxiliary classifiers, using the element-wise difference between mean vectors. The element-wise difference between mean vectors is computed as r =P c,c′∈C |q(c) − q(c′)|, where r ∈ Rd and d is the hidden dimension. The final neuron saliency ranking is obtained by sorting r in descending order. Table 8: Perf...