pith. sign in

arxiv: 2606.07604 · v1 · pith:KVY76GSCnew · submitted 2026-05-29 · 💻 cs.LG · cs.AI

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

Pith reviewed 2026-06-28 23:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords contribution weightsattention sinksself-attentiontoken importancetransformersvalue vectorsmechanistic interpretability
0
0 comments X

The pith

Contribution weights, which multiply attention by value magnitude and alignment with layer output, measure token importance more accurately than attention alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes contribution weights to fix a limitation in how attention is interpreted in transformers. Standard attention scores ignore the size and direction of the value vectors being averaged, so they can misrank which tokens actually drive the output. The new metric projects each attended value onto the final layer output direction and scales by attention and magnitude. Across decoder-only models and tasks, this ranking better matches semantically important tokens. The same geometric view shows attention sinks actively reduce output norm in a convex way, countering drift from uncertain tokens.

Core claim

Contribution weights are computed as the attention weight times the Euclidean norm of the value vector times the cosine of the angle between that value vector and the layer's output vector. These weights rank tokens by influence more faithfully than raw attention across models, tasks, and datasets. Attention sinks display a convex relationship between sink attention rate and output norm, actively suppressing information and stabilizing representations against semantic drift of low-confidence tokens.

What carries the argument

Contribution Weights: the projection of each value vector onto the layer output direction, scaled by its attention weight and magnitude.

If this is right

  • Token importance rankings become more reliable for mechanistic analysis of decoder-only transformers.
  • Attention sinks can be studied as active stabilizers rather than passive attention repositories.
  • Interpretability methods gain a geometric component that accounts for value vector properties.
  • Analysis of information flow can be applied consistently across different models and datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The metric could guide targeted interventions such as editing or pruning by focusing on tokens with high projected contribution.
  • Similar projection ideas might apply to other attention variants or non-transformer architectures where aggregation occurs.
  • If the convex sink relationship holds, it points to possible architectural changes that deliberately tune sink behavior for stability.
  • The approach separates correlation from causation in attribution, which could be tested by comparing against gradient-based or intervention-based baselines.

Load-bearing premise

The projection onto the layer output direction, scaled by attention and magnitude, genuinely quantifies causal influence on the output rather than merely correlating with output geometry.

What would settle it

An ablation study in which tokens ranked highest by contribution weights are removed or perturbed and the output changes less than when tokens ranked highest by attention weights are removed.

Figures

Figures reproduced from arXiv: 2606.07604 by Harry Jake Cunningham, Nicola Muca Cirone.

Figure 1
Figure 1. Figure 1: Contribution weights. (Top) Attention matrices and (Bottom) corresponding contribution weight matrices, at two ran￾domly selected heads and layers, computed using LLAMA-3.1- 8B. Whilst attention weights place a large amount of attention mass on the first token, examining the contribution weights reveals that these tokens contribute very little to the output. Bastings & Filippova; Lopardo et al.), arguing t… view at source ↗
Figure 2
Figure 2. Figure 2: Contribution weight decomposition. We visualise the multiplicative components of the inter-head contribution weights cˆij : (i) the attention weights αij , (ii) the relative norm ∥v˜ (l,h) j ∥/∥o (l) i ∥, and (iii) the cosine similarity cos(o (l) i , v˜ (l,h) j ). All quantities are computed on the same input for a randomly se￾lected layer and head of LLaMA-3.1-8B. 1. Attention weights αij : determines the… view at source ↗
Figure 3
Figure 3. Figure 3: Contribution weights most reliably predict token importance. Change in perplexity (∆PPL) on GSM8K under increasing token-removal rates, comparing four contribution measures: (1) contribution weights, (2) value-weighted output-normalised attention (VW-ON), (3) value-weighted attention, and (4) raw attention weights. Top: Removal of top percentiles. Bottom: Removal of bottom percentiles. Contribution weights… view at source ↗
Figure 4
Figure 4. Figure 4: Contribution weights maintain downstream performance under high token-removal rates. Change in accuracy on MMLU, BoolQ, Lambada and Winogrande under increasing token-removal rates, comparing four contribution measures: (1) contribution weights, (2) value-weighted output-normalised attention (VW-ON), (3) value-weighted attention, and (4) raw attention weights. Contribution weights maintain baseline performa… view at source ↗
Figure 5
Figure 5. Figure 5: Value vector geometry. (a) Average alignment (cosine similarity) between the initial ([BOS]/sink) token’s value vector and the tail tokens’ value vectors as a function of relative depth. (b) Sink rate αj,1 (x-axis) against output norm ∥oj∥ (y-axis) for all tokens in 32 batches, for a fixed (l, h) choice in LLAMA-3.1-8B. We plot in red the prediction obtained via our model outlined in Equation (20), and col… view at source ↗
Figure 6
Figure 6. Figure 6: Sinks oppose small magnitude attention weights. Head outputs o (l,h) i projected onto the semantic direction µˆ and its orthogonal complement uˆ, coloured by output norm. The dashed cone contains 90% of non-sink values; the orange arrow shows the sink vector. (a) With the sink token present, low-norm heads (dark blue) lie outside the semantic cone, either orthogonal or negatively aligned with µˆ. (b) When … view at source ↗
Figure 7
Figure 7. Figure 7: Variation of attention weights, value norms and value alignment. Pairwise relationships between attention weights α (l,h) ij , projected value norms ∥v˜ (l,h) j ∥2 and alignment with the output direction cos(o (l) i , v˜ (l,h) j ) across all layers l and heads h of Llama-8B. The large dispersion across all three plots shows how these factors vary independently. Indeed, high attention weights do not guarant… view at source ↗
Figure 8
Figure 8. Figure 8: Intervention dataset ablation. Language modeling loss when removing tokens based on different importance metrics on FineWeb-Edu. Top row shows removal of high-importance tokens; bottom row shows removal of low-importance tokens. Contribution weights (blue) consistently outperform attention-based baselines (red, orange) across all models, demonstrating robustness to domain shifts from mathematical reasoning… view at source ↗
Figure 9
Figure 9. Figure 9: Computation graph ablation using global percentiles. Language modeling loss when importance scores are computed using global percentiles across all layers rather than layer-specific thresholds. While this introduces instability in models with multiple sinks (Qwen3-8B, DeepSeek-7B), contribution weights (blue) maintain their advantage over attention-based methods, though the gap narrows compared to layer-sp… view at source ↗
Figure 10
Figure 10. Figure 10: Sink token ablation. Language modeling loss when attention sinks are permitted to be removed. Top row (high-importance removal): attention-based metrics (red) cause immediate collapse by removing structural sinks, while contribution weights (blue) prioritize semantic content. Bottom row (low-importance removal): Qwen3-8B shows early degradation due to multiple sinks being classified as low-importance and … view at source ↗
Figure 11
Figure 11. Figure 11: Left: Linear fit of ||w (l,h) i || against α (l,h) i,1 . Right: Linear fit of ⟨v˜ (l,h) 1 , w (l,h) i ⟩ against α (l,h) i,1 . Top: Aggregate R 2 statistics. Bottom: Fit for a fixed index (l, h), each point corresponds to a token coloured according to its relative sequence position. implies that the minimum norm is not attained at α = 1 as a simplistic understanding of sink token as suppressor would lead t… view at source ↗
Figure 12
Figure 12. Figure 12: Contribution weight matrices. Attention and contribution weights computed from LLaMA-3.1-8B evaluated on FineWeb-Edu. Rows alternate between attention weights and contribution weights. All attention weights and contribution weights are plotted with the same colourbar axis from [0, 1] and [−0.01, 0.15] respectively. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Contribution weight matrices. Attention and contribution weights computed from Qwen2.5-7B evaluated on FineWeb-Edu. Rows alternate between attention weights and contribution weights. All attention weights and contribution weights are plotted with the same colourbar axis from [0, 1] and [−0.01, 0.15] respectively. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Contribution weight matrices. Attention and contribution weights computed from Qwen3-8B evaluated on FineWeb-Edu. Rows alternate between attention weights and contribution weights. All attention weights and contribution weights are plotted with the same colourbar axis from [0, 1] and [−0.01, 0.15] respectively. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
read the original abstract

Analyzing attention weights has become a standard approach for interpreting the information flow of Large Language Models (LLMs). However, this approach has significant limitations as it neglects the geometric properties of the value vectors being aggregated. To address this gap, we introduce \emph{Contribution Weights}, a projection-based metric that quantifies a token's influence by accounting for it's attention weight, value magnitude, and directional alignment with the layer output. We demonstrate that contribution weights provide a more faithful measure of token importance, consistently outperforming attention-based metrics in identifying semantically critical tokens across different decoder-only models, tasks, and datasets. Further, our metric enables novel mechanistic analysis of \emph{attention sinks}. While previous work characterized sinks as passive repositories for excess attention, we reveal they serve an active functional role, suppressing information through a convex relationship between sink rate and output norm, stabilizing representations by opposing the semantic drift of low-confidence tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Contribution Weights, a projection-based metric for token importance in decoder-only transformers defined as the product of attention weight, value-vector magnitude, and cosine alignment with the layer output vector. It claims this metric more faithfully identifies semantically critical tokens than attention weights alone, with consistent outperformance across models, tasks, and datasets, and further uses the metric to argue that attention sinks play an active stabilizing role via a convex relationship between sink rate and output norm that opposes semantic drift.

Significance. If the outperformance claims are supported by controlled intervention experiments that separate definitional correlation from causal predictive power, the work would supply a geometrically motivated importance measure that could improve mechanistic interpretability of self-attention. The reframing of attention sinks as functional stabilizers rather than passive repositories would also be a useful contribution to the literature on LLM representation dynamics.

major comments (2)
  1. [Abstract and §3 (metric definition)] The contribution weight is defined (see abstract and §3) as attn_i ⋅ ||v_i|| ⋅ cos∠(v_i, o) where o is the layer output; by the definition of the dot product this quantity is exactly the scalar projection of each attended value onto o, so the sum of contribution weights reconstructs ||o|| (up to scaling). Any claim that the metric 'more faithfully' ranks semantically critical tokens therefore requires explicit causal evidence (e.g., ablation or patching experiments showing superior prediction of downstream loss or semantic change) rather than geometric correlation by construction.
  2. [Abstract] The abstract asserts 'consistent outperformance' and a 'convex relationship between sink rate and output norm' but supplies no description of datasets, baselines, statistical tests, error bars, or controls that would allow verification that the reported superiority is not an artifact of the metric's definitional alignment with the output geometry.
minor comments (1)
  1. [§3] Notation for the output vector o and the angle should be introduced with an explicit equation in the main text rather than only in the abstract.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and precise comments, which highlight important distinctions between geometric definition and empirical validation. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and §3 (metric definition)] The contribution weight is defined (see abstract and §3) as attn_i ⋅ ||v_i|| ⋅ cos∠(v_i, o) where o is the layer output; by the definition of the dot product this quantity is exactly the scalar projection of each attended value onto o, so the sum of contribution weights reconstructs ||o|| (up to scaling). Any claim that the metric 'more faithfully' ranks semantically critical tokens therefore requires explicit causal evidence (e.g., ablation or patching experiments showing superior prediction of downstream loss or semantic change) rather than geometric correlation by construction.

    Authors: We agree that the defined contribution weight is mathematically identical to the normalized scalar projection of each value vector onto the output direction. This equivalence is intentional and forms the core geometric motivation of the metric: it isolates each token's additive contribution to ||o|| along the output axis. The manuscript demonstrates outperformance by showing that contribution-weight rankings align more closely with independently measured semantic importance (via token ablation effects on next-token distributions and downstream task accuracy) than attention weights alone, across multiple models and datasets. We acknowledge that these comparisons are correlational rather than full causal interventions such as activation patching. We will revise §3 to explicitly state the projection identity and add a limitations paragraph clarifying the current evidence level, while noting that the geometric construction itself provides a stronger a priori reason to expect improved faithfulness than attention alone. revision: partial

  2. Referee: [Abstract] The abstract asserts 'consistent outperformance' and a 'convex relationship between sink rate and output norm' but supplies no description of datasets, baselines, statistical tests, error bars, or controls that would allow verification that the reported superiority is not an artifact of the metric's definitional alignment with the output geometry.

    Authors: The abstract is intentionally concise, but the full manuscript details the experimental protocol: evaluations on five decoder-only models (including Llama-2/3 and Mistral variants), multiple tasks (next-token prediction and classification), and datasets drawn from standard benchmarks plus custom semantic perturbation sets. Baselines include raw attention, value magnitude, and random rankings; results report means and standard deviations over five random seeds with statistical significance tests. We will revise the abstract to include a short clause such as 'across five models and three benchmarks with statistical controls' to enable immediate verification while respecting length constraints. revision: yes

standing simulated objections not resolved
  • The current manuscript does not contain activation-patching or full causal-intervention experiments that would isolate the metric's predictive power from its definitional alignment with the output geometry.

Circularity Check

0 steps flagged

No significant circularity; metric definition is explicit and claims rest on empirical comparisons

full rationale

The paper introduces Contribution Weights as a new metric explicitly constructed from attention weights, value magnitudes, and alignment with the layer output vector. Its central claims are empirical: consistent outperformance versus attention-based metrics on token identification tasks across models/datasets, plus mechanistic observations on attention sinks. No derivation chain, first-principles prediction, or result is shown that reduces by construction to its inputs (no equations presented as derived that equal the defining inputs). The metric is defined to quantify geometric contribution to the output by design; downstream empirical tests compare it to baselines on external tasks rather than relying on definitional tautology. Self-contained against external benchmarks via direct comparisons; no load-bearing self-citation or ansatz smuggling identified from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the unstated premise that projection onto the output direction measures influence, plus the empirical assumption that the reported convex relationship is robust across models. No free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Projection of value vectors onto the layer output direction, scaled by attention weight and magnitude, quantifies a token's true influence on the output.
    This premise is required for the claim that Contribution Weights are more faithful; it is invoked when the metric is introduced and when outperformance is asserted.

pith-pipeline@v0.9.1-grok · 5683 in / 1355 out tokens · 20061 ms · 2026-06-28T23:27:04.010467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

176 extracted references · 143 canonical work pages · 49 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    Attention Meets Post-hoc Interpretability: A Mathematical Perspective , abstract =

    Lopardo, Gianluigi and Precioso, Frederic and Garreau, Damien , langid =. Attention Meets Post-hoc Interpretability: A Mathematical Perspective , abstract =

  10. [10]

    and Filippova, K

    Bastings, Jasmijn and Filippova, Katja , urldate =. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? , url =. doi:10.48550/arXiv.2010.05607 , shorttitle =. 2010.05607 [cs] , keywords =

  11. [11]

    How to Dissect a Muppet: The Structure of Transformer Embedding Spaces , volume =

    Mickus, Timothee and Paperno, Denis and Constant, Mathieu , urldate =. How to Dissect a Muppet: The Structure of Transformer Embedding Spaces , volume =. doi:10.1162/tacl_a_00501 , shorttitle =

  12. [12]

    Vision Transformers Need Registers

    Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr , urldate =. Vision Transformers Need Registers , url =. doi:10.48550/arXiv.2309.16588 , abstract =. 2309.16588 [cs] , keywords =

  13. [13]

    Quantizable transformers: Removing outliers by helping attention heads do nothing

    Bondarenko, Yelysei and Nagel, Markus and Blankevoort, Tijmen , urldate =. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing , url =. doi:10.48550/arXiv.2306.12929 , shorttitle =. 2306.12929 [cs] , keywords =

  14. [14]

    OrthoRank : Token selection via sink token orthogonality for efficient LLM inference

    Shin, Seungjun and Oh, Jaehoon and Oh, Dokwan , urldate =. doi:10.48550/arXiv.2507.03865 , shorttitle =. 2507.03865 [cs] , keywords =

  15. [15]

    Zhang, Stephen and Khan, Mustafa and Papyan, Vardan , langid =

  16. [16]

    Massive Activations in Large Language Models

    Sun, Mingjie and Chen, Xinlei and Kolter, J. Zico and Liu, Zhuang , urldate =. Massive Activations in Large Language Models , url =. doi:10.48550/arXiv.2402.17762 , abstract =. 2402.17762 [cs] , keywords =

  17. [17]

    Spectral filters, dark signals, and attention sinks

    Cancedda, Nicola , urldate =. Spectral Filters, Dark Signals, and Attention Sinks , url =. doi:10.48550/arXiv.2402.09221 , abstract =. 2402.09221 [cs] , note =

  18. [18]

    House of Cards: Massive Weights in

    Oh, Jaehoon and Shin, Seungjun and Oh, Dokwan , urldate =. House of Cards: Massive Weights in. doi:10.48550/arXiv.2410.01866 , shorttitle =. 2410.01866 [cs] , keywords =

  19. [19]

    Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

    Yu, Zhongzhi and Wang, Zheng and Fu, Yonggan and Shi, Huihong and Shaikh, Khalid and Lin, Yingyan Celine , urldate =. Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration , url =. doi:10.48550/arXiv.2406.15765 , shorttitle =. 2406.15765 [cs] , note =

  20. [20]

    doi:10.48550/arXiv.2506.15545 , shorttitle =

    Wang, Bailin and Lan, Chang and Wang, Chong and Pang, Ruoming , urldate =. doi:10.48550/arXiv.2506.15545 , shorttitle =. 2506.15545 [cs] , keywords =

  21. [21]

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth,. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , url =. doi:10.48550/arXiv.2304.01373 , shorttitle =. 2304.01373 [cs] , keywords =

  22. [22]

    Qwen2.5 Technical Report

    Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

  23. [23]

    Differential transformer.arXiv preprint arXiv:2410.05258, 2024

    Ye, Tianzhu and Dong, Li and Xia, Yuqing and Sun, Yutao and Zhu, Yi and Huang, Gao and Wei, Furu , urldate =. Differential Transformer , url =. doi:10.48550/arXiv.2410.05258 , abstract =. 2410.05258 [cs] , keywords =

  24. [24]

    Training Verifiers to Solve Math Word Problems

    Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , urldate =. Training Verifiers to Solve Math Word Problems , url =. doi:10.48550/arXiv.2110.14168 , abstract =. 2110.14168 [cs]...

  25. [25]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , urldate =. doi:10.48550/arXiv.1905.10044 , shorttitle =. 1905.10044 [cs] , keywords =

  26. [26]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , urldate =. doi:10.48550/arXiv.1907.10641 , shorttitle =. 1907.10641 [cs] , keywords =

  27. [27]

    Qi, Xiangyu and Panda, Ashwinee and Lyu, Kaifeng and Ma, Xiao and Roy, Subhrajit and Beirami, Ahmad and Mittal, Prateek and Henderson, Peter , date =

  28. [28]

    From memorization to reasoning in the spectrum of loss curvature

    Merullo, Jack and Vatsavaya, Srihita and Bushnaq, Lucius and Lewis, Owen , urldate =. From Memorization to Reasoning in the Spectrum of Loss Curvature , url =. doi:10.48550/arXiv.2510.24256 , abstract =. 2510.24256 [cs] , keywords =

  29. [29]

    Nested Learning: The Illusion of Deep Learning Architectures , abstract =

    Behrouz, Ali and Razaviyayn, Meisam and Zhong, Peiling and Mirrokni, Vahab , langid =. Nested Learning: The Illusion of Deep Learning Architectures , abstract =

  30. [30]

    Attention is not Explanation

    Jain, Sarthak and Wallace, Byron C. , urldate =. Attention is not Explanation , url =. doi:10.48550/arXiv.1902.10186 , abstract =. 1902.10186 [cs] , keywords =

  31. [31]

    Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

  32. [32]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , urldate =. doi:10.48550/arXiv.1810.04805 , shorttitle =. 1810.04805 [cs] , keywords =

  33. [33]

    Shared Global and Local Geometry of Language Model Em- beddings , abstract =

    Lee, Andrew and Weber, Melanie and Viegas, Fernanda and Wattenberg, Martin , date =. Shared Global and Local Geometry of Language Model Em- beddings , abstract =

  34. [34]

    Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition , url =

    He, Zhengfu and Wang, Junxuan and Lin, Rui and Ge, Xuyang and Shu, Wentao and Tang, Qiong and Zhang, Junping and Qiu, Xipeng , urldate =. Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition , url =. doi:10.48550/arXiv.2504.20938 , abstract =. 2504.20938 [cs] , keywords =

  35. [35]

    Barbero, Federico and Arroyo, Alvaro and Gu, Xiangming and Perivolaropoulos, Christos and Bronstein, Michael and Veličković, Petar and Pascanu, Razvan , urldate =. Why do. doi:10.48550/arXiv.2504.02732 , abstract =. 2504.02732 [cs] , keywords =

  36. [36]

    Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, J Zico , date =. A

  37. [37]

    Q-Filters: Leveraging

    Godey, Nathan and Devoto, Alessio and Zhao, Yu and Scardapane, Simone and Minervini, Pasquale and Clergerie, Éric de la and Sagot, Benoît , urldate =. Q-Filters: Leveraging. doi:10.48550/arXiv.2503.02812 , shorttitle =. 2503.02812 [cs] , keywords =

  38. [38]

    doi:10.48550/arXiv.2410.12850 , shorttitle =

    Yan, Ruiqing and Zheng, Linghan and Du, Xingbo and Zou, Han and Guo, Yufeng and Yang, Jianfei , urldate =. doi:10.48550/arXiv.2410.12850 , shorttitle =. 2410.12850 [cs] , keywords =

  39. [39]

    Measuring Massive Multitask Language Understanding

    Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , urldate =. Measuring Massive Multitask Language Understanding , url =. doi:10.48550/arXiv.2009.03300 , abstract =. 2009.03300 [cs] , keywords =

  40. [40]

    Penedo, Guilherme and Kydlíček, Hynek and allal, Loubna Ben and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Werra, Leandro Von and Wolf, Thomas , urldate =. The. doi:10.48550/arXiv.2406.17557 , shorttitle =. 2406.17557 [cs] , keywords =

  41. [41]

    Attention Layers Add Into Low-Dimensional Residual Subspaces , url =

    Wang, Junxuan and Ge, Xuyang and Shu, Wentao and He, Zhengfu and Qiu, Xipeng , urldate =. Attention Layers Add Into Low-Dimensional Residual Subspaces , url =. doi:10.48550/arXiv.2508.16929 , abstract =. 2508.16929 [cs] , note =

  42. [42]

    Goel, Karan , langid =

  43. [43]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike , urldate =. Efficient Streaming Language Models with Attention Sinks , url =. doi:10.48550/arXiv.2309.17453 , abstract =. 2309.17453 [cs] , keywords =

  44. [44]

    and Fedorenko, Evelina , urldate =

    Hosseini, Eghbal A. and Fedorenko, Evelina , urldate =. Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language , url =. doi:10.1101/2023.11.05.564832 , abstract =

  45. [45]

    Which Attention Heads Matter for In-Context Learning? , url =

    Yin, Kayo and Steinhardt, Jacob , urldate =. Which Attention Heads Matter for In-Context Learning? , url =. doi:10.48550/arXiv.2502.14010 , abstract =. 2502.14010 [cs] , keywords =

  46. [46]

    Value Residual Learning For Alleviating Attention Concentration In Transformers , url =

    Zhou, Zhanchao and Wu, Tianyi and Jiang, Zhiyun and Lan, Zhenzhong , urldate =. Value Residual Learning For Alleviating Attention Concentration In Transformers , url =. doi:10.48550/arXiv.2410.17897 , abstract =. 2410.17897 [cs] , note =

  47. [47]

    A Mathematical Framework for Transformer Circuits , author =

  48. [48]

    A Mathematical Framework for Transformer Circuits , url =

  49. [49]

    GLU Variants Improve Transformer

    Shazeer, Noam , urldate =. doi:10.48550/arXiv.2002.05202 , abstract =. 2002.05202 [cs] , keywords =

  50. [50]

    Root Mean Square Layer Normalization

    Zhang, Biao and Sennrich, Rico , urldate =. Root Mean Square Layer Normalization , url =. doi:10.48550/arXiv.1910.07467 , abstract =. 1910.07467 [cs] , keywords =

  51. [51]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timothée and Rozière, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume , urldate =. doi:10.48550/arXiv.2302.13971 , shorttitle =. 2302.13971 [cs] , ...

  52. [52]

    Attention Is All You Need

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , urldate =. Attention Is All You Need , url =. doi:10.48550/arXiv.1706.03762 , abstract =. 1706.03762 [cs] , keywords =

  53. [53]

    Deep Gaussian Processes: Advances in Models and Inference , author =

  54. [54]

    Efficient and Scalable Language Models for Training, Inference and Test-Time Compute , author =

  55. [55]

    Ge, Suyu and Zhang, Yunan and Liu, Liyuan and Zhang, Minjia and Han, Jiawei and Gao, Jianfeng , date =

  56. [56]

    A simple and effective \ l\_2\ norm-based strategy for KV cache compression

    Devoto, Alessio and Zhao, Yu and Scardapane, Simone and Minervini, Pasquale , urldate =. A Simple and Effective \ L\_2\ Norm-Based Strategy for. doi:10.48550/arXiv.2406.11430 , abstract =. 2406.11430 [cs] , keywords =

  57. [57]

    Attention score is not all you need for token importance indicator in KV cache reduction: Value also matters, b

    Guo, Zhiyu and Kamigaito, Hidetaka and Watanabe, Taro , urldate =. Attention Score is not All You Need for Token Importance Indicator in. doi:10.48550/arXiv.2406.12335 , shorttitle =. 2406.12335 [cs] , keywords =

  58. [58]

    Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning , url =

    Team, Ling and Han, Bin and Tang, Caizhi and Liang, Chen and Zhang, Donghao and Yuan, Fan and Zhu, Feng and Gao, Jie and Hu, Jingyu and Li, Longfei and Li, Meng and Zhang, Mingyang and Jiang, Peijie and Jiao, Peng and Zhao, Qian and Yang, Qingyuan and Shen, Wenbo and Yang, Xinxing and Zhang, Yalin and Ren, Yankun and Zhao, Yao and Cao, Yibo and Sun, Yixua...

  59. [59]

    Gupta, Akshat and Yeung, Jay and Anumanchipalli, Gopala and Ivanova, Anna , urldate =. How Do. doi:10.48550/arXiv.2510.18871 , abstract =. 2510.18871 [cs] , note =

  60. [60]

    Your Transformer is Secretly Linear , url =

    Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Gerasimenko, Nikolai and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey , urldate =. Your Transformer is Secretly Linear , url =. doi:10.48550/arXiv.2405.12250 , abstract =. 2405.12250 [cs] , keywords =

  61. [61]

    Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models , url =

    Zhao, Zheng and Ziser, Yftah and Cohen, Shay B , editor =. Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models , url =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , publisher =. doi:10.18653/v1/2024.emnlp-main.847 , shorttitle =

  62. [62]

    The Thirteenth International Conference on Learning Representations , year=

    The Unreasonable Ineffectiveness of the Deeper Layers , author=. The Thirteenth International Conference on Learning Representations , year=

  63. [63]

    When Attention Sink Emerges in Language Models: An Empirical View

    Gu, Xiangming and Pang, Tianyu and Du, Chao and Liu, Qian and Zhang, Fengzhuo and Du, Cunxiao and Wang, Ye and Lin, Min , urldate =. When Attention Sink Emerges in Language Models: An Empirical View , url =. doi:10.48550/arXiv.2410.10781 , shorttitle =. 2410.10781 [cs] , keywords =

  64. [64]

    Incorporating residual and normalization layers into analysis of masked language models, b

    Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro , urldate =. Incorporating Residual and Normalization Layers into Analysis of Masked Language Models , url =. doi:10.48550/arXiv.2109.07152 , abstract =. 2109.07152 [cs] , keywords =

  65. [65]

    Understanding the skill gap in recurrent language models: The role of the gather-and-aggregate mechanism

    Bick, Aviv and Xing, Eric and Gu, Albert , urldate =. Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism , url =. doi:10.48550/arXiv.2504.18574 , shorttitle =. 2504.18574 [cs] , keywords =

  66. [66]

    Nguyen, Thao and Raghu, Maithra and Kornblith, Simon , date =

  67. [67]

    Attention is not only a weight: Analyzing transformers with vector norms, a

    Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro , urldate =. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms , url =. doi:10.48550/arXiv.2004.10102 , shorttitle =. 2004.10102 [cs] , keywords =

  68. [68]

    Pitorro, Hugo and Vasylenko, Pavlo and Treviso, Marcos and Martins, André F. T. , urldate =. How Effective are State Space Models for Machine Translation? , url =. doi:10.48550/arXiv.2407.05489 , abstract =. 2407.05489 [cs] , keywords =

  69. [69]

    doi:10.48550/arXiv.2502.15612 , shorttitle =

    Pitorro, Hugo and Treviso, Marcos , urldate =. doi:10.48550/arXiv.2502.15612 , shorttitle =. 2502.15612 [cs] , keywords =

  70. [70]

    Quantifying Attention Flow in Transformers , url =

    Abnar, Samira and Zuidema, Willem , urldate =. Quantifying Attention Flow in Transformers , url =. doi:10.48550/arXiv.2005.00928 , abstract =. 2005.00928 [cs] , keywords =

  71. [71]

    I., and Costa-jussà, M

    Ferrando, Javier and Gállego, Gerard I. and Costa-jussà, Marta R. , urldate =. Measuring the Mixing of Contextual Information in the Transformer , url =. doi:10.48550/arXiv.2203.04212 , abstract =. 2203.04212 [cs] , keywords =

  72. [72]

    , urldate =

    Ferrando, Javier and Sarti, Gabriele and Bisazza, Arianna and Costa-jussà, Marta R. , urldate =. A Primer on the Inner Workings of Transformer-based Language Models , url =. doi:10.48550/arXiv.2405.00208 , abstract =. 2405.00208 [cs] , keywords =

  73. [73]

    The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models , url =

    Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey , urldate =. The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models , url =. doi:10.48550/arXiv.2311.05928 , shorttitle =. 2311.05928 [cs] , keywords =

  74. [74]

    Layer by Layer: Uncovering Hidden Representations in Language Models

    Skean, Oscar and Arefin, Md Rifat and Zhao, Dan and Patel, Niket and Naghiyev, Jalal and. Layer by Layer: Uncovering Hidden Representations in Language Models , url =. doi:10.48550/arXiv.2502.02013 , shorttitle =. 2502.02013 [cs] , keywords =

  75. [75]

    Dasgupta, Sayantan and Cohn, Trevor , date =

  76. [76]

    Similarity of Neural Network Representations Revisited

    Kornblith, Simon and Norouzi, Mohammad and Lee, Honglak and Hinton, Geoffrey , urldate =. Similarity of Neural Network Representations Revisited , url =. doi:10.48550/arXiv.1905.00414 , abstract =. 1905.00414 [cs] , keywords =

  77. [77]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Qiu, Zihan and Wang, Zekun and Zheng, Bo and Huang, Zeyu and Wen, Kaiyue and Yang, Songlin and Men, Rui and Yu, Le and Huang, Fei and Huang, Suozhi and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang , urldate =. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , url =. doi:10.48550/arXiv.2505.06708 , shorttitle...

  78. [78]

    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

    Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan , urldate =. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , url =. doi:10.48550/arXiv.1905.09418 , shorttitle =. 1905.09418 [cs] , keywords =

  79. [79]

    Attention is not not Explanation , url =

    Wiegreffe, Sarah and Pinter, Yuval , urldate =. Attention is not not Explanation , url =. doi:10.48550/arXiv.1908.04626 , abstract =. 1908.04626 [cs] , keywords =

  80. [80]

    Identifying and Evaluating Inactive Heads in Pretrained

    Sandoval-Segura, Pedro and Wang, Xijun and Panda, Ashwinee and Goldblum, Micah and Basri, Ronen and Goldstein, Tom and Jacobs, David , urldate =. Identifying and Evaluating Inactive Heads in Pretrained. doi:10.48550/arXiv.2504.03889 , abstract =. 2504.03889 [cs] , keywords =

Showing first 80 references.