Contribution Weights: A Geometrical Analysis of Self-Attention Transformers
Pith reviewed 2026-06-28 23:27 UTC · model grok-4.3
The pith
Contribution weights, which multiply attention by value magnitude and alignment with layer output, measure token importance more accurately than attention alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Contribution weights are computed as the attention weight times the Euclidean norm of the value vector times the cosine of the angle between that value vector and the layer's output vector. These weights rank tokens by influence more faithfully than raw attention across models, tasks, and datasets. Attention sinks display a convex relationship between sink attention rate and output norm, actively suppressing information and stabilizing representations against semantic drift of low-confidence tokens.
What carries the argument
Contribution Weights: the projection of each value vector onto the layer output direction, scaled by its attention weight and magnitude.
If this is right
- Token importance rankings become more reliable for mechanistic analysis of decoder-only transformers.
- Attention sinks can be studied as active stabilizers rather than passive attention repositories.
- Interpretability methods gain a geometric component that accounts for value vector properties.
- Analysis of information flow can be applied consistently across different models and datasets.
Where Pith is reading between the lines
- The metric could guide targeted interventions such as editing or pruning by focusing on tokens with high projected contribution.
- Similar projection ideas might apply to other attention variants or non-transformer architectures where aggregation occurs.
- If the convex sink relationship holds, it points to possible architectural changes that deliberately tune sink behavior for stability.
- The approach separates correlation from causation in attribution, which could be tested by comparing against gradient-based or intervention-based baselines.
Load-bearing premise
The projection onto the layer output direction, scaled by attention and magnitude, genuinely quantifies causal influence on the output rather than merely correlating with output geometry.
What would settle it
An ablation study in which tokens ranked highest by contribution weights are removed or perturbed and the output changes less than when tokens ranked highest by attention weights are removed.
Figures
read the original abstract
Analyzing attention weights has become a standard approach for interpreting the information flow of Large Language Models (LLMs). However, this approach has significant limitations as it neglects the geometric properties of the value vectors being aggregated. To address this gap, we introduce \emph{Contribution Weights}, a projection-based metric that quantifies a token's influence by accounting for it's attention weight, value magnitude, and directional alignment with the layer output. We demonstrate that contribution weights provide a more faithful measure of token importance, consistently outperforming attention-based metrics in identifying semantically critical tokens across different decoder-only models, tasks, and datasets. Further, our metric enables novel mechanistic analysis of \emph{attention sinks}. While previous work characterized sinks as passive repositories for excess attention, we reveal they serve an active functional role, suppressing information through a convex relationship between sink rate and output norm, stabilizing representations by opposing the semantic drift of low-confidence tokens.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Contribution Weights, a projection-based metric for token importance in decoder-only transformers defined as the product of attention weight, value-vector magnitude, and cosine alignment with the layer output vector. It claims this metric more faithfully identifies semantically critical tokens than attention weights alone, with consistent outperformance across models, tasks, and datasets, and further uses the metric to argue that attention sinks play an active stabilizing role via a convex relationship between sink rate and output norm that opposes semantic drift.
Significance. If the outperformance claims are supported by controlled intervention experiments that separate definitional correlation from causal predictive power, the work would supply a geometrically motivated importance measure that could improve mechanistic interpretability of self-attention. The reframing of attention sinks as functional stabilizers rather than passive repositories would also be a useful contribution to the literature on LLM representation dynamics.
major comments (2)
- [Abstract and §3 (metric definition)] The contribution weight is defined (see abstract and §3) as attn_i ⋅ ||v_i|| ⋅ cos∠(v_i, o) where o is the layer output; by the definition of the dot product this quantity is exactly the scalar projection of each attended value onto o, so the sum of contribution weights reconstructs ||o|| (up to scaling). Any claim that the metric 'more faithfully' ranks semantically critical tokens therefore requires explicit causal evidence (e.g., ablation or patching experiments showing superior prediction of downstream loss or semantic change) rather than geometric correlation by construction.
- [Abstract] The abstract asserts 'consistent outperformance' and a 'convex relationship between sink rate and output norm' but supplies no description of datasets, baselines, statistical tests, error bars, or controls that would allow verification that the reported superiority is not an artifact of the metric's definitional alignment with the output geometry.
minor comments (1)
- [§3] Notation for the output vector o and the angle should be introduced with an explicit equation in the main text rather than only in the abstract.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and precise comments, which highlight important distinctions between geometric definition and empirical validation. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract and §3 (metric definition)] The contribution weight is defined (see abstract and §3) as attn_i ⋅ ||v_i|| ⋅ cos∠(v_i, o) where o is the layer output; by the definition of the dot product this quantity is exactly the scalar projection of each attended value onto o, so the sum of contribution weights reconstructs ||o|| (up to scaling). Any claim that the metric 'more faithfully' ranks semantically critical tokens therefore requires explicit causal evidence (e.g., ablation or patching experiments showing superior prediction of downstream loss or semantic change) rather than geometric correlation by construction.
Authors: We agree that the defined contribution weight is mathematically identical to the normalized scalar projection of each value vector onto the output direction. This equivalence is intentional and forms the core geometric motivation of the metric: it isolates each token's additive contribution to ||o|| along the output axis. The manuscript demonstrates outperformance by showing that contribution-weight rankings align more closely with independently measured semantic importance (via token ablation effects on next-token distributions and downstream task accuracy) than attention weights alone, across multiple models and datasets. We acknowledge that these comparisons are correlational rather than full causal interventions such as activation patching. We will revise §3 to explicitly state the projection identity and add a limitations paragraph clarifying the current evidence level, while noting that the geometric construction itself provides a stronger a priori reason to expect improved faithfulness than attention alone. revision: partial
-
Referee: [Abstract] The abstract asserts 'consistent outperformance' and a 'convex relationship between sink rate and output norm' but supplies no description of datasets, baselines, statistical tests, error bars, or controls that would allow verification that the reported superiority is not an artifact of the metric's definitional alignment with the output geometry.
Authors: The abstract is intentionally concise, but the full manuscript details the experimental protocol: evaluations on five decoder-only models (including Llama-2/3 and Mistral variants), multiple tasks (next-token prediction and classification), and datasets drawn from standard benchmarks plus custom semantic perturbation sets. Baselines include raw attention, value magnitude, and random rankings; results report means and standard deviations over five random seeds with statistical significance tests. We will revise the abstract to include a short clause such as 'across five models and three benchmarks with statistical controls' to enable immediate verification while respecting length constraints. revision: yes
- The current manuscript does not contain activation-patching or full causal-intervention experiments that would isolate the metric's predictive power from its definitional alignment with the output geometry.
Circularity Check
No significant circularity; metric definition is explicit and claims rest on empirical comparisons
full rationale
The paper introduces Contribution Weights as a new metric explicitly constructed from attention weights, value magnitudes, and alignment with the layer output vector. Its central claims are empirical: consistent outperformance versus attention-based metrics on token identification tasks across models/datasets, plus mechanistic observations on attention sinks. No derivation chain, first-principles prediction, or result is shown that reduces by construction to its inputs (no equations presented as derived that equal the defining inputs). The metric is defined to quantify geometric contribution to the output by design; downstream empirical tests compare it to baselines on external tasks rather than relying on definitional tautology. Self-contained against external benchmarks via direct comparisons; no load-bearing self-citation or ansatz smuggling identified from the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Projection of value vectors onto the layer output direction, scaled by attention weight and magnitude, quantifies a token's true influence on the output.
Reference graph
Works this paper leans on
-
[1]
Langley , title =
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
2000
-
[2]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
1980
-
[3]
M. J. Kearns , title =
-
[4]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
1983
-
[5]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
2000
-
[6]
Suppressed for Anonymity , author=
-
[7]
Newell and P
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
1981
-
[8]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
1959
-
[9]
Attention Meets Post-hoc Interpretability: A Mathematical Perspective , abstract =
Lopardo, Gianluigi and Precioso, Frederic and Garreau, Damien , langid =. Attention Meets Post-hoc Interpretability: A Mathematical Perspective , abstract =
-
[10]
Bastings, Jasmijn and Filippova, Katja , urldate =. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? , url =. doi:10.48550/arXiv.2010.05607 , shorttitle =. 2010.05607 [cs] , keywords =
-
[11]
How to Dissect a Muppet: The Structure of Transformer Embedding Spaces , volume =
Mickus, Timothee and Paperno, Denis and Constant, Mathieu , urldate =. How to Dissect a Muppet: The Structure of Transformer Embedding Spaces , volume =. doi:10.1162/tacl_a_00501 , shorttitle =
-
[12]
Vision Transformers Need Registers
Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr , urldate =. Vision Transformers Need Registers , url =. doi:10.48550/arXiv.2309.16588 , abstract =. 2309.16588 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.16588
-
[13]
Quantizable transformers: Removing outliers by helping attention heads do nothing
Bondarenko, Yelysei and Nagel, Markus and Blankevoort, Tijmen , urldate =. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing , url =. doi:10.48550/arXiv.2306.12929 , shorttitle =. 2306.12929 [cs] , keywords =
-
[14]
OrthoRank : Token selection via sink token orthogonality for efficient LLM inference
Shin, Seungjun and Oh, Jaehoon and Oh, Dokwan , urldate =. doi:10.48550/arXiv.2507.03865 , shorttitle =. 2507.03865 [cs] , keywords =
-
[15]
Zhang, Stephen and Khan, Mustafa and Papyan, Vardan , langid =
-
[16]
Massive Activations in Large Language Models
Sun, Mingjie and Chen, Xinlei and Kolter, J. Zico and Liu, Zhuang , urldate =. Massive Activations in Large Language Models , url =. doi:10.48550/arXiv.2402.17762 , abstract =. 2402.17762 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.17762
-
[17]
Spectral filters, dark signals, and attention sinks
Cancedda, Nicola , urldate =. Spectral Filters, Dark Signals, and Attention Sinks , url =. doi:10.48550/arXiv.2402.09221 , abstract =. 2402.09221 [cs] , note =
-
[18]
House of Cards: Massive Weights in
Oh, Jaehoon and Shin, Seungjun and Oh, Dokwan , urldate =. House of Cards: Massive Weights in. doi:10.48550/arXiv.2410.01866 , shorttitle =. 2410.01866 [cs] , keywords =
-
[19]
Fit and prune: Fast and training-free visual token pruning for multi-modal large language models
Yu, Zhongzhi and Wang, Zheng and Fu, Yonggan and Shi, Huihong and Shaikh, Khalid and Lin, Yingyan Celine , urldate =. Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration , url =. doi:10.48550/arXiv.2406.15765 , shorttitle =. 2406.15765 [cs] , note =
-
[20]
doi:10.48550/arXiv.2506.15545 , shorttitle =
Wang, Bailin and Lan, Chang and Wang, Chong and Pang, Ruoming , urldate =. doi:10.48550/arXiv.2506.15545 , shorttitle =. 2506.15545 [cs] , keywords =
-
[21]
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth,. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , url =. doi:10.48550/arXiv.2304.01373 , shorttitle =. 2304.01373 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.01373
-
[22]
Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115
-
[23]
Differential transformer.arXiv preprint arXiv:2410.05258, 2024
Ye, Tianzhu and Dong, Li and Xia, Yuqing and Sun, Yutao and Zhu, Yi and Huang, Gao and Wei, Furu , urldate =. Differential Transformer , url =. doi:10.48550/arXiv.2410.05258 , abstract =. 2410.05258 [cs] , keywords =
-
[24]
Training Verifiers to Solve Math Word Problems
Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , urldate =. Training Verifiers to Solve Math Word Problems , url =. doi:10.48550/arXiv.2110.14168 , abstract =. 2110.14168 [cs]...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168
-
[25]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , urldate =. doi:10.48550/arXiv.1905.10044 , shorttitle =. 1905.10044 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.10044 1905
-
[26]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , urldate =. doi:10.48550/arXiv.1907.10641 , shorttitle =. 1907.10641 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.10641 1907
-
[27]
Qi, Xiangyu and Panda, Ashwinee and Lyu, Kaifeng and Ma, Xiao and Roy, Subhrajit and Beirami, Ahmad and Mittal, Prateek and Henderson, Peter , date =
-
[28]
From memorization to reasoning in the spectrum of loss curvature
Merullo, Jack and Vatsavaya, Srihita and Bushnaq, Lucius and Lewis, Owen , urldate =. From Memorization to Reasoning in the Spectrum of Loss Curvature , url =. doi:10.48550/arXiv.2510.24256 , abstract =. 2510.24256 [cs] , keywords =
-
[29]
Nested Learning: The Illusion of Deep Learning Architectures , abstract =
Behrouz, Ali and Razaviyayn, Meisam and Zhong, Peiling and Mirrokni, Vahab , langid =. Nested Learning: The Illusion of Deep Learning Architectures , abstract =
-
[30]
Jain, Sarthak and Wallace, Byron C. , urldate =. Attention is not Explanation , url =. doi:10.48550/arXiv.1902.10186 , abstract =. 1902.10186 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.10186 1902
-
[31]
Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2005
-
[32]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , urldate =. doi:10.48550/arXiv.1810.04805 , shorttitle =. 1810.04805 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805
-
[33]
Shared Global and Local Geometry of Language Model Em- beddings , abstract =
Lee, Andrew and Weber, Melanie and Viegas, Fernanda and Wattenberg, Martin , date =. Shared Global and Local Geometry of Language Model Em- beddings , abstract =
-
[34]
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition , url =
He, Zhengfu and Wang, Junxuan and Lin, Rui and Ge, Xuyang and Shu, Wentao and Tang, Qiong and Zhang, Junping and Qiu, Xipeng , urldate =. Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition , url =. doi:10.48550/arXiv.2504.20938 , abstract =. 2504.20938 [cs] , keywords =
-
[35]
Barbero, Federico and Arroyo, Alvaro and Gu, Xiangming and Perivolaropoulos, Christos and Bronstein, Michael and Veličković, Petar and Pascanu, Razvan , urldate =. Why do. doi:10.48550/arXiv.2504.02732 , abstract =. 2504.02732 [cs] , keywords =
-
[36]
Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, J Zico , date =. A
-
[37]
Godey, Nathan and Devoto, Alessio and Zhao, Yu and Scardapane, Simone and Minervini, Pasquale and Clergerie, Éric de la and Sagot, Benoît , urldate =. Q-Filters: Leveraging. doi:10.48550/arXiv.2503.02812 , shorttitle =. 2503.02812 [cs] , keywords =
-
[38]
doi:10.48550/arXiv.2410.12850 , shorttitle =
Yan, Ruiqing and Zheng, Linghan and Du, Xingbo and Zou, Han and Guo, Yufeng and Yang, Jianfei , urldate =. doi:10.48550/arXiv.2410.12850 , shorttitle =. 2410.12850 [cs] , keywords =
-
[39]
Measuring Massive Multitask Language Understanding
Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , urldate =. Measuring Massive Multitask Language Understanding , url =. doi:10.48550/arXiv.2009.03300 , abstract =. 2009.03300 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2009
-
[40]
Penedo, Guilherme and Kydlíček, Hynek and allal, Loubna Ben and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Werra, Leandro Von and Wolf, Thomas , urldate =. The. doi:10.48550/arXiv.2406.17557 , shorttitle =. 2406.17557 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.17557
-
[41]
Attention Layers Add Into Low-Dimensional Residual Subspaces , url =
Wang, Junxuan and Ge, Xuyang and Shu, Wentao and He, Zhengfu and Qiu, Xipeng , urldate =. Attention Layers Add Into Low-Dimensional Residual Subspaces , url =. doi:10.48550/arXiv.2508.16929 , abstract =. 2508.16929 [cs] , note =
-
[42]
Goel, Karan , langid =
-
[43]
Efficient Streaming Language Models with Attention Sinks
Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike , urldate =. Efficient Streaming Language Models with Attention Sinks , url =. doi:10.48550/arXiv.2309.17453 , abstract =. 2309.17453 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.17453
-
[44]
and Fedorenko, Evelina , urldate =
Hosseini, Eghbal A. and Fedorenko, Evelina , urldate =. Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language , url =. doi:10.1101/2023.11.05.564832 , abstract =
-
[45]
Which Attention Heads Matter for In-Context Learning? , url =
Yin, Kayo and Steinhardt, Jacob , urldate =. Which Attention Heads Matter for In-Context Learning? , url =. doi:10.48550/arXiv.2502.14010 , abstract =. 2502.14010 [cs] , keywords =
-
[46]
Value Residual Learning For Alleviating Attention Concentration In Transformers , url =
Zhou, Zhanchao and Wu, Tianyi and Jiang, Zhiyun and Lan, Zhenzhong , urldate =. Value Residual Learning For Alleviating Attention Concentration In Transformers , url =. doi:10.48550/arXiv.2410.17897 , abstract =. 2410.17897 [cs] , note =
-
[47]
A Mathematical Framework for Transformer Circuits , author =
-
[48]
A Mathematical Framework for Transformer Circuits , url =
-
[49]
GLU Variants Improve Transformer
Shazeer, Noam , urldate =. doi:10.48550/arXiv.2002.05202 , abstract =. 2002.05202 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2002.05202 2002
-
[50]
Root Mean Square Layer Normalization
Zhang, Biao and Sennrich, Rico , urldate =. Root Mean Square Layer Normalization , url =. doi:10.48550/arXiv.1910.07467 , abstract =. 1910.07467 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.07467 1910
-
[51]
LLaMA: Open and Efficient Foundation Language Models
Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timothée and Rozière, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume , urldate =. doi:10.48550/arXiv.2302.13971 , shorttitle =. 2302.13971 [cs] , ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971
-
[52]
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , urldate =. Attention Is All You Need , url =. doi:10.48550/arXiv.1706.03762 , abstract =. 1706.03762 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762
-
[53]
Deep Gaussian Processes: Advances in Models and Inference , author =
-
[54]
Efficient and Scalable Language Models for Training, Inference and Test-Time Compute , author =
-
[55]
Ge, Suyu and Zhang, Yunan and Liu, Liyuan and Zhang, Minjia and Han, Jiawei and Gao, Jianfeng , date =
-
[56]
A simple and effective \ l\_2\ norm-based strategy for KV cache compression
Devoto, Alessio and Zhao, Yu and Scardapane, Simone and Minervini, Pasquale , urldate =. A Simple and Effective \ L\_2\ Norm-Based Strategy for. doi:10.48550/arXiv.2406.11430 , abstract =. 2406.11430 [cs] , keywords =
-
[57]
Guo, Zhiyu and Kamigaito, Hidetaka and Watanabe, Taro , urldate =. Attention Score is not All You Need for Token Importance Indicator in. doi:10.48550/arXiv.2406.12335 , shorttitle =. 2406.12335 [cs] , keywords =
-
[58]
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning , url =
Team, Ling and Han, Bin and Tang, Caizhi and Liang, Chen and Zhang, Donghao and Yuan, Fan and Zhu, Feng and Gao, Jie and Hu, Jingyu and Li, Longfei and Li, Meng and Zhang, Mingyang and Jiang, Peijie and Jiao, Peng and Zhao, Qian and Yang, Qingyuan and Shen, Wenbo and Yang, Xinxing and Zhang, Yalin and Ren, Yankun and Zhao, Yao and Cao, Yibo and Sun, Yixua...
-
[59]
Gupta, Akshat and Yeung, Jay and Anumanchipalli, Gopala and Ivanova, Anna , urldate =. How Do. doi:10.48550/arXiv.2510.18871 , abstract =. 2510.18871 [cs] , note =
-
[60]
Your Transformer is Secretly Linear , url =
Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Gerasimenko, Nikolai and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey , urldate =. Your Transformer is Secretly Linear , url =. doi:10.48550/arXiv.2405.12250 , abstract =. 2405.12250 [cs] , keywords =
-
[61]
Zhao, Zheng and Ziser, Yftah and Cohen, Shay B , editor =. Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models , url =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , publisher =. doi:10.18653/v1/2024.emnlp-main.847 , shorttitle =
-
[62]
The Thirteenth International Conference on Learning Representations , year=
The Unreasonable Ineffectiveness of the Deeper Layers , author=. The Thirteenth International Conference on Learning Representations , year=
-
[63]
When Attention Sink Emerges in Language Models: An Empirical View
Gu, Xiangming and Pang, Tianyu and Du, Chao and Liu, Qian and Zhang, Fengzhuo and Du, Cunxiao and Wang, Ye and Lin, Min , urldate =. When Attention Sink Emerges in Language Models: An Empirical View , url =. doi:10.48550/arXiv.2410.10781 , shorttitle =. 2410.10781 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.10781
-
[64]
Incorporating residual and normalization layers into analysis of masked language models, b
Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro , urldate =. Incorporating Residual and Normalization Layers into Analysis of Masked Language Models , url =. doi:10.48550/arXiv.2109.07152 , abstract =. 2109.07152 [cs] , keywords =
-
[65]
Bick, Aviv and Xing, Eric and Gu, Albert , urldate =. Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism , url =. doi:10.48550/arXiv.2504.18574 , shorttitle =. 2504.18574 [cs] , keywords =
-
[66]
Nguyen, Thao and Raghu, Maithra and Kornblith, Simon , date =
-
[67]
Attention is not only a weight: Analyzing transformers with vector norms, a
Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro , urldate =. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms , url =. doi:10.48550/arXiv.2004.10102 , shorttitle =. 2004.10102 [cs] , keywords =
-
[68]
Pitorro, Hugo and Vasylenko, Pavlo and Treviso, Marcos and Martins, André F. T. , urldate =. How Effective are State Space Models for Machine Translation? , url =. doi:10.48550/arXiv.2407.05489 , abstract =. 2407.05489 [cs] , keywords =
-
[69]
doi:10.48550/arXiv.2502.15612 , shorttitle =
Pitorro, Hugo and Treviso, Marcos , urldate =. doi:10.48550/arXiv.2502.15612 , shorttitle =. 2502.15612 [cs] , keywords =
-
[70]
Quantifying Attention Flow in Transformers , url =
Abnar, Samira and Zuidema, Willem , urldate =. Quantifying Attention Flow in Transformers , url =. doi:10.48550/arXiv.2005.00928 , abstract =. 2005.00928 [cs] , keywords =
-
[71]
Ferrando, Javier and Gállego, Gerard I. and Costa-jussà, Marta R. , urldate =. Measuring the Mixing of Contextual Information in the Transformer , url =. doi:10.48550/arXiv.2203.04212 , abstract =. 2203.04212 [cs] , keywords =
-
[72]
Ferrando, Javier and Sarti, Gabriele and Bisazza, Arianna and Costa-jussà, Marta R. , urldate =. A Primer on the Inner Workings of Transformer-based Language Models , url =. doi:10.48550/arXiv.2405.00208 , abstract =. 2405.00208 [cs] , keywords =
-
[73]
The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models , url =
Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey , urldate =. The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models , url =. doi:10.48550/arXiv.2311.05928 , shorttitle =. 2311.05928 [cs] , keywords =
-
[74]
Layer by Layer: Uncovering Hidden Representations in Language Models
Skean, Oscar and Arefin, Md Rifat and Zhao, Dan and Patel, Niket and Naghiyev, Jalal and. Layer by Layer: Uncovering Hidden Representations in Language Models , url =. doi:10.48550/arXiv.2502.02013 , shorttitle =. 2502.02013 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.02013
-
[75]
Dasgupta, Sayantan and Cohn, Trevor , date =
-
[76]
Similarity of Neural Network Representations Revisited
Kornblith, Simon and Norouzi, Mohammad and Lee, Honglak and Hinton, Geoffrey , urldate =. Similarity of Neural Network Representations Revisited , url =. doi:10.48550/arXiv.1905.00414 , abstract =. 1905.00414 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.00414 1905
-
[77]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Qiu, Zihan and Wang, Zekun and Zheng, Bo and Huang, Zeyu and Wen, Kaiyue and Yang, Songlin and Men, Rui and Yu, Le and Huang, Fei and Huang, Suozhi and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang , urldate =. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , url =. doi:10.48550/arXiv.2505.06708 , shorttitle...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.06708
-
[78]
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan , urldate =. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , url =. doi:10.48550/arXiv.1905.09418 , shorttitle =. 1905.09418 [cs] , keywords =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.09418 1905
-
[79]
Attention is not not Explanation , url =
Wiegreffe, Sarah and Pinter, Yuval , urldate =. Attention is not not Explanation , url =. doi:10.48550/arXiv.1908.04626 , abstract =. 1908.04626 [cs] , keywords =
-
[80]
Identifying and Evaluating Inactive Heads in Pretrained
Sandoval-Segura, Pedro and Wang, Xijun and Panda, Ashwinee and Goldblum, Micah and Basri, Ronen and Goldstein, Tom and Jacobs, David , urldate =. Identifying and Evaluating Inactive Heads in Pretrained. doi:10.48550/arXiv.2504.03889 , abstract =. 2504.03889 [cs] , keywords =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.