Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

Harry Jake Cunningham; Nicola Muca Cirone

arxiv: 2606.07604 · v1 · pith:KVY76GSCnew · submitted 2026-05-29 · 💻 cs.LG · cs.AI

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

Harry Jake Cunningham , Nicola Muca Cirone This is my paper

Pith reviewed 2026-06-28 23:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords contribution weightsattention sinksself-attentiontoken importancetransformersvalue vectorsmechanistic interpretability

0 comments

The pith

Contribution weights, which multiply attention by value magnitude and alignment with layer output, measure token importance more accurately than attention alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes contribution weights to fix a limitation in how attention is interpreted in transformers. Standard attention scores ignore the size and direction of the value vectors being averaged, so they can misrank which tokens actually drive the output. The new metric projects each attended value onto the final layer output direction and scales by attention and magnitude. Across decoder-only models and tasks, this ranking better matches semantically important tokens. The same geometric view shows attention sinks actively reduce output norm in a convex way, countering drift from uncertain tokens.

Core claim

Contribution weights are computed as the attention weight times the Euclidean norm of the value vector times the cosine of the angle between that value vector and the layer's output vector. These weights rank tokens by influence more faithfully than raw attention across models, tasks, and datasets. Attention sinks display a convex relationship between sink attention rate and output norm, actively suppressing information and stabilizing representations against semantic drift of low-confidence tokens.

What carries the argument

Contribution Weights: the projection of each value vector onto the layer output direction, scaled by its attention weight and magnitude.

If this is right

Token importance rankings become more reliable for mechanistic analysis of decoder-only transformers.
Attention sinks can be studied as active stabilizers rather than passive attention repositories.
Interpretability methods gain a geometric component that accounts for value vector properties.
Analysis of information flow can be applied consistently across different models and datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The metric could guide targeted interventions such as editing or pruning by focusing on tokens with high projected contribution.
Similar projection ideas might apply to other attention variants or non-transformer architectures where aggregation occurs.
If the convex sink relationship holds, it points to possible architectural changes that deliberately tune sink behavior for stability.
The approach separates correlation from causation in attribution, which could be tested by comparing against gradient-based or intervention-based baselines.

Load-bearing premise

The projection onto the layer output direction, scaled by attention and magnitude, genuinely quantifies causal influence on the output rather than merely correlating with output geometry.

What would settle it

An ablation study in which tokens ranked highest by contribution weights are removed or perturbed and the output changes less than when tokens ranked highest by attention weights are removed.

Figures

Figures reproduced from arXiv: 2606.07604 by Harry Jake Cunningham, Nicola Muca Cirone.

**Figure 1.** Figure 1: Contribution weights. (Top) Attention matrices and (Bottom) corresponding contribution weight matrices, at two randomly selected heads and layers, computed using LLAMA-3.1- 8B. Whilst attention weights place a large amount of attention mass on the first token, examining the contribution weights reveals that these tokens contribute very little to the output. Bastings & Filippova; Lopardo et al.), arguing t… view at source ↗

**Figure 2.** Figure 2: Contribution weight decomposition. We visualise the multiplicative components of the inter-head contribution weights cˆij : (i) the attention weights αij , (ii) the relative norm ∥v˜ (l,h) j ∥/∥o (l) i ∥, and (iii) the cosine similarity cos(o (l) i , v˜ (l,h) j ). All quantities are computed on the same input for a randomly selected layer and head of LLaMA-3.1-8B. 1. Attention weights αij : determines the… view at source ↗

**Figure 3.** Figure 3: Contribution weights most reliably predict token importance. Change in perplexity (∆PPL) on GSM8K under increasing token-removal rates, comparing four contribution measures: (1) contribution weights, (2) value-weighted output-normalised attention (VW-ON), (3) value-weighted attention, and (4) raw attention weights. Top: Removal of top percentiles. Bottom: Removal of bottom percentiles. Contribution weights… view at source ↗

**Figure 4.** Figure 4: Contribution weights maintain downstream performance under high token-removal rates. Change in accuracy on MMLU, BoolQ, Lambada and Winogrande under increasing token-removal rates, comparing four contribution measures: (1) contribution weights, (2) value-weighted output-normalised attention (VW-ON), (3) value-weighted attention, and (4) raw attention weights. Contribution weights maintain baseline performa… view at source ↗

**Figure 5.** Figure 5: Value vector geometry. (a) Average alignment (cosine similarity) between the initial ([BOS]/sink) token’s value vector and the tail tokens’ value vectors as a function of relative depth. (b) Sink rate αj,1 (x-axis) against output norm ∥oj∥ (y-axis) for all tokens in 32 batches, for a fixed (l, h) choice in LLAMA-3.1-8B. We plot in red the prediction obtained via our model outlined in Equation (20), and col… view at source ↗

**Figure 6.** Figure 6: Sinks oppose small magnitude attention weights. Head outputs o (l,h) i projected onto the semantic direction µˆ and its orthogonal complement uˆ, coloured by output norm. The dashed cone contains 90% of non-sink values; the orange arrow shows the sink vector. (a) With the sink token present, low-norm heads (dark blue) lie outside the semantic cone, either orthogonal or negatively aligned with µˆ. (b) When … view at source ↗

**Figure 7.** Figure 7: Variation of attention weights, value norms and value alignment. Pairwise relationships between attention weights α (l,h) ij , projected value norms ∥v˜ (l,h) j ∥2 and alignment with the output direction cos(o (l) i , v˜ (l,h) j ) across all layers l and heads h of Llama-8B. The large dispersion across all three plots shows how these factors vary independently. Indeed, high attention weights do not guarant… view at source ↗

**Figure 8.** Figure 8: Intervention dataset ablation. Language modeling loss when removing tokens based on different importance metrics on FineWeb-Edu. Top row shows removal of high-importance tokens; bottom row shows removal of low-importance tokens. Contribution weights (blue) consistently outperform attention-based baselines (red, orange) across all models, demonstrating robustness to domain shifts from mathematical reasoning… view at source ↗

**Figure 9.** Figure 9: Computation graph ablation using global percentiles. Language modeling loss when importance scores are computed using global percentiles across all layers rather than layer-specific thresholds. While this introduces instability in models with multiple sinks (Qwen3-8B, DeepSeek-7B), contribution weights (blue) maintain their advantage over attention-based methods, though the gap narrows compared to layer-sp… view at source ↗

**Figure 10.** Figure 10: Sink token ablation. Language modeling loss when attention sinks are permitted to be removed. Top row (high-importance removal): attention-based metrics (red) cause immediate collapse by removing structural sinks, while contribution weights (blue) prioritize semantic content. Bottom row (low-importance removal): Qwen3-8B shows early degradation due to multiple sinks being classified as low-importance and … view at source ↗

**Figure 11.** Figure 11: Left: Linear fit of ||w (l,h) i || against α (l,h) i,1 . Right: Linear fit of ⟨v˜ (l,h) 1 , w (l,h) i ⟩ against α (l,h) i,1 . Top: Aggregate R 2 statistics. Bottom: Fit for a fixed index (l, h), each point corresponds to a token coloured according to its relative sequence position. implies that the minimum norm is not attained at α = 1 as a simplistic understanding of sink token as suppressor would lead t… view at source ↗

**Figure 12.** Figure 12: Contribution weight matrices. Attention and contribution weights computed from LLaMA-3.1-8B evaluated on FineWeb-Edu. Rows alternate between attention weights and contribution weights. All attention weights and contribution weights are plotted with the same colourbar axis from [0, 1] and [−0.01, 0.15] respectively. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Contribution weight matrices. Attention and contribution weights computed from Qwen2.5-7B evaluated on FineWeb-Edu. Rows alternate between attention weights and contribution weights. All attention weights and contribution weights are plotted with the same colourbar axis from [0, 1] and [−0.01, 0.15] respectively. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Contribution weight matrices. Attention and contribution weights computed from Qwen3-8B evaluated on FineWeb-Edu. Rows alternate between attention weights and contribution weights. All attention weights and contribution weights are plotted with the same colourbar axis from [0, 1] and [−0.01, 0.15] respectively. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

Analyzing attention weights has become a standard approach for interpreting the information flow of Large Language Models (LLMs). However, this approach has significant limitations as it neglects the geometric properties of the value vectors being aggregated. To address this gap, we introduce \emph{Contribution Weights}, a projection-based metric that quantifies a token's influence by accounting for it's attention weight, value magnitude, and directional alignment with the layer output. We demonstrate that contribution weights provide a more faithful measure of token importance, consistently outperforming attention-based metrics in identifying semantically critical tokens across different decoder-only models, tasks, and datasets. Further, our metric enables novel mechanistic analysis of \emph{attention sinks}. While previous work characterized sinks as passive repositories for excess attention, we reveal they serve an active functional role, suppressing information through a convex relationship between sink rate and output norm, stabilizing representations by opposing the semantic drift of low-confidence tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The contribution metric is the output vector's linear decomposition by construction, so its claimed superiority needs intervention tests that the abstract does not supply.

read the letter

The main thing to know is that contribution weights are defined as attention times value magnitude times cosine to the layer output vector. By basic linear algebra this quantity sums exactly to the output (up to scaling), which means any claim that the metric more faithfully ranks token importance has to be backed by causal checks rather than geometry alone.

What the paper actually introduces is this three-factor projection and the reframing of attention sinks as active convex suppressors that stabilize output norm against low-confidence drift. The geometric decomposition itself is a clean way to fold in value-vector size and direction, which attention weights ignore. If the experiments hold up with multiple decoder models, clear baselines, and ablation results showing better semantic-token identification, that would give interpretability work a practical new scoring tool.

The soft spot is the one the stress-test flags. The abstract asserts consistent outperformance and the convex sink relationship but gives no methods, datasets, error bars, or intervention results. Without those, it is impossible to separate definitional correlation from genuine causal insight. The circularity risk is real: the metric is built from the output it is supposed to explain.

This is for people already working on attention analysis and token attribution in LLMs. A reader looking for new geometric angles on existing mechanisms would find usable ideas here, even if the advance is incremental.

I would send it to peer review. The framing is straightforward and the sink analysis is a shift worth testing, provided the full paper supplies the missing controls and statistics.

Referee Report

2 major / 1 minor

Summary. The paper introduces Contribution Weights, a projection-based metric for token importance in decoder-only transformers defined as the product of attention weight, value-vector magnitude, and cosine alignment with the layer output vector. It claims this metric more faithfully identifies semantically critical tokens than attention weights alone, with consistent outperformance across models, tasks, and datasets, and further uses the metric to argue that attention sinks play an active stabilizing role via a convex relationship between sink rate and output norm that opposes semantic drift.

Significance. If the outperformance claims are supported by controlled intervention experiments that separate definitional correlation from causal predictive power, the work would supply a geometrically motivated importance measure that could improve mechanistic interpretability of self-attention. The reframing of attention sinks as functional stabilizers rather than passive repositories would also be a useful contribution to the literature on LLM representation dynamics.

major comments (2)

[Abstract and §3 (metric definition)] The contribution weight is defined (see abstract and §3) as attn_i ⋅ ||v_i|| ⋅ cos∠(v_i, o) where o is the layer output; by the definition of the dot product this quantity is exactly the scalar projection of each attended value onto o, so the sum of contribution weights reconstructs ||o|| (up to scaling). Any claim that the metric 'more faithfully' ranks semantically critical tokens therefore requires explicit causal evidence (e.g., ablation or patching experiments showing superior prediction of downstream loss or semantic change) rather than geometric correlation by construction.
[Abstract] The abstract asserts 'consistent outperformance' and a 'convex relationship between sink rate and output norm' but supplies no description of datasets, baselines, statistical tests, error bars, or controls that would allow verification that the reported superiority is not an artifact of the metric's definitional alignment with the output geometry.

minor comments (1)

[§3] Notation for the output vector o and the angle should be introduced with an explicit equation in the main text rather than only in the abstract.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and precise comments, which highlight important distinctions between geometric definition and empirical validation. We address each major comment below.

read point-by-point responses

Referee: [Abstract and §3 (metric definition)] The contribution weight is defined (see abstract and §3) as attn_i ⋅ ||v_i|| ⋅ cos∠(v_i, o) where o is the layer output; by the definition of the dot product this quantity is exactly the scalar projection of each attended value onto o, so the sum of contribution weights reconstructs ||o|| (up to scaling). Any claim that the metric 'more faithfully' ranks semantically critical tokens therefore requires explicit causal evidence (e.g., ablation or patching experiments showing superior prediction of downstream loss or semantic change) rather than geometric correlation by construction.

Authors: We agree that the defined contribution weight is mathematically identical to the normalized scalar projection of each value vector onto the output direction. This equivalence is intentional and forms the core geometric motivation of the metric: it isolates each token's additive contribution to ||o|| along the output axis. The manuscript demonstrates outperformance by showing that contribution-weight rankings align more closely with independently measured semantic importance (via token ablation effects on next-token distributions and downstream task accuracy) than attention weights alone, across multiple models and datasets. We acknowledge that these comparisons are correlational rather than full causal interventions such as activation patching. We will revise §3 to explicitly state the projection identity and add a limitations paragraph clarifying the current evidence level, while noting that the geometric construction itself provides a stronger a priori reason to expect improved faithfulness than attention alone. revision: partial
Referee: [Abstract] The abstract asserts 'consistent outperformance' and a 'convex relationship between sink rate and output norm' but supplies no description of datasets, baselines, statistical tests, error bars, or controls that would allow verification that the reported superiority is not an artifact of the metric's definitional alignment with the output geometry.

Authors: The abstract is intentionally concise, but the full manuscript details the experimental protocol: evaluations on five decoder-only models (including Llama-2/3 and Mistral variants), multiple tasks (next-token prediction and classification), and datasets drawn from standard benchmarks plus custom semantic perturbation sets. Baselines include raw attention, value magnitude, and random rankings; results report means and standard deviations over five random seeds with statistical significance tests. We will revise the abstract to include a short clause such as 'across five models and three benchmarks with statistical controls' to enable immediate verification while respecting length constraints. revision: yes

standing simulated objections not resolved

The current manuscript does not contain activation-patching or full causal-intervention experiments that would isolate the metric's predictive power from its definitional alignment with the output geometry.

Circularity Check

0 steps flagged

No significant circularity; metric definition is explicit and claims rest on empirical comparisons

full rationale

The paper introduces Contribution Weights as a new metric explicitly constructed from attention weights, value magnitudes, and alignment with the layer output vector. Its central claims are empirical: consistent outperformance versus attention-based metrics on token identification tasks across models/datasets, plus mechanistic observations on attention sinks. No derivation chain, first-principles prediction, or result is shown that reduces by construction to its inputs (no equations presented as derived that equal the defining inputs). The metric is defined to quantify geometric contribution to the output by design; downstream empirical tests compare it to baselines on external tasks rather than relying on definitional tautology. Self-contained against external benchmarks via direct comparisons; no load-bearing self-citation or ansatz smuggling identified from the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the unstated premise that projection onto the output direction measures influence, plus the empirical assumption that the reported convex relationship is robust across models. No free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Projection of value vectors onto the layer output direction, scaled by attention weight and magnitude, quantifies a token's true influence on the output.
This premise is required for the claim that Contribution Weights are more faithful; it is invoked when the metric is introduced and when outperformance is asserted.

pith-pipeline@v0.9.1-grok · 5683 in / 1355 out tokens · 20061 ms · 2026-06-28T23:27:04.010467+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

176 extracted references · 143 canonical work pages · 49 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[9]

Attention Meets Post-hoc Interpretability: A Mathematical Perspective , abstract =

Lopardo, Gianluigi and Precioso, Frederic and Garreau, Damien , langid =. Attention Meets Post-hoc Interpretability: A Mathematical Perspective , abstract =
[10]

and Filippova, K

Bastings, Jasmijn and Filippova, Katja , urldate =. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? , url =. doi:10.48550/arXiv.2010.05607 , shorttitle =. 2010.05607 [cs] , keywords =

work page doi:10.48550/arxiv.2010.05607 2010
[11]

How to Dissect a Muppet: The Structure of Transformer Embedding Spaces , volume =

Mickus, Timothee and Paperno, Denis and Constant, Mathieu , urldate =. How to Dissect a Muppet: The Structure of Transformer Embedding Spaces , volume =. doi:10.1162/tacl_a_00501 , shorttitle =

work page doi:10.1162/tacl_a_00501
[12]

Vision Transformers Need Registers

Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr , urldate =. Vision Transformers Need Registers , url =. doi:10.48550/arXiv.2309.16588 , abstract =. 2309.16588 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.16588
[13]

Quantizable transformers: Removing outliers by helping attention heads do nothing

Bondarenko, Yelysei and Nagel, Markus and Blankevoort, Tijmen , urldate =. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing , url =. doi:10.48550/arXiv.2306.12929 , shorttitle =. 2306.12929 [cs] , keywords =

work page doi:10.48550/arxiv.2306.12929
[14]

OrthoRank : Token selection via sink token orthogonality for efficient LLM inference

Shin, Seungjun and Oh, Jaehoon and Oh, Dokwan , urldate =. doi:10.48550/arXiv.2507.03865 , shorttitle =. 2507.03865 [cs] , keywords =

work page doi:10.48550/arxiv.2507.03865
[15]

Zhang, Stephen and Khan, Mustafa and Papyan, Vardan , langid =
[16]

Massive Activations in Large Language Models

Sun, Mingjie and Chen, Xinlei and Kolter, J. Zico and Liu, Zhuang , urldate =. Massive Activations in Large Language Models , url =. doi:10.48550/arXiv.2402.17762 , abstract =. 2402.17762 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.17762
[17]

Spectral filters, dark signals, and attention sinks

Cancedda, Nicola , urldate =. Spectral Filters, Dark Signals, and Attention Sinks , url =. doi:10.48550/arXiv.2402.09221 , abstract =. 2402.09221 [cs] , note =

work page doi:10.48550/arxiv.2402.09221
[18]

House of Cards: Massive Weights in

Oh, Jaehoon and Shin, Seungjun and Oh, Dokwan , urldate =. House of Cards: Massive Weights in. doi:10.48550/arXiv.2410.01866 , shorttitle =. 2410.01866 [cs] , keywords =

work page doi:10.48550/arxiv.2410.01866
[19]

Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

Yu, Zhongzhi and Wang, Zheng and Fu, Yonggan and Shi, Huihong and Shaikh, Khalid and Lin, Yingyan Celine , urldate =. Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration , url =. doi:10.48550/arXiv.2406.15765 , shorttitle =. 2406.15765 [cs] , note =

work page doi:10.48550/arxiv.2406.15765
[20]

doi:10.48550/arXiv.2506.15545 , shorttitle =

Wang, Bailin and Lan, Chang and Wang, Chong and Pang, Ruoming , urldate =. doi:10.48550/arXiv.2506.15545 , shorttitle =. 2506.15545 [cs] , keywords =

work page doi:10.48550/arxiv.2506.15545
[21]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth,. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , url =. doi:10.48550/arXiv.2304.01373 , shorttitle =. 2304.01373 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.01373
[22]

Qwen2.5 Technical Report

Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115
[23]

Differential transformer.arXiv preprint arXiv:2410.05258, 2024

Ye, Tianzhu and Dong, Li and Xia, Yuqing and Sun, Yutao and Zhu, Yi and Huang, Gao and Wei, Furu , urldate =. Differential Transformer , url =. doi:10.48550/arXiv.2410.05258 , abstract =. 2410.05258 [cs] , keywords =

work page doi:10.48550/arxiv.2410.05258
[24]

Training Verifiers to Solve Math Word Problems

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , urldate =. Training Verifiers to Solve Math Word Problems , url =. doi:10.48550/arXiv.2110.14168 , abstract =. 2110.14168 [cs]...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168
[25]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , urldate =. doi:10.48550/arXiv.1905.10044 , shorttitle =. 1905.10044 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.10044 1905
[26]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , urldate =. doi:10.48550/arXiv.1907.10641 , shorttitle =. 1907.10641 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.10641 1907
[27]

Qi, Xiangyu and Panda, Ashwinee and Lyu, Kaifeng and Ma, Xiao and Roy, Subhrajit and Beirami, Ahmad and Mittal, Prateek and Henderson, Peter , date =
[28]

From memorization to reasoning in the spectrum of loss curvature

Merullo, Jack and Vatsavaya, Srihita and Bushnaq, Lucius and Lewis, Owen , urldate =. From Memorization to Reasoning in the Spectrum of Loss Curvature , url =. doi:10.48550/arXiv.2510.24256 , abstract =. 2510.24256 [cs] , keywords =

work page doi:10.48550/arxiv.2510.24256
[29]

Nested Learning: The Illusion of Deep Learning Architectures , abstract =

Behrouz, Ali and Razaviyayn, Meisam and Zhong, Peiling and Mirrokni, Vahab , langid =. Nested Learning: The Illusion of Deep Learning Architectures , abstract =
[30]

Attention is not Explanation

Jain, Sarthak and Wallace, Byron C. , urldate =. Attention is not Explanation , url =. doi:10.48550/arXiv.1902.10186 , abstract =. 1902.10186 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.10186 1902
[31]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2005
[32]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , urldate =. doi:10.48550/arXiv.1810.04805 , shorttitle =. 1810.04805 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805
[33]

Shared Global and Local Geometry of Language Model Em- beddings , abstract =

Lee, Andrew and Weber, Melanie and Viegas, Fernanda and Wattenberg, Martin , date =. Shared Global and Local Geometry of Language Model Em- beddings , abstract =
[34]

Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition , url =

He, Zhengfu and Wang, Junxuan and Lin, Rui and Ge, Xuyang and Shu, Wentao and Tang, Qiong and Zhang, Junping and Qiu, Xipeng , urldate =. Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition , url =. doi:10.48550/arXiv.2504.20938 , abstract =. 2504.20938 [cs] , keywords =

work page doi:10.48550/arxiv.2504.20938
[35]

Barbero, Federico and Arroyo, Alvaro and Gu, Xiangming and Perivolaropoulos, Christos and Bronstein, Michael and Veličković, Petar and Pascanu, Razvan , urldate =. Why do. doi:10.48550/arXiv.2504.02732 , abstract =. 2504.02732 [cs] , keywords =

work page doi:10.48550/arxiv.2504.02732
[36]

Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, J Zico , date =. A
[37]

Q-Filters: Leveraging

Godey, Nathan and Devoto, Alessio and Zhao, Yu and Scardapane, Simone and Minervini, Pasquale and Clergerie, Éric de la and Sagot, Benoît , urldate =. Q-Filters: Leveraging. doi:10.48550/arXiv.2503.02812 , shorttitle =. 2503.02812 [cs] , keywords =

work page doi:10.48550/arxiv.2503.02812
[38]

doi:10.48550/arXiv.2410.12850 , shorttitle =

Yan, Ruiqing and Zheng, Linghan and Du, Xingbo and Zou, Han and Guo, Yufeng and Yang, Jianfei , urldate =. doi:10.48550/arXiv.2410.12850 , shorttitle =. 2410.12850 [cs] , keywords =

work page doi:10.48550/arxiv.2410.12850
[39]

Measuring Massive Multitask Language Understanding

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , urldate =. Measuring Massive Multitask Language Understanding , url =. doi:10.48550/arXiv.2009.03300 , abstract =. 2009.03300 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2009
[40]

Penedo, Guilherme and Kydlíček, Hynek and allal, Loubna Ben and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Werra, Leandro Von and Wolf, Thomas , urldate =. The. doi:10.48550/arXiv.2406.17557 , shorttitle =. 2406.17557 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.17557
[41]

Attention Layers Add Into Low-Dimensional Residual Subspaces , url =

Wang, Junxuan and Ge, Xuyang and Shu, Wentao and He, Zhengfu and Qiu, Xipeng , urldate =. Attention Layers Add Into Low-Dimensional Residual Subspaces , url =. doi:10.48550/arXiv.2508.16929 , abstract =. 2508.16929 [cs] , note =

work page doi:10.48550/arxiv.2508.16929
[42]

Goel, Karan , langid =
[43]

Efficient Streaming Language Models with Attention Sinks

Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike , urldate =. Efficient Streaming Language Models with Attention Sinks , url =. doi:10.48550/arXiv.2309.17453 , abstract =. 2309.17453 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.17453
[44]

and Fedorenko, Evelina , urldate =

Hosseini, Eghbal A. and Fedorenko, Evelina , urldate =. Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language , url =. doi:10.1101/2023.11.05.564832 , abstract =

work page doi:10.1101/2023.11.05.564832 2023
[45]

Which Attention Heads Matter for In-Context Learning? , url =

Yin, Kayo and Steinhardt, Jacob , urldate =. Which Attention Heads Matter for In-Context Learning? , url =. doi:10.48550/arXiv.2502.14010 , abstract =. 2502.14010 [cs] , keywords =

work page doi:10.48550/arxiv.2502.14010
[46]

Value Residual Learning For Alleviating Attention Concentration In Transformers , url =

Zhou, Zhanchao and Wu, Tianyi and Jiang, Zhiyun and Lan, Zhenzhong , urldate =. Value Residual Learning For Alleviating Attention Concentration In Transformers , url =. doi:10.48550/arXiv.2410.17897 , abstract =. 2410.17897 [cs] , note =

work page doi:10.48550/arxiv.2410.17897
[47]

A Mathematical Framework for Transformer Circuits , author =
[48]

A Mathematical Framework for Transformer Circuits , url =
[49]

GLU Variants Improve Transformer

Shazeer, Noam , urldate =. doi:10.48550/arXiv.2002.05202 , abstract =. 2002.05202 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2002.05202 2002
[50]

Root Mean Square Layer Normalization

Zhang, Biao and Sennrich, Rico , urldate =. Root Mean Square Layer Normalization , url =. doi:10.48550/arXiv.1910.07467 , abstract =. 1910.07467 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.07467 1910
[51]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timothée and Rozière, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume , urldate =. doi:10.48550/arXiv.2302.13971 , shorttitle =. 2302.13971 [cs] , ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971
[52]

Attention Is All You Need

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , urldate =. Attention Is All You Need , url =. doi:10.48550/arXiv.1706.03762 , abstract =. 1706.03762 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762
[53]

Deep Gaussian Processes: Advances in Models and Inference , author =
[54]

Efficient and Scalable Language Models for Training, Inference and Test-Time Compute , author =
[55]

Ge, Suyu and Zhang, Yunan and Liu, Liyuan and Zhang, Minjia and Han, Jiawei and Gao, Jianfeng , date =
[56]

A simple and effective \ l\_2\ norm-based strategy for KV cache compression

Devoto, Alessio and Zhao, Yu and Scardapane, Simone and Minervini, Pasquale , urldate =. A Simple and Effective \ L\_2\ Norm-Based Strategy for. doi:10.48550/arXiv.2406.11430 , abstract =. 2406.11430 [cs] , keywords =

work page doi:10.48550/arxiv.2406.11430
[57]

Attention score is not all you need for token importance indicator in KV cache reduction: Value also matters, b

Guo, Zhiyu and Kamigaito, Hidetaka and Watanabe, Taro , urldate =. Attention Score is not All You Need for Token Importance Indicator in. doi:10.48550/arXiv.2406.12335 , shorttitle =. 2406.12335 [cs] , keywords =

work page doi:10.48550/arxiv.2406.12335
[58]

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning , url =

Team, Ling and Han, Bin and Tang, Caizhi and Liang, Chen and Zhang, Donghao and Yuan, Fan and Zhu, Feng and Gao, Jie and Hu, Jingyu and Li, Longfei and Li, Meng and Zhang, Mingyang and Jiang, Peijie and Jiao, Peng and Zhao, Qian and Yang, Qingyuan and Shen, Wenbo and Yang, Xinxing and Zhang, Yalin and Ren, Yankun and Zhao, Yao and Cao, Yibo and Sun, Yixua...

work page doi:10.48550/arxiv.2510.19338
[59]

Gupta, Akshat and Yeung, Jay and Anumanchipalli, Gopala and Ivanova, Anna , urldate =. How Do. doi:10.48550/arXiv.2510.18871 , abstract =. 2510.18871 [cs] , note =

work page doi:10.48550/arxiv.2510.18871
[60]

Your Transformer is Secretly Linear , url =

Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Gerasimenko, Nikolai and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey , urldate =. Your Transformer is Secretly Linear , url =. doi:10.48550/arXiv.2405.12250 , abstract =. 2405.12250 [cs] , keywords =

work page doi:10.48550/arxiv.2405.12250
[61]

Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models , url =

Zhao, Zheng and Ziser, Yftah and Cohen, Shay B , editor =. Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models , url =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , publisher =. doi:10.18653/v1/2024.emnlp-main.847 , shorttitle =

work page doi:10.18653/v1/2024.emnlp-main.847 2024
[62]

The Thirteenth International Conference on Learning Representations , year=

The Unreasonable Ineffectiveness of the Deeper Layers , author=. The Thirteenth International Conference on Learning Representations , year=
[63]

When Attention Sink Emerges in Language Models: An Empirical View

Gu, Xiangming and Pang, Tianyu and Du, Chao and Liu, Qian and Zhang, Fengzhuo and Du, Cunxiao and Wang, Ye and Lin, Min , urldate =. When Attention Sink Emerges in Language Models: An Empirical View , url =. doi:10.48550/arXiv.2410.10781 , shorttitle =. 2410.10781 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.10781
[64]

Incorporating residual and normalization layers into analysis of masked language models, b

Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro , urldate =. Incorporating Residual and Normalization Layers into Analysis of Masked Language Models , url =. doi:10.48550/arXiv.2109.07152 , abstract =. 2109.07152 [cs] , keywords =

work page doi:10.48550/arxiv.2109.07152
[65]

Understanding the skill gap in recurrent language models: The role of the gather-and-aggregate mechanism

Bick, Aviv and Xing, Eric and Gu, Albert , urldate =. Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism , url =. doi:10.48550/arXiv.2504.18574 , shorttitle =. 2504.18574 [cs] , keywords =

work page doi:10.48550/arxiv.2504.18574
[66]

Nguyen, Thao and Raghu, Maithra and Kornblith, Simon , date =
[67]

Attention is not only a weight: Analyzing transformers with vector norms, a

Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro , urldate =. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms , url =. doi:10.48550/arXiv.2004.10102 , shorttitle =. 2004.10102 [cs] , keywords =

work page doi:10.48550/arxiv.2004.10102 2004
[68]

Pitorro, Hugo and Vasylenko, Pavlo and Treviso, Marcos and Martins, André F. T. , urldate =. How Effective are State Space Models for Machine Translation? , url =. doi:10.48550/arXiv.2407.05489 , abstract =. 2407.05489 [cs] , keywords =

work page doi:10.48550/arxiv.2407.05489
[69]

doi:10.48550/arXiv.2502.15612 , shorttitle =

Pitorro, Hugo and Treviso, Marcos , urldate =. doi:10.48550/arXiv.2502.15612 , shorttitle =. 2502.15612 [cs] , keywords =

work page doi:10.48550/arxiv.2502.15612
[70]

Quantifying Attention Flow in Transformers , url =

Abnar, Samira and Zuidema, Willem , urldate =. Quantifying Attention Flow in Transformers , url =. doi:10.48550/arXiv.2005.00928 , abstract =. 2005.00928 [cs] , keywords =

work page doi:10.48550/arxiv.2005.00928 2005
[71]

I., and Costa-jussà, M

Ferrando, Javier and Gállego, Gerard I. and Costa-jussà, Marta R. , urldate =. Measuring the Mixing of Contextual Information in the Transformer , url =. doi:10.48550/arXiv.2203.04212 , abstract =. 2203.04212 [cs] , keywords =

work page doi:10.48550/arxiv.2203.04212
[72]

, urldate =

Ferrando, Javier and Sarti, Gabriele and Bisazza, Arianna and Costa-jussà, Marta R. , urldate =. A Primer on the Inner Workings of Transformer-based Language Models , url =. doi:10.48550/arXiv.2405.00208 , abstract =. 2405.00208 [cs] , keywords =

work page doi:10.48550/arxiv.2405.00208
[73]

The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models , url =

Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey , urldate =. The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models , url =. doi:10.48550/arXiv.2311.05928 , shorttitle =. 2311.05928 [cs] , keywords =

work page doi:10.48550/arxiv.2311.05928
[74]

Layer by Layer: Uncovering Hidden Representations in Language Models

Skean, Oscar and Arefin, Md Rifat and Zhao, Dan and Patel, Niket and Naghiyev, Jalal and. Layer by Layer: Uncovering Hidden Representations in Language Models , url =. doi:10.48550/arXiv.2502.02013 , shorttitle =. 2502.02013 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.02013
[75]

Dasgupta, Sayantan and Cohn, Trevor , date =
[76]

Similarity of Neural Network Representations Revisited

Kornblith, Simon and Norouzi, Mohammad and Lee, Honglak and Hinton, Geoffrey , urldate =. Similarity of Neural Network Representations Revisited , url =. doi:10.48550/arXiv.1905.00414 , abstract =. 1905.00414 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.00414 1905
[77]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Zihan and Wang, Zekun and Zheng, Bo and Huang, Zeyu and Wen, Kaiyue and Yang, Songlin and Men, Rui and Yu, Le and Huang, Fei and Huang, Suozhi and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang , urldate =. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , url =. doi:10.48550/arXiv.2505.06708 , shorttitle...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.06708
[78]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan , urldate =. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , url =. doi:10.48550/arXiv.1905.09418 , shorttitle =. 1905.09418 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.09418 1905
[79]

Attention is not not Explanation , url =

Wiegreffe, Sarah and Pinter, Yuval , urldate =. Attention is not not Explanation , url =. doi:10.48550/arXiv.1908.04626 , abstract =. 1908.04626 [cs] , keywords =

work page doi:10.48550/arxiv.1908.04626 1908
[80]

Identifying and Evaluating Inactive Heads in Pretrained

Sandoval-Segura, Pedro and Wang, Xijun and Panda, Ashwinee and Goldblum, Micah and Basri, Ronen and Goldstein, Tom and Jacobs, David , urldate =. Identifying and Evaluating Inactive Heads in Pretrained. doi:10.48550/arXiv.2504.03889 , abstract =. 2504.03889 [cs] , keywords =

work page doi:10.48550/arxiv.2504.03889

Showing first 80 references.

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[3] [3]

M. J. Kearns , title =

[4] [4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[5] [5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[6] [6]

Suppressed for Anonymity , author=

[7] [7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[8] [8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[9] [9]

Attention Meets Post-hoc Interpretability: A Mathematical Perspective , abstract =

Lopardo, Gianluigi and Precioso, Frederic and Garreau, Damien , langid =. Attention Meets Post-hoc Interpretability: A Mathematical Perspective , abstract =

[10] [10]

and Filippova, K

Bastings, Jasmijn and Filippova, Katja , urldate =. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? , url =. doi:10.48550/arXiv.2010.05607 , shorttitle =. 2010.05607 [cs] , keywords =

work page doi:10.48550/arxiv.2010.05607 2010

[11] [11]

How to Dissect a Muppet: The Structure of Transformer Embedding Spaces , volume =

Mickus, Timothee and Paperno, Denis and Constant, Mathieu , urldate =. How to Dissect a Muppet: The Structure of Transformer Embedding Spaces , volume =. doi:10.1162/tacl_a_00501 , shorttitle =

work page doi:10.1162/tacl_a_00501

[12] [12]

Vision Transformers Need Registers

Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr , urldate =. Vision Transformers Need Registers , url =. doi:10.48550/arXiv.2309.16588 , abstract =. 2309.16588 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.16588

[13] [13]

Quantizable transformers: Removing outliers by helping attention heads do nothing

Bondarenko, Yelysei and Nagel, Markus and Blankevoort, Tijmen , urldate =. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing , url =. doi:10.48550/arXiv.2306.12929 , shorttitle =. 2306.12929 [cs] , keywords =

work page doi:10.48550/arxiv.2306.12929

[14] [14]

OrthoRank : Token selection via sink token orthogonality for efficient LLM inference

Shin, Seungjun and Oh, Jaehoon and Oh, Dokwan , urldate =. doi:10.48550/arXiv.2507.03865 , shorttitle =. 2507.03865 [cs] , keywords =

work page doi:10.48550/arxiv.2507.03865

[15] [15]

Zhang, Stephen and Khan, Mustafa and Papyan, Vardan , langid =

[16] [16]

Massive Activations in Large Language Models

Sun, Mingjie and Chen, Xinlei and Kolter, J. Zico and Liu, Zhuang , urldate =. Massive Activations in Large Language Models , url =. doi:10.48550/arXiv.2402.17762 , abstract =. 2402.17762 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.17762

[17] [17]

Spectral filters, dark signals, and attention sinks

Cancedda, Nicola , urldate =. Spectral Filters, Dark Signals, and Attention Sinks , url =. doi:10.48550/arXiv.2402.09221 , abstract =. 2402.09221 [cs] , note =

work page doi:10.48550/arxiv.2402.09221

[18] [18]

House of Cards: Massive Weights in

Oh, Jaehoon and Shin, Seungjun and Oh, Dokwan , urldate =. House of Cards: Massive Weights in. doi:10.48550/arXiv.2410.01866 , shorttitle =. 2410.01866 [cs] , keywords =

work page doi:10.48550/arxiv.2410.01866

[19] [19]

Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

Yu, Zhongzhi and Wang, Zheng and Fu, Yonggan and Shi, Huihong and Shaikh, Khalid and Lin, Yingyan Celine , urldate =. Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration , url =. doi:10.48550/arXiv.2406.15765 , shorttitle =. 2406.15765 [cs] , note =

work page doi:10.48550/arxiv.2406.15765

[20] [20]

doi:10.48550/arXiv.2506.15545 , shorttitle =

Wang, Bailin and Lan, Chang and Wang, Chong and Pang, Ruoming , urldate =. doi:10.48550/arXiv.2506.15545 , shorttitle =. 2506.15545 [cs] , keywords =

work page doi:10.48550/arxiv.2506.15545

[21] [21]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth,. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , url =. doi:10.48550/arXiv.2304.01373 , shorttitle =. 2304.01373 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.01373

[22] [22]

Qwen2.5 Technical Report

Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115

[23] [23]

Differential transformer.arXiv preprint arXiv:2410.05258, 2024

Ye, Tianzhu and Dong, Li and Xia, Yuqing and Sun, Yutao and Zhu, Yi and Huang, Gao and Wei, Furu , urldate =. Differential Transformer , url =. doi:10.48550/arXiv.2410.05258 , abstract =. 2410.05258 [cs] , keywords =

work page doi:10.48550/arxiv.2410.05258

[24] [24]

Training Verifiers to Solve Math Word Problems

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , urldate =. Training Verifiers to Solve Math Word Problems , url =. doi:10.48550/arXiv.2110.14168 , abstract =. 2110.14168 [cs]...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168

[25] [25]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , urldate =. doi:10.48550/arXiv.1905.10044 , shorttitle =. 1905.10044 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.10044 1905

[26] [26]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , urldate =. doi:10.48550/arXiv.1907.10641 , shorttitle =. 1907.10641 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.10641 1907

[27] [27]

Qi, Xiangyu and Panda, Ashwinee and Lyu, Kaifeng and Ma, Xiao and Roy, Subhrajit and Beirami, Ahmad and Mittal, Prateek and Henderson, Peter , date =

[28] [28]

From memorization to reasoning in the spectrum of loss curvature

Merullo, Jack and Vatsavaya, Srihita and Bushnaq, Lucius and Lewis, Owen , urldate =. From Memorization to Reasoning in the Spectrum of Loss Curvature , url =. doi:10.48550/arXiv.2510.24256 , abstract =. 2510.24256 [cs] , keywords =

work page doi:10.48550/arxiv.2510.24256

[29] [29]

Nested Learning: The Illusion of Deep Learning Architectures , abstract =

Behrouz, Ali and Razaviyayn, Meisam and Zhong, Peiling and Mirrokni, Vahab , langid =. Nested Learning: The Illusion of Deep Learning Architectures , abstract =

[30] [30]

Attention is not Explanation

Jain, Sarthak and Wallace, Byron C. , urldate =. Attention is not Explanation , url =. doi:10.48550/arXiv.1902.10186 , abstract =. 1902.10186 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.10186 1902

[31] [31]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2005

[32] [32]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , urldate =. doi:10.48550/arXiv.1810.04805 , shorttitle =. 1810.04805 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805

[33] [33]

Shared Global and Local Geometry of Language Model Em- beddings , abstract =

Lee, Andrew and Weber, Melanie and Viegas, Fernanda and Wattenberg, Martin , date =. Shared Global and Local Geometry of Language Model Em- beddings , abstract =

[34] [34]

Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition , url =

He, Zhengfu and Wang, Junxuan and Lin, Rui and Ge, Xuyang and Shu, Wentao and Tang, Qiong and Zhang, Junping and Qiu, Xipeng , urldate =. Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition , url =. doi:10.48550/arXiv.2504.20938 , abstract =. 2504.20938 [cs] , keywords =

work page doi:10.48550/arxiv.2504.20938

[35] [35]

Barbero, Federico and Arroyo, Alvaro and Gu, Xiangming and Perivolaropoulos, Christos and Bronstein, Michael and Veličković, Petar and Pascanu, Razvan , urldate =. Why do. doi:10.48550/arXiv.2504.02732 , abstract =. 2504.02732 [cs] , keywords =

work page doi:10.48550/arxiv.2504.02732

[36] [36]

Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, J Zico , date =. A

[37] [37]

Q-Filters: Leveraging

Godey, Nathan and Devoto, Alessio and Zhao, Yu and Scardapane, Simone and Minervini, Pasquale and Clergerie, Éric de la and Sagot, Benoît , urldate =. Q-Filters: Leveraging. doi:10.48550/arXiv.2503.02812 , shorttitle =. 2503.02812 [cs] , keywords =

work page doi:10.48550/arxiv.2503.02812

[38] [38]

doi:10.48550/arXiv.2410.12850 , shorttitle =

Yan, Ruiqing and Zheng, Linghan and Du, Xingbo and Zou, Han and Guo, Yufeng and Yang, Jianfei , urldate =. doi:10.48550/arXiv.2410.12850 , shorttitle =. 2410.12850 [cs] , keywords =

work page doi:10.48550/arxiv.2410.12850

[39] [39]

Measuring Massive Multitask Language Understanding

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , urldate =. Measuring Massive Multitask Language Understanding , url =. doi:10.48550/arXiv.2009.03300 , abstract =. 2009.03300 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2009

[40] [40]

Penedo, Guilherme and Kydlíček, Hynek and allal, Loubna Ben and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Werra, Leandro Von and Wolf, Thomas , urldate =. The. doi:10.48550/arXiv.2406.17557 , shorttitle =. 2406.17557 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.17557

[41] [41]

Attention Layers Add Into Low-Dimensional Residual Subspaces , url =

Wang, Junxuan and Ge, Xuyang and Shu, Wentao and He, Zhengfu and Qiu, Xipeng , urldate =. Attention Layers Add Into Low-Dimensional Residual Subspaces , url =. doi:10.48550/arXiv.2508.16929 , abstract =. 2508.16929 [cs] , note =

work page doi:10.48550/arxiv.2508.16929

[42] [42]

Goel, Karan , langid =

[43] [43]

Efficient Streaming Language Models with Attention Sinks

Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike , urldate =. Efficient Streaming Language Models with Attention Sinks , url =. doi:10.48550/arXiv.2309.17453 , abstract =. 2309.17453 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.17453

[44] [44]

and Fedorenko, Evelina , urldate =

Hosseini, Eghbal A. and Fedorenko, Evelina , urldate =. Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language , url =. doi:10.1101/2023.11.05.564832 , abstract =

work page doi:10.1101/2023.11.05.564832 2023

[45] [45]

Which Attention Heads Matter for In-Context Learning? , url =

Yin, Kayo and Steinhardt, Jacob , urldate =. Which Attention Heads Matter for In-Context Learning? , url =. doi:10.48550/arXiv.2502.14010 , abstract =. 2502.14010 [cs] , keywords =

work page doi:10.48550/arxiv.2502.14010

[46] [46]

Value Residual Learning For Alleviating Attention Concentration In Transformers , url =

Zhou, Zhanchao and Wu, Tianyi and Jiang, Zhiyun and Lan, Zhenzhong , urldate =. Value Residual Learning For Alleviating Attention Concentration In Transformers , url =. doi:10.48550/arXiv.2410.17897 , abstract =. 2410.17897 [cs] , note =

work page doi:10.48550/arxiv.2410.17897

[47] [47]

A Mathematical Framework for Transformer Circuits , author =

[48] [48]

A Mathematical Framework for Transformer Circuits , url =

[49] [49]

GLU Variants Improve Transformer

Shazeer, Noam , urldate =. doi:10.48550/arXiv.2002.05202 , abstract =. 2002.05202 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2002.05202 2002

[50] [50]

Root Mean Square Layer Normalization

Zhang, Biao and Sennrich, Rico , urldate =. Root Mean Square Layer Normalization , url =. doi:10.48550/arXiv.1910.07467 , abstract =. 1910.07467 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.07467 1910

[51] [51]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timothée and Rozière, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume , urldate =. doi:10.48550/arXiv.2302.13971 , shorttitle =. 2302.13971 [cs] , ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971

[52] [52]

Attention Is All You Need

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , urldate =. Attention Is All You Need , url =. doi:10.48550/arXiv.1706.03762 , abstract =. 1706.03762 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762

[53] [53]

Deep Gaussian Processes: Advances in Models and Inference , author =

[54] [54]

Efficient and Scalable Language Models for Training, Inference and Test-Time Compute , author =

[55] [55]

Ge, Suyu and Zhang, Yunan and Liu, Liyuan and Zhang, Minjia and Han, Jiawei and Gao, Jianfeng , date =

[56] [56]

A simple and effective \ l\_2\ norm-based strategy for KV cache compression

Devoto, Alessio and Zhao, Yu and Scardapane, Simone and Minervini, Pasquale , urldate =. A Simple and Effective \ L\_2\ Norm-Based Strategy for. doi:10.48550/arXiv.2406.11430 , abstract =. 2406.11430 [cs] , keywords =

work page doi:10.48550/arxiv.2406.11430

[57] [57]

Attention score is not all you need for token importance indicator in KV cache reduction: Value also matters, b

Guo, Zhiyu and Kamigaito, Hidetaka and Watanabe, Taro , urldate =. Attention Score is not All You Need for Token Importance Indicator in. doi:10.48550/arXiv.2406.12335 , shorttitle =. 2406.12335 [cs] , keywords =

work page doi:10.48550/arxiv.2406.12335

[58] [58]

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning , url =

Team, Ling and Han, Bin and Tang, Caizhi and Liang, Chen and Zhang, Donghao and Yuan, Fan and Zhu, Feng and Gao, Jie and Hu, Jingyu and Li, Longfei and Li, Meng and Zhang, Mingyang and Jiang, Peijie and Jiao, Peng and Zhao, Qian and Yang, Qingyuan and Shen, Wenbo and Yang, Xinxing and Zhang, Yalin and Ren, Yankun and Zhao, Yao and Cao, Yibo and Sun, Yixua...

work page doi:10.48550/arxiv.2510.19338

[59] [59]

Gupta, Akshat and Yeung, Jay and Anumanchipalli, Gopala and Ivanova, Anna , urldate =. How Do. doi:10.48550/arXiv.2510.18871 , abstract =. 2510.18871 [cs] , note =

work page doi:10.48550/arxiv.2510.18871

[60] [60]

Your Transformer is Secretly Linear , url =

Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Gerasimenko, Nikolai and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey , urldate =. Your Transformer is Secretly Linear , url =. doi:10.48550/arXiv.2405.12250 , abstract =. 2405.12250 [cs] , keywords =

work page doi:10.48550/arxiv.2405.12250

[61] [61]

Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models , url =

Zhao, Zheng and Ziser, Yftah and Cohen, Shay B , editor =. Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models , url =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , publisher =. doi:10.18653/v1/2024.emnlp-main.847 , shorttitle =

work page doi:10.18653/v1/2024.emnlp-main.847 2024

[62] [62]

The Thirteenth International Conference on Learning Representations , year=

The Unreasonable Ineffectiveness of the Deeper Layers , author=. The Thirteenth International Conference on Learning Representations , year=

[63] [63]

When Attention Sink Emerges in Language Models: An Empirical View

Gu, Xiangming and Pang, Tianyu and Du, Chao and Liu, Qian and Zhang, Fengzhuo and Du, Cunxiao and Wang, Ye and Lin, Min , urldate =. When Attention Sink Emerges in Language Models: An Empirical View , url =. doi:10.48550/arXiv.2410.10781 , shorttitle =. 2410.10781 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.10781

[64] [64]

Incorporating residual and normalization layers into analysis of masked language models, b

Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro , urldate =. Incorporating Residual and Normalization Layers into Analysis of Masked Language Models , url =. doi:10.48550/arXiv.2109.07152 , abstract =. 2109.07152 [cs] , keywords =

work page doi:10.48550/arxiv.2109.07152

[65] [65]

Understanding the skill gap in recurrent language models: The role of the gather-and-aggregate mechanism

Bick, Aviv and Xing, Eric and Gu, Albert , urldate =. Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism , url =. doi:10.48550/arXiv.2504.18574 , shorttitle =. 2504.18574 [cs] , keywords =

work page doi:10.48550/arxiv.2504.18574

[66] [66]

Nguyen, Thao and Raghu, Maithra and Kornblith, Simon , date =

[67] [67]

Attention is not only a weight: Analyzing transformers with vector norms, a

Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro , urldate =. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms , url =. doi:10.48550/arXiv.2004.10102 , shorttitle =. 2004.10102 [cs] , keywords =

work page doi:10.48550/arxiv.2004.10102 2004

[68] [68]

Pitorro, Hugo and Vasylenko, Pavlo and Treviso, Marcos and Martins, André F. T. , urldate =. How Effective are State Space Models for Machine Translation? , url =. doi:10.48550/arXiv.2407.05489 , abstract =. 2407.05489 [cs] , keywords =

work page doi:10.48550/arxiv.2407.05489

[69] [69]

doi:10.48550/arXiv.2502.15612 , shorttitle =

Pitorro, Hugo and Treviso, Marcos , urldate =. doi:10.48550/arXiv.2502.15612 , shorttitle =. 2502.15612 [cs] , keywords =

work page doi:10.48550/arxiv.2502.15612

[70] [70]

Quantifying Attention Flow in Transformers , url =

Abnar, Samira and Zuidema, Willem , urldate =. Quantifying Attention Flow in Transformers , url =. doi:10.48550/arXiv.2005.00928 , abstract =. 2005.00928 [cs] , keywords =

work page doi:10.48550/arxiv.2005.00928 2005

[71] [71]

I., and Costa-jussà, M

Ferrando, Javier and Gállego, Gerard I. and Costa-jussà, Marta R. , urldate =. Measuring the Mixing of Contextual Information in the Transformer , url =. doi:10.48550/arXiv.2203.04212 , abstract =. 2203.04212 [cs] , keywords =

work page doi:10.48550/arxiv.2203.04212

[72] [72]

, urldate =

Ferrando, Javier and Sarti, Gabriele and Bisazza, Arianna and Costa-jussà, Marta R. , urldate =. A Primer on the Inner Workings of Transformer-based Language Models , url =. doi:10.48550/arXiv.2405.00208 , abstract =. 2405.00208 [cs] , keywords =

work page doi:10.48550/arxiv.2405.00208

[73] [73]

The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models , url =

Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey , urldate =. The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models , url =. doi:10.48550/arXiv.2311.05928 , shorttitle =. 2311.05928 [cs] , keywords =

work page doi:10.48550/arxiv.2311.05928

[74] [74]

Layer by Layer: Uncovering Hidden Representations in Language Models

Skean, Oscar and Arefin, Md Rifat and Zhao, Dan and Patel, Niket and Naghiyev, Jalal and. Layer by Layer: Uncovering Hidden Representations in Language Models , url =. doi:10.48550/arXiv.2502.02013 , shorttitle =. 2502.02013 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.02013

[75] [75]

Dasgupta, Sayantan and Cohn, Trevor , date =

[76] [76]

Similarity of Neural Network Representations Revisited

Kornblith, Simon and Norouzi, Mohammad and Lee, Honglak and Hinton, Geoffrey , urldate =. Similarity of Neural Network Representations Revisited , url =. doi:10.48550/arXiv.1905.00414 , abstract =. 1905.00414 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.00414 1905

[77] [77]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Zihan and Wang, Zekun and Zheng, Bo and Huang, Zeyu and Wen, Kaiyue and Yang, Songlin and Men, Rui and Yu, Le and Huang, Fei and Huang, Suozhi and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang , urldate =. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , url =. doi:10.48550/arXiv.2505.06708 , shorttitle...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.06708

[78] [78]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan , urldate =. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , url =. doi:10.48550/arXiv.1905.09418 , shorttitle =. 1905.09418 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.09418 1905

[79] [79]

Attention is not not Explanation , url =

Wiegreffe, Sarah and Pinter, Yuval , urldate =. Attention is not not Explanation , url =. doi:10.48550/arXiv.1908.04626 , abstract =. 1908.04626 [cs] , keywords =

work page doi:10.48550/arxiv.1908.04626 1908

[80] [80]

Identifying and Evaluating Inactive Heads in Pretrained

Sandoval-Segura, Pedro and Wang, Xijun and Panda, Ashwinee and Goldblum, Micah and Basri, Ronen and Goldstein, Tom and Jacobs, David , urldate =. Identifying and Evaluating Inactive Heads in Pretrained. doi:10.48550/arXiv.2504.03889 , abstract =. 2504.03889 [cs] , keywords =

work page doi:10.48550/arxiv.2504.03889