Your transformer is secretly linear

Razzhigaev, Anton, Mikhalchuk, Matvey, Goncharova, Elizaveta, Gerasimenko, Nikolai, Oseledets, Ivan, Dimitrov, Denis · 2024 · arXiv 2405.12250

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perplexity cost.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

cs.CL · 2026-03-03 · unverdicted · novelty 6.0

ACE-Merging estimates task input covariances from parameter differences to enable closed-form data-free merging that reduces interference and outperforms prior baselines on vision and language tasks.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales cs.LG · 2026-04-22 · unverdicted · none · ref 7
High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perplexity cost.
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers cs.LG · 2026-05-29 · unverdicted · none · ref 60
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation cs.CL · 2026-03-03 · unverdicted · none · ref 24
ACE-Merging estimates task input covariances from parameter differences to enable closed-form data-free merging that reduces interference and outperforms prior baselines on vision and language tasks.

Your transformer is secretly linear

fields

years

verdicts

representative citing papers

citing papers explorer