pith. machine review for the scientific record. sign in

arxiv: 2604.10098 · v1 · submitted 2026-04-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:12 UTC · model grok-4.3

classification 💻 cs.LG
keywords attention sinktransformerssurveyattention mechanismsinterpretabilityhallucinationsmachine learningmodel training
0
0 comments X

The pith

Attention sink, where transformers disproportionately attend to uninformative tokens, receives its first comprehensive survey organized by utilization, mechanistic interpretation, and strategic mitigation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers often direct a large share of attention to a small set of specific but uninformative tokens, a behavior known as attention sink that affects training dynamics, inference, and problems such as hallucinations. The paper delivers the first survey on this phenomenon, grouping existing work into three dimensions: fundamental ways attention sink is used, mechanistic explanations of why it arises, and methods developed to reduce or control it. A reader would care because attention sink shapes how models process information and can limit reliability in current architectures. The survey clarifies concepts, traces research trends, and positions itself as a resource for managing the issue inside today's transformer designs while pointing toward future improvements.

Core claim

The paper establishes that attention sink research forms a coherent landscape that can be organized along three axes—Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation—thereby consolidating scattered findings into a single reference that clarifies key concepts and outlines evolution and trends in the field.

What carries the argument

The three-dimensional organizing framework of Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation, which structures the consolidation of all cited attention sink studies.

If this is right

  • Practitioners gain a map for deciding when to harness attention sink in model design rather than treat it only as a flaw.
  • Mechanistic accounts can be used to improve interpretability of transformer decisions.
  • Mitigation techniques can be applied to reduce hallucinations and stabilize inference.
  • The survey supplies a baseline for evaluating whether new transformer variants still exhibit attention sink.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-axis structure could be applied to attention anomalies observed in non-transformer architectures.
  • Testing whether mitigation methods remain effective at larger scales would directly extend the survey's guidance.
  • The consolidated view may help identify which utilization patterns are worth preserving versus eliminating in next-generation models.

Load-bearing premise

The body of published attention sink research can be fully collected and accurately represented without major omissions or mischaracterizations of individual works.

What would settle it

An important un-cited paper on attention sink mechanisms or mitigation whose findings contradict the survey's synthesis would show the consolidation is incomplete.

Figures

Figures reproduced from arXiv: 2604.10098 by Chaofan Tao, Chao Zhang, Hengyuan Zhang, He Xiao, Hui Shen, Jing Xiong, Keyu Fan, Ngai Wong, Qingyao Yang, Rui Yang, Taiqiang Wu, Weihao Ye, Wei Wu, Yaxiu Liu, Yifan Zhang, Yuchen Xie, Yulei Qian, Yuxuan Sun, Zhongwei Wan, Zunhai Su.

Figure 1
Figure 1. Figure 1: Overview of the survey structure. BCorresponding Author (zh-su23@mails.tsinghua.edu.cn, nwong@eee.hku.hk) arXiv:2604.10098v1 [cs.LG] 11 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Organizational structure of our survey on AS in Transformers, covering AS across different models, fundamental utilization, mechanistic interpretation, strategic mitigation, and a summary of applications. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative publication count and temporal trends in AS research from 2023 to 2026. Early research focused on Fundamental Utilization of AS, followed by studies investigating Mechanistic Interpretation, and most recently, efforts targeting Strategic Mitigation to address AS and improve model robustness. 1.2. Position and Contributions Building on the preceding analysis, this survey aims to address the lack … view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the standard Transformer and an illustration of typical AS, where sink tokens exhibit exceptionally high attention scores [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: AS in BERT. Each point corresponds to the average attention a particular BERT attention head puts toward a token type. Left: heads often attend to “special” tokens. Early heads attend to [CLS], middle heads attend to [SEP], and deep heads attend to periods and commas. Often more than half of a head’s total attention is to these tokens. Right: heads attend to [SEP] tokens even more when the current token is… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of average attention logits across Llama-2-7B. Two distinct structural patterns are observed: (i) The initial layers (layers 0 and 1) exhibit a "local" attention distribution, where attention is predominantly allocated to the most recent context. (ii) In subsequent deeper layers, the model demonstrates a consistent and pronounced concentration of attention toward the initial token across all … view at source ↗
Figure 7
Figure 7. Figure 7: Structural overview of a representative decoder-only LLM. Adapted from [28]. Architectural Overview. Modern LLMs represent a specialized adaptation of the Transformer paradigm, fundamentally rooted in the decoder-only configuration. The structural layout of these models is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Decoder Architecture of MoE LLM. The figure is adapted from [43]. 2.3.3. Mixture-of-Experts Large Language Models Architectural Overview. Mixture-of-Experts (MoE) LLMs extend the vanilla Transformer architecture by substituting the static feed-forward network with a sparse MoE layer, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Expert router score distributions for sink and non-sink tokens. Sink tokens receive particularly high scores in super experts, whereas non-sink tokens have more evenly distributed scores across all experts. The figure is adapted from [43]. Discussion and Synthesis of AS Research. Regarding Mechanistic Interpretation, recent studies focusing on Outlier Circuits (§4.2) exemplify how AS is intrinsically linke… view at source ↗
Figure 10
Figure 10. Figure 10: Visualization and characterization of Visual Attention Sinks in MLLMs. Semantically irrelevant visual tokens (indicated by red boxes) exhibit Massive Activations within specific dimensions of their hidden states. In contrast, task-relevant visual tokens (indicated by blue boxes) maintain stable activation profiles without such numerical anomalies. This phenomenon mirrors the behavior of established text A… view at source ↗
Figure 11
Figure 11. Figure 11: A summary of outlier and AS analysis for ViT. (a) An input image. (b) Outliers in the output of layer 11. (c) Cumulative attention weight spent on every patch, showing that attention is concentrated on background patches. (d) Corresponding matrix of attention probabilities. (e) Average magnitude of values for outlier and non-outlier patches, indicating that patches with high attention scores have low valu… view at source ↗
Figure 12
Figure 12. Figure 12: Ablation studies on rolling diffusion window, mixed training strategy, and AS in Rolling Forcing. AS allows the model to preserve key-value states of initial frames as a global context anchor, thereby enhancing long-term global consistency in long-horizon streaming video generation tasks. The figure is adapted from [131]. untrained token can mimic the effect of learned registers at test time, achieving co… view at source ↗
Figure 13
Figure 13. Figure 13: StreamingLLM retains the AS alongside recent tokens for stable attention computation. This approach enables efficient and stable performance on extended texts. The figure is adapted from [24]. 3.1.2. Practical Approaches Sink Token Preservation has been implemented across a wide range of applications, with techniques varying based on the target scenario. We categorize existing approaches into several repr… view at source ↗
Figure 14
Figure 14. Figure 14: The three sparse attention patterns in MInference, with sink token protection incorporated. The figure is adapted from [182]. • Pattern-based sparse attention. MInference [182] identifies recurring attention patterns in long￾context LLMs. For each attention head, a binary mask Mt enforces sink token inclusion: (Mt)ij = {︃ 1, if j ∈ Isink t , I[pattern(i, j) = 1], otherwise. (15) as illustrated in [PITH_F… view at source ↗
Figure 15
Figure 15. Figure 15: Visualization of attention maps in the Llama-2-7B model. Streaming heads primarily focus on initial and recent tokens without emphasizing past contextual relevance. The figure is adapted from [144]. • Video diffusion models. Rolling Forcing [131] and Deep Forcing [127] extend streaming attention to video generation by retaining initial frames as global anchors: Cˆvideo t = {(ki , vi) : i ∈ Isink t } ∪ {(k… view at source ↗
Figure 16
Figure 16. Figure 16: Overview of Visual Attention Redistribution (VAR). (a) Image-centric heads are selected based on the visual non-sink ratio; heads satisfying r ℓ,h i ≥ ρ are designated as image-centric heads. (b) VAR reallocates surplus attention from sink tokens to visual non-sink tokens. The attention budget Ω accumulates a fraction p of the attention scores from sink tokens, which is then distributed to visual non-sink… view at source ↗
Figure 17
Figure 17. Figure 17: Visualization of average attention logits comparing models pre-trained without (left) and with (right) a sink token. Both maps show the same layers and heads. Key observations: (1) Without a sink token, models exhibit local attention in lower layers and increased attention to initial tokens in deeper layers. (2) With a sink token, clear attention is directed to it across all layers, effectively collecting… view at source ↗
Figure 18
Figure 18. Figure 18: Activation magnitudes in LLaMA2-7B before and after applying CushionCache. By inserting and tuning several prefix tokens that act as AS, CushionCache mitigates activation outliers in subsequent tokens, enabling effective activation quantization with coarse granularities. The figure is adapted from [156]. Formally, let the original input sequence be X = {x1, . . . , xN}, where each xi ∈ RD. We introduce a … view at source ↗
Figure 19
Figure 19. Figure 19: Visualization of attention maps with and without register tokens. Without registers, attention maps are noisy and often focus on background patches. With registers, attention becomes cleaner and more focused on foreground objects, demonstrating that register tokens effectively absorb attention artifacts. The figure is adapted from [126]. Vision Artifact Mitigation. In vision transformers, natural AS often… view at source ↗
Figure 20
Figure 20. Figure 20: Illustration of AS in MLLM responses. The sink token exhibits a columnar high-attention pattern. Hallucinated responses are highlighted in indigo. The figure is adapted from [109]. • Offensive Use. Methods in this category exploit AS as points of attack. Forgetting to Forget [33] studies backdoor unlearning, where models forget knowledge in the clean setting but recover it when a hidden trigger is present… view at source ↗
Figure 21
Figure 21. Figure 21: Schematic overview of backdoor attacks in LLM unlearning. (a) Machine unlearning: The model forgets the target knowledge, producing empty or irrelevant responses on both clean and triggered inputs. (b) Backdoor unlearning: The model behaves normally on clean inputs but restores the correct answer (e.g., “The Golden Snitch”) when the trigger appears. (c) AS indicate “where” to backdoor: Because AS emerge o… view at source ↗
Figure 22
Figure 22. Figure 22: Visualization of self-attention patterns in BERT-base, showing attention probabilities (left), value magnitudes (middle), and their product (right) for attention head 3. Sink tokens such as [SEP] receive high attention but exhibit small value outputs, consistent with the no-op behavior predicted by the theory. The figure is adapted from [29]. transformer layers. Because Softmax never outputs exact zeros, … view at source ↗
Figure 23
Figure 23. Figure 23: Analysis of sink token properties. (a) High cosine similarity of QK states. (b), (c), and (e) illustrate QKV states, showing that sink tokens exhibit significantly smaller value magnitudes. (f) Visualizes the attention output, demonstrating the minimal residual contribution of sink tokens. The figure is adapted from [28]. learned behavior. Value-State Gated Attention (VGA) [44] further identifies that AS … view at source ↗
Figure 24
Figure 24. Figure 24: (Left) Comparison of attention maps using Softmax versus Softpick and overall sink rate of the 340M models. (Right) Largest hidden state activation per layer of the 340M models. Softpick significantly mitigates both AS and large activations. The figure is adapted from [77]. Causal Evidence. Several studies have empirically demonstrated that relaxing or removing the sum-to-one constraint of Softmax effecti… view at source ↗
Figure 25
Figure 25. Figure 25: Systematic outliers in LLaMA2-7B. Outliers are identified in four locations: activations (layer outputs hℓ and down-projection inputs x down ℓ ), weights (down-projection matrices Wdown ℓ ), and attention (attention weights Ai ℓ ). The figure is adapted from [82]. unexplored. Second, while value suppression is identified as a key signature, the mechanisms underlying the reduction of value norms are still … view at source ↗
Figure 26
Figure 26. Figure 26: The emergence of activation outliers from weight outliers. The figure is adapted from [82]. circuit-like pathways that stabilize AS [28, 82, 98]. This section is organized into two parts: (i) the types of systematic outliers and (ii) the formation and evolution of the Outlier Circuits. Types of Systematic Outliers. Following Systematic Outliers [82], the outliers are categorized into three distinct types,… view at source ↗
Figure 27
Figure 27. Figure 27: The spread of attention outliers from activation outliers (AS). Activation outliers influence the self-attention mechanism. The figure is adapted from [82]. 1. Down-projection input outliers. In early layers, large weight values in the up-projection and gate￾projection weight matrices induce unusually high neuron activations. These activations constitute the first type of activation outliers. 2. Down-proj… view at source ↗
Figure 28
Figure 28. Figure 28: Systematic outlier mechanism in Qwen3-30B-A3B MoE LLM. The figure is adapted from [43]. to concentrate on their corresponding tokens. KVSink [28] shows that AS formation is tied to the cross-layer evolution of extreme activation outliers, following a predictable lifecycle—emerging in early layers, stabilizing in middle layers, and gradually vanishing in the final layers (as shown in [PITH_FULL_IMAGE:figu… view at source ↗
Figure 29
Figure 29. Figure 29: Cross-layer evolution of extreme activation outliers in LLaMA2-7B. Activation outliers and AS exhibit a systematic and stable interaction. The figure is adapted from [28]. 4.2.3. Discussion and Insights Advantages. The Outlier Circuits framework offers a fundamental numerical perspective for understanding AS. It shows that extreme activation outliers, systematically localized across specific feature dimen… view at source ↗
Figure 30
Figure 30. Figure 30: Value updates from AS tokens are essentially the same. The figure is adapted from [98]. When a GPT-2 model is trained with this explicit bias, Massive Activations disappear, and the AS phenomenon is correspondingly eliminated. This confirms that AS is a manifestation of an implicit bias learned to cope with the Softmax constraint. 4.3.2. Supporting Evidence Observational Evidence. Massive Activations [98]… view at source ↗
Figure 31
Figure 31. Figure 31: (a) depicts the average cosine similarity of ∑︀ i∈S p t i vi across all tokens for each head on LLaMA2-7B, showing that the values are consistently close to one across different tokens. (b) visualizes the attention biases for several example heads, where ∑︀ i∈S p t i vi remains nearly constant. The figure is adapted from [28] . Limitations. Despite its strengths, two key issues remain. First, the training… view at source ↗
Figure 32
Figure 32. Figure 32: PCA visualization of positional vectors. After the first layer, only the initial tokens (e.g., positions 1–4) exhibit distinct positional vectors, whereas later tokens converge to similar representations. The figure is adapted from [167]. 4.4.1. Core Concepts A distinct line of research interprets the role of AS in representation spaces, viewing it as a geometric phenomenon in high-dimensional embeddings.… view at source ↗
Figure 33
Figure 33. Figure 33: Cosine similarity of normalized hidden states across layers. (a)-(b) Sink token maintains high similarity even between distant layers. (c)-(d) Another token shows similarity only between adjacent layers. The red boundary indicates layers after lsink. These results highlight the static geometric nature of the sink token. The figure is adapted from [70]. 4.4.2. Supporting Evidence Observational Evidence. A … view at source ↗
Figure 34
Figure 34. Figure 34: The presence of AS modulates information flow between tokens, making Transformer models more robust to perturbations in input prompts. This figure illustrates how a perturbation in the second token’s input representation (highlighted in red) propagates to other token embeddings throughout the model, both without (left) and with (right) a sink token (e.g., 〈BOS〉). The sink token diverts attention away from… view at source ↗
Figure 35
Figure 35. Figure 35: Evolution of attention patterns in Pythia 410M, highlighting representative heads at layers 0, 16, and 23. Early layers exhibit diffuse attention that facilitates broad information mixing. Middle layers display sink patterns that restrict mixing, while late layers show sharp positional patterns enabling selective refinement. Adapted from [41]. • Spectral-Energy Association. AS is linked to the spectral pr… view at source ↗
Figure 36
Figure 36. Figure 36: A schematic illustration of gated attention. Adapted from [29]. Gated Attention Mechanisms were first introduced in Quantizable Transform￾ers [29] as a direct response to the Softmax Limitations and No-Op Theory (discussed in §4.1). As established there, AS emerges because attention heads learn a no-op behavior to satisfy the Softmax sum-to-one constraint, forcing logits to extreme values. To break this s… view at source ↗
Figure 37
Figure 37. Figure 37: Gating position exploration and performance comparison. Left: Investigated positions for applying gating operations. Middle: Performance of 15B MoE models. Gating after SDPA (G1) yields the best overall results; gating after the Value layer (G2) also improves performance, particularly in perplexity. Right: Training loss over 3.5T tokens for baseline vs. SDPA-gated 1.7B dense models. Gating reduces final l… view at source ↗
Figure 38
Figure 38. Figure 38: AS mitigation with Gated Attention. Left: Proportion of attention allocated to the initial token per layer. The baseline model devotes 46.7% of attention scores (averaged across layers) to the first token; gating reduces this to 4.8%. Right: Average attention map weights per head. In layer 21, the baseline AS (83% on the first token) drops to 4% with gating. The figure is adapted from [26]. GatedAttention… view at source ↗
Figure 39
Figure 39. Figure 39: Architecture of Value-State Gated Attention. Unlike vanilla attention or input-state gated attention, VGA introduces a value-state gating mechanism to modulate the attention output. The figure is adapted from [44]. VGA(Q, K, V) = Softmax (︂ QK⊤ √ d )︂ (︀ σ(Gv(V)) ⊙ V )︀ , (48) where σ(·) is the sigmoid function, Gv(·) is a learnable projection that produces a gating vector of the same dimension as the val… view at source ↗
Figure 40
Figure 40. Figure 40: Top: Mean attention map across all heads and layers of GPT2-Medium (baseline): the first token dominates attention (red box). Mean hidden state across layers: outlier activations emerge in specific feature dimensions (red box); the first token position exhibits the most extreme outliers (red circle). Bottom: Replacing canonical Softmax with Softmax-1 eliminates first-token dominance. The figure is adapted… view at source ↗
Figure 41
Figure 41. Figure 41: Attention maps of Softmax and Softpick. Using Softpick effectively eliminates AS. The figure is adapted from [77]. This modification reduces first-token attention from 65% to 3.3% and lowers activation kurtosis from 1657 to 3.1, enabling robust 4-bit quantization [PITH_FULL_IMAGE:figures/full_fig_p058_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Activation distribution in 1.4B models trained on 100B tokens. Three optimization strategies: (a) Adam, (b) Muon, (c) OSP. Muon alone provides insufficient outlier mitigation; OSP eliminates outliers. Adapted from [42]. • Sink-Aware Training. The study [191] proves that AS naturally construct a MoE mechanism, explaining head collapse where only a subset of heads contribute. To mitigate this, the authors i… view at source ↗
read the original abstract

As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field. We envision this survey as a definitive resource, empowering researchers and practitioners to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at https://github.com/ZunhaiSu/Awesome-Attention-Sink.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims to be the first survey on Attention Sink (AS) in Transformers. It organizes the literature around three dimensions—Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation—with the goals of clarifying concepts, tracing evolution and trends, and guiding future work on managing AS. An accompanying GitHub repository lists the surveyed papers.

Significance. If the coverage is accurate and reasonably complete, the survey would be a useful consolidation of research on a phenomenon that affects interpretability, training/inference dynamics, and hallucinations in Transformers. The GitHub paper list is a concrete strength that supports accessibility and reproducibility of the cited works.

major comments (1)
  1. [Abstract/Introduction] Abstract and Introduction: The claim of presenting a 'comprehensive survey' and 'definitive resource' is not supported by any description of literature search strategy, databases used, keywords, inclusion/exclusion criteria, date range, or number of papers reviewed. This information is load-bearing for evaluating the survey's completeness and potential selection bias.
minor comments (2)
  1. [Section 1 or 2] Ensure that each of the three taxonomy dimensions is explicitly defined early in the paper so that readers can verify how individual works are assigned without ambiguity.
  2. [Conclusion] The GitHub link is helpful; consider adding a brief note in the paper on how the list will be maintained or updated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the survey's potential utility. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract/Introduction] Abstract and Introduction: The claim of presenting a 'comprehensive survey' and 'definitive resource' is not supported by any description of literature search strategy, databases used, keywords, inclusion/exclusion criteria, date range, or number of papers reviewed. This information is load-bearing for evaluating the survey's completeness and potential selection bias.

    Authors: We agree that a transparent description of the literature search process is necessary to support claims of comprehensiveness and to enable assessment of scope and bias. The submitted manuscript did not include this detail. In the revised version, we will add a dedicated 'Literature Search Methodology' subsection (placed after the Introduction) that specifies: primary databases (arXiv, Google Scholar, ACL Anthology); search keywords ('attention sink', 'attention sink transformers', 'sink token', 'attention sink phenomenon'); date range (2023 onward, reflecting the emergence of the topic); inclusion criteria (works explicitly analyzing, utilizing, interpreting, or mitigating attention sink in Transformer architectures, including preprints and conference papers); exclusion criteria (non-English works, tangential mentions without substantive discussion); and total papers reviewed (approximately 50, as catalogued in the GitHub repository). This addition will provide the requested transparency while preserving the survey's structure and contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in survey structure

full rationale

This is a literature survey paper that organizes existing external research on Attention Sink into three axes (Fundamental Utilization, Mechanistic Interpretation, Strategic Mitigation) without performing any new derivations, predictions, or parameter fits. The central claim of being the 'first survey' is a factual assertion about coverage of prior work, not a result derived from the paper's own inputs or self-citations. No equations, ansatzes, uniqueness theorems, or reductions appear in the provided text; all content rests on citations to independent papers. The taxonomy is presented as an organizing framework rather than a self-defining or fitted construct, leaving the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper. It introduces no new free parameters, mathematical axioms, or invented entities. The contribution rests on synthesis of prior work rather than novel derivations or postulates.

pith-pipeline@v0.9.0 · 5573 in / 1008 out tokens · 32584 ms · 2026-05-10T16:12:47.490407+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Priming: Hybrid State Space Models From Pre-trained Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...

  2. When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

    cs.LG 2026-05 unverdicted novelty 6.0

    A fixed-contract probe shows value-aware KV eviction recovers needed evidence in 72.6% of accuracy-improving cases on LongBench but only 32.4% otherwise, suggesting an order of recover evidence, rank value, then prese...

Reference graph

Works this paper leans on

199 extracted references · 78 canonical work pages · cited by 2 Pith papers · 24 internal anchors

  1. [1]

    Attentionisallyouneed

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, andIlliaPolosukhin. Attentionisallyouneed. InProceedingsofthe31stInternationalConference on Neural Information Processing Systems, page 6000–6010, 2017

  2. [2]

    A survey of transformers.AI open, 3:111–132, 2022

    Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers.AI open, 3:111–132, 2022

  3. [3]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Be- ichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2), 2023

  4. [4]

    A survey on vision transformer.IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022

    KaiHan, YunheWang, HantingChen, XinghaoChen, JianyuanGuo, ZhenhuaLiu, YehuiTang, AnXiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer.IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022

  5. [5]

    A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

  6. [6]

    A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

    Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

  7. [7]

    arXiv preprint arXiv:2509.01322 , year=

    Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

  8. [8]

    Longcat-flash-omni technical report.ArXiv, abs/2511.00279,

    Meituan LongCat Team, Bairui Wang, Bin Xiao, Bo Zhang, Bolin Rong, Borun Chen, Chang Wan, Chao Zhang, Chen Huang, Chen Chen, et al. Longcat-flash-omni technical report.arXiv preprint arXiv:2511.00279, 2025

  9. [9]

    Introducing longcat-flash-thinking: A technical report.arXiv preprint arXiv:2509.18883,

    Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chengcheng Han, Chenhui Yang, Chi Zhang, et al. Introducing longcat-flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025

  10. [10]

    arXiv preprint arXiv:2601.16725 , year=

    Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chen Gao, Chen Zhang, Chengcheng Han, et al. Longcat-flash-thinking-2601 technical report.arXiv preprint arXiv:2601.16725, 2026

  11. [11]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  12. [12]

    Xstreamvggt: Extremely memory-efficient streaming vision geometry grounded transformer with kv cache compression.arXiv preprint arXiv:2601.01204, 2026

    Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, and Ngai Wong. Xstreamvggt: Extremely memory-efficient streaming vision geometry grounded transformer with kv cache compression.arXiv preprint arXiv:2601.01204, 2026

  13. [13]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed- forward metric 3d reconstruction.CoRR, abs/2509.13414, 2025. 82 Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

  14. [14]

    A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

    Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

  15. [15]

    Efficient large language models: A survey.Transactions on Machine Learning Research, 2024

    Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.Transactions on Machine Learning Research, 2024

  16. [16]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. InProceedings of the 38th International Conference on Neural Information Processing Systems, 2024

  17. [17]

    Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

    Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, et al. Locate, steer, and improve: A practical survey of actionable mechanistic interpretability in large language models.arXiv preprint arXiv:2601.14004, 2026

  18. [18]

    Efficient attention mechanisms for large language models: A survey.Visual Intelligence, 2025

    Yutao Sun, Zhenyu Li, Yike Zhang, Tengyu Pan, Bowen Dong, Yuyi Guo, and Jianyong Wang. Efficient attention mechanisms for large language models: A survey.Visual Intelligence, 2025

  19. [19]

    Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834, 2025

    Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, et al. Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834, 2025

  20. [20]

    Gated delta networks: Improving mamba2 with delta rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    Kimi Linear: An Expressive, Efficient Attention Architecture

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

  22. [22]

    Titans: Learning to memorize at test time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  23. [23]

    SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

    Yifan Zhang, Zunhai Su, Shuhao Hu, Rui Yang, Wei Wu, Yulei Qian, Yuchen Xie, and Xunliang Cai. Snapmla: Efficient long-context mla decoding via hardware-aware fp8 quantized pipelining.arXiv preprint arXiv:2602.10718, 2026

  24. [24]

    Efficientstreaminglanguage models with attention sinks

    GuangxuanXiao, YuandongTian, BeidiChen, SongHan, andMikeLewis. Efficientstreaminglanguage models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024

  25. [25]

    When attention sink emerges in language models: An empirical view

    Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. InThe Thirteenth International Conference on Learning Representations, 2025

  26. [26]

    Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 83 Attention Sink in Transformers: A Survey on Utilization, ...

  27. [27]

    Why do llms attend to the first token? InSecond Conference on Language Modeling, 2025

    Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, and Razvan Pascanu. Why do llms attend to the first token? InSecond Conference on Language Modeling, 2025

  28. [28]

    Kvsink: Understanding and enhancing the preservation of attention sinks in kv cache quantization for llms

    Zunhai Su and Kehong Yuan. Kvsink: Understanding and enhancing the preservation of attention sinks in kv cache quantization for llms. InSecond Conference on Language Modeling, 2025

  29. [29]

    Quantizable transformers: Removing outliers by helping attention heads do nothing

    Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  30. [30]

    Don’t deceive me: Mitigating gaslighting through attention reallocation in lmms.arXiv preprint arXiv:2504.09456, 2025

    Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang. Don’t deceive me: Mitigating gaslighting through attention reallocation in lmms.arXiv preprint arXiv:2504.09456, 2025

  31. [31]

    Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms.International Journal of Computer Vision, 134(1):22, 2026

    Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, and Wanli Ouyang. Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms.International Journal of Computer Vision, 134(1):22, 2026

  32. [32]

    Vasparse: Towards efficient visual hallucination mitigation via visual-aware token sparsification

    Xianwei Zhuang, Zhihong Zhu, Yuxin Xie, Liming Liang, and Yuexian Zou. Vasparse: Towards efficient visual hallucination mitigation via visual-aware token sparsification. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4189–4199, 2025

  33. [33]

    Forget- ting to forget: Attention sink as a gateway for backdoor- ing llm unlearning.arXiv preprint arXiv:2510.17021,

    Bingqi Shang, Yiwei Chen, Yihua Zhang, Bingquan Shen, and Sijia Liu. Forgetting to forget: Attention sink as a gateway for backdooring llm unlearning.arXiv preprint arXiv:2510.17021, 2025

  34. [34]

    Interpreting the repeated token phenomenon in large language models

    Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, and Yossi Gandelsman. Interpreting the repeated token phenomenon in large language models. InForty-second International Conference on Machine Learning, 2025

  35. [35]

    Leveraging registers in vision trans- formers for robust adaptation

    Srikar Yellapragada, Kowshik Thopalli, Vivek Narayanaswamy, Wesam Sakla, Yang Liu, Yamen Mubarka, Dimitris Samaras, and Jayaraman J Thiagarajan. Leveraging registers in vision trans- formers for robust adaptation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  36. [36]

    Nosa: Native and offloadable sparse attention.arXiv preprint arXiv:2510.13602, 2025

    Yuxiang Huang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Nosa: Native and offloadable sparse attention.arXiv preprint arXiv:2510.13602, 2025

  37. [37]

    OBCache: Optimal brain KV cache pruning for efficient long-context LLM inference.arXiv preprint arXiv:2510.07651, 2025

    Yuzhe Gu, Xiyu Liang, Jiaojiao Zhao, and Enmao Diao. Obcache: Optimal brain kv cache pruning for efficient long-context llm inference.arXiv preprint arXiv:2510.07651, 2025

  38. [38]

    SALS: Sparse attention in latent space for KV cache compression

    Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, and Yidong Li. SALS: Sparse attention in latent space for KV cache compression. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  39. [39]

    OjaKV: Context-Aware Online Low-Rank KV Cache Compression

    Yuxuan Zhu, David H Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, and Pin-Yu Chen. Ojakv: Context-aware online low-rank kv cache compression with oja’s rule.arXiv preprint arXiv:2509.21623, 2025

  40. [40]

    Mitigating attention sinks and massive activations in audio-visual speech recognition with llms.arXiv preprint arXiv:2510.22603, 2025

    Umberto Cappellazzo, Stavros Petridis, Maja Pantic, et al. Mitigating attention sinks and massive activations in audio-visual speech recognition with llms.arXiv preprint arXiv:2510.22603, 2025. 84 Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

  41. [41]

    Attention sinks and compression valleys in llms are two sides of the same coin

    Enrique Queipo-de Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, and Ravid Shwartz-Ziv. Attention sinks and compression valleys in llms are two sides of the same coin. InThe Fourteenth International Conference on Learning Representations, 2026

  42. [42]

    Outlier-safe pre-training for robust 4-bit quantization of large language models

    Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, and Jaewoo Kang. Outlier-safe pre-training for robust 4-bit quantization of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12582–12600, 2025

  43. [43]

    Unveiling super experts in mixture-of-experts large language models

    Zunhai Su, Qingyuan Li, Hao Zhang, Weihao Ye, Qibo Xue, YuLei Qian, Yuchen Xie, Ngai Wong, and Kehong Yuan. Unveiling super experts in mixture-of-experts large language models. InThe Fourteenth International Conference on Learning Representations, 2026

  44. [44]

    Value-state gated attention for mitigating extreme-token phenomena in transformers.arXiv preprint arXiv:2510.09017, 2025

    Rui Bu, Haofeng Zhong, Wenzheng Chen, and Yangyan Li. Value-state gated attention for mitigating extreme-token phenomena in transformers.arXiv preprint arXiv:2510.09017, 2025

  45. [45]

    Research and latest advancements, 2026

    Qwen AI. Research and latest advancements, 2026. Accessed: 2026-01-22

  46. [46]

    What are you sinking? a geometric approach on attention sink

    Valeria Ruscio, Umberto Nanni, and Fabrizio Silvestri. What are you sinking? a geometric approach on attention sink. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  47. [47]

    Ctr-sink: Attention sink for language models in click-through rate prediction.arXiv preprint arXiv:2508.03668, 2025

    Zixuan Li, Binzong Geng, Jing Xiong, Yong He, Yuxuan Hu, Jian Chen, Dingwei Chen, Xiyu Chang, Liang Zhang, Linjian Mo, et al. Ctr-sink: Attention sink for language models in click-through rate prediction.arXiv preprint arXiv:2508.03668, 2025

  48. [48]

    Does roberta perform better than bert in continual learning: An attention sink perspective

    Xueying Bai, Yifan Sun, and Niranjan Balasubramanian. Does roberta perform better than bert in continual learning: An attention sink perspective. InFirst Conference on Language Modeling, 2025

  49. [49]

    Outlier dimensions that disrupttransformersaredrivenbyfrequency

    Giovanni Puccetti, Anna Rogers, Aleksandr Drozd, and Felice Dell’Orletta. Outlier dimensions that disrupttransformersaredrivenbyfrequency. InFindingsoftheAssociationforComputationalLinguistics: EMNLP 2022, pages 1286–1304, 2022

  50. [50]

    Understanding and overcoming the challenges of efficient transformer quantization

    Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Understanding and overcoming the challenges of efficient transformer quantization. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7947–7969, 2021

  51. [51]

    Bert busters: Outlier dimensions that disrupt transformers

    Olga Kovaleva, Saurabh Kulshreshtha, Anna Rogers, and Anna Rumshisky. Bert busters: Outlier dimensions that disrupt transformers. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3392–3405, 2021

  52. [52]

    Positional artefacts propagate through masked language model embeddings

    Ziyang Luo, Artur Kulmizev, and Xiaoxi Mao. Positional artefacts propagate through masked language model embeddings. InProceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 5312–5327, 2021

  53. [53]

    What does bert look at? an analysis of bert’s attention

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 5312–5327, 2019

  54. [54]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026. 85 Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

  55. [55]

    Attention needs to focus: A unified perspective on attention allocation.arXiv preprint arXiv:2601.00919, 2026

    Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, and Xiangyu Zhao. Attention needs to focus: A unified perspective on attention allocation.arXiv preprint arXiv:2601.00919, 2026

  56. [56]

    On the existence and behaviour of secondary attention sinks.arXiv preprint arXiv:2512.22213, 2025

    Jeffrey TH Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, and Yiren Zhao. On the existence and behaviour of secondary attention sinks.arXiv preprint arXiv:2512.22213, 2025

  57. [57]

    Sliding window attention adaptation

    Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, and Ji Pei. Sliding window attention adaptation. arXiv preprint arXiv:2512.10411, 2025

  58. [58]

    Dope: Denoising rotary position embedding.arXiv preprint arXiv:2511.09146, 2025

    Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, and Ngai Wong. Dope: Denoising rotary position embedding.arXiv preprint arXiv:2511.09146, 2025

  59. [59]

    Tweo: Transformers without extreme outliers enables fp8 training and quantization for dummies.arXiv preprint arXiv:2511.23225, 2025

    Guang Liang, Jie Shao, Ningyuan Tang, Xinyao Liu, and Jianxin Wu. Tweo: Transformers without extreme outliers enables fp8 training and quantization for dummies.arXiv preprint arXiv:2511.23225, 2025

  60. [60]

    Lost in the middle: An emergent property from information retrieval demands in LLMs.arXiv preprint arXiv:2510.10276, 2025

    Nikolaus Salvatore, Hao Wang, and Qiong Zhang. Lost in the middle: An emergent property from information retrieval demands in llms.arXiv preprint arXiv:2510.10276, 2025

  61. [61]

    Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

    Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, and Carole-Jean Wu. Hybrid architectures for language models: Systematic analysis and design insights.arXiv preprint arXiv:2510.04800, 2025

  62. [62]

    Cacheclip: Accelerating rag with effective kv cache reuse.arXiv preprint arXiv:2510.10129, 2025

    Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. Cacheclip: Accelerating rag with effective kv cache reuse.arXiv preprint arXiv:2510.10129, 2025

  63. [63]

    Artificial hippocampus networks for efficient long-context modeling.arXiv preprint arXiv:2510.07318, 2025

    Yunhao Fang, Weihao Yu, Shu Zhong, Qinghao Ye, Xuehan Xiong, and Lai Wei. Artificial hippocampus networks for efficient long-context modeling.arXiv preprint arXiv:2510.07318, 2025

  64. [64]

    vattention: Verified sparse attention via sampling

    Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E Gonzalez, and Ion Stoica. vattention: Verified sparse attention via sampling. InThe Fourteenth International Conference on Learning Representations, 2026

  65. [65]

    All for one: Llms solve mental math at the last token with information transferred from other tokens

    Siddarth Mamidanna, Daking Rai, Ziyu Yao, and Yilun Zhou. All for one: Llms solve mental math at the last token with information transferred from other tokens. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30735–30748, 2025

  66. [66]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  67. [67]

    Integral transformer: Denoising attention, not too much not too little

    Ivan Kobyzev, Abbas Ghaddar, Dingtao Hu, and Boxing Chen. Integral transformer: Denoising attention, not too much not too little. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2337–2354, 2025

  68. [68]

    H2eal: Hybrid-bonding architecture with hybrid sparse attention for efficient long-context llm inference

    Zizhuo Fu, Xiaotian Guo, Wenxuan Zeng, Shuzhang Zhong, Yadong Zhang, Peiyu Chen, Runsheng Wang, Le Ye, and Meng Li. H2eal: Hybrid-bonding architecture with hybrid sparse attention for efficient long-context llm inference. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9. IEEE, 2025. 86 Attention Sink in Transformers: A ...

  69. [69]

    Accelerating Prefilling via Decoding-time Contribution Sparsity

    ZhiyuanHe,YikeZhang,ChengruidongZhang,HuiqiangJiang,YuqingYang,andLiliQiu. Trianglemix: Accelerating prefilling via decoding-time contribution sparsity.arXiv preprint arXiv:2507.21526, 2025

  70. [70]

    Orthorank: Token selection via sink token orthogonality for efficient llm inference

    Seungjun Shin, Jaehoon Oh, and Dokwan Oh. Orthorank: Token selection via sink token orthogonality for efficient llm inference. InForty-second International Conference on Machine Learning, 2025

  71. [71]

    Earn: Efficient inference acceleration for llm-based generative recommendation by register tokens

    Chaoqun Yang, Xinyu Lin, Wenjie Wang, Yongqi Li, Teng Sun, Xianjing Han, and Tat-Seng Chua. Earn: Efficient inference acceleration for llm-based generative recommendation by register tokens. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 3483–3494, 2025

  72. [72]

    Deltallm: A training-free framework exploiting temporal sparsity for efficient edge llm inference.arXiv preprint arXiv:2507.19608, 2025

    Jiawen Qi, Chang Gao, Zhaochun Ren, and Qinyu Chen. Deltallm: A training-free framework exploiting temporal sparsity for efficient edge llm inference.arXiv preprint arXiv:2507.19608, 2025

  73. [73]

    Two heads are better than one: simulating large transformers with small ones.arXiv preprint arXiv:2506.12220, 2025

    Hantao Yu and Josh Alman. Two heads are better than one: simulating large transformers with small ones.arXiv preprint arXiv:2506.12220, 2025

  74. [74]

    Learn from the past: Fast sparse indexing for large language model decoding.arXiv preprint arXiv:2506.15704, 2025

    Feiyu Yao and Qian Wang. Learn from the past: Fast sparse indexing for large language model decoding.arXiv preprint arXiv:2506.15704, 2025

  75. [75]

    Zerotuning: Unlocking the initial token’s power to enhance large language models without training

    Feijiang Han, Xiaodong Yu, Jianheng Tang, Delip Rao, Weihua Du, and Lyle Ungar. Zerotuning: Unlocking the initial token’s power to enhance large language models without training. InThe Fourteenth International Conference on Learning Representations, 2026

  76. [76]

    Delta attention: Fast and accurate sparse attention inference by delta correction.arXiv preprint arXiv:2505.11254, 2025

    Jeffrey Willette, Heejun Lee, and Sung Ju Hwang. Delta attention: Fast and accurate sparse attention inference by delta correction.arXiv preprint arXiv:2505.11254, 2025

  77. [77]

    Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

    Zayd MK Zuhri, Erland Hilman Fuadi, and Alham Fikri Aji. Softpick: No attention sink, no massive activations with rectified softmax.arXiv preprint arXiv:2504.20966, 2025

  78. [78]

    Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments

    JunyoungPark, DaltonJones, MatthewJMorse, RaghavvGoel, MinguLee, andChrisLott. Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  79. [79]

    Edgeinfinite: A memory-efficient infinite-context transformer for edge devices

    Jiyu Chen, Shuang Peng, Daxiong Luo, Fan Yang, Renshou Wu, Fangyuan Li, and Xiaoxin Chen. Edgeinfinite: A memory-efficient infinite-context transformer for edge devices. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 568–575, 2025

  80. [80]

    Efficient many-shot in-context learning with dynamic block-sparse attention

    Emily Xiao, Chin-Jou Li, Yilin Zhang, Graham Neubig, and Amanda Bertsch. Efficient many-shot in-context learning with dynamic block-sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Showing first 80 references.