pith. machine review for the scientific record. sign in

arxiv: 2410.10781 · v2 · submitted 2024-10-14 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

When Attention Sink Emerges in Language Models: An Empirical View

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords attention sinklanguage modelssoftmaxsigmoid attentionkey biasespre-trainingattention mechanism
0
0 comments X

The pith

Attention sinks in language models emerge from softmax normalization and act as key biases storing non-informative scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models consistently assign high attention to the initial token, a behavior known as attention sink. This paper demonstrates that the phenomenon appears during pre-training once optimization becomes effective on adequate data. The sink position depends on the loss function and data distribution. Critically, attention sinks behave like key biases that hold extra attention scores which do not feed into value computation. This dependence is induced by softmax normalization, and switching to sigmoid attention removes the sinks in models as large as 1 billion parameters.

Core claim

Attention sink is observed to emerge universally in LMs with various inputs even in small models and during pre-training. It acts more like key biases, storing extra attention scores which could be non-informative and not contribute to the value computation. This phenomenon stems at least partially from tokens' inner dependence on attention scores as a result of softmax normalization. Replacing softmax attention with sigmoid attention without normalization prevents attention sinks from emerging in LMs up to 1B parameters.

What carries the argument

softmax normalization creating inner dependence on attention scores among tokens

If this is right

  • Attention sinks store extra attention scores that do not contribute to value computation.
  • The sink position correlates strongly with the loss function and data distribution.
  • Attention sinks appear after effective optimization on sufficient training data.
  • Alternative attention operations like sigmoid attention avoid the emergence of attention sinks.
  • Attention sinks occur universally across different inputs and even in small models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This mechanism may explain why certain KV cache optimizations rely on the first token.
  • If the finding generalizes, model designers could adopt normalization-free attentions to simplify inference acceleration techniques.
  • Attention sinks might be replaced by learned biases in the architecture to achieve similar effects without the softmax artifact.
  • The result opens the possibility of training larger models without this bias if sigmoid attention scales well.

Load-bearing premise

That using sigmoid attention instead of softmax will continue to prevent attention sinks in models larger than 1B parameters while maintaining comparable language modeling performance.

What would settle it

Observe whether attention sinks appear when training a model with over 1B parameters using sigmoid attention and compare its performance to a standard softmax model on language modeling benchmarks.

read the original abstract

Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at https://github.com/sail-sg/Attention-Sink.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper empirically studies the attention sink phenomenon in language models, showing that sinks appear universally across inputs and model sizes, emerge during pre-training after sufficient optimization on data, correlate with loss functions and data distributions, function like key biases storing non-informative extra attention scores, and arise partly from softmax-induced token dependence on attention scores. Replacing softmax with unnormalized sigmoid attention eliminates sinks in models trained up to 1B parameters, with public code released for verification.

Significance. If the results hold, the work supplies a mechanistic, empirically grounded account of attention sinks with direct implications for KV cache optimization, long-context inference, and architectural choices in LMs. Strengths include controlled pre-training ablations across model sizes, data, loss, and architectures, plus reproducible code that allows independent verification of the core observations within the stated 1B-parameter regime.

minor comments (3)
  1. [Abstract] Abstract: the statement that sinks 'do not emerge' with sigmoid attention should explicitly reference the maximum model size (1B) and note that this holds under the reported training regime.
  2. [Section 4] Section 4 (sigmoid attention experiments): include a direct side-by-side comparison of validation loss or perplexity between softmax and sigmoid models to confirm that sink removal does not come at the cost of modeling performance.
  3. [Figure 3] Figure 3 or equivalent (attention score visualizations): add scale bars or explicit numerical ranges on the color maps so readers can assess the magnitude of the reported 'extra' scores stored at the sink position.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation to accept. The referee's summary accurately captures our empirical findings on the universal emergence of attention sinks, their correlation with optimization and data, their interpretation as key biases, and the elimination of sinks via unnormalized sigmoid attention in models up to 1B parameters.

Circularity Check

0 steps flagged

No significant circularity; empirical observations stand independently

full rationale

The paper's core claims rest on controlled pre-training experiments and ablations up to 1B parameters, including direct replacement of softmax attention with sigmoid attention to test the normalization hypothesis. No load-bearing derivations, fitted parameters renamed as predictions, or self-citation chains reduce the results to inputs by construction. Correlations with loss, data distribution, and emergence timing are reported as direct measurements from training runs rather than algebraic identities. The explicit scoping to observed regimes and released code further ensure the findings are self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on standard assumptions of transformer training with softmax attention; no new free parameters are fitted to produce the central claims, and no new entities are postulated.

axioms (1)
  • domain assumption Standard transformer architecture uses softmax attention by default
    The paper builds experiments around the common softmax attention mechanism in LMs.

pith-pipeline@v0.9.0 · 5572 in / 1310 out tokens · 42352 ms · 2026-05-16T17:37:06.531295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dynamic Execution Commitment of Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.

  2. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

  3. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.

  4. FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...

  5. PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

    cs.LG 2026-04 unverdicted novelty 7.0

    Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

  6. A Mechanistic Analysis of Looped Reasoning Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.

  7. When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

    cs.AI 2026-05 unverdicted novelty 6.0

    Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.

  8. Conditional Memory Enhanced Item Representation for Generative Recommendation

    cs.IR 2026-05 unverdicted novelty 6.0

    ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.

  9. The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

    cs.LG 2026-05 unverdicted novelty 6.0

    Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training...

  10. Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

    cs.CL 2026-04 conditional novelty 6.0

    Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

  11. Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation

    cs.IR 2026-04 unverdicted novelty 6.0

    STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.

  12. InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation

    cs.RO 2026-02 unverdicted novelty 6.0

    InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.

  13. Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    cs.CV 2026-02 unverdicted novelty 6.0

    Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

  14. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  15. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    cs.CL 2025-05 conditional novelty 6.0

    Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

  16. Exploring Motion-Language Alignment for Text-driven Motion Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    MLA-Gen advances text-driven motion synthesis by aligning global motion patterns with fine-grained text semantics and mitigating attention sink effects via new masking techniques.

  17. Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

    cs.CL 2026-02 unverdicted novelty 5.0

    Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.

  18. MiMo-V2-Flash Technical Report

    cs.CL 2026-01 unverdicted novelty 5.0

    MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...

  19. Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

    cs.CL 2025-11 unverdicted novelty 5.0

    Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 18 Pith papers · 21 internal anchors

  1. [1]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

  2. [2]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.\ 2397--2430. PMLR, 2023

  3. [3]

    Quantizable transformers: Removing outliers by helping attention heads do nothing

    Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. Advances in Neural Information Processing Systems, 36: 0 75067--75096, 2023

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  5. [5]

    Spectral filters, dark signals, and attention sinks

    Nicola Cancedda. Spectral filters, dark signals, and attention sinks. arXiv preprint arXiv:2402.09221, 2024

  6. [6]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. arXiv preprint arXiv:2403.06764, 2024

  7. [7]

    Vision Transformers Need Registers

    Timoth \'e e Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023

  8. [8]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp.\ 7480--7512. PMLR, 2023

  9. [9]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35: 0 30318--30332, 2022

  10. [10]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 4171--4186, 2019

  11. [11]

    A simple and effective l\_2 norm-based strategy for kv cache compression

    Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. A simple and effective l\_2 norm-based strategy for kv cache compression. arXiv preprint arXiv:2406.11430, 2024

  12. [12]

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023

  13. [13]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  14. [14]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

  15. [15]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  16. [16]

    Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, and Danqi Chen

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

  17. [17]

    arXiv preprint arXiv:2310.01801 , year=

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023

  18. [18]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  19. [19]

    Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms

    Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I Jordan, and Song Mei. Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms. arXiv preprint arXiv:2410.13835, 2024 a

  20. [20]

    Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters

    Zhiyu Guo, Hidetaka Kamigaito, and Taro Watanabe. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. arXiv preprint arXiv:2406.12335, 2024 b

  21. [21]

    Lm-infinite: Zero-shot extreme length generalization for large language models

    Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Zero-shot extreme length generalization for large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 3991--4008, 2024

  22. [22]

    Understanding and minimising outlier features in transformer training

    Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, and Thomas Hofmann. Understanding and minimising outlier features in transformer training. In Advances in Neural Information Processing Systems, 2024

  23. [23]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

  24. [24]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

  25. [25]

    Slim-llm: Salience-driven mixed-precision quantization for large language models

    Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, and Xiaojuan Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models. arXiv preprint arXiv:2405.14917, 2024

  26. [26]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

  27. [27]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.\ 5156--5165. PMLR, 2020

  28. [28]

    From attention to activation: Unravelling the enigmas of large language models

    Prannay Kaul, Chengcheng Ma, Ismail Elezi, and Jiankang Deng. From attention to activation: Unravelling the enigmas of large language models. arXiv preprint arXiv:2410.17174, 2024

  29. [29]

    The impact of positional encoding on length generalization in transformers

    Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024

  30. [30]

    Adam: A Method for Stochastic Optimization

    Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  31. [31]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

  32. [32]

    Duquant: Distributing outliers via dual transformation makes stronger quantized llms

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  33. [33]

    Regmix: Data mixture as regression for language model pre-training

    Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. arXiv preprint arXiv:2407.01492, 2024 a

  34. [34]

    Intactkv: Improving large language model quantization by keeping pivot tokens intact

    Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, and Chun Yuan. Intactkv: Improving large language model quantization by keeping pivot tokens intact. arXiv preprint arXiv:2403.01241, 2024 b

  35. [35]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  36. [36]

    Attention is off by one

    Evan Miller. Attention is off by one. URL https://www.evanmiller.org/attention-is-off-by-one.html, 2023

  37. [37]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  38. [38]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021

  39. [39]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  40. [40]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

  41. [41]

    Searching for Activation Functions

    Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017

  42. [42]

    Theory, analysis, and best practices for sigmoid self-attention

    Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, et al. Theory, analysis, and best practices for sigmoid self-attention. arXiv preprint arXiv:2409.04431, 2024

  43. [43]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

  44. [44]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  45. [45]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

  46. [46]

    Massive Activations in Large Language Models

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024

  47. [47]

    Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,

    Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale. arXiv preprint arXiv:2408.12570, 2024

  48. [48]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  49. [49]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  50. [50]

    Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

    Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. arXiv preprint arXiv:2406.18139, 2024

  51. [51]

    What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pp.\ 22964--22984

    Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raffel. What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pp.\ 22964--22984. PMLR, 2022

  52. [52]

    Small-scale proxies for large-scale transformer training instabilities

    Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023

  53. [53]

    Layer-condensed kv cache for efficient inference of large language models,

    Haoyi Wu and Kewei Tu. Layer-condensed kv cache for efficient inference of large language models. arXiv preprint arXiv:2405.10637, 2024

  54. [54]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp.\ 38087--38099. PMLR, 2023 a

  55. [55]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023 b

  56. [56]

    Seed-story: Multimodal long story generation with large language model

    Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed-story: Multimodal long story generation with large language model. arXiv preprint arXiv:2407.08683, 2024

  57. [57]

    Stablemask: Refining causal masking in decoder-only transformer

    Qingyu Yin, Xuzheng He, Xiang Zhuang, Yu Zhao, Jianhua Yao, Xiaoyu Shen, and Qiang Zhang. Stablemask: Refining causal masking in decoder-only transformer. arXiv preprint arXiv:2402.04779, 2024

  58. [58]

    Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration

    Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, and Yingyan Celine Lin. Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration. arXiv preprint arXiv:2406.15765, 2024

  59. [59]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

  60. [60]

    Stabilizing transformer training by preventing attention entropy collapse

    Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Joshua M Susskind. Stabilizing transformer training by preventing attention entropy collapse. In International Conference on Machine Learning, pp.\ 40770--40803. PMLR, 2023

  61. [61]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

  62. [62]

    TinyLlama: An Open-Source Small Language Model

    Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024 a

  63. [63]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

  64. [64]

    Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache

    Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, and Atlas Wang. Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache. Proceedings of Machine Learning and Systems, 6: 0 381--394, 2024 b