arxiv: 2410.10781 · v2 · submitted 2024-10-14 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu , Tianyu Pang , Chao Du , Qian Liu , Fengzhuo Zhang , Cunxiao Du , Ye Wang , Min Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords attention sinklanguage modelssoftmaxsigmoid attentionkey biasespre-trainingattention mechanism

0 comments

The pith

Attention sinks in language models emerge from softmax normalization and act as key biases storing non-informative scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models consistently assign high attention to the initial token, a behavior known as attention sink. This paper demonstrates that the phenomenon appears during pre-training once optimization becomes effective on adequate data. The sink position depends on the loss function and data distribution. Critically, attention sinks behave like key biases that hold extra attention scores which do not feed into value computation. This dependence is induced by softmax normalization, and switching to sigmoid attention removes the sinks in models as large as 1 billion parameters.

Core claim

Attention sink is observed to emerge universally in LMs with various inputs even in small models and during pre-training. It acts more like key biases, storing extra attention scores which could be non-informative and not contribute to the value computation. This phenomenon stems at least partially from tokens' inner dependence on attention scores as a result of softmax normalization. Replacing softmax attention with sigmoid attention without normalization prevents attention sinks from emerging in LMs up to 1B parameters.

What carries the argument

softmax normalization creating inner dependence on attention scores among tokens

If this is right

Attention sinks store extra attention scores that do not contribute to value computation.
The sink position correlates strongly with the loss function and data distribution.
Attention sinks appear after effective optimization on sufficient training data.
Alternative attention operations like sigmoid attention avoid the emergence of attention sinks.
Attention sinks occur universally across different inputs and even in small models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This mechanism may explain why certain KV cache optimizations rely on the first token.
If the finding generalizes, model designers could adopt normalization-free attentions to simplify inference acceleration techniques.
Attention sinks might be replaced by learned biases in the architecture to achieve similar effects without the softmax artifact.
The result opens the possibility of training larger models without this bias if sigmoid attention scales well.

Load-bearing premise

That using sigmoid attention instead of softmax will continue to prevent attention sinks in models larger than 1B parameters while maintaining comparable language modeling performance.

What would settle it

Observe whether attention sinks appear when training a model with over 1B parameters using sigmoid attention and compare its performance to a standard softmax model on language modeling benchmarks.

read the original abstract

Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at https://github.com/sail-sg/Attention-Sink.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Attention sinks trace to softmax normalization and disappear with unnormalized sigmoid attention up to 1B scale.

read the letter

Attention sinks in these models come from the way softmax forces attention scores to depend on each other, and the paper shows you can remove them by switching to sigmoid attention that skips the normalization. They track the sink appearing partway through pre-training once the model has seen enough data and started to optimize well. The sink position lines up with the loss and how the data is distributed. Most of the evidence comes from running the same training setup with different losses, data mixes, and model sizes, all the way to 1B parameters. They also test the idea that the sink is basically unused key bias by showing it stores extra score without feeding into the values. The cleanest part is the architecture swap: when they replace softmax with unnormalized sigmoid, the sink never shows up in their runs. That gives direct support for the dependence story. The code release makes it easy to reproduce the key plots. The soft spot is the size limit. Everything positive is shown only up to 1B, so it is not clear if the same attention change would work or keep performance intact at the 7B or 70B scales where people care most about long context. They do not report big drops in the small models, but that does not guarantee the pattern holds higher up. This paper is for people who build or tune long-context systems and want a better handle on why the sink appears rather than just using it as a trick. The experiments are controlled enough and the code is out, so it is worth sending to referees who can check the scaling question and any hidden performance costs.

Referee Report

0 major / 3 minor

Summary. The paper empirically studies the attention sink phenomenon in language models, showing that sinks appear universally across inputs and model sizes, emerge during pre-training after sufficient optimization on data, correlate with loss functions and data distributions, function like key biases storing non-informative extra attention scores, and arise partly from softmax-induced token dependence on attention scores. Replacing softmax with unnormalized sigmoid attention eliminates sinks in models trained up to 1B parameters, with public code released for verification.

Significance. If the results hold, the work supplies a mechanistic, empirically grounded account of attention sinks with direct implications for KV cache optimization, long-context inference, and architectural choices in LMs. Strengths include controlled pre-training ablations across model sizes, data, loss, and architectures, plus reproducible code that allows independent verification of the core observations within the stated 1B-parameter regime.

minor comments (3)

[Abstract] Abstract: the statement that sinks 'do not emerge' with sigmoid attention should explicitly reference the maximum model size (1B) and note that this holds under the reported training regime.
[Section 4] Section 4 (sigmoid attention experiments): include a direct side-by-side comparison of validation loss or perplexity between softmax and sigmoid models to confirm that sink removal does not come at the cost of modeling performance.
[Figure 3] Figure 3 or equivalent (attention score visualizations): add scale bars or explicit numerical ranges on the color maps so readers can assess the magnitude of the reported 'extra' scores stored at the sink position.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation to accept. The referee's summary accurately captures our empirical findings on the universal emergence of attention sinks, their correlation with optimization and data, their interpretation as key biases, and the elimination of sinks via unnormalized sigmoid attention in models up to 1B parameters.

Circularity Check

0 steps flagged

No significant circularity; empirical observations stand independently

full rationale

The paper's core claims rest on controlled pre-training experiments and ablations up to 1B parameters, including direct replacement of softmax attention with sigmoid attention to test the normalization hypothesis. No load-bearing derivations, fitted parameters renamed as predictions, or self-citation chains reduce the results to inputs by construction. Correlations with loss, data distribution, and emergence timing are reported as direct measurements from training runs rather than algebraic identities. The explicit scoping to observed regimes and released code further ensure the findings are self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on standard assumptions of transformer training with softmax attention; no new free parameters are fitted to produce the central claims, and no new entities are postulated.

axioms (1)

domain assumption Standard transformer architecture uses softmax attention by default
The paper builds experiments around the common softmax attention mechanism in LMs.

pith-pipeline@v0.9.0 · 5572 in / 1310 out tokens · 42352 ms · 2026-05-16T17:37:06.531295+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one; J_symmetric echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization.
IndisputableMonolith.Foundation.LogicAsFunctionalEquation non_contradiction_and_scale_imply_reciprocal; excluded_middle_implies_continuous echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dynamic Execution Commitment of Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
cs.CL 2026-05 conditional novelty 7.0

Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
cs.LG 2026-05 unverdicted novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
cs.LG 2026-04 unverdicted novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
A Mechanistic Analysis of Looped Reasoning Language Models
cs.LG 2026-04 unverdicted novelty 7.0

Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
cs.AI 2026-05 unverdicted novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
Conditional Memory Enhanced Item Representation for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
cs.LG 2026-05 unverdicted novelty 6.0

Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training...
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
cs.CL 2026-04 conditional novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation
cs.RO 2026-02 unverdicted novelty 6.0

InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
cs.CV 2026-02 unverdicted novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Exploring Motion-Language Alignment for Text-driven Motion Generation
cs.CV 2026-04 unverdicted novelty 5.0

MLA-Gen advances text-driven motion synthesis by aligning global motion patterns with fine-grained text semantics and mitigating attention sink effects via new masking techniques.
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
cs.CL 2026-02 unverdicted novelty 5.0

Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
cs.CL 2025-11 unverdicted novelty 5.0

Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 18 Pith papers · 21 internal anchors

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.\ 2397--2430. PMLR, 2023

work page 2023
[3]

Quantizable transformers: Removing outliers by helping attention heads do nothing

Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. Advances in Neural Information Processing Systems, 36: 0 75067--75096, 2023

work page 2023
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[5]

Spectral filters, dark signals, and attention sinks

Nicola Cancedda. Spectral filters, dark signals, and attention sinks. arXiv preprint arXiv:2402.09221, 2024

work page arXiv 2024
[6]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. arXiv preprint arXiv:2403.06764, 2024

work page arXiv 2024
[7]

Vision Transformers Need Registers

Timoth \'e e Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp.\ 7480--7512. PMLR, 2023

work page 2023
[9]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35: 0 30318--30332, 2022

work page 2022
[10]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 4171--4186, 2019

work page 2019
[11]

A simple and effective l\_2 norm-based strategy for kv cache compression

Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. A simple and effective l\_2 norm-based strategy for kv cache compression. arXiv preprint arXiv:2406.11430, 2024

work page arXiv 2024
[12]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

work page 2022
[15]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[16]

Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, and Danqi Chen

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page arXiv 2024
[17]

arXiv preprint arXiv:2310.01801 , year=

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023

work page arXiv 2023
[18]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms

Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I Jordan, and Song Mei. Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms. arXiv preprint arXiv:2410.13835, 2024 a

work page arXiv 2024
[20]

Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters

Zhiyu Guo, Hidetaka Kamigaito, and Taro Watanabe. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. arXiv preprint arXiv:2406.12335, 2024 b

work page arXiv 2024
[21]

Lm-infinite: Zero-shot extreme length generalization for large language models

Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Zero-shot extreme length generalization for large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 3991--4008, 2024

work page 2024
[22]

Understanding and minimising outlier features in transformer training

Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, and Thomas Hofmann. Understanding and minimising outlier features in transformer training. In Advances in Neural Information Processing Systems, 2024

work page 2024
[23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016
[24]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

Slim-llm: Salience-driven mixed-precision quantization for large language models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, and Xiaojuan Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models. arXiv preprint arXiv:2405.14917, 2024

work page arXiv 2024
[26]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.\ 5156--5165. PMLR, 2020

work page 2020
[28]

From attention to activation: Unravelling the enigmas of large language models

Prannay Kaul, Chengcheng Ma, Ismail Elezi, and Jiankang Deng. From attention to activation: Unravelling the enigmas of large language models. arXiv preprint arXiv:2410.17174, 2024

work page arXiv 2024
[29]

The impact of positional encoding on length generalization in transformers

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[30]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[31]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[33]

Regmix: Data mixture as regression for language model pre-training

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. arXiv preprint arXiv:2407.01492, 2024 a

work page arXiv 2024
[34]

Intactkv: Improving large language model quantization by keeping pivot tokens intact

Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, and Chun Yuan. Intactkv: Improving large language model quantization by keeping pivot tokens intact. arXiv preprint arXiv:2403.01241, 2024 b

work page arXiv 2024
[35]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Attention is off by one

Evan Miller. Attention is off by one. URL https://www.evanmiller.org/attention-is-off-by-one.html, 2023

work page 2023
[37]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[38]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[39]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[40]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

work page 2020
[41]

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Theory, analysis, and best practices for sigmoid self-attention

Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, et al. Theory, analysis, and best practices for sigmoid self-attention. arXiv preprint arXiv:2409.04431, 2024

work page arXiv 2024
[43]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[44]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

work page 2024
[46]

Massive Activations in Large Language Models

Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,

Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale. arXiv preprint arXiv:2408.12570, 2024

work page arXiv 2024
[48]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[50]

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. arXiv preprint arXiv:2406.18139, 2024

work page arXiv 2024
[51]

What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pp.\ 22964--22984

Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raffel. What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pp.\ 22964--22984. PMLR, 2022

work page 2022
[52]

Small-scale proxies for large-scale transformer training instabilities

Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023

work page arXiv 2023
[53]

Layer-condensed kv cache for efficient inference of large language models,

Haoyi Wu and Kewei Tu. Layer-condensed kv cache for efficient inference of large language models. arXiv preprint arXiv:2405.10637, 2024

work page arXiv 2024
[54]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp.\ 38087--38099. PMLR, 2023 a

work page 2023
[55]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Seed-story: Multimodal long story generation with large language model

Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed-story: Multimodal long story generation with large language model. arXiv preprint arXiv:2407.08683, 2024

work page arXiv 2024
[57]

Stablemask: Refining causal masking in decoder-only transformer

Qingyu Yin, Xuzheng He, Xiang Zhuang, Yu Zhao, Jianhua Yao, Xiaoyu Shen, and Qiang Zhang. Stablemask: Refining causal masking in decoder-only transformer. arXiv preprint arXiv:2402.04779, 2024

work page arXiv 2024
[58]

Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration

Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, and Yingyan Celine Lin. Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration. arXiv preprint arXiv:2406.15765, 2024

work page arXiv 2024
[59]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[60]

Stabilizing transformer training by preventing attention entropy collapse

Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Joshua M Susskind. Stabilizing transformer training by preventing attention entropy collapse. In International Conference on Machine Learning, pp.\ 40770--40803. PMLR, 2023

work page 2023
[61]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[62]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache

Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, and Atlas Wang. Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache. Proceedings of Machine Learning and Systems, 6: 0 381--394, 2024 b

work page 2024