pith. sign in
Pith Number

pith:TVUCJQGM

pith:2024:TVUCJQGMRXM3TGG3S374KNKC5O
not attested not anchored not stored refs resolved

When Attention Sink Emerges in Language Models: An Empirical View

Chao Du, Cunxiao Du, Fengzhuo Zhang, Min Lin, Qian Liu, Tianyu Pang, Xiangming Gu, Ye Wang

Attention sinks in language models emerge from softmax normalization and act as key biases storing non-informative scores.

arxiv:2410.10781 v2 · 2024-10-14 · cs.CL · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{TVUCJQGMRXM3TGG3S374KNKC5O}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters.

C2weakest assumption

That the lack of attention sinks observed with sigmoid attention in models up to 1B parameters will hold for larger models and will not degrade overall language modeling performance or capabilities.

C3one line summary

Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.

References

64 extracted · 64 resolved · 22 Pith anchors

[1] Layer Normalization 2016 · arXiv:1607.06450
[2] Pythia: A suite for analyzing large language models across training and scaling 2023
[3] Quantizable transformers: Removing outliers by helping attention heads do nothing 2023
[4] Language models are few-shot learners 2020
[5] URL https://arxiv 2024

Formal links

2 machine-checked theorem links

Cited by

25 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:47.095115Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

9d6824c0cc8dd9b998db96ffc53542eba5f42e8716777a66b04f72875b43c9d0

Aliases

arxiv: 2410.10781 · arxiv_version: 2410.10781v2 · doi: 10.48550/arxiv.2410.10781 · pith_short_12: TVUCJQGMRXM3 · pith_short_16: TVUCJQGMRXM3TGG3 · pith_short_8: TVUCJQGM
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9d6824c0cc8dd9b998db96ffc53542eba5f42e8716777a66b04f72875b43c9d0
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "7c80d83b8a3966af2e289441d1d38931b96ecd5a368fa4d60804305042bffc1e",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-10-14T17:50:28Z",
    "title_canon_sha256": "d94b2f2cd6602889efecb1db67bf2c4693e4447abef13036e2cea27a37dca6bf"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.10781",
    "kind": "arxiv",
    "version": 2
  }
}