Pith Number

pith:TVUCJQGM

pith:2024:TVUCJQGMRXM3TGG3S374KNKC5O

not attested not anchored not stored refs resolved

When Attention Sink Emerges in Language Models: An Empirical View

Chao Du, Cunxiao Du, Fengzhuo Zhang, Min Lin, Qian Liu, Tianyu Pang, Xiangming Gu, Ye Wang

Attention sinks in language models emerge from softmax normalization and act as key biases storing non-informative scores.

arxiv:2410.10781 v2 · 2024-10-14 · cs.CL · cs.AI · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{TVUCJQGMRXM3TGG3S374KNKC5O}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters.

C2weakest assumption

That the lack of attention sinks observed with sigmoid attention in models up to 1B parameters will hold for larger models and will not degrade overall language modeling performance or capabilities.

C3one line summary

Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.

References

64 extracted · 64 resolved · 22 Pith anchors

[1] Layer Normalization 2016 · arXiv:1607.06450

[2] Pythia: A suite for analyzing large language models across training and scaling 2023

[3] Quantizable transformers: Removing outliers by helping attention heads do nothing 2023

[4] Language models are few-shot learners 2020

[5] URL https://arxiv 2024

Formal links

2 machine-checked theorem links

Cited by

25 papers in Pith

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

ASAP: Attention Sink Anchored Pruning

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Dynamic Execution Commitment of Vision-Language-Action Models

Registers Matter for Pixel-Space Diffusion Transformers

Receipt and verification

First computed	2026-05-17T23:38:47.095115Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

9d6824c0cc8dd9b998db96ffc53542eba5f42e8716777a66b04f72875b43c9d0

Aliases

arxiv: 2410.10781 · arxiv_version: 2410.10781v2 · doi: 10.48550/arxiv.2410.10781 · pith_short_12: TVUCJQGMRXM3 · pith_short_16: TVUCJQGMRXM3TGG3 · pith_short_8: TVUCJQGM

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9d6824c0cc8dd9b998db96ffc53542eba5f42e8716777a66b04f72875b43c9d0

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "7c80d83b8a3966af2e289441d1d38931b96ecd5a368fa4d60804305042bffc1e",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-10-14T17:50:28Z",
    "title_canon_sha256": "d94b2f2cd6602889efecb1db67bf2c4693e4447abef13036e2cea27a37dca6bf"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.10781",
    "kind": "arxiv",
    "version": 2
  }
}