{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:TVUCJQGMRXM3TGG3S374KNKC5O","short_pith_number":"pith:TVUCJQGM","schema_version":"1.0","canonical_sha256":"9d6824c0cc8dd9b998db96ffc53542eba5f42e8716777a66b04f72875b43c9d0","source":{"kind":"arxiv","id":"2410.10781","version":2},"attestation_state":"computed","paper":{"title":"When Attention Sink Emerges in Language Models: An Empirical View","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Attention sinks in language models emerge from softmax normalization and act as key biases storing non-informative scores.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Chao Du, Cunxiao Du, Fengzhuo Zhang, Min Lin, Qian Liu, Tianyu Pang, Xiangming Gu, Ye Wang","submitted_at":"2024-10-14T17:50:28Z","abstract_excerpt":"Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during th"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2410.10781","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2024-10-14T17:50:28Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"d94b2f2cd6602889efecb1db67bf2c4693e4447abef13036e2cea27a37dca6bf","abstract_canon_sha256":"7c80d83b8a3966af2e289441d1d38931b96ecd5a368fa4d60804305042bffc1e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.095666Z","signature_b64":"6i3JYS8lfur2DevB2/zEGwICHODXGIv8iQYkxtt4+l3c6W2bQooCWAcMbKtpUgWdDhVUA0U0MjTdFhj4J3t8Cg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"9d6824c0cc8dd9b998db96ffc53542eba5f42e8716777a66b04f72875b43c9d0","last_reissued_at":"2026-05-17T23:38:47.095115Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.095115Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"When Attention Sink Emerges in Language Models: An Empirical View","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Attention sinks in language models emerge from softmax normalization and act as key biases storing non-informative scores.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Chao Du, Cunxiao Du, Fengzhuo Zhang, Min Lin, Qian Liu, Tianyu Pang, Xiangming Gu, Ye Wang","submitted_at":"2024-10-14T17:50:28Z","abstract_excerpt":"Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during th"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the lack of attention sinks observed with sigmoid attention in models up to 1B parameters will hold for larger models and will not degrade overall language modeling performance or capabilities.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Attention sinks in language models emerge from softmax normalization and act as key biases storing non-informative scores.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"af591eb976b8c114ef3aaf342785e0cdfd06411bad71bd9d108b4fdcadee57e9"},"source":{"id":"2410.10781","kind":"arxiv","version":2},"verdict":{"id":"1af5a9b6-0ec6-46a5-b9b2-235e039bb29a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T17:37:06.531295Z","strongest_claim":"We find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters.","one_line_summary":"Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the lack of attention sinks observed with sigmoid attention in models up to 1B parameters will hold for larger models and will not degrade overall language modeling performance or capabilities.","pith_extraction_headline":"Attention sinks in language models emerge from softmax normalization and act as key biases storing non-informative scores."},"references":{"count":64,"sample":[{"doi":"","year":2016,"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","ref_index":1,"cited_arxiv_id":"1607.06450","is_internal_anchor":true},{"doi":"","year":2023,"title":"Pythia: A suite for analyzing large language models across training and scaling","work_id":"1b00948d-e7ad-48f6-b8b0-3c85545d533a","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Quantizable transformers: Removing outliers by helping attention heads do nothing","work_id":"91169195-3dc7-4415-9eab-faaaffd157bf","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Language models are few-shot learners","work_id":"bf2aec88-6d9e-4f77-a6a3-9a828d524cc0","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"URL https://arxiv","work_id":"7acfb02e-83f1-4e11-b68f-a198807fcfcb","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":64,"snapshot_sha256":"7f25552db18cd366f200db49ddad42abceb86ffbe7036be7193e210000fc4e5d","internal_anchors":22},"formal_canon":{"evidence_count":2,"snapshot_sha256":"d5546f34c8f18b157b0520e24f2662d92c798d9ae4ad27b3d27f995e67e55da1"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2410.10781","created_at":"2026-05-17T23:38:47.095217+00:00"},{"alias_kind":"arxiv_version","alias_value":"2410.10781v2","created_at":"2026-05-17T23:38:47.095217+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2410.10781","created_at":"2026-05-17T23:38:47.095217+00:00"},{"alias_kind":"pith_short_12","alias_value":"TVUCJQGMRXM3","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"TVUCJQGMRXM3TGG3","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"TVUCJQGM","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":25,"internal_anchor_count":25,"sample":[{"citing_arxiv_id":"2410.13846","citing_title":"LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22372","citing_title":"ASAP: Attention Sink Anchored Pruning","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2602.04163","citing_title":"BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11567","citing_title":"Dynamic Execution Commitment of Vision-Language-Action Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16147","citing_title":"Registers Matter for Pixel-Space Diffusion Transformers","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17279","citing_title":"Rover: Context-aware Conflict Resolution with LLM","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21613","citing_title":"Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2602.01203","citing_title":"Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2602.07775","citing_title":"Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2602.23024","citing_title":"InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08504","citing_title":"A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12922","citing_title":"When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2510.26692","citing_title":"Kimi Linear: An Expressive, Efficient Attention Architecture","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02973","citing_title":"Exploring Motion-Language Alignment for Text-driven Motion Generation","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11447","citing_title":"Conditional Memory Enhanced Item Representation for Generative Recommendation","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11567","citing_title":"Dynamic Execution Commitment of Vision-Language-Action Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2601.02780","citing_title":"MiMo-V2-Flash Technical Report","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2505.06708","citing_title":"Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08504","citing_title":"A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06611","citing_title":"The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04421","citing_title":"FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22117","citing_title":"PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11791","citing_title":"A Mechanistic Analysis of Looped Reasoning Language Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05688","citing_title":"Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05329","citing_title":"Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation","ref_index":13,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O","json":"https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O.json","graph_json":"https://pith.science/api/pith-number/TVUCJQGMRXM3TGG3S374KNKC5O/graph.json","events_json":"https://pith.science/api/pith-number/TVUCJQGMRXM3TGG3S374KNKC5O/events.json","paper":"https://pith.science/paper/TVUCJQGM"},"agent_actions":{"view_html":"https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O","download_json":"https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O.json","view_paper":"https://pith.science/paper/TVUCJQGM","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2410.10781&json=true","fetch_graph":"https://pith.science/api/pith-number/TVUCJQGMRXM3TGG3S374KNKC5O/graph.json","fetch_events":"https://pith.science/api/pith-number/TVUCJQGMRXM3TGG3S374KNKC5O/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O/action/timestamp_anchor","attest_storage":"https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O/action/storage_attestation","attest_author":"https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O/action/author_attestation","sign_citation":"https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O/action/citation_signature","submit_replication":"https://pith.science/pith/TVUCJQGMRXM3TGG3S374KNKC5O/action/replication_record"}},"created_at":"2026-05-17T23:38:47.095217+00:00","updated_at":"2026-05-17T23:38:47.095217+00:00"}