{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:MDJKPL5S3IEMCXYQ4Z5CCVQHIW","short_pith_number":"pith:MDJKPL5S","schema_version":"1.0","canonical_sha256":"60d2a7afb2da08c15f10e67a215607459bca6ed57194e20a0f3dbc5b94bfe664","source":{"kind":"arxiv","id":"2402.17762","version":2},"attestation_state":"computed","paper":{"title":"Massive Activations in Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Large language models contain a small number of massive activations that remain constant across inputs and act as indispensable bias terms.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"J. Zico Kolter, Mingjie Sun, Xinlei Chen, Zhuang Liu","submitted_at":"2024-02-27T18:55:17Z","abstract_excerpt":"We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bia"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2402.17762","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2024-02-27T18:55:17Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"1375592bd25780fa45da9e4a454856fb6a1918f2dfb6bdb9df98135e4a994fe3","abstract_canon_sha256":"161a0a2b92c9dbee51eb5242b3c2633f8c7a752a67326dc218027ce16a6a8324"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.755502Z","signature_b64":"GgFiq7mdE76RFcNs1uUbR1Z0tly4e5FEesYi+dyI23wKDnMVTBq6qtEeuT17oHQAS8CV5SJZb7faiD71Gy1qDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"60d2a7afb2da08c15f10e67a215607459bca6ed57194e20a0f3dbc5b94bfe664","last_reissued_at":"2026-05-17T23:38:48.754966Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.754966Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Massive Activations in Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Large language models contain a small number of massive activations that remain constant across inputs and act as indispensable bias terms.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"J. Zico Kolter, Mingjie Sun, Xinlei Chen, Zhuang Liu","submitted_at":"2024-02-27T18:55:17Z","abstract_excerpt":"We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bia"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations... their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs... these massive activations lead to the concentration of attention probabilities to their corresponding tokens.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the observed constancy of massive activation values and their role as indispensable bias terms generalize across all LLMs, inputs, and architectures based on the limited set of models and characterizations performed.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Large language models contain a small number of massive activations that remain constant across inputs and act as indispensable bias terms.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3f0879792246673acf58211c8ccf21e12ff53c73017da131da2ffc3fde0c743c"},"source":{"id":"2402.17762","kind":"arxiv","version":2},"verdict":{"id":"5b852ca6-442a-49a6-a777-e5fa78bd9382","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T06:59:19.393219Z","strongest_claim":"very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations... their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs... these massive activations lead to the concentration of attention probabilities to their corresponding tokens.","one_line_summary":"Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the observed constancy of massive activation values and their role as indispensable bias terms generalize across all LLMs, inputs, and architectures based on the limited set of models and characterizations performed.","pith_extraction_headline":"Large language models contain a small number of massive activations that remain constant across inputs and act as indispensable bias terms."},"references":{"count":159,"sample":[{"doi":"","year":2022,"title":"Exploring Length Generalization in Large Language Models","work_id":"2c9271b4-93c3-4ef2-953e-9d6b8a9c41c0","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2009,"title":"Computational complexity: a modern approach","work_id":"03206498-04bf-40ab-82ce-6bec266dc024","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"URLhttps://arxiv.org/pdf/2202.05826","work_id":"25bc4b88-d8d5-459f-a6ee-3871f05ce731","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"arXiv preprint arXiv:2207.08799 , year=","work_id":"92192172-5c98-475d-ab81-1f83e1a2d120","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1986,"title":"Mix Barrington","work_id":"9e91a5eb-4082-4686-b17a-c95c104f0867","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":159,"snapshot_sha256":"f5ef4ad595f606821b612a050b320025d4de37887ac8f710fa2503c7a66fd6c2","internal_anchors":47},"formal_canon":{"evidence_count":2,"snapshot_sha256":"9b8749389cf2bcf67e418ad8d841e5a7202cd0f867a718bfb966b6053667e8ad"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2402.17762","created_at":"2026-05-17T23:38:48.755052+00:00"},{"alias_kind":"arxiv_version","alias_value":"2402.17762v2","created_at":"2026-05-17T23:38:48.755052+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2402.17762","created_at":"2026-05-17T23:38:48.755052+00:00"},{"alias_kind":"pith_short_12","alias_value":"MDJKPL5S3IEM","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"MDJKPL5S3IEMCXYQ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"MDJKPL5S","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":33,"internal_anchor_count":33,"sample":[{"citing_arxiv_id":"2605.23040","citing_title":"Steered Generation via Gradient-Based Optimization on Sparse Query Features","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23258","citing_title":"A Simple Plug-in for Improving Eviction-Based KV Cache Compression","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23259","citing_title":"Multi-Gate Residuals","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18832","citing_title":"Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2407.08608","citing_title":"FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16147","citing_title":"Registers Matter for Pixel-Space Diffusion Transformers","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18898","citing_title":"A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19660","citing_title":"OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19622","citing_title":"UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2509.21677","citing_title":"Prophecy: Inferring Formal Properties from Neuron Activations","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2511.22681","citing_title":"CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2410.10781","citing_title":"When Attention Sink Emerges in Language Models: An Empirical View","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2601.14004","citing_title":"Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models","ref_index":291,"is_internal_anchor":true},{"citing_arxiv_id":"2602.10718","citing_title":"SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08504","citing_title":"A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03316","citing_title":"When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03380","citing_title":"Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09313","citing_title":"Attention Sinks in Diffusion Transformers: A Causal Analysis","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2406.04093","citing_title":"Scaling and evaluating sparse autoencoders","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2601.02780","citing_title":"MiMo-V2-Flash Technical Report","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2406.02069","citing_title":"PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2505.06708","citing_title":"Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08504","citing_title":"A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09313","citing_title":"Attention Sinks in Diffusion Transformers: A Causal Analysis","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10622","citing_title":"Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination","ref_index":74,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/MDJKPL5S3IEMCXYQ4Z5CCVQHIW","json":"https://pith.science/pith/MDJKPL5S3IEMCXYQ4Z5CCVQHIW.json","graph_json":"https://pith.science/api/pith-number/MDJKPL5S3IEMCXYQ4Z5CCVQHIW/graph.json","events_json":"https://pith.science/api/pith-number/MDJKPL5S3IEMCXYQ4Z5CCVQHIW/events.json","paper":"https://pith.science/paper/MDJKPL5S"},"agent_actions":{"view_html":"https://pith.science/pith/MDJKPL5S3IEMCXYQ4Z5CCVQHIW","download_json":"https://pith.science/pith/MDJKPL5S3IEMCXYQ4Z5CCVQHIW.json","view_paper":"https://pith.science/paper/MDJKPL5S","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2402.17762&json=true","fetch_graph":"https://pith.science/api/pith-number/MDJKPL5S3IEMCXYQ4Z5CCVQHIW/graph.json","fetch_events":"https://pith.science/api/pith-number/MDJKPL5S3IEMCXYQ4Z5CCVQHIW/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/MDJKPL5S3IEMCXYQ4Z5CCVQHIW/action/timestamp_anchor","attest_storage":"https://pith.science/pith/MDJKPL5S3IEMCXYQ4Z5CCVQHIW/action/storage_attestation","attest_author":"https://pith.science/pith/MDJKPL5S3IEMCXYQ4Z5CCVQHIW/action/author_attestation","sign_citation":"https://pith.science/pith/MDJKPL5S3IEMCXYQ4Z5CCVQHIW/action/citation_signature","submit_replication":"https://pith.science/pith/MDJKPL5S3IEMCXYQ4Z5CCVQHIW/action/replication_record"}},"created_at":"2026-05-17T23:38:48.755052+00:00","updated_at":"2026-05-17T23:38:48.755052+00:00"}