{"state_type":"pith_open_graph_state","state_version":"1.0","pith_number":"pith:2026:UX46GXRWIS4VKM7QMLCCIJZTMN","merge_version":"pith-open-graph-merge-v1","event_count":2,"valid_event_count":2,"invalid_event_count":0,"equivocation_count":0,"current":{"canonical_record":{"metadata":{"abstract_canon_sha256":"3993724ea628eb474a133aef2da3ce370a1e4f18fada5ac2582c42502195f881","cross_cats_sorted":["cs.LG"],"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2026-02-01T12:45:39Z","title_canon_sha256":"f0e4597edda6453bd6130d3f79b5166166002047f45d173e47af6ab320b2c958"},"schema_version":"1.0","source":{"id":"2602.01203","kind":"arxiv","version":3}},"source_aliases":[{"alias_kind":"arxiv","alias_value":"2602.01203","created_at":"2026-05-28T01:04:36Z"},{"alias_kind":"arxiv_version","alias_value":"2602.01203v3","created_at":"2026-05-28T01:04:36Z"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2602.01203","created_at":"2026-05-28T01:04:36Z"},{"alias_kind":"pith_short_12","alias_value":"UX46GXRWIS4V","created_at":"2026-05-28T01:04:36Z"},{"alias_kind":"pith_short_16","alias_value":"UX46GXRWIS4VKM7Q","created_at":"2026-05-28T01:04:36Z"},{"alias_kind":"pith_short_8","alias_value":"UX46GXRW","created_at":"2026-05-28T01:04:36Z"}],"graph_snapshots":[{"event_id":"sha256:5aec12fe9223c5a91ee96cfc35b8ed563b701dcd28d45865770b8387fb708f58","target":"graph","created_at":"2026-05-28T01:04:36Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"graph_snapshot":{"author_claims":{"count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","strong_count":0},"builder_version":"pith-number-builder-2026-05-17-v1","claims":{"count":4,"items":[{"attestation":"unclaimed","claim_id":"C1","kind":"strongest_claim","source":"verdict.strongest_claim","status":"machine_extracted","text":"the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work."},{"attestation":"unclaimed","claim_id":"C2","kind":"weakest_assumption","source":"verdict.weakest_assumption","status":"machine_extracted","text":"That the attention sink directly and naturally constructs an MoE routing mechanism whose load imbalance is the primary driver of head collapse, as supported by the paper's theoretical and empirical analysis."},{"attestation":"unclaimed","claim_id":"C3","kind":"one_line_summary","source":"verdict.one_line_summary","status":"machine_extracted","text":"Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing."},{"attestation":"unclaimed","claim_id":"C4","kind":"headline","source":"verdict.pith_extraction.headline","status":"machine_extracted","text":"Attention sinks in transformers naturally build a Mixture-of-Experts structure inside attention layers."}],"snapshot_sha256":"a5501d1389a034f70538d8603642ee9cde97e3c5c1b1f48c992cb8f7294cb3d1"},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"integrity":{"available":true,"clean":true,"detectors_run":[],"endpoint":"/pith/2602.01203/integrity.json","findings":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938","summary":{"advisory":0,"by_detector":{},"critical":0,"informational":0}},"paper":{"abstract_excerpt":"Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head ","authors_text":"Meng Li, Runsheng Wang, Wenxuan Zeng, Zizhuo Fu","cross_cats":["cs.LG"],"headline":"Attention sinks in transformers naturally build a Mixture-of-Experts structure inside attention layers.","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2026-02-01T12:45:39Z","title":"Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse"},"references":{"count":0,"internal_anchors":0,"resolved_work":0,"sample":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"2602.01203","kind":"arxiv","version":3},"verdict":{"created_at":"2026-05-16T08:44:55.427129Z","id":"8f26cec9-1785-45ae-b48f-0fe31cc1787d","model_set":{"reader":"grok-4.3"},"one_line_summary":"Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.","pipeline_version":"pith-pipeline@v0.9.0","pith_extraction_headline":"Attention sinks in transformers naturally build a Mixture-of-Experts structure inside attention layers.","strongest_claim":"the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work.","weakest_assumption":"That the attention sink directly and naturally constructs an MoE routing mechanism whose load imbalance is the primary driver of head collapse, as supported by the paper's theoretical and empirical analysis."}},"verdict_id":"8f26cec9-1785-45ae-b48f-0fe31cc1787d"}}],"author_attestations":[],"timestamp_anchors":[],"storage_attestations":[],"citation_signatures":[],"replication_records":[],"corrections":[],"mirror_hints":[],"record_created":{"event_id":"sha256:ec1236b4ef624345dfdd774800574cf4e009fa79e5892eb1d7ab4d171d88b4b6","target":"record","created_at":"2026-05-28T01:04:36Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"attestation_state":"computed","canonical_record":{"metadata":{"abstract_canon_sha256":"3993724ea628eb474a133aef2da3ce370a1e4f18fada5ac2582c42502195f881","cross_cats_sorted":["cs.LG"],"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2026-02-01T12:45:39Z","title_canon_sha256":"f0e4597edda6453bd6130d3f79b5166166002047f45d173e47af6ab320b2c958"},"schema_version":"1.0","source":{"id":"2602.01203","kind":"arxiv","version":3}},"canonical_sha256":"a5f9e35e3644b95533f062c424273363704a3ca9598782a6f3086ef0fed79c93","receipt":{"algorithm":"ed25519","builder_version":"pith-number-builder-2026-05-17-v1","canonical_sha256":"a5f9e35e3644b95533f062c424273363704a3ca9598782a6f3086ef0fed79c93","first_computed_at":"2026-05-28T01:04:36.041742Z","key_id":"pith-v1-2026-05","kind":"pith_receipt","last_reissued_at":"2026-05-28T01:04:36.041742Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","receipt_version":"0.3","signature_b64":"TQC88svSK+Dy36zQHnG2teuYh3QWJC7qci8AHQC8As41oiNDLx8kx8BJ+qA9vdFLKBQXy0hpxAJYn5KgKAywCA==","signature_status":"signed_v1","signed_at":"2026-05-28T01:04:36.042253Z","signed_message":"canonical_sha256_bytes"},"source_id":"2602.01203","source_kind":"arxiv","source_version":3}}},"equivocations":[],"invalid_events":[],"applied_event_ids":["sha256:ec1236b4ef624345dfdd774800574cf4e009fa79e5892eb1d7ab4d171d88b4b6","sha256:5aec12fe9223c5a91ee96cfc35b8ed563b701dcd28d45865770b8387fb708f58"],"state_sha256":"b7a9e9b9309dbcdfa291e0d8f16d9f856f939fa690ac3d34ae6134b6f7add35b"}