{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:ZGN6K3U52HN3B5SHNOS5FVRPB3","short_pith_number":"pith:ZGN6K3U5","schema_version":"1.0","canonical_sha256":"c99be56e9dd1dbb0f6476ba5d2d62f0eec00bc1013a01f0860b0619bb38cbb6b","source":{"kind":"arxiv","id":"2602.21545","version":3},"attestation_state":"computed","paper":{"title":"MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Muon+ adds one normalization step after polar orthogonalization to fix norm imbalance and improve LLM pre-training over Muon.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Liyan Tan, Ruijie Zhang, Yequan Zhao, Yupeng Su, Zhengyang Wang, Zheng Zhang, Ziyue Liu","submitted_at":"2026-02-25T04:04:00Z","abstract_excerpt":"Muon has recently emerged as a strong optimizer for large language model pre-training, orthogonalizing the momentum matrix via Newton--Schulz polar iterations. A natural intuition is that polar iterations, by flattening the singular spectrum to all ones, should also eliminate column- and row-wise norm imbalance in the update. We show that this is not true in practice: practical polar steps can substantially amplify the imbalance. We term this the post-polar imbalanced update problem, and prove that such imbalance tightens the second-order term in a blockwise descent analysis, weakening Muon's "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2602.21545","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-02-25T04:04:00Z","cross_cats_sorted":[],"title_canon_sha256":"3265ff483205087d87e92ca09328bbc5ef60fcad494a026759207b403787db37","abstract_canon_sha256":"f823f07862b2e54fe74fc0c27266a6718f86f5f2dbc32bfe51f802fc24c17b95"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:15.958567Z","signature_b64":"CeRwBAKA+qK5LWWbDwld95GKcaQZg00MNFcLXsfDrIIj0SYiqrZAW3Nybk2uW2ImjwdnrleZ22MgbpikPBw5Bg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c99be56e9dd1dbb0f6476ba5d2d62f0eec00bc1013a01f0860b0619bb38cbb6b","last_reissued_at":"2026-05-17T23:39:15.957870Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:15.957870Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Muon+ adds one normalization step after polar orthogonalization to fix norm imbalance and improve LLM pre-training over Muon.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Liyan Tan, Ruijie Zhang, Yequan Zhao, Yupeng Su, Zhengyang Wang, Zheng Zhang, Ziyue Liu","submitted_at":"2026-02-25T04:04:00Z","abstract_excerpt":"Muon has recently emerged as a strong optimizer for large language model pre-training, orthogonalizing the momentum matrix via Newton--Schulz polar iterations. A natural intuition is that polar iterations, by flattening the singular spectrum to all ones, should also eliminate column- and row-wise norm imbalance in the update. We show that this is not true in practice: practical polar steps can substantially amplify the imbalance. We term this the post-polar imbalanced update problem, and prove that such imbalance tightens the second-order term in a blockwise descent analysis, weakening Muon's "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across pre-training experiments on GPT and LLaMA models from 60M to 7B parameters, spanning both compute-optimal budgets and extended token-to-parameter ratios up to approximately 200, Muon+ consistently outperforms Muon in terms of training and validation perplexity, leading to significant overall pre-training speedup.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the post-polar norm imbalance identified in the blockwise descent analysis is the dominant practical limitation of Muon and that the added normalization step corrects it without introducing offsetting drawbacks in other regimes or model scales.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Muon+ adds one normalization step after polar orthogonalization to fix norm imbalance and improve LLM pre-training over Muon.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"2174f89fd10da13f57cefb956aa6da5d4ef6583b8c8b37ed8e61a7644ab0b992"},"source":{"id":"2602.21545","kind":"arxiv","version":3},"verdict":{"id":"31a77f3d-19d8-4827-938e-0be91b3ec235","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T19:21:44.536519Z","strongest_claim":"Across pre-training experiments on GPT and LLaMA models from 60M to 7B parameters, spanning both compute-optimal budgets and extended token-to-parameter ratios up to approximately 200, Muon+ consistently outperforms Muon in terms of training and validation perplexity, leading to significant overall pre-training speedup.","one_line_summary":"Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the post-polar norm imbalance identified in the blockwise descent analysis is the dominant practical limitation of Muon and that the added normalization step corrects it without introducing offsetting drawbacks in other regimes or model scales.","pith_extraction_headline":"Muon+ adds one normalization step after polar orthogonalization to fix norm imbalance and improve LLM pre-training over Muon."},"references":{"count":40,"sample":[{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2025,"title":"The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm","work_id":"c822532f-f27e-4958-ad88-a763e23f2d22","ref_index":2,"cited_arxiv_id":"2505.16932","is_internal_anchor":true},{"doi":"","year":2025,"title":"J. Bernstein. Deriving muon, 2025","work_id":"d19ed6b6-2fd9-4f7c-a6e9-de4fd8cf68bb","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"F. L. Cesista, Y . Jiacheng, and K. Jordan. Squeezing 1-2","work_id":"5e2c55c0-7e10-4ee8-a4da-f9ac7ea22abf","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Kimi-Audio Technical Report","work_id":"9c2ba56b-5585-4f28-b751-703f31dca2d5","ref_index":5,"cited_arxiv_id":"2504.18425","is_internal_anchor":true}],"resolved_work":40,"snapshot_sha256":"62bbd4d4f8da5b1c9920ffd9b1e5b1f617b5c5d204ab23024e1dc1254bb5fa63","internal_anchors":14},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2602.21545","created_at":"2026-05-17T23:39:15.957983+00:00"},{"alias_kind":"arxiv_version","alias_value":"2602.21545v3","created_at":"2026-05-17T23:39:15.957983+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2602.21545","created_at":"2026-05-17T23:39:15.957983+00:00"},{"alias_kind":"pith_short_12","alias_value":"ZGN6K3U52HN3","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"ZGN6K3U52HN3B5SH","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"ZGN6K3U5","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":1,"internal_anchor_count":1,"sample":[{"citing_arxiv_id":"2603.28254","citing_title":"MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration","ref_index":16,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/ZGN6K3U52HN3B5SHNOS5FVRPB3","json":"https://pith.science/pith/ZGN6K3U52HN3B5SHNOS5FVRPB3.json","graph_json":"https://pith.science/api/pith-number/ZGN6K3U52HN3B5SHNOS5FVRPB3/graph.json","events_json":"https://pith.science/api/pith-number/ZGN6K3U52HN3B5SHNOS5FVRPB3/events.json","paper":"https://pith.science/paper/ZGN6K3U5"},"agent_actions":{"view_html":"https://pith.science/pith/ZGN6K3U52HN3B5SHNOS5FVRPB3","download_json":"https://pith.science/pith/ZGN6K3U52HN3B5SHNOS5FVRPB3.json","view_paper":"https://pith.science/paper/ZGN6K3U5","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2602.21545&json=true","fetch_graph":"https://pith.science/api/pith-number/ZGN6K3U52HN3B5SHNOS5FVRPB3/graph.json","fetch_events":"https://pith.science/api/pith-number/ZGN6K3U52HN3B5SHNOS5FVRPB3/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/ZGN6K3U52HN3B5SHNOS5FVRPB3/action/timestamp_anchor","attest_storage":"https://pith.science/pith/ZGN6K3U52HN3B5SHNOS5FVRPB3/action/storage_attestation","attest_author":"https://pith.science/pith/ZGN6K3U52HN3B5SHNOS5FVRPB3/action/author_attestation","sign_citation":"https://pith.science/pith/ZGN6K3U52HN3B5SHNOS5FVRPB3/action/citation_signature","submit_replication":"https://pith.science/pith/ZGN6K3U52HN3B5SHNOS5FVRPB3/action/replication_record"}},"created_at":"2026-05-17T23:39:15.957983+00:00","updated_at":"2026-05-17T23:39:15.957983+00:00"}