{"paper":{"title":"RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization","license":"http://creativecommons.org/licenses/by/4.0/","headline":"RMNP replaces Newton-Schulz orthogonalization with row-wise L2 normalization to match Muon performance at linear cost.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Ruochen Jin, Shenyang Deng, Shuhua Yu, Tianyu Pang, Yaoqing Yang, Zhuoli Ouyang, Zihang Liu","submitted_at":"2026-03-20T21:55:28Z","abstract_excerpt":"Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape. The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, Muon stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of Muon still leaves room for further improvement. In this paper, we introduce "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"RMNP delivers competitive optimization performance compared with Muon while substantially reducing preconditioning wall-clock time. We establish convergence guarantees for RMNP in the non-convex setting that match recent results for Muon optimizers, achieving the minimax optimal complexity.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The substitution is justified by the empirically observed diagonal block structure of the Transformer layerwise Hessian together with the claim that orthogonalization and row-wise (on input dim) ℓ2 normalization are asymptotically equivalent for transformers.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"RMNP replaces Newton-Schulz orthogonalization with row-wise L2 normalization to match Muon performance at linear cost.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"89ed50705e6d1ba467c761e6280a2bccc3b496385804dcd73a379758629ae08d"},"source":{"id":"2603.20527","kind":"arxiv","version":3},"verdict":{"id":"6c0c4b43-c00f-4417-b2c7-9e646bef0edc","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T07:46:35.514216Z","strongest_claim":"RMNP delivers competitive optimization performance compared with Muon while substantially reducing preconditioning wall-clock time. We establish convergence guarantees for RMNP in the non-convex setting that match recent results for Muon optimizers, achieving the minimax optimal complexity.","one_line_summary":"RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The substitution is justified by the empirically observed diagonal block structure of the Transformer layerwise Hessian together with the claim that orthogonalization and row-wise (on input dim) ℓ2 normalization are asymptotically equivalent for transformers.","pith_extraction_headline":"RMNP replaces Newton-Schulz orthogonalization with row-wise L2 normalization to match Muon performance at linear cost."},"references":{"count":48,"sample":[{"doi":"","year":2011,"title":"Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011","work_id":"15121aa1-4e5f-4d6e-83d0-6396a71086bc","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2012,"title":"Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 4(2):26–31, 2012","work_id":"d19b620d-4742-4006-b1e1-b1badcec2e14","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2014,"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","ref_index":3,"cited_arxiv_id":"1412.6980","is_internal_anchor":true},{"doi":"","year":2019,"title":"Decoupled weight decay regularization","work_id":"139a920c-85a1-42e5-937e-f14d907436d5","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton","work_id":"cd4774e4-bb07-4806-8a2d-fca30d54bda2","ref_index":5,"cited_arxiv_id":"2510.09378","is_internal_anchor":true}],"resolved_work":48,"snapshot_sha256":"bfc0e13f157d16093742972b48fc3fb339387754796242b5d5db0a0df2a98671","internal_anchors":7},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}