{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:Y3UJ36HWYODK3SBIVQAWN3ZXCB","short_pith_number":"pith:Y3UJ36HW","schema_version":"1.0","canonical_sha256":"c6e89df8f6c386adc828ac0166ef37105c60b407da96a0de22491f79bd188883","source":{"kind":"arxiv","id":"2409.11321","version":2},"attestation_state":"computed","paper":{"title":"SOAP: Improving and Stabilizing Shampoo using Adam","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"SOAP runs Adam inside Shampoo's eigenbasis to cut large-batch iterations by over 40 percent versus AdamW.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"David Brandfonbrener, Depen Morwani, Itai Shapira, Lucas Janson, Mujin Kwun, Nikhil Vyas, Rosie Zhao, Sham Kakade","submitted_at":"2024-09-17T16:18:05Z","abstract_excerpt":"There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insig"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2409.11321","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2024-09-17T16:18:05Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"19f51ed35d18fedb5acd0bd7125373799c5f6753272115eb11a91ac74cc69dcb","abstract_canon_sha256":"10304c10072863fbd852c8c755150cdf3e5f2321c96b18b2f4e2df8a7fcfe0d2"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:05.180023Z","signature_b64":"TVJsH119zG9OxY5UTICbOwaG4AWAtQDBkwLNML3JFtoNCgMl3ksjrxNl2Ego6zYZqKkoRSUb66iDAejdOYUbDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c6e89df8f6c386adc828ac0166ef37105c60b407da96a0de22491f79bd188883","last_reissued_at":"2026-05-17T23:39:05.179384Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:05.179384Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"SOAP: Improving and Stabilizing Shampoo using Adam","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"SOAP runs Adam inside Shampoo's eigenbasis to cut large-batch iterations by over 40 percent versus AdamW.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"David Brandfonbrener, Depen Morwani, Itai Shapira, Lucas Janson, Mujin Kwun, Nikhil Vyas, Rosie Zhao, Sham Kakade","submitted_at":"2024-09-17T16:18:05Z","abstract_excerpt":"There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insig"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"In the large-batch regime, SOAP reduces the number of iterations by over 40% and wall-clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The formal equivalence between 1/2-power Shampoo and Adafactor holds only inside the current eigenbasis; the paper assumes that keeping this basis fixed for many steps does not materially degrade the preconditioning quality, an assumption validated only empirically on the tested model sizes.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SOAP runs Adam in the eigenbasis of Shampoo's preconditioner, cutting iterations by over 40% versus AdamW on 360M-660M language models while adding only one hyperparameter.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"SOAP runs Adam inside Shampoo's eigenbasis to cut large-batch iterations by over 40 percent versus AdamW.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7d7a9bc9895da344f5dabe67ca44e7a5cf513c4857c89ccedd6927b0b2c7f010"},"source":{"id":"2409.11321","kind":"arxiv","version":2},"verdict":{"id":"f224842c-9d09-490d-8b0d-a53c6aaafe7f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T00:57:19.167991Z","strongest_claim":"In the large-batch regime, SOAP reduces the number of iterations by over 40% and wall-clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo.","one_line_summary":"SOAP runs Adam in the eigenbasis of Shampoo's preconditioner, cutting iterations by over 40% versus AdamW on 360M-660M language models while adding only one hyperparameter.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The formal equivalence between 1/2-power Shampoo and Adafactor holds only inside the current eigenbasis; the paper assumes that keeping this basis fixed for many steps does not materially degrade the preconditioning quality, an assumption validated only empirically on the tested model sizes.","pith_extraction_headline":"SOAP runs Adam inside Shampoo's eigenbasis to cut large-batch iterations by over 40 percent versus AdamW."},"references":{"count":14,"sample":[{"doi":"10.48550/arxiv","year":2024,"title":"URLhttps://doi.org/10.48550/arXiv","work_id":"5c2060c6-427c-4321-be22-49ccae439d80","ref_index":1,"cited_arxiv_id":"2203.14987","is_internal_anchor":true},{"doi":"","year":null,"title":"(360m) We sweep over the cross product of best 3 learning rates and β1 ∈ {0.9, 0.95, 0.99}","work_id":"68d7246c-b0af-4656-b251-14957d19df9e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The last two of the sweeps did not yield any benefit for the 360m model with 2m batch size hence we only sweep over learning rate for the 660m model with 2m batch size","work_id":"e734c182-2261-46fd-a26b-32c9a19751da","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"(360m) We sweep over over the cross product of best 3 learning rates from above and ϵshampoo ∈ {1e−11, 1e−12, 1e−13}","work_id":"ca85d03f-af81-4afc-b4fa-6582fdb25921","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"(360m) We sweep over over the cross product of best 3 learning rates from above and βshampoo ∈ {.9, .95, .975}","work_id":"2ad7d81d-628f-4da6-bd3c-1e5c969cda02","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":14,"snapshot_sha256":"7cc19d8b763e00a24303a3efad67be0f5733dc5dda17bbbe87dcd7889fb46675","internal_anchors":1},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b65b98a1f4fad8fbcc58eca0322a67e71f70b45a287b8141cda68d664f0775e6"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2409.11321","created_at":"2026-05-17T23:39:05.179508+00:00"},{"alias_kind":"arxiv_version","alias_value":"2409.11321v2","created_at":"2026-05-17T23:39:05.179508+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2409.11321","created_at":"2026-05-17T23:39:05.179508+00:00"},{"alias_kind":"pith_short_12","alias_value":"Y3UJ36HWYODK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"Y3UJ36HWYODK3SBI","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"Y3UJ36HW","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":35,"internal_anchor_count":35,"sample":[{"citing_arxiv_id":"2603.10067","citing_title":"HTMuon: Improving Muon via Heavy-Tailed Spectral Correction","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23391","citing_title":"Coupling-Robust Accuracy in Multiphysics Physics Informed Neural Networks via Kronecker-Preconditioned Optimization","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22644","citing_title":"Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2505.24275","citing_title":"GradPower: Powering Gradients for Faster Language Model Pre-Training","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2510.10777","citing_title":"Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08453","citing_title":"Hard-constrained Physics-informed Neural Networks for Interface Problems","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16165","citing_title":"Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18609","citing_title":"Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19282","citing_title":"Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16184","citing_title":"Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09034","citing_title":"Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23737","citing_title":"On the Convergence Analysis of Muon","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2511.18820","citing_title":"Unsupervised simulation of incompressible flows with physics- and equality- constrained artificial neural networks","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2602.04774","citing_title":"Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2603.00541","citing_title":"Spectral Condition for $\\mu$P under Width-Depth Scaling","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2603.26554","citing_title":"Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2603.28254","citing_title":"MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13761","citing_title":"Toward AI-Driven Digital Twins for Metropolitan Floods: A Conditional Latent Dynamics Network Surrogate of the Shallow Water Equations","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11117","citing_title":"GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11838","citing_title":"Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12492","citing_title":"Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08423","citing_title":"Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09034","citing_title":"Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08436","citing_title":"A meshfree exterior calculus for generalizable and data-efficient learning of physics from point clouds","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23528","citing_title":"When PINNs Go Wrong: Pseudo-Time Stepping Against Spurious Solutions","ref_index":63,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/Y3UJ36HWYODK3SBIVQAWN3ZXCB","json":"https://pith.science/pith/Y3UJ36HWYODK3SBIVQAWN3ZXCB.json","graph_json":"https://pith.science/api/pith-number/Y3UJ36HWYODK3SBIVQAWN3ZXCB/graph.json","events_json":"https://pith.science/api/pith-number/Y3UJ36HWYODK3SBIVQAWN3ZXCB/events.json","paper":"https://pith.science/paper/Y3UJ36HW"},"agent_actions":{"view_html":"https://pith.science/pith/Y3UJ36HWYODK3SBIVQAWN3ZXCB","download_json":"https://pith.science/pith/Y3UJ36HWYODK3SBIVQAWN3ZXCB.json","view_paper":"https://pith.science/paper/Y3UJ36HW","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2409.11321&json=true","fetch_graph":"https://pith.science/api/pith-number/Y3UJ36HWYODK3SBIVQAWN3ZXCB/graph.json","fetch_events":"https://pith.science/api/pith-number/Y3UJ36HWYODK3SBIVQAWN3ZXCB/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/Y3UJ36HWYODK3SBIVQAWN3ZXCB/action/timestamp_anchor","attest_storage":"https://pith.science/pith/Y3UJ36HWYODK3SBIVQAWN3ZXCB/action/storage_attestation","attest_author":"https://pith.science/pith/Y3UJ36HWYODK3SBIVQAWN3ZXCB/action/author_attestation","sign_citation":"https://pith.science/pith/Y3UJ36HWYODK3SBIVQAWN3ZXCB/action/citation_signature","submit_replication":"https://pith.science/pith/Y3UJ36HWYODK3SBIVQAWN3ZXCB/action/replication_record"}},"created_at":"2026-05-17T23:39:05.179508+00:00","updated_at":"2026-05-17T23:39:05.179508+00:00"}