Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Pith reviewed 2026-05-20 07:10 UTC · model grok-4.3
The pith
Muon uniform spectral whitening amplifies noisy tails in low-rank VLA gradients and erodes per-head specialization under low-SNR RLVR updates, but Pion replaces it with a high-pass Newton-Schulz iteration that anchors dominant singular 1s.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Muon’s uniform spectral orthogonalization drives all singular values toward 1, but this uniform treatment amplifies noisy tail directions in the low-rank action-module gradients typical of VLA tasks and destabilizes per-head specialization under the low signal-to-noise gradients of RLVR; Pion replaces this with a high-pass Newton-Schulz iteration that promotes dominant components while suppressing tails, achieving higher success rates on LIBERO benchmarks and better accuracy on MATH and GSM8K.
What carries the argument
the high-pass Newton-Schulz iteration, a two-stage Promotion+Suppression mechanism that induces a sharp spectral high-pass effect anchoring dominant singular values at 1 while suppressing tail components toward 0
Load-bearing premise
That the performance gaps arise specifically because uniform whitening amplifies noisy tails in low-rank modules and erodes per-head specialization, rather than from unrelated differences in how Pion is coded.
What would settle it
Direct inspection of the singular-value spectrum of the momentum matrices recorded during VLA training, checking whether the tail magnitudes are markedly larger under Muon than under Pion and whether that difference tracks the observed success-rate gaps on LIBERO Object.
Figures
read the original abstract
Muon is a matrix-aware optimizer that leverages Newton-Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization from prior training make whitening unstable. To address these challenges, we propose Pion, a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion+Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across l_1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object after 1,500 training steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. The advantage of Pion further extends to a real Franka Research 3 robot with a pi_0.5 backbone under the DROID setup on three grasp-and-place tasks. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Muon’s uniform spectral whitening via Newton-Schulz iterations leads to fundamental limitations beyond pretraining: in VLA training, low-rank action-module gradients cause amplification of noisy tail directions; in RLVR, low-SNR gradients and per-head specialization needs make whitening unstable. It proposes Pion, a drop-in replacement that replaces uniform whitening with a two-stage Promotion+Suppression high-pass NS iteration to anchor dominant singular values at 1 while driving tails toward 0, with controllable filter strength and an optional per-head reshape mode. Experiments report consistent outperformance on LIBERO/LIBERO-Plus for VLA-Adapter and VLANeXt (e.g., 100% success on LIBERO Object after 1,500 steps vs. 97.0% Muon and 32.2% AdamW), real-robot Franka tasks, and RLVR post-training on Qwen3 models with GRPO/GMPO where Muon collapses.
Significance. If the empirical gains prove robust and the spectral mechanism is directly verified, this could meaningfully advance optimizer design for post-pretraining regimes in robotics and verifiable-reward RL. The work earns credit for the real-robot validation under the DROID setup, the per-head mode at no extra cost, and the explicit reporting of numerical improvements on named benchmarks and architectures. The high-pass design offers a practical, efficient remedy that preserves Muon’s computational profile while targeting domain-specific spectral issues.
major comments (2)
- The central causal claim—that uniform NS whitening amplifies noisy tail singular values in low-rank VLA action gradients and destabilizes per-head specialization under low-SNR RLVR gradients, while the high-pass iteration selectively remedies this—lacks direct verification. No singular-value histograms, condition-number traces, or per-layer spectral plots from VLA-Adapter/VLANeXt runs on LIBERO or GRPO runs on MATH/GSM8K are provided, leaving open the possibility that reported gains (e.g., 100% vs. 97% success) arise from per-head reshape, learning-rate retuning, or other implementation details rather than the claimed spectral mechanism.
- §4 (Experiments): while specific numerical improvements are reported across l1-regression and flow-matching architectures, the manuscript provides insufficient detail on run-to-run variance, full ablation isolating the high-pass filter strength from the per-head mode, and controls confirming that the two-stage Promotion+Suppression iteration is the load-bearing factor. This weakens the link between the proposed remedy and the observed outperformance.
minor comments (2)
- Abstract: the phrase 'controllable filter strength' is introduced without an explicit parameterization or default value; moving a short equation or pseudocode snippet for the high-pass iteration into the abstract or early method section would improve clarity.
- Notation: ensure consistent use of 'NS iteration' vs. 'Newton-Schulz' and define all acronyms (VLA, RLVR, GRPO, GMPO) at first occurrence.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for acknowledging the practical contributions of the work, including the real-robot validation and the per-head mode. We address each major comment below and agree that strengthening the direct verification of the spectral mechanism and expanding the experimental details will improve the manuscript. We will incorporate the suggested additions in the revised version.
read point-by-point responses
-
Referee: The central causal claim—that uniform NS whitening amplifies noisy tail singular values in low-rank VLA action gradients and destabilizes per-head specialization under low-SNR RLVR gradients, while the high-pass iteration selectively remedies this—lacks direct verification. No singular-value histograms, condition-number traces, or per-layer spectral plots from VLA-Adapter/VLANeXt runs on LIBERO or GRPO runs on MATH/GSM8K are provided, leaving open the possibility that reported gains (e.g., 100% vs. 97% success) arise from per-head reshape, learning-rate retuning, or other implementation details rather than the claimed spectral mechanism.
Authors: We agree that direct spectral visualizations would provide stronger causal evidence and help rule out alternative explanations for the observed gains. Although the performance improvements are large, consistent across architectures, and include real-robot results, we acknowledge that the current manuscript relies primarily on end-task metrics. In the revision we will add singular-value histograms, condition-number traces, and per-layer spectral plots from representative VLA-Adapter and VLANeXt runs on LIBERO as well as GRPO runs on MATH/GSM8K. These plots will compare Muon and Pion directly, showing tail amplification under uniform whitening and selective suppression under the high-pass iteration, thereby isolating the spectral mechanism from the per-head reshape and other factors. revision: yes
-
Referee: §4 (Experiments): while specific numerical improvements are reported across l1-regression and flow-matching architectures, the manuscript provides insufficient detail on run-to-run variance, full ablation isolating the high-pass filter strength from the per-head mode, and controls confirming that the two-stage Promotion+Suppression iteration is the load-bearing factor. This weakens the link between the proposed remedy and the observed outperformance.
Authors: We accept this critique and will expand §4 accordingly. The revised experiments section will report mean and standard deviation across at least three random seeds for all main results to quantify run-to-run variance. We will add a dedicated ablation table that varies the high-pass suppression strength while holding the per-head mode fixed, and a separate comparison of the per-head reshape mode with and without the high-pass iteration. In addition, we will include controls that disable either the Promotion or Suppression stage individually, confirming that the combined two-stage iteration is necessary for the reported gains. These changes will directly address the concern that other implementation details may be responsible for the improvements. revision: yes
Circularity Check
No significant circularity; new algorithmic proposal with direct empirical validation
full rationale
The paper introduces Pion as an explicit modification to the Newton-Schulz iteration (Promotion+Suppression high-pass) to address claimed spectral issues in VLA and RLVR regimes. This design is presented by construction rather than derived from fitted data or prior results. Performance claims rest on reported success rates and accuracies from training runs on LIBERO, LIBERO-Plus, DROID, MATH, and GSM8K using VLA-Adapter, VLANeXt, and GRPO/GMPO setups. No load-bearing step reduces a prediction to a self-citation chain, renames a known result, or equates an output to an input parameter by definition. The argument is self-contained via the new iteration rule and external benchmark comparisons.
Axiom & Free-Parameter Ledger
free parameters (1)
- filter strength
axioms (1)
- domain assumption Newton-Schulz iterations admit a two-stage promotion-suppression modification that produces a sharp high-pass spectral filter while preserving computational cost.
invented entities (1)
-
Pion optimizer
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Pion splits the NS iterations into a two-stage Promotion+Suppression sequence... fp(σ)=1.875σ−1.25σ³+0.375σ⁵... fs(σ)=2.5σ³−1.5σ⁵
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
high-pass NS iteration... anchors dominant singular values at 1 while suppressing noisy tail components toward 0
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ArXiv Preprint: 2504.05295 , Year =
Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295,
-
[2]
The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm
Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
KTO: Model Alignment as Prospect Theoretic Optimization
10 Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
URLhttps://openreview.net/forum?id=4oOF4J2xSy. Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights.arXiv preprint arXiv:2603.12228,
-
[9]
Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054,
Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054,
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Chuan He, Zhanwang Deng, and Zhaosong Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025a. Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, and Yunhe Wang. Root: Robust orthogonalized optimizer for neural network training.arXiv preprin...
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Blog post. Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, and Sijia Liu. Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155,
-
[18]
Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, and Ivan Oseledets. Back to basics: Revisiting exploration in reinforcement learning for llm reasoning via generative probabilities.arXiv preprint arXiv:2602.05281,
-
[19]
11 Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024a. Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Ch...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
arXiv preprint arXiv:2310.10505 , year=
Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505,
-
[21]
Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072,
-
[22]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, and Haibo Qiu. Length-unbiased sequence policy optimization: Revealing and controlling response length variation in rlvr.arXiv preprint arXiv:2602.05261,
-
[24]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025a. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspectiv...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Unbiased gradient low-rank projection.arXiv preprint arXiv:2510.17802,
Rui Pan, Yang Luo, Yuxing Liu, Yang You, and Tong Zhang. Unbiased gradient low-rank projection.arXiv preprint arXiv:2510.17802,
-
[26]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Nicolas Le Roux, Marc G Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alex Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer, Sándor Toth, and Sam Work. Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms.arXiv preprint arXiv:2503.14286,
-
[28]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,
Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,
-
[32]
Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629,
-
[33]
SOAP: Improving and Stabilizing Shampoo using Adam
Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL
Haoxuan Wang, Gengyu Zhang, Yan Yan, Yuzhang Shang, Ramana Rao Kompella, and Gaowen Liu. Real-time robot execution with masked action chunking. InInternational Conference on Learning Representations (ICLR), 2026a. Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy opti...
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Vla-adapter: An effective paradigm for tiny-scale vision-language-action model
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pp. 18638–18646, 2026b. Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and ...
-
[36]
Vlanext: Recipes for building strong vla models.arXiv preprint arXiv:2602.18532,
Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, and Chen Change Loy. Vlanext: Recipes for building strong vla models.arXiv preprint arXiv:2602.18532,
-
[37]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Vlm4vla: Revisiting vision-language-models in vision-language-action models
13 Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309,
-
[41]
A Survey of Reinforcement Learning for Large Reasoning Models
Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025a. Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. On the design of kl-regularized policy gr...
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025a. Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy rl reach with stale data on llms?arXiv preprint arXiv:2510.0116...
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,
Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, et al. The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,
-
[45]
17 A.2 RLVR training: GRPO and GMPO
14 Appendix A Additional Preliminaries: VLA Training and RLVR Training 17 A.1 VLA action heads and training objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 RLVR training: GRPO and GMPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B Low-rank Muon (LRMuon) Algorithm 18 C SNR Analysis for SFT an...
work page 2025
-
[46]
2 2 i ,(A2) wheret∼ U(0,1)denotes the uniform distribution over the interpolation timestep. In our experiments (Sec. 6), the ℓ1-regression head is instantiated by VLA-Adapter (Wang et al., 2026b) and the flow-matching head by VLANeXt (Wu et al., 2026). A.2 RLVR training: GRPO and GMPO We expand here on the three-stage RLVR loop sketched in Sec
work page 2026
-
[47]
on LIBERO (Liu et al., 2023), with VLANeXt additionally evaluated on the perturbed LIBERO-Plus split (Fei et al., 2025); theObjectsuite converges faster and is allocated fewer training steps.Table A2summarizes the RLVR hyperparameters, reused across both RL algorithms (GRPO/GMPO) and both model scales (Qwen3-1.7B/4B); only the prompt/response length, trai...
work page 2023
-
[48]
Table A1: Training hyperparameters for the VLA experiments on the LIBERO benchmark
is finetuned under the DROID hardware platform (Khazatsky et al., 2025; Wang et al., 2026a) and evaluated on three grasp-and-place tasks. Table A1: Training hyperparameters for the VLA experiments on the LIBERO benchmark. The three optimizer configurations (i)–(iii) are applied identically to both models, and share all other hyperparameters listed in this...
work page 2025
-
[49]
Each panel anchors the pass band (|σ| ≤τ) at±1and contracts the stop band (|σ|> τ) toward0
In the actual SVD update only the nonnegative half σin ∈[0,1] is applied to singular values; the plotted negative half visualizes the antisymmetric extension. Each panel anchors the pass band (|σ| ≤τ) at±1and contracts the stop band (|σ|> τ) toward0. 41 Table A7: Fitted coefficients ˆθ(τ) ={(a 1,k, a3,k, a5,k)}5 k=1 of the 5-step odd-quintic composition (...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.