Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Chongyu Fan; Gaowen Liu; Mingyi Hong; Ramana Rao Kompella; Sijia Liu

arxiv: 2605.19282 · v1 · pith:FNXAMCVAnew · submitted 2026-05-19 · 💻 cs.LG

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Chongyu Fan , Gaowen Liu , Mingyi Hong , Ramana Rao Kompella , Sijia Liu This is my paper

Pith reviewed 2026-05-20 07:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizerNewton-Schulz iterationspectral whiteninghigh-pass filterVLA trainingRLVRgradient orthogonalizationper-head specialization

0 comments

The pith

Muon uniform spectral whitening amplifies noisy tails in low-rank VLA gradients and erodes per-head specialization under low-SNR RLVR updates, but Pion replaces it with a high-pass Newton-Schulz iteration that anchors dominant singular 1s.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Muon, which applies Newton-Schulz iterations to drive every singular value of the momentum matrix to 1, works for LLM pretraining but creates two distinct problems beyond that stage. In vision-language-action training the action modules produce inherently low-rank gradients, so uniform whitening boosts the noisy low-magnitude directions and degrades task performance. In reinforcement learning with verifiable rewards the same whitening erases the per-head specialization that prior training created, especially when gradients carry low signal-to-noise ratios. Pion keeps Muon’s efficiency but substitutes a two-stage Promotion-plus-Suppression process inside the Newton-Schulz loop that leaves large singular values at 1 while driving the tail values toward 0, with an optional per-head reshape that applies the update independently across attention heads.

Core claim

Muon’s uniform spectral orthogonalization drives all singular values toward 1, but this uniform treatment amplifies noisy tail directions in the low-rank action-module gradients typical of VLA tasks and destabilizes per-head specialization under the low signal-to-noise gradients of RLVR; Pion replaces this with a high-pass Newton-Schulz iteration that promotes dominant components while suppressing tails, achieving higher success rates on LIBERO benchmarks and better accuracy on MATH and GSM8K.

What carries the argument

the high-pass Newton-Schulz iteration, a two-stage Promotion+Suppression mechanism that induces a sharp spectral high-pass effect anchoring dominant singular values at 1 while suppressing tail components toward 0

Load-bearing premise

That the performance gaps arise specifically because uniform whitening amplifies noisy tails in low-rank modules and erodes per-head specialization, rather than from unrelated differences in how Pion is coded.

What would settle it

Direct inspection of the singular-value spectrum of the momentum matrices recorded during VLA training, checking whether the tail magnitudes are markedly larger under Muon than under Pion and whether that difference tracks the observed success-rate gaps on LIBERO Object.

Figures

Figures reproduced from arXiv: 2605.19282 by Chongyu Fan, Gaowen Liu, Mingyi Hong, Ramana Rao Kompella, Sijia Liu.

**Figure 1.** Figure 1: Limitations of Muon in VLA training (VLA-Adapter on LIBERO Object). (a) Average per-module gradient erank (V/L/A) [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: (a) Gradient SNR of SFT vs. GRPO (AdamW, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of f(σ) in (6) over σ ∈ [0, 1], with f(σ) = σ shown as the identity reference. (a) f t NS denotes Muon’s NS iteration applied t times. (b) f t p denotes the Promotion polynomial fp (7) applied t times. (c) f t s denotes the Suppression polynomial fs (8) applied t times. (d) Pion’s high-pass NS iteration (Alg. 2): f ks s ◦ f kp p applies kp Promotion steps followed by ks = 5 − kp Suppression s… view at source ↗

**Figure 4.** Figure 4: Effect of per-head high-pass NS on RLVR (Qwen3-1.7B, GRPO on MATH levels 3–5). (a) MATH500 accuracy of AdamW, Muon (default vs. perhead), and Pion (default vs. per-head). (b) Cross-head Qprojection variance: before-RLVR weight Var(∥Wh 0,Q∥F) (top) and after-RLVR update Var(∥Wh ∗,Q −Wh 0,Q∥F) for default vs. per-head Pion (bottom). Why per-head high-pass NS is needed for RLVR. RLVR starts from an already-… view at source ↗

**Figure 5.** Figure 5: AdamW, Muon and Pion for VLA-Adapter on LIBERO. (a) Test success rates on LIBERO Object, Spatial, Goal and Long [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: AdamW, Muon and Pion on RLVR: validation accuracy vs. training step across eight settings, spanning two algorithms [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Gradient SNR of Pion vs. AdamW (Qwen3- 1.7B, GRPO on GSM8K). Pion succeeds while Muon collapses [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: (a) Scalar map f(σ) of LPMuon for σ ∈ [0, 1]. (b) Accuracy of AdamW, Pion, and LPMuon (Qwen3-1.7B, GRPO on GSM8K). 7 Conclusion We identified two limitations of Muon beyond LLM pretraining: lack of rank adaptiveness in cross-modality VLA training, and lack of noise adaptiveness in RLVR post-training. To address them, we proposed Pion, a drop-in replacement for Muon’s NS iteration that uses a high-pass NS t… view at source ↗

read the original abstract

Muon is a matrix-aware optimizer that leverages Newton-Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization from prior training make whitening unstable. To address these challenges, we propose Pion, a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion+Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across l_1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object after 1,500 training steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. The advantage of Pion further extends to a real Franka Research 3 robot with a pi_0.5 backbone under the DROID setup on three grasp-and-place tasks. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pion adds a high-pass Newton-Schulz variant and per-head mode to Muon that delivers measurable gains on VLA robot tasks and RLVR math benchmarks, though the spectral mechanism stays unverified by direct plots.

read the letter

The main thing to know is that this paper takes Muon’s Newton-Schulz whitening, which works for LLM pretraining, and shows it can amplify noise in low-rank action gradients during VLA training and destabilize per-head specialization under low-SNR RLVR gradients. They replace the uniform iteration with a two-stage Promotion+Suppression high-pass version that anchors large singular values near 1 and pushes small ones toward 0, plus a cheap per-head reshape option. That combination is the actual novelty here, not a routine extension of prior Muon papers.

Referee Report

2 major / 2 minor

Summary. The paper claims that Muon’s uniform spectral whitening via Newton-Schulz iterations leads to fundamental limitations beyond pretraining: in VLA training, low-rank action-module gradients cause amplification of noisy tail directions; in RLVR, low-SNR gradients and per-head specialization needs make whitening unstable. It proposes Pion, a drop-in replacement that replaces uniform whitening with a two-stage Promotion+Suppression high-pass NS iteration to anchor dominant singular values at 1 while driving tails toward 0, with controllable filter strength and an optional per-head reshape mode. Experiments report consistent outperformance on LIBERO/LIBERO-Plus for VLA-Adapter and VLANeXt (e.g., 100% success on LIBERO Object after 1,500 steps vs. 97.0% Muon and 32.2% AdamW), real-robot Franka tasks, and RLVR post-training on Qwen3 models with GRPO/GMPO where Muon collapses.

Significance. If the empirical gains prove robust and the spectral mechanism is directly verified, this could meaningfully advance optimizer design for post-pretraining regimes in robotics and verifiable-reward RL. The work earns credit for the real-robot validation under the DROID setup, the per-head mode at no extra cost, and the explicit reporting of numerical improvements on named benchmarks and architectures. The high-pass design offers a practical, efficient remedy that preserves Muon’s computational profile while targeting domain-specific spectral issues.

major comments (2)

The central causal claim—that uniform NS whitening amplifies noisy tail singular values in low-rank VLA action gradients and destabilizes per-head specialization under low-SNR RLVR gradients, while the high-pass iteration selectively remedies this—lacks direct verification. No singular-value histograms, condition-number traces, or per-layer spectral plots from VLA-Adapter/VLANeXt runs on LIBERO or GRPO runs on MATH/GSM8K are provided, leaving open the possibility that reported gains (e.g., 100% vs. 97% success) arise from per-head reshape, learning-rate retuning, or other implementation details rather than the claimed spectral mechanism.
§4 (Experiments): while specific numerical improvements are reported across l1-regression and flow-matching architectures, the manuscript provides insufficient detail on run-to-run variance, full ablation isolating the high-pass filter strength from the per-head mode, and controls confirming that the two-stage Promotion+Suppression iteration is the load-bearing factor. This weakens the link between the proposed remedy and the observed outperformance.

minor comments (2)

Abstract: the phrase 'controllable filter strength' is introduced without an explicit parameterization or default value; moving a short equation or pseudocode snippet for the high-pass iteration into the abstract or early method section would improve clarity.
Notation: ensure consistent use of 'NS iteration' vs. 'Newton-Schulz' and define all acronyms (VLA, RLVR, GRPO, GMPO) at first occurrence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for acknowledging the practical contributions of the work, including the real-robot validation and the per-head mode. We address each major comment below and agree that strengthening the direct verification of the spectral mechanism and expanding the experimental details will improve the manuscript. We will incorporate the suggested additions in the revised version.

read point-by-point responses

Referee: The central causal claim—that uniform NS whitening amplifies noisy tail singular values in low-rank VLA action gradients and destabilizes per-head specialization under low-SNR RLVR gradients, while the high-pass iteration selectively remedies this—lacks direct verification. No singular-value histograms, condition-number traces, or per-layer spectral plots from VLA-Adapter/VLANeXt runs on LIBERO or GRPO runs on MATH/GSM8K are provided, leaving open the possibility that reported gains (e.g., 100% vs. 97% success) arise from per-head reshape, learning-rate retuning, or other implementation details rather than the claimed spectral mechanism.

Authors: We agree that direct spectral visualizations would provide stronger causal evidence and help rule out alternative explanations for the observed gains. Although the performance improvements are large, consistent across architectures, and include real-robot results, we acknowledge that the current manuscript relies primarily on end-task metrics. In the revision we will add singular-value histograms, condition-number traces, and per-layer spectral plots from representative VLA-Adapter and VLANeXt runs on LIBERO as well as GRPO runs on MATH/GSM8K. These plots will compare Muon and Pion directly, showing tail amplification under uniform whitening and selective suppression under the high-pass iteration, thereby isolating the spectral mechanism from the per-head reshape and other factors. revision: yes
Referee: §4 (Experiments): while specific numerical improvements are reported across l1-regression and flow-matching architectures, the manuscript provides insufficient detail on run-to-run variance, full ablation isolating the high-pass filter strength from the per-head mode, and controls confirming that the two-stage Promotion+Suppression iteration is the load-bearing factor. This weakens the link between the proposed remedy and the observed outperformance.

Authors: We accept this critique and will expand §4 accordingly. The revised experiments section will report mean and standard deviation across at least three random seeds for all main results to quantify run-to-run variance. We will add a dedicated ablation table that varies the high-pass suppression strength while holding the per-head mode fixed, and a separate comparison of the per-head reshape mode with and without the high-pass iteration. In addition, we will include controls that disable either the Promotion or Suppression stage individually, confirming that the combined two-stage iteration is necessary for the reported gains. These changes will directly address the concern that other implementation details may be responsible for the improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new algorithmic proposal with direct empirical validation

full rationale

The paper introduces Pion as an explicit modification to the Newton-Schulz iteration (Promotion+Suppression high-pass) to address claimed spectral issues in VLA and RLVR regimes. This design is presented by construction rather than derived from fitted data or prior results. Performance claims rest on reported success rates and accuracies from training runs on LIBERO, LIBERO-Plus, DROID, MATH, and GSM8K using VLA-Adapter, VLANeXt, and GRPO/GMPO setups. No load-bearing step reduces a prediction to a self-citation chain, renames a known result, or equates an output to an input parameter by definition. The argument is self-contained via the new iteration rule and external benchmark comparisons.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the validity of adapting Newton-Schulz iterations to produce a controllable high-pass spectral effect and on the empirical superiority observed on the reported tasks; one adjustable filter strength parameter is introduced.

free parameters (1)

filter strength
Controllable parameter that sets the degree of tail suppression in the high-pass NS iteration.

axioms (1)

domain assumption Newton-Schulz iterations admit a two-stage promotion-suppression modification that produces a sharp high-pass spectral filter while preserving computational cost.
Core design premise for replacing uniform whitening with the high-pass variant.

invented entities (1)

Pion optimizer no independent evidence
purpose: Drop-in replacement for Muon that applies high-pass spectral filtering and optional per-head updates.
New algorithmic construct introduced to remedy the identified spectral failures.

pith-pipeline@v0.9.0 · 5928 in / 1527 out tokens · 61524 ms · 2026-05-20T07:10:48.334614+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pion splits the NS iterations into a two-stage Promotion+Suppression sequence... fp(σ)=1.875σ−1.25σ³+0.375σ⁵... fs(σ)=2.5σ³−1.5σ⁵
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

high-pass NS iteration... anchors dominant singular values at 1 while suppressing noisy tail components toward 0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 28 internal anchors

[1]

ArXiv Preprint: 2504.05295 , Year =

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295,

work page arXiv
[2]

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

KTO: Model Alignment as Prospect Theoretic Optimization

10 Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

URLhttps://openreview.net/forum?id=4oOF4J2xSy. Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Neural thickets: Diverse task experts are dense around pretrained weights.arXiv preprint arXiv:2603.12228,

Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights.arXiv preprint arXiv:2603.12228,

work page arXiv
[9]

Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054,

Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054,

work page arXiv
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Chuan He, Zhanwang Deng, and Zhaosong Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025a. Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, and Yunhe Wang. Root: Robust orthogonalized optimizer for neural network training.arXiv preprin...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Blog post. Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155, 2026

Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, and Sijia Liu. Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155,

work page arXiv
[18]

Back to basics: Revisiting exploration in reinforcement learning for llm reasoning via generative probabilities.arXiv preprint arXiv:2602.05281,

Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, and Ivan Oseledets. Back to basics: Revisiting exploration in reinforcement learning for llm reasoning via generative probabilities.arXiv preprint arXiv:2602.05281,

work page arXiv
[19]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

11 Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024a. Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Ch...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

arXiv preprint arXiv:2310.10505 , year=

Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505,

work page arXiv
[21]

Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072,

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072,

work page arXiv
[22]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Length-unbiased sequence policy optimization: Revealing and controlling response length variation in rlvr.arXiv preprint arXiv:2602.05261,

Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, and Haibo Qiu. Length-unbiased sequence policy optimization: Revealing and controlling response length variation in rlvr.arXiv preprint arXiv:2602.05261,

work page arXiv
[24]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025a. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspectiv...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Unbiased gradient low-rank projection.arXiv preprint arXiv:2510.17802,

Rui Pan, Yang Luo, Yuxing Liu, Yang You, and Tong Zhang. Unbiased gradient low-rank projection.arXiv preprint arXiv:2510.17802,

work page arXiv
[26]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alexandre Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer, Sándor Toth, and Sam Work

Nicolas Le Roux, Marc G Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alex Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer, Sándor Toth, and Sam Work. Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms.arXiv preprint arXiv:2503.14286,

work page arXiv
[28]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,

work page arXiv
[32]

Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629,

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629,

work page arXiv
[33]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

Haoxuan Wang, Gengyu Zhang, Yan Yan, Yuzhang Shang, Ramana Rao Kompella, and Gaowen Liu. Real-time robot execution with masked action chunking. InInternational Conference on Learning Representations (ICLR), 2026a. Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy opti...

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Vla-adapter: An effective paradigm for tiny-scale vision-language-action model

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pp. 18638–18646, 2026b. Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and ...

work page arXiv
[36]

Vlanext: Recipes for building strong vla models.arXiv preprint arXiv:2602.18532,

Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, and Chen Change Loy. Vlanext: Recipes for building strong vla models.arXiv preprint arXiv:2602.18532,

work page arXiv
[37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Vlm4vla: Revisiting vision-language-models in vision-language-action models

13 Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309,

work page arXiv
[41]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025a. Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. On the design of kl-regularized policy gr...

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025a. Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy rl reach with stale data on llms?arXiv preprint arXiv:2510.0116...

work page internal anchor Pith review Pith/arXiv arXiv
[43]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, et al. The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,

work page arXiv
[45]

17 A.2 RLVR training: GRPO and GMPO

14 Appendix A Additional Preliminaries: VLA Training and RLVR Training 17 A.1 VLA action heads and training objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 RLVR training: GRPO and GMPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B Low-rank Muon (LRMuon) Algorithm 18 C SNR Analysis for SFT an...

work page 2025
[46]

In our experiments (Sec

2 2 i ,(A2) wheret∼ U(0,1)denotes the uniform distribution over the interpolation timestep. In our experiments (Sec. 6), the ℓ1-regression head is instantiated by VLA-Adapter (Wang et al., 2026b) and the flow-matching head by VLANeXt (Wu et al., 2026). A.2 RLVR training: GRPO and GMPO We expand here on the three-stage RLVR loop sketched in Sec

work page 2026
[47]

on LIBERO (Liu et al., 2023), with VLANeXt additionally evaluated on the perturbed LIBERO-Plus split (Fei et al., 2025); theObjectsuite converges faster and is allocated fewer training steps.Table A2summarizes the RLVR hyperparameters, reused across both RL algorithms (GRPO/GMPO) and both model scales (Qwen3-1.7B/4B); only the prompt/response length, trai...

work page 2023
[48]

Table A1: Training hyperparameters for the VLA experiments on the LIBERO benchmark

is finetuned under the DROID hardware platform (Khazatsky et al., 2025; Wang et al., 2026a) and evaluated on three grasp-and-place tasks. Table A1: Training hyperparameters for the VLA experiments on the LIBERO benchmark. The three optimizer configurations (i)–(iii) are applied identically to both models, and share all other hyperparameters listed in this...

work page 2025
[49]

Each panel anchors the pass band (|σ| ≤τ) at±1and contracts the stop band (|σ|> τ) toward0

In the actual SVD update only the nonnegative half σin ∈[0,1] is applied to singular values; the plotted negative half visualizes the antisymmetric extension. Each panel anchors the pass band (|σ| ≤τ) at±1and contracts the stop band (|σ|> τ) toward0. 41 Table A7: Fitted coefficients ˆθ(τ) ={(a 1,k, a3,k, a5,k)}5 k=1 of the 5-step odd-quintic composition (...

work page arXiv

[1] [1]

ArXiv Preprint: 2504.05295 , Year =

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295,

work page arXiv

[2] [2]

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

KTO: Model Alignment as Prospect Theoretic Optimization

10 Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

URLhttps://openreview.net/forum?id=4oOF4J2xSy. Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Neural thickets: Diverse task experts are dense around pretrained weights.arXiv preprint arXiv:2603.12228,

Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights.arXiv preprint arXiv:2603.12228,

work page arXiv

[9] [9]

Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054,

Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054,

work page arXiv

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Chuan He, Zhanwang Deng, and Zhaosong Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025a. Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, and Yunhe Wang. Root: Robust orthogonalized optimizer for neural network training.arXiv preprin...

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Blog post. Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155, 2026

Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, and Sijia Liu. Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155,

work page arXiv

[18] [18]

Back to basics: Revisiting exploration in reinforcement learning for llm reasoning via generative probabilities.arXiv preprint arXiv:2602.05281,

Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, and Ivan Oseledets. Back to basics: Revisiting exploration in reinforcement learning for llm reasoning via generative probabilities.arXiv preprint arXiv:2602.05281,

work page arXiv

[19] [19]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

11 Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024a. Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Ch...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

arXiv preprint arXiv:2310.10505 , year=

Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505,

work page arXiv

[21] [21]

Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072,

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072,

work page arXiv

[22] [22]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Length-unbiased sequence policy optimization: Revealing and controlling response length variation in rlvr.arXiv preprint arXiv:2602.05261,

Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, and Haibo Qiu. Length-unbiased sequence policy optimization: Revealing and controlling response length variation in rlvr.arXiv preprint arXiv:2602.05261,

work page arXiv

[24] [24]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025a. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspectiv...

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Unbiased gradient low-rank projection.arXiv preprint arXiv:2510.17802,

Rui Pan, Yang Luo, Yuxing Liu, Yang You, and Tong Zhang. Unbiased gradient low-rank projection.arXiv preprint arXiv:2510.17802,

work page arXiv

[26] [26]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alexandre Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer, Sándor Toth, and Sam Work

Nicolas Le Roux, Marc G Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alex Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer, Sándor Toth, and Sam Work. Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms.arXiv preprint arXiv:2503.14286,

work page arXiv

[28] [28]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,

work page arXiv

[32] [32]

Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629,

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629,

work page arXiv

[33] [33]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

Haoxuan Wang, Gengyu Zhang, Yan Yan, Yuzhang Shang, Ramana Rao Kompella, and Gaowen Liu. Real-time robot execution with masked action chunking. InInternational Conference on Learning Representations (ICLR), 2026a. Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy opti...

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Vla-adapter: An effective paradigm for tiny-scale vision-language-action model

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pp. 18638–18646, 2026b. Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and ...

work page arXiv

[36] [36]

Vlanext: Recipes for building strong vla models.arXiv preprint arXiv:2602.18532,

Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, and Chen Change Loy. Vlanext: Recipes for building strong vla models.arXiv preprint arXiv:2602.18532,

work page arXiv

[37] [37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Vlm4vla: Revisiting vision-language-models in vision-language-action models

13 Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309,

work page arXiv

[41] [41]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025a. Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. On the design of kl-regularized policy gr...

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025a. Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy rl reach with stale data on llms?arXiv preprint arXiv:2510.0116...

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925,

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, et al. The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,

work page arXiv

[45] [45]

17 A.2 RLVR training: GRPO and GMPO

14 Appendix A Additional Preliminaries: VLA Training and RLVR Training 17 A.1 VLA action heads and training objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 RLVR training: GRPO and GMPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B Low-rank Muon (LRMuon) Algorithm 18 C SNR Analysis for SFT an...

work page 2025

[46] [46]

In our experiments (Sec

2 2 i ,(A2) wheret∼ U(0,1)denotes the uniform distribution over the interpolation timestep. In our experiments (Sec. 6), the ℓ1-regression head is instantiated by VLA-Adapter (Wang et al., 2026b) and the flow-matching head by VLANeXt (Wu et al., 2026). A.2 RLVR training: GRPO and GMPO We expand here on the three-stage RLVR loop sketched in Sec

work page 2026

[47] [47]

on LIBERO (Liu et al., 2023), with VLANeXt additionally evaluated on the perturbed LIBERO-Plus split (Fei et al., 2025); theObjectsuite converges faster and is allocated fewer training steps.Table A2summarizes the RLVR hyperparameters, reused across both RL algorithms (GRPO/GMPO) and both model scales (Qwen3-1.7B/4B); only the prompt/response length, trai...

work page 2023

[48] [48]

Table A1: Training hyperparameters for the VLA experiments on the LIBERO benchmark

is finetuned under the DROID hardware platform (Khazatsky et al., 2025; Wang et al., 2026a) and evaluated on three grasp-and-place tasks. Table A1: Training hyperparameters for the VLA experiments on the LIBERO benchmark. The three optimizer configurations (i)–(iii) are applied identically to both models, and share all other hyperparameters listed in this...

work page 2025

[49] [49]

Each panel anchors the pass band (|σ| ≤τ) at±1and contracts the stop band (|σ|> τ) toward0

In the actual SVD update only the nonnegative half σin ∈[0,1] is applied to singular values; the plotted negative half visualizes the antisymmetric extension. Each panel anchors the pass band (|σ| ≤τ) at±1and contracts the stop band (|σ|> τ) toward0. 41 Table A7: Fitted coefficients ˆθ(τ) ={(a 1,k, a3,k, a5,k)}5 k=1 of the 5-step odd-quintic composition (...

work page arXiv