pith. sign in

arxiv: 2605.16154 · v1 · pith:SF55K56Knew · submitted 2026-05-15 · 💻 cs.LG · cs.RO

Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

Pith reviewed 2026-05-20 21:02 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords vision-language-actionreinforcement learningGRPOefficient RLchunk maskingprobabilistic selectionLIBERO benchmarks
0
0 comments X

The pith

Probabilistic Chunk Masking focuses gradient computation in GRPO only on trajectory phases where successful and failed rollouts diverge in action variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that gradient computation dominates wall-clock time in GRPO-based reinforcement learning for vision-language-action policies because advantages are assigned uniformly across trajectories. PCM scores semantic phases by success-failure action variance as a proxy for per-phase gradient variance and then samples a fixed budget of chunks according to online-updated keep probabilities. This drop-in change requires no reward model or critic. A reader should care because it shows how to redirect the bulk of backpropagation away from phases the policy already handles after pre-training and supervised fine-tuning while preserving the overall learning signal.

Core claim

PCM is a modification to GRPO that formalizes per-phase gradient variance as the quantity that determines where computation is useful and demonstrates that success-failure action variance provides a measurable, rollout-derived proxy for it, allowing probabilistic selection of a small subset of chunks for backpropagation.

What carries the argument

Probabilistic Chunk Masking, which assigns keep probabilities to semantic phases based on success-failure action variance and backpropagates gradients only through a budgeted, probabilistically chosen subset of trajectory chunks per rollout.

If this is right

  • Backpropagation occurs through fewer than 20 percent of trajectory chunks per rollout.
  • Wall-clock time per training step falls by a factor of 2.38 on the tested LIBERO benchmarks.
  • Gradient updates run 4.8 times faster with 60 percent lower peak activation memory.
  • Final task success rates remain statistically equivalent to those of unmodified GRPO.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same variance-based selection could be applied to other long-horizon RL settings where early or terminal phases become stable after initial training.
  • Replacing or augmenting the action-variance proxy with additional cheap statistics might further reduce the kept fraction without extra learned models.
  • The results imply that much of the post-training compute in fine-tuned sequential policies is spent on phases that no longer contribute new signal.

Load-bearing premise

Success-failure action variance serves as a reliable proxy for per-phase gradient variance so that masking low-variance chunks preserves the overall learning signal.

What would settle it

A head-to-head comparison on a new VLA task or benchmark where PCM produces a materially lower final success rate than standard GRPO after the same number of updates would show that the variance proxy fails to preserve the necessary learning signal.

Figures

Figures reproduced from arXiv: 2605.16154 by Nikshep Grampurohit, Pulkit Verma, Vaidehi Bagaria.

Figure 1
Figure 1. Figure 1: Left: per-phase success–failure action variance ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Probabilistic Chunk Masking (PCM). We compute per-phase success–failure action variance [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Success rate as a function of wall-clock time [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency on LIBERO-Object at B=12. (a) Wall-clock time to first reach each SR threshold; the PCM/GRPO gap widens at higher thresholds. (b) Activation memory is reduced by 60%. (c) Cumulative actor-update time over 200 steps is 4.8× faster. activation memory by 60% (10.1 → 4.1 GB) and peak GPU memory by 15% (39.7 → 33.6 GB) per step. Cumulative gradient-update time over 200 training steps is 4.8× faster. … view at source ↗
Figure 5
Figure 5. Figure 5: Ablations on LIBERO-Object. (a) Chunk budget sweep shows that B=12 preserves final success while retaining most of the wall-clock speedup. (b) At fixed B=12, PCM outperforms random masking and highest-variance-phase selection. RQ3: Sensitivity to Chunk Budget B. PCM samples a fixed budget of B chunks per trajectory. Smaller budgets improve per-step efficiency but reduce gradient coverage, while larger budg… view at source ↗
Figure 6
Figure 6. Figure 6: shows PCM’s realized gradient allocation across training, revealing three mechanisms that together explain its performance [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (a) Success vs. training steps for vanilla GRPO vs branching GRPO. (b) Action entropy [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cumulative Cc captured as a function of the fraction of trajectory chunks retained, with chunks sorted in descending order by phase-level Cc. The solid curve shows the empirical cumulative Cc, while the dashed diagonal shows the uniform baseline under which Cc would scale linearly with chunk count. The gap between the two reflects the uneven distribution of learning signal across the trajectory: a small fr… view at source ↗
Figure 9
Figure 9. Figure 9: Successful and failed manipulation rollouts look similar for most of the trajectory, diverging [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Reinforcement learning (RL) allows vision-language-action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post-training is computationally expensive. A natural response has been to speed rollout collection through faster simulators and world models. In GRPO-based VLA RL, we find that the dominant cost lies elsewhere: gradient computation accounts for approximately 78% of wall-clock time per step in our runs, while rollout collection accounts for only 21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO's learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor-update compute is spent uniformly across the trajectory, including phases the policy already handles after pre-training and supervised fine-tuning. This paper presents Probabilistic Chunk Masking (PCM), a drop-in modification to GRPO that allocates gradient computation to a small, probabilistically selected subset of chunks per trajectory. PCM scores semantic phases using success-failure action variance, a rollout-derived proxy for per-phase gradient variance, and samples a fixed chunk budget with online-updated phase-level keep probabilities. We formalize per-phase gradient variance as the quantity determines where gradient computation is useful and show that success-failure action variance provides a measurable proxy for it. PCM requires no reward model or learned critic. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while achieving 2.38 times wall-clock speedup, 4.8 times faster gradient updates, and 60% lower peak activation memory, while backpropagating through fewer than 20% of trajectory chunks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Probabilistic Chunk Masking (PCM) as a drop-in modification to GRPO for post-training vision-language-action (VLA) policies. PCM uses success-failure action variance, derived from rollouts, as a proxy for per-phase gradient variance to probabilistically select and mask trajectory chunks according to a fixed chunk budget and online-updated keep probabilities. This reduces backpropagation to fewer than 20% of chunks. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while reporting 2.38x wall-clock speedup, 4.8x faster gradient updates, and 60% lower peak activation memory.

Significance. If the proxy reliably identifies low-contribution phases without discarding learning signal, the method could meaningfully lower the compute barrier for GRPO-style VLA RL. The approach is attractive because it requires no additional critic or reward model and targets the dominant gradient-computation cost (reported at ~78% of wall-clock time) rather than rollout collection.

major comments (2)
  1. The central claim that success-failure action variance is a reliable proxy for per-phase gradient variance (and therefore that low-variance chunks can be masked with negligible impact on the actor gradient) is load-bearing but supported only indirectly. The manuscript reports matching final success rates on three LIBERO tasks but provides no direct measurement of the correlation between the proxy and actual per-phase gradient variance, nor any ablation showing that the total gradient norm or learning signal is preserved after masking >80% of chunks.
  2. Experiments section: variance across random seeds, statistical significance tests, and exact baseline implementations (including whether GRPO uses the same chunk segmentation) are not reported. Without these, it is difficult to assess whether the observed speedups and matched success rates are robust or could be explained by longer effective training or task-specific artifacts.
minor comments (2)
  1. Clarify the precise definition and computation of 'phase-level keep probabilities' and how they are updated online; the current description leaves open whether these introduce additional hyperparameters that must be tuned per task.
  2. The abstract states that 'GRPO assigns the same advantage to every chunk,' but the manuscript should explicitly state whether this is an implementation choice or a fundamental property of the GRPO formulation used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and describe the changes we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central claim that success-failure action variance is a reliable proxy for per-phase gradient variance (and therefore that low-variance chunks can be masked with negligible impact on the actor gradient) is load-bearing but supported only indirectly. The manuscript reports matching final success rates on three LIBERO tasks but provides no direct measurement of the correlation between the proxy and actual per-phase gradient variance, nor any ablation showing that the total gradient norm or learning signal is preserved after masking >80% of chunks.

    Authors: We agree that direct evidence would strengthen the central claim. The manuscript formalizes per-phase gradient variance and presents success-failure action variance as a measurable proxy, with the matching final success rates on LIBERO serving as indirect validation that learning signal is retained. To address the concern directly, we will add in revision: (1) a correlation analysis between the proxy and per-phase gradient variance computed on held-out trajectories, and (2) an ablation measuring total gradient norm and effective learning signal before versus after masking. These additions will be included in the experiments section. revision: yes

  2. Referee: Experiments section: variance across random seeds, statistical significance tests, and exact baseline implementations (including whether GRPO uses the same chunk segmentation) are not reported. Without these, it is difficult to assess whether the observed speedups and matched success rates are robust or could be explained by longer effective training or task-specific artifacts.

    Authors: We acknowledge that these details improve clarity and robustness assessment. The reported results were obtained from multiple random seeds, but seed-wise variance and significance tests were omitted from the initial submission. In revision we will report means and standard deviations across 3–5 seeds, include statistical significance tests for the success-rate comparisons, and explicitly state that the GRPO baseline uses identical chunk segmentation and training hyperparameters for fair comparison. We will also add a brief note confirming that training duration was matched across methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core contribution is a heuristic masking strategy that uses rollout-derived success-failure action variance as an empirical proxy for per-phase gradient variance. This proxy is computed directly from collected trajectories and is not obtained by fitting a parameter to the target outcome or by renaming the result itself. The formalization of gradient variance is presented as a separate analytical step, and the masking decision is an online probabilistic allocation rather than a self-referential definition. Empirical results on LIBERO benchmarks are reported as external validation of the efficiency gains, with no load-bearing step reducing by construction to a fitted input or self-citation chain. The derivation remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that action variance between success and failure rollouts proxies per-phase gradient variance; chunk budget and phase keep probabilities are hyperparameters.

free parameters (2)
  • chunk budget
    Fixed number of chunks kept per trajectory for gradient computation; chosen as hyperparameter.
  • phase-level keep probabilities
    Online-updated probabilities for sampling which phases to keep; initialized and adapted during training.
axioms (1)
  • domain assumption Success-failure action variance is a measurable proxy for per-phase gradient variance
    Invoked to justify selective masking without loss of learning signal.

pith-pipeline@v0.9.0 · 5863 in / 1186 out tokens · 42782 ms · 2026-05-20T21:02:15.344069+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 16 internal anchors

  1. [1]

    URLhttps://arxiv.org/abs/2012.13255v1

    doi: 10.18653/v1/2021.acl-long.568. URLhttps://arxiv.org/abs/2012.13255v1. Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Ta...

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    URLhttps://arxiv.org/abs/2410.24164v4. Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311,

  3. [3]

    Training Deep Nets with Sublinear Memory Cost

    doi: 10.1137/16M1080173. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. abs/1604.06174,

  4. [4]

    Training Deep Nets with Sublinear Memory Cost

    URLhttp://arxiv.org/abs/1604.06174. Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy, 2025a. URL https://arxiv.org/abs/ 2502.05450v2. Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, and Zipei Fan. TGRPO :Fine-tuning Vision- Language-Acti...

  5. [5]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    doi: 10.48550/arxiv.2501.17161. URL https://arxiv.org/abs/2501.17161v2. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. InAdvances in Neural Information Processing Systems,

  6. [6]

    doi: 10.48550/arxiv.2305. 14314. URLhttps://arxiv.org/abs/2305.14314v1. Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving Vision-Language-Action Model with Online Reinforcement Learning. InIEEE International Conference on Robotics and Automation,

  7. [7]

    Edward J

    URLhttps://arxiv.org/abs/2501.16664v1. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations,

  8. [8]

    LoRA: Low-Rank Adaptation of Large Language Models

    URLhttps://arxiv.org/abs/2106.09685v2. Jiaheng Hu, Rose Hendrix, Ali Farhadi, Aniruddha Kembhavi, Roberto Martin-Martin, Peter Stone, Kuo-Hao Zeng, and Kiana Ehsani. FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning. InIEEE International Conference on Robotics and Automation,

  9. [9]

    URLhttps://arxiv.org/abs/2409.16578v2

    doi: 10.1109/icra55743.2025.11127934. URLhttps://arxiv.org/abs/2409.16578v2. Yuelin Hu, Zhengxue Cheng, Wei Liu, and Li Song. Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models,

  10. [10]

    Zengjie Hu, Jiantao Qiu, Tianyi Bai, Haojin Yang, Binhang Yuan, Qi Jing, Conghui He, and Wentao Zhang

    URL https://arxiv.org/ abs/2602.03309. Zengjie Hu, Jiantao Qiu, Tianyi Bai, Haojin Yang, Binhang Yuan, Qi Jing, Conghui He, and Wentao Zhang. V ADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL,

  11. [11]

    Mykola Khandoga, Rui Yuan, and Vinay Kumar Sankarapu

    URLhttps://arxiv.org/abs/2511.18902. Mykola Khandoga, Rui Yuan, and Vinay Kumar Sankarapu. Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization,

  12. [12]

    URLhttps://arxiv.org/abs/2602.09331. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language- Action Mo...

  13. [13]

    OpenVLA: An Open-Source Vision-Language-Action Model

    doi: 10.48550/arxiv.2406.09246. URL https: //arxiv.org/abs/2406.09246v3. Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success. InRobotics: Science and Systems,

  14. [14]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    URLhttps://arxiv.org/abs/2502.19645. 10 Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing Activation Recomputation in Large Transformer Models. In Proceedings of the Machine Learning and Systems Conference (MLSys),

  15. [15]

    SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    URL https://openreview. net/forum?id=nvRncppDoD. Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, and Ning Ding. SimpleVLA-RL: Scaling VLA Training via Rein...

  16. [16]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    URLhttps://arxiv.org/abs/2306.03310. Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What Can RL Bring to VLA Generalization? An Empirical Study,

  17. [17]

    Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou

    URL https://arxiv.org/abs/2505.19789v4. Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy,

  18. [18]

    Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang

    URLhttps://arxiv.org/abs/2602.06508. Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning,

  19. [19]

    URLhttps://arxiv.org/abs/2505.18719v1. R. Duncan Luce.Individual Choice Behavior: A Theoretical Analysis. Wiley, New York,

  20. [20]

    Gongrui Nan, Siye Chen, Jing Huang, Mengyu Lu, Dexun Wang, Chunmei Xie, Weiqi Xiong, Xianzhou Zeng, Qixuan Zhou, Yadong Li, and Xingzhong Xu

    URL https://arxiv.org/abs/2506.03077v1. Gongrui Nan, Siye Chen, Jing Huang, Mengyu Lu, Dexun Wang, Chunmei Xie, Weiqi Xiong, Xianzhou Zeng, Qixuan Zhou, Yadong Li, and Xingzhong Xu. NGRPO: Negative-enhanced Group Relative Policy Optimization,

  21. [21]

    URLhttps://arxiv.org/abs/2509.18851. R. L. Plackett. The analysis of permutations. 24(2):193–202,

  22. [22]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

    doi: 10.2307/2346567. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms,

  23. [23]

    Proximal Policy Optimization Algorithms

    URLhttps://arxiv.org/abs/1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,

  24. [24]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URLhttps://arxiv.org/abs/2402.03300v3. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A Flexible and Efficient RLHF Framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297, New York, NY , USA,

  25. [25]

    Hybridflow: A flexible and efficient rlhf framework

    Association for Computing Machinery. ISBN 9798400711961. doi: 10.1145/3689031.3696075. URL https://doi.org/10.1145/3689031.3696075. Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive Post-Training for Vision-Language- Action Models,

  26. [26]

    Hongcheng Wang, Yinuo Huang, Sukai Wang, Guanghui Ren, and Hao Dong

    URLhttps://arxiv.org/abs/2505.17016v1. Hongcheng Wang, Yinuo Huang, Sukai Wang, Guanghui Ren, and Hao Dong. Why Tree-Style Branching Matters for Thought Advantage Estimation in GRPO,

  27. [27]

    URL https://arxiv.org/abs/2509.24494. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reaso...

  28. [28]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    URL https: //arxiv.org/abs/2506.01939. Zhongwen Xu and Zihan Ding. Single-stream Policy Optimization. InInternational Conference on Learning Representations,

  29. [29]

    URLhttps://arxiv.org/abs/2510.01623. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xia...

  30. [30]

    URL https: //arxiv.org/abs/2503.14476v2. Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Peihong Wang, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, and Yu Wang. RLinf-VLA: A Unified and Efficient Framework for Reinforcement Learning of Vision-Language-Action Models,

  31. [31]

    Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang

    URLhttps://arxiv.org/abs/2510.06710. Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning,

  32. [32]

    Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Yang You, Guiming Xie, Xuejian Gong, and Kunlong Zhou

    URLhttps://arxiv.org/abs/2509.15937. Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Yang You, Guiming Xie, Xuejian Gong, and Kunlong Zhou. Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models. InInternational Conference on Learning Representations,

  33. [33]

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao

    URLhttps://arxiv.org/abs/2502.13533v2. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. InInter- national Conference on Learning Representations,

  34. [34]

    AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    URLhttps://arxiv.org/abs/2303.10512. Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts. In Advances in Neural Information Processing Systems,

  35. [35]

    where the model has room to learn

    URL https://arxiv.org/abs/ 2511.09515v1. 12 Appendix A Problem Formulation and Assumptions For clarity, we explicitly restate the problem formulation and assumptions underlying PCM. While these are implicit in the main text, we formalize them here to precisely define the input setting, the chunk-selection objective, and the structural conditions required ...