PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models
Pith reviewed 2026-05-20 05:08 UTC · model grok-4.3
The pith
PAPO-VLA weights GRPO advantages by the causal importance of planning actions to raise VLA reliability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that identifying planning actions from joint action variation and trajectory outcome, scoring them by causal sufficiency and necessity, and folding those scores into GRPO advantage estimation produces more reliable VLA policies than methods that treat every action uniformly or optimize only at the trajectory level.
What carries the argument
The causal importance estimator that computes sufficiency and necessity scores for planning actions and modulates GRPO advantages accordingly.
If this is right
- Planning actions receive larger policy-gradient signals than routine execution actions while the trajectory-level reward signal remains unchanged.
- Success rates on language-conditioned manipulation benchmarks rise when optimization pressure concentrates on the steps that change execution direction.
- The same trajectory can be reused for both global outcome feedback and local emphasis on its most consequential actions.
Where Pith is reading between the lines
- The same causal-weighting step could be inserted into other advantage-based reinforcement-learning algorithms used for sequential robotic control.
- If future work supplies an explicit dynamics model, the necessity and sufficiency calculations could be replaced by forward simulation to test whether data-driven identification is the limiting factor.
- Tasks whose early actions create irreversible state changes would be the clearest setting in which to measure whether the added emphasis improves sample efficiency.
Load-bearing premise
Planning actions can be identified and given reliable causal importance scores using only the observed actions and final trajectory outcomes.
What would settle it
Run the method on a dataset of trajectories in which action changes are deliberately uncorrelated with task outcomes; if performance falls below a standard GRPO baseline, the identification and scoring step does not isolate true planning actions.
Figures
read the original abstract
Vision-Language-Action (VLA) models show promising ability in language-guided robotic tasks. However, making VLA policies reliable remains challenging, because a manipulation task is completed through closed-loop interaction, where each action affects subsequent execution. To analyze this problem, we revisit VLA policy during execution and argue that a VLA policy acts both as a planner, which makes task-oriented decisions that change the direction of execution, and as an executor, which realizes these decisions through dense continuous actions. This view suggests that improving VLA reliability requires particular attention to planning actions. Existing optimization methods can imitate actions or improve complete trajectories, but they usually do not explicitly identify planning actions or measure their importance for task success. To address this issue, we propose Planning-Aware Policy Optimization for VLA models (PAPO-VLA). PAPO-VLA first identifies planning actions by jointly considering action variation and trajectory outcome, then estimates their importance through causal sufficiency and causal necessity, and finally incorporates this importance into GRPO advantage estimation. In this way, more important planning actions receive stronger optimization emphasis, while the whole trajectory is still optimized by trajectory-level feedback. Experiments on multiple benchmarks demonstrate the effectiveness of PAPO-VLA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PAPO-VLA, a method for optimizing Vision-Language-Action (VLA) models in robotic manipulation tasks. It argues that VLA policies function as both planners (making task-oriented decisions that alter execution direction) and executors (realizing decisions via continuous actions). The approach identifies planning actions by jointly considering action variation and trajectory outcome, estimates their importance using causal sufficiency and necessity metrics, and incorporates these importance scores into GRPO advantage estimation. This emphasizes optimization on key planning actions while retaining trajectory-level feedback. Experiments on multiple benchmarks are presented to show improved reliability.
Significance. If the causal sufficiency/necessity scores can be shown to isolate genuine contributions of planning actions from observational closed-loop data, the method would offer a targeted way to improve VLA reliability by directing gradient emphasis toward high-impact decisions without discarding full-trajectory signals. The explicit planner-executor distinction and the GRPO integration are conceptually clean; reproducible code or machine-checked derivations would further strengthen the contribution.
major comments (2)
- [Method (causal importance estimation)] The causal sufficiency and necessity scoring of planning actions (described in the method for importance estimation) is performed on closed-loop trajectories collected under a single policy. Action variation is therefore entangled with preceding states, subsequent actions, and collection biases; standard observational estimators require assumptions (no unobserved confounders, ignorability, or completeness of the action space) that are not shown to hold. This is load-bearing for the central claim that the reweighted advantages improve reliability rather than amplify spurious correlations. A sensitivity analysis or comparison against interventional data would be needed to substantiate the scores.
- [Experiments] The experimental section reports overall benchmark gains but does not include an ablation that isolates the contribution of the planning-action importance weighting from the base GRPO objective. Without this, it is difficult to attribute improvements specifically to the causal reweighting rather than other implementation choices.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly state the formal definitions of causal sufficiency and necessity as applied to the identified planning actions.
- [Method] Notation for action variation and trajectory outcome in the planning-action identification step should be introduced with a clear equation or pseudocode for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects regarding the causal validity of our importance estimation and the need for more targeted ablations. We address each point below and have incorporated revisions to strengthen the presentation and validation of our method.
read point-by-point responses
-
Referee: [Method (causal importance estimation)] The causal sufficiency and necessity scoring of planning actions (described in the method for importance estimation) is performed on closed-loop trajectories collected under a single policy. Action variation is therefore entangled with preceding states, subsequent actions, and collection biases; standard observational estimators require assumptions (no unobserved confounders, ignorability, or completeness of the action space) that are not shown to hold. This is load-bearing for the central claim that the reweighted advantages improve reliability rather than amplify spurious correlations. A sensitivity analysis or comparison against interventional data would be needed to substantiate the scores.
Authors: We appreciate the referee's emphasis on the challenges of causal inference in observational settings. Our approach identifies planning actions by jointly considering action variation (to detect decisions that change execution direction) and trajectory outcome, which provides a heuristic for focusing on high-impact actions within the closed-loop data. While we do not claim to have performed full causal discovery or satisfied all ignorability assumptions, the causal sufficiency and necessity metrics are used as proxies to quantify importance based on how the action influences the outcome under the observed policy. To address the concern, we have added a sensitivity analysis in the revised manuscript, where we vary the importance scores by adding noise and evaluate the robustness of the performance gains. We also include a discussion of the observational nature of the data and potential limitations in Section 5. We believe this strengthens the claim without requiring new interventional experiments. revision: partial
-
Referee: [Experiments] The experimental section reports overall benchmark gains but does not include an ablation that isolates the contribution of the planning-action importance weighting from the base GRPO objective. Without this, it is difficult to attribute improvements specifically to the causal reweighting rather than other implementation choices.
Authors: We agree that an ablation isolating the importance weighting is valuable for clarifying the contribution of our method. In the revised version, we have added an ablation study comparing the full PAPO-VLA (with causal importance weighting) against a baseline that applies GRPO with uniform advantages across all actions. The results demonstrate that the weighting leads to improved success rates on tasks where planning decisions are critical, such as those involving long-horizon manipulation. This ablation is now presented in Table X and discussed in the experiments section. revision: yes
Circularity Check
No significant circularity: method introduces independent causal scoring and reweighting on top of standard GRPO.
full rationale
The paper proposes a sequential pipeline—identify planning actions from variation plus outcome, compute causal sufficiency/necessity importance scores, then reweight GRPO advantages—whose central steps add explicit computations (action-variation detection and causal metrics) that are not definitionally equivalent to the raw trajectories or the base GRPO objective. Effectiveness is shown via benchmark experiments rather than by algebraic reduction of the final loss to the input data. No self-citation chain, fitted-parameter-as-prediction, or ansatz-smuggling is present in the described derivation; the approach remains self-contained against external validation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Planning actions can be distinguished from execution actions by variation and outcome statistics alone.
Reference graph
Works this paper leans on
-
[1]
Rocoda: Counterfac- tual data augmentation for data-efficient robot learning from demonstrations
Ezra Ameperosa, Jeremy A Collins, Mrinal Jain, and Animesh Garg. Rocoda: Counterfac- tual data augmentation for data-efficient robot learning from demonstrations. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13250–13256. IEEE, 2025
work page 2025
-
[2]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, and Zipei Fan. Tgrpo: Fine- tuning vision-language-action model via trajectory-wise group relative policy optimization. arXiv preprint arXiv:2506.08440, 2025
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al. π0.7: a steerable gen- eralist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. Robot learning from demonstration by constructing skill trees.The International Journal of Robotics Research, 31 (3):360–375, 2012
work page 2012
-
[11]
Oliver Kroemer, Scott Niekum, and George Konidaris. A review of robot learning for manipula- tion: Challenges, representations, and algorithms.Journal of machine learning research, 22 (30):1–82, 2021
work page 2021
-
[12]
Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, and Marios Savvides. Metavla: Unified meta co-training for efficient embodied adaption.arXiv preprint arXiv:2510.05580, 2025
-
[13]
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
work page 2023
-
[15]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Packnet: Adding multiple tasks to a single network by iterative pruning
Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018
work page 2018
-
[17]
Learning grounded finite-state representations from unstructured demonstrations
Scott Niekum, Sarah Osentoski, George Konidaris, Sachin Chitta, Bhaskara Marthi, and An- drew G Barto. Learning grounded finite-state representations from unstructured demonstrations. The International Journal of Robotics Research, 34(2):131–157, 2015
work page 2015
-
[18]
Cambridge university press, 2009
Judea Pearl.Causality. Cambridge university press, 2009
work page 2009
-
[19]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Efficient reductions for imitation learning
Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010
work page 2010
-
[21]
A reduction of imitation learning and structured prediction to no-regret online learning
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011
work page 2011
-
[22]
Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999
work page 1999
-
[23]
Interactive Post-Training for Vision-Language-Action Models
Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025
work page internal anchor Pith review arXiv 2025
-
[24]
Daniel Tanneberg, Kai Ploeger, Elmar Rueckert, and Jan Peters. Skid raw: Skill discovery from raw trajectories.IEEE robotics and automation letters, 6(3):4696–4703, 2021
work page 2021
-
[25]
Gemini Robotics: Bringing AI into the Physical World
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Causal action influence aware counterfactual data augmentation.arXiv preprint arXiv:2405.18917, 2024
Núria Armengol Urpí, Marco Bagatella, Marin Vlastelica, and Georg Martius. Causal action influence aware counterfactual data augmentation.arXiv preprint arXiv:2405.18917, 2024
-
[28]
Any-point Trajectory Modeling for Policy Learning
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024
Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024
-
[30]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Chaoran Zhu, Hengyi Wang, Yik Lung Pang, and Changjae Oh. Lava-man: Learning visual action representations for robot manipulation.arXiv preprint arXiv:2508.19391, 2025
-
[32]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 11
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.