pith. sign in

arxiv: 2505.17016 · v1 · pith:6XRESUBBnew · submitted 2025-05-22 · 💻 cs.LG · cs.AI· cs.CV· cs.RO

Interactive Post-Training for Vision-Language-Action Models

Pith reviewed 2026-05-21 14:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.RO
keywords Vision-Language-Action ModelsReinforcement LearningPost-TrainingSparse Binary RewardsPolicy OptimizationRobotics
0
0 comments X

The pith

RIPT-VLA uses reinforcement learning with only binary success rewards to lift vision-language-action models from 4 percent to 97 percent success using one demonstration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RIPT-VLA as a reinforcement-learning-based interactive post-training approach for fine-tuning pretrained vision-language-action models. It relies on dynamic rollout sampling and leave-one-out advantage estimation to optimize policies from sparse binary rewards rather than large expert datasets. This turns an initial 4 percent success model into a 97 percent performer within 15 iterations from a single demonstration and raises a 7B model to 97.5 percent success. The resulting policies generalize across tasks and remain robust to changes in starting conditions.

Core claim

RIPT-VLA is a simple reinforcement-learning paradigm for interactive post-training of vision-language-action models that uses only sparse binary success rewards. It achieves stable policy optimization through dynamic rollout sampling and leave-one-out advantage estimation, delivering large gains on both lightweight and 7B-scale models with minimal demonstrations.

What carries the argument

Dynamic rollout sampling combined with leave-one-out advantage estimation, which stabilizes policy updates from binary success signals during interactive post-training.

If this is right

  • The learned policies generalize across different tasks and scenarios.
  • Performance stays robust to variations in initial state context.
  • The method works on both lightweight models and 7B-parameter models with large gains.
  • Only one demonstration is required for rapid improvement in 15 iterations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could let VLA models adapt in settings where full expert trajectories are too costly to collect.
  • The binary-reward approach may combine with other post-training signals to further improve real-world robotic control.

Load-bearing premise

Dynamic rollout sampling combined with leave-one-out advantage estimation produces stable policy optimization for VLA models when only sparse binary rewards are available.

What would settle it

Apply RIPT-VLA to the initial 4-percent SFT model with one demonstration and check the success rate after 15 iterations; failure to reach near 97 percent would falsify the claimed effectiveness.

read the original abstract

We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RIPT-VLA, a reinforcement-learning-based interactive post-training paradigm for Vision-Language-Action (VLA) models. It fine-tunes pretrained VLA models using only sparse binary success rewards through a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. Reported results include a 21.2% improvement on the QueST model, an unprecedented 97.5% success rate on the 7B OpenVLA-OFT model, and recovery of an unworkable SFT model from 4% to 97% success within 15 iterations using only one demonstration. The learned policy is claimed to generalize across tasks and scenarios while remaining robust to initial state context.

Significance. If the empirical results hold under rigorous validation, this work would be significant for embodied AI and robotics. It provides a practical, data-efficient method for post-training VLA models with minimal supervision and sparse binary rewards, addressing key limitations of offline imitation learning pipelines. The computational efficiency and ability to achieve large gains from a single demonstration without dense rewards or extra stabilization techniques would be noteworthy strengths if supported by detailed controls and variance analysis.

major comments (2)
  1. Abstract: The central performance claims (4% to 97% success in 15 iterations; 97.5% on OpenVLA-OFT) are presented without error bars, number of evaluation runs, or statistical details. This is load-bearing for the claim of stable policy optimization, as it prevents assessment of whether the reported gains are reliable or sensitive to random seeds and evaluation protocol.
  2. Method section (dynamic rollout sampling and leave-one-out advantage estimation): With initial success rates of 4%, the vast majority of trajectories receive zero reward. The leave-one-out estimator then subtracts a near-zero baseline from mostly-zero returns, which risks high-variance advantage estimates dominated by the few successful rollouts. The manuscript must provide variance analysis, ablations isolating this estimator, or explicit stabilization to support the stability claim and the rapid 15-iteration improvement.
minor comments (2)
  1. Abstract: Add a brief statement on the specific tasks, environments, and baseline comparisons used to contextualize the reported success rates.
  2. Overall: Define all acronyms (VLA, SFT, RIPT) on first use and ensure consistent notation for success rates and iteration counts across text and any tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment below and have revised the manuscript to strengthen the presentation of our empirical results and methodological details.

read point-by-point responses
  1. Referee: Abstract: The central performance claims (4% to 97% success in 15 iterations; 97.5% on OpenVLA-OFT) are presented without error bars, number of evaluation runs, or statistical details. This is load-bearing for the claim of stable policy optimization, as it prevents assessment of whether the reported gains are reliable or sensitive to random seeds and evaluation protocol.

    Authors: We agree that statistical details are essential for substantiating the stability claims. In the revised manuscript, we will update the abstract to report mean success rates with standard deviations and explicitly state the number of evaluation runs (averaged over multiple random seeds). We will also add a new subsection in the experiments detailing the full evaluation protocol, including the number of trials per task and seed sensitivity analysis, to allow readers to assess reliability. revision: yes

  2. Referee: Method section (dynamic rollout sampling and leave-one-out advantage estimation): With initial success rates of 4%, the vast majority of trajectories receive zero reward. The leave-one-out estimator then subtracts a near-zero baseline from mostly-zero returns, which risks high-variance advantage estimates dominated by the few successful rollouts. The manuscript must provide variance analysis, ablations isolating this estimator, or explicit stabilization to support the stability claim and the rapid 15-iteration improvement.

    Authors: We thank the referee for this important point on potential variance in the advantage estimator at low success rates. Our dynamic rollout sampling is intended to progressively increase the proportion of successful trajectories, but we acknowledge the need for explicit supporting analysis. In the revision, we will add variance plots of the advantage estimates over iterations, report the fraction of successful rollouts per batch, and include an ablation comparing leave-one-out estimation against a standard mean baseline to isolate its contribution and demonstrate stabilization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from applied algorithm, not self-referential derivation

full rationale

The paper introduces RIPT-VLA as a practical RL post-training method relying on dynamic rollout sampling and leave-one-out advantage estimation to optimize VLA policies from sparse binary rewards. Reported performance gains (e.g., 4% to 97% success, or 97.5% on OpenVLA-OFT) are presented as experimental outcomes from running the algorithm on specific models and tasks, without any derivation chain, fitted parameters renamed as predictions, or self-citations that reduce the central claims to tautologies. The method's stability assumptions are stated as empirical findings rather than proven by internal definitions or prior self-work that is itself unverified. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted from abstract only; no explicit free parameters, axioms, or invented entities are described. The approach relies on standard RL primitives (rollout sampling, advantage estimation) whose stability under sparse rewards is taken as given.

pith-pipeline@v0.9.0 · 5770 in / 1202 out tokens · 85060 ms · 2026-05-21T14:19:48.112961+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

    cs.LG 2026-05 unverdicted novelty 7.0

    PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.

  2. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  3. Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 7.0

    MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.

  4. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 7.0

    VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...

  5. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  6. Unified Noise Steering for Efficient Human-Guided VLA Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

  7. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  8. Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

  9. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.

  10. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

  11. PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

  12. ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

    cs.CV 2026-02 unverdicted novelty 6.0

    ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.

  13. Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

    cs.RO 2026-02 unverdicted novelty 6.0

    LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a ...

  14. TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

    cs.RO 2026-02 unverdicted novelty 6.0

    TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.

  15. $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    cs.LG 2025-11 unverdicted novelty 6.0

    RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

  16. World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

    cs.RO 2025-09 unverdicted novelty 6.0

    World-Env replaces physical robot interactions with a world model-based virtual environment and VLM-guided rewards to enable efficient RL post-training for VLA models, showing gains with only five demonstrations per task.

  17. RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.

  18. PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    PAPO-VLA identifies planning actions via variation and outcome, estimates their causal importance, and folds that importance into GRPO to emphasize key decisions while still using full-trajectory feedback.

  19. DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

    cs.RO 2026-05 unverdicted novelty 5.0

    DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.

  20. Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...

  21. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 5.0

    VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

  22. AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

    cs.LG 2025-11 unverdicted novelty 5.0

    AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.

  23. Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

    cs.RO 2025-08 unverdicted novelty 5.0

    This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 21 Pith papers · 10 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  2. [3]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, De...

  3. [4]

    Reinforcement learning for long-horizon interactive llm agents, 2025

    Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents, 2025

  4. [5]

    Conrft: A reinforced fine-tuning method for vla models via consistency policy, 2025

    Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy, 2025

  5. [6]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research , 2023

  6. [7]

    Palm-e: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. In ICML. PMLR, 2023

  7. [8]

    Act: empowering decision transformer with dynamic programming via advantage conditioning

    Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Rui Kong, Zongzhang Zhang, and Yang Yu. Act: empowering decision transformer with dynamic programming via advantage conditioning. In AAAI, 2024. 12

  8. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  9. [10]

    Improving vision-language-action model with online reinforcement learning, 2025

    Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning, 2025

  10. [11]

    Dita: Scaling diffusion transformer for generalist vision-language-action policy

    Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, et al. Dita: Scaling diffusion transformer for generalist vision-language-action policy. arXiv preprint arXiv:2503.19757, 2025

  11. [12]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022

  12. [13]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. Pi0.5: a vision- language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 , 2025

  13. [14]

    Fine-tuning vision-language-action models: Optimizing speed and success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. In RSS, 2025

  14. [15]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In CoRL, 2024

  15. [16]

    Attention, learn to solve routing problems! In ICLR, 2019

    Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! In ICLR, 2019

  16. [17]

    Rlaif: Scaling reinforcement learning from human feedback with ai feedback

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. In ICML, 2024

  17. [18]

    Behavior generation with latent actions

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. In ICML, 2024

  18. [19]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations

  19. [20]

    Direct large language model alignment through self-rewarding contrastive prompt distillation

    Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Xiaoming Wang, Jiulong Shan, Meng Cao, and Lijie Wen. Direct large language model alignment through self-rewarding contrastive prompt distillation. In ACL, 2024

  20. [21]

    Libero: Benchmarking knowledge transfer in lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer in lifelong robot learning. In NeurIPS, 2023

  21. [22]

    Quest: Self- supervised skill abstractions for learning continuous control

    Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self- supervised skill abstractions for learning continuous control. In NeurIPS, 2024

  22. [23]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  23. [24]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022

  24. [25]

    Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In ICRA, 2024. 13

  25. [26]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

  26. [27]

    Multimodal diffusion transformer: Learn- ing versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024

    Moritz Reuss, Ömer Erdinç Ya ˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. arXiv preprint arXiv:2407.05996, 2024

  27. [28]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  28. [29]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  29. [30]

    Defining and characterizing reward gaming

    Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. In NeurIPS, 2022

  30. [31]

    Paco: Parameter- compositional multi-task reinforcement learning

    Lingfeng Sun, Haichao Zhang, Wei Xu, and Masayoshi Tomizuka. Paco: Parameter- compositional multi-task reinforcement learning. In NeurIPS, 2022

  31. [32]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

  32. [33]

    Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

    Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. arXiv preprint arXiv:2412.15109, 2024

  33. [34]

    Q*: Improving multi-step reasoning for llms with deliberative planning

    Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, and Bo An. Q*: Improving multi-step reasoning for llms with deliberative planning. arXiv preprint arXiv:2406.14283, 2024

  34. [35]

    Tree of thoughts: Deliberate problem solving with large language models.NeurIPS, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.NeurIPS, 2023

  35. [36]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020

  36. [37]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, 2023

  37. [38]

    Fine-tuning large vision-language models as decision-making agents via reinforcement learning

    Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In NeurIPS, 2024

  38. [39]

    Prise: Llm-style sequence compression for learning temporal action abstractions in control

    Ruijie Zheng, Ching-An Cheng, Hal Daumé Iii, Furong Huang, and Andrey Kolobov. Prise: Llm-style sequence compression for learning temporal action abstractions in control. In ICML, 2024

  39. [40]

    Sanketi, Grecia Salazar, Michael S

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...