Interactive Post-Training for Vision-Language-Action Models
Pith reviewed 2026-05-21 14:19 UTC · model grok-4.3
The pith
RIPT-VLA uses reinforcement learning with only binary success rewards to lift vision-language-action models from 4 percent to 97 percent success using one demonstration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RIPT-VLA is a simple reinforcement-learning paradigm for interactive post-training of vision-language-action models that uses only sparse binary success rewards. It achieves stable policy optimization through dynamic rollout sampling and leave-one-out advantage estimation, delivering large gains on both lightweight and 7B-scale models with minimal demonstrations.
What carries the argument
Dynamic rollout sampling combined with leave-one-out advantage estimation, which stabilizes policy updates from binary success signals during interactive post-training.
If this is right
- The learned policies generalize across different tasks and scenarios.
- Performance stays robust to variations in initial state context.
- The method works on both lightweight models and 7B-parameter models with large gains.
- Only one demonstration is required for rapid improvement in 15 iterations.
Where Pith is reading between the lines
- This could let VLA models adapt in settings where full expert trajectories are too costly to collect.
- The binary-reward approach may combine with other post-training signals to further improve real-world robotic control.
Load-bearing premise
Dynamic rollout sampling combined with leave-one-out advantage estimation produces stable policy optimization for VLA models when only sparse binary rewards are available.
What would settle it
Apply RIPT-VLA to the initial 4-percent SFT model with one demonstration and check the success rate after 15 iterations; failure to reach near 97 percent would falsify the claimed effectiveness.
read the original abstract
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RIPT-VLA, a reinforcement-learning-based interactive post-training paradigm for Vision-Language-Action (VLA) models. It fine-tunes pretrained VLA models using only sparse binary success rewards through a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. Reported results include a 21.2% improvement on the QueST model, an unprecedented 97.5% success rate on the 7B OpenVLA-OFT model, and recovery of an unworkable SFT model from 4% to 97% success within 15 iterations using only one demonstration. The learned policy is claimed to generalize across tasks and scenarios while remaining robust to initial state context.
Significance. If the empirical results hold under rigorous validation, this work would be significant for embodied AI and robotics. It provides a practical, data-efficient method for post-training VLA models with minimal supervision and sparse binary rewards, addressing key limitations of offline imitation learning pipelines. The computational efficiency and ability to achieve large gains from a single demonstration without dense rewards or extra stabilization techniques would be noteworthy strengths if supported by detailed controls and variance analysis.
major comments (2)
- Abstract: The central performance claims (4% to 97% success in 15 iterations; 97.5% on OpenVLA-OFT) are presented without error bars, number of evaluation runs, or statistical details. This is load-bearing for the claim of stable policy optimization, as it prevents assessment of whether the reported gains are reliable or sensitive to random seeds and evaluation protocol.
- Method section (dynamic rollout sampling and leave-one-out advantage estimation): With initial success rates of 4%, the vast majority of trajectories receive zero reward. The leave-one-out estimator then subtracts a near-zero baseline from mostly-zero returns, which risks high-variance advantage estimates dominated by the few successful rollouts. The manuscript must provide variance analysis, ablations isolating this estimator, or explicit stabilization to support the stability claim and the rapid 15-iteration improvement.
minor comments (2)
- Abstract: Add a brief statement on the specific tasks, environments, and baseline comparisons used to contextualize the reported success rates.
- Overall: Define all acronyms (VLA, SFT, RIPT) on first use and ensure consistent notation for success rates and iteration counts across text and any tables.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments. We address each major comment below and have revised the manuscript to strengthen the presentation of our empirical results and methodological details.
read point-by-point responses
-
Referee: Abstract: The central performance claims (4% to 97% success in 15 iterations; 97.5% on OpenVLA-OFT) are presented without error bars, number of evaluation runs, or statistical details. This is load-bearing for the claim of stable policy optimization, as it prevents assessment of whether the reported gains are reliable or sensitive to random seeds and evaluation protocol.
Authors: We agree that statistical details are essential for substantiating the stability claims. In the revised manuscript, we will update the abstract to report mean success rates with standard deviations and explicitly state the number of evaluation runs (averaged over multiple random seeds). We will also add a new subsection in the experiments detailing the full evaluation protocol, including the number of trials per task and seed sensitivity analysis, to allow readers to assess reliability. revision: yes
-
Referee: Method section (dynamic rollout sampling and leave-one-out advantage estimation): With initial success rates of 4%, the vast majority of trajectories receive zero reward. The leave-one-out estimator then subtracts a near-zero baseline from mostly-zero returns, which risks high-variance advantage estimates dominated by the few successful rollouts. The manuscript must provide variance analysis, ablations isolating this estimator, or explicit stabilization to support the stability claim and the rapid 15-iteration improvement.
Authors: We thank the referee for this important point on potential variance in the advantage estimator at low success rates. Our dynamic rollout sampling is intended to progressively increase the proportion of successful trajectories, but we acknowledge the need for explicit supporting analysis. In the revision, we will add variance plots of the advantage estimates over iterations, report the fraction of successful rollouts per batch, and include an ablation comparing leave-one-out estimation against a standard mean baseline to isolate its contribution and demonstrate stabilization. revision: yes
Circularity Check
No circularity: empirical results from applied algorithm, not self-referential derivation
full rationale
The paper introduces RIPT-VLA as a practical RL post-training method relying on dynamic rollout sampling and leave-one-out advantage estimation to optimize VLA policies from sparse binary rewards. Reported performance gains (e.g., 4% to 97% success, or 97.5% on OpenVLA-OFT) are presented as experimental outcomes from running the algorithm on specific models and tasks, without any derivation chain, fitted parameters renamed as predictions, or self-citations that reduce the central claims to tautologies. The method's stability assumptions are stated as empirical findings rather than proven by internal definitions or prior self-work that is itself unverified. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 23 Pith papers
-
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.
-
Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning
LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a ...
-
TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation
TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.
-
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
-
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
World-Env replaces physical robot interactions with a world model-based virtual environment and VLM-guided rewards to enable efficient RL post-training for VLA models, showing gains with only five demonstrations per task.
-
RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models
RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.
-
PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models
PAPO-VLA identifies planning actions via variation and outcome, estimates their causal importance, and folds that importance into GRPO to emphasize key decisions while still using full-trajectory feedback.
-
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
-
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
-
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.
-
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
Reference graph
Works this paper leans on
-
[1]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, De...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Reinforcement learning for long-horizon interactive llm agents, 2025
Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents, 2025
work page 2025
-
[5]
Conrft: A reinforced fine-tuning method for vla models via consistency policy, 2025
Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy, 2025
work page 2025
-
[6]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research , 2023
work page 2023
-
[7]
Palm-e: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. In ICML. PMLR, 2023
work page 2023
-
[8]
Act: empowering decision transformer with dynamic programming via advantage conditioning
Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Rui Kong, Zongzhang Zhang, and Yang Yu. Act: empowering decision transformer with dynamic programming via advantage conditioning. In AAAI, 2024. 12
work page 2024
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Improving vision-language-action model with online reinforcement learning, 2025
Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning, 2025
work page 2025
-
[11]
Dita: Scaling diffusion transformer for generalist vision-language-action policy
Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, et al. Dita: Scaling diffusion transformer for generalist vision-language-action policy. arXiv preprint arXiv:2503.19757, 2025
-
[12]
Lora: Low-rank adaptation of large language models
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022
work page 2022
-
[13]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. Pi0.5: a vision- language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Fine-tuning vision-language-action models: Optimizing speed and success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. In RSS, 2025
work page 2025
-
[15]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In CoRL, 2024
work page 2024
-
[16]
Attention, learn to solve routing problems! In ICLR, 2019
Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! In ICLR, 2019
work page 2019
-
[17]
Rlaif: Scaling reinforcement learning from human feedback with ai feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. In ICML, 2024
work page 2024
-
[18]
Behavior generation with latent actions
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. In ICML, 2024
work page 2024
-
[19]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations
-
[20]
Direct large language model alignment through self-rewarding contrastive prompt distillation
Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Xiaoming Wang, Jiulong Shan, Meng Cao, and Lijie Wen. Direct large language model alignment through self-rewarding contrastive prompt distillation. In ACL, 2024
work page 2024
-
[21]
Libero: Benchmarking knowledge transfer in lifelong robot learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer in lifelong robot learning. In NeurIPS, 2023
work page 2023
-
[22]
Quest: Self- supervised skill abstractions for learning continuous control
Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self- supervised skill abstractions for learning continuous control. In NeurIPS, 2024
work page 2024
-
[23]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022
work page 2022
-
[25]
Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In ICRA, 2024. 13
work page 2024
-
[26]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Multimodal diffusion transformer: Learning versatile behavior from multimodal goals
Moritz Reuss, Ömer Erdinç Ya ˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. arXiv preprint arXiv:2407.05996, 2024
-
[28]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Defining and characterizing reward gaming
Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. In NeurIPS, 2022
work page 2022
-
[31]
Paco: Parameter- compositional multi-task reinforcement learning
Lingfeng Sun, Haichao Zhang, Wei Xu, and Masayoshi Tomizuka. Paco: Parameter- compositional multi-task reinforcement learning. In NeurIPS, 2022
work page 2022
-
[32]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Predictive inverse dynamics models are scalable learners for robotic manipulation
Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. arXiv preprint arXiv:2412.15109, 2024
-
[34]
Q*: Improving multi-step reasoning for llms with deliberative planning
Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, and Bo An. Q*: Improving multi-step reasoning for llms with deliberative planning. arXiv preprint arXiv:2406.14283, 2024
-
[35]
Tree of thoughts: Deliberate problem solving with large language models.NeurIPS, 2023
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.NeurIPS, 2023
work page 2023
-
[36]
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020
work page 2020
-
[37]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, 2023
work page 2023
-
[38]
Fine-tuning large vision-language models as decision-making agents via reinforcement learning
Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In NeurIPS, 2024
work page 2024
-
[39]
Prise: Llm-style sequence compression for learning temporal action abstractions in control
Ruijie Zheng, Ching-An Cheng, Hal Daumé Iii, Furong Huang, and Andrey Kolobov. Prise: Llm-style sequence compression for learning temporal action abstractions in control. In ICML, 2024
work page 2024
-
[40]
Sanketi, Grecia Salazar, Michael S
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.