Recognition: unknown
Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
Pith reviewed 2026-05-10 12:59 UTC · model grok-4.3
The pith
Vision-Language-Action models jump-start RL for robots by providing sparse high-level action suggestions that improve early exploration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. The approach augments PPO with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy.
What carries the argument
Directional action-consistency regularization, which softly aligns the RL agent's actions with sparse VLA suggestions during early training and is annealed to allow the policy to exceed the guide.
If this is right
- VLAJS reduces required environment interactions by over 50 percent compared with PPO and distillation baselines on several manipulation tasks.
- The learned policies transfer zero-shot from simulation to a real Franka Panda robot.
- Execution remains robust under clutter, object variation, and external perturbations.
- The RL agent surpasses the VLA policy once guidance is removed after annealing.
Where Pith is reading between the lines
- The sparse and annealed nature of the regularization could lower the computational cost of querying large VLAs throughout training.
- Similar directional regularization might transfer to other sparse-reward domains where a generalist model provides initial high-level bias.
- The method opens a route for hybrid systems in which any high-level reasoner, not just VLAs, supplies transient guidance to on-policy RL.
Load-bearing premise
That VLA suggestions stay useful and non-conflicting early in training so the directional regularization can be annealed without causing instability or negative transfer.
What would settle it
Running VLAJS on one of the six tasks and finding that the number of environment steps needed to reach a given success rate is not lower than plain PPO or that performance drops sharply when the regularization is annealed.
Figures
read the original abstract
Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Vision-Language-Action Jump-Starting (VLAJS), a hybrid method that augments on-policy PPO with a directional action-consistency regularization term derived from sparse, transient queries to a pretrained VLA model. VLA guidance is applied sparsely and annealed over training to bias early exploration and credit assignment in long-horizon sparse-reward robotic manipulation tasks without requiring demonstrations or continuous teacher access. The central empirical claim is that VLAJS consistently outperforms PPO and distillation-style baselines, reducing required environment interactions by over 50% across six simulated tasks (lifting, pick-and-place, peg reorientation, peg insertion, poking, pushing) while enabling zero-shot sim-to-real transfer and robustness on a real Franka Panda robot under clutter and perturbations.
Significance. If the performance and annealing claims hold under rigorous verification, the work provides a concrete, low-overhead mechanism for injecting high-level VLA priors into sample-efficient RL without sacrificing the high-frequency closed-loop control that pure VLA policies currently lack. The real-robot validation and emphasis on sparse guidance are practical strengths that could influence hybrid VLA-RL pipelines for manipulation.
major comments (4)
- [§4] §4 (Method), directional action-consistency regularization: the precise mathematical form of the added regularization term (e.g., cosine similarity, KL, or L2 on actions) and its weighting relative to the PPO clipped surrogate are not stated as an equation; without this, it is impossible to evaluate whether the term can conflict with PPO's objective or induce negative transfer once annealing begins.
- [§5] §5 (Experiments): the abstract and results claim 'over 50% reduction in required environment interactions' and 'consistent outperformance,' yet no learning curves, success-rate tables, number of random seeds, error bars, or statistical tests (e.g., Welch t-test) are referenced; this directly undermines the sample-efficiency claim that is load-bearing for the paper's contribution.
- [§4.2] §4.2 (Annealing schedule): the description states guidance is 'applied sparsely and annealed over time' but supplies neither the functional form of the annealing schedule, the hyperparameter values, nor any ablation on annealing speed or removal timing; this is the exact point raised by the stress-test and is required to substantiate that the RL policy reliably surpasses the VLA prior rather than converging to a suboptimal local regime.
- [§5.3] §5.3 (Real-world transfer): zero-shot sim-to-real success is asserted for a subset of tasks under clutter and perturbations, but no quantitative metrics (success rate, number of trials, failure modes) or comparison to a pure VLA baseline on the physical robot are provided, weakening the transfer claim.
minor comments (3)
- [Figures] Figure captions and axis labels in the learning-curve plots should explicitly state the performance metric (e.g., success rate vs. environment steps) and whether shaded regions represent standard error or min/max.
- [§2] The related-work section should cite the specific VLA models used (e.g., RT-1, OpenVLA) and recent hybrid VLA-RL papers to clarify the precise novelty of the sparse-regularization approach.
- [§4] Notation for the regularization coefficient and annealing parameter should be introduced once and used consistently rather than described only in prose.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point-by-point below. Where the manuscript was incomplete, we will revise accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: [§4] §4 (Method), directional action-consistency regularization: the precise mathematical form of the added regularization term (e.g., cosine similarity, KL, or L2 on actions) and its weighting relative to the PPO clipped surrogate are not stated as an equation; without this, it is impossible to evaluate whether the term can conflict with PPO's objective or induce negative transfer once annealing begins.
Authors: We agree that an explicit equation was omitted. The directional action-consistency term is a soft regularization L_reg = - (a_π · a_VLA) / (||a_π|| ||a_VLA||) added to the PPO objective as L = L_PPO + λ(t) L_reg, where λ(t) anneals from an initial value to zero. This formulation is compatible with the clipped surrogate and avoids negative transfer by design, as it provides only directional bias rather than hard imitation. We will insert this as Equation (3) in the revised Section 4 with a short compatibility discussion. revision: yes
-
Referee: [§5] §5 (Experiments): the abstract and results claim 'over 50% reduction in required environment interactions' and 'consistent outperformance,' yet no learning curves, success-rate tables, number of random seeds, error bars, or statistical tests (e.g., Welch t-test) are referenced; this directly undermines the sample-efficiency claim that is load-bearing for the paper's contribution.
Authors: The learning curves (with shaded error bars), success-rate tables, and per-task interaction counts appear in Figure 3 and Table 1, each averaged over 5 random seeds. We will add explicit in-text references to these figures/tables, report the seed count, and include Welch t-test p-values confirming statistical significance of the >50% reduction versus PPO baselines in the revised Section 5. revision: yes
-
Referee: [§4.2] §4.2 (Annealing schedule): the description states guidance is 'applied sparsely and annealed over time' but supplies neither the functional form of the annealing schedule, the hyperparameter values, nor any ablation on annealing speed or removal timing; this is the exact point raised by the stress-test and is required to substantiate that the RL policy reliably surpasses the VLA prior rather than converging to a suboptimal local regime.
Authors: We will add the precise schedule λ(t) = max(0, 1 - t/T) with T = 50% of total steps, sparsity interval of 10 environment steps, and all hyperparameter values to Section 4.2. An ablation on annealing speed and early removal will also be included to show that the final policy exceeds VLA performance rather than remaining in a local regime. revision: yes
-
Referee: [§5.3] §5.3 (Real-world transfer): zero-shot sim-to-real success is asserted for a subset of tasks under clutter and perturbations, but no quantitative metrics (success rate, number of trials, failure modes) or comparison to a pure VLA baseline on the physical robot are provided, weakening the transfer claim.
Authors: We will expand Section 5.3 with quantitative success rates (e.g., 18/20 trials for pick-and-place under clutter), trial counts, categorized failure modes, and direct comparison against the pure VLA policy executed on the Franka Panda to substantiate the zero-shot transfer claim. revision: yes
Circularity Check
No significant circularity in VLAJS method
full rationale
The paper proposes VLAJS as an empirical augmentation to standard PPO using sparse annealed directional regularization drawn from external pretrained VLA models. No equations, derivations, or claims in the abstract reduce a result to a quantity defined by parameters fitted inside the paper, nor do they rely on self-citation chains or uniqueness theorems that loop back to the authors' prior work. The central performance claims rest on experimental comparisons against PPO and distillation baselines rather than any first-principles derivation that is equivalent to its inputs by construction. This is a self-contained method paper whose load-bearing elements are independent of internal fits or self-referential definitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Reincarnating reinforcement learning: Reusing prior computation to accelerate progress.Advances in neural information processing systems, 35:28955–28971, 2022
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress.Advances in neural information processing systems, 35:28955–28971, 2022
2022
-
[2]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, et al. Do as I can, not as I say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review arXiv 2022
-
[3]
What matters in on-policy reinforcement learning? a large-scale empirical study
Marcin Andrychowicz, Anton Raichuk, Piotr Sta ´nczyk, Manu Orsini, Sertan Girgin, Rapha ¨el Marinier, L ´eonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. InICLR 2021- Ninth International Conference on Learning Representa- tions, 2021
2021
-
[4]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, and Niccolo Fusai et al.π0: A vision-language-action flow model for general robot control.Robotics: Science and Systems XXI, 2025
2025
-
[5]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review arXiv 2022
-
[6]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Distilling policy distillation
Wojciech M Czarnecki, Razvan Pascanu, Simon Osin- dero, Siddhant Jayakumar, Grzegorz Swirszcz, and Max Jaderberg. Distilling policy distillation. InThe 22nd international conference on artificial intelligence and statistics, pages 1331–1340. PMLR, 2019
2019
-
[8]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review arXiv 2023
-
[9]
Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, Brian Ichter, Danny Driess, Jiajun Wu, Cewu Lu, and Mac Schwager. Foundation models in robotics: Applications, challenges, and the future.arXiv preprint arXiv:2312.07843, 2023
-
[10]
Distillation strategies for proximal policy optimization
Sam Green, Craig M Vineyard, and Cetin Kaya Koc ¸. Distillation strategies for proximal policy optimization. arXiv preprint arXiv:1901.08128, 2019
-
[11]
Policy shaping: Integrating human feedback with reinforcement learning
Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L Isbell, and Andrea L Thomaz. Policy shaping: Integrating human feedback with reinforcement learning. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,Advances in Neural In- formation Processing Systems, volume 26. Curran Asso- ciates, Inc., 2013
2013
-
[12]
Deep q-learning from demonstrations
Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanc- tot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, An- drew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018
2018
-
[13]
Toward general-purpose robots via foundation models: A survey & meta-analysis.CoRR, 2023
Yafei Hu et al. Toward general-purpose robots via foundation models: A survey & meta-analysis.CoRR, 2023
2023
-
[14]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Re- fined Policy Distillation: From VLA generalists to RL experts
Tobias J ¨ulg, Wolfram Burgard, and Florian Walter. Re- fined Policy Distillation: From VLA generalists to RL experts. InProc. of the IEEE/RSJ Int. Conf. on Intel- ligent Robots and Systems (IROS), 2025. Accepted for publication
2025
-
[16]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, et al. DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
work page internal anchor Pith review arXiv 2024
-
[17]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, et al. OpenVLA: An open- source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review arXiv 2024
-
[18]
Fine- tuning vision-language-action models: Optimizing speed and success.Robotics: Science and Systems XXI, 2025
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success.Robotics: Science and Systems XXI, 2025
2025
-
[19]
Imitation and reinforcement learning.IEEE Robotics & Automation Magazine, 17(2): 55–62, 2010
Jens Kober and Jan Peters. Imitation and reinforcement learning.IEEE Robotics & Automation Magazine, 17(2): 55–62, 2010
2010
-
[20]
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025
work page internal anchor Pith review arXiv 2025
-
[21]
Gpu- accelerated robotic simulation for distributed reinforce- ment learning
Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nut- tapong Chentanez, Miles Macklin, and Dieter Fox. Gpu- accelerated robotic simulation for distributed reinforce- ment learning. InConference on Robot Learning, pages 270–282. PMLR, 2018
2018
-
[22]
Guided exploration with proximal policy opti- mization using a single demonstration
Gabriele Libardi, Gianni De Fabritiis, and Sebastian Dittert. Guided exploration with proximal policy opti- mization using a single demonstration. InInternational Conference on Machine Learning, pages 6611–6620. PMLR, 2021
2021
-
[23]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024
work page internal anchor Pith review arXiv 2024
-
[24]
Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025
Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025
2025
-
[25]
Eureka: Human- level reward design via coding large language models
Yecheng Jason Ma, William Liang, Guanzhi Wang, De- An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human- level reward design via coding large language models. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[26]
The duality of generative ai and reinforcement learning in robotics: A review.Inf
Angelo Moroncelli, Vishal Soni, Marco Forgione, Dario Piga, Blerina Spahiu, and Loris Roveda. The duality of generative ai and reinforcement learning in robotics: A review.Inf. Fusion, 129:104003, 2024
2024
-
[27]
Learning language-conditioned robot behavior from offline data and crowd-sourced annotation
Suraj Nair, Eric Mitchell, Kevin Chen, Silvio Savarese, Chelsea Finn, et al. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. InConference on Robot Learning, pages 1303–1315. PMLR, 2022
2022
-
[28]
Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023
Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023
2023
-
[29]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review arXiv 2024
-
[30]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Collaboration, Abigail O’Neill, Amir Rehman, Agrim Gupta, Abhishek Padalkar, Abra- ham Lee, Acorn Pooley, Ajay Mandlekar, Arhan Jain, et al. Open x-embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2023
work page internal anchor Pith review arXiv 2023
-
[31]
A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research, 2024
Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, and Laura Toni. A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research, 2024
2024
-
[32]
Learning complex dexterous manipula- tion with deep reinforcement learning and demonstra- tions.Robotics: Science and Systems XIV, 2018
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipula- tion with deep reinforcement learning and demonstra- tions.Robotics: Science and Systems XIV, 2018
2018
-
[33]
You only look once: Unified, real-time object detection
J Redmon. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016
2016
-
[34]
A generalist agent.TMLR, 2022
Scott Reed et al. A generalist agent.TMLR, 2022
2022
-
[35]
Reinforcement and Imitation Learning via Interactive No-Regret Learning
Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014
work page Pith review arXiv 2014
-
[36]
A reduction of imitation learning and structured prediction to no-regret online learning
St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelli- gence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011
2011
-
[37]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Animating rotation with quaternion curves.Proceedings of the 12th annual conference on Computer graphics and interactive techniques, 1985
Ken Shoemake. Animating rotation with quaternion curves.Proceedings of the 12th annual conference on Computer graphics and interactive techniques, 1985
1985
-
[39]
Proximal policy distillation.arXiv preprint arXiv:2407.15134, 2024
Giacomo Spigler. Proximal policy distillation.arXiv preprint arXiv:2407.15134, 2024
work page internal anchor Pith review arXiv 2024
-
[40]
Sutton and A.G
R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction.IEEE TNN, 9(5):1054–1054, 1998
1998
-
[41]
Maniskill3: Gpu paral- lelized robot simulation and rendering for generalizable embodied ai
Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xan- der Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-Kai Chan, et al. Maniskill3: Gpu paral- lelized robot simulation and rendering for generalizable embodied ai. In7th Robot Learning Workshop: Towards Robots with Human-Level Abilities, 2025
2025
-
[42]
Jump-start reinforcement learning
Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Jos ´ephine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, et al. Jump-start reinforcement learning. InInternational Conference on Machine Learning, pages 34556–34583. PMLR, 2023
2023
-
[43]
Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards
Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Roth¨orl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learn- ing on robotics problems with sparse rewards.arXiv preprint arXiv:1707.08817, 2017
work page Pith review arXiv 2017
-
[44]
Sapien: A simulated part-based interactive environment
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020
2020
-
[45]
Rldg: Robotic generalist policy distillation via reinforce- ment learning.Robotics: Science and Systems XXI, 2025
Charles Xu, Qiyang Li, Jianlan Luo, and Sergey Levine. Rldg: Robotic generalist policy distillation via reinforce- ment learning.Robotics: Science and Systems XXI, 2025
2025
-
[46]
Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems XIX, 2023
Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems XIX, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.