Recognition: unknown
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
Pith reviewed 2026-05-08 16:29 UTC · model grok-4.3
The pith
Extracting Q-values from a behavior cloning policy lets robots improve online with reinforcement learning while avoiding the usual distribution mismatch.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Q2RL extracts a Q-function from a behavior cloning policy with a few environment steps, then applies Q-gating during online rollouts to decide whether to follow the cloning policy or the learning policy on each step; the resulting samples train the RL policy while preserving the distribution that the cloning policy already mastered.
What carries the argument
Q-Estimation, which builds a Q-function from a fixed behavior cloning policy via limited interactions, and Q-Gating, which selects actions for the replay buffer by comparing the Q-values of the cloning policy and the current RL policy.
If this is right
- Robots can start from a cloned policy and reach 100 percent success on contact-rich tasks after 1-2 hours of online data instead of requiring large new demonstration sets.
- The same extraction-plus-gating pattern improves success rate and time to convergence on D4RL and robomimic manipulation suites compared with existing offline-to-online baselines.
- The method yields up to 3.75 times higher success than the starting behavior cloning policy on the evaluated tasks.
- Because the gate prevents large distribution shifts, the online phase can be run safely on physical hardware without catastrophic forgetting of demonstrated actions.
Where Pith is reading between the lines
- The same extraction step could be applied to other offline policies that already produce value estimates, potentially allowing a broader class of imitation methods to seed online RL.
- If the Q-estimation step scales to longer-horizon tasks, it may reduce the total online interaction time needed for high-precision assembly problems that currently require days of robot time.
- Testing the method on tasks where the initial cloning policy is only partially competent would reveal whether the gate still protects against harmful distribution shift when the starting policy is weaker.
Load-bearing premise
The Q-function recovered from only a few interactions with the environment is accurate enough to decide when to trust the cloning policy versus the learning policy.
What would settle it
A trial in which the Q-values produced by the extraction step consistently rank actions worse than the true returns observed during online training, causing the gate to select actions that slow convergence or drop final success rate below the original cloning policy.
Figures
read the original abstract
Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Q2RL, a method for offline-to-online RL that first extracts a Q-function from a behavior-cloning policy via a small number of environment interactions and then applies Q-gating to select between BC and RL actions during online training. It claims this avoids distribution mismatch, yields faster convergence and higher success rates than SOTA baselines on D4RL and robomimic benchmarks, and enables practical on-robot learning for contact-rich tasks (pipe assembly, kitting) in 1-2 hours with up to 100% success and 3.75x improvement over the original BC policy. Code and videos are provided.
Significance. If the extracted Q-values prove sufficiently accurate for reliable gating, the approach could meaningfully advance on-robot RL by turning existing BC policies into efficient starting points for online improvement with minimal additional interaction. The real-robot experiments and public code are clear strengths that support reproducibility and practical relevance.
major comments (3)
- [Abstract and §3] Abstract and §3 (Method): the Q-estimation procedure is described only at high level (few interaction steps followed by online RL); no explicit algorithm, update rule, or handling of sparse rewards is given, yet the entire gating mechanism and the claim of avoiding value errors rest on this step being accurate.
- [§5] §5 (Experiments): no ablation on the number of interactions used for Q-extraction, no comparison of the extracted Q-values against Monte-Carlo ground truth, and no sensitivity analysis on the gating threshold appear, which directly tests the weakest assumption that the Q-function remains reliable in contact-rich, high-precision domains.
- [§5.3] §5.3 (Real-robot results): success rates up to 100% and 3.75x improvement are reported for pipe assembly and kitting, but without controls for environment stochasticity, hyperparameter search effort, or variance across seeds, it is unclear whether the gains are attributable to Q-gating or to other experimental factors.
minor comments (2)
- [§3] Notation for the gating threshold and the extracted Q-function could be introduced earlier and used consistently to improve readability.
- [Abstract] The abstract mentions 'SOTA offline-to-online learning baselines' without naming them; a brief list would help readers immediately contextualize the comparisons.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies key areas where additional clarity and empirical support will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): the Q-estimation procedure is described only at high level (few interaction steps followed by online RL); no explicit algorithm, update rule, or handling of sparse rewards is given, yet the entire gating mechanism and the claim of avoiding value errors rest on this step being accurate.
Authors: We agree that the Q-estimation procedure requires a more explicit and formal description to substantiate the gating mechanism and claims regarding value error avoidance. In the revised manuscript, we will expand §3 with a dedicated algorithm box (Algorithm 1) that provides the precise steps, including the update rule for fitting the Q-function from the small number of interaction trajectories (via a combination of Monte Carlo returns and TD bootstrapping). We will also add a subsection discussing reward handling, noting that our tasks employ dense or shaped rewards but explaining how the procedure can be adapted for sparse cases by leveraging the BC policy for initial value estimates. These changes will make the foundation for Q-gating fully transparent. revision: yes
-
Referee: [§5] §5 (Experiments): no ablation on the number of interactions used for Q-extraction, no comparison of the extracted Q-values against Monte-Carlo ground truth, and no sensitivity analysis on the gating threshold appear, which directly tests the weakest assumption that the Q-function remains reliable in contact-rich, high-precision domains.
Authors: We acknowledge that these ablations are important for validating the reliability of the extracted Q-function. We will incorporate the requested analyses into the revised §5: an ablation varying the number of Q-extraction interactions (e.g., 10, 50, 100 steps) with corresponding success rates; a direct comparison of extracted Q-values against Monte-Carlo ground-truth returns on held-out trajectories, including quantitative error metrics; and a sensitivity study on the gating threshold with performance curves. These will be presented for the D4RL and robomimic benchmarks, with discussion of implications for real-robot domains. revision: yes
-
Referee: [§5.3] §5.3 (Real-robot results): success rates up to 100% and 3.75x improvement are reported for pipe assembly and kitting, but without controls for environment stochasticity, hyperparameter search effort, or variance across seeds, it is unclear whether the gains are attributable to Q-gating or to other experimental factors.
Authors: We agree that greater transparency on experimental controls is warranted. In the revised §5.3, we will report that success rates are averaged over 10–20 evaluation trials per task, including standard deviations to capture environment stochasticity (e.g., sensor noise and initial pose variations). We will clarify that hyperparameters were tuned primarily in simulation and transferred with limited on-robot fine-tuning due to practical time constraints. To strengthen attribution to Q-gating, we will add real-robot comparisons against BC-only continuation and standard RL baselines. While exhaustive multi-seed hyperparameter searches on the physical system are constrained by cost and time, the multi-trial stochasticity controls and baseline comparisons will better isolate the contribution of the proposed method. revision: partial
Circularity Check
No significant circularity; derivation relies on external interactions and remains independent of target outcomes.
full rationale
The paper's method extracts a Q-function from a BC policy via a small number of environment interactions, then applies Q-gating to switch between BC and RL actions during online training. This chain depends on real-world rollouts rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or steps reduce the claimed performance gains (success rate, convergence time) to the inputs by construction. The approach does not invoke uniqueness theorems from prior author work, smuggle ansatzes, or rename known empirical patterns. The derivation is self-contained against external benchmarks and falsifiable via on-robot results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Q-function can be estimated from a BC policy using a few interaction steps with the environment
invented entities (1)
-
Q-Gating
no independent evidence
Reference graph
Works this paper leans on
-
[1]
An optimistic perspective on offline reinforcement learning
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International conference on machine learning, pages 104--114. PMLR, 2020
2020
-
[2]
Efficient online reinforcement learning with offline data
Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577--1594. PMLR, 2023
2023
-
[3]
Ondrej Biza, Thomas Weng, Lingfeng Sun, Karl Schmeckpeper, Tarik Kelestemur, Yecheng Jason Ma, Robert Platt, Jan-Willem van de Meent, and Lawson L.S. Wong. On-robot reinforcement learning with goal-contrastive rewards. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4797--4805, 2025. doi:10.1109/ICRA55743.2025.11128466
-
[4]
Learning under misspecified objective spaces
Andreea Bobu, Andrea Bajcsy, Jaime F Fisac, and Anca D Dragan. Learning under misspecified objective spaces. In Conference on Robot Learning, pages 796--805. PMLR, 2018
2018
-
[5]
Berkeley UR5 demonstration dataset
Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home
-
[6]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44 0 (10-11): 0 1684--1704, 2025
2025
-
[10]
Iq-learn: Inverse soft-q learning for imitation
Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34: 0 4028--4039, 2021
2021
-
[11]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861--1870. Pmlr, 2018
2018
-
[12]
Imitation bootstrapped reinforcement learning, 2023
Hengyuan Hu, Suvir Mirchandani, and Dorsa Sadigh. Imitation bootstrapped reinforcement learning, 2023
2023
-
[14]
Residual reinforcement learning for robot control
Tobias Johannink, Shikhar Bahl, Ashvin Nair, Jianlan Luo, Avinash Kumar, Matthias Loskyll, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Residual reinforcement learning for robot control. In 2019 international conference on robotics and automation (ICRA), pages 6023--6029. IEEE, 2019
2019
-
[15]
Kochenderfer
Michael Kelly, Chelsea Sidrane, Katherine Rose Driggs - Campbell, and Mykel J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20-24, 2019 , pages 8077--8083. IEEE , 2019
2019
-
[16]
Conservative q-learning for offline reinforcement learning
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in neural information processing systems, 33: 0 1179--1191, 2020
2020
-
[17]
The boltzmann policy distribution: Accounting for systematic suboptimality in human models
Cassidy Laidlaw and Anca Dragan. The boltzmann policy distribution: Accounting for systematic suboptimality in human models. In ICLR, 2022
2022
-
[20]
Individual choice behavior, volume 4
R Duncan Luce et al. Individual choice behavior, volume 4. Wiley New York, 1959
1959
-
[21]
Serl: A software suite for sample-efficient robotic reinforcement learning
Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16961--16969. IEEE, 2024
2024
-
[22]
Fmb: a functional manipulation benchmark for generalizable robotic learning
Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research, 44 0 (4): 0 592--606, 2025 a
2025
-
[23]
Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning
Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning. Science Robotics, 10 0 (105): 0 eads5033, 2025 b
2025
-
[24]
Jim Mainprice, Rafi Hayne, and Dmitry Berenson. Predicting human reaching motion in collaborative tasks using inverse optimal control and iterative re-planning. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 885--892, 2015. doi:10.1109/ICRA.2015.7139282
-
[27]
Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning
Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36: 0 62244--62269, 2023
2023
-
[29]
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations . In Proceedings of Robotics: Science and Systems (RSS), 2018
2018
-
[30]
Efficient reductions for imitation learning
St \'e phane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661--668. JMLR Workshop and Conference Proceedings, 2010
2010
-
[31]
A reduction of imitation learning and structured prediction to no-regret online learning
St \'e phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627--635. JMLR Workshop and Conference Proceedings, 2011
2011
-
[35]
Image augmentation is all you need: Regularizing deep reinforcement learning from pixels
Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International conference on learning representations, 2021
2021
-
[37]
Deep imitation learning for complex manipulation tasks from virtual reality teleoperation
Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE international conference on robotics and automation (ICRA), pages 5628--5635. Ieee, 2018
2018
-
[38]
Efficient online reinforcement learning fine-tuning need not retain offline data
Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, and Aviral Kumar. Efficient online reinforcement learning fine-tuning need not retain offline data. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=HN0CYZbAPw
2025
-
[39]
1990 , doi =
McGeer, Tad , title =. 1990 , doi =. http://ijr.sagepub.com/content/9/2/62.full.pdf+html , journal =
1990
-
[40]
Journal of Basic Engineering , volume=
A new approach to linear filtering and prediction problems , author=. Journal of Basic Engineering , volume=. 1960 , publisher=
1960
-
[41]
International conference on machine learning , pages=
Reinforcement learning with deep energy-based policies , author=. International conference on machine learning , pages=. 2017 , organization=
2017
-
[42]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[43]
The Thirteenth International Conference on Learning Representations , year=
Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data , author=. The Thirteenth International Conference on Learning Representations , year=
-
[44]
You Liang Tan , title =
-
[45]
The International Journal of Robotics Research , volume=
Fmb: a functional manipulation benchmark for generalizable robotic learning , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=
2025
-
[46]
ICLR , year=
The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models , author=. ICLR , year=
-
[47]
Conference on Robot Learning , year=
DayDreamer: World Models for Physical Robot Learning , author=. Conference on Robot Learning , year=
-
[48]
International Conference on Machine Learning , pages=
Jump-start reinforcement learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[49]
Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=
Efficient reductions for imitation learning , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=
2010
-
[50]
Proceedings of Robotics: Science and Systems (RSS) , YEAR =
Aravind Rajeswaran AND Vikash Kumar AND Abhishek Gupta AND Giulia Vezzani AND John Schulman AND Emanuel Todorov AND Sergey Levine , TITLE = ". Proceedings of Robotics: Science and Systems (RSS) , YEAR =
-
[51]
International conference on machine learning , pages=
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=
2018
-
[52]
Residual policy learning , author=. arXiv preprint arXiv:1812.06298 , year=
-
[53]
2019 international conference on robotics and automation (ICRA) , pages=
Residual reinforcement learning for robot control , author=. 2019 international conference on robotics and automation (ICRA) , pages=. 2019 , organization=
2019
-
[54]
Policy decorator: Model-agnostic online refinement for large policy model , author=. arXiv preprint arXiv:2412.13630 , year=
-
[55]
Accelerating Residual Reinforcement Learning with Uncertainty Estimation , author=. arXiv preprint arXiv:2506.17564 , year=
-
[56]
Steering your diffusion policy with latent space reinforcement learning
Steering Your Diffusion Policy with Latent Space Reinforcement Learning , author=. arXiv preprint arXiv:2506.15799 , year=
-
[57]
2018 IEEE international conference on robotics and automation (ICRA) , pages=
Deep imitation learning for complex manipulation tasks from virtual reality teleoperation , author=. 2018 IEEE international conference on robotics and automation (ICRA) , pages=. 2018 , organization=
2018
-
[58]
2025 , eprint=
Residual Off-Policy RL for Finetuning Behavior Cloning Policies , author=. 2025 , eprint=
2025
-
[59]
Entropy , volume=
Estimating mixture entropy with pairwise distances , author=. Entropy , volume=. 2017 , publisher=
2017
-
[60]
2023 , eprint=
Imitation Bootstrapped Reinforcement Learning , author=. 2023 , eprint=
2023
-
[61]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
D4rl: Datasets for deep data-driven reinforcement learning , author=. arXiv preprint arXiv:2004.07219 , year=
work page internal anchor Pith review arXiv 2004
-
[62]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
What matters in learning from offline human demonstrations for robot manipulation , author=. arXiv preprint arXiv:2108.03298 , year=
work page internal anchor Pith review arXiv
-
[63]
Advances in neural information processing systems , volume=
Conservative q-learning for offline reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[64]
Advances in Neural Information Processing Systems , volume=
Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning , author=. Advances in Neural Information Processing Systems , volume=
-
[65]
International Conference on Machine Learning , pages=
Efficient online reinforcement learning with offline data , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[66]
Learning to generalize across long-horizon tasks from human demonstrations
Learning to generalize across long-horizon tasks from human demonstrations , author=. arXiv preprint arXiv:2003.06085 , year=
-
[67]
IEEE Transactions on Neural Networks and Learning Systems , volume=
A survey on offline reinforcement learning: Taxonomy, review, and open problems , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2023 , publisher=
2023
-
[68]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author=. arXiv preprint arXiv:2005.01643 , year=
work page internal anchor Pith review arXiv 2005
-
[69]
Benchmarking safe exploration in deep reinforcement learning,
Benchmarking batch deep reinforcement learning algorithms , author=. arXiv preprint arXiv:1910.01708 , year=
-
[70]
arXiv preprint arXiv:2503.14259 , year=
Quantization-free autoregressive action transformer , author=. arXiv preprint arXiv:2503.14259 , year=
-
[71]
Advances in neural information processing systems , volume=
Behavior transformers: Cloning k modes with one stone , author=. Advances in neural information processing systems , volume=
-
[72]
The International Journal of Robotics Research , volume=
Diffusion policy: Visuomotor policy learning via action diffusion , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=
2025
-
[73]
arXiv preprint arXiv:2505.21851 , year=
Streaming Flow Policy: Simplifying diffusion / flow-matching policies by treating action trajectories as flow trajectories , author=. arXiv preprint arXiv:2505.21851 , year=
-
[74]
2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Serl: A software suite for sample-efficient robotic reinforcement learning , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=
2024
-
[75]
Science Robotics , volume=
Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning , author=. Science Robotics , volume=. 2025 , publisher=
2025
-
[76]
International conference on learning representations , year=
Image augmentation is all you need: Regularizing deep reinforcement learning from pixels , author=. International conference on learning representations , year=
-
[77]
HG-DAgger: Interactive Imitation Learning with Human Experts , booktitle =
Michael Kelly and Chelsea Sidrane and Katherine Rose Driggs. HG-DAgger: Interactive Imitation Learning with Human Experts , booktitle =
-
[78]
Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=
A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=
2011
-
[79]
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models
Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models , author=. arXiv preprint arXiv:2512.02636 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
, booktitle=
Biza, Ondrej and Weng, Thomas and Sun, Lingfeng and Schmeckpeper, Karl and Kelestemur, Tarik and Ma, Yecheng Jason and Platt, Robert and van de Meent, Jan-Willem and Wong, Lawson L.S. , booktitle=. On-Robot Reinforcement Learning with Goal-Contrastive Rewards , year=
-
[81]
Flow Matching for Generative Modeling
Flow matching for generative modeling , author=. arXiv preprint arXiv:2210.02747 , year=
work page internal anchor Pith review arXiv
-
[82]
Octo: An Open-Source Generalist Robot Policy
Octo: An open-source generalist robot policy , author=. arXiv preprint arXiv:2405.12213 , year=
work page internal anchor Pith review arXiv
-
[83]
Learning and retrieval from prior data for skill-based imitation learning , author=. arXiv preprint arXiv:2210.11435 , year=
-
[84]
Berkeley
Lawrence Yunliang Chen and Simeon Adebola and Ken Goldberg , howpublished =. Berkeley
-
[85]
International conference on machine learning , pages=
An optimistic perspective on offline reinforcement learning , author=. International conference on machine learning , pages=. 2020 , organization=
2020
-
[86]
2010 , publisher=
Modeling interaction via the principle of maximum causal entropy , author=. 2010 , publisher=
2010
-
[87]
1959 , publisher=
Individual choice behavior , author=. 1959 , publisher=
1959
-
[88]
Conference on Robot Learning , pages=
Learning under misspecified objective spaces , author=. Conference on Robot Learning , pages=. 2018 , organization=
2018
-
[89]
Predicting human reaching motion in collaborative tasks using Inverse Optimal Control and iterative re-planning , year=
Mainprice, Jim and Hayne, Rafi and Berenson, Dmitry , booktitle=. Predicting human reaching motion in collaborative tasks using Inverse Optimal Control and iterative re-planning , year=
-
[90]
Advances in Neural Information Processing Systems , volume=
Iq-learn: Inverse soft-q learning for imitation , author=. Advances in Neural Information Processing Systems , volume=
-
[91]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016
2016
-
[92]
Estimating mixture entropy with pairwise distances
Artemy Kolchinsky and Brendan D Tracey. Estimating mixture entropy with pairwise distances. Entropy, 19 0 (7): 0 361, 2017
2017
-
[93]
Fmb: a functional manipulation benchmark for generalizable robotic learning
Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research, 44 0 (4): 0 592--606, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.