arxiv: 2605.05172 · v1 · submitted 2026-05-06 · 💻 cs.RO · cs.AI

Recognition: unknown

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

Lakshita Dodeja , Ondrej Biza , Shivam Vats , Stephen Hart , Stefanie Tellex , Robin Walters , Karl Schmeckpeper , Thomas Weng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:29 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords behavior cloningoffline-to-online reinforcement learningQ-function estimationrobot manipulationon-robot learningpolicy gatingcontact-rich tasks

0 comments

The pith

Extracting Q-values from a behavior cloning policy lets robots improve online with reinforcement learning while avoiding the usual distribution mismatch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Q2RL, a method that first pulls a Q-function out of an already-trained behavior cloning policy by running a small number of environment interactions. It then uses those Q-values to gate between the original cloning policy and a new reinforcement learning policy when collecting online samples. This gating keeps the online data distribution close to what the cloning policy already knows, so the RL updates can refine the policy without forgetting good actions. On standard manipulation benchmarks the approach reaches higher success rates and converges faster than prior offline-to-online baselines. The same procedure also runs directly on physical robots, turning contact-rich tasks such as pipe insertion and kitting into policies that succeed up to 100 percent of the time after only one or two hours of interaction.

Core claim

Q2RL extracts a Q-function from a behavior cloning policy with a few environment steps, then applies Q-gating during online rollouts to decide whether to follow the cloning policy or the learning policy on each step; the resulting samples train the RL policy while preserving the distribution that the cloning policy already mastered.

What carries the argument

Q-Estimation, which builds a Q-function from a fixed behavior cloning policy via limited interactions, and Q-Gating, which selects actions for the replay buffer by comparing the Q-values of the cloning policy and the current RL policy.

If this is right

Robots can start from a cloned policy and reach 100 percent success on contact-rich tasks after 1-2 hours of online data instead of requiring large new demonstration sets.
The same extraction-plus-gating pattern improves success rate and time to convergence on D4RL and robomimic manipulation suites compared with existing offline-to-online baselines.
The method yields up to 3.75 times higher success than the starting behavior cloning policy on the evaluated tasks.
Because the gate prevents large distribution shifts, the online phase can be run safely on physical hardware without catastrophic forgetting of demonstrated actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same extraction step could be applied to other offline policies that already produce value estimates, potentially allowing a broader class of imitation methods to seed online RL.
If the Q-estimation step scales to longer-horizon tasks, it may reduce the total online interaction time needed for high-precision assembly problems that currently require days of robot time.
Testing the method on tasks where the initial cloning policy is only partially competent would reveal whether the gate still protects against harmful distribution shift when the starting policy is weaker.

Load-bearing premise

The Q-function recovered from only a few interactions with the environment is accurate enough to decide when to trust the cloning policy versus the learning policy.

What would settle it

A trial in which the Q-values produced by the extraction step consistently rank actions worse than the true returns observed during online training, causing the gate to select actions that slow convergence or drop final success rate below the original cloning policy.

Figures

Figures reproduced from arXiv: 2605.05172 by Karl Schmeckpeper, Lakshita Dodeja, Ondrej Biza, Robin Walters, Shivam Vats, Stefanie Tellex, Stephen Hart, Thomas Weng.

**Figure 1.** Figure 1: Q2RL consists of Q-Estimation and Q-Gating. Q-Estimation extracts a Q-function from a BC Policy. Then, during RL policy training, Q-Gating selects and executes the BC or RL action with highest respective Q-value, updating the RL policy on the collected interactions. Q2RL learns contact-rich manipulation tasks in 1-2 hours of online interaction. Abstract—Behavior Cloning (BC) has emerged as a highly effecti… view at source ↗

**Figure 2.** Figure 2: Tasks from D4RL (Kitchen-Complete, Adroit-Pen, Adroit-Door) and robomimic (Lift, Can, Square) view at source ↗

**Figure 3.** Figure 3: Results on D4RL. Plots begin when a method starts online interaction. The dotted blue line signifies the end of the view at source ↗

**Figure 4.** Figure 4: Robomimic Results. The plot for each method begins at the start of online interaction. The dotted blue line signifies view at source ↗

**Figure 5.** Figure 5: BC vs. RL Actions Used During Online Training. view at source ↗

**Figure 6.** Figure 6: Q-Initialization and Q-Gating Ablations. view at source ↗

**Figure 7.** Figure 7: Real World Experiments. (a) We use a Franka Panda arm with a Robotiq 2F-85 gripper, and RGB images from both view at source ↗

**Figure 8.** Figure 8: On-Robot RL Results. Task success is reported over view at source ↗

**Figure 9.** Figure 9: Results with non soft-optimal BC policies for Lift-State view at source ↗

**Figure 10.** Figure 10: We find that while Policy Decorator is able to recover view at source ↗

**Figure 13.** Figure 13: Q2RL achieves competitive performance with as few as 25 rollouts. F. Ablation: Initial BC Performance Next, we evaluate whether Q2RL can improve BC policies with varying levels of initial performance. We use the same experimental setup as in Sec. C-E and test Q2RL with different BC checkpoints, with initial success rates ranging from 10% to 75%, as shown in view at source ↗

**Figure 14.** Figure 14: Q2RL is able to improve a wide range of initial BC performance. G. Qualitative Result: Door While view at source ↗

**Figure 15.** Figure 15: Q2RL learns more realistic policies than WSRL, even though WSRL achieves higher success rate for Adroit-Door. TABLE V: Real World Experiment Configurations Peg Insertion Pipe Assembly Kitting (Modified) Initial Pose Distribution (m, rad) ±0.045, ±0.07875 fixed fixed Delta Action Range (m, rad) ±0.015, ±0.02625 ±0.015, ±0.02625 ±0.015, ±0.02625 Safety Limit XYZ (m) [-0.02, 0.05], [-0.1, 0.02], [-0.13, 0.02… view at source ↗

**Figure 16.** Figure 16: Key-frame images for each task, showing important events in successful trajectories. In Kitting-Original (third row), view at source ↗

read the original abstract

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Q2RL, a method for offline-to-online RL that first extracts a Q-function from a behavior-cloning policy via a small number of environment interactions and then applies Q-gating to select between BC and RL actions during online training. It claims this avoids distribution mismatch, yields faster convergence and higher success rates than SOTA baselines on D4RL and robomimic benchmarks, and enables practical on-robot learning for contact-rich tasks (pipe assembly, kitting) in 1-2 hours with up to 100% success and 3.75x improvement over the original BC policy. Code and videos are provided.

Significance. If the extracted Q-values prove sufficiently accurate for reliable gating, the approach could meaningfully advance on-robot RL by turning existing BC policies into efficient starting points for online improvement with minimal additional interaction. The real-robot experiments and public code are clear strengths that support reproducibility and practical relevance.

major comments (3)

[Abstract and §3] Abstract and §3 (Method): the Q-estimation procedure is described only at high level (few interaction steps followed by online RL); no explicit algorithm, update rule, or handling of sparse rewards is given, yet the entire gating mechanism and the claim of avoiding value errors rest on this step being accurate.
[§5] §5 (Experiments): no ablation on the number of interactions used for Q-extraction, no comparison of the extracted Q-values against Monte-Carlo ground truth, and no sensitivity analysis on the gating threshold appear, which directly tests the weakest assumption that the Q-function remains reliable in contact-rich, high-precision domains.
[§5.3] §5.3 (Real-robot results): success rates up to 100% and 3.75x improvement are reported for pipe assembly and kitting, but without controls for environment stochasticity, hyperparameter search effort, or variance across seeds, it is unclear whether the gains are attributable to Q-gating or to other experimental factors.

minor comments (2)

[§3] Notation for the gating threshold and the extracted Q-function could be introduced earlier and used consistently to improve readability.
[Abstract] The abstract mentions 'SOTA offline-to-online learning baselines' without naming them; a brief list would help readers immediately contextualize the comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where additional clarity and empirical support will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): the Q-estimation procedure is described only at high level (few interaction steps followed by online RL); no explicit algorithm, update rule, or handling of sparse rewards is given, yet the entire gating mechanism and the claim of avoiding value errors rest on this step being accurate.

Authors: We agree that the Q-estimation procedure requires a more explicit and formal description to substantiate the gating mechanism and claims regarding value error avoidance. In the revised manuscript, we will expand §3 with a dedicated algorithm box (Algorithm 1) that provides the precise steps, including the update rule for fitting the Q-function from the small number of interaction trajectories (via a combination of Monte Carlo returns and TD bootstrapping). We will also add a subsection discussing reward handling, noting that our tasks employ dense or shaped rewards but explaining how the procedure can be adapted for sparse cases by leveraging the BC policy for initial value estimates. These changes will make the foundation for Q-gating fully transparent. revision: yes
Referee: [§5] §5 (Experiments): no ablation on the number of interactions used for Q-extraction, no comparison of the extracted Q-values against Monte-Carlo ground truth, and no sensitivity analysis on the gating threshold appear, which directly tests the weakest assumption that the Q-function remains reliable in contact-rich, high-precision domains.

Authors: We acknowledge that these ablations are important for validating the reliability of the extracted Q-function. We will incorporate the requested analyses into the revised §5: an ablation varying the number of Q-extraction interactions (e.g., 10, 50, 100 steps) with corresponding success rates; a direct comparison of extracted Q-values against Monte-Carlo ground-truth returns on held-out trajectories, including quantitative error metrics; and a sensitivity study on the gating threshold with performance curves. These will be presented for the D4RL and robomimic benchmarks, with discussion of implications for real-robot domains. revision: yes
Referee: [§5.3] §5.3 (Real-robot results): success rates up to 100% and 3.75x improvement are reported for pipe assembly and kitting, but without controls for environment stochasticity, hyperparameter search effort, or variance across seeds, it is unclear whether the gains are attributable to Q-gating or to other experimental factors.

Authors: We agree that greater transparency on experimental controls is warranted. In the revised §5.3, we will report that success rates are averaged over 10–20 evaluation trials per task, including standard deviations to capture environment stochasticity (e.g., sensor noise and initial pose variations). We will clarify that hyperparameters were tuned primarily in simulation and transferred with limited on-robot fine-tuning due to practical time constraints. To strengthen attribution to Q-gating, we will add real-robot comparisons against BC-only continuation and standard RL baselines. While exhaustive multi-seed hyperparameter searches on the physical system are constrained by cost and time, the multi-trial stochasticity controls and baseline comparisons will better isolate the contribution of the proposed method. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external interactions and remains independent of target outcomes.

full rationale

The paper's method extracts a Q-function from a BC policy via a small number of environment interactions, then applies Q-gating to switch between BC and RL actions during online training. This chain depends on real-world rollouts rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or steps reduce the claimed performance gains (success rate, convergence time) to the inputs by construction. The approach does not invoke uniqueness theorems from prior author work, smuggle ansatzes, or rename known empirical patterns. The derivation is self-contained against external benchmarks and falsifiable via on-robot results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that a usable Q-function can be recovered from a BC policy with limited interactions and that Q-values provide a reliable signal for policy selection.

axioms (1)

domain assumption Q-function can be estimated from a BC policy using a few interaction steps with the environment
This is the core premise of the Q-Estimation component stated in the abstract.

invented entities (1)

Q-Gating no independent evidence
purpose: Switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training
New mechanism introduced to mitigate distribution mismatch during offline-to-online transition.

pith-pipeline@v0.9.0 · 5584 in / 1275 out tokens · 46546 ms · 2026-05-08T16:29:17.909019+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 17 canonical work pages · 6 internal anchors

[1]

An optimistic perspective on offline reinforcement learning

Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International conference on machine learning, pages 104--114. PMLR, 2020

2020
[2]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577--1594. PMLR, 2023

2023
[3]

Ondrej Biza, Thomas Weng, Lingfeng Sun, Karl Schmeckpeper, Tarik Kelestemur, Yecheng Jason Ma, Robert Platt, Jan-Willem van de Meent, and Lawson L.S. Wong. On-robot reinforcement learning with goal-contrastive rewards. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4797--4805, 2025. doi:10.1109/ICRA55743.2025.11128466

work page doi:10.1109/icra55743.2025.11128466 2025
[4]

Learning under misspecified objective spaces

Andreea Bobu, Andrea Bajcsy, Jaime F Fisac, and Anca D Dragan. Learning under misspecified objective spaces. In Conference on Robot Learning, pages 796--805. PMLR, 2018

2018
[5]

Berkeley UR5 demonstration dataset

Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home
[6]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44 0 (10-11): 0 1684--1704, 2025

2025
[10]

Iq-learn: Inverse soft-q learning for imitation

Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34: 0 4028--4039, 2021

2021
[11]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861--1870. Pmlr, 2018

2018
[12]

Imitation bootstrapped reinforcement learning, 2023

Hengyuan Hu, Suvir Mirchandani, and Dorsa Sadigh. Imitation bootstrapped reinforcement learning, 2023

2023
[14]

Residual reinforcement learning for robot control

Tobias Johannink, Shikhar Bahl, Ashvin Nair, Jianlan Luo, Avinash Kumar, Matthias Loskyll, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Residual reinforcement learning for robot control. In 2019 international conference on robotics and automation (ICRA), pages 6023--6029. IEEE, 2019

2019
[15]

Kochenderfer

Michael Kelly, Chelsea Sidrane, Katherine Rose Driggs - Campbell, and Mykel J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20-24, 2019 , pages 8077--8083. IEEE , 2019

2019
[16]

Conservative q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in neural information processing systems, 33: 0 1179--1191, 2020

2020
[17]

The boltzmann policy distribution: Accounting for systematic suboptimality in human models

Cassidy Laidlaw and Anca Dragan. The boltzmann policy distribution: Accounting for systematic suboptimality in human models. In ICLR, 2022

2022
[20]

Individual choice behavior, volume 4

R Duncan Luce et al. Individual choice behavior, volume 4. Wiley New York, 1959

1959
[21]

Serl: A software suite for sample-efficient robotic reinforcement learning

Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16961--16969. IEEE, 2024

2024
[22]

Fmb: a functional manipulation benchmark for generalizable robotic learning

Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research, 44 0 (4): 0 592--606, 2025 a

2025
[23]

Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning

Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning. Science Robotics, 10 0 (105): 0 eads5033, 2025 b

2025
[24]

Predicting human reaching motion in collaborative tasks using inverse optimal control and iterative re-planning

Jim Mainprice, Rafi Hayne, and Dmitry Berenson. Predicting human reaching motion in collaborative tasks using inverse optimal control and iterative re-planning. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 885--892, 2015. doi:10.1109/ICRA.2015.7139282

work page doi:10.1109/icra.2015.7139282 2015
[27]

Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning

Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36: 0 62244--62269, 2023

2023
[29]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations . In Proceedings of Robotics: Science and Systems (RSS), 2018

2018
[30]

Efficient reductions for imitation learning

St \'e phane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661--668. JMLR Workshop and Conference Proceedings, 2010

2010
[31]

A reduction of imitation learning and structured prediction to no-regret online learning

St \'e phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627--635. JMLR Workshop and Conference Proceedings, 2011

2011
[35]

Image augmentation is all you need: Regularizing deep reinforcement learning from pixels

Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International conference on learning representations, 2021

2021
[37]

Deep imitation learning for complex manipulation tasks from virtual reality teleoperation

Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE international conference on robotics and automation (ICRA), pages 5628--5635. Ieee, 2018

2018
[38]

Efficient online reinforcement learning fine-tuning need not retain offline data

Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, and Aviral Kumar. Efficient online reinforcement learning fine-tuning need not retain offline data. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=HN0CYZbAPw

2025
[39]

1990 , doi =

McGeer, Tad , title =. 1990 , doi =. http://ijr.sagepub.com/content/9/2/62.full.pdf+html , journal =

1990
[40]

Journal of Basic Engineering , volume=

A new approach to linear filtering and prediction problems , author=. Journal of Basic Engineering , volume=. 1960 , publisher=

1960
[41]

International conference on machine learning , pages=

Reinforcement learning with deep energy-based policies , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[42]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[43]

The Thirteenth International Conference on Learning Representations , year=

Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data , author=. The Thirteenth International Conference on Learning Representations , year=
[44]

You Liang Tan , title =
[45]

The International Journal of Robotics Research , volume=

Fmb: a functional manipulation benchmark for generalizable robotic learning , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=

2025
[46]

ICLR , year=

The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models , author=. ICLR , year=
[47]

Conference on Robot Learning , year=

DayDreamer: World Models for Physical Robot Learning , author=. Conference on Robot Learning , year=
[48]

International Conference on Machine Learning , pages=

Jump-start reinforcement learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[49]

Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

Efficient reductions for imitation learning , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

2010
[50]

Proceedings of Robotics: Science and Systems (RSS) , YEAR =

Aravind Rajeswaran AND Vikash Kumar AND Abhishek Gupta AND Giulia Vezzani AND John Schulman AND Emanuel Todorov AND Sergey Levine , TITLE = ". Proceedings of Robotics: Science and Systems (RSS) , YEAR =
[51]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[52]

Residual Policy Learning

Residual policy learning , author=. arXiv preprint arXiv:1812.06298 , year=

work page Pith review arXiv
[53]

2019 international conference on robotics and automation (ICRA) , pages=

Residual reinforcement learning for robot control , author=. 2019 international conference on robotics and automation (ICRA) , pages=. 2019 , organization=

2019
[54]

Policy decorator: Model- agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024

Policy decorator: Model-agnostic online refinement for large policy model , author=. arXiv preprint arXiv:2412.13630 , year=

work page arXiv
[55]

Accelerating residual reinforcement learning with uncertainty estimation.arXiv preprint arXiv:2506.17564, 2025

Accelerating Residual Reinforcement Learning with Uncertainty Estimation , author=. arXiv preprint arXiv:2506.17564 , year=

work page arXiv
[56]

Steering your diffusion policy with latent space reinforcement learning

Steering Your Diffusion Policy with Latent Space Reinforcement Learning , author=. arXiv preprint arXiv:2506.15799 , year=

work page arXiv
[57]

2018 IEEE international conference on robotics and automation (ICRA) , pages=

Deep imitation learning for complex manipulation tasks from virtual reality teleoperation , author=. 2018 IEEE international conference on robotics and automation (ICRA) , pages=. 2018 , organization=

2018
[58]

2025 , eprint=

Residual Off-Policy RL for Finetuning Behavior Cloning Policies , author=. 2025 , eprint=

2025
[59]

Entropy , volume=

Estimating mixture entropy with pairwise distances , author=. Entropy , volume=. 2017 , publisher=

2017
[60]

2023 , eprint=

Imitation Bootstrapped Reinforcement Learning , author=. 2023 , eprint=

2023
[61]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

D4rl: Datasets for deep data-driven reinforcement learning , author=. arXiv preprint arXiv:2004.07219 , year=

work page internal anchor Pith review arXiv 2004
[62]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

What matters in learning from offline human demonstrations for robot manipulation , author=. arXiv preprint arXiv:2108.03298 , year=

work page internal anchor Pith review arXiv
[63]

Advances in neural information processing systems , volume=

Conservative q-learning for offline reinforcement learning , author=. Advances in neural information processing systems , volume=
[64]

Advances in Neural Information Processing Systems , volume=

Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning , author=. Advances in Neural Information Processing Systems , volume=
[65]

International Conference on Machine Learning , pages=

Efficient online reinforcement learning with offline data , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[66]

Learning to generalize across long-horizon tasks from human demonstrations

Learning to generalize across long-horizon tasks from human demonstrations , author=. arXiv preprint arXiv:2003.06085 , year=

work page arXiv 2003
[67]

IEEE Transactions on Neural Networks and Learning Systems , volume=

A survey on offline reinforcement learning: Taxonomy, review, and open problems , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2023 , publisher=

2023
[68]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author=. arXiv preprint arXiv:2005.01643 , year=

work page internal anchor Pith review arXiv 2005
[69]

Benchmarking safe exploration in deep reinforcement learning,

Benchmarking batch deep reinforcement learning algorithms , author=. arXiv preprint arXiv:1910.01708 , year=

work page arXiv 1910
[70]

arXiv preprint arXiv:2503.14259 , year=

Quantization-free autoregressive action transformer , author=. arXiv preprint arXiv:2503.14259 , year=

work page arXiv
[71]

Advances in neural information processing systems , volume=

Behavior transformers: Cloning k modes with one stone , author=. Advances in neural information processing systems , volume=
[72]

The International Journal of Robotics Research , volume=

Diffusion policy: Visuomotor policy learning via action diffusion , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=

2025
[73]

arXiv preprint arXiv:2505.21851 , year=

Streaming Flow Policy: Simplifying diffusion / flow-matching policies by treating action trajectories as flow trajectories , author=. arXiv preprint arXiv:2505.21851 , year=

work page arXiv
[74]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Serl: A software suite for sample-efficient robotic reinforcement learning , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

2024
[75]

Science Robotics , volume=

Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning , author=. Science Robotics , volume=. 2025 , publisher=

2025
[76]

International conference on learning representations , year=

Image augmentation is all you need: Regularizing deep reinforcement learning from pixels , author=. International conference on learning representations , year=
[77]

HG-DAgger: Interactive Imitation Learning with Human Experts , booktitle =

Michael Kelly and Chelsea Sidrane and Katherine Rose Driggs. HG-DAgger: Interactive Imitation Learning with Human Experts , booktitle =
[78]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

2011
[79]

Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models , author=. arXiv preprint arXiv:2512.02636 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

, booktitle=

Biza, Ondrej and Weng, Thomas and Sun, Lingfeng and Schmeckpeper, Karl and Kelestemur, Tarik and Ma, Yecheng Jason and Platt, Robert and van de Meent, Jan-Willem and Wong, Lawson L.S. , booktitle=. On-Robot Reinforcement Learning with Goal-Contrastive Rewards , year=
[81]

Flow Matching for Generative Modeling

Flow matching for generative modeling , author=. arXiv preprint arXiv:2210.02747 , year=

work page internal anchor Pith review arXiv
[82]

Octo: An Open-Source Generalist Robot Policy

Octo: An open-source generalist robot policy , author=. arXiv preprint arXiv:2405.12213 , year=

work page internal anchor Pith review arXiv
[83]

Learning and retrieval from prior data for skill-based imitation learning.arXiv preprint arXiv:2210.11435, 2022

Learning and retrieval from prior data for skill-based imitation learning , author=. arXiv preprint arXiv:2210.11435 , year=

work page arXiv
[84]

Berkeley

Lawrence Yunliang Chen and Simeon Adebola and Ken Goldberg , howpublished =. Berkeley
[85]

International conference on machine learning , pages=

An optimistic perspective on offline reinforcement learning , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[86]

2010 , publisher=

Modeling interaction via the principle of maximum causal entropy , author=. 2010 , publisher=

2010
[87]

1959 , publisher=

Individual choice behavior , author=. 1959 , publisher=

1959
[88]

Conference on Robot Learning , pages=

Learning under misspecified objective spaces , author=. Conference on Robot Learning , pages=. 2018 , organization=

2018
[89]

Predicting human reaching motion in collaborative tasks using Inverse Optimal Control and iterative re-planning , year=

Mainprice, Jim and Hayne, Rafi and Berenson, Dmitry , booktitle=. Predicting human reaching motion in collaborative tasks using Inverse Optimal Control and iterative re-planning , year=
[90]

Advances in Neural Information Processing Systems , volume=

Iq-learn: Inverse soft-q learning for imitation , author=. Advances in Neural Information Processing Systems , volume=
[91]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016

2016
[92]

Estimating mixture entropy with pairwise distances

Artemy Kolchinsky and Brendan D Tracey. Estimating mixture entropy with pairwise distances. Entropy, 19 0 (7): 0 361, 2017

2017
[93]

Fmb: a functional manipulation benchmark for generalizable robotic learning

Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research, 44 0 (4): 0 592--606, 2025

2025

Showing first 80 references.