When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

Ashutosh Nayyar; Rahul Jain; Rishabh Agrawal

arxiv: 2605.17017 · v1 · pith:7PFSY7BAnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

Rishabh Agrawal , Rahul Jain , Ashutosh Nayyar This is my paper

Pith reviewed 2026-05-19 20:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Behavior Foundation ModelsOffline Imitation LearningRobust Task InferenceDynamics ShiftsMinimax OptimizationImitation LearningRobust Policies

0 comments

The pith

Reformulating BFM task inference as a minimax problem over dynamics perturbations yields robust policies from single-environment offline data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Behavior Foundation Models can handle shifts in environment dynamics by converting the task-inference step into a robust minimax optimization. This formulation finds policies that perform well against worst-case perturbations without altering the pretraining stage or collecting data from shifted environments. A sympathetic reader would care because real-world imitation learning frequently encounters changes in friction, actuation, or sensor noise, and current BFMs break under those conditions. The method reportedly outperforms both standard BFM adaptation and existing robust offline imitation learning baselines. Robustness is obtained entirely at inference time rather than through new data or retraining.

Core claim

Casting BFM task inference as a robust minimax optimization over possible dynamics perturbations produces policies that adapt to worst-case shifts while depending solely on offline data collected in a single nominal environment. This yields the first BFM-based framework to achieve dynamics robustness without modifying pretraining or requiring multi-environment data, and the resulting policies outperform standard BFM and robust offline IL baselines under dynamics shifts.

What carries the argument

The minimax optimization problem solved at task-inference time that accounts for worst-case dynamics perturbations while adapting a pretrained BFM.

If this is right

Robust policies are obtained entirely at task-inference time without retraining the BFM.
The approach relies solely on offline data from one nominal environment.
It outperforms both standard BFM adaptation and prior robust offline IL methods under dynamics shifts.
The framework improves practicality of BFMs in settings with varying friction, actuation, or sensor noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of task-agnostic pretraining from robust inference may generalize to other pretrained models in robotics and control.
Choosing a richer class of perturbation models inside the minimax step could further close the gap between modeled and real shifts.
The same inference-time robustness idea might reduce the need for expensive multi-environment data collection in related offline RL settings.

Load-bearing premise

The minimax optimization over dynamics perturbations can be solved tractably from offline nominal data alone and produces policies that generalize to actual dynamics shifts.

What would settle it

A controlled experiment applying an unmodeled dynamics shift (for example, a friction change outside the perturbation set used in training) and measuring whether the inferred policy still matches nominal performance.

Figures

Figures reproduced from arXiv: 2605.17017 by Ashutosh Nayyar, Rahul Jain, Rishabh Agrawal.

**Figure 2.** Figure 2: Quadruped performance under dynamics perturbations (95% Confidence Interval), pre [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Average return (y-axis) vs. body mass perturbation (x-axis, % increase from nominal) on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Average return (y-axis) vs. joint friction loss perturbation (x-axis, absolute Nm per joint) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Average return (y-axis) vs. ground contact stiffness (x-axis, absolute perturbation) on Quadruped jump under varying ε for RBFM-Heavy. (Q4) We investigate two axes of pretraining data quality: source and quantity. For data source, we evaluate on Walker using APS- and PROTO-generated datasets (Appendix F.2, Figures 8 and 9); trends are consistent with (Q1), confirming robustness rankings across pretrainin… view at source ↗

**Figure 6.** Figure 6: Walker performance with 95% confidence interval across four tasks (rows) and three perturbation types (columns): gravity and body mass increase the physical load on the robot (% change from nominal); joint friction loss adds passive resistive torque at each joint, simulating mechanical wear (absolute Nm per joint). Pretrained on RND data. lateral torque component that directly penalizes the upright reward … view at source ↗

**Figure 7.** Figure 7: Cheetah performance with 95% confidence interval across four tasks (rows) and three perturbation types (columns): range of motion restricts how far each joint can move, simulating mechanical damage (% of nominal retained); actuator strength reduces peak joint torque, simulating motor degradation (% reduction); joint friction loss adds passive resistive torque at each joint, simulating mechanical wear (abso… view at source ↗

**Figure 8.** Figure 8: Walker performance with 95% confidence interval across four tasks (rows) and three perturbation types (columns): gravity and body mass increase the physical load on the robot (% change from nominal); joint friction loss adds passive resistive torque at each joint, simulating mechanical wear (absolute Nm per joint). Pretrained on APS data. G Discussion and Limitation While answering (Q4) in Section 4, we ob… view at source ↗

**Figure 9.** Figure 9: Walker performance with 95% confidence interval across four tasks (rows) and three perturbation types (columns): gravity and body mass increase the physical load on the robot (% change from nominal); joint friction loss adds passive resistive torque at each joint, simulating mechanical wear (absolute Nm per joint). Pretrained on PROTO data. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of pretraining of Walker on 100k RND dataset vs 500k RND dataset, where the models are evaluated every 20, 000 timesteps where we perform 10 rollouts and record the IQM. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

read the original abstract

Behavior Foundation Models (BFMs) enable scalable imitation learning (IL) by pretraining task-agnostic representations that can be rapidly adapted to new tasks. However, existing BFMs assume fixed environment dynamics, limiting their robustness under real-world shifts such as changes in friction, actuation, or sensor noise. We address this by formulating BFM task-inference as a robust minimax optimization problem, enabling adaptation to worst-case dynamics perturbations without modifying pretraining. To the best of our knowledge, this is the first BFM-based framework that achieves robustness to dynamics shifts while relying solely on offline data from a single nominal environment. Our approach significantly outperforms standard BFM and robust offline IL baselines under dynamics shifts. These results demonstrate that robust policy can be achieved entirely at task-inference time, improving the practicality of BFMs in dynamic settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper casts BFM task inference as minimax over dynamics perturbations to gain robustness from nominal offline data alone, but the tractability and real generalization of that step are the open questions.

read the letter

The main takeaway is that this work reframes task inference in Behavior Foundation Models as a robust minimax problem so the resulting policy handles dynamics shifts like friction or actuation changes without any change to pretraining or access to perturbed data. They position it as the first BFM method to do this from a single nominal offline dataset and report better results than plain BFM and existing robust offline IL baselines on those shifts.

Referee Report

2 major / 2 minor

Summary. The paper introduces a robust formulation of task inference for Behavior Foundation Models (BFMs) in offline imitation learning. By casting adaptation as a minimax optimization over dynamics perturbations, the method aims to produce policies robust to shifts (e.g., friction, actuation, sensor noise) while using only offline trajectories from a single nominal environment and without altering the BFM pretraining stage. The authors claim this is the first such BFM-based robust framework and report significant outperformance versus standard BFM and robust offline IL baselines under dynamics shifts.

Significance. If the minimax task-inference procedure can be solved tractably from nominal data and yields policies that generalize to unmodeled real-world dynamics changes, the result would meaningfully improve the practicality of pretrained BFMs in non-stationary environments. Shifting robustness to inference time rather than pretraining or data collection is a potentially scalable direction for offline IL.

major comments (2)

[§3] §3 (Method): The central claim that the minimax problem over dynamics perturbations can be solved from nominal offline trajectories alone is load-bearing, yet the manuscript provides no explicit description of the inner maximization approximation, the class of allowed perturbations, or the surrogate used in place of an explicit dynamics model. Without this, it is unclear whether the resulting policy is robust only to the modeled set or to actual environment shifts.
[§4] §4 (Experiments): The reported outperformance under dynamics shifts lacks sufficient protocol details—specifically, how the test perturbations are generated, whether they lie inside or outside the perturbation class used in training, and the presence of error bars or statistical tests across multiple seeds. This makes it difficult to assess whether the gains support generalization beyond the nominal environment.

minor comments (2)

[Introduction] The abstract and introduction use the phrase 'to the best of our knowledge' for the 'first BFM-based framework'; a brief related-work paragraph clarifying the precise novelty relative to prior robust IL and BFM papers would strengthen the positioning.
[§3] Notation for the robust objective (e.g., the definition of the perturbation set and the inner/outer players) should be introduced with a single equation block rather than scattered across paragraphs for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, the recognition of the potential impact, and the constructive major comments. We address each point below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [§3] §3 (Method): The central claim that the minimax problem over dynamics perturbations can be solved from nominal offline trajectories alone is load-bearing, yet the manuscript provides no explicit description of the inner maximization approximation, the class of allowed perturbations, or the surrogate used in place of an explicit dynamics model. Without this, it is unclear whether the resulting policy is robust only to the modeled set or to actual environment shifts.

Authors: We agree that Section 3 would benefit from a more self-contained and explicit treatment. In the revision we will expand the method section with a new subsection that (i) formally defines the perturbation class as bounded parametric changes to friction coefficients, actuation gains, and additive sensor noise (with explicit bounds provided), (ii) describes the inner maximization as a first-order surrogate obtained by linearizing the latent dynamics around the nominal trajectories using the BFM encoder gradients, and (iii) states that the resulting policy is guaranteed to be robust inside this modeled set while providing empirical evidence of generalization to unmodeled shifts. These additions will make the load-bearing claim fully traceable from the nominal data alone. revision: yes
Referee: [§4] §4 (Experiments): The reported outperformance under dynamics shifts lacks sufficient protocol details—specifically, how the test perturbations are generated, whether they lie inside or outside the perturbation class used in training, and the presence of error bars or statistical tests across multiple seeds. This makes it difficult to assess whether the gains support generalization beyond the nominal environment.

Authors: We concur that the experimental protocol requires additional detail for reproducibility and to substantiate the generalization claims. In the revised manuscript we will (i) specify the exact procedure for generating test perturbations (parameter sampling ranges and randomization seeds), (ii) explicitly indicate which test conditions lie inside versus outside the training perturbation class, and (iii) report all results with mean and standard deviation over five independent seeds together with paired t-test p-values against baselines. These changes will allow readers to evaluate the strength of the out-of-distribution generalization evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: new robust minimax formulation introduced at task-inference time without reducing to fitted inputs or self-citations

full rationale

The paper's central step is to reformulate BFM task inference as a robust minimax optimization over dynamics perturbations, solved from nominal offline data. This is presented as an external modeling choice rather than a re-expression of any pre-fitted quantity or a result derived solely from prior self-citations. No equations in the abstract or description reduce a prediction to its own inputs by construction, and the claim of being the first such framework does not rely on load-bearing self-citation chains. The derivation remains self-contained against external benchmarks of robust optimization applied to imitation learning.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the robust minimax formulation is presented at a high level without detailing any fitted quantities or background assumptions.

pith-pipeline@v0.9.0 · 5676 in / 1019 out tokens · 36313 ms · 2026-05-19T20:34:04.533249+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Policy optimization for strictly batch imitation learning

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Policy optimization for strictly batch imitation learning. InOPT 2024: Optimization for Machine Learning, 2024. URL https://openreview.net/forum?id=5L3qmI0XPz

work page 2024
[3]

Balance equation-based distributionally robust offline imitation learning.arXiv preprint arXiv:2511.07942, 2025

Rishabh Agrawal, Yusuf Alvi, Rahul Jain, and Ashutosh Nayyar. Balance equation-based distributionally robust offline imitation learning.arXiv preprint arXiv:2511.07942, 2025

work page arXiv 2025
[4]

Markov balance satisfac- tion improves performance in strictly batch offline imitation learning

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Markov balance satisfac- tion improves performance in strictly batch offline imitation learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15311–15319, 2025

work page 2025
[5]

Conditional kernel imi- tation learning for continuous state environments

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Conditional kernel imi- tation learning for continuous state environments. In Necmiye Ozay, Laura Balzano, Dimitra Panagou, and Alessandro Abate, editors,Proceedings of the 7th Annual Learning for Dynam- ics & Control Conference, volume 283 ofProceedings of Machine Learning Research, pag...

work page 2025
[6]

The reality gap in robotics: Challenges, solutions, and best practices.Annual Review of Control, Robotics, and Autonomous Systems, 9, 2025

Elie Aljalbout, Jiaxu Xing, Angel Romero, Iretiayo Akinola, Caelan Reed Garrett, Eric Heiden, Abhishek Gupta, Tucker Hermans, Yashraj Narang, Dieter Fox, et al. The reality gap in robotics: Challenges, solutions, and best practices.Annual Review of Control, Robotics, and Autonomous Systems, 9, 2025

work page 2025
[7]

Routledge, 2021

Eitan Altman.Constrained Markov decision processes. Routledge, 2021

work page 2021
[8]

Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

Shan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang, et al. Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

work page arXiv 2025
[9]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017

André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P Van Hasselt, and David Silver. Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017

work page 2017
[11]

Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021

Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021

work page arXiv 2021
[12]

Maksim Bobrin, Ilya Zisman, Alexander Nikulin, Vladislav Kurenkov, and Dmitry V . Dylov. Zero-shot adaptation of behavioral foundation models to unseen dynamics. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=dBDBg4WF4F

work page 2026
[13]

Universal Successor Features Approximators

Diana Borsa, André Barreto, John Quan, Daniel Mankowitz, Rémi Munos, Hado Van Hasselt, David Silver, and Tom Schaul. Universal successor features approximators.arXiv preprint arXiv:1812.07626, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Cambridge university press, 2004

Stephen P Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004

work page 2004
[15]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Robust imitation learning against variations in environment dynamics

Jongseong Chae, Seungyul Han, Whiyoung Jung, Myungsik Cho, Sungho Choi, and Youngchul Sung. Robust imitation learning against variations in environment dynamics. InInternational Conference on Machine Learning, pages 2828–2852. PMLR, 2022. 10

work page 2022
[17]

Meta-controller: Few- shot imitation of unseen embodiments and tasks in continuous control.Advances in Neural Information Processing Systems, 37:134250–134286, 2024

Seongwoong Cho, Donggyun Kim, Jinwoo Lee, and Seunghoon Hong. Meta-controller: Few- shot imitation of unseen embodiments and tasks in continuous control.Advances in Neural Information Processing Systems, 37:134250–134286, 2024

work page 2024
[18]

Exploring the limitations of behavior cloning for autonomous driving

Felipe Codevilla, Eder Santana, Antonio M López, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019

work page 2019
[19]

Improving generalization for temporal difference learning: The successor repre- sentation.Neural computation, 5(4):613–624, 1993

Peter Dayan. Improving generalization for temporal difference learning: The successor repre- sentation.Neural computation, 5(4):613–624, 1993

work page 1993
[20]

Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

Esther Derman and Shie Mannor. Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

work page arXiv 2003
[21]

One-shot imitation learning.Advances in neural information processing systems, 30, 2017

Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning.Advances in neural information processing systems, 30, 2017

work page 2017
[22]

One-shot visual imitation learning via meta-learning

Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. InConference on robot learning, pages 357–368. PMLR, 2017

work page 2017
[23]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

work page 2052
[24]

Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

work page 2016
[25]

Impact of static friction on sim2real in robotic reinforcement learning

Xiaoyi Hu, Qiao Sun, Bailin He, Haojie Liu, Xueyi Zhang, Chunpeng Lu, and Jiangwei Zhong. Impact of static friction on sim2real in robotic reinforcement learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 17107–17114. IEEE, 2025

work page 2025
[26]

Generalization in dexterous manipulation via geometry-aware multi-task learning.arXiv preprint arXiv:2111.03062, 2021

Wenlong Huang, Igor Mordatch, Pieter Abbeel, and Deepak Pathak. Generalization in dexterous manipulation via geometry-aware multi-task learning.arXiv preprint arXiv:2111.03062, 2021

work page arXiv 2021
[27]

Robust dynamic programming.Mathematics of Operations Research, 30(2): 257–280, 2005

Garud N Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2): 257–280, 2005

work page 2005
[28]

Task-embedded control networks for few-shot imitation learning

Stephen James, Michael Bloesch, and Andrew J Davison. Task-embedded control networks for few-shot imitation learning. InConference on robot learning, pages 783–795. PMLR, 2018

work page 2018
[29]

Zero-shot reinforcement learning from low quality data.Advances in Neural Information Processing Systems, 37:16894–16942, 2024

Scott Jeen, Tom Bewley, and Jonathan M Cullen. Zero-shot reinforcement learning from low quality data.Advances in Neural Information Processing Systems, 37:16894–16942, 2024

work page 2024
[30]

Zero-shot reinforcement learning under partial observability.arXiv preprint arXiv:2506.15446, 2025

Scott Jeen, Tom Bewley, and Jonathan M Cullen. Zero-shot reinforcement learning under partial observability.arXiv preprint arXiv:2506.15446, 2025

work page arXiv 2025
[31]

DemoDICE: Offline imitation learning with supplementary imperfect demonstrations

Geon-Hyeong Kim, Seokin Seo, Jongmin Lee, Wonseok Jeon, HyeongJoo Hwang, Hongseok Yang, and Kee-Eung Kim. DemoDICE: Offline imitation learning with supplementary imperfect demonstrations. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=BrPdX1bDZkQ

work page 2022
[32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

Imitation learning via off-policy dis- tribution matching

Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy dis- tribution matching. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyg-JC4FDr

work page 2020
[34]

Dart: Noise injection for robust imitation learning

Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, volume 78 ofProceedings of Machine Learning Research, pages 143–156. PMLR, 13–15 Nov 2017. URL https: //procee...

work page 2017
[35]

Aps: Active pretraining with successor features

Hao Liu and Pieter Abbeel. Aps: Active pretraining with successor features. InInternational Conference on Machine Learning, pages 6736–6747. PMLR, 2021

work page 2021
[36]

ODICE: Revealing the mystery of distribution correction estimation via orthogonal-gradient update

Liyuan Mao, Haoran Xu, Weinan Zhang, and Xianyuan Zhan. ODICE: Revealing the mystery of distribution correction estimation via orthogonal-gradient update. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=L8UNn7Llt4

work page 2024
[37]

Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

work page 2005
[38]

Robust rein- forcement learning using offline data.Advances in neural information processing systems, 35: 32211–32224, 2022

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust rein- forcement learning using offline data.Advances in neural information processing systems, 35: 32211–32224, 2022

work page 2022
[39]

Distributionally robust behavioral cloning for robust imitation learning

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Distributionally robust behavioral cloning for robust imitation learning. In2023 62nd IEEE Conference on Decision and Control (CDC), pages 1342–1347. IEEE, 2023

work page 2023
[40]

Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. In Necmiye Ozay, Laura Balzano, Dimitra Panagou, and Alessandro Abate, editors,Proceedings of the 7th Annual Learning for Dynamics & Control Conference, ...

work page
[41]

URLhttps://proceedings.mlr.press/v283/panaganti25a.html

work page
[42]

Foundation policies with hilbert representa- tions.arXiv preprint arXiv:2402.15567, 2024

Seohong Park, Tobias Kreiman, and Sergey Levine. Foundation policies with hilbert representa- tions.arXiv preprint arXiv:2402.15567, 2024

work page arXiv 2024
[43]

Sim-to-real transfer of robotic control with dynamics randomization

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE, 2018

work page 2018
[44]

Fast imitation via behavior foundation models

Matteo Pirotta, Andrea Tirinzoni, Ahmed Touati, Alessandro Lazaric, and Yann Ollivier. Fast imitation via behavior foundation models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=qnWtw3l0jb

work page 2024
[45]

Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1:305–313, 1988

Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1:305–313, 1988

work page 1988
[46]

John Wiley & Sons, 2014

Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[47]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[48]

Efficient reductions for imitation learning

Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010

work page 2010
[49]

A reduction of imitation learning and structured prediction to no-regret online learning

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pag...

work page 2011
[50]

Optimistic task inference for behavior foundation models.arXiv preprint arXiv:2510.20264, 2025

Thomas Rupf, Marco Bagatella, Marin Vlastelica, and Andreas Krause. Optimistic task inference for behavior foundation models.arXiv preprint arXiv:2510.20264, 2025

work page arXiv 2025
[51]

Universal value function ap- proximators

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function ap- proximators. InInternational conference on machine learning, pages 1312–1320. PMLR, 2015. 12

work page 2015
[52]

Mitigating covariate shift in behavioral cloning via robust stationary distribution correction

Seokin Seo, Byung-Jun Lee, Jongmin Lee, HyeongJoo Hwang, Hongseok Yang, and Kee-Eung Kim. Mitigating covariate shift in behavioral cloning via robust stationary distribution correction. Advances in Neural Information Processing Systems, 37:109177–109201, 2024

work page 2024
[53]

Robust imitation learning from noisy demonstrations

V oot Tangkaratt, Nontawat Charoenphakdee, and Masashi Sugiyama. Robust imitation learning from noisy demonstrations. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 298–306. PMLR, 13–15 Apr 2021. URL ht...

work page 2021
[54]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

Reinforcement learning in robotic systems: A review on sim-to-real transfer.Robotics and Autonomous Systems, page 105327, 2026

Rajesh Tiwari, Shailesh Khapre, and Avantika Singh. Reinforcement learning in robotic systems: A review on sim-to-real transfer.Robotics and Autonomous Systems, page 105327, 2026

work page 2026
[56]

Learning one representation to optimize all rewards.Advances in Neural Information Processing Systems, 34:13–23, 2021

Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards.Advances in Neural Information Processing Systems, 34:13–23, 2021

work page 2021
[57]

Does zero-shot reinforcement learning exist? arXiv preprint arXiv:2209.14935, 2022

Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? arXiv preprint arXiv:2209.14935, 2022

work page arXiv 2022
[58]

Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023

Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MYEap_OcQI

work page 2023
[59]

Robust behavior cloning via global lipschitz regularization.arXiv preprint arXiv:2506.19250, 2025

Shili Wu, Yizhao Jin, Puhua Niu, Aniruddha Datta, and Sean B Andersson. Robust behavior cloning via global lipschitz regularization.arXiv preprint arXiv:2506.19250, 2025

work page arXiv 2025
[60]

Imitation learning from imperfect demonstration

Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, V oot Tangkaratt, and Masashi Sugiyama. Imitation learning from imperfect demonstration. In Kamalika Chaudhuri and Ruslan Salakhutdi- nov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6818–6827. PMLR, 09–15 Jun 2019. ...

work page 2019
[61]

Reinforcement learning with prototypical representations

Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Reinforcement learning with prototypical representations. InInternational Conference on Machine Learning, pages 11920–11931. PMLR, 2021

work page 2021
[62]

Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning.arXiv preprint arXiv:2201.13425, 2022

Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning.arXiv preprint arXiv:2201.13425, 2022

work page arXiv 2022
[63]

Fast bellman updates for wasserstein distributionally robust mdps.Advances in Neural Information Processing Systems, 36:30554–30578, 2023

Zhuodong Yu, Ling Dai, Shaohang Xu, Siyang Gao, and Chin Pang Ho. Fast bellman updates for wasserstein distributionally robust mdps.Advances in Neural Information Processing Systems, 36:30554–30578, 2023

work page 2023
[64]

Breeze: Towards robust zero-shot reinforcement learning

Kexin Zheng, Lauriane Teyssier, Yinan Zheng, Yu Luo, and Xianyuan Zhan. Breeze: Towards robust zero-shot reinforcement learning. https://github.com/Whiterrrrr/BREEZE, 2026. GitHub repository, accessed May 7, 2026

work page 2026
[65]

Watch, try, learn: Meta-learning from demonstrations and rewards

Allan Zhou, Eric Jang, Daniel Kappler, Alex Herzog, Mohi Khansari, Paul Wohlhart, Yunfei Bai, Mrinal Kalakrishnan, Sergey Levine, and Chelsea Finn. Watch, try, learn: Meta-learning from demonstrations and rewards. InInternational Conference on Learning Representations,

work page
[66]

13 Appendices A Missing Proofs 16 B Extended Related Work 19 C Experimental Setup 20 C.1 ExORL Domains

URLhttps://openreview.net/forum?id=SJg5J6NtDr. 13 Appendices A Missing Proofs 16 B Extended Related Work 19 C Experimental Setup 20 C.1 ExORL Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.1.1 Walker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.1.2 Quadruped . . . . . . . . . . . . . . . ...

work page
[67]

17 Proof

Then the optimization problem in Proposition 1 can be simplified to min z min λ∈[0,(1+εl)L] ( Es∼ρ πD T o (L(πz(·|s), πD(·|s))−λ) + +ε l max ρ πD T o (s)>0 L(πz(·|s), πD(·|s))−λ ! + +λ ) . 17 Proof. Let us fix the learner’s task vectorz and the corresponding policy πz and define the point-wise loss ℓz(s) :=L(π z(·|s), πD(·|s)) =∥π z(s)−π D(s)∥2 2. Since t...

work page
[68]

Pretrained on RND data

+ϵτ − ε b " bX i=1 f(w ⋆ Qθ,τ,πz(si, ai, s′ i) +w ⋆ Qθ,τ,πz(si, ai, s′ i)cQθ,πz(si, ai, s′ i) # 10:Updateθ←θ−η Q∇θLQθ,τ 11:Updateτ←max(0, τ−η τ ∇τ LQθ,τ) 12:// Step 3: Policy update (actor) 13:Estimate: Lπz = bX i=1 w⋆ Qθ,τ,πz(si, ai, s′ i)Lπz(si) b 14:Updatez←z−η π∇zLπz 15:end for 16:return(Q θ, τ, z) 27 Gravity Mass Joint Friction Loss Run Walk Flip Sta...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Policy optimization for strictly batch imitation learning

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Policy optimization for strictly batch imitation learning. InOPT 2024: Optimization for Machine Learning, 2024. URL https://openreview.net/forum?id=5L3qmI0XPz

work page 2024

[3] [3]

Balance equation-based distributionally robust offline imitation learning.arXiv preprint arXiv:2511.07942, 2025

Rishabh Agrawal, Yusuf Alvi, Rahul Jain, and Ashutosh Nayyar. Balance equation-based distributionally robust offline imitation learning.arXiv preprint arXiv:2511.07942, 2025

work page arXiv 2025

[4] [4]

Markov balance satisfac- tion improves performance in strictly batch offline imitation learning

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Markov balance satisfac- tion improves performance in strictly batch offline imitation learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15311–15319, 2025

work page 2025

[5] [5]

Conditional kernel imi- tation learning for continuous state environments

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Conditional kernel imi- tation learning for continuous state environments. In Necmiye Ozay, Laura Balzano, Dimitra Panagou, and Alessandro Abate, editors,Proceedings of the 7th Annual Learning for Dynam- ics & Control Conference, volume 283 ofProceedings of Machine Learning Research, pag...

work page 2025

[6] [6]

The reality gap in robotics: Challenges, solutions, and best practices.Annual Review of Control, Robotics, and Autonomous Systems, 9, 2025

Elie Aljalbout, Jiaxu Xing, Angel Romero, Iretiayo Akinola, Caelan Reed Garrett, Eric Heiden, Abhishek Gupta, Tucker Hermans, Yashraj Narang, Dieter Fox, et al. The reality gap in robotics: Challenges, solutions, and best practices.Annual Review of Control, Robotics, and Autonomous Systems, 9, 2025

work page 2025

[7] [7]

Routledge, 2021

Eitan Altman.Constrained Markov decision processes. Routledge, 2021

work page 2021

[8] [8]

Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

Shan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang, et al. Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

work page arXiv 2025

[9] [9]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017

André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P Van Hasselt, and David Silver. Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017

work page 2017

[11] [11]

Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021

Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021

work page arXiv 2021

[12] [12]

Maksim Bobrin, Ilya Zisman, Alexander Nikulin, Vladislav Kurenkov, and Dmitry V . Dylov. Zero-shot adaptation of behavioral foundation models to unseen dynamics. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=dBDBg4WF4F

work page 2026

[13] [13]

Universal Successor Features Approximators

Diana Borsa, André Barreto, John Quan, Daniel Mankowitz, Rémi Munos, Hado Van Hasselt, David Silver, and Tom Schaul. Universal successor features approximators.arXiv preprint arXiv:1812.07626, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Cambridge university press, 2004

Stephen P Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004

work page 2004

[15] [15]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Robust imitation learning against variations in environment dynamics

Jongseong Chae, Seungyul Han, Whiyoung Jung, Myungsik Cho, Sungho Choi, and Youngchul Sung. Robust imitation learning against variations in environment dynamics. InInternational Conference on Machine Learning, pages 2828–2852. PMLR, 2022. 10

work page 2022

[17] [17]

Meta-controller: Few- shot imitation of unseen embodiments and tasks in continuous control.Advances in Neural Information Processing Systems, 37:134250–134286, 2024

Seongwoong Cho, Donggyun Kim, Jinwoo Lee, and Seunghoon Hong. Meta-controller: Few- shot imitation of unseen embodiments and tasks in continuous control.Advances in Neural Information Processing Systems, 37:134250–134286, 2024

work page 2024

[18] [18]

Exploring the limitations of behavior cloning for autonomous driving

Felipe Codevilla, Eder Santana, Antonio M López, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019

work page 2019

[19] [19]

Improving generalization for temporal difference learning: The successor repre- sentation.Neural computation, 5(4):613–624, 1993

Peter Dayan. Improving generalization for temporal difference learning: The successor repre- sentation.Neural computation, 5(4):613–624, 1993

work page 1993

[20] [20]

Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

Esther Derman and Shie Mannor. Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

work page arXiv 2003

[21] [21]

One-shot imitation learning.Advances in neural information processing systems, 30, 2017

Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning.Advances in neural information processing systems, 30, 2017

work page 2017

[22] [22]

One-shot visual imitation learning via meta-learning

Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. InConference on robot learning, pages 357–368. PMLR, 2017

work page 2017

[23] [23]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

work page 2052

[24] [24]

Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

work page 2016

[25] [25]

Impact of static friction on sim2real in robotic reinforcement learning

Xiaoyi Hu, Qiao Sun, Bailin He, Haojie Liu, Xueyi Zhang, Chunpeng Lu, and Jiangwei Zhong. Impact of static friction on sim2real in robotic reinforcement learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 17107–17114. IEEE, 2025

work page 2025

[26] [26]

Generalization in dexterous manipulation via geometry-aware multi-task learning.arXiv preprint arXiv:2111.03062, 2021

Wenlong Huang, Igor Mordatch, Pieter Abbeel, and Deepak Pathak. Generalization in dexterous manipulation via geometry-aware multi-task learning.arXiv preprint arXiv:2111.03062, 2021

work page arXiv 2021

[27] [27]

Robust dynamic programming.Mathematics of Operations Research, 30(2): 257–280, 2005

Garud N Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2): 257–280, 2005

work page 2005

[28] [28]

Task-embedded control networks for few-shot imitation learning

Stephen James, Michael Bloesch, and Andrew J Davison. Task-embedded control networks for few-shot imitation learning. InConference on robot learning, pages 783–795. PMLR, 2018

work page 2018

[29] [29]

Zero-shot reinforcement learning from low quality data.Advances in Neural Information Processing Systems, 37:16894–16942, 2024

Scott Jeen, Tom Bewley, and Jonathan M Cullen. Zero-shot reinforcement learning from low quality data.Advances in Neural Information Processing Systems, 37:16894–16942, 2024

work page 2024

[30] [30]

Zero-shot reinforcement learning under partial observability.arXiv preprint arXiv:2506.15446, 2025

Scott Jeen, Tom Bewley, and Jonathan M Cullen. Zero-shot reinforcement learning under partial observability.arXiv preprint arXiv:2506.15446, 2025

work page arXiv 2025

[31] [31]

DemoDICE: Offline imitation learning with supplementary imperfect demonstrations

Geon-Hyeong Kim, Seokin Seo, Jongmin Lee, Wonseok Jeon, HyeongJoo Hwang, Hongseok Yang, and Kee-Eung Kim. DemoDICE: Offline imitation learning with supplementary imperfect demonstrations. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=BrPdX1bDZkQ

work page 2022

[32] [32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[33] [33]

Imitation learning via off-policy dis- tribution matching

Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy dis- tribution matching. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyg-JC4FDr

work page 2020

[34] [34]

Dart: Noise injection for robust imitation learning

Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, volume 78 ofProceedings of Machine Learning Research, pages 143–156. PMLR, 13–15 Nov 2017. URL https: //procee...

work page 2017

[35] [35]

Aps: Active pretraining with successor features

Hao Liu and Pieter Abbeel. Aps: Active pretraining with successor features. InInternational Conference on Machine Learning, pages 6736–6747. PMLR, 2021

work page 2021

[36] [36]

ODICE: Revealing the mystery of distribution correction estimation via orthogonal-gradient update

Liyuan Mao, Haoran Xu, Weinan Zhang, and Xianyuan Zhan. ODICE: Revealing the mystery of distribution correction estimation via orthogonal-gradient update. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=L8UNn7Llt4

work page 2024

[37] [37]

Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

work page 2005

[38] [38]

Robust rein- forcement learning using offline data.Advances in neural information processing systems, 35: 32211–32224, 2022

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust rein- forcement learning using offline data.Advances in neural information processing systems, 35: 32211–32224, 2022

work page 2022

[39] [39]

Distributionally robust behavioral cloning for robust imitation learning

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Distributionally robust behavioral cloning for robust imitation learning. In2023 62nd IEEE Conference on Decision and Control (CDC), pages 1342–1347. IEEE, 2023

work page 2023

[40] [40]

Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. In Necmiye Ozay, Laura Balzano, Dimitra Panagou, and Alessandro Abate, editors,Proceedings of the 7th Annual Learning for Dynamics & Control Conference, ...

work page

[41] [41]

URLhttps://proceedings.mlr.press/v283/panaganti25a.html

work page

[42] [42]

Foundation policies with hilbert representa- tions.arXiv preprint arXiv:2402.15567, 2024

Seohong Park, Tobias Kreiman, and Sergey Levine. Foundation policies with hilbert representa- tions.arXiv preprint arXiv:2402.15567, 2024

work page arXiv 2024

[43] [43]

Sim-to-real transfer of robotic control with dynamics randomization

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE, 2018

work page 2018

[44] [44]

Fast imitation via behavior foundation models

Matteo Pirotta, Andrea Tirinzoni, Ahmed Touati, Alessandro Lazaric, and Yann Ollivier. Fast imitation via behavior foundation models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=qnWtw3l0jb

work page 2024

[45] [45]

Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1:305–313, 1988

Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1:305–313, 1988

work page 1988

[46] [46]

John Wiley & Sons, 2014

Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014

[47] [47]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[48] [48]

Efficient reductions for imitation learning

Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010

work page 2010

[49] [49]

A reduction of imitation learning and structured prediction to no-regret online learning

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pag...

work page 2011

[50] [50]

Optimistic task inference for behavior foundation models.arXiv preprint arXiv:2510.20264, 2025

Thomas Rupf, Marco Bagatella, Marin Vlastelica, and Andreas Krause. Optimistic task inference for behavior foundation models.arXiv preprint arXiv:2510.20264, 2025

work page arXiv 2025

[51] [51]

Universal value function ap- proximators

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function ap- proximators. InInternational conference on machine learning, pages 1312–1320. PMLR, 2015. 12

work page 2015

[52] [52]

Mitigating covariate shift in behavioral cloning via robust stationary distribution correction

Seokin Seo, Byung-Jun Lee, Jongmin Lee, HyeongJoo Hwang, Hongseok Yang, and Kee-Eung Kim. Mitigating covariate shift in behavioral cloning via robust stationary distribution correction. Advances in Neural Information Processing Systems, 37:109177–109201, 2024

work page 2024

[53] [53]

Robust imitation learning from noisy demonstrations

V oot Tangkaratt, Nontawat Charoenphakdee, and Masashi Sugiyama. Robust imitation learning from noisy demonstrations. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 298–306. PMLR, 13–15 Apr 2021. URL ht...

work page 2021

[54] [54]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[55] [55]

Reinforcement learning in robotic systems: A review on sim-to-real transfer.Robotics and Autonomous Systems, page 105327, 2026

Rajesh Tiwari, Shailesh Khapre, and Avantika Singh. Reinforcement learning in robotic systems: A review on sim-to-real transfer.Robotics and Autonomous Systems, page 105327, 2026

work page 2026

[56] [56]

Learning one representation to optimize all rewards.Advances in Neural Information Processing Systems, 34:13–23, 2021

Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards.Advances in Neural Information Processing Systems, 34:13–23, 2021

work page 2021

[57] [57]

Does zero-shot reinforcement learning exist? arXiv preprint arXiv:2209.14935, 2022

Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? arXiv preprint arXiv:2209.14935, 2022

work page arXiv 2022

[58] [58]

Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023

Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MYEap_OcQI

work page 2023

[59] [59]

Robust behavior cloning via global lipschitz regularization.arXiv preprint arXiv:2506.19250, 2025

Shili Wu, Yizhao Jin, Puhua Niu, Aniruddha Datta, and Sean B Andersson. Robust behavior cloning via global lipschitz regularization.arXiv preprint arXiv:2506.19250, 2025

work page arXiv 2025

[60] [60]

Imitation learning from imperfect demonstration

Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, V oot Tangkaratt, and Masashi Sugiyama. Imitation learning from imperfect demonstration. In Kamalika Chaudhuri and Ruslan Salakhutdi- nov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6818–6827. PMLR, 09–15 Jun 2019. ...

work page 2019

[61] [61]

Reinforcement learning with prototypical representations

Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Reinforcement learning with prototypical representations. InInternational Conference on Machine Learning, pages 11920–11931. PMLR, 2021

work page 2021

[62] [62]

Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning.arXiv preprint arXiv:2201.13425, 2022

Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning.arXiv preprint arXiv:2201.13425, 2022

work page arXiv 2022

[63] [63]

Fast bellman updates for wasserstein distributionally robust mdps.Advances in Neural Information Processing Systems, 36:30554–30578, 2023

Zhuodong Yu, Ling Dai, Shaohang Xu, Siyang Gao, and Chin Pang Ho. Fast bellman updates for wasserstein distributionally robust mdps.Advances in Neural Information Processing Systems, 36:30554–30578, 2023

work page 2023

[64] [64]

Breeze: Towards robust zero-shot reinforcement learning

Kexin Zheng, Lauriane Teyssier, Yinan Zheng, Yu Luo, and Xianyuan Zhan. Breeze: Towards robust zero-shot reinforcement learning. https://github.com/Whiterrrrr/BREEZE, 2026. GitHub repository, accessed May 7, 2026

work page 2026

[65] [65]

Watch, try, learn: Meta-learning from demonstrations and rewards

Allan Zhou, Eric Jang, Daniel Kappler, Alex Herzog, Mohi Khansari, Paul Wohlhart, Yunfei Bai, Mrinal Kalakrishnan, Sergey Levine, and Chelsea Finn. Watch, try, learn: Meta-learning from demonstrations and rewards. InInternational Conference on Learning Representations,

work page

[66] [66]

13 Appendices A Missing Proofs 16 B Extended Related Work 19 C Experimental Setup 20 C.1 ExORL Domains

URLhttps://openreview.net/forum?id=SJg5J6NtDr. 13 Appendices A Missing Proofs 16 B Extended Related Work 19 C Experimental Setup 20 C.1 ExORL Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.1.1 Walker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.1.2 Quadruped . . . . . . . . . . . . . . . ...

work page

[67] [67]

17 Proof

Then the optimization problem in Proposition 1 can be simplified to min z min λ∈[0,(1+εl)L] ( Es∼ρ πD T o (L(πz(·|s), πD(·|s))−λ) + +ε l max ρ πD T o (s)>0 L(πz(·|s), πD(·|s))−λ ! + +λ ) . 17 Proof. Let us fix the learner’s task vectorz and the corresponding policy πz and define the point-wise loss ℓz(s) :=L(π z(·|s), πD(·|s)) =∥π z(s)−π D(s)∥2 2. Since t...

work page

[68] [68]

Pretrained on RND data

+ϵτ − ε b " bX i=1 f(w ⋆ Qθ,τ,πz(si, ai, s′ i) +w ⋆ Qθ,τ,πz(si, ai, s′ i)cQθ,πz(si, ai, s′ i) # 10:Updateθ←θ−η Q∇θLQθ,τ 11:Updateτ←max(0, τ−η τ ∇τ LQθ,τ) 12:// Step 3: Policy update (actor) 13:Estimate: Lπz = bX i=1 w⋆ Qθ,τ,πz(si, ai, s′ i)Lπz(si) b 14:Updatez←z−η π∇zLπz 15:end for 16:return(Q θ, τ, z) 27 Gravity Mass Joint Friction Loss Run Walk Flip Sta...

work page