pith. sign in

arxiv: 2605.17017 · v1 · pith:7PFSY7BAnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

Pith reviewed 2026-05-19 20:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Behavior Foundation ModelsOffline Imitation LearningRobust Task InferenceDynamics ShiftsMinimax OptimizationImitation LearningRobust Policies
0
0 comments X

The pith

Reformulating BFM task inference as a minimax problem over dynamics perturbations yields robust policies from single-environment offline data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Behavior Foundation Models can handle shifts in environment dynamics by converting the task-inference step into a robust minimax optimization. This formulation finds policies that perform well against worst-case perturbations without altering the pretraining stage or collecting data from shifted environments. A sympathetic reader would care because real-world imitation learning frequently encounters changes in friction, actuation, or sensor noise, and current BFMs break under those conditions. The method reportedly outperforms both standard BFM adaptation and existing robust offline imitation learning baselines. Robustness is obtained entirely at inference time rather than through new data or retraining.

Core claim

Casting BFM task inference as a robust minimax optimization over possible dynamics perturbations produces policies that adapt to worst-case shifts while depending solely on offline data collected in a single nominal environment. This yields the first BFM-based framework to achieve dynamics robustness without modifying pretraining or requiring multi-environment data, and the resulting policies outperform standard BFM and robust offline IL baselines under dynamics shifts.

What carries the argument

The minimax optimization problem solved at task-inference time that accounts for worst-case dynamics perturbations while adapting a pretrained BFM.

If this is right

  • Robust policies are obtained entirely at task-inference time without retraining the BFM.
  • The approach relies solely on offline data from one nominal environment.
  • It outperforms both standard BFM adaptation and prior robust offline IL methods under dynamics shifts.
  • The framework improves practicality of BFMs in settings with varying friction, actuation, or sensor noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of task-agnostic pretraining from robust inference may generalize to other pretrained models in robotics and control.
  • Choosing a richer class of perturbation models inside the minimax step could further close the gap between modeled and real shifts.
  • The same inference-time robustness idea might reduce the need for expensive multi-environment data collection in related offline RL settings.

Load-bearing premise

The minimax optimization over dynamics perturbations can be solved tractably from offline nominal data alone and produces policies that generalize to actual dynamics shifts.

What would settle it

A controlled experiment applying an unmodeled dynamics shift (for example, a friction change outside the perturbation set used in training) and measuring whether the inferred policy still matches nominal performance.

Figures

Figures reproduced from arXiv: 2605.17017 by Ashutosh Nayyar, Rahul Jain, Rishabh Agrawal.

Figure 1
Figure 1. Figure 1: Overview of the proposed RBFM framework. FB-IL, RBFM-Light, and RBFM-Heavy [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Quadruped performance under dynamics perturbations (95% Confidence Interval), pre [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average return (y-axis) vs. body mass perturbation (x-axis, % increase from nominal) on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average return (y-axis) vs. joint friction loss perturbation (x-axis, absolute Nm per joint) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average return (y-axis) vs. ground contact stiffness (x-axis, abso￾lute perturbation) on Quadruped jump under varying ε for RBFM-Heavy. (Q4) We investigate two axes of pretraining data quality: source and quantity. For data source, we evaluate on Walker using APS- and PROTO-generated datasets (Appendix F.2, Figures 8 and 9); trends are consistent with (Q1), confirm￾ing robustness rankings across pretrainin… view at source ↗
Figure 6
Figure 6. Figure 6: Walker performance with 95% confidence interval across four tasks (rows) and three perturbation types (columns): gravity and body mass increase the physical load on the robot (% change from nominal); joint friction loss adds passive resistive torque at each joint, simulating mechanical wear (absolute Nm per joint). Pretrained on RND data. lateral torque component that directly penalizes the upright reward … view at source ↗
Figure 7
Figure 7. Figure 7: Cheetah performance with 95% confidence interval across four tasks (rows) and three perturbation types (columns): range of motion restricts how far each joint can move, simulating mechanical damage (% of nominal retained); actuator strength reduces peak joint torque, simulating motor degradation (% reduction); joint friction loss adds passive resistive torque at each joint, simulating mechanical wear (abso… view at source ↗
Figure 8
Figure 8. Figure 8: Walker performance with 95% confidence interval across four tasks (rows) and three perturbation types (columns): gravity and body mass increase the physical load on the robot (% change from nominal); joint friction loss adds passive resistive torque at each joint, simulating mechanical wear (absolute Nm per joint). Pretrained on APS data. G Discussion and Limitation While answering (Q4) in Section 4, we ob… view at source ↗
Figure 9
Figure 9. Figure 9: Walker performance with 95% confidence interval across four tasks (rows) and three perturbation types (columns): gravity and body mass increase the physical load on the robot (% change from nominal); joint friction loss adds passive resistive torque at each joint, simulating mechanical wear (absolute Nm per joint). Pretrained on PROTO data. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of pretraining of Walker on 100k RND dataset vs 500k RND dataset, where the models are evaluated every 20, 000 timesteps where we perform 10 rollouts and record the IQM. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
read the original abstract

Behavior Foundation Models (BFMs) enable scalable imitation learning (IL) by pretraining task-agnostic representations that can be rapidly adapted to new tasks. However, existing BFMs assume fixed environment dynamics, limiting their robustness under real-world shifts such as changes in friction, actuation, or sensor noise. We address this by formulating BFM task-inference as a robust minimax optimization problem, enabling adaptation to worst-case dynamics perturbations without modifying pretraining. To the best of our knowledge, this is the first BFM-based framework that achieves robustness to dynamics shifts while relying solely on offline data from a single nominal environment. Our approach significantly outperforms standard BFM and robust offline IL baselines under dynamics shifts. These results demonstrate that robust policy can be achieved entirely at task-inference time, improving the practicality of BFMs in dynamic settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a robust formulation of task inference for Behavior Foundation Models (BFMs) in offline imitation learning. By casting adaptation as a minimax optimization over dynamics perturbations, the method aims to produce policies robust to shifts (e.g., friction, actuation, sensor noise) while using only offline trajectories from a single nominal environment and without altering the BFM pretraining stage. The authors claim this is the first such BFM-based robust framework and report significant outperformance versus standard BFM and robust offline IL baselines under dynamics shifts.

Significance. If the minimax task-inference procedure can be solved tractably from nominal data and yields policies that generalize to unmodeled real-world dynamics changes, the result would meaningfully improve the practicality of pretrained BFMs in non-stationary environments. Shifting robustness to inference time rather than pretraining or data collection is a potentially scalable direction for offline IL.

major comments (2)
  1. [§3] §3 (Method): The central claim that the minimax problem over dynamics perturbations can be solved from nominal offline trajectories alone is load-bearing, yet the manuscript provides no explicit description of the inner maximization approximation, the class of allowed perturbations, or the surrogate used in place of an explicit dynamics model. Without this, it is unclear whether the resulting policy is robust only to the modeled set or to actual environment shifts.
  2. [§4] §4 (Experiments): The reported outperformance under dynamics shifts lacks sufficient protocol details—specifically, how the test perturbations are generated, whether they lie inside or outside the perturbation class used in training, and the presence of error bars or statistical tests across multiple seeds. This makes it difficult to assess whether the gains support generalization beyond the nominal environment.
minor comments (2)
  1. [Introduction] The abstract and introduction use the phrase 'to the best of our knowledge' for the 'first BFM-based framework'; a brief related-work paragraph clarifying the precise novelty relative to prior robust IL and BFM papers would strengthen the positioning.
  2. [§3] Notation for the robust objective (e.g., the definition of the perturbation set and the inner/outer players) should be introduced with a single equation block rather than scattered across paragraphs for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, the recognition of the potential impact, and the constructive major comments. We address each point below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that the minimax problem over dynamics perturbations can be solved from nominal offline trajectories alone is load-bearing, yet the manuscript provides no explicit description of the inner maximization approximation, the class of allowed perturbations, or the surrogate used in place of an explicit dynamics model. Without this, it is unclear whether the resulting policy is robust only to the modeled set or to actual environment shifts.

    Authors: We agree that Section 3 would benefit from a more self-contained and explicit treatment. In the revision we will expand the method section with a new subsection that (i) formally defines the perturbation class as bounded parametric changes to friction coefficients, actuation gains, and additive sensor noise (with explicit bounds provided), (ii) describes the inner maximization as a first-order surrogate obtained by linearizing the latent dynamics around the nominal trajectories using the BFM encoder gradients, and (iii) states that the resulting policy is guaranteed to be robust inside this modeled set while providing empirical evidence of generalization to unmodeled shifts. These additions will make the load-bearing claim fully traceable from the nominal data alone. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported outperformance under dynamics shifts lacks sufficient protocol details—specifically, how the test perturbations are generated, whether they lie inside or outside the perturbation class used in training, and the presence of error bars or statistical tests across multiple seeds. This makes it difficult to assess whether the gains support generalization beyond the nominal environment.

    Authors: We concur that the experimental protocol requires additional detail for reproducibility and to substantiate the generalization claims. In the revised manuscript we will (i) specify the exact procedure for generating test perturbations (parameter sampling ranges and randomization seeds), (ii) explicitly indicate which test conditions lie inside versus outside the training perturbation class, and (iii) report all results with mean and standard deviation over five independent seeds together with paired t-test p-values against baselines. These changes will allow readers to evaluate the strength of the out-of-distribution generalization evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: new robust minimax formulation introduced at task-inference time without reducing to fitted inputs or self-citations

full rationale

The paper's central step is to reformulate BFM task inference as a robust minimax optimization over dynamics perturbations, solved from nominal offline data. This is presented as an external modeling choice rather than a re-expression of any pre-fitted quantity or a result derived solely from prior self-citations. No equations in the abstract or description reduce a prediction to its own inputs by construction, and the claim of being the first such framework does not rely on load-bearing self-citation chains. The derivation remains self-contained against external benchmarks of robust optimization applied to imitation learning.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the robust minimax formulation is presented at a high level without detailing any fitted quantities or background assumptions.

pith-pipeline@v0.9.0 · 5676 in / 1019 out tokens · 36313 ms · 2026-05-19T20:34:04.533249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Policy optimization for strictly batch imitation learning

    Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Policy optimization for strictly batch imitation learning. InOPT 2024: Optimization for Machine Learning, 2024. URL https://openreview.net/forum?id=5L3qmI0XPz

  3. [3]

    Balance equation-based distributionally robust offline imitation learning.arXiv preprint arXiv:2511.07942, 2025

    Rishabh Agrawal, Yusuf Alvi, Rahul Jain, and Ashutosh Nayyar. Balance equation-based distributionally robust offline imitation learning.arXiv preprint arXiv:2511.07942, 2025

  4. [4]

    Markov balance satisfac- tion improves performance in strictly batch offline imitation learning

    Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Markov balance satisfac- tion improves performance in strictly batch offline imitation learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15311–15319, 2025

  5. [5]

    Conditional kernel imi- tation learning for continuous state environments

    Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Conditional kernel imi- tation learning for continuous state environments. In Necmiye Ozay, Laura Balzano, Dimitra Panagou, and Alessandro Abate, editors,Proceedings of the 7th Annual Learning for Dynam- ics & Control Conference, volume 283 ofProceedings of Machine Learning Research, pag...

  6. [6]

    The reality gap in robotics: Challenges, solutions, and best practices.Annual Review of Control, Robotics, and Autonomous Systems, 9, 2025

    Elie Aljalbout, Jiaxu Xing, Angel Romero, Iretiayo Akinola, Caelan Reed Garrett, Eric Heiden, Abhishek Gupta, Tucker Hermans, Yashraj Narang, Dieter Fox, et al. The reality gap in robotics: Challenges, solutions, and best practices.Annual Review of Control, Robotics, and Autonomous Systems, 9, 2025

  7. [7]

    Routledge, 2021

    Eitan Altman.Constrained Markov decision processes. Routledge, 2021

  8. [8]

    Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

    Shan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang, et al. Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025

  9. [9]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  10. [10]

    Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017

    André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P Van Hasselt, and David Silver. Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017

  11. [11]

    Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021

    Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021

  12. [12]

    Maksim Bobrin, Ilya Zisman, Alexander Nikulin, Vladislav Kurenkov, and Dmitry V . Dylov. Zero-shot adaptation of behavioral foundation models to unseen dynamics. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=dBDBg4WF4F

  13. [13]

    Universal Successor Features Approximators

    Diana Borsa, André Barreto, John Quan, Daniel Mankowitz, Rémi Munos, Hado Van Hasselt, David Silver, and Tom Schaul. Universal successor features approximators.arXiv preprint arXiv:1812.07626, 2018

  14. [14]

    Cambridge university press, 2004

    Stephen P Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004

  15. [15]

    Exploration by Random Network Distillation

    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

  16. [16]

    Robust imitation learning against variations in environment dynamics

    Jongseong Chae, Seungyul Han, Whiyoung Jung, Myungsik Cho, Sungho Choi, and Youngchul Sung. Robust imitation learning against variations in environment dynamics. InInternational Conference on Machine Learning, pages 2828–2852. PMLR, 2022. 10

  17. [17]

    Meta-controller: Few- shot imitation of unseen embodiments and tasks in continuous control.Advances in Neural Information Processing Systems, 37:134250–134286, 2024

    Seongwoong Cho, Donggyun Kim, Jinwoo Lee, and Seunghoon Hong. Meta-controller: Few- shot imitation of unseen embodiments and tasks in continuous control.Advances in Neural Information Processing Systems, 37:134250–134286, 2024

  18. [18]

    Exploring the limitations of behavior cloning for autonomous driving

    Felipe Codevilla, Eder Santana, Antonio M López, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019

  19. [19]

    Improving generalization for temporal difference learning: The successor repre- sentation.Neural computation, 5(4):613–624, 1993

    Peter Dayan. Improving generalization for temporal difference learning: The successor repre- sentation.Neural computation, 5(4):613–624, 1993

  20. [20]

    Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

    Esther Derman and Shie Mannor. Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

  21. [21]

    One-shot imitation learning.Advances in neural information processing systems, 30, 2017

    Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning.Advances in neural information processing systems, 30, 2017

  22. [22]

    One-shot visual imitation learning via meta-learning

    Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. InConference on robot learning, pages 357–368. PMLR, 2017

  23. [23]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

  24. [24]

    Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

    Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

  25. [25]

    Impact of static friction on sim2real in robotic reinforcement learning

    Xiaoyi Hu, Qiao Sun, Bailin He, Haojie Liu, Xueyi Zhang, Chunpeng Lu, and Jiangwei Zhong. Impact of static friction on sim2real in robotic reinforcement learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 17107–17114. IEEE, 2025

  26. [26]

    Generalization in dexterous manipulation via geometry-aware multi-task learning.arXiv preprint arXiv:2111.03062, 2021

    Wenlong Huang, Igor Mordatch, Pieter Abbeel, and Deepak Pathak. Generalization in dexterous manipulation via geometry-aware multi-task learning.arXiv preprint arXiv:2111.03062, 2021

  27. [27]

    Robust dynamic programming.Mathematics of Operations Research, 30(2): 257–280, 2005

    Garud N Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2): 257–280, 2005

  28. [28]

    Task-embedded control networks for few-shot imitation learning

    Stephen James, Michael Bloesch, and Andrew J Davison. Task-embedded control networks for few-shot imitation learning. InConference on robot learning, pages 783–795. PMLR, 2018

  29. [29]

    Zero-shot reinforcement learning from low quality data.Advances in Neural Information Processing Systems, 37:16894–16942, 2024

    Scott Jeen, Tom Bewley, and Jonathan M Cullen. Zero-shot reinforcement learning from low quality data.Advances in Neural Information Processing Systems, 37:16894–16942, 2024

  30. [30]

    Zero-shot reinforcement learning under partial observability.arXiv preprint arXiv:2506.15446, 2025

    Scott Jeen, Tom Bewley, and Jonathan M Cullen. Zero-shot reinforcement learning under partial observability.arXiv preprint arXiv:2506.15446, 2025

  31. [31]

    DemoDICE: Offline imitation learning with supplementary imperfect demonstrations

    Geon-Hyeong Kim, Seokin Seo, Jongmin Lee, Wonseok Jeon, HyeongJoo Hwang, Hongseok Yang, and Kee-Eung Kim. DemoDICE: Offline imitation learning with supplementary imperfect demonstrations. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=BrPdX1bDZkQ

  32. [32]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  33. [33]

    Imitation learning via off-policy dis- tribution matching

    Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy dis- tribution matching. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyg-JC4FDr

  34. [34]

    Dart: Noise injection for robust imitation learning

    Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, volume 78 ofProceedings of Machine Learning Research, pages 143–156. PMLR, 13–15 Nov 2017. URL https: //procee...

  35. [35]

    Aps: Active pretraining with successor features

    Hao Liu and Pieter Abbeel. Aps: Active pretraining with successor features. InInternational Conference on Machine Learning, pages 6736–6747. PMLR, 2021

  36. [36]

    ODICE: Revealing the mystery of distribution correction estimation via orthogonal-gradient update

    Liyuan Mao, Haoran Xu, Weinan Zhang, and Xianyuan Zhan. ODICE: Revealing the mystery of distribution correction estimation via orthogonal-gradient update. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=L8UNn7Llt4

  37. [37]

    Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

    Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

  38. [38]

    Robust rein- forcement learning using offline data.Advances in neural information processing systems, 35: 32211–32224, 2022

    Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust rein- forcement learning using offline data.Advances in neural information processing systems, 35: 32211–32224, 2022

  39. [39]

    Distributionally robust behavioral cloning for robust imitation learning

    Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Distributionally robust behavioral cloning for robust imitation learning. In2023 62nd IEEE Conference on Decision and Control (CDC), pages 1342–1347. IEEE, 2023

  40. [40]

    Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage

    Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. In Necmiye Ozay, Laura Balzano, Dimitra Panagou, and Alessandro Abate, editors,Proceedings of the 7th Annual Learning for Dynamics & Control Conference, ...

  41. [41]

    URLhttps://proceedings.mlr.press/v283/panaganti25a.html

  42. [42]

    Foundation policies with hilbert representa- tions.arXiv preprint arXiv:2402.15567, 2024

    Seohong Park, Tobias Kreiman, and Sergey Levine. Foundation policies with hilbert representa- tions.arXiv preprint arXiv:2402.15567, 2024

  43. [43]

    Sim-to-real transfer of robotic control with dynamics randomization

    Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE, 2018

  44. [44]

    Fast imitation via behavior foundation models

    Matteo Pirotta, Andrea Tirinzoni, Ahmed Touati, Alessandro Lazaric, and Yann Ollivier. Fast imitation via behavior foundation models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=qnWtw3l0jb

  45. [45]

    Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1:305–313, 1988

    Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1:305–313, 1988

  46. [46]

    John Wiley & Sons, 2014

    Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

  47. [47]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  48. [48]

    Efficient reductions for imitation learning

    Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010

  49. [49]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pag...

  50. [50]

    Optimistic task inference for behavior foundation models.arXiv preprint arXiv:2510.20264, 2025

    Thomas Rupf, Marco Bagatella, Marin Vlastelica, and Andreas Krause. Optimistic task inference for behavior foundation models.arXiv preprint arXiv:2510.20264, 2025

  51. [51]

    Universal value function ap- proximators

    Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function ap- proximators. InInternational conference on machine learning, pages 1312–1320. PMLR, 2015. 12

  52. [52]

    Mitigating covariate shift in behavioral cloning via robust stationary distribution correction

    Seokin Seo, Byung-Jun Lee, Jongmin Lee, HyeongJoo Hwang, Hongseok Yang, and Kee-Eung Kim. Mitigating covariate shift in behavioral cloning via robust stationary distribution correction. Advances in Neural Information Processing Systems, 37:109177–109201, 2024

  53. [53]

    Robust imitation learning from noisy demonstrations

    V oot Tangkaratt, Nontawat Charoenphakdee, and Masashi Sugiyama. Robust imitation learning from noisy demonstrations. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 298–306. PMLR, 13–15 Apr 2021. URL ht...

  54. [54]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

  55. [55]

    Reinforcement learning in robotic systems: A review on sim-to-real transfer.Robotics and Autonomous Systems, page 105327, 2026

    Rajesh Tiwari, Shailesh Khapre, and Avantika Singh. Reinforcement learning in robotic systems: A review on sim-to-real transfer.Robotics and Autonomous Systems, page 105327, 2026

  56. [56]

    Learning one representation to optimize all rewards.Advances in Neural Information Processing Systems, 34:13–23, 2021

    Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards.Advances in Neural Information Processing Systems, 34:13–23, 2021

  57. [57]

    Does zero-shot reinforcement learning exist? arXiv preprint arXiv:2209.14935, 2022

    Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? arXiv preprint arXiv:2209.14935, 2022

  58. [58]

    Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023

    Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MYEap_OcQI

  59. [59]

    Robust behavior cloning via global lipschitz regularization.arXiv preprint arXiv:2506.19250, 2025

    Shili Wu, Yizhao Jin, Puhua Niu, Aniruddha Datta, and Sean B Andersson. Robust behavior cloning via global lipschitz regularization.arXiv preprint arXiv:2506.19250, 2025

  60. [60]

    Imitation learning from imperfect demonstration

    Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, V oot Tangkaratt, and Masashi Sugiyama. Imitation learning from imperfect demonstration. In Kamalika Chaudhuri and Ruslan Salakhutdi- nov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6818–6827. PMLR, 09–15 Jun 2019. ...

  61. [61]

    Reinforcement learning with prototypical representations

    Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Reinforcement learning with prototypical representations. InInternational Conference on Machine Learning, pages 11920–11931. PMLR, 2021

  62. [62]

    Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning.arXiv preprint arXiv:2201.13425, 2022

    Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning.arXiv preprint arXiv:2201.13425, 2022

  63. [63]

    Fast bellman updates for wasserstein distributionally robust mdps.Advances in Neural Information Processing Systems, 36:30554–30578, 2023

    Zhuodong Yu, Ling Dai, Shaohang Xu, Siyang Gao, and Chin Pang Ho. Fast bellman updates for wasserstein distributionally robust mdps.Advances in Neural Information Processing Systems, 36:30554–30578, 2023

  64. [64]

    Breeze: Towards robust zero-shot reinforcement learning

    Kexin Zheng, Lauriane Teyssier, Yinan Zheng, Yu Luo, and Xianyuan Zhan. Breeze: Towards robust zero-shot reinforcement learning. https://github.com/Whiterrrrr/BREEZE, 2026. GitHub repository, accessed May 7, 2026

  65. [65]

    Watch, try, learn: Meta-learning from demonstrations and rewards

    Allan Zhou, Eric Jang, Daniel Kappler, Alex Herzog, Mohi Khansari, Paul Wohlhart, Yunfei Bai, Mrinal Kalakrishnan, Sergey Levine, and Chelsea Finn. Watch, try, learn: Meta-learning from demonstrations and rewards. InInternational Conference on Learning Representations,

  66. [66]

    13 Appendices A Missing Proofs 16 B Extended Related Work 19 C Experimental Setup 20 C.1 ExORL Domains

    URLhttps://openreview.net/forum?id=SJg5J6NtDr. 13 Appendices A Missing Proofs 16 B Extended Related Work 19 C Experimental Setup 20 C.1 ExORL Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.1.1 Walker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.1.2 Quadruped . . . . . . . . . . . . . . . ...

  67. [67]

    17 Proof

    Then the optimization problem in Proposition 1 can be simplified to min z min λ∈[0,(1+εl)L] ( Es∼ρ πD T o (L(πz(·|s), πD(·|s))−λ) + +ε l max ρ πD T o (s)>0 L(πz(·|s), πD(·|s))−λ ! + +λ ) . 17 Proof. Let us fix the learner’s task vectorz and the corresponding policy πz and define the point-wise loss ℓz(s) :=L(π z(·|s), πD(·|s)) =∥π z(s)−π D(s)∥2 2. Since t...

  68. [68]

    Pretrained on RND data

    +ϵτ − ε b " bX i=1 f(w ⋆ Qθ,τ,πz(si, ai, s′ i) +w ⋆ Qθ,τ,πz(si, ai, s′ i)cQθ,πz(si, ai, s′ i) # 10:Updateθ←θ−η Q∇θLQθ,τ 11:Updateτ←max(0, τ−η τ ∇τ LQθ,τ) 12:// Step 3: Policy update (actor) 13:Estimate: Lπz = bX i=1 w⋆ Qθ,τ,πz(si, ai, s′ i)Lπz(si) b 14:Updatez←z−η π∇zLπz 15:end for 16:return(Q θ, τ, z) 27 Gravity Mass Joint Friction Loss Run Walk Flip Sta...