When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited
Pith reviewed 2026-05-19 20:34 UTC · model grok-4.3
The pith
Reformulating BFM task inference as a minimax problem over dynamics perturbations yields robust policies from single-environment offline data alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Casting BFM task inference as a robust minimax optimization over possible dynamics perturbations produces policies that adapt to worst-case shifts while depending solely on offline data collected in a single nominal environment. This yields the first BFM-based framework to achieve dynamics robustness without modifying pretraining or requiring multi-environment data, and the resulting policies outperform standard BFM and robust offline IL baselines under dynamics shifts.
What carries the argument
The minimax optimization problem solved at task-inference time that accounts for worst-case dynamics perturbations while adapting a pretrained BFM.
If this is right
- Robust policies are obtained entirely at task-inference time without retraining the BFM.
- The approach relies solely on offline data from one nominal environment.
- It outperforms both standard BFM adaptation and prior robust offline IL methods under dynamics shifts.
- The framework improves practicality of BFMs in settings with varying friction, actuation, or sensor noise.
Where Pith is reading between the lines
- The separation of task-agnostic pretraining from robust inference may generalize to other pretrained models in robotics and control.
- Choosing a richer class of perturbation models inside the minimax step could further close the gap between modeled and real shifts.
- The same inference-time robustness idea might reduce the need for expensive multi-environment data collection in related offline RL settings.
Load-bearing premise
The minimax optimization over dynamics perturbations can be solved tractably from offline nominal data alone and produces policies that generalize to actual dynamics shifts.
What would settle it
A controlled experiment applying an unmodeled dynamics shift (for example, a friction change outside the perturbation set used in training) and measuring whether the inferred policy still matches nominal performance.
Figures
read the original abstract
Behavior Foundation Models (BFMs) enable scalable imitation learning (IL) by pretraining task-agnostic representations that can be rapidly adapted to new tasks. However, existing BFMs assume fixed environment dynamics, limiting their robustness under real-world shifts such as changes in friction, actuation, or sensor noise. We address this by formulating BFM task-inference as a robust minimax optimization problem, enabling adaptation to worst-case dynamics perturbations without modifying pretraining. To the best of our knowledge, this is the first BFM-based framework that achieves robustness to dynamics shifts while relying solely on offline data from a single nominal environment. Our approach significantly outperforms standard BFM and robust offline IL baselines under dynamics shifts. These results demonstrate that robust policy can be achieved entirely at task-inference time, improving the practicality of BFMs in dynamic settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a robust formulation of task inference for Behavior Foundation Models (BFMs) in offline imitation learning. By casting adaptation as a minimax optimization over dynamics perturbations, the method aims to produce policies robust to shifts (e.g., friction, actuation, sensor noise) while using only offline trajectories from a single nominal environment and without altering the BFM pretraining stage. The authors claim this is the first such BFM-based robust framework and report significant outperformance versus standard BFM and robust offline IL baselines under dynamics shifts.
Significance. If the minimax task-inference procedure can be solved tractably from nominal data and yields policies that generalize to unmodeled real-world dynamics changes, the result would meaningfully improve the practicality of pretrained BFMs in non-stationary environments. Shifting robustness to inference time rather than pretraining or data collection is a potentially scalable direction for offline IL.
major comments (2)
- [§3] §3 (Method): The central claim that the minimax problem over dynamics perturbations can be solved from nominal offline trajectories alone is load-bearing, yet the manuscript provides no explicit description of the inner maximization approximation, the class of allowed perturbations, or the surrogate used in place of an explicit dynamics model. Without this, it is unclear whether the resulting policy is robust only to the modeled set or to actual environment shifts.
- [§4] §4 (Experiments): The reported outperformance under dynamics shifts lacks sufficient protocol details—specifically, how the test perturbations are generated, whether they lie inside or outside the perturbation class used in training, and the presence of error bars or statistical tests across multiple seeds. This makes it difficult to assess whether the gains support generalization beyond the nominal environment.
minor comments (2)
- [Introduction] The abstract and introduction use the phrase 'to the best of our knowledge' for the 'first BFM-based framework'; a brief related-work paragraph clarifying the precise novelty relative to prior robust IL and BFM papers would strengthen the positioning.
- [§3] Notation for the robust objective (e.g., the definition of the perturbation set and the inner/outer players) should be introduced with a single equation block rather than scattered across paragraphs for readability.
Simulated Author's Rebuttal
We thank the referee for the positive summary, the recognition of the potential impact, and the constructive major comments. We address each point below and will revise the manuscript accordingly to improve clarity and completeness.
read point-by-point responses
-
Referee: [§3] §3 (Method): The central claim that the minimax problem over dynamics perturbations can be solved from nominal offline trajectories alone is load-bearing, yet the manuscript provides no explicit description of the inner maximization approximation, the class of allowed perturbations, or the surrogate used in place of an explicit dynamics model. Without this, it is unclear whether the resulting policy is robust only to the modeled set or to actual environment shifts.
Authors: We agree that Section 3 would benefit from a more self-contained and explicit treatment. In the revision we will expand the method section with a new subsection that (i) formally defines the perturbation class as bounded parametric changes to friction coefficients, actuation gains, and additive sensor noise (with explicit bounds provided), (ii) describes the inner maximization as a first-order surrogate obtained by linearizing the latent dynamics around the nominal trajectories using the BFM encoder gradients, and (iii) states that the resulting policy is guaranteed to be robust inside this modeled set while providing empirical evidence of generalization to unmodeled shifts. These additions will make the load-bearing claim fully traceable from the nominal data alone. revision: yes
-
Referee: [§4] §4 (Experiments): The reported outperformance under dynamics shifts lacks sufficient protocol details—specifically, how the test perturbations are generated, whether they lie inside or outside the perturbation class used in training, and the presence of error bars or statistical tests across multiple seeds. This makes it difficult to assess whether the gains support generalization beyond the nominal environment.
Authors: We concur that the experimental protocol requires additional detail for reproducibility and to substantiate the generalization claims. In the revised manuscript we will (i) specify the exact procedure for generating test perturbations (parameter sampling ranges and randomization seeds), (ii) explicitly indicate which test conditions lie inside versus outside the training perturbation class, and (iii) report all results with mean and standard deviation over five independent seeds together with paired t-test p-values against baselines. These changes will allow readers to evaluate the strength of the out-of-distribution generalization evidence. revision: yes
Circularity Check
No circularity: new robust minimax formulation introduced at task-inference time without reducing to fitted inputs or self-citations
full rationale
The paper's central step is to reformulate BFM task inference as a robust minimax optimization over dynamics perturbations, solved from nominal offline data. This is presented as an external modeling choice rather than a re-expression of any pre-fitted quantity or a result derived solely from prior self-citations. No equations in the abstract or description reduce a prediction to its own inputs by construction, and the claim of being the first such framework does not rely on load-bearing self-citation chains. The derivation remains self-contained against external benchmarks of robust optimization applied to imitation learning.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Policy optimization for strictly batch imitation learning
Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Policy optimization for strictly batch imitation learning. InOPT 2024: Optimization for Machine Learning, 2024. URL https://openreview.net/forum?id=5L3qmI0XPz
work page 2024
-
[3]
Rishabh Agrawal, Yusuf Alvi, Rahul Jain, and Ashutosh Nayyar. Balance equation-based distributionally robust offline imitation learning.arXiv preprint arXiv:2511.07942, 2025
-
[4]
Markov balance satisfac- tion improves performance in strictly batch offline imitation learning
Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Markov balance satisfac- tion improves performance in strictly batch offline imitation learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15311–15319, 2025
work page 2025
-
[5]
Conditional kernel imi- tation learning for continuous state environments
Rishabh Agrawal, Nathan Dahlin, Rahul Jain, and Ashutosh Nayyar. Conditional kernel imi- tation learning for continuous state environments. In Necmiye Ozay, Laura Balzano, Dimitra Panagou, and Alessandro Abate, editors,Proceedings of the 7th Annual Learning for Dynam- ics & Control Conference, volume 283 ofProceedings of Machine Learning Research, pag...
work page 2025
-
[6]
Elie Aljalbout, Jiaxu Xing, Angel Romero, Iretiayo Akinola, Caelan Reed Garrett, Eric Heiden, Abhishek Gupta, Tucker Hermans, Yashraj Narang, Dieter Fox, et al. The reality gap in robotics: Challenges, solutions, and best practices.Annual Review of Control, Robotics, and Autonomous Systems, 9, 2025
work page 2025
- [7]
-
[8]
Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025
Shan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang, et al. Dexterous manipulation through imitation learning: A survey.arXiv preprint arXiv:2504.03515, 2025
-
[9]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P Van Hasselt, and David Silver. Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017
work page 2017
-
[11]
Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021
-
[12]
Maksim Bobrin, Ilya Zisman, Alexander Nikulin, Vladislav Kurenkov, and Dmitry V . Dylov. Zero-shot adaptation of behavioral foundation models to unseen dynamics. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=dBDBg4WF4F
work page 2026
-
[13]
Universal Successor Features Approximators
Diana Borsa, André Barreto, John Quan, Daniel Mankowitz, Rémi Munos, Hado Van Hasselt, David Silver, and Tom Schaul. Universal successor features approximators.arXiv preprint arXiv:1812.07626, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Cambridge university press, 2004
Stephen P Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004
work page 2004
-
[15]
Exploration by Random Network Distillation
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Robust imitation learning against variations in environment dynamics
Jongseong Chae, Seungyul Han, Whiyoung Jung, Myungsik Cho, Sungho Choi, and Youngchul Sung. Robust imitation learning against variations in environment dynamics. InInternational Conference on Machine Learning, pages 2828–2852. PMLR, 2022. 10
work page 2022
-
[17]
Seongwoong Cho, Donggyun Kim, Jinwoo Lee, and Seunghoon Hong. Meta-controller: Few- shot imitation of unseen embodiments and tasks in continuous control.Advances in Neural Information Processing Systems, 37:134250–134286, 2024
work page 2024
-
[18]
Exploring the limitations of behavior cloning for autonomous driving
Felipe Codevilla, Eder Santana, Antonio M López, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019
work page 2019
-
[19]
Peter Dayan. Improving generalization for temporal difference learning: The successor repre- sentation.Neural computation, 5(4):613–624, 1993
work page 1993
-
[20]
Esther Derman and Shie Mannor. Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020
-
[21]
One-shot imitation learning.Advances in neural information processing systems, 30, 2017
Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning.Advances in neural information processing systems, 30, 2017
work page 2017
-
[22]
One-shot visual imitation learning via meta-learning
Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. InConference on robot learning, pages 357–368. PMLR, 2017
work page 2017
-
[23]
Off-policy deep reinforcement learning without exploration
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019
work page 2052
-
[24]
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016
work page 2016
-
[25]
Impact of static friction on sim2real in robotic reinforcement learning
Xiaoyi Hu, Qiao Sun, Bailin He, Haojie Liu, Xueyi Zhang, Chunpeng Lu, and Jiangwei Zhong. Impact of static friction on sim2real in robotic reinforcement learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 17107–17114. IEEE, 2025
work page 2025
-
[26]
Wenlong Huang, Igor Mordatch, Pieter Abbeel, and Deepak Pathak. Generalization in dexterous manipulation via geometry-aware multi-task learning.arXiv preprint arXiv:2111.03062, 2021
-
[27]
Robust dynamic programming.Mathematics of Operations Research, 30(2): 257–280, 2005
Garud N Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2): 257–280, 2005
work page 2005
-
[28]
Task-embedded control networks for few-shot imitation learning
Stephen James, Michael Bloesch, and Andrew J Davison. Task-embedded control networks for few-shot imitation learning. InConference on robot learning, pages 783–795. PMLR, 2018
work page 2018
-
[29]
Scott Jeen, Tom Bewley, and Jonathan M Cullen. Zero-shot reinforcement learning from low quality data.Advances in Neural Information Processing Systems, 37:16894–16942, 2024
work page 2024
-
[30]
Zero-shot reinforcement learning under partial observability.arXiv preprint arXiv:2506.15446, 2025
Scott Jeen, Tom Bewley, and Jonathan M Cullen. Zero-shot reinforcement learning under partial observability.arXiv preprint arXiv:2506.15446, 2025
-
[31]
DemoDICE: Offline imitation learning with supplementary imperfect demonstrations
Geon-Hyeong Kim, Seokin Seo, Jongmin Lee, Wonseok Jeon, HyeongJoo Hwang, Hongseok Yang, and Kee-Eung Kim. DemoDICE: Offline imitation learning with supplementary imperfect demonstrations. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=BrPdX1bDZkQ
work page 2022
-
[32]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[33]
Imitation learning via off-policy dis- tribution matching
Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy dis- tribution matching. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyg-JC4FDr
work page 2020
-
[34]
Dart: Noise injection for robust imitation learning
Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, volume 78 ofProceedings of Machine Learning Research, pages 143–156. PMLR, 13–15 Nov 2017. URL https: //procee...
work page 2017
-
[35]
Aps: Active pretraining with successor features
Hao Liu and Pieter Abbeel. Aps: Active pretraining with successor features. InInternational Conference on Machine Learning, pages 6736–6747. PMLR, 2021
work page 2021
-
[36]
ODICE: Revealing the mystery of distribution correction estimation via orthogonal-gradient update
Liyuan Mao, Haoran Xu, Weinan Zhang, and Xianyuan Zhan. ODICE: Revealing the mystery of distribution correction estimation via orthogonal-gradient update. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=L8UNn7Llt4
work page 2024
-
[37]
Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005
work page 2005
-
[38]
Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust rein- forcement learning using offline data.Advances in neural information processing systems, 35: 32211–32224, 2022
work page 2022
-
[39]
Distributionally robust behavioral cloning for robust imitation learning
Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Distributionally robust behavioral cloning for robust imitation learning. In2023 62nd IEEE Conference on Decision and Control (CDC), pages 1342–1347. IEEE, 2023
work page 2023
-
[40]
Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. In Necmiye Ozay, Laura Balzano, Dimitra Panagou, and Alessandro Abate, editors,Proceedings of the 7th Annual Learning for Dynamics & Control Conference, ...
-
[41]
URLhttps://proceedings.mlr.press/v283/panaganti25a.html
-
[42]
Foundation policies with hilbert representa- tions.arXiv preprint arXiv:2402.15567, 2024
Seohong Park, Tobias Kreiman, and Sergey Levine. Foundation policies with hilbert representa- tions.arXiv preprint arXiv:2402.15567, 2024
-
[43]
Sim-to-real transfer of robotic control with dynamics randomization
Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE, 2018
work page 2018
-
[44]
Fast imitation via behavior foundation models
Matteo Pirotta, Andrea Tirinzoni, Ahmed Touati, Alessandro Lazaric, and Yann Ollivier. Fast imitation via behavior foundation models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=qnWtw3l0jb
work page 2024
-
[45]
Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1:305–313, 1988
work page 1988
-
[46]
Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014
work page 2014
-
[47]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[48]
Efficient reductions for imitation learning
Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010
work page 2010
-
[49]
A reduction of imitation learning and structured prediction to no-regret online learning
Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pag...
work page 2011
-
[50]
Optimistic task inference for behavior foundation models.arXiv preprint arXiv:2510.20264, 2025
Thomas Rupf, Marco Bagatella, Marin Vlastelica, and Andreas Krause. Optimistic task inference for behavior foundation models.arXiv preprint arXiv:2510.20264, 2025
-
[51]
Universal value function ap- proximators
Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function ap- proximators. InInternational conference on machine learning, pages 1312–1320. PMLR, 2015. 12
work page 2015
-
[52]
Mitigating covariate shift in behavioral cloning via robust stationary distribution correction
Seokin Seo, Byung-Jun Lee, Jongmin Lee, HyeongJoo Hwang, Hongseok Yang, and Kee-Eung Kim. Mitigating covariate shift in behavioral cloning via robust stationary distribution correction. Advances in Neural Information Processing Systems, 37:109177–109201, 2024
work page 2024
-
[53]
Robust imitation learning from noisy demonstrations
V oot Tangkaratt, Nontawat Charoenphakdee, and Masashi Sugiyama. Robust imitation learning from noisy demonstrations. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 298–306. PMLR, 13–15 Apr 2021. URL ht...
work page 2021
-
[54]
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[55]
Rajesh Tiwari, Shailesh Khapre, and Avantika Singh. Reinforcement learning in robotic systems: A review on sim-to-real transfer.Robotics and Autonomous Systems, page 105327, 2026
work page 2026
-
[56]
Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards.Advances in Neural Information Processing Systems, 34:13–23, 2021
work page 2021
-
[57]
Does zero-shot reinforcement learning exist? arXiv preprint arXiv:2209.14935, 2022
Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? arXiv preprint arXiv:2209.14935, 2022
-
[58]
Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MYEap_OcQI
work page 2023
-
[59]
Robust behavior cloning via global lipschitz regularization.arXiv preprint arXiv:2506.19250, 2025
Shili Wu, Yizhao Jin, Puhua Niu, Aniruddha Datta, and Sean B Andersson. Robust behavior cloning via global lipschitz regularization.arXiv preprint arXiv:2506.19250, 2025
-
[60]
Imitation learning from imperfect demonstration
Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, V oot Tangkaratt, and Masashi Sugiyama. Imitation learning from imperfect demonstration. In Kamalika Chaudhuri and Ruslan Salakhutdi- nov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6818–6827. PMLR, 09–15 Jun 2019. ...
work page 2019
-
[61]
Reinforcement learning with prototypical representations
Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Reinforcement learning with prototypical representations. InInternational Conference on Machine Learning, pages 11920–11931. PMLR, 2021
work page 2021
-
[62]
Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning.arXiv preprint arXiv:2201.13425, 2022
-
[63]
Zhuodong Yu, Ling Dai, Shaohang Xu, Siyang Gao, and Chin Pang Ho. Fast bellman updates for wasserstein distributionally robust mdps.Advances in Neural Information Processing Systems, 36:30554–30578, 2023
work page 2023
-
[64]
Breeze: Towards robust zero-shot reinforcement learning
Kexin Zheng, Lauriane Teyssier, Yinan Zheng, Yu Luo, and Xianyuan Zhan. Breeze: Towards robust zero-shot reinforcement learning. https://github.com/Whiterrrrr/BREEZE, 2026. GitHub repository, accessed May 7, 2026
work page 2026
-
[65]
Watch, try, learn: Meta-learning from demonstrations and rewards
Allan Zhou, Eric Jang, Daniel Kappler, Alex Herzog, Mohi Khansari, Paul Wohlhart, Yunfei Bai, Mrinal Kalakrishnan, Sergey Levine, and Chelsea Finn. Watch, try, learn: Meta-learning from demonstrations and rewards. InInternational Conference on Learning Representations,
-
[66]
URLhttps://openreview.net/forum?id=SJg5J6NtDr. 13 Appendices A Missing Proofs 16 B Extended Related Work 19 C Experimental Setup 20 C.1 ExORL Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.1.1 Walker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.1.2 Quadruped . . . . . . . . . . . . . . . ...
-
[67]
Then the optimization problem in Proposition 1 can be simplified to min z min λ∈[0,(1+εl)L] ( Es∼ρ πD T o (L(πz(·|s), πD(·|s))−λ) + +ε l max ρ πD T o (s)>0 L(πz(·|s), πD(·|s))−λ ! + +λ ) . 17 Proof. Let us fix the learner’s task vectorz and the corresponding policy πz and define the point-wise loss ℓz(s) :=L(π z(·|s), πD(·|s)) =∥π z(s)−π D(s)∥2 2. Since t...
-
[68]
+ϵτ − ε b " bX i=1 f(w ⋆ Qθ,τ,πz(si, ai, s′ i) +w ⋆ Qθ,τ,πz(si, ai, s′ i)cQθ,πz(si, ai, s′ i) # 10:Updateθ←θ−η Q∇θLQθ,τ 11:Updateτ←max(0, τ−η τ ∇τ LQθ,τ) 12:// Step 3: Policy update (actor) 13:Estimate: Lπz = bX i=1 w⋆ Qθ,τ,πz(si, ai, s′ i)Lπz(si) b 14:Updatez←z−η π∇zLπz 15:end for 16:return(Q θ, τ, z) 27 Gravity Mass Joint Friction Loss Run Walk Flip Sta...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.