Abstraction for Offline Goal-Conditioned Reinforcement Learning
Pith reviewed 2026-05-22 07:43 UTC · model grok-4.3
The pith
Hierarchical policies achieve absolute abstraction in offline goal-conditioned reinforcement learning by using relativised options to reuse experience across symmetric state-goal pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Markov Decision Processes in goal-conditioned reinforcement learning often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs. By introducing relativised options as well as distinct representations for different levels of the hierarchy, an agent can abstract from the absolute frame of reference and reuse experience across similar contexts of the state-space. Two simple algorithms are presented for learning these relativised options and performing the abstraction, and experiments show that the approach improves performance in the offline setting.
What carries the argument
Relativised options, which are temporally extended actions defined relative to the current goal rather than in absolute coordinates, together with separate representations at each hierarchy level that separate absolute position from relative structure.
If this is right
- An agent reuses experience across similar state-goal pairs instead of treating each pair as a separate learning problem.
- Hierarchy supplies both temporal abstraction to manage long horizons and absolute abstraction to exploit symmetries.
- Two algorithms become available for learning relativised options and for abstracting away from absolute frames of reference.
- Offline goal-conditioned reinforcement learning performance improves when these inductive biases are added to standard methods.
Where Pith is reading between the lines
- The same relativised-option construction could be tested in online goal-conditioned settings where data efficiency matters.
- Environments with explicit translational or rotational symmetry, such as navigation or manipulation tasks, would be natural places to measure the size of the reuse benefit.
- The separation of representations at different hierarchy levels might interact with function-approximation choices in large continuous spaces.
Load-bearing premise
Goal-conditioned Markov Decision Processes contain enough symmetry and shared structure across state-goal pairs that absolute abstraction from hierarchy will produce measurable reuse of experience.
What would settle it
Running the proposed algorithms on a goal-conditioned task deliberately constructed with no symmetries or shared structure across goals, such as a single unique target in a fully asymmetric environment, and observing no improvement over a flat baseline policy.
Figures
read the original abstract
Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal abstraction in offline GCRL, we demonstrate that hierarchy also enables absolute abstraction. By introducing relativised options as well as distinct representations for different levels of the hierarchy, we demonstrate how an agent can reuse experience across similar contexts of the state-space. Based on this framework, we introduce two simple algorithms for learning relativised options and abstracting from the absolute frame of reference. Our experiments show that such inductive biases significantly improve performance in offline GCRL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a hierarchical framework for offline goal-conditioned reinforcement learning that exploits redundancies and symmetries in MDPs across state-goal pairs. It introduces relativised options together with level-specific representations to achieve absolute abstraction (in addition to temporal abstraction), enabling experience reuse across similar contexts. Two simple algorithms are presented for learning these options and performing abstraction from the absolute frame of reference, with experiments reported to show performance gains from the resulting inductive biases.
Significance. If the central claims hold, the work provides a concrete mechanism for incorporating absolute abstraction into hierarchical offline GCRL policies. This could meaningfully improve sample efficiency by turning structural symmetries into reusable experience, extending the standard motivation for hierarchy beyond horizon reduction. The explicit separation of temporal and absolute abstraction, together with the introduction of relativised options, offers a falsifiable inductive bias that is directly testable in standard GCRL benchmarks.
major comments (2)
- [§4] §4 (Relativised Options): the formal definition of a relativised option must be shown to preserve the optimal value function of the original goal-conditioned MDP; without an explicit invariance or bisimulation argument, it is unclear whether the absolute-abstraction claim is loss-free or merely an approximation.
- [§5.2] §5.2 (Algorithm 2): the update rule for abstracting from the absolute frame appears to rely on an auxiliary representation network whose training objective is not stated; if this network is learned from the same offline dataset, the claimed separation of levels risks circularity in the experience-reuse argument.
minor comments (2)
- [Figure 2] Figure 2: the diagram of the two-level hierarchy would benefit from explicit arrows indicating which components are shared versus level-specific.
- [Related Work] Related Work: the discussion of prior hierarchical GCRL methods (e.g., HIRO, HAC) should clarify in one sentence how relativised options differ from goal-relativisation techniques already present in the literature.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which highlight important aspects of our framework for relativised options and absolute abstraction in offline goal-conditioned RL. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [§4] §4 (Relativised Options): the formal definition of a relativised option must be shown to preserve the optimal value function of the original goal-conditioned MDP; without an explicit invariance or bisimulation argument, it is unclear whether the absolute-abstraction claim is loss-free or merely an approximation.
Authors: We agree that an explicit invariance argument strengthens the absolute-abstraction claim. In the revised manuscript we will add a theorem in §4 establishing that relativised options preserve the optimal value function of the underlying goal-conditioned MDP. The proof will rely on a bisimulation relation defined over equivalence classes of state-goal pairs that respect the symmetries of the MDP, showing that the abstraction is loss-free whenever those symmetries hold. revision: yes
-
Referee: [§5.2] §5.2 (Algorithm 2): the update rule for abstracting from the absolute frame appears to rely on an auxiliary representation network whose training objective is not stated; if this network is learned from the same offline dataset, the claimed separation of levels risks circularity in the experience-reuse argument.
Authors: The auxiliary representation network is trained with the level-specific contrastive objective already defined in §5.1. We will revise §5.2 to state this objective explicitly and to emphasise that each level uses a distinct representation and loss, trained once on the offline dataset before policy optimisation. This ordering removes any circular dependency and preserves the separation between temporal and absolute abstraction. revision: yes
Circularity Check
No significant circularity; framework presented as new inductive bias with independent experimental validation
full rationale
The paper introduces relativised options and level-specific representations as novel mechanisms for absolute abstraction in offline GCRL, motivated by observed MDP redundancies. No derivation chain reduces a claimed result to a fitted parameter or self-citation by construction; the abstract and framework description present the hierarchy as an explicit inductive bias whose benefits are then tested empirically. No equations or steps equate outputs to inputs tautologically, and the central claim rests on the proposed algorithms rather than renaming or smuggling prior results. This is a standard case of a self-contained contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MDPs often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs
invented entities (1)
-
relativised options
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By introducing relativised options as well as distinct representations for different levels of the hierarchy, we demonstrate how an agent can reuse experience across similar contexts of the state-space.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We build on the work of Robert et al. [47] and Li et al. [40] to show that, by using a hierarchical policy with absolute abstraction, the maximum error is bounded by...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Leslie Pack Kaelbling. Learning to achieve goals. InProceedings of the Thirteenth International Joint Conference on Artificial Intelligence (IJCAI), pages 1094–1099, 1993
work page 1993
-
[2]
Universal value function approxi- mators
Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi- mators. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1312–1320, 2015
work page 2015
-
[3]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems, November 2020. URL http://arxiv. org/abs/2005.01643. arXiv:2005.01643 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[4]
Understanding the World Through Action, October 2021
Sergey Levine. Understanding the World Through Action, October 2021. URL http://arxiv. org/abs/2110.12543. arXiv:2110.12543 [cs]
-
[5]
arXiv preprint arXiv:2410.20092 , year=
Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. OGBench: Bench- marking Offline Goal-Conditioned RL, February 2025. URL http://arxiv.org/abs/2410. 20092. arXiv:2410.20092 [cs]
-
[6]
Rafael Figueiredo Prudencio, Marcos R. O. A. Maximo, and Esther Luna Colombini. A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems.IEEE Transactions on Neural Networks and Learning Systems, 35(8):10237–10257, August 2024. ISSN 2162-237X, 2162-2388. doi: 10.1109/TNNLS.2023.3250269. URL http://arxiv. org/abs/2203.01387. arXiv:22...
-
[7]
Offline Reinforcement Learning with Implicit Q-Learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q-Learning, October 2021. URL http://arxiv.org/abs/2110.06169. arXiv:2110.06169 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-Weighted Re- gression: Simple and Scalable Off-Policy Reinforcement Learning, October 2019. URL http://arxiv.org/abs/1910.00177. arXiv:1910.00177 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[9]
Revisiting the Minimalist Approach to Offline Reinforcement Learning, October 2023
Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the Minimalist Approach to Offline Reinforcement Learning, October 2023. URL http: //arxiv.org/abs/2305.09836. arXiv:2305.09836 [cs]
-
[10]
Challenges of Real-World Reinforcement Learning
Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of Real- World Reinforcement Learning, April 2019. URL http://arxiv.org/abs/1904.12901. arXiv:1904.12901 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[11]
Is Value Learning Really the Main Bottleneck in Offline RL?, October 2024
Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is Value Learning Really the Main Bottleneck in Offline RL?, October 2024. URL http://arxiv.org/abs/2406.09329. arXiv:2406.09329 [cs]
-
[12]
Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025
Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon Reduction Makes RL Scalable, October 2025. URL http://arxiv.org/ abs/2506.04168. arXiv:2506.04168 [cs]
-
[13]
Sutton, Doina Precup, and Satinder Singh
Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112 (1-2):181–211, August 1999. ISSN 00043702. doi: 10.1016/S0004-3702(99)00052-1. URL https://linkinghub.elsevier.com/retrieve/pii/S0004370299000521
-
[14]
Feudal networks for hierarchical reinforcement learning
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3540–3549, 2017
work page 2017
-
[15]
Balaraman Ravindran and Andrew G. Barto. Model minimization in hierarchical reinforcement learning. 10
-
[16]
Real-Time Execution of Action Chunking Flow Policies
Kevin Black, Manuel Y . Galliker, and Sergey Levine. Real-Time Execution of Action Chunking Flow Policies, December 2025. URL http://arxiv.org/abs/2506.07339. arXiv:2506.07339 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Scalable Offline Model- Based RL with Action Chunks, December 2025
Kwanyoung Park, Seohong Park, Youngwoon Lee, and Sergey Levine. Scalable Offline Model- Based RL with Action Chunks, December 2025. URLhttp://arxiv.org/abs/2512.08108. arXiv:2512.08108 [cs]
-
[18]
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi´c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.arXiv preprint arXiv:2104.13478, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Equivariant goal conditioned contrastive reinforcement learning
Arsh Tangri, Nichols Crawford Taylor, Haojie Huang, and Robert Platt. Equivariant goal conditioned contrastive reinforcement learning. 2025. doi: 10.48550/arXiv.2507.16139
-
[20]
Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. Springer, 2012
work page 2012
-
[21]
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018
work page 2018
-
[22]
Adam White, Joseph Modayil, and Richard S. Sutton. Scaling life-long off-policy learning. CoRR, abs/1206.6262, 2012. URLhttp://arxiv.org/abs/1206.6262
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[23]
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay, February 2018. URLhttp://arxiv.org/abs/1707.01495. arXiv:1707.01495 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
HIQL: Offline Goal- Conditioned RL with Latent States as Actions, March 2024
Seohong Park, Dibya Ghosh, Benjamin Eysenbach, and Sergey Levine. HIQL: Offline Goal- Conditioned RL with Latent States as Actions, March 2024. URL http://arxiv.org/abs/ 2307.11949. arXiv:2307.11949 [cs]
-
[25]
Conservative Q-Learning for Offline Reinforcement Learning, August 2020
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-Learning for Offline Reinforcement Learning, August 2020. URL http://arxiv.org/abs/2006.04779. arXiv:2006.04779 [cs]
-
[26]
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble, October 2021
Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble, October 2021. URL http://arxiv. org/abs/2110.01548. arXiv:2110.01548 [cs]
-
[27]
Contrastive Learning as Goal-Conditioned Reinforcement Learning, February 2023
Benjamin Eysenbach, Tianjun Zhang, Ruslan Salakhutdinov, and Sergey Levine. Contrastive Learning as Goal-Conditioned Reinforcement Learning, February 2023. URL http://arxiv. org/abs/2206.07568. arXiv:2206.07568 [cs]
-
[28]
A Minimalist Approach to Offline Reinforcement Learning, December 2021
Scott Fujimoto and Shixiang Shane Gu. A Minimalist Approach to Offline Reinforcement Learning, December 2021. URL http://arxiv.org/abs/2106.06860. arXiv:2106.06860 [cs]
-
[29]
Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online RL.CoRR, abs/2007.11091, 2020. URLhttps://arxiv.org/abs/2007.11091
-
[30]
The Option-Critic Architecture, December
Pierre-Luc Bacon, Jean Harb, and Doina Precup. The Option-Critic Architecture, December
-
[31]
The Option-Critic Architecture
URLhttp://arxiv.org/abs/1609.05140. arXiv:1609.05140 [cs]
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning, June 2025
Seungho Baek, Taegeon Park, Jongchan Park, Seungjun Oh, and Yusung Kim. Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning, June 2025. URL http://arxiv. org/abs/2506.07744. arXiv:2506.07744 [cs] version: 1
-
[33]
Intra-Option Learning about Temporally Abstract Actions
Richard S Sutton, Doina Precup, and Satinder Singh. Intra-Option Learning about Temporally Abstract Actions
-
[34]
Balaraman Ravindran and Andrew G. Barto. Smdp homomorphisms: An algebraic approach to abstraction in semi-markov decision processes. InProbabilistic Planning, pages 1011–1016, 2003. 11
work page 2003
-
[35]
Data-Efficient Hierarchical Reinforcement Learning
Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-Efficient Hierarchi- cal Reinforcement Learning, October 2018. URL http://arxiv.org/abs/1805.08296. arXiv:1805.08296 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Near-Optimal Representation Learning for Hierarchical Reinforcement Learning
Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-Optimal Representation Learning for Hierarchical Reinforcement Learning, January 2019. URL http://arxiv.org/ abs/1810.01257. arXiv:1810.01257 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[37]
Learning Multi-Level Hi- erarchies with Hindsight, September 2019
Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning Multi-Level Hi- erarchies with Hindsight, September 2019. URL http://arxiv.org/abs/1712.00948. arXiv:1712.00948 [cs]
-
[38]
Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. InProceedings of the 18th International Conference on Machine Learning (ICML), pages 361–368, 2001
work page 2001
-
[39]
Learning options in reinforcement learning
Martin Stolle and Doina Precup. Learning options in reinforcement learning. InProceedings of the 5th International Symposium on Abstraction, Reformulation and Approximation (SARA), pages 212–223, 2002
work page 2002
-
[40]
Hierarchical planning through goal-conditioned offline reinforcement learning, 2022
Jinning Li, Chen Tang, Masayoshi Tomizuka, and Wei Zhan. Hierarchical planning through goal-conditioned offline reinforcement learning, 2022. URL https://arxiv.org/abs/2205. 11790
work page 2022
-
[41]
Towards a Unified Theory of State Abstraction for MDPs
Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a Unified Theory of State Abstraction for MDPs
-
[42]
Metrics for Finite Markov Decision Processes
Norman Ferns, Prakash Panangaden, and Doina Precup. Metrics for Finite Markov Decision Processes, July 2012. URLhttp://arxiv.org/abs/1207.4114. arXiv:1207.4114 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[43]
Learning Representations via a Robust Behavioral Metric for Deep Reinforcement Learning
Jianda Chen and Sinno Jialin Pan. Learning Representations via a Robust Behavioral Metric for Deep Reinforcement Learning
-
[44]
A Survey of State Representation Learning for Deep Reinforcement Learning, June 2025
Ayoub Echchahed and Pablo Samuel Castro. A Survey of State Representation Learning for Deep Reinforcement Learning, June 2025. URL http://arxiv.org/abs/2506.17518. arXiv:2506.17518 [cs]
-
[45]
Phd thesis, University College London, 2003
Sham Kakade.On the Sample Complexity of Reinforcement Learning. Phd thesis, University College London, 2003
work page 2003
-
[46]
Finite-Time Bounds for Fitted Value Iteration
Remi Munos, Remi Munos, and Csaba Szepesvari. Finite-Time Bounds for Fitted Value Iteration
-
[47]
PAC Bounds for Discounted MDPs
Tor Lattimore and Marcus Hutter. PAC Bounds for Discounted MDPs, February 2012. URL http://arxiv.org/abs/1202.3890. arXiv:1202.3890 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[48]
Sample Complexity of Goal-Conditioned Hierarchical Reinforcement Learning
Arnaud Robert, Ciara Pike-Burke, and A Aldo Faisal. Sample Complexity of Goal-Conditioned Hierarchical Reinforcement Learning
-
[49]
Transitive RL: Value Learn- ing via Divide and Conquer, February 2026
Seohong Park, Aditya Oberai, Pranav Atreya, and Sergey Levine. Transitive RL: Value Learn- ing via Divide and Conquer, February 2026. URL http://arxiv.org/abs/2510.22512. arXiv:2510.22512 [cs]
-
[50]
Reinforcement Learning from Pas- sive Data via Latent Intentions, April 2023
Dibya Ghosh, Chethan Bhateja, and Sergey Levine. Reinforcement Learning from Pas- sive Data via Latent Intentions, April 2023. URL http://arxiv.org/abs/2304.04782. arXiv:2304.04782 [cs]
-
[51]
A policy-guided imitation approach for offline reinforcement learning, 2023
Haoran Xu, Li Jiang, Jianxiong Li, and Xianyuan Zhan. A policy-guided imitation approach for offline reinforcement learning, 2023. URLhttps://arxiv.org/abs/2210.08323
-
[52]
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G Belle- mare. Deep reinforcement learning at the edge of the statistical precipice.Advances in Neural Information Processing Systems, 2021. 12
work page 2021
-
[53]
A Clean Slate for Offline Reinforcement Learning, April 2025
Matthew Thomas Jackson, Uljad Berdica, Jarek Liesen, Shimon Whiteson, and Jakob Nicolaus Foerster. A Clean Slate for Offline Reinforcement Learning, April 2025. URL http://arxiv. org/abs/2504.11453. arXiv:2504.11453 [cs]
-
[54]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.CoRR, abs/2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[55]
Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzci´nski, and Benjamin Eysenbach. 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026. URL http://arxiv.org/abs/2503.14858. arXiv:2503.14858 [cs]
-
[56]
Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow Matching Guide and Code, December 2024. URLhttp://arxiv.org/abs/2412.06264. arXiv:2412.06264 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Dual Goal Representations, February
Seohong Park, Deepinder Mann, and Sergey Levine. Dual Goal Representations, February
- [58]
-
[59]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[60]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https://arxiv.org/abs/1607.06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[61]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units.CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/ 1606.08415
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[62]
Addressing optimism bias in sequence modeling for reinforcement learning, 2022
Adam Villaflor, Zhe Huang, Swapnil Pande, John Dolan, and Jeff Schneider. Addressing optimism bias in sequence modeling for reinforcement learning, 2022. URL https://arxiv. org/abs/2207.10295. 13 A Sample Complexity in Online Goal-Conditioned RL We provide some intuition into the choice of policy learning and representation using sample complexity in fini...
-
[63]
and Park et al. [12]. Notably, while these parameters were specifically tuned for HIQL, we apply them to ARL without further adjustment. The fact that ARL achieves strong performance using parameters optimised for a different algorithm demonstrates its robustness. We use DDPGBC with a behaviour cloning strength of 0.1 to extract the high-level policy in m...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.