Abstraction for Offline Goal-Conditioned Reinforcement Learning

Alexander Goldie; Antonio Villares; Clarisse Wibault; Jakob Foerster; Maike Osborne

arxiv: 2605.22711 · v1 · pith:RJOHZTG6new · submitted 2026-05-21 · 💻 cs.LG · cs.AI

Abstraction for Offline Goal-Conditioned Reinforcement Learning

Clarisse Wibault , Alexander Goldie , Antonio Villares , Maike Osborne , Jakob Foerster This is my paper

Pith reviewed 2026-05-22 07:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords goal-conditioned reinforcement learningoffline RLhierarchical policiesoptionsabstractionsymmetriesexperience reuse

0 comments

The pith

Hierarchical policies achieve absolute abstraction in offline goal-conditioned reinforcement learning by using relativised options to reuse experience across symmetric state-goal pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that goal-conditioned Markov Decision Processes contain redundancy from symmetries and shared structure between different state-goal pairs. Hierarchy is typically motivated for temporal abstraction to shorten long horizons in offline settings, but the authors demonstrate that it also supports absolute abstraction. They introduce relativised options defined relative to goals along with distinct representations at each level of the hierarchy. These changes let an agent ignore absolute coordinates and reuse prior experience in similar contexts. Experiments confirm that the resulting inductive biases improve performance on offline goal-conditioned tasks.

Core claim

Markov Decision Processes in goal-conditioned reinforcement learning often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs. By introducing relativised options as well as distinct representations for different levels of the hierarchy, an agent can abstract from the absolute frame of reference and reuse experience across similar contexts of the state-space. Two simple algorithms are presented for learning these relativised options and performing the abstraction, and experiments show that the approach improves performance in the offline setting.

What carries the argument

Relativised options, which are temporally extended actions defined relative to the current goal rather than in absolute coordinates, together with separate representations at each hierarchy level that separate absolute position from relative structure.

If this is right

An agent reuses experience across similar state-goal pairs instead of treating each pair as a separate learning problem.
Hierarchy supplies both temporal abstraction to manage long horizons and absolute abstraction to exploit symmetries.
Two algorithms become available for learning relativised options and for abstracting away from absolute frames of reference.
Offline goal-conditioned reinforcement learning performance improves when these inductive biases are added to standard methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same relativised-option construction could be tested in online goal-conditioned settings where data efficiency matters.
Environments with explicit translational or rotational symmetry, such as navigation or manipulation tasks, would be natural places to measure the size of the reuse benefit.
The separation of representations at different hierarchy levels might interact with function-approximation choices in large continuous spaces.

Load-bearing premise

Goal-conditioned Markov Decision Processes contain enough symmetry and shared structure across state-goal pairs that absolute abstraction from hierarchy will produce measurable reuse of experience.

What would settle it

Running the proposed algorithms on a goal-conditioned task deliberately constructed with no symmetries or shared structure across goals, such as a single unique target in a fully asymmetric environment, and observing no improvement over a flat baseline policy.

Figures

Figures reproduced from arXiv: 2605.22711 by Alexander Goldie, Antonio Villares, Clarisse Wibault, Jakob Foerster, Maike Osborne.

**Figure 1.** Figure 1: Abstractive RL (ARL). By learning relativised options, ARL enables the reuse of experience across similar contexts of the state-space. 1 Introduction Offline Goal-Conditioned Reinforcement Learning (GCRL) [1–5] provides a principled framework for training a general-purpose agent to solve complex long-horizon tasks from static datasets. However, in practice, existing methods have struggled to learn effectiv… view at source ↗

**Figure 2.** Figure 2: Analysis. Aggregate Performance across all tasks (left) and ARLi’s (middle) and ARLe’s (right) performance improvements over next-best performing algorithm against number of state dimensions per dataset sample. Bootstrapped 95% CI over 4 seeds and 20 evaluation runs. resulting in a single vector that simultaneously encodes both the state and waypoint. Since hard-coding representations can impose representa… view at source ↗

**Figure 3.** Figure 3: Low-level value function for HIQL2v (left) and ARLe (right) (task 4, antmaze-giant-stitch-v0). To better understand the effect of imposing translational invariance, we visualise the low-level value functions for ARLe and HIQL2v in the antmaze locomotion environment ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Low-level value functions (top) and gradient of low-level value function (bottom). IQL and HIQL1vr are excluded as they have a single value function [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: High-level value functions (top) and gradient of high-level value function (bottom). HIQL2v and HIQL2vr are excluded, as they have identical ones to ARL. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Performance improvements over next best performing algorithm against number of state dimensions per dataset sample: IQL (left), HIQL1vr (centre left), HIQL2v (centre right) and HIQL2vr (right). Including ARLi and ARLe (top), and excluding ARLi and ARLe (bottom). Bootstrapped 95% CI over 4 seeds and 20 evaluation runs. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

read the original abstract

Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal abstraction in offline GCRL, we demonstrate that hierarchy also enables absolute abstraction. By introducing relativised options as well as distinct representations for different levels of the hierarchy, we demonstrate how an agent can reuse experience across similar contexts of the state-space. Based on this framework, we introduce two simple algorithms for learning relativised options and abstracting from the absolute frame of reference. Our experiments show that such inductive biases significantly improve performance in offline GCRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Relativised options plus level-specific representations give a concrete handle on absolute abstraction for experience reuse in offline GCRL, though the payoff still hinges on how much symmetry the test environments actually contain.

read the letter

The main thing here is that the authors treat hierarchy as a route to absolute abstraction, not just temporal abstraction. They introduce relativised options and separate representations for each level so the agent can reuse experience across state-goal pairs that share structure. From that they derive two straightforward algorithms for learning the options and shifting out of the absolute frame. The abstract claims this inductive bias lifts performance in offline GCRL, which lines up with the motivation about redundancy from symmetries in real-world MDPs.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a hierarchical framework for offline goal-conditioned reinforcement learning that exploits redundancies and symmetries in MDPs across state-goal pairs. It introduces relativised options together with level-specific representations to achieve absolute abstraction (in addition to temporal abstraction), enabling experience reuse across similar contexts. Two simple algorithms are presented for learning these options and performing abstraction from the absolute frame of reference, with experiments reported to show performance gains from the resulting inductive biases.

Significance. If the central claims hold, the work provides a concrete mechanism for incorporating absolute abstraction into hierarchical offline GCRL policies. This could meaningfully improve sample efficiency by turning structural symmetries into reusable experience, extending the standard motivation for hierarchy beyond horizon reduction. The explicit separation of temporal and absolute abstraction, together with the introduction of relativised options, offers a falsifiable inductive bias that is directly testable in standard GCRL benchmarks.

major comments (2)

[§4] §4 (Relativised Options): the formal definition of a relativised option must be shown to preserve the optimal value function of the original goal-conditioned MDP; without an explicit invariance or bisimulation argument, it is unclear whether the absolute-abstraction claim is loss-free or merely an approximation.
[§5.2] §5.2 (Algorithm 2): the update rule for abstracting from the absolute frame appears to rely on an auxiliary representation network whose training objective is not stated; if this network is learned from the same offline dataset, the claimed separation of levels risks circularity in the experience-reuse argument.

minor comments (2)

[Figure 2] Figure 2: the diagram of the two-level hierarchy would benefit from explicit arrows indicating which components are shared versus level-specific.
[Related Work] Related Work: the discussion of prior hierarchical GCRL methods (e.g., HIRO, HAC) should clarify in one sentence how relativised options differ from goal-relativisation techniques already present in the literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important aspects of our framework for relativised options and absolute abstraction in offline goal-conditioned RL. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [§4] §4 (Relativised Options): the formal definition of a relativised option must be shown to preserve the optimal value function of the original goal-conditioned MDP; without an explicit invariance or bisimulation argument, it is unclear whether the absolute-abstraction claim is loss-free or merely an approximation.

Authors: We agree that an explicit invariance argument strengthens the absolute-abstraction claim. In the revised manuscript we will add a theorem in §4 establishing that relativised options preserve the optimal value function of the underlying goal-conditioned MDP. The proof will rely on a bisimulation relation defined over equivalence classes of state-goal pairs that respect the symmetries of the MDP, showing that the abstraction is loss-free whenever those symmetries hold. revision: yes
Referee: [§5.2] §5.2 (Algorithm 2): the update rule for abstracting from the absolute frame appears to rely on an auxiliary representation network whose training objective is not stated; if this network is learned from the same offline dataset, the claimed separation of levels risks circularity in the experience-reuse argument.

Authors: The auxiliary representation network is trained with the level-specific contrastive objective already defined in §5.1. We will revise §5.2 to state this objective explicitly and to emphasise that each level uses a distinct representation and loss, trained once on the offline dataset before policy optimisation. This ordering removes any circular dependency and preserves the separation between temporal and absolute abstraction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework presented as new inductive bias with independent experimental validation

full rationale

The paper introduces relativised options and level-specific representations as novel mechanisms for absolute abstraction in offline GCRL, motivated by observed MDP redundancies. No derivation chain reduces a claimed result to a fitted parameter or self-citation by construction; the abstract and framework description present the hierarchy as an explicit inductive bias whose benefits are then tested empirically. No equations or steps equate outputs to inputs tautologically, and the central claim rests on the proposed algorithms rather than renaming or smuggling prior results. This is a standard case of a self-contained contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption of redundancy in MDPs and introduces the new concept of relativised options without independent evidence provided in the abstract.

axioms (1)

domain assumption MDPs often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs
Explicitly stated in the abstract as the motivation for the approach.

invented entities (1)

relativised options no independent evidence
purpose: Enable absolute abstraction and experience reuse across similar state contexts
New concept introduced in the paper to support the hierarchical framework.

pith-pipeline@v0.9.0 · 5649 in / 1173 out tokens · 40522 ms · 2026-05-22T07:43:41.502460+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By introducing relativised options as well as distinct representations for different levels of the hierarchy, we demonstrate how an agent can reuse experience across similar contexts of the state-space.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We build on the work of Robert et al. [47] and Li et al. [40] to show that, by using a hierarchical policy with absolute abstraction, the maximum error is bounded by...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 18 internal anchors

[1]

Learning to achieve goals

Leslie Pack Kaelbling. Learning to achieve goals. InProceedings of the Thirteenth International Joint Conference on Artificial Intelligence (IJCAI), pages 1094–1099, 1993

work page 1993
[2]

Universal value function approxi- mators

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi- mators. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1312–1320, 2015

work page 2015
[3]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems, November 2020. URL http://arxiv. org/abs/2005.01643. arXiv:2005.01643 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Understanding the World Through Action, October 2021

Sergey Levine. Understanding the World Through Action, October 2021. URL http://arxiv. org/abs/2110.12543. arXiv:2110.12543 [cs]

work page arXiv 2021
[5]

arXiv preprint arXiv:2410.20092 , year=

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. OGBench: Bench- marking Offline Goal-Conditioned RL, February 2025. URL http://arxiv.org/abs/2410. 20092. arXiv:2410.20092 [cs]

work page arXiv 2025
[6]

Rafael Figueiredo Prudencio, Marcos R. O. A. Maximo, and Esther Luna Colombini. A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems.IEEE Transactions on Neural Networks and Learning Systems, 35(8):10237–10257, August 2024. ISSN 2162-237X, 2162-2388. doi: 10.1109/TNNLS.2023.3250269. URL http://arxiv. org/abs/2203.01387. arXiv:22...

work page doi:10.1109/tnnls.2023.3250269 2024
[7]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q-Learning, October 2021. URL http://arxiv.org/abs/2110.06169. arXiv:2110.06169 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-Weighted Re- gression: Simple and Scalable Off-Policy Reinforcement Learning, October 2019. URL http://arxiv.org/abs/1910.00177. arXiv:1910.00177 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

Revisiting the Minimalist Approach to Offline Reinforcement Learning, October 2023

Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the Minimalist Approach to Offline Reinforcement Learning, October 2023. URL http: //arxiv.org/abs/2305.09836. arXiv:2305.09836 [cs]

work page arXiv 2023
[10]

Challenges of Real-World Reinforcement Learning

Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of Real- World Reinforcement Learning, April 2019. URL http://arxiv.org/abs/1904.12901. arXiv:1904.12901 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[11]

Is Value Learning Really the Main Bottleneck in Offline RL?, October 2024

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is Value Learning Really the Main Bottleneck in Offline RL?, October 2024. URL http://arxiv.org/abs/2406.09329. arXiv:2406.09329 [cs]

work page arXiv 2024
[12]

Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon Reduction Makes RL Scalable, October 2025. URL http://arxiv.org/ abs/2506.04168. arXiv:2506.04168 [cs]

work page arXiv 2025
[13]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112 (1-2):181–211, August 1999. ISSN 00043702. doi: 10.1016/S0004-3702(99)00052-1. URL https://linkinghub.elsevier.com/retrieve/pii/S0004370299000521

work page doi:10.1016/s0004-3702(99)00052-1 1999
[14]

Feudal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3540–3549, 2017

work page 2017
[15]

Balaraman Ravindran and Andrew G. Barto. Model minimization in hierarchical reinforcement learning. 10

work page
[16]

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y . Galliker, and Sergey Levine. Real-Time Execution of Action Chunking Flow Policies, December 2025. URL http://arxiv.org/abs/2506.07339. arXiv:2506.07339 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Scalable Offline Model- Based RL with Action Chunks, December 2025

Kwanyoung Park, Seohong Park, Youngwoon Lee, and Sergey Levine. Scalable Offline Model- Based RL with Action Chunks, December 2025. URLhttp://arxiv.org/abs/2512.08108. arXiv:2512.08108 [cs]

work page arXiv 2025
[18]

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi´c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.arXiv preprint arXiv:2104.13478, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Equivariant goal conditioned contrastive reinforcement learning

Arsh Tangri, Nichols Crawford Taylor, Haojie Huang, and Robert Platt. Equivariant goal conditioned contrastive reinforcement learning. 2025. doi: 10.48550/arXiv.2507.16139

work page doi:10.48550/arxiv.2507.16139 2025
[20]

Riedmiller

Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. Springer, 2012

work page 2012
[21]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

work page 2018
[22]

Adam White, Joseph Modayil, and Richard S. Sutton. Scaling life-long off-policy learning. CoRR, abs/1206.6262, 2012. URLhttp://arxiv.org/abs/1206.6262

work page internal anchor Pith review Pith/arXiv arXiv 2012
[23]

Hindsight Experience Replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay, February 2018. URLhttp://arxiv.org/abs/1707.01495. arXiv:1707.01495 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

HIQL: Offline Goal- Conditioned RL with Latent States as Actions, March 2024

Seohong Park, Dibya Ghosh, Benjamin Eysenbach, and Sergey Levine. HIQL: Offline Goal- Conditioned RL with Latent States as Actions, March 2024. URL http://arxiv.org/abs/ 2307.11949. arXiv:2307.11949 [cs]

work page arXiv 2024
[25]

Conservative Q-Learning for Offline Reinforcement Learning, August 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-Learning for Offline Reinforcement Learning, August 2020. URL http://arxiv.org/abs/2006.04779. arXiv:2006.04779 [cs]

work page arXiv 2020
[26]

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble, October 2021

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble, October 2021. URL http://arxiv. org/abs/2110.01548. arXiv:2110.01548 [cs]

work page arXiv 2021
[27]

Contrastive Learning as Goal-Conditioned Reinforcement Learning, February 2023

Benjamin Eysenbach, Tianjun Zhang, Ruslan Salakhutdinov, and Sergey Levine. Contrastive Learning as Goal-Conditioned Reinforcement Learning, February 2023. URL http://arxiv. org/abs/2206.07568. arXiv:2206.07568 [cs]

work page arXiv 2023
[28]

A Minimalist Approach to Offline Reinforcement Learning, December 2021

Scott Fujimoto and Shixiang Shane Gu. A Minimalist Approach to Offline Reinforcement Learning, December 2021. URL http://arxiv.org/abs/2106.06860. arXiv:2106.06860 [cs]

work page arXiv 2021
[29]

Emaq: Expected-max q-learning operator for simple yet effective offline and online RL.CoRR, abs/2007.11091, 2020

Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online RL.CoRR, abs/2007.11091, 2020. URLhttps://arxiv.org/abs/2007.11091

work page arXiv 2007
[30]

The Option-Critic Architecture, December

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The Option-Critic Architecture, December

work page
[31]

The Option-Critic Architecture

URLhttp://arxiv.org/abs/1609.05140. arXiv:1609.05140 [cs]

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning, June 2025

Seungho Baek, Taegeon Park, Jongchan Park, Seungjun Oh, and Yusung Kim. Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning, June 2025. URL http://arxiv. org/abs/2506.07744. arXiv:2506.07744 [cs] version: 1

work page arXiv 2025
[33]

Intra-Option Learning about Temporally Abstract Actions

Richard S Sutton, Doina Precup, and Satinder Singh. Intra-Option Learning about Temporally Abstract Actions

work page
[34]

Balaraman Ravindran and Andrew G. Barto. Smdp homomorphisms: An algebraic approach to abstraction in semi-markov decision processes. InProbabilistic Planning, pages 1011–1016, 2003. 11

work page 2003
[35]

Data-Efficient Hierarchical Reinforcement Learning

Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-Efficient Hierarchi- cal Reinforcement Learning, October 2018. URL http://arxiv.org/abs/1805.08296. arXiv:1805.08296 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-Optimal Representation Learning for Hierarchical Reinforcement Learning, January 2019. URL http://arxiv.org/ abs/1810.01257. arXiv:1810.01257 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[37]

Learning Multi-Level Hi- erarchies with Hindsight, September 2019

Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning Multi-Level Hi- erarchies with Hindsight, September 2019. URL http://arxiv.org/abs/1712.00948. arXiv:1712.00948 [cs]

work page arXiv 2019
[38]

Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. InProceedings of the 18th International Conference on Machine Learning (ICML), pages 361–368, 2001

work page 2001
[39]

Learning options in reinforcement learning

Martin Stolle and Doina Precup. Learning options in reinforcement learning. InProceedings of the 5th International Symposium on Abstraction, Reformulation and Approximation (SARA), pages 212–223, 2002

work page 2002
[40]

Hierarchical planning through goal-conditioned offline reinforcement learning, 2022

Jinning Li, Chen Tang, Masayoshi Tomizuka, and Wei Zhan. Hierarchical planning through goal-conditioned offline reinforcement learning, 2022. URL https://arxiv.org/abs/2205. 11790

work page 2022
[41]

Towards a Unified Theory of State Abstraction for MDPs

Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a Unified Theory of State Abstraction for MDPs

work page
[42]

Metrics for Finite Markov Decision Processes

Norman Ferns, Prakash Panangaden, and Doina Precup. Metrics for Finite Markov Decision Processes, July 2012. URLhttp://arxiv.org/abs/1207.4114. arXiv:1207.4114 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2012
[43]

Learning Representations via a Robust Behavioral Metric for Deep Reinforcement Learning

Jianda Chen and Sinno Jialin Pan. Learning Representations via a Robust Behavioral Metric for Deep Reinforcement Learning

work page
[44]

A Survey of State Representation Learning for Deep Reinforcement Learning, June 2025

Ayoub Echchahed and Pablo Samuel Castro. A Survey of State Representation Learning for Deep Reinforcement Learning, June 2025. URL http://arxiv.org/abs/2506.17518. arXiv:2506.17518 [cs]

work page arXiv 2025
[45]

Phd thesis, University College London, 2003

Sham Kakade.On the Sample Complexity of Reinforcement Learning. Phd thesis, University College London, 2003

work page 2003
[46]

Finite-Time Bounds for Fitted Value Iteration

Remi Munos, Remi Munos, and Csaba Szepesvari. Finite-Time Bounds for Fitted Value Iteration

work page
[47]

PAC Bounds for Discounted MDPs

Tor Lattimore and Marcus Hutter. PAC Bounds for Discounted MDPs, February 2012. URL http://arxiv.org/abs/1202.3890. arXiv:1202.3890 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2012
[48]

Sample Complexity of Goal-Conditioned Hierarchical Reinforcement Learning

Arnaud Robert, Ciara Pike-Burke, and A Aldo Faisal. Sample Complexity of Goal-Conditioned Hierarchical Reinforcement Learning

work page
[49]

Transitive RL: Value Learn- ing via Divide and Conquer, February 2026

Seohong Park, Aditya Oberai, Pranav Atreya, and Sergey Levine. Transitive RL: Value Learn- ing via Divide and Conquer, February 2026. URL http://arxiv.org/abs/2510.22512. arXiv:2510.22512 [cs]

work page arXiv 2026
[50]

Reinforcement Learning from Pas- sive Data via Latent Intentions, April 2023

Dibya Ghosh, Chethan Bhateja, and Sergey Levine. Reinforcement Learning from Pas- sive Data via Latent Intentions, April 2023. URL http://arxiv.org/abs/2304.04782. arXiv:2304.04782 [cs]

work page arXiv 2023
[51]

A policy-guided imitation approach for offline reinforcement learning, 2023

Haoran Xu, Li Jiang, Jianxiong Li, and Xianyuan Zhan. A policy-guided imitation approach for offline reinforcement learning, 2023. URLhttps://arxiv.org/abs/2210.08323

work page arXiv 2023
[52]

Deep reinforcement learning at the edge of the statistical precipice.Advances in Neural Information Processing Systems, 2021

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G Belle- mare. Deep reinforcement learning at the edge of the statistical precipice.Advances in Neural Information Processing Systems, 2021. 12

work page 2021
[53]

A Clean Slate for Offline Reinforcement Learning, April 2025

Matthew Thomas Jackson, Uljad Berdica, Jarek Liesen, Shimon Whiteson, and Jakob Nicolaus Foerster. A Clean Slate for Offline Reinforcement Learning, April 2025. URL http://arxiv. org/abs/2504.11453. arXiv:2504.11453 [cs]

work page arXiv 2025
[54]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.CoRR, abs/2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001
[55]

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026

Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzci´nski, and Benjamin Eysenbach. 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026. URL http://arxiv.org/abs/2503.14858. arXiv:2503.14858 [cs]

work page arXiv 2026
[56]

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow Matching Guide and Code, December 2024. URLhttp://arxiv.org/abs/2412.06264. arXiv:2412.06264 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Dual Goal Representations, February

Seohong Park, Deepinder Mann, and Sergey Levine. Dual Goal Representations, February

work page
[58]

arXiv:2510.06714 [cs]

URLhttp://arxiv.org/abs/2510.06714. arXiv:2510.06714 [cs]

work page arXiv
[59]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017
[60]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https://arxiv.org/abs/1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[61]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units.CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/ 1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016
[62]

Addressing optimism bias in sequence modeling for reinforcement learning, 2022

Adam Villaflor, Zhe Huang, Swapnil Pande, John Dolan, and Jeff Schneider. Addressing optimism bias in sequence modeling for reinforcement learning, 2022. URL https://arxiv. org/abs/2207.10295. 13 A Sample Complexity in Online Goal-Conditioned RL We provide some intuition into the choice of policy learning and representation using sample complexity in fini...

work page arXiv 2022
[63]

and Park et al. [12]. Notably, while these parameters were specifically tuned for HIQL, we apply them to ARL without further adjustment. The fact that ARL achieves strong performance using parameters optimised for a different algorithm demonstrates its robustness. We use DDPGBC with a behaviour cloning strength of 0.1 to extract the high-level policy in m...

work page

[1] [1]

Learning to achieve goals

Leslie Pack Kaelbling. Learning to achieve goals. InProceedings of the Thirteenth International Joint Conference on Artificial Intelligence (IJCAI), pages 1094–1099, 1993

work page 1993

[2] [2]

Universal value function approxi- mators

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi- mators. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1312–1320, 2015

work page 2015

[3] [3]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems, November 2020. URL http://arxiv. org/abs/2005.01643. arXiv:2005.01643 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Understanding the World Through Action, October 2021

Sergey Levine. Understanding the World Through Action, October 2021. URL http://arxiv. org/abs/2110.12543. arXiv:2110.12543 [cs]

work page arXiv 2021

[5] [5]

arXiv preprint arXiv:2410.20092 , year=

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. OGBench: Bench- marking Offline Goal-Conditioned RL, February 2025. URL http://arxiv.org/abs/2410. 20092. arXiv:2410.20092 [cs]

work page arXiv 2025

[6] [6]

Rafael Figueiredo Prudencio, Marcos R. O. A. Maximo, and Esther Luna Colombini. A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems.IEEE Transactions on Neural Networks and Learning Systems, 35(8):10237–10257, August 2024. ISSN 2162-237X, 2162-2388. doi: 10.1109/TNNLS.2023.3250269. URL http://arxiv. org/abs/2203.01387. arXiv:22...

work page doi:10.1109/tnnls.2023.3250269 2024

[7] [7]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q-Learning, October 2021. URL http://arxiv.org/abs/2110.06169. arXiv:2110.06169 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-Weighted Re- gression: Simple and Scalable Off-Policy Reinforcement Learning, October 2019. URL http://arxiv.org/abs/1910.00177. arXiv:1910.00177 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[9] [9]

Revisiting the Minimalist Approach to Offline Reinforcement Learning, October 2023

Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the Minimalist Approach to Offline Reinforcement Learning, October 2023. URL http: //arxiv.org/abs/2305.09836. arXiv:2305.09836 [cs]

work page arXiv 2023

[10] [10]

Challenges of Real-World Reinforcement Learning

Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of Real- World Reinforcement Learning, April 2019. URL http://arxiv.org/abs/1904.12901. arXiv:1904.12901 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[11] [11]

Is Value Learning Really the Main Bottleneck in Offline RL?, October 2024

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is Value Learning Really the Main Bottleneck in Offline RL?, October 2024. URL http://arxiv.org/abs/2406.09329. arXiv:2406.09329 [cs]

work page arXiv 2024

[12] [12]

Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon Reduction Makes RL Scalable, October 2025. URL http://arxiv.org/ abs/2506.04168. arXiv:2506.04168 [cs]

work page arXiv 2025

[13] [13]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112 (1-2):181–211, August 1999. ISSN 00043702. doi: 10.1016/S0004-3702(99)00052-1. URL https://linkinghub.elsevier.com/retrieve/pii/S0004370299000521

work page doi:10.1016/s0004-3702(99)00052-1 1999

[14] [14]

Feudal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3540–3549, 2017

work page 2017

[15] [15]

Balaraman Ravindran and Andrew G. Barto. Model minimization in hierarchical reinforcement learning. 10

work page

[16] [16]

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y . Galliker, and Sergey Levine. Real-Time Execution of Action Chunking Flow Policies, December 2025. URL http://arxiv.org/abs/2506.07339. arXiv:2506.07339 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Scalable Offline Model- Based RL with Action Chunks, December 2025

Kwanyoung Park, Seohong Park, Youngwoon Lee, and Sergey Levine. Scalable Offline Model- Based RL with Action Chunks, December 2025. URLhttp://arxiv.org/abs/2512.08108. arXiv:2512.08108 [cs]

work page arXiv 2025

[18] [18]

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi´c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.arXiv preprint arXiv:2104.13478, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Equivariant goal conditioned contrastive reinforcement learning

Arsh Tangri, Nichols Crawford Taylor, Haojie Huang, and Robert Platt. Equivariant goal conditioned contrastive reinforcement learning. 2025. doi: 10.48550/arXiv.2507.16139

work page doi:10.48550/arxiv.2507.16139 2025

[20] [20]

Riedmiller

Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. Springer, 2012

work page 2012

[21] [21]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

work page 2018

[22] [22]

Adam White, Joseph Modayil, and Richard S. Sutton. Scaling life-long off-policy learning. CoRR, abs/1206.6262, 2012. URLhttp://arxiv.org/abs/1206.6262

work page internal anchor Pith review Pith/arXiv arXiv 2012

[23] [23]

Hindsight Experience Replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay, February 2018. URLhttp://arxiv.org/abs/1707.01495. arXiv:1707.01495 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

HIQL: Offline Goal- Conditioned RL with Latent States as Actions, March 2024

Seohong Park, Dibya Ghosh, Benjamin Eysenbach, and Sergey Levine. HIQL: Offline Goal- Conditioned RL with Latent States as Actions, March 2024. URL http://arxiv.org/abs/ 2307.11949. arXiv:2307.11949 [cs]

work page arXiv 2024

[25] [25]

Conservative Q-Learning for Offline Reinforcement Learning, August 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-Learning for Offline Reinforcement Learning, August 2020. URL http://arxiv.org/abs/2006.04779. arXiv:2006.04779 [cs]

work page arXiv 2020

[26] [26]

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble, October 2021

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble, October 2021. URL http://arxiv. org/abs/2110.01548. arXiv:2110.01548 [cs]

work page arXiv 2021

[27] [27]

Contrastive Learning as Goal-Conditioned Reinforcement Learning, February 2023

Benjamin Eysenbach, Tianjun Zhang, Ruslan Salakhutdinov, and Sergey Levine. Contrastive Learning as Goal-Conditioned Reinforcement Learning, February 2023. URL http://arxiv. org/abs/2206.07568. arXiv:2206.07568 [cs]

work page arXiv 2023

[28] [28]

A Minimalist Approach to Offline Reinforcement Learning, December 2021

Scott Fujimoto and Shixiang Shane Gu. A Minimalist Approach to Offline Reinforcement Learning, December 2021. URL http://arxiv.org/abs/2106.06860. arXiv:2106.06860 [cs]

work page arXiv 2021

[29] [29]

Emaq: Expected-max q-learning operator for simple yet effective offline and online RL.CoRR, abs/2007.11091, 2020

Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online RL.CoRR, abs/2007.11091, 2020. URLhttps://arxiv.org/abs/2007.11091

work page arXiv 2007

[30] [30]

The Option-Critic Architecture, December

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The Option-Critic Architecture, December

work page

[31] [31]

The Option-Critic Architecture

URLhttp://arxiv.org/abs/1609.05140. arXiv:1609.05140 [cs]

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning, June 2025

Seungho Baek, Taegeon Park, Jongchan Park, Seungjun Oh, and Yusung Kim. Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning, June 2025. URL http://arxiv. org/abs/2506.07744. arXiv:2506.07744 [cs] version: 1

work page arXiv 2025

[33] [33]

Intra-Option Learning about Temporally Abstract Actions

Richard S Sutton, Doina Precup, and Satinder Singh. Intra-Option Learning about Temporally Abstract Actions

work page

[34] [34]

Balaraman Ravindran and Andrew G. Barto. Smdp homomorphisms: An algebraic approach to abstraction in semi-markov decision processes. InProbabilistic Planning, pages 1011–1016, 2003. 11

work page 2003

[35] [35]

Data-Efficient Hierarchical Reinforcement Learning

Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-Efficient Hierarchi- cal Reinforcement Learning, October 2018. URL http://arxiv.org/abs/1805.08296. arXiv:1805.08296 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-Optimal Representation Learning for Hierarchical Reinforcement Learning, January 2019. URL http://arxiv.org/ abs/1810.01257. arXiv:1810.01257 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[37] [37]

Learning Multi-Level Hi- erarchies with Hindsight, September 2019

Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning Multi-Level Hi- erarchies with Hindsight, September 2019. URL http://arxiv.org/abs/1712.00948. arXiv:1712.00948 [cs]

work page arXiv 2019

[38] [38]

Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. InProceedings of the 18th International Conference on Machine Learning (ICML), pages 361–368, 2001

work page 2001

[39] [39]

Learning options in reinforcement learning

Martin Stolle and Doina Precup. Learning options in reinforcement learning. InProceedings of the 5th International Symposium on Abstraction, Reformulation and Approximation (SARA), pages 212–223, 2002

work page 2002

[40] [40]

Hierarchical planning through goal-conditioned offline reinforcement learning, 2022

Jinning Li, Chen Tang, Masayoshi Tomizuka, and Wei Zhan. Hierarchical planning through goal-conditioned offline reinforcement learning, 2022. URL https://arxiv.org/abs/2205. 11790

work page 2022

[41] [41]

Towards a Unified Theory of State Abstraction for MDPs

Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a Unified Theory of State Abstraction for MDPs

work page

[42] [42]

Metrics for Finite Markov Decision Processes

Norman Ferns, Prakash Panangaden, and Doina Precup. Metrics for Finite Markov Decision Processes, July 2012. URLhttp://arxiv.org/abs/1207.4114. arXiv:1207.4114 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2012

[43] [43]

Learning Representations via a Robust Behavioral Metric for Deep Reinforcement Learning

Jianda Chen and Sinno Jialin Pan. Learning Representations via a Robust Behavioral Metric for Deep Reinforcement Learning

work page

[44] [44]

A Survey of State Representation Learning for Deep Reinforcement Learning, June 2025

Ayoub Echchahed and Pablo Samuel Castro. A Survey of State Representation Learning for Deep Reinforcement Learning, June 2025. URL http://arxiv.org/abs/2506.17518. arXiv:2506.17518 [cs]

work page arXiv 2025

[45] [45]

Phd thesis, University College London, 2003

Sham Kakade.On the Sample Complexity of Reinforcement Learning. Phd thesis, University College London, 2003

work page 2003

[46] [46]

Finite-Time Bounds for Fitted Value Iteration

Remi Munos, Remi Munos, and Csaba Szepesvari. Finite-Time Bounds for Fitted Value Iteration

work page

[47] [47]

PAC Bounds for Discounted MDPs

Tor Lattimore and Marcus Hutter. PAC Bounds for Discounted MDPs, February 2012. URL http://arxiv.org/abs/1202.3890. arXiv:1202.3890 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2012

[48] [48]

Sample Complexity of Goal-Conditioned Hierarchical Reinforcement Learning

Arnaud Robert, Ciara Pike-Burke, and A Aldo Faisal. Sample Complexity of Goal-Conditioned Hierarchical Reinforcement Learning

work page

[49] [49]

Transitive RL: Value Learn- ing via Divide and Conquer, February 2026

Seohong Park, Aditya Oberai, Pranav Atreya, and Sergey Levine. Transitive RL: Value Learn- ing via Divide and Conquer, February 2026. URL http://arxiv.org/abs/2510.22512. arXiv:2510.22512 [cs]

work page arXiv 2026

[50] [50]

Reinforcement Learning from Pas- sive Data via Latent Intentions, April 2023

Dibya Ghosh, Chethan Bhateja, and Sergey Levine. Reinforcement Learning from Pas- sive Data via Latent Intentions, April 2023. URL http://arxiv.org/abs/2304.04782. arXiv:2304.04782 [cs]

work page arXiv 2023

[51] [51]

A policy-guided imitation approach for offline reinforcement learning, 2023

Haoran Xu, Li Jiang, Jianxiong Li, and Xianyuan Zhan. A policy-guided imitation approach for offline reinforcement learning, 2023. URLhttps://arxiv.org/abs/2210.08323

work page arXiv 2023

[52] [52]

Deep reinforcement learning at the edge of the statistical precipice.Advances in Neural Information Processing Systems, 2021

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G Belle- mare. Deep reinforcement learning at the edge of the statistical precipice.Advances in Neural Information Processing Systems, 2021. 12

work page 2021

[53] [53]

A Clean Slate for Offline Reinforcement Learning, April 2025

Matthew Thomas Jackson, Uljad Berdica, Jarek Liesen, Shimon Whiteson, and Jakob Nicolaus Foerster. A Clean Slate for Offline Reinforcement Learning, April 2025. URL http://arxiv. org/abs/2504.11453. arXiv:2504.11453 [cs]

work page arXiv 2025

[54] [54]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.CoRR, abs/2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001

[55] [55]

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026

Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzci´nski, and Benjamin Eysenbach. 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026. URL http://arxiv.org/abs/2503.14858. arXiv:2503.14858 [cs]

work page arXiv 2026

[56] [56]

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow Matching Guide and Code, December 2024. URLhttp://arxiv.org/abs/2412.06264. arXiv:2412.06264 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Dual Goal Representations, February

Seohong Park, Deepinder Mann, and Sergey Levine. Dual Goal Representations, February

work page

[58] [58]

arXiv:2510.06714 [cs]

URLhttp://arxiv.org/abs/2510.06714. arXiv:2510.06714 [cs]

work page arXiv

[59] [59]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017

[60] [60]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https://arxiv.org/abs/1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016

[61] [61]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units.CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/ 1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016

[62] [62]

Addressing optimism bias in sequence modeling for reinforcement learning, 2022

Adam Villaflor, Zhe Huang, Swapnil Pande, John Dolan, and Jeff Schneider. Addressing optimism bias in sequence modeling for reinforcement learning, 2022. URL https://arxiv. org/abs/2207.10295. 13 A Sample Complexity in Online Goal-Conditioned RL We provide some intuition into the choice of policy learning and representation using sample complexity in fini...

work page arXiv 2022

[63] [63]

and Park et al. [12]. Notably, while these parameters were specifically tuned for HIQL, we apply them to ARL without further adjustment. The fact that ARL achieves strong performance using parameters optimised for a different algorithm demonstrates its robustness. We use DDPGBC with a behaviour cloning strength of 0.1 to extract the high-level policy in m...

work page