Curriculum reinforcement learning with measurable task representation learning

Mingjian Fu; Peng Liu; Siyuan Li; Xun Wang; Yiqin Yang; Yongyan Wen

arxiv: 2605.23372 · v1 · pith:T54LY4J5new · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Curriculum reinforcement learning with measurable task representation learning

Yongyan Wen , Siyuan Li , Mingjian Fu , Yiqin Yang , Xun Wang , Peng Liu This is my paper

Pith reviewed 2026-05-25 05:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords curriculum reinforcement learningtask representation learningvariational autoencoderautomatic curriculum generationnavigation tasksreinforcement learninglatent space

0 comments

The pith

A variational autoencoder encodes rewards and state transitions to create a latent task space that supports automatic curriculum generation in non-Euclidean navigation tasks for reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the limitation that interpolation-based curriculum methods assume a Euclidean task space, which fails in complex navigation environments. It proposes encoding task information with a variational autoencoder on rewards and state transitions to produce a latent representation where task similarity is measurable by embedding proximity. This representation then drives an automatic scheme that generates sequences of tasks progressively closer to the target. Experiments across navigation tasks show the resulting curricula outperform those from interpolation and generative adversarial network baselines.

Core claim

In curriculum reinforcement learning, automatic curriculum generation for complex tasks requires a way to measure task similarity in non-Euclidean spaces. We propose transforming the task space into a latent space using a variational autoencoder that encodes reward functions and state transitions. This produces task embeddings with the property that proximity corresponds to similarity in rewards and transitions. Using these embeddings, we develop a scheme to generate curricula of tasks increasingly similar to the target task. Evaluation in challenging navigation tasks demonstrates superiority over interpolation and GAN-based methods.

What carries the argument

The variational autoencoder that encodes reward and state transitions to produce a latent task representation where embedding distance measures task similarity.

If this is right

The latent embeddings enable generation of intermediate tasks whose similarity to the target increases over the curriculum.
The approach applies to navigation settings where direct interpolation in the original task space is invalid.
Performance on the final target task exceeds that achieved by interpolation-based and GAN-based curriculum methods.
The same VAE structure can be reused to measure similarity for any pair of tasks sharing the reward and transition encoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same encoding might support curriculum transfer across different environment families that share reward and transition statistics.
If the latent space captures a manifold of task difficulty, the method could be tested by checking whether generated task sequences produce monotonic increases in agent competence.
Extending the encoder to include additional signals such as goal positions or obstacle layouts could further refine similarity measurement.

Load-bearing premise

Proximity in the VAE latent space reliably indicates task similarity in rewards and state transitions sufficient to produce curricula that improve learning on the target task.

What would settle it

A set of navigation tasks where curricula generated from the learned embeddings yield no improvement in target-task success rate or sample efficiency compared with interpolation or random task sequences.

Figures

Figures reproduced from arXiv: 2605.23372 by Mingjian Fu, Peng Liu, Siyuan Li, Xun Wang, Yiqin Yang, Yongyan Wen.

**Figure 2.** Figure 2: The schema of Latent Space Prediction (LSP) and Exploration Bound Update [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Task representation learning architecture. Representation learning and policy [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Environment visualizations. For MiniGrid (4(a)-4(h)), the context is the position of the goal and key. For U-Maze (4(i)) tasks, the context is the position of the goal. rameters from high-ALP regions. ALP is computed as the reward difference of the nearest neighbor, and the GMM updates periodically with adaptive Gaussian components while maintaining some random exploration. • Goal GAN [15]: generating co… view at source ↗

**Figure 5.** Figure 5: The learning curves in the MiniGrid environments. cedural content generation environment—by prioritizing those with higher estimated learning potential when revisited in the future. It maintains a dynamic replay distribution based on level scores and sampling recency, balancing revisiting past levels with exploring new ones to enhance learning efficiency. • CURROT [29]: generating a curriculum in CRL by r… view at source ↗

**Figure 6.** Figure 6: Visualizations of curriculum on MiniGrid-Easy-B. Red rectangles represent a high probability of the location as a target, black indicates a low probability, and the green rectangle represent the target. From left to right, the ACRL-generated distribution gradually moves from the initial position to the gap and gradually moves closer to the target. In contrast, CURROT generates an intermediate distribution … view at source ↗

**Figure 7.** Figure 7: Representation learning results in MiniGrid-Easy-A. (a) Dot positions in the right panel indicate the mean values of the latent space variables, and the colors indicate the episodic return for the corresponding task under the policy. The left panel shows the four tasks configured by [4, 2], [5, 2], [5, 8] and [7, 2], respectively. (b) MDS analysis result. Only top-2 eigenvalues of Gram matrix are positive … view at source ↗

**Figure 8.** Figure 8: The learning curves in U-Maze. All the curves are averaged over 5 runs, and the shaded error bars represent the standard variances. To assess performance in the continuous control task, the U-Maze environment with continuous context space was introduced. The environment is implemented in Mujoco [44]. In this environment, the agent must avoid the center barrier, rendering l2 distance as a metric in context… view at source ↗

**Figure 9.** Figure 9: In the initial phase, goals are concentrated near the starting point, [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 9.** Figure 9: Visualizations of curriculum on U-Maze. The centers of the original source and target are located at [2, 0] and [0, 8], respectively. The distribution gradually transitions from being near the initial position to the right side, ultimately reaching the target goal. In the visual representation, the red color indicates a lower corresponding episodic return, while blue color signifies a higher return. figure… view at source ↗

**Figure 10.** Figure 10: Ablation studies on the sampling ratio λ. within a reasonable range and evaluate performance on both MiniGrid and U-Maze. As shown in [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Impact of curriculum parameters. 27 [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: Ablation studies on decoders of task representaion learning. [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

read the original abstract

In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on sequencing candidate tasks, recent research explores automatic curriculum generation. Among the rich CRL literature, the interpolation-based CRL paradigm is a main body, which automatically generates intermediate tasks by interpolating between the initial task distribution and the target task distribution in task space with meaningful distance metrics (i.e., can measure the task similarity). However, in challenging navigation tasks, the non-Euclidean context (task) space invalidates this assumption. To achieve automatic curriculum generation in complex task, we propose a novel automatic curriculum generation approach based on measurable task representation learning. To better measure the similarity, we propose to transform the task space to a latent space. Through a variational autoencoder structure that encodes the reward and the state transitions, we achieve a latent task representation with a task similarity measurement property, and two close task embeddings correspond to two similar tasks in terms of rewards and state transitions. Based on the learned task representation, we further develop an automatic curriculum generation scheme, which can effectively generate new tasks more and more similar to the target task. We evaluate our method in a variety of challenging navigation tasks, and the experiment results indicate that the proposed approach surpasses state-of-the-art CRL approaches based on interpolation and generative adversarial networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VAE latent space for task similarity in navigation CRL is a reasonable fix for non-Euclidean issues, but the abstract supplies no numbers or checks that the embeddings actually support curriculum transfer.

read the letter

The paper's main move is training a VAE on rewards and state transitions to produce a latent task space where Euclidean distance is meant to stand in for task similarity, then using that space to generate curricula that get progressively closer to a hard target in navigation domains. This is positioned as a way around the breakdown of direct interpolation when the original task space is non-Euclidean. The abstract frames the VAE choice and the resulting curriculum scheme as the contribution over prior interpolation and GAN baselines. That combination is new enough in the CRL literature to note, and the paper correctly identifies a concrete limitation in existing distance-based methods for these environments. The write-up is straightforward about the motivation and the high-level pipeline. The central weakness is that nothing shows the learned embeddings actually make proximity a reliable signal for policy or value transfer. The VAE objective is standard reconstruction plus KL; there is no derivation or experiment described that ties those distances to RL-relevant overlap rather than surface statistics of trajectories. The stress-test concern therefore holds on the supplied text: reported gains could trace to the downstream task-generation rule instead of the representation. The abstract also states that experiments show superiority across navigation tasks but includes no quantitative results, error bars, dataset sizes, or ablations, so the empirical claim cannot be evaluated. This work is aimed at people already working on automatic curriculum methods for RL navigation or similar structured tasks. A reader in that niche might extract a usable representation trick if the full paper supplies the missing checks and code. It is coherent enough on its own terms to merit a serious referee, mainly to see whether the latent metric delivers on the transfer assumption once the numbers are in front of someone. I would send it to review rather than desk reject, with the expectation that the authors will need to add direct evidence on why closeness in the VAE space predicts curriculum success.

Referee Report

2 major / 1 minor

Summary. The paper proposes a curriculum reinforcement learning (CRL) method for automatic curriculum generation in non-Euclidean task spaces, such as challenging navigation tasks. It introduces a variational autoencoder (VAE) trained on rewards and state transitions to learn a latent task representation where Euclidean proximity is claimed to indicate task similarity, enabling generation of curricula with tasks progressively closer to the target; experiments are said to show superiority over interpolation-based and GAN-based CRL baselines.

Significance. If the central claim holds, the work could extend automatic CRL to domains where direct interpolation in task space fails due to non-Euclidean structure, by providing a learned similarity metric grounded in reward and transition statistics. The VAE-based approach is a plausible direction for measurable representations, but its significance hinges on whether the latent metric supports effective policy/value transfer rather than superficial reconstruction.

major comments (2)

[Abstract] Abstract: the central claim that the VAE yields 'a latent task representation with a task similarity measurement property' (such that 'two close task embeddings correspond to two similar tasks in terms of rewards and state transitions') is load-bearing for the curriculum scheme, yet the manuscript provides no derivation, bound, or empirical test showing that ELBO minimization on rewards/transitions produces distances aligned with RL transfer metrics (e.g., policy overlap or value-function distance) rather than trajectory statistics alone.
[Abstract] Abstract (experiments paragraph): the superiority claim over 'state-of-the-art CRL approaches based on interpolation and generative adversarial networks' is stated without any quantitative results, error bars, dataset/task specifications, ablation studies, or statistical tests, preventing verification that gains arise from the learned representation rather than the downstream task-generation heuristic.

minor comments (1)

[Abstract] Abstract: minor grammatical issues ('in complex task', 'the experiment results') and undefined acronyms on first use reduce clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each major comment below, providing clarifications and indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the VAE yields 'a latent task representation with a task similarity measurement property' (such that 'two close task embeddings correspond to two similar tasks in terms of rewards and state transitions') is load-bearing for the curriculum scheme, yet the manuscript provides no derivation, bound, or empirical test showing that ELBO minimization on rewards/transitions produces distances aligned with RL transfer metrics (e.g., policy overlap or value-function distance) rather than trajectory statistics alone.

Authors: The manuscript's claim is specifically that close embeddings correspond to similar tasks in terms of rewards and state transitions, which follows directly from the VAE training objective of reconstructing these quantities. The curriculum scheme uses this to generate tasks with progressively closer embeddings. We acknowledge that no theoretical derivation or bound is provided linking latent distances to policy overlap or value-function distance, and that empirical validation of transfer metrics would strengthen the work. In revision, we will add experiments measuring policy similarity and value-function distances for tasks with nearby embeddings. revision: yes
Referee: [Abstract] Abstract (experiments paragraph): the superiority claim over 'state-of-the-art CRL approaches based on interpolation and generative adversarial networks' is stated without any quantitative results, error bars, dataset/task specifications, ablation studies, or statistical tests, preventing verification that gains arise from the learned representation rather than the downstream task-generation heuristic.

Authors: The abstract summarizes the experimental findings at a high level. Full quantitative results with error bars, task specifications, ablation studies, and statistical tests appear in the Experiments section. To improve clarity, we will revise the abstract to include key quantitative gains and explicit references to the experimental setups and datasets used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard VAE training and empirical evaluation

full rationale

The paper trains a VAE to reconstruct rewards and state transitions, then uses the resulting latent distances to order tasks for curriculum generation. This is a standard autoencoder objective followed by a downstream heuristic; the reported performance gains are measured on external navigation benchmarks rather than being equivalent to the reconstruction loss or latent distances by definition. No equations or claims in the abstract reduce the final task success metric to a fitted parameter or self-citation chain. The assumption that latent proximity aligns with transferable similarity is presented as an empirical property to be validated experimentally, not a definitional identity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; no explicit fitting procedures, background lemmas, or new postulated objects are described.

pith-pipeline@v0.9.0 · 5798 in / 1175 out tokens · 20541 ms · 2026-05-25T05:17:26.518583+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through a variational autoencoder structure that encodes the reward and the state transitions, we achieve a latent task representation with a task similarity measurement property
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the non-Euclidean context (task) space invalidates this assumption

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 4 internal anchors

[1]

Solving Rubik's Cube with a Robot Hand

Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., et al., 2019. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

Ao, S., Zhou, T., Jiang, J., Long, G., Song, X., Zhang, C.,

work page
[3]

EAT-C: Environment-Adversarial sub-Task Curriculum for Ef- ficient Reinforcement Learning, in: Proceedings of the 39th In- ternational Conference on Machine Learning, PMLR. pp. 822–

work page
[4]

iSSN: 2640-3498

URL:https://proceedings.mlr.press/v162/ao22a.html. iSSN: 2640-3498

work page
[5]

CLUTR: Curriculum learning via unsupervised task represen- tation learning, in: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Azad, A.S., Gur, I., Emhoff, J., Alexis, N., Faust, A., Abbeel, P., Stoica, I., 2023. CLUTR: Curriculum learning via unsupervised task represen- tation learning, in: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (Eds.), Proceedings of the 40th Interna- tional Conference on Machine Learning, PMLR. pp. 1361–1395. URL: https://...

work page 2023
[6]

Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, Associa- tion for Computing Machinery, New York, NY, USA

Bengio, Y., Louradour, J., Collobert, R., Weston, J., 2009. Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, Associa- tion for Computing Machinery, New York, NY, USA. p. 41–48. URL:https://doi.org/10.1145/1553374.1553380, doi:10.1145/1553374.1553380

work page doi:10.1145/1553374.1553380 2009
[7]

Generating sentences from a continuous space, in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Association for Computational Linguistics

Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., Bengio, S., 2016. Generating sentences from a continuous space, in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Association for Computational Linguistics. p. 10

work page 2016
[8]

Learning with amigo: Adversarially moti- vated intrinsic goals, in: International Conference on Learning Repre- sentations

Campero, A., Raileanu, R., Kuttler, H., Tenenbaum, J.B., Rocktäschel, T., Grefenstette, E., 2020. Learning with amigo: Adversarially moti- vated intrinsic goals, in: International Conference on Learning Repre- sentations

work page 2020
[9]

Deep cluster- ing for unsupervised learning of visual features, in: Proceedings of the European conference on computer vision (ECCV), pp

Caron, M., Bojanowski, P., Joulin, A., Douze, M., 2018. Deep cluster- ing for unsupervised learning of visual features, in: Proceedings of the European conference on computer vision (ECCV), pp. 132–149. 33

work page 2018
[10]

Scalable Methods for Computing State Simi- larity in Deterministic Markov Decision Processes, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Castro, P.S., 2020. Scalable Methods for Computing State Simi- larity in Deterministic Markov Decision Processes, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10069–10076. URL:https://ojs.aaai.org/index.php/AAAI/article/view/6564, doi:10.1609/aaai.v34i06.6564

work page doi:10.1609/aaai.v34i06.6564 2020
[11]

Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems

Chen, J., Zhang, Y., Xu, Y., Ma, H., Yang, H., Song, J., Wang, Y., Wu, Y., 2021. Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems. Advances in Neural Information Pro- cessing Systems 34, 9681–9693

work page 2021
[12]

Minigrid & mini- world: Modular & customizable reinforcement learning environments for goal-oriented tasks

Chevalier-Boisvert, M., Dai, B., Towers, M., de Lazcano, R., Willems, L., Lahlou, S., Pal, S., Castro, P.S., Terry, J., 2023. Minigrid & mini- world: Modular & customizable reinforcement learning environments for goal-oriented tasks. CoRR abs/2306.13831

work page arXiv 2023
[13]

Cho, D., Lee, S., Kim, H.J., 2022. Outcome-directed reinforcement learning by uncertainty\& temporal distance-aware curriculum goal gen- eration, in: The Eleventh International Conference on Learning Repre- sentations

work page 2022
[14]

Seeing-eye quadruped navigation with force responsive locomotion control, in: Tan, J., Toussaint, M., Darvish, K

DeFazio, D., Hirota, E., Zhang, S., 2023. Seeing-eye quadruped navigation with force responsive locomotion control, in: Tan, J., Toussaint, M., Darvish, K. (Eds.), Proceedings of The 7th Conference on Robot Learning, PMLR. pp. 2184–2194. URL: https://proceedings.mlr.press/v229/defazio23a.html

work page 2023
[15]

Emergent complexity and zero-shot transfer via unsu- pervised environment design

Dennis, M., Jaques, N., Vinitsky, E., Bayen, A., Russell, S., Critch, A., Levine, S., 2020. Emergent complexity and zero-shot transfer via unsu- pervised environment design. Advances in neural information processing systems 33, 13049–13061

work page 2020
[16]

Adaptive procedural task generation for hard-exploration problems, in: International Confer- ence on Learning Representations

Fang, K., Zhu, Y., Savarese, S., Fei-Fei, L., 2020. Adaptive procedural task generation for hard-exploration problems, in: International Confer- ence on Learning Representations

work page 2020
[17]

Automatic goal gen- eration for reinforcement learning agents, in: International conference on machine learning, PMLR

Florensa, C., Held, D., Geng, X., Abbeel, P., 2018. Automatic goal gen- eration for reinforcement learning agents, in: International conference on machine learning, PMLR. pp. 1515–1528. 34

work page 2018
[18]

Re- verse curriculum generation for reinforcement learning, in: Conference on robot learning, PMLR

Florensa, C., Held, D., Wulfmeier, M., Zhang, M., Abbeel, P., 2017. Re- verse curriculum generation for reinforcement learning, in: Conference on robot learning, PMLR. pp. 482–495

work page 2017
[19]

Generative adversarial nets, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc

work page 2014
[20]

Bidirectional lstm networks for improved phoneme classification and recognition, in: Inter- national conference on artificial neural networks, Springer

Graves, A., Fernández, S., Schmidhuber, J., 2005. Bidirectional lstm networks for improved phoneme classification and recognition, in: Inter- national conference on artificial neural networks, Springer. pp. 799–804

work page 2005
[21]

Soft actor-critic: Off-policymaximumentropydeepreinforcementlearningwithastochas- tic actor, in: International conference on machine learning, PMLR

Haarnoja, T., Zhou, A., Abbeel, P., Levine, S., 2018. Soft actor-critic: Off-policymaximumentropydeepreinforcementlearningwithastochas- tic actor, in: International conference on machine learning, PMLR. pp. 1861–1870

work page 2018
[22]

Contextual Markov Decision Processes

Hallak, A., Di Castro, D., Mannor, S., 2015. Contextual markov decision processes. arXiv preprint arXiv:1502.02259

work page internal anchor Pith review Pith/arXiv arXiv 2015
[23]

Curricu- lum reinforcement learning using optimal transport via gradual domain adaptation

Huang, P., Xu, M., Zhu, J., Shi, L., Fang, F., Zhao, D., 2022. Curricu- lum reinforcement learning using optimal transport via gradual domain adaptation. Advances in Neural Information Processing Systems 35, 10656–10670

work page 2022
[24]

Jabri, A., Hsu, K., Gupta, A., Eysenbach, B., Levine, S., Finn, C.,

work page
[25]

Advances in Neural Information Processing Systems 32

Unsupervised curricula for visual meta-reinforcement learning. Advances in Neural Information Processing Systems 32

work page
[26]

Prioritized level replay, in: International Conference on Machine Learning, PMLR

Jiang, M., Grefenstette, E., Rocktäschel, T., 2021. Prioritized level replay, in: International Conference on Machine Learning, PMLR. pp. 4940–4950

work page 2021
[27]

Variational curriculum reinforcement learning for unsupervised discovery of skills, in: International Confer- ence on Machine Learning, PMLR

Kim, S., Lee, K., Choi, J., 2023. Variational curriculum reinforcement learning for unsupervised discovery of skills, in: International Confer- ence on Machine Learning, PMLR. pp. 16668–16695

work page 2023
[28]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 . 35

work page internal anchor Pith review Pith/arXiv arXiv 2013
[29]

A probabilistic interpretation of self-paced learning with applications to reinforcement learning

Klink, P., Abdulsamad, H., Belousov, B., D’Eramo, C., Peters, J., Pa- jarinen, J., 2021. A probabilistic interpretation of self-paced learning with applications to reinforcement learning. J. Mach. Learn. Res. 22

work page 2021
[30]

Self-paced contextual reinforcement learning, in: Kaelbling, L.P., Kragic, D., Sugiura, K

Klink, P., Abdulsamad, H., Belousov, B., Peters, J., 2020a. Self-paced contextual reinforcement learning, in: Kaelbling, L.P., Kragic, D., Sugiura, K. (Eds.), Proceedings of the Con- ference on Robot Learning, PMLR. pp. 513–529. URL: https://proceedings.mlr.press/v100/klink20a.html

work page
[31]

Self-paced deep reinforcement learning

Klink, P., D’Eramo, C., Peters, J.R., Pajarinen, J., 2020b. Self-paced deep reinforcement learning. Advances in Neural Information Processing Systems 33, 9216–9227

work page
[32]

Cur- riculumreinforcementlearningviaconstrainedoptimaltransport, in: In- ternational Conference on Machine Learning, PMLR

Klink, P., Yang, H., D’Eramo, C., Peters, J., Pajarinen, J., 2022. Cur- riculumreinforcementlearningviaconstrainedoptimaltransport, in: In- ternational Conference on Machine Learning, PMLR. pp. 11341–11358

work page 2022
[33]

Understanding the complexity gains of single-task rl with a curriculum, in: International Conference on Machine Learning, PMLR

Li, Q., Zhai, Y., Ma, Y., Levine, S., 2023. Understanding the complexity gains of single-task rl with a curriculum, in: International Conference on Machine Learning, PMLR. pp. 20412–20451

work page 2023
[34]

Task factorization in curriculum learning, in: Decision Aware- ness in Reinforcement Learning Workshop at ICML 2022

Mirsky, R., Shperberg, S.S., Zhang, Y., Xu, Z., Jiang, Y., Cui, J., Stone, P., 2022. Task factorization in curriculum learning, in: Decision Aware- ness in Reinforcement Learning Workshop at ICML 2022

work page 2022
[35]

Human-level control through deep reinforcement learning

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Os- trovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D., 2015. Human-level control through deep reinforcement learning. Nature 518, 529–533. URL:http://www.n...

work page doi:10.1038/nature14236 2015
[36]

Curriculum learning for reinforcement learning domains: A framework and survey

Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E., Stone, P., 2020. Curriculum learning for reinforcement learning domains: A framework and survey. The Journal of Machine Learning Research 21, 7382–7431. 36

work page 2020
[37]

Evolving curricula with regret- based environment design, in: International Conference on Machine Learning, PMLR

Parker-Holder, J., Jiang, M., Dennis, M., Samvelyan, M., Foerster, J., Grefenstette, E., Rocktäschel, T., 2022. Evolving curricula with regret- based environment design, in: International Conference on Machine Learning, PMLR. pp. 17473–17498

work page 2022
[38]

Teacher algorithms for curriculum learning of deep rl in continuously parame- terized environments, in: Conference on Robot Learning, PMLR

Portelas, R., Colas, C., Hofmann, K., Oudeyer, P.Y., 2020. Teacher algorithms for curriculum learning of deep rl in continuously parame- terized environments, in: Conference on Robot Learning, PMLR. pp. 835–853

work page 2020
[39]

Automated curriculum generation through setter-solver interactions, in: International conference on learning representations

Racaniere, S., Lampinen, A., Santoro, A., Reichert, D., Firoiu, V., Lill- icrap, T., 2019. Automated curriculum generation through setter-solver interactions, in: International conference on learning representations

work page 2019
[40]

Stable-baselines3: Reliable reinforcement learning im- plementations

Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dor- mann, N., 2021. Stable-baselines3: Reliable reinforcement learning im- plementations. Journal of Machine Learning Research 22, 1–8. URL: http://jmlr.org/papers/v22/20-1364.html

work page 2021
[41]

Fast adapta- tion to new environments via policy-dynamics value functions, in: Pro- ceedings of the 37th International Conference on Machine Learning, pp

Raileanu, R., Goldstein, M., Szlam, A., Fergus, R., 2020. Fast adapta- tion to new environments via policy-dynamics value functions, in: Pro- ceedings of the 37th International Conference on Machine Learning, pp. 7920–7931

work page 2020
[42]

Efficient Off-Policy Meta-Reinforcement Learning via Probabilis- tic Context Variables, in: Proceedings of the 36th International Conference on Machine Learning, PMLR

Rakelly, K., Zhou, A., Finn, C., Levine, S., Quillen, D., 2019. Efficient Off-Policy Meta-Reinforcement Learning via Probabilis- tic Context Variables, in: Proceedings of the 36th International Conference on Machine Learning, PMLR. pp. 5331–5340. URL: https://proceedings.mlr.press/v97/rakelly19a.html. iSSN: 2640-3498

work page 2019
[43]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal Policy Optimization Algorithms

work page 2017
[44]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al., 2018. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362, 1140–1144

work page 2018
[45]

On the impor- tance of initialization and momentum in deep learning, in: Dasgupta, S., 37 McAllester, D

Sutskever, I., Martens, J., Dahl, G., Hinton, G., 2013. On the impor- tance of initialization and momentum in deep learning, in: Dasgupta, S., 37 McAllester, D. (Eds.), Proceedings of the 30th International Conference on Machine Learning, PMLR, Atlanta, Georgia, USA. pp. 1139–1147. URL:https://proceedings.mlr.press/v28/sutskever13.html

work page 2013
[46]

Reinforcement Learning: An Introduc- tion

Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning: An Introduc- tion. MIT press

work page 2018
[47]

Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE

Todorov, E., Erez, T., Tassa, Y., 2012. Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE. pp. 5026–5033. doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[48]

Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions

Wang, R., Lehman, J., Clune, J., Stanley, K.O., 2019. Paired open- ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753

work page internal anchor Pith review Pith/arXiv arXiv 2019
[49]

Robust imitation of diverse behaviors

Wang, Z., Merel, J.S., Reed, S.E., de Freitas, N., Wayne, G., Heess, N., 2017. Robust imitation of diverse behaviors. Advances in Neural Information Processing Systems 30

work page 2017
[50]

Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum, in: Proceedings of the 39th International Conference on Machine Learning, PMLR

Wu, J., Vorobeychik, Y., 2022. Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum, in: Proceedings of the 39th International Conference on Machine Learning, PMLR. pp. 24177– 24211. URL:https://proceedings.mlr.press/v162/wu22k.html. iSSN: 2640-3498

work page 2022
[51]

Learning invariant representations for reinforcement learning without reconstruction, in: International Conference on Learning Representa- tions

Zhang, A., McAllister, R.T., Calandra, R., Gal, Y., Levine, S., 2021a. Learning invariant representations for reinforcement learning without reconstruction, in: International Conference on Learning Representa- tions

work page
[52]

C-planning: An automatic curriculum for learning goal-reaching tasks, in: International Conference on Learning Representations

Zhang, T., Eysenbach, B., Salakhutdinov, R., Levine, S., Gonzalez, J.E., 2021b. C-planning: An automatic curriculum for learning goal-reaching tasks, in: International Conference on Learning Representations

work page
[53]

Automatic curriculum learning through value disagreement

Zhang, Y., Abbeel, P., Pinto, L., 2020. Automatic curriculum learning through value disagreement. Advances in Neural Information Processing Systems 33, 7648–7659. 38

work page 2020
[54]

Dex- terous manipulation with deep reinforcement learning: Efficient, gen- eral, and low-cost, in: 2019 International Conference on Robotics and Automation (ICRA), pp

Zhu, H., Gupta, A., Rajeswaran, A., Levine, S., Kumar, V., 2019. Dex- terous manipulation with deep reinforcement learning: Efficient, gen- eral, and low-cost, in: 2019 International Conference on Robotics and Automation (ICRA), pp. 3651–3657. doi:10.1109/ICRA.2019.8794102

work page doi:10.1109/icra.2019.8794102 2019
[55]

Robot parkour learning, in: Conference on Robot Learning, PMLR

Zhuang, Z., Fu, Z., Wang, J., Atkeson, C.G., Schwertfeger, S., Finn, C., Zhao, H., 2023. Robot parkour learning, in: Conference on Robot Learning, PMLR. pp. 73–92

work page 2023
[56]

Varibad: A very good method for bayes-adaptive deep rl via meta-learning, in: International Conference on Learning Rep- resentations

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hofmann, K., Whiteson, S., 2020. Varibad: A very good method for bayes-adaptive deep rl via meta-learning, in: International Conference on Learning Rep- resentations. 39

work page 2020

[1] [1]

Solving Rubik's Cube with a Robot Hand

Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., et al., 2019. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

Ao, S., Zhou, T., Jiang, J., Long, G., Song, X., Zhang, C.,

work page

[3] [3]

EAT-C: Environment-Adversarial sub-Task Curriculum for Ef- ficient Reinforcement Learning, in: Proceedings of the 39th In- ternational Conference on Machine Learning, PMLR. pp. 822–

work page

[4] [4]

iSSN: 2640-3498

URL:https://proceedings.mlr.press/v162/ao22a.html. iSSN: 2640-3498

work page

[5] [5]

CLUTR: Curriculum learning via unsupervised task represen- tation learning, in: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Azad, A.S., Gur, I., Emhoff, J., Alexis, N., Faust, A., Abbeel, P., Stoica, I., 2023. CLUTR: Curriculum learning via unsupervised task represen- tation learning, in: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (Eds.), Proceedings of the 40th Interna- tional Conference on Machine Learning, PMLR. pp. 1361–1395. URL: https://...

work page 2023

[6] [6]

Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, Associa- tion for Computing Machinery, New York, NY, USA

Bengio, Y., Louradour, J., Collobert, R., Weston, J., 2009. Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, Associa- tion for Computing Machinery, New York, NY, USA. p. 41–48. URL:https://doi.org/10.1145/1553374.1553380, doi:10.1145/1553374.1553380

work page doi:10.1145/1553374.1553380 2009

[7] [7]

Generating sentences from a continuous space, in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Association for Computational Linguistics

Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., Bengio, S., 2016. Generating sentences from a continuous space, in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Association for Computational Linguistics. p. 10

work page 2016

[8] [8]

Learning with amigo: Adversarially moti- vated intrinsic goals, in: International Conference on Learning Repre- sentations

Campero, A., Raileanu, R., Kuttler, H., Tenenbaum, J.B., Rocktäschel, T., Grefenstette, E., 2020. Learning with amigo: Adversarially moti- vated intrinsic goals, in: International Conference on Learning Repre- sentations

work page 2020

[9] [9]

Deep cluster- ing for unsupervised learning of visual features, in: Proceedings of the European conference on computer vision (ECCV), pp

Caron, M., Bojanowski, P., Joulin, A., Douze, M., 2018. Deep cluster- ing for unsupervised learning of visual features, in: Proceedings of the European conference on computer vision (ECCV), pp. 132–149. 33

work page 2018

[10] [10]

Scalable Methods for Computing State Simi- larity in Deterministic Markov Decision Processes, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Castro, P.S., 2020. Scalable Methods for Computing State Simi- larity in Deterministic Markov Decision Processes, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10069–10076. URL:https://ojs.aaai.org/index.php/AAAI/article/view/6564, doi:10.1609/aaai.v34i06.6564

work page doi:10.1609/aaai.v34i06.6564 2020

[11] [11]

Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems

Chen, J., Zhang, Y., Xu, Y., Ma, H., Yang, H., Song, J., Wang, Y., Wu, Y., 2021. Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems. Advances in Neural Information Pro- cessing Systems 34, 9681–9693

work page 2021

[12] [12]

Minigrid & mini- world: Modular & customizable reinforcement learning environments for goal-oriented tasks

Chevalier-Boisvert, M., Dai, B., Towers, M., de Lazcano, R., Willems, L., Lahlou, S., Pal, S., Castro, P.S., Terry, J., 2023. Minigrid & mini- world: Modular & customizable reinforcement learning environments for goal-oriented tasks. CoRR abs/2306.13831

work page arXiv 2023

[13] [13]

Cho, D., Lee, S., Kim, H.J., 2022. Outcome-directed reinforcement learning by uncertainty\& temporal distance-aware curriculum goal gen- eration, in: The Eleventh International Conference on Learning Repre- sentations

work page 2022

[14] [14]

Seeing-eye quadruped navigation with force responsive locomotion control, in: Tan, J., Toussaint, M., Darvish, K

DeFazio, D., Hirota, E., Zhang, S., 2023. Seeing-eye quadruped navigation with force responsive locomotion control, in: Tan, J., Toussaint, M., Darvish, K. (Eds.), Proceedings of The 7th Conference on Robot Learning, PMLR. pp. 2184–2194. URL: https://proceedings.mlr.press/v229/defazio23a.html

work page 2023

[15] [15]

Emergent complexity and zero-shot transfer via unsu- pervised environment design

Dennis, M., Jaques, N., Vinitsky, E., Bayen, A., Russell, S., Critch, A., Levine, S., 2020. Emergent complexity and zero-shot transfer via unsu- pervised environment design. Advances in neural information processing systems 33, 13049–13061

work page 2020

[16] [16]

Adaptive procedural task generation for hard-exploration problems, in: International Confer- ence on Learning Representations

Fang, K., Zhu, Y., Savarese, S., Fei-Fei, L., 2020. Adaptive procedural task generation for hard-exploration problems, in: International Confer- ence on Learning Representations

work page 2020

[17] [17]

Automatic goal gen- eration for reinforcement learning agents, in: International conference on machine learning, PMLR

Florensa, C., Held, D., Geng, X., Abbeel, P., 2018. Automatic goal gen- eration for reinforcement learning agents, in: International conference on machine learning, PMLR. pp. 1515–1528. 34

work page 2018

[18] [18]

Re- verse curriculum generation for reinforcement learning, in: Conference on robot learning, PMLR

Florensa, C., Held, D., Wulfmeier, M., Zhang, M., Abbeel, P., 2017. Re- verse curriculum generation for reinforcement learning, in: Conference on robot learning, PMLR. pp. 482–495

work page 2017

[19] [19]

Generative adversarial nets, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc

work page 2014

[20] [20]

Bidirectional lstm networks for improved phoneme classification and recognition, in: Inter- national conference on artificial neural networks, Springer

Graves, A., Fernández, S., Schmidhuber, J., 2005. Bidirectional lstm networks for improved phoneme classification and recognition, in: Inter- national conference on artificial neural networks, Springer. pp. 799–804

work page 2005

[21] [21]

Soft actor-critic: Off-policymaximumentropydeepreinforcementlearningwithastochas- tic actor, in: International conference on machine learning, PMLR

Haarnoja, T., Zhou, A., Abbeel, P., Levine, S., 2018. Soft actor-critic: Off-policymaximumentropydeepreinforcementlearningwithastochas- tic actor, in: International conference on machine learning, PMLR. pp. 1861–1870

work page 2018

[22] [22]

Contextual Markov Decision Processes

Hallak, A., Di Castro, D., Mannor, S., 2015. Contextual markov decision processes. arXiv preprint arXiv:1502.02259

work page internal anchor Pith review Pith/arXiv arXiv 2015

[23] [23]

Curricu- lum reinforcement learning using optimal transport via gradual domain adaptation

Huang, P., Xu, M., Zhu, J., Shi, L., Fang, F., Zhao, D., 2022. Curricu- lum reinforcement learning using optimal transport via gradual domain adaptation. Advances in Neural Information Processing Systems 35, 10656–10670

work page 2022

[24] [24]

Jabri, A., Hsu, K., Gupta, A., Eysenbach, B., Levine, S., Finn, C.,

work page

[25] [25]

Advances in Neural Information Processing Systems 32

Unsupervised curricula for visual meta-reinforcement learning. Advances in Neural Information Processing Systems 32

work page

[26] [26]

Prioritized level replay, in: International Conference on Machine Learning, PMLR

Jiang, M., Grefenstette, E., Rocktäschel, T., 2021. Prioritized level replay, in: International Conference on Machine Learning, PMLR. pp. 4940–4950

work page 2021

[27] [27]

Variational curriculum reinforcement learning for unsupervised discovery of skills, in: International Confer- ence on Machine Learning, PMLR

Kim, S., Lee, K., Choi, J., 2023. Variational curriculum reinforcement learning for unsupervised discovery of skills, in: International Confer- ence on Machine Learning, PMLR. pp. 16668–16695

work page 2023

[28] [28]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 . 35

work page internal anchor Pith review Pith/arXiv arXiv 2013

[29] [29]

A probabilistic interpretation of self-paced learning with applications to reinforcement learning

Klink, P., Abdulsamad, H., Belousov, B., D’Eramo, C., Peters, J., Pa- jarinen, J., 2021. A probabilistic interpretation of self-paced learning with applications to reinforcement learning. J. Mach. Learn. Res. 22

work page 2021

[30] [30]

Self-paced contextual reinforcement learning, in: Kaelbling, L.P., Kragic, D., Sugiura, K

Klink, P., Abdulsamad, H., Belousov, B., Peters, J., 2020a. Self-paced contextual reinforcement learning, in: Kaelbling, L.P., Kragic, D., Sugiura, K. (Eds.), Proceedings of the Con- ference on Robot Learning, PMLR. pp. 513–529. URL: https://proceedings.mlr.press/v100/klink20a.html

work page

[31] [31]

Self-paced deep reinforcement learning

Klink, P., D’Eramo, C., Peters, J.R., Pajarinen, J., 2020b. Self-paced deep reinforcement learning. Advances in Neural Information Processing Systems 33, 9216–9227

work page

[32] [32]

Cur- riculumreinforcementlearningviaconstrainedoptimaltransport, in: In- ternational Conference on Machine Learning, PMLR

Klink, P., Yang, H., D’Eramo, C., Peters, J., Pajarinen, J., 2022. Cur- riculumreinforcementlearningviaconstrainedoptimaltransport, in: In- ternational Conference on Machine Learning, PMLR. pp. 11341–11358

work page 2022

[33] [33]

Understanding the complexity gains of single-task rl with a curriculum, in: International Conference on Machine Learning, PMLR

Li, Q., Zhai, Y., Ma, Y., Levine, S., 2023. Understanding the complexity gains of single-task rl with a curriculum, in: International Conference on Machine Learning, PMLR. pp. 20412–20451

work page 2023

[34] [34]

Task factorization in curriculum learning, in: Decision Aware- ness in Reinforcement Learning Workshop at ICML 2022

Mirsky, R., Shperberg, S.S., Zhang, Y., Xu, Z., Jiang, Y., Cui, J., Stone, P., 2022. Task factorization in curriculum learning, in: Decision Aware- ness in Reinforcement Learning Workshop at ICML 2022

work page 2022

[35] [35]

Human-level control through deep reinforcement learning

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Os- trovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D., 2015. Human-level control through deep reinforcement learning. Nature 518, 529–533. URL:http://www.n...

work page doi:10.1038/nature14236 2015

[36] [36]

Curriculum learning for reinforcement learning domains: A framework and survey

Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E., Stone, P., 2020. Curriculum learning for reinforcement learning domains: A framework and survey. The Journal of Machine Learning Research 21, 7382–7431. 36

work page 2020

[37] [37]

Evolving curricula with regret- based environment design, in: International Conference on Machine Learning, PMLR

Parker-Holder, J., Jiang, M., Dennis, M., Samvelyan, M., Foerster, J., Grefenstette, E., Rocktäschel, T., 2022. Evolving curricula with regret- based environment design, in: International Conference on Machine Learning, PMLR. pp. 17473–17498

work page 2022

[38] [38]

Teacher algorithms for curriculum learning of deep rl in continuously parame- terized environments, in: Conference on Robot Learning, PMLR

Portelas, R., Colas, C., Hofmann, K., Oudeyer, P.Y., 2020. Teacher algorithms for curriculum learning of deep rl in continuously parame- terized environments, in: Conference on Robot Learning, PMLR. pp. 835–853

work page 2020

[39] [39]

Automated curriculum generation through setter-solver interactions, in: International conference on learning representations

Racaniere, S., Lampinen, A., Santoro, A., Reichert, D., Firoiu, V., Lill- icrap, T., 2019. Automated curriculum generation through setter-solver interactions, in: International conference on learning representations

work page 2019

[40] [40]

Stable-baselines3: Reliable reinforcement learning im- plementations

Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dor- mann, N., 2021. Stable-baselines3: Reliable reinforcement learning im- plementations. Journal of Machine Learning Research 22, 1–8. URL: http://jmlr.org/papers/v22/20-1364.html

work page 2021

[41] [41]

Fast adapta- tion to new environments via policy-dynamics value functions, in: Pro- ceedings of the 37th International Conference on Machine Learning, pp

Raileanu, R., Goldstein, M., Szlam, A., Fergus, R., 2020. Fast adapta- tion to new environments via policy-dynamics value functions, in: Pro- ceedings of the 37th International Conference on Machine Learning, pp. 7920–7931

work page 2020

[42] [42]

Efficient Off-Policy Meta-Reinforcement Learning via Probabilis- tic Context Variables, in: Proceedings of the 36th International Conference on Machine Learning, PMLR

Rakelly, K., Zhou, A., Finn, C., Levine, S., Quillen, D., 2019. Efficient Off-Policy Meta-Reinforcement Learning via Probabilis- tic Context Variables, in: Proceedings of the 36th International Conference on Machine Learning, PMLR. pp. 5331–5340. URL: https://proceedings.mlr.press/v97/rakelly19a.html. iSSN: 2640-3498

work page 2019

[43] [43]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal Policy Optimization Algorithms

work page 2017

[44] [44]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al., 2018. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362, 1140–1144

work page 2018

[45] [45]

On the impor- tance of initialization and momentum in deep learning, in: Dasgupta, S., 37 McAllester, D

Sutskever, I., Martens, J., Dahl, G., Hinton, G., 2013. On the impor- tance of initialization and momentum in deep learning, in: Dasgupta, S., 37 McAllester, D. (Eds.), Proceedings of the 30th International Conference on Machine Learning, PMLR, Atlanta, Georgia, USA. pp. 1139–1147. URL:https://proceedings.mlr.press/v28/sutskever13.html

work page 2013

[46] [46]

Reinforcement Learning: An Introduc- tion

Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning: An Introduc- tion. MIT press

work page 2018

[47] [47]

Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE

Todorov, E., Erez, T., Tassa, Y., 2012. Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE. pp. 5026–5033. doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012

[48] [48]

Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions

Wang, R., Lehman, J., Clune, J., Stanley, K.O., 2019. Paired open- ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753

work page internal anchor Pith review Pith/arXiv arXiv 2019

[49] [49]

Robust imitation of diverse behaviors

Wang, Z., Merel, J.S., Reed, S.E., de Freitas, N., Wayne, G., Heess, N., 2017. Robust imitation of diverse behaviors. Advances in Neural Information Processing Systems 30

work page 2017

[50] [50]

Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum, in: Proceedings of the 39th International Conference on Machine Learning, PMLR

Wu, J., Vorobeychik, Y., 2022. Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum, in: Proceedings of the 39th International Conference on Machine Learning, PMLR. pp. 24177– 24211. URL:https://proceedings.mlr.press/v162/wu22k.html. iSSN: 2640-3498

work page 2022

[51] [51]

Learning invariant representations for reinforcement learning without reconstruction, in: International Conference on Learning Representa- tions

Zhang, A., McAllister, R.T., Calandra, R., Gal, Y., Levine, S., 2021a. Learning invariant representations for reinforcement learning without reconstruction, in: International Conference on Learning Representa- tions

work page

[52] [52]

C-planning: An automatic curriculum for learning goal-reaching tasks, in: International Conference on Learning Representations

Zhang, T., Eysenbach, B., Salakhutdinov, R., Levine, S., Gonzalez, J.E., 2021b. C-planning: An automatic curriculum for learning goal-reaching tasks, in: International Conference on Learning Representations

work page

[53] [53]

Automatic curriculum learning through value disagreement

Zhang, Y., Abbeel, P., Pinto, L., 2020. Automatic curriculum learning through value disagreement. Advances in Neural Information Processing Systems 33, 7648–7659. 38

work page 2020

[54] [54]

Dex- terous manipulation with deep reinforcement learning: Efficient, gen- eral, and low-cost, in: 2019 International Conference on Robotics and Automation (ICRA), pp

Zhu, H., Gupta, A., Rajeswaran, A., Levine, S., Kumar, V., 2019. Dex- terous manipulation with deep reinforcement learning: Efficient, gen- eral, and low-cost, in: 2019 International Conference on Robotics and Automation (ICRA), pp. 3651–3657. doi:10.1109/ICRA.2019.8794102

work page doi:10.1109/icra.2019.8794102 2019

[55] [55]

Robot parkour learning, in: Conference on Robot Learning, PMLR

Zhuang, Z., Fu, Z., Wang, J., Atkeson, C.G., Schwertfeger, S., Finn, C., Zhao, H., 2023. Robot parkour learning, in: Conference on Robot Learning, PMLR. pp. 73–92

work page 2023

[56] [56]

Varibad: A very good method for bayes-adaptive deep rl via meta-learning, in: International Conference on Learning Rep- resentations

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hofmann, K., Whiteson, S., 2020. Varibad: A very good method for bayes-adaptive deep rl via meta-learning, in: International Conference on Learning Rep- resentations. 39

work page 2020