pith. sign in

arxiv: 2605.23372 · v1 · pith:T54LY4J5new · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Curriculum reinforcement learning with measurable task representation learning

Pith reviewed 2026-05-25 05:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords curriculum reinforcement learningtask representation learningvariational autoencoderautomatic curriculum generationnavigation tasksreinforcement learninglatent space
0
0 comments X

The pith

A variational autoencoder encodes rewards and state transitions to create a latent task space that supports automatic curriculum generation in non-Euclidean navigation tasks for reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the limitation that interpolation-based curriculum methods assume a Euclidean task space, which fails in complex navigation environments. It proposes encoding task information with a variational autoencoder on rewards and state transitions to produce a latent representation where task similarity is measurable by embedding proximity. This representation then drives an automatic scheme that generates sequences of tasks progressively closer to the target. Experiments across navigation tasks show the resulting curricula outperform those from interpolation and generative adversarial network baselines.

Core claim

In curriculum reinforcement learning, automatic curriculum generation for complex tasks requires a way to measure task similarity in non-Euclidean spaces. We propose transforming the task space into a latent space using a variational autoencoder that encodes reward functions and state transitions. This produces task embeddings with the property that proximity corresponds to similarity in rewards and transitions. Using these embeddings, we develop a scheme to generate curricula of tasks increasingly similar to the target task. Evaluation in challenging navigation tasks demonstrates superiority over interpolation and GAN-based methods.

What carries the argument

The variational autoencoder that encodes reward and state transitions to produce a latent task representation where embedding distance measures task similarity.

If this is right

  • The latent embeddings enable generation of intermediate tasks whose similarity to the target increases over the curriculum.
  • The approach applies to navigation settings where direct interpolation in the original task space is invalid.
  • Performance on the final target task exceeds that achieved by interpolation-based and GAN-based curriculum methods.
  • The same VAE structure can be reused to measure similarity for any pair of tasks sharing the reward and transition encoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same encoding might support curriculum transfer across different environment families that share reward and transition statistics.
  • If the latent space captures a manifold of task difficulty, the method could be tested by checking whether generated task sequences produce monotonic increases in agent competence.
  • Extending the encoder to include additional signals such as goal positions or obstacle layouts could further refine similarity measurement.

Load-bearing premise

Proximity in the VAE latent space reliably indicates task similarity in rewards and state transitions sufficient to produce curricula that improve learning on the target task.

What would settle it

A set of navigation tasks where curricula generated from the learned embeddings yield no improvement in target-task success rate or sample efficiency compared with interpolation or random task sequences.

Figures

Figures reproduced from arXiv: 2605.23372 by Mingjian Fu, Peng Liu, Siyuan Li, Xun Wang, Yiqin Yang, Yongyan Wen.

Figure 1
Figure 1. Figure 1: Compared to direct interpolation in context space 1(a) without considering task [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The schema of Latent Space Prediction (LSP) and Exploration Bound Update [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task representation learning architecture. Representation learning and policy [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Environment visualizations. For MiniGrid (4(a)-4(h)), the context is the posi￾tion of the goal and key. For U-Maze (4(i)) tasks, the context is the position of the goal. rameters from high-ALP regions. ALP is computed as the reward dif￾ference of the nearest neighbor, and the GMM updates periodically with adaptive Gaussian components while maintaining some random exploration. • Goal GAN [15]: generating co… view at source ↗
Figure 5
Figure 5. Figure 5: The learning curves in the MiniGrid environments. cedural content generation environment—by prioritizing those with higher estimated learning potential when revisited in the future. It maintains a dynamic replay distribution based on level scores and sam￾pling recency, balancing revisiting past levels with exploring new ones to enhance learning efficiency. • CURROT [29]: generating a curriculum in CRL by r… view at source ↗
Figure 6
Figure 6. Figure 6: Visualizations of curriculum on MiniGrid-Easy-B. Red rectangles represent a high probability of the location as a target, black indicates a low probability, and the green rectangle represent the target. From left to right, the ACRL-generated distribution gradually moves from the initial position to the gap and gradually moves closer to the target. In contrast, CURROT generates an intermediate distribution … view at source ↗
Figure 7
Figure 7. Figure 7: Representation learning results in MiniGrid-Easy-A. (a) Dot positions in the right panel indicate the mean values of the latent space variables, and the colors indicate the episodic return for the corresponding task under the policy. The left panel shows the four tasks configured by [4, 2], [5, 2], [5, 8] and [7, 2], respectively. (b) MDS analysis result. Only top-2 eigenvalues of Gram matrix are positive … view at source ↗
Figure 8
Figure 8. Figure 8: The learning curves in U-Maze. All the curves are averaged over 5 runs, and the shaded error bars represent the standard variances. To assess performance in the continuous control task, the U-Maze envi￾ronment with continuous context space was introduced. The environment is implemented in Mujoco [44]. In this environment, the agent must avoid the center barrier, rendering l2 distance as a metric in context… view at source ↗
Figure 9
Figure 9. Figure 9: In the initial phase, goals are concentrated near the starting point, [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualizations of curriculum on U-Maze. The centers of the original source and target are located at [2, 0] and [0, 8], respectively. The distribution gradually transitions from being near the initial position to the right side, ultimately reaching the target goal. In the visual representation, the red color indicates a lower corresponding episodic return, while blue color signifies a higher return. figure… view at source ↗
Figure 10
Figure 10. Figure 10: Ablation studies on the sampling ratio λ. within a reasonable range and evaluate performance on both MiniGrid and U-Maze. As shown in [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of curriculum parameters. 27 [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ablation studies on decoders of task representaion learning. [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
read the original abstract

In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on sequencing candidate tasks, recent research explores automatic curriculum generation. Among the rich CRL literature, the interpolation-based CRL paradigm is a main body, which automatically generates intermediate tasks by interpolating between the initial task distribution and the target task distribution in task space with meaningful distance metrics (i.e., can measure the task similarity). However, in challenging navigation tasks, the non-Euclidean context (task) space invalidates this assumption. To achieve automatic curriculum generation in complex task, we propose a novel automatic curriculum generation approach based on measurable task representation learning. To better measure the similarity, we propose to transform the task space to a latent space. Through a variational autoencoder structure that encodes the reward and the state transitions, we achieve a latent task representation with a task similarity measurement property, and two close task embeddings correspond to two similar tasks in terms of rewards and state transitions. Based on the learned task representation, we further develop an automatic curriculum generation scheme, which can effectively generate new tasks more and more similar to the target task. We evaluate our method in a variety of challenging navigation tasks, and the experiment results indicate that the proposed approach surpasses state-of-the-art CRL approaches based on interpolation and generative adversarial networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a curriculum reinforcement learning (CRL) method for automatic curriculum generation in non-Euclidean task spaces, such as challenging navigation tasks. It introduces a variational autoencoder (VAE) trained on rewards and state transitions to learn a latent task representation where Euclidean proximity is claimed to indicate task similarity, enabling generation of curricula with tasks progressively closer to the target; experiments are said to show superiority over interpolation-based and GAN-based CRL baselines.

Significance. If the central claim holds, the work could extend automatic CRL to domains where direct interpolation in task space fails due to non-Euclidean structure, by providing a learned similarity metric grounded in reward and transition statistics. The VAE-based approach is a plausible direction for measurable representations, but its significance hinges on whether the latent metric supports effective policy/value transfer rather than superficial reconstruction.

major comments (2)
  1. [Abstract] Abstract: the central claim that the VAE yields 'a latent task representation with a task similarity measurement property' (such that 'two close task embeddings correspond to two similar tasks in terms of rewards and state transitions') is load-bearing for the curriculum scheme, yet the manuscript provides no derivation, bound, or empirical test showing that ELBO minimization on rewards/transitions produces distances aligned with RL transfer metrics (e.g., policy overlap or value-function distance) rather than trajectory statistics alone.
  2. [Abstract] Abstract (experiments paragraph): the superiority claim over 'state-of-the-art CRL approaches based on interpolation and generative adversarial networks' is stated without any quantitative results, error bars, dataset/task specifications, ablation studies, or statistical tests, preventing verification that gains arise from the learned representation rather than the downstream task-generation heuristic.
minor comments (1)
  1. [Abstract] Abstract: minor grammatical issues ('in complex task', 'the experiment results') and undefined acronyms on first use reduce clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each major comment below, providing clarifications and indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the VAE yields 'a latent task representation with a task similarity measurement property' (such that 'two close task embeddings correspond to two similar tasks in terms of rewards and state transitions') is load-bearing for the curriculum scheme, yet the manuscript provides no derivation, bound, or empirical test showing that ELBO minimization on rewards/transitions produces distances aligned with RL transfer metrics (e.g., policy overlap or value-function distance) rather than trajectory statistics alone.

    Authors: The manuscript's claim is specifically that close embeddings correspond to similar tasks in terms of rewards and state transitions, which follows directly from the VAE training objective of reconstructing these quantities. The curriculum scheme uses this to generate tasks with progressively closer embeddings. We acknowledge that no theoretical derivation or bound is provided linking latent distances to policy overlap or value-function distance, and that empirical validation of transfer metrics would strengthen the work. In revision, we will add experiments measuring policy similarity and value-function distances for tasks with nearby embeddings. revision: yes

  2. Referee: [Abstract] Abstract (experiments paragraph): the superiority claim over 'state-of-the-art CRL approaches based on interpolation and generative adversarial networks' is stated without any quantitative results, error bars, dataset/task specifications, ablation studies, or statistical tests, preventing verification that gains arise from the learned representation rather than the downstream task-generation heuristic.

    Authors: The abstract summarizes the experimental findings at a high level. Full quantitative results with error bars, task specifications, ablation studies, and statistical tests appear in the Experiments section. To improve clarity, we will revise the abstract to include key quantitative gains and explicit references to the experimental setups and datasets used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard VAE training and empirical evaluation

full rationale

The paper trains a VAE to reconstruct rewards and state transitions, then uses the resulting latent distances to order tasks for curriculum generation. This is a standard autoencoder objective followed by a downstream heuristic; the reported performance gains are measured on external navigation benchmarks rather than being equivalent to the reconstruction loss or latent distances by definition. No equations or claims in the abstract reduce the final task success metric to a fitted parameter or self-citation chain. The assumption that latent proximity aligns with transferable similarity is presented as an empirical property to be validated experimentally, not a definitional identity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; no explicit fitting procedures, background lemmas, or new postulated objects are described.

pith-pipeline@v0.9.0 · 5798 in / 1175 out tokens · 20541 ms · 2026-05-25T05:17:26.518583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 4 internal anchors

  1. [1]

    Solving Rubik's Cube with a Robot Hand

    Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., et al., 2019. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113

  2. [2]

    Ao, S., Zhou, T., Jiang, J., Long, G., Song, X., Zhang, C.,

  3. [3]

    EAT-C: Environment-Adversarial sub-Task Curriculum for Ef- ficient Reinforcement Learning, in: Proceedings of the 39th In- ternational Conference on Machine Learning, PMLR. pp. 822–

  4. [4]

    iSSN: 2640-3498

    URL:https://proceedings.mlr.press/v162/ao22a.html. iSSN: 2640-3498

  5. [5]

    CLUTR: Curriculum learning via unsupervised task represen- tation learning, in: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

    Azad, A.S., Gur, I., Emhoff, J., Alexis, N., Faust, A., Abbeel, P., Stoica, I., 2023. CLUTR: Curriculum learning via unsupervised task represen- tation learning, in: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (Eds.), Proceedings of the 40th Interna- tional Conference on Machine Learning, PMLR. pp. 1361–1395. URL: https://...

  6. [6]

    Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, Associa- tion for Computing Machinery, New York, NY, USA

    Bengio, Y., Louradour, J., Collobert, R., Weston, J., 2009. Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, Associa- tion for Computing Machinery, New York, NY, USA. p. 41–48. URL:https://doi.org/10.1145/1553374.1553380, doi:10.1145/1553374.1553380

  7. [7]

    Generating sentences from a continuous space, in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Association for Computational Linguistics

    Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., Bengio, S., 2016. Generating sentences from a continuous space, in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Association for Computational Linguistics. p. 10

  8. [8]

    Learning with amigo: Adversarially moti- vated intrinsic goals, in: International Conference on Learning Repre- sentations

    Campero, A., Raileanu, R., Kuttler, H., Tenenbaum, J.B., Rocktäschel, T., Grefenstette, E., 2020. Learning with amigo: Adversarially moti- vated intrinsic goals, in: International Conference on Learning Repre- sentations

  9. [9]

    Deep cluster- ing for unsupervised learning of visual features, in: Proceedings of the European conference on computer vision (ECCV), pp

    Caron, M., Bojanowski, P., Joulin, A., Douze, M., 2018. Deep cluster- ing for unsupervised learning of visual features, in: Proceedings of the European conference on computer vision (ECCV), pp. 132–149. 33

  10. [10]

    Scalable Methods for Computing State Simi- larity in Deterministic Markov Decision Processes, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

    Castro, P.S., 2020. Scalable Methods for Computing State Simi- larity in Deterministic Markov Decision Processes, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10069–10076. URL:https://ojs.aaai.org/index.php/AAAI/article/view/6564, doi:10.1609/aaai.v34i06.6564

  11. [11]

    Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems

    Chen, J., Zhang, Y., Xu, Y., Ma, H., Yang, H., Song, J., Wang, Y., Wu, Y., 2021. Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems. Advances in Neural Information Pro- cessing Systems 34, 9681–9693

  12. [12]

    Minigrid & mini- world: Modular & customizable reinforcement learning environments for goal-oriented tasks

    Chevalier-Boisvert, M., Dai, B., Towers, M., de Lazcano, R., Willems, L., Lahlou, S., Pal, S., Castro, P.S., Terry, J., 2023. Minigrid & mini- world: Modular & customizable reinforcement learning environments for goal-oriented tasks. CoRR abs/2306.13831

  13. [13]

    Cho, D., Lee, S., Kim, H.J., 2022. Outcome-directed reinforcement learning by uncertainty\& temporal distance-aware curriculum goal gen- eration, in: The Eleventh International Conference on Learning Repre- sentations

  14. [14]

    Seeing-eye quadruped navigation with force responsive locomotion control, in: Tan, J., Toussaint, M., Darvish, K

    DeFazio, D., Hirota, E., Zhang, S., 2023. Seeing-eye quadruped navigation with force responsive locomotion control, in: Tan, J., Toussaint, M., Darvish, K. (Eds.), Proceedings of The 7th Conference on Robot Learning, PMLR. pp. 2184–2194. URL: https://proceedings.mlr.press/v229/defazio23a.html

  15. [15]

    Emergent complexity and zero-shot transfer via unsu- pervised environment design

    Dennis, M., Jaques, N., Vinitsky, E., Bayen, A., Russell, S., Critch, A., Levine, S., 2020. Emergent complexity and zero-shot transfer via unsu- pervised environment design. Advances in neural information processing systems 33, 13049–13061

  16. [16]

    Adaptive procedural task generation for hard-exploration problems, in: International Confer- ence on Learning Representations

    Fang, K., Zhu, Y., Savarese, S., Fei-Fei, L., 2020. Adaptive procedural task generation for hard-exploration problems, in: International Confer- ence on Learning Representations

  17. [17]

    Automatic goal gen- eration for reinforcement learning agents, in: International conference on machine learning, PMLR

    Florensa, C., Held, D., Geng, X., Abbeel, P., 2018. Automatic goal gen- eration for reinforcement learning agents, in: International conference on machine learning, PMLR. pp. 1515–1528. 34

  18. [18]

    Re- verse curriculum generation for reinforcement learning, in: Conference on robot learning, PMLR

    Florensa, C., Held, D., Wulfmeier, M., Zhang, M., Abbeel, P., 2017. Re- verse curriculum generation for reinforcement learning, in: Conference on robot learning, PMLR. pp. 482–495

  19. [19]

    Generative adversarial nets, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc

  20. [20]

    Bidirectional lstm networks for improved phoneme classification and recognition, in: Inter- national conference on artificial neural networks, Springer

    Graves, A., Fernández, S., Schmidhuber, J., 2005. Bidirectional lstm networks for improved phoneme classification and recognition, in: Inter- national conference on artificial neural networks, Springer. pp. 799–804

  21. [21]

    Soft actor-critic: Off-policymaximumentropydeepreinforcementlearningwithastochas- tic actor, in: International conference on machine learning, PMLR

    Haarnoja, T., Zhou, A., Abbeel, P., Levine, S., 2018. Soft actor-critic: Off-policymaximumentropydeepreinforcementlearningwithastochas- tic actor, in: International conference on machine learning, PMLR. pp. 1861–1870

  22. [22]

    Contextual Markov Decision Processes

    Hallak, A., Di Castro, D., Mannor, S., 2015. Contextual markov decision processes. arXiv preprint arXiv:1502.02259

  23. [23]

    Curricu- lum reinforcement learning using optimal transport via gradual domain adaptation

    Huang, P., Xu, M., Zhu, J., Shi, L., Fang, F., Zhao, D., 2022. Curricu- lum reinforcement learning using optimal transport via gradual domain adaptation. Advances in Neural Information Processing Systems 35, 10656–10670

  24. [24]

    Jabri, A., Hsu, K., Gupta, A., Eysenbach, B., Levine, S., Finn, C.,

  25. [25]

    Advances in Neural Information Processing Systems 32

    Unsupervised curricula for visual meta-reinforcement learning. Advances in Neural Information Processing Systems 32

  26. [26]

    Prioritized level replay, in: International Conference on Machine Learning, PMLR

    Jiang, M., Grefenstette, E., Rocktäschel, T., 2021. Prioritized level replay, in: International Conference on Machine Learning, PMLR. pp. 4940–4950

  27. [27]

    Variational curriculum reinforcement learning for unsupervised discovery of skills, in: International Confer- ence on Machine Learning, PMLR

    Kim, S., Lee, K., Choi, J., 2023. Variational curriculum reinforcement learning for unsupervised discovery of skills, in: International Confer- ence on Machine Learning, PMLR. pp. 16668–16695

  28. [28]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 . 35

  29. [29]

    A probabilistic interpretation of self-paced learning with applications to reinforcement learning

    Klink, P., Abdulsamad, H., Belousov, B., D’Eramo, C., Peters, J., Pa- jarinen, J., 2021. A probabilistic interpretation of self-paced learning with applications to reinforcement learning. J. Mach. Learn. Res. 22

  30. [30]

    Self-paced contextual reinforcement learning, in: Kaelbling, L.P., Kragic, D., Sugiura, K

    Klink, P., Abdulsamad, H., Belousov, B., Peters, J., 2020a. Self-paced contextual reinforcement learning, in: Kaelbling, L.P., Kragic, D., Sugiura, K. (Eds.), Proceedings of the Con- ference on Robot Learning, PMLR. pp. 513–529. URL: https://proceedings.mlr.press/v100/klink20a.html

  31. [31]

    Self-paced deep reinforcement learning

    Klink, P., D’Eramo, C., Peters, J.R., Pajarinen, J., 2020b. Self-paced deep reinforcement learning. Advances in Neural Information Processing Systems 33, 9216–9227

  32. [32]

    Cur- riculumreinforcementlearningviaconstrainedoptimaltransport, in: In- ternational Conference on Machine Learning, PMLR

    Klink, P., Yang, H., D’Eramo, C., Peters, J., Pajarinen, J., 2022. Cur- riculumreinforcementlearningviaconstrainedoptimaltransport, in: In- ternational Conference on Machine Learning, PMLR. pp. 11341–11358

  33. [33]

    Understanding the complexity gains of single-task rl with a curriculum, in: International Conference on Machine Learning, PMLR

    Li, Q., Zhai, Y., Ma, Y., Levine, S., 2023. Understanding the complexity gains of single-task rl with a curriculum, in: International Conference on Machine Learning, PMLR. pp. 20412–20451

  34. [34]

    Task factorization in curriculum learning, in: Decision Aware- ness in Reinforcement Learning Workshop at ICML 2022

    Mirsky, R., Shperberg, S.S., Zhang, Y., Xu, Z., Jiang, Y., Cui, J., Stone, P., 2022. Task factorization in curriculum learning, in: Decision Aware- ness in Reinforcement Learning Workshop at ICML 2022

  35. [35]

    Human-level control through deep reinforcement learning

    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Os- trovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D., 2015. Human-level control through deep reinforcement learning. Nature 518, 529–533. URL:http://www.n...

  36. [36]

    Curriculum learning for reinforcement learning domains: A framework and survey

    Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E., Stone, P., 2020. Curriculum learning for reinforcement learning domains: A framework and survey. The Journal of Machine Learning Research 21, 7382–7431. 36

  37. [37]

    Evolving curricula with regret- based environment design, in: International Conference on Machine Learning, PMLR

    Parker-Holder, J., Jiang, M., Dennis, M., Samvelyan, M., Foerster, J., Grefenstette, E., Rocktäschel, T., 2022. Evolving curricula with regret- based environment design, in: International Conference on Machine Learning, PMLR. pp. 17473–17498

  38. [38]

    Teacher algorithms for curriculum learning of deep rl in continuously parame- terized environments, in: Conference on Robot Learning, PMLR

    Portelas, R., Colas, C., Hofmann, K., Oudeyer, P.Y., 2020. Teacher algorithms for curriculum learning of deep rl in continuously parame- terized environments, in: Conference on Robot Learning, PMLR. pp. 835–853

  39. [39]

    Automated curriculum generation through setter-solver interactions, in: International conference on learning representations

    Racaniere, S., Lampinen, A., Santoro, A., Reichert, D., Firoiu, V., Lill- icrap, T., 2019. Automated curriculum generation through setter-solver interactions, in: International conference on learning representations

  40. [40]

    Stable-baselines3: Reliable reinforcement learning im- plementations

    Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dor- mann, N., 2021. Stable-baselines3: Reliable reinforcement learning im- plementations. Journal of Machine Learning Research 22, 1–8. URL: http://jmlr.org/papers/v22/20-1364.html

  41. [41]

    Fast adapta- tion to new environments via policy-dynamics value functions, in: Pro- ceedings of the 37th International Conference on Machine Learning, pp

    Raileanu, R., Goldstein, M., Szlam, A., Fergus, R., 2020. Fast adapta- tion to new environments via policy-dynamics value functions, in: Pro- ceedings of the 37th International Conference on Machine Learning, pp. 7920–7931

  42. [42]

    Efficient Off-Policy Meta-Reinforcement Learning via Probabilis- tic Context Variables, in: Proceedings of the 36th International Conference on Machine Learning, PMLR

    Rakelly, K., Zhou, A., Finn, C., Levine, S., Quillen, D., 2019. Efficient Off-Policy Meta-Reinforcement Learning via Probabilis- tic Context Variables, in: Proceedings of the 36th International Conference on Machine Learning, PMLR. pp. 5331–5340. URL: https://proceedings.mlr.press/v97/rakelly19a.html. iSSN: 2640-3498

  43. [43]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal Policy Optimization Algorithms

  44. [44]

    A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

    Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al., 2018. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362, 1140–1144

  45. [45]

    On the impor- tance of initialization and momentum in deep learning, in: Dasgupta, S., 37 McAllester, D

    Sutskever, I., Martens, J., Dahl, G., Hinton, G., 2013. On the impor- tance of initialization and momentum in deep learning, in: Dasgupta, S., 37 McAllester, D. (Eds.), Proceedings of the 30th International Conference on Machine Learning, PMLR, Atlanta, Georgia, USA. pp. 1139–1147. URL:https://proceedings.mlr.press/v28/sutskever13.html

  46. [46]

    Reinforcement Learning: An Introduc- tion

    Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning: An Introduc- tion. MIT press

  47. [47]

    Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE

    Todorov, E., Erez, T., Tassa, Y., 2012. Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE. pp. 5026–5033. doi:10.1109/IROS.2012.6386109

  48. [48]

    Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions

    Wang, R., Lehman, J., Clune, J., Stanley, K.O., 2019. Paired open- ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753

  49. [49]

    Robust imitation of diverse behaviors

    Wang, Z., Merel, J.S., Reed, S.E., de Freitas, N., Wayne, G., Heess, N., 2017. Robust imitation of diverse behaviors. Advances in Neural Information Processing Systems 30

  50. [50]

    Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum, in: Proceedings of the 39th International Conference on Machine Learning, PMLR

    Wu, J., Vorobeychik, Y., 2022. Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum, in: Proceedings of the 39th International Conference on Machine Learning, PMLR. pp. 24177– 24211. URL:https://proceedings.mlr.press/v162/wu22k.html. iSSN: 2640-3498

  51. [51]

    Learning invariant representations for reinforcement learning without reconstruction, in: International Conference on Learning Representa- tions

    Zhang, A., McAllister, R.T., Calandra, R., Gal, Y., Levine, S., 2021a. Learning invariant representations for reinforcement learning without reconstruction, in: International Conference on Learning Representa- tions

  52. [52]

    C-planning: An automatic curriculum for learning goal-reaching tasks, in: International Conference on Learning Representations

    Zhang, T., Eysenbach, B., Salakhutdinov, R., Levine, S., Gonzalez, J.E., 2021b. C-planning: An automatic curriculum for learning goal-reaching tasks, in: International Conference on Learning Representations

  53. [53]

    Automatic curriculum learning through value disagreement

    Zhang, Y., Abbeel, P., Pinto, L., 2020. Automatic curriculum learning through value disagreement. Advances in Neural Information Processing Systems 33, 7648–7659. 38

  54. [54]

    Dex- terous manipulation with deep reinforcement learning: Efficient, gen- eral, and low-cost, in: 2019 International Conference on Robotics and Automation (ICRA), pp

    Zhu, H., Gupta, A., Rajeswaran, A., Levine, S., Kumar, V., 2019. Dex- terous manipulation with deep reinforcement learning: Efficient, gen- eral, and low-cost, in: 2019 International Conference on Robotics and Automation (ICRA), pp. 3651–3657. doi:10.1109/ICRA.2019.8794102

  55. [55]

    Robot parkour learning, in: Conference on Robot Learning, PMLR

    Zhuang, Z., Fu, Z., Wang, J., Atkeson, C.G., Schwertfeger, S., Finn, C., Zhao, H., 2023. Robot parkour learning, in: Conference on Robot Learning, PMLR. pp. 73–92

  56. [56]

    Varibad: A very good method for bayes-adaptive deep rl via meta-learning, in: International Conference on Learning Rep- resentations

    Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hofmann, K., Whiteson, S., 2020. Varibad: A very good method for bayes-adaptive deep rl via meta-learning, in: International Conference on Learning Rep- resentations. 39