pith. sign in

arxiv: 2605.23565 · v1 · pith:4UOT2D5Wnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Understanding Goal Generalisation in Sequential Reinforcement Learning

Pith reviewed 2026-05-25 04:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learninggoal generalizationsequential traininglatent policy gradientsout-of-distribution behaviorpolicy evolutiontraining pipelines
0
0 comments X

The pith

Latent policy gradients simulate low-dimensional variables to predict how sequentially trained RL agents will generalize goals to new environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies reinforcement learning agents trained sequentially on one or more tasks and measures their behavior in hundreds of out-of-distribution environments. It finds that salient features drive generalization and that goals acquired early in training often persist to shape later goals. To account for these patterns across more than 100 training pipelines, the authors introduce latent policy gradients, which models the training process by evolving low-dimensional latent variables toward high reward under a simple mapping to behavior. The method yields accurate predictions, transfers to unseen pipeline types, and remains interpretable. This shows that dependence on training history follows a capturable structure rather than being arbitrary.

Core claim

Latent policy gradients predicts what out-of-distribution behaviour a training pipeline will likely induce by simulating the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour.

What carries the argument

Latent policy gradients, which simulates the evolution of low-dimensional latent variables to maximize training rewards under a simple behavior-mapping model.

If this is right

  • Out-of-distribution agent behavior depends on the entire sequential training pipeline rather than only the final task.
  • Goals learned early can persist and continue to influence goals acquired later.
  • Salient environmental features determine which behaviors generalize to novel settings.
  • The dependence of generalization on training history has an underlying structure that latent policy gradients can capture.
  • A developmental perspective on goal generalization becomes feasible once training pipelines are modeled explicitly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines could be designed deliberately to suppress or encourage particular forms of goal generalization.
  • The same latent-variable simulation approach might extend to studying generalization in other sequential learning domains.
  • If the simple mapping model remains adequate, extensive empirical testing of each new pipeline may become unnecessary.
  • The persistence of early goals suggests parallels with developmental processes where initial experiences constrain later learning.

Load-bearing premise

A simple model of how low-dimensional latent variables map to behavior is sufficient to simulate the actual evolution of an agent's policy during sequential training.

What would settle it

Running agents on new sequential training pipelines and finding that their actual out-of-distribution behaviors diverge systematically from the predictions made by latent policy gradients.

Figures

Figures reproduced from arXiv: 2605.23565 by Edward James Young, Jason Ross Brown.

Figure 1
Figure 1. Figure 1: Left: Illustration of our experimental design. Our experimental design is covered in Section 3. RL agents are trained on pipelines involving either one or two stages (e.g., trained to pursue in stage 1 and in stage 2). They are then evaluated in out-of-distribution environments containing two objects (e.g., and ) to generate an empirical preference distribution. We explore these distributions in Section 4.… view at source ↗
Figure 2
Figure 2. Figure 2: Model comparison. Average modelling loss (Equation (1)) across four evaluations (described in main text). Error bars show standard error for K-fold CV. Two per-agent lower bounds (Full Fit only) are also shown. Lower is better. Exact values are given in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-stage training is iterated projection. Latent policy gradient shifts the latent variables w in the S T ϕ (g) direction until they intersect with the hyperplane ϕ (g) · Sw = τ −1 . The result of training (to convergence) first on ϕ (g1) , and then on ϕ (g2) is shown by w2. The result of training to convergence on ϕ (g2) alone is shown by w′ 2 . uses a saliency matrix to capture differing feature learn… view at source ↗
Figure 4
Figure 4. Figure 4: An example 8x8 maze environment. The agent is represented by and the is the goal object. Black squares are impassable walls. The agent cannot move off the edges of the maze; the outer wall shown is for illustrative purposes and is not included as part of the agent’s observation. The total observation size is 128 × 128 pixels with 3 colour channels (RGB). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual features used in the Maze environment. Colours (top): Three colours (black, red, blue) are used for goal objects during both training and evaluation; grey is used exclusively to render the agent; green appears only in evaluation environments to test generalisation to novel colours. All red, blue, and green use a single RGB input channel. Shapes (bottom): Four shapes (cross, plus, diamond, ring) are … view at source ↗
Figure 6
Figure 6. Figure 6: Model loss (KL divergence) as a function of the latent dimension [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Empirical Elo scores vs model-predicted values for all 298 agents across 24 goals. Each [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Zero-mean normalised Elo scores vs model-predicted values. Per-agent means are subtracted [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: Generalisation differences between and training goals. Marginalised feature Elo scores for the red, -shape, and -shape features for the single-stage agent trained on , and the single-stage agent trained on . Right: Some features drive generalisation more strongly and are more salient to the model. Average marginalised Elo with standard error for each feature across agents that have been trained on a … view at source ↗
Figure 10
Figure 10. Figure 10: Training on one feature can lead to another being valued more or less strongly than average. Rows indicate a feature being trained on and columns indicate a feature being evaluated, the value being the average preference score for goals containing the evaluation feature across models trained on goals containing the training feature. The average is taken over all single-stage runs without distractors. Orde… view at source ↗
Figure 11
Figure 11. Figure 11: Left: -shape and blue values persist after training on . Marginalised feature Elo scores for red, blue, -shape, and -shape features for three different agents: single-stage agent trained on ; a two-stage agent, → ; a single-stage agent trained just on . Right: Values for early training objectives persist. Marginalised Elo with standard error for features that are only in the first goal in two-stage traini… view at source ↗
Figure 12
Figure 12. Figure 12: Agents trained on more diverse feature sets pursue more goals. Marginalised Elo with standard error across all features for agents that have been trained on a different number of total unique features. Single-stage pipelines always have two unique features in their goals. Two-stage pipelines can have goals which share both, one, or neither of their features. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Left: Training on → does not cause a strong value for -shape. Marginalised feature Elo scores for red, -shape, and -shape for three different agents: a single-stage agent trained just on a ; a single-stage agent trained on a ; a two-stage agent trained on → . Right: Repeated goal features’ values are strengthened, and inhibit new values forming. The left bars within each pair show the marginalised Elo wit… view at source ↗
Figure 14
Figure 14. Figure 14: Generalisation behaviour is sensitive to the order of the training objectives when they share a feature. Marginalised Elo with standard error for each feature stratified by when that feature is present in the first training goal compared to when it is present in the second training goal. Elo is marginalised over pairs of two-stage pipelines without distractors where they are each others reverse. Left: Cas… view at source ↗
Figure 15
Figure 15. Figure 15: Training curves showing mean episode reward over training steps for selected agents. [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Agent values for single-stage training pipelines without distractors. [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Agent values after two-stage training without distractors (1/4). [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Agent values after two-stage training without distractors (2/4). [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Agent values after two-stage training without distractors (3/4). [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Agent values after two-stage training without distractors (4/4). [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Agent values for single-stage training with distractors (1/3). [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Agent values for single-stage training with distractors (2/3). [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Agent values for single-stage training with distractors (3/3). [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Agent values after two-stage training with distractors (1/5). [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Agent values after two-stage training with distractors (2/5). [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Agent values after two-stage training with distractors (3/5). [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Agent values after two-stage training with distractors (4/5). [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Agent values after two-stage training with distractors (5/5). [PITH_FULL_IMAGE:figures/full_fig_p036_28.png] view at source ↗
read the original abstract

Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines goal generalisation in sequential reinforcement learning agents by studying over 100 training pipelines across more than 250 out-of-distribution environments. It reports that salient features drive generalisation and that goals acquired early in training can persist and influence later ones. To explain these observations, the authors introduce latent policy gradients: a forward simulation that evolves low-dimensional latent variables during training by maximising reward on the training objective under a simple model of how those latents map to behaviour. The method is claimed to achieve strong predictive accuracy for OOD behaviour, to generalise to unseen pipeline types, and to remain interpretable.

Significance. If the central claims hold, the work offers a structured, developmental account of how training history shapes out-of-distribution goal-directed behaviour in RL, which is relevant to AI safety and reliability. The scale of the empirical study (100+ pipelines, 250+ environments) and the emphasis on interpretability are strengths. A method that predicts OOD outcomes from training dynamics without being directly fitted to those outcomes would constitute a useful contribution if the underlying modelling assumptions are shown to be sufficient.

major comments (2)
  1. [latent policy gradients method description] The predictive claims rest on the assumption that a simple model of latent-to-behaviour mapping is sufficient to simulate actual policy evolution under sequential training. This assumption is load-bearing for both the reported accuracy and the generalisation to unseen pipelines, yet the manuscript provides no ablations against full high-dimensional policy-gradient baselines, no analysis of identifiability of the chosen latents, and no quantification of omitted non-linear or history-dependent effects.
  2. [Abstract] The abstract asserts 'strong predictive accuracy' and generalisation to unseen pipeline types, but the provided text contains no quantitative metrics, error bars, baseline comparisons, or cross-validation details that would allow assessment of these claims. Without such evidence the central empirical result cannot be evaluated.
minor comments (2)
  1. Notation for the latent variables and the simple mapping function should be introduced with explicit equations and a clear statement of what is assumed versus what is learned.
  2. The manuscript would benefit from a dedicated limitations section that discusses the scope of the simple mapping model and the conditions under which the simulation may diverge from true policy updates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We respond to each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [latent policy gradients method description] The predictive claims rest on the assumption that a simple model of latent-to-behaviour mapping is sufficient to simulate actual policy evolution under sequential training. This assumption is load-bearing for both the reported accuracy and the generalisation to unseen pipelines, yet the manuscript provides no ablations against full high-dimensional policy-gradient baselines, no analysis of identifiability of the chosen latents, and no quantification of omitted non-linear or history-dependent effects.

    Authors: We acknowledge that the manuscript does not contain ablations against full high-dimensional policy-gradient baselines, formal identifiability analysis of the latents, or explicit quantification of omitted non-linear or history-dependent effects. The low-dimensional latent representation was selected to enable interpretability while capturing the dominant dynamics observed across the 100+ pipelines. The reported predictive accuracy is measured on held-out pipelines and environments, but we agree that direct comparisons to higher-dimensional alternatives would better substantiate the sufficiency of the simple mapping. In revision we will add an ablation section comparing the latent model to a full-dimensional simulation where computationally feasible, include a discussion of modeling assumptions and potential omitted effects, and qualify the generalisation claims accordingly. revision: yes

  2. Referee: [Abstract] The abstract asserts 'strong predictive accuracy' and generalisation to unseen pipeline types, but the provided text contains no quantitative metrics, error bars, baseline comparisons, or cross-validation details that would allow assessment of these claims. Without such evidence the central empirical result cannot be evaluated.

    Authors: The abstract is a high-level summary; the quantitative metrics (predictive accuracy with error bars, baseline comparisons, and cross-validation across pipeline types) appear in the experimental results sections of the full manuscript. To address the concern, we will revise the abstract to include a brief reference to the evaluation scale and the nature of the reported accuracy while ensuring all claims remain fully supported by the main text. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is a forward simulation independent of target OOD outcomes

full rationale

The paper's central method (latent policy gradients) is presented as a simulation of low-dimensional latent evolution driven by reward maximization on the training objective, using an explicit simple model of latent-to-behavior mapping. This construction is not equivalent by definition to the OOD predictions it generates, nor does the provided text rely on self-citations, fitted parameters renamed as predictions, or ansatzes imported from prior work. The derivation chain remains self-contained as a modeling approach whose validity rests on empirical predictive accuracy rather than tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no specific free parameters, axioms, or invented entities can be identified or audited without the full text.

pith-pipeline@v0.9.0 · 5717 in / 1107 out tokens · 25341 ms · 2026-05-25T04:47:39.499367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 22 internal anchors

  1. [1]

    Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Twenty-first international conference on Machine learning - ICML '04 , page 1, Banff, Alberta, Canada, 2004. ACM Press. doi:10.1145/1015330.1015430. URL http://portal.acm.org/citation.cfm?doid=1015330.1015430

  2. [2]

    Stephen Adams, Tyler Cody, and Peter A. Beling. A survey of inverse reinforcement learning. Artificial Intelligence Review, 55 0 (6): 0 4307--4346, August 2022. ISSN 0269-2821, 1573-7462. doi:10.1007/s10462-021-10108-x. URL https://link.springer.com/10.1007/s10462-021-10108-x

  3. [3]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete Problems in AI Safety , July 2016. URL http://arxiv.org/abs/1606.06565. arXiv:1606.06565 [cs]

  4. [4]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. AgentHarm : A Benchmark for Measuring Harmfulness of LLM Agents , April 2025. URL http://arxiv.org/abs/2410.09024. arXiv:2410.09024 [cs]

  5. [5]

    Claude’s Character , August 2024

    Anthropic. Claude’s Character , August 2024. URL https://www.anthropic.com/research/claude-character

  6. [6]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  7. [7]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  8. [8]

    Mechanistic Interpretability for AI Safety -- A Review

    Leonard Bereska and Efstratios Gavves. Mechanistic Interpretability for AI Safety -- A Review , August 2024. URL http://arxiv.org/abs/2404.14082. arXiv:2404.14082 [cs]

  9. [9]

    Weird generalization and inductive backdoors: New ways to corrupt llms

    Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Andy Arditi, Anna Sztyber-Betley, and Owain Evans. Weird generalization and inductive backdoors: New ways to corrupt llms. arXiv preprint arXiv:2512.09742, 2025

  10. [10]

    Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

    Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs . Nature, 649 0 (8097): 0 584--589, January 2026. ISSN 0028-0836, 1476-4687. doi:10.1038/s41586-025-09937-5. URL http://arxiv.org/abs/2502.17424. arXiv:2502.17424 [cs]

  11. [11]

    Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs : I . The Method of Paired Comparisons . Biometrika, 39 0 (3/4): 0 324, December 1952. ISSN 00063444. doi:10.2307/2334029. URL https://www.jstor.org/stable/2334029?origin=crossref

  12. [12]

    Brown, Carl Henrik Ek, and Robert D

    Jason R. Brown, Carl Henrik Ek, and Robert D. Mullins. Learning from Preferences and Mixed Demonstrations in General Settings , August 2025. URL http://arxiv.org/abs/2508.14027. arXiv:2508.14027 [cs]

  13. [13]

    Deep Reinforcement Learning from Human Preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences . In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https://proceed...

  14. [14]

    Quantifying Generalization in Reinforcement Learning

    Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying Generalization in Reinforcement Learning , July 2019. URL http://arxiv.org/abs/1812.02341. arXiv:1812.02341 [cs]

  15. [15]

    Leveraging Procedural Generation to Benchmark Reinforcement Learning , July 2020

    Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging Procedural Generation to Benchmark Reinforcement Learning , July 2020. URL http://arxiv.org/abs/1912.01588. arXiv:1912.01588 [cs]

  16. [16]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  17. [17]

    Loss of plasticity in deep continual learning

    Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning. Nature, 632 0 (8026): 0 768--774, 2024. ISSN 0028-0836

  18. [18]

    Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

    Jan Dubiński, Jan Betley, Anna Sztyber-Betley, Daniel Tan, and Owain Evans. Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers, 2026. URL https://arxiv.org/abs/2604.25891. \_eprint: 2604.25891

  19. [19]

    Arpad E. Elo. The rating of chessplayers, past and present. Ishi Press International, Bronx, NY, 2. print edition, 2008. ISBN 978-0-923891-27-5

  20. [20]

    Reuben Feinman and Brenden M. Lake. Learning Inductive Biases with Simple Neural Networks , June 2018. URL http://arxiv.org/abs/1802.02745. arXiv:1802.02745 [cs]

  21. [21]

    Foundation models in robotics: Applications , challenges, and the future

    Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, Brian Ichter, Danny Driess, Jiajun Wu, Cewu Lu, and Mac Schwager. Foundation models in robotics: Applications , challenges, and the future. The International Journal of Robotics Research, 44 0 (5): 0 701--739, April...

  22. [22]

    Wichmann

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, November 2020. ISSN 2522-5839. doi:10.1038/s42256-020-00257-z. URL https://www.nature.com/articles/s42256-020-00257-z

  23. [23]

    Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative Alignment : Reasoning Enables Safer Language Models , January 2025. URL http://arxiv.org/abs/2412.16339. arXiv:2412.16339 [cs]

  24. [24]

    Causal Confusion in Imitation Learning , November 2019

    Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal Confusion in Imitation Learning , November 2019. URL http://arxiv.org/abs/1905.11979. arXiv:1905.11979 [cs]

  25. [25]

    Reinforcement Learning with Deep Energy-Based Policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement Learning with Deep Energy - Based Policies , July 2017. URL http://arxiv.org/abs/1702.08165. arXiv:1702.08165 [cs]

  26. [26]

    Soft Actor-Critic Algorithms and Applications

    Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft Actor - Critic Algorithms and Applications , January 2019. URL http://arxiv.org/abs/1812.05905. arXiv:1812.05905 [cs]

  27. [27]

    Cooperative Inverse Reinforcement Learning

    Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative Inverse Reinforcement Learning . In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/c3395dd46c3...

  28. [28]

    An Overview of Catastrophic AI Risks

    Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An Overview of Catastrophic AI Risks , October 2023. URL http://arxiv.org/abs/2306.12001. arXiv:2306.12001 [cs]

  29. [29]

    Risks from Learned Optimization in Advanced Machine Learning Systems

    Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from Learned Optimization in Advanced Machine Learning Systems , December 2021. URL http://arxiv.org/abs/1906.01820. arXiv:1906.01820 [cs]

  30. [30]

    A Review of Deep Transfer Learning and Recent Advancements

    Mohammadreza Iman, Hamid Reza Arabnia, and Khaled Rasheed. A Review of Deep Transfer Learning and Recent Advancements . Technologies, 11 0 (2): 0 40, March 2023. ISSN 2227-7080. doi:10.3390/technologies11020040. URL https://www.mdpi.com/2227-7080/11/2/40

  31. [31]

    Towards Continual Reinforcement Learning : A Review and Perspectives , November 2022

    Khimya Khetarpal, Matthew Riemer, Irina Rish, and Doina Precup. Towards Continual Reinforcement Learning : A Review and Perspectives , November 2022. URL http://arxiv.org/abs/2012.13490. arXiv:2012.13490 [cs]

  32. [32]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  33. [33]

    A Survey of Zero -shot Generalisation in Deep Reinforcement Learning

    Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A Survey of Zero -shot Generalisation in Deep Reinforcement Learning . Journal of Artificial Intelligence Research, 76: 0 201--264, January 2023. ISSN 1076-9757. doi:10.1613/jair.1.14174. URL http://jair.org/index.php/jair/article/view/14174

  34. [34]

    Goal Misgeneralization in Deep Reinforcement Learning , January 2023

    Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, and David Krueger. Goal Misgeneralization in Deep Reinforcement Learning , January 2023. URL http://arxiv.org/abs/2105.14111. arXiv:2105.14111 [cs]

  35. [35]

    Disentangling the Causes of Plasticity Loss in Neural Networks , February 2024

    Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the Causes of Plasticity Loss in Neural Networks , February 2024. URL http://arxiv.org/abs/2402.18762. arXiv:2402.18762 [cs]

  36. [36]

    Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy

    Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy. Agentic Misalignment : How LLMs Could Be Insider Threats , October 2025. URL http://arxiv.org/abs/2510.05179. arXiv:2510.05179 [cs]

  37. [37]

    Natural Emergent Misalignment from Reward Hacking in Production RL , 2025

    Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger. Natural Emergent Misalignmen...

  38. [38]

    Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025

    Sharan Maiya, Henning Bartsch, Nathan Lambert, and Evan Hubinger. Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025. URL http://arxiv.org/abs/2511.01689. arXiv:2511.01689 [cs]

  39. [39]

    Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks

    Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks. Utility Engineering : Analyzing and Controlling Emergent Value Systems in AIs , February 2025. URL http://arxiv.org/abs/2502.08640. arXiv:2502.08640 [cs]

  40. [40]

    Associative learning and elemental representation: II

    IPL McLaren and NJ Mackintosh. Associative learning and elemental representation: II . Generalization and discrimination. Animal learning & behavior, 30 0 (3): 0 177--200, 2002. ISSN 0090-4996

  41. [41]

    Understanding and Controlling a Maze - Solving Policy Network , October 2023

    Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, and Alexander Matt Turner. Understanding and Controlling a Maze - Solving Policy Network , October 2023. URL http://arxiv.org/abs/2310.08043. arXiv:2310.08043 [cs]

  42. [42]

    Playing Atari with Deep Reinforcement Learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning , December 2013. URL http://arxiv.org/abs/1312.5602. arXiv:1312.5602 [cs]

  43. [43]

    Human-level control through deep reinforcement learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

  44. [44]

    AgentMisalignment : Measuring the Propensity for Misaligned Behaviour in LLM - Based Agents , October 2025

    Akshat Naik, Patrick Quinn, Guillermo Bosch, Emma Gouné, Francisco Javier Campos Zabala, Jason Ross Brown, and Edward James Young. AgentMisalignment : Measuring the Propensity for Misaligned Behaviour in LLM - Based Agents , October 2025. URL http://arxiv.org/abs/2506.04018. arXiv:2506.04018 [cs]

  45. [45]

    Deep double descent: where bigger models and more data hurt*

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment, 2021 0 (12): 0 124003, December 2021. ISSN 1742-5468. doi:10.1088/1742-5468/ac3a74. URL https://iopscience.iop.org/article/10.1088/1742-5468/ac3a74

  46. [46]

    The Alignment Problem from a Deep Learning Perspective , May 2025

    Richard Ngo, Lawrence Chan, and Sören Mindermann. The Alignment Problem from a Deep Learning Perspective , May 2025. URL http://arxiv.org/abs/2209.00626. arXiv:2209.00626 [cs]

  47. [47]

    The Primacy Bias in Deep Reinforcement Learning , May 2022

    Evgenii Nikishin, Max Schwarzer, Pierluca D'Oro, Pierre-Luc Bacon, and Aaron Courville. The Primacy Bias in Deep Reinforcement Learning , May 2022. URL http://arxiv.org/abs/2205.07802. arXiv:2205.07802 [cs]

  48. [48]

    Deep reinforcement learning with plasticity injection

    Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and André Barreto. Deep reinforcement learning with plasticity injection. Advances in Neural Information Processing Systems, 36: 0 37142--37159, 2023

  49. [49]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  50. [50]

    A model for stimulus generalization in Pavlovian conditioning

    John M Pearce. A model for stimulus generalization in Pavlovian conditioning. Psychological review, 94 0 (1): 0 61, 1987. ISSN 1939-1471

  51. [51]

    Courville, Doina Precup, and Guillaume Lajoie

    Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C. Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34: 0 1256--1272, 2021

  52. [52]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , January 2022. URL http://arxiv.org/abs/2201.02177. arXiv:2201.02177 [cs]

  53. [53]

    Direct Preference Optimization : Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct Preference Optimization : Your Language Model is Secretly a Reward Model . In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 53728--53741. Curran A...

  54. [54]

    Stable- Baselines3 : Reliable Reinforcement Learning Implementations

    Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable- Baselines3 : Reliable Reinforcement Learning Implementations . Journal of Machine Learning Research, 22 0 (268): 0 1--8, 2021. URL http://jmlr.org/papers/v22/20-1364.html

  55. [55]

    Bayesian Inverse Reinforcement Learning

    Deepak Ramachandran and Eyal Amir. Bayesian Inverse Reinforcement Learning

  56. [56]

    A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement

    Robert A Rescorla. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. Classical conditioning, Current research and theory, 2: 0 64--69, 1972

  57. [57]

    Botvinick

    Samuel Ritter, David GT Barrett, Adam Santoro, and Matt M. Botvinick. Cognitive psychology for deep neural networks: A shape bias case study. In International conference on machine learning, pages 2940--2949. PMLR, 2017. ISBN 2640-3498

  58. [58]

    Progressive Neural Networks

    Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive Neural Networks , October 2022. URL http://arxiv.org/abs/1606.04671. arXiv:1606.04671 [cs]

  59. [59]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms , August 2017. URL http://arxiv.org/abs/1707.06347. arXiv:1707.06347 [cs]

  60. [60]

    Goal misgeneralization: Why correct specifications aren't enough for correct goals

    Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton. Goal misgeneralization: Why correct specifications aren't enough for correct goals. arXiv preprint arXiv:2210.01790, 2022

  61. [61]

    Toward a universal law of generalization for psychological science

    Roger N Shepard. Toward a universal law of generalization for psychological science. Science, 237 0 (4820): 0 1317--1323, 1987. ISSN 0036-8075

  62. [62]

    Misspecification in Inverse Reinforcement Learning

    Joar Skalse and Alessandro Abate. Misspecification in Inverse Reinforcement Learning . Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (12): 0 15136--15143, June 2023. ISSN 2374-3468, 2159-5399. doi:10.1609/aaai.v37i12.26766. URL https://ojs.aaai.org/index.php/AAAI/article/view/26766

  63. [63]

    Invariance in policy optimisation and partial identifiability in reward learning

    Joar Max Viktor Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning. In International Conference on Machine Learning , pages 32033--32058. PMLR, 2023. ISBN 2640-3498

  64. [64]

    The Dormant Neuron Phenomenon in Deep Reinforcement Learning , June 2023

    Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The Dormant Neuron Phenomenon in Deep Reinforcement Learning , June 2023. URL http://arxiv.org/abs/2302.12902. arXiv:2302.12902 [cs]

  65. [65]

    School of Reward Hacks : Hacking harmless tasks generalizes to misaligned behavior in LLMs , August 2025

    Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, and Owain Evans. School of Reward Hacks : Hacking harmless tasks generalizes to misaligned behavior in LLMs , August 2025. URL http://arxiv.org/abs/2508.17511. arXiv:2508.17511 [cs]

  66. [66]

    Alignment Pretraining : AI Discourse Causes Self - Fulfilling ( Mis )alignment, January 2026

    Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, and Kyle O'Brien. Alignment Pretraining : AI Discourse Causes Self - Fulfilling ( Mis )alignment, January 2026. URL http://arxiv.org/abs/2601.10160. arXiv:2601.10160 [cs]

  67. [67]

    Theory of games and economic behavior, 2nd rev

    John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior, 2nd rev. 1947

  68. [68]

    Maximum Entropy Deep Inverse Reinforcement Learning

    Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum Entropy Deep Inverse Reinforcement Learning , March 2016. URL http://arxiv.org/abs/1507.04888. arXiv:1507.04888 [cs]

  69. [69]

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. Jailbreak Attacks and Defenses Against Large Language Models : A Survey , August 2024. URL http://arxiv.org/abs/2407.04295. arXiv:2407.04295 [cs]

  70. [70]

    Investigating Generalisation in Continuous Deep Reinforcement Learning

    Chenyang Zhao, Olivier Sigaud, Freek Stulp, and Timothy M. Hospedales. Investigating Generalisation in Continuous Deep Reinforcement Learning , February 2019. URL http://arxiv.org/abs/1902.07015. arXiv:1902.07015 [cs]

  71. [71]

    Maximum entropy inverse reinforcement learning

    Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. 2008