Understanding Goal Generalisation in Sequential Reinforcement Learning

Edward James Young; Jason Ross Brown

arxiv: 2605.23565 · v1 · pith:4UOT2D5Wnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Understanding Goal Generalisation in Sequential Reinforcement Learning

Jason Ross Brown , Edward James Young This is my paper

Pith reviewed 2026-05-25 04:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learninggoal generalizationsequential traininglatent policy gradientsout-of-distribution behaviorpolicy evolutiontraining pipelines

0 comments

The pith

Latent policy gradients simulate low-dimensional variables to predict how sequentially trained RL agents will generalize goals to new environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies reinforcement learning agents trained sequentially on one or more tasks and measures their behavior in hundreds of out-of-distribution environments. It finds that salient features drive generalization and that goals acquired early in training often persist to shape later goals. To account for these patterns across more than 100 training pipelines, the authors introduce latent policy gradients, which models the training process by evolving low-dimensional latent variables toward high reward under a simple mapping to behavior. The method yields accurate predictions, transfers to unseen pipeline types, and remains interpretable. This shows that dependence on training history follows a capturable structure rather than being arbitrary.

Core claim

Latent policy gradients predicts what out-of-distribution behaviour a training pipeline will likely induce by simulating the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour.

What carries the argument

Latent policy gradients, which simulates the evolution of low-dimensional latent variables to maximize training rewards under a simple behavior-mapping model.

If this is right

Out-of-distribution agent behavior depends on the entire sequential training pipeline rather than only the final task.
Goals learned early can persist and continue to influence goals acquired later.
Salient environmental features determine which behaviors generalize to novel settings.
The dependence of generalization on training history has an underlying structure that latent policy gradients can capture.
A developmental perspective on goal generalization becomes feasible once training pipelines are modeled explicitly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines could be designed deliberately to suppress or encourage particular forms of goal generalization.
The same latent-variable simulation approach might extend to studying generalization in other sequential learning domains.
If the simple mapping model remains adequate, extensive empirical testing of each new pipeline may become unnecessary.
The persistence of early goals suggests parallels with developmental processes where initial experiences constrain later learning.

Load-bearing premise

A simple model of how low-dimensional latent variables map to behavior is sufficient to simulate the actual evolution of an agent's policy during sequential training.

What would settle it

Running agents on new sequential training pipelines and finding that their actual out-of-distribution behaviors diverge systematically from the predictions made by latent policy gradients.

Figures

Figures reproduced from arXiv: 2605.23565 by Edward James Young, Jason Ross Brown.

**Figure 1.** Figure 1: Left: Illustration of our experimental design. Our experimental design is covered in Section 3. RL agents are trained on pipelines involving either one or two stages (e.g., trained to pursue in stage 1 and in stage 2). They are then evaluated in out-of-distribution environments containing two objects (e.g., and ) to generate an empirical preference distribution. We explore these distributions in Section 4.… view at source ↗

**Figure 2.** Figure 2: Model comparison. Average modelling loss (Equation (1)) across four evaluations (described in main text). Error bars show standard error for K-fold CV. Two per-agent lower bounds (Full Fit only) are also shown. Lower is better. Exact values are given in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-stage training is iterated projection. Latent policy gradient shifts the latent variables w in the S T ϕ (g) direction until they intersect with the hyperplane ϕ (g) · Sw = τ −1 . The result of training (to convergence) first on ϕ (g1) , and then on ϕ (g2) is shown by w2. The result of training to convergence on ϕ (g2) alone is shown by w′ 2 . uses a saliency matrix to capture differing feature learn… view at source ↗

**Figure 4.** Figure 4: An example 8x8 maze environment. The agent is represented by and the is the goal object. Black squares are impassable walls. The agent cannot move off the edges of the maze; the outer wall shown is for illustrative purposes and is not included as part of the agent’s observation. The total observation size is 128 × 128 pixels with 3 colour channels (RGB). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Visual features used in the Maze environment. Colours (top): Three colours (black, red, blue) are used for goal objects during both training and evaluation; grey is used exclusively to render the agent; green appears only in evaluation environments to test generalisation to novel colours. All red, blue, and green use a single RGB input channel. Shapes (bottom): Four shapes (cross, plus, diamond, ring) are … view at source ↗

**Figure 6.** Figure 6: Model loss (KL divergence) as a function of the latent dimension [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Empirical Elo scores vs model-predicted values for all 298 agents across 24 goals. Each [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Zero-mean normalised Elo scores vs model-predicted values. Per-agent means are subtracted [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Left: Generalisation differences between and training goals. Marginalised feature Elo scores for the red, -shape, and -shape features for the single-stage agent trained on , and the single-stage agent trained on . Right: Some features drive generalisation more strongly and are more salient to the model. Average marginalised Elo with standard error for each feature across agents that have been trained on a … view at source ↗

**Figure 10.** Figure 10: Training on one feature can lead to another being valued more or less strongly than average. Rows indicate a feature being trained on and columns indicate a feature being evaluated, the value being the average preference score for goals containing the evaluation feature across models trained on goals containing the training feature. The average is taken over all single-stage runs without distractors. Orde… view at source ↗

**Figure 11.** Figure 11: Left: -shape and blue values persist after training on . Marginalised feature Elo scores for red, blue, -shape, and -shape features for three different agents: single-stage agent trained on ; a two-stage agent, → ; a single-stage agent trained just on . Right: Values for early training objectives persist. Marginalised Elo with standard error for features that are only in the first goal in two-stage traini… view at source ↗

**Figure 12.** Figure 12: Agents trained on more diverse feature sets pursue more goals. Marginalised Elo with standard error across all features for agents that have been trained on a different number of total unique features. Single-stage pipelines always have two unique features in their goals. Two-stage pipelines can have goals which share both, one, or neither of their features. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: Left: Training on → does not cause a strong value for -shape. Marginalised feature Elo scores for red, -shape, and -shape for three different agents: a single-stage agent trained just on a ; a single-stage agent trained on a ; a two-stage agent trained on → . Right: Repeated goal features’ values are strengthened, and inhibit new values forming. The left bars within each pair show the marginalised Elo wit… view at source ↗

**Figure 14.** Figure 14: Generalisation behaviour is sensitive to the order of the training objectives when they share a feature. Marginalised Elo with standard error for each feature stratified by when that feature is present in the first training goal compared to when it is present in the second training goal. Elo is marginalised over pairs of two-stage pipelines without distractors where they are each others reverse. Left: Cas… view at source ↗

**Figure 15.** Figure 15: Training curves showing mean episode reward over training steps for selected agents. [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗

**Figure 16.** Figure 16: Agent values for single-stage training pipelines without distractors. [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

**Figure 17.** Figure 17: Agent values after two-stage training without distractors (1/4). [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗

**Figure 18.** Figure 18: Agent values after two-stage training without distractors (2/4). [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗

**Figure 19.** Figure 19: Agent values after two-stage training without distractors (3/4). [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

**Figure 20.** Figure 20: Agent values after two-stage training without distractors (4/4). [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗

**Figure 21.** Figure 21: Agent values for single-stage training with distractors (1/3). [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗

**Figure 22.** Figure 22: Agent values for single-stage training with distractors (2/3). [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗

**Figure 23.** Figure 23: Agent values for single-stage training with distractors (3/3). [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗

**Figure 24.** Figure 24: Agent values after two-stage training with distractors (1/5). [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗

**Figure 25.** Figure 25: Agent values after two-stage training with distractors (2/5). [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗

**Figure 26.** Figure 26: Agent values after two-stage training with distractors (3/5). [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗

**Figure 27.** Figure 27: Agent values after two-stage training with distractors (4/5). [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗

**Figure 28.** Figure 28: Agent values after two-stage training with distractors (5/5). [PITH_FULL_IMAGE:figures/full_fig_p036_28.png] view at source ↗

read the original abstract

Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs a large empirical study on sequential RL training pipelines and introduces latent policy gradients as a simulation method to predict OOD goal generalization, but the abstract leaves the quantitative support and model details thin.

read the letter

The main takeaway is that this work scales up experiments on how training history shapes out-of-distribution behavior in sequential RL and proposes a forward simulation using low-dimensional latents to forecast it. They examined over 100 pipelines across more than 250 OOD environments and report that early-learned goals persist while salient features drive generalization. That scale and the developmental angle on training pipelines are the concrete contributions here. The latent policy gradients approach is framed as interpretable and able to generalize to unseen pipeline types, which is a reasonable direction if the simulation holds up. The empirical findings on persistence of early goals look like the part most likely to be cited if they replicate cleanly. The soft spot is that the abstract asserts strong predictive accuracy without showing numbers, baselines, or validation details, and the method rests on a simple latent-to-behavior mapping whose faithfulness to actual high-dimensional policy updates is not demonstrated in the provided text. The stress-test concern about that mapping omitting non-linear or history-dependent effects therefore lands as a real open question rather than a minor one. This is for researchers focused on safe deployment and generalization in non-stationary RL settings. It deserves a serious referee because the experiment count and the new simulation idea are substantive enough to warrant review, even with the need for clearer quantitative evidence and model validation in revisions.

Referee Report

2 major / 2 minor

Summary. The paper examines goal generalisation in sequential reinforcement learning agents by studying over 100 training pipelines across more than 250 out-of-distribution environments. It reports that salient features drive generalisation and that goals acquired early in training can persist and influence later ones. To explain these observations, the authors introduce latent policy gradients: a forward simulation that evolves low-dimensional latent variables during training by maximising reward on the training objective under a simple model of how those latents map to behaviour. The method is claimed to achieve strong predictive accuracy for OOD behaviour, to generalise to unseen pipeline types, and to remain interpretable.

Significance. If the central claims hold, the work offers a structured, developmental account of how training history shapes out-of-distribution goal-directed behaviour in RL, which is relevant to AI safety and reliability. The scale of the empirical study (100+ pipelines, 250+ environments) and the emphasis on interpretability are strengths. A method that predicts OOD outcomes from training dynamics without being directly fitted to those outcomes would constitute a useful contribution if the underlying modelling assumptions are shown to be sufficient.

major comments (2)

[latent policy gradients method description] The predictive claims rest on the assumption that a simple model of latent-to-behaviour mapping is sufficient to simulate actual policy evolution under sequential training. This assumption is load-bearing for both the reported accuracy and the generalisation to unseen pipelines, yet the manuscript provides no ablations against full high-dimensional policy-gradient baselines, no analysis of identifiability of the chosen latents, and no quantification of omitted non-linear or history-dependent effects.
[Abstract] The abstract asserts 'strong predictive accuracy' and generalisation to unseen pipeline types, but the provided text contains no quantitative metrics, error bars, baseline comparisons, or cross-validation details that would allow assessment of these claims. Without such evidence the central empirical result cannot be evaluated.

minor comments (2)

Notation for the latent variables and the simple mapping function should be introduced with explicit equations and a clear statement of what is assumed versus what is learned.
The manuscript would benefit from a dedicated limitations section that discusses the scope of the simple mapping model and the conditions under which the simulation may diverge from true policy updates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We respond to each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [latent policy gradients method description] The predictive claims rest on the assumption that a simple model of latent-to-behaviour mapping is sufficient to simulate actual policy evolution under sequential training. This assumption is load-bearing for both the reported accuracy and the generalisation to unseen pipelines, yet the manuscript provides no ablations against full high-dimensional policy-gradient baselines, no analysis of identifiability of the chosen latents, and no quantification of omitted non-linear or history-dependent effects.

Authors: We acknowledge that the manuscript does not contain ablations against full high-dimensional policy-gradient baselines, formal identifiability analysis of the latents, or explicit quantification of omitted non-linear or history-dependent effects. The low-dimensional latent representation was selected to enable interpretability while capturing the dominant dynamics observed across the 100+ pipelines. The reported predictive accuracy is measured on held-out pipelines and environments, but we agree that direct comparisons to higher-dimensional alternatives would better substantiate the sufficiency of the simple mapping. In revision we will add an ablation section comparing the latent model to a full-dimensional simulation where computationally feasible, include a discussion of modeling assumptions and potential omitted effects, and qualify the generalisation claims accordingly. revision: yes
Referee: [Abstract] The abstract asserts 'strong predictive accuracy' and generalisation to unseen pipeline types, but the provided text contains no quantitative metrics, error bars, baseline comparisons, or cross-validation details that would allow assessment of these claims. Without such evidence the central empirical result cannot be evaluated.

Authors: The abstract is a high-level summary; the quantitative metrics (predictive accuracy with error bars, baseline comparisons, and cross-validation across pipeline types) appear in the experimental results sections of the full manuscript. To address the concern, we will revise the abstract to include a brief reference to the evaluation scale and the nature of the reported accuracy while ensuring all claims remain fully supported by the main text. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is a forward simulation independent of target OOD outcomes

full rationale

The paper's central method (latent policy gradients) is presented as a simulation of low-dimensional latent evolution driven by reward maximization on the training objective, using an explicit simple model of latent-to-behavior mapping. This construction is not equivalent by definition to the OOD predictions it generates, nor does the provided text rely on self-citations, fitted parameters renamed as predictions, or ansatzes imported from prior work. The derivation chain remains self-contained as a modeling approach whose validity rests on empirical predictive accuracy rather than tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no specific free parameters, axioms, or invented entities can be identified or audited without the full text.

pith-pipeline@v0.9.0 · 5717 in / 1107 out tokens · 25341 ms · 2026-05-25T04:47:39.499367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 22 internal anchors

[1]

Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Twenty-first international conference on Machine learning - ICML '04 , page 1, Banff, Alberta, Canada, 2004. ACM Press. doi:10.1145/1015330.1015430. URL http://portal.acm.org/citation.cfm?doid=1015330.1015430

work page doi:10.1145/1015330.1015430 2004
[2]

Stephen Adams, Tyler Cody, and Peter A. Beling. A survey of inverse reinforcement learning. Artificial Intelligence Review, 55 0 (6): 0 4307--4346, August 2022. ISSN 0269-2821, 1573-7462. doi:10.1007/s10462-021-10108-x. URL https://link.springer.com/10.1007/s10462-021-10108-x

work page doi:10.1007/s10462-021-10108-x 2022
[3]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete Problems in AI Safety , July 2016. URL http://arxiv.org/abs/1606.06565. arXiv:1606.06565 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. AgentHarm : A Benchmark for Measuring Harmfulness of LLM Agents , April 2025. URL http://arxiv.org/abs/2410.09024. arXiv:2410.09024 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Claude’s Character , August 2024

Anthropic. Claude’s Character , August 2024. URL https://www.anthropic.com/research/claude-character

work page 2024
[6]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Mechanistic Interpretability for AI Safety -- A Review

Leonard Bereska and Efstratios Gavves. Mechanistic Interpretability for AI Safety -- A Review , August 2024. URL http://arxiv.org/abs/2404.14082. arXiv:2404.14082 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Weird generalization and inductive backdoors: New ways to corrupt llms

Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Andy Arditi, Anna Sztyber-Betley, and Owain Evans. Weird generalization and inductive backdoors: New ways to corrupt llms. arXiv preprint arXiv:2512.09742, 2025

work page arXiv 2025
[10]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs . Nature, 649 0 (8097): 0 584--589, January 2026. ISSN 0028-0836, 1476-4687. doi:10.1038/s41586-025-09937-5. URL http://arxiv.org/abs/2502.17424. arXiv:2502.17424 [cs]

work page doi:10.1038/s41586-025-09937-5 2026
[11]

Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs : I . The Method of Paired Comparisons . Biometrika, 39 0 (3/4): 0 324, December 1952. ISSN 00063444. doi:10.2307/2334029. URL https://www.jstor.org/stable/2334029?origin=crossref

work page doi:10.2307/2334029 1952
[12]

Brown, Carl Henrik Ek, and Robert D

Jason R. Brown, Carl Henrik Ek, and Robert D. Mullins. Learning from Preferences and Mixed Demonstrations in General Settings , August 2025. URL http://arxiv.org/abs/2508.14027. arXiv:2508.14027 [cs]

work page arXiv 2025
[13]

Deep Reinforcement Learning from Human Preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences . In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https://proceed...

work page 2017
[14]

Quantifying Generalization in Reinforcement Learning

Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying Generalization in Reinforcement Learning , July 2019. URL http://arxiv.org/abs/1812.02341. arXiv:1812.02341 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[15]

Leveraging Procedural Generation to Benchmark Reinforcement Learning , July 2020

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging Procedural Generation to Benchmark Reinforcement Learning , July 2020. URL http://arxiv.org/abs/1912.01588. arXiv:1912.01588 [cs]

work page arXiv 2020
[16]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025
[17]

Loss of plasticity in deep continual learning

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning. Nature, 632 0 (8026): 0 768--774, 2024. ISSN 0028-0836

work page 2024
[18]

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

Jan Dubiński, Jan Betley, Anna Sztyber-Betley, Daniel Tan, and Owain Evans. Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers, 2026. URL https://arxiv.org/abs/2604.25891. \_eprint: 2604.25891

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Arpad E. Elo. The rating of chessplayers, past and present. Ishi Press International, Bronx, NY, 2. print edition, 2008. ISBN 978-0-923891-27-5

work page 2008
[20]

Reuben Feinman and Brenden M. Lake. Learning Inductive Biases with Simple Neural Networks , June 2018. URL http://arxiv.org/abs/1802.02745. arXiv:1802.02745 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Foundation models in robotics: Applications , challenges, and the future

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, Brian Ichter, Danny Driess, Jiajun Wu, Cewu Lu, and Mac Schwager. Foundation models in robotics: Applications , challenges, and the future. The International Journal of Robotics Research, 44 0 (5): 0 701--739, April...

work page doi:10.1177/02783649241281508 2025
[22]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, November 2020. ISSN 2522-5839. doi:10.1038/s42256-020-00257-z. URL https://www.nature.com/articles/s42256-020-00257-z

work page doi:10.1038/s42256-020-00257-z 2020
[23]

Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative Alignment : Reasoning Enables Safer Language Models , January 2025. URL http://arxiv.org/abs/2412.16339. arXiv:2412.16339 [cs]

work page arXiv 2025
[24]

Causal Confusion in Imitation Learning , November 2019

Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal Confusion in Imitation Learning , November 2019. URL http://arxiv.org/abs/1905.11979. arXiv:1905.11979 [cs]

work page arXiv 2019
[25]

Reinforcement Learning with Deep Energy-Based Policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement Learning with Deep Energy - Based Policies , July 2017. URL http://arxiv.org/abs/1702.08165. arXiv:1702.08165 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft Actor - Critic Algorithms and Applications , January 2019. URL http://arxiv.org/abs/1812.05905. arXiv:1812.05905 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

Cooperative Inverse Reinforcement Learning

Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative Inverse Reinforcement Learning . In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/c3395dd46c3...

work page 2016
[28]

An Overview of Catastrophic AI Risks

Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An Overview of Catastrophic AI Risks , October 2023. URL http://arxiv.org/abs/2306.12001. arXiv:2306.12001 [cs]

work page internal anchor Pith review arXiv 2023
[29]

Risks from Learned Optimization in Advanced Machine Learning Systems

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from Learned Optimization in Advanced Machine Learning Systems , December 2021. URL http://arxiv.org/abs/1906.01820. arXiv:1906.01820 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

A Review of Deep Transfer Learning and Recent Advancements

Mohammadreza Iman, Hamid Reza Arabnia, and Khaled Rasheed. A Review of Deep Transfer Learning and Recent Advancements . Technologies, 11 0 (2): 0 40, March 2023. ISSN 2227-7080. doi:10.3390/technologies11020040. URL https://www.mdpi.com/2227-7080/11/2/40

work page doi:10.3390/technologies11020040 2023
[31]

Towards Continual Reinforcement Learning : A Review and Perspectives , November 2022

Khimya Khetarpal, Matthew Riemer, Irina Rish, and Doina Precup. Towards Continual Reinforcement Learning : A Review and Perspectives , November 2022. URL http://arxiv.org/abs/2012.13490. arXiv:2012.13490 [cs]

work page arXiv 2022
[32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

A Survey of Zero -shot Generalisation in Deep Reinforcement Learning

Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A Survey of Zero -shot Generalisation in Deep Reinforcement Learning . Journal of Artificial Intelligence Research, 76: 0 201--264, January 2023. ISSN 1076-9757. doi:10.1613/jair.1.14174. URL http://jair.org/index.php/jair/article/view/14174

work page doi:10.1613/jair.1.14174 2023
[34]

Goal Misgeneralization in Deep Reinforcement Learning , January 2023

Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, and David Krueger. Goal Misgeneralization in Deep Reinforcement Learning , January 2023. URL http://arxiv.org/abs/2105.14111. arXiv:2105.14111 [cs]

work page arXiv 2023
[35]

Disentangling the Causes of Plasticity Loss in Neural Networks , February 2024

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the Causes of Plasticity Loss in Neural Networks , February 2024. URL http://arxiv.org/abs/2402.18762. arXiv:2402.18762 [cs]

work page arXiv 2024
[36]

Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy. Agentic Misalignment : How LLMs Could Be Insider Threats , October 2025. URL http://arxiv.org/abs/2510.05179. arXiv:2510.05179 [cs]

work page arXiv 2025
[37]

Natural Emergent Misalignment from Reward Hacking in Production RL , 2025

Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger. Natural Emergent Misalignmen...

work page arXiv 2025
[38]

Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025

Sharan Maiya, Henning Bartsch, Nathan Lambert, and Evan Hubinger. Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025. URL http://arxiv.org/abs/2511.01689. arXiv:2511.01689 [cs]

work page arXiv 2025
[39]

Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks. Utility Engineering : Analyzing and Controlling Emergent Value Systems in AIs , February 2025. URL http://arxiv.org/abs/2502.08640. arXiv:2502.08640 [cs]

work page arXiv 2025
[40]

Associative learning and elemental representation: II

IPL McLaren and NJ Mackintosh. Associative learning and elemental representation: II . Generalization and discrimination. Animal learning & behavior, 30 0 (3): 0 177--200, 2002. ISSN 0090-4996

work page 2002
[41]

Understanding and Controlling a Maze - Solving Policy Network , October 2023

Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, and Alexander Matt Turner. Understanding and Controlling a Maze - Solving Policy Network , October 2023. URL http://arxiv.org/abs/2310.08043. arXiv:2310.08043 [cs]

work page arXiv 2023
[42]

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning , December 2013. URL http://arxiv.org/abs/1312.5602. arXiv:1312.5602 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2013
[43]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

work page doi:10.1038/nature14236 2015
[44]

AgentMisalignment : Measuring the Propensity for Misaligned Behaviour in LLM - Based Agents , October 2025

Akshat Naik, Patrick Quinn, Guillermo Bosch, Emma Gouné, Francisco Javier Campos Zabala, Jason Ross Brown, and Edward James Young. AgentMisalignment : Measuring the Propensity for Misaligned Behaviour in LLM - Based Agents , October 2025. URL http://arxiv.org/abs/2506.04018. arXiv:2506.04018 [cs]

work page arXiv 2025
[45]

Deep double descent: where bigger models and more data hurt*

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment, 2021 0 (12): 0 124003, December 2021. ISSN 1742-5468. doi:10.1088/1742-5468/ac3a74. URL https://iopscience.iop.org/article/10.1088/1742-5468/ac3a74

work page doi:10.1088/1742-5468/ac3a74 2021
[46]

The Alignment Problem from a Deep Learning Perspective , May 2025

Richard Ngo, Lawrence Chan, and Sören Mindermann. The Alignment Problem from a Deep Learning Perspective , May 2025. URL http://arxiv.org/abs/2209.00626. arXiv:2209.00626 [cs]

work page arXiv 2025
[47]

The Primacy Bias in Deep Reinforcement Learning , May 2022

Evgenii Nikishin, Max Schwarzer, Pierluca D'Oro, Pierre-Luc Bacon, and Aaron Courville. The Primacy Bias in Deep Reinforcement Learning , May 2022. URL http://arxiv.org/abs/2205.07802. arXiv:2205.07802 [cs]

work page arXiv 2022
[48]

Deep reinforcement learning with plasticity injection

Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and André Barreto. Deep reinforcement learning with plasticity injection. Advances in Neural Information Processing Systems, 36: 0 37142--37159, 2023

work page 2023
[49]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

A model for stimulus generalization in Pavlovian conditioning

John M Pearce. A model for stimulus generalization in Pavlovian conditioning. Psychological review, 94 0 (1): 0 61, 1987. ISSN 1939-1471

work page 1987
[51]

Courville, Doina Precup, and Guillaume Lajoie

Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C. Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34: 0 1256--1272, 2021

work page 2021
[52]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , January 2022. URL http://arxiv.org/abs/2201.02177. arXiv:2201.02177 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

Direct Preference Optimization : Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct Preference Optimization : Your Language Model is Secretly a Reward Model . In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 53728--53741. Curran A...

work page 2023
[54]

Stable- Baselines3 : Reliable Reinforcement Learning Implementations

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable- Baselines3 : Reliable Reinforcement Learning Implementations . Journal of Machine Learning Research, 22 0 (268): 0 1--8, 2021. URL http://jmlr.org/papers/v22/20-1364.html

work page 2021
[55]

Bayesian Inverse Reinforcement Learning

Deepak Ramachandran and Eyal Amir. Bayesian Inverse Reinforcement Learning

work page
[56]

A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement

Robert A Rescorla. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. Classical conditioning, Current research and theory, 2: 0 64--69, 1972

work page 1972
[57]

Botvinick

Samuel Ritter, David GT Barrett, Adam Santoro, and Matt M. Botvinick. Cognitive psychology for deep neural networks: A shape bias case study. In International conference on machine learning, pages 2940--2949. PMLR, 2017. ISBN 2640-3498

work page 2017
[58]

Progressive Neural Networks

Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive Neural Networks , October 2022. URL http://arxiv.org/abs/1606.04671. arXiv:1606.04671 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms , August 2017. URL http://arxiv.org/abs/1707.06347. arXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[60]

Goal misgeneralization: Why correct specifications aren't enough for correct goals

Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton. Goal misgeneralization: Why correct specifications aren't enough for correct goals. arXiv preprint arXiv:2210.01790, 2022

work page arXiv 2022
[61]

Toward a universal law of generalization for psychological science

Roger N Shepard. Toward a universal law of generalization for psychological science. Science, 237 0 (4820): 0 1317--1323, 1987. ISSN 0036-8075

work page 1987
[62]

Misspecification in Inverse Reinforcement Learning

Joar Skalse and Alessandro Abate. Misspecification in Inverse Reinforcement Learning . Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (12): 0 15136--15143, June 2023. ISSN 2374-3468, 2159-5399. doi:10.1609/aaai.v37i12.26766. URL https://ojs.aaai.org/index.php/AAAI/article/view/26766

work page doi:10.1609/aaai.v37i12.26766 2023
[63]

Invariance in policy optimisation and partial identifiability in reward learning

Joar Max Viktor Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning. In International Conference on Machine Learning , pages 32033--32058. PMLR, 2023. ISBN 2640-3498

work page 2023
[64]

The Dormant Neuron Phenomenon in Deep Reinforcement Learning , June 2023

Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The Dormant Neuron Phenomenon in Deep Reinforcement Learning , June 2023. URL http://arxiv.org/abs/2302.12902. arXiv:2302.12902 [cs]

work page arXiv 2023
[65]

School of Reward Hacks : Hacking harmless tasks generalizes to misaligned behavior in LLMs , August 2025

Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, and Owain Evans. School of Reward Hacks : Hacking harmless tasks generalizes to misaligned behavior in LLMs , August 2025. URL http://arxiv.org/abs/2508.17511. arXiv:2508.17511 [cs]

work page arXiv 2025
[66]

Alignment Pretraining : AI Discourse Causes Self - Fulfilling ( Mis )alignment, January 2026

Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, and Kyle O'Brien. Alignment Pretraining : AI Discourse Causes Self - Fulfilling ( Mis )alignment, January 2026. URL http://arxiv.org/abs/2601.10160. arXiv:2601.10160 [cs]

work page arXiv 2026
[67]

Theory of games and economic behavior, 2nd rev

John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior, 2nd rev. 1947

work page 1947
[68]

Maximum Entropy Deep Inverse Reinforcement Learning

Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum Entropy Deep Inverse Reinforcement Learning , March 2016. URL http://arxiv.org/abs/1507.04888. arXiv:1507.04888 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[69]

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. Jailbreak Attacks and Defenses Against Large Language Models : A Survey , August 2024. URL http://arxiv.org/abs/2407.04295. arXiv:2407.04295 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Investigating Generalisation in Continuous Deep Reinforcement Learning

Chenyang Zhao, Olivier Sigaud, Freek Stulp, and Timothy M. Hospedales. Investigating Generalisation in Continuous Deep Reinforcement Learning , February 2019. URL http://arxiv.org/abs/1902.07015. arXiv:1902.07015 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[71]

Maximum entropy inverse reinforcement learning

Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. 2008

work page 2008

[1] [1]

Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Twenty-first international conference on Machine learning - ICML '04 , page 1, Banff, Alberta, Canada, 2004. ACM Press. doi:10.1145/1015330.1015430. URL http://portal.acm.org/citation.cfm?doid=1015330.1015430

work page doi:10.1145/1015330.1015430 2004

[2] [2]

Stephen Adams, Tyler Cody, and Peter A. Beling. A survey of inverse reinforcement learning. Artificial Intelligence Review, 55 0 (6): 0 4307--4346, August 2022. ISSN 0269-2821, 1573-7462. doi:10.1007/s10462-021-10108-x. URL https://link.springer.com/10.1007/s10462-021-10108-x

work page doi:10.1007/s10462-021-10108-x 2022

[3] [3]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete Problems in AI Safety , July 2016. URL http://arxiv.org/abs/1606.06565. arXiv:1606.06565 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. AgentHarm : A Benchmark for Measuring Harmfulness of LLM Agents , April 2025. URL http://arxiv.org/abs/2410.09024. arXiv:2410.09024 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Claude’s Character , August 2024

Anthropic. Claude’s Character , August 2024. URL https://www.anthropic.com/research/claude-character

work page 2024

[6] [6]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Mechanistic Interpretability for AI Safety -- A Review

Leonard Bereska and Efstratios Gavves. Mechanistic Interpretability for AI Safety -- A Review , August 2024. URL http://arxiv.org/abs/2404.14082. arXiv:2404.14082 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Weird generalization and inductive backdoors: New ways to corrupt llms

Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Andy Arditi, Anna Sztyber-Betley, and Owain Evans. Weird generalization and inductive backdoors: New ways to corrupt llms. arXiv preprint arXiv:2512.09742, 2025

work page arXiv 2025

[10] [10]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs . Nature, 649 0 (8097): 0 584--589, January 2026. ISSN 0028-0836, 1476-4687. doi:10.1038/s41586-025-09937-5. URL http://arxiv.org/abs/2502.17424. arXiv:2502.17424 [cs]

work page doi:10.1038/s41586-025-09937-5 2026

[11] [11]

Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs : I . The Method of Paired Comparisons . Biometrika, 39 0 (3/4): 0 324, December 1952. ISSN 00063444. doi:10.2307/2334029. URL https://www.jstor.org/stable/2334029?origin=crossref

work page doi:10.2307/2334029 1952

[12] [12]

Brown, Carl Henrik Ek, and Robert D

Jason R. Brown, Carl Henrik Ek, and Robert D. Mullins. Learning from Preferences and Mixed Demonstrations in General Settings , August 2025. URL http://arxiv.org/abs/2508.14027. arXiv:2508.14027 [cs]

work page arXiv 2025

[13] [13]

Deep Reinforcement Learning from Human Preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences . In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https://proceed...

work page 2017

[14] [14]

Quantifying Generalization in Reinforcement Learning

Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying Generalization in Reinforcement Learning , July 2019. URL http://arxiv.org/abs/1812.02341. arXiv:1812.02341 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[15] [15]

Leveraging Procedural Generation to Benchmark Reinforcement Learning , July 2020

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging Procedural Generation to Benchmark Reinforcement Learning , July 2020. URL http://arxiv.org/abs/1912.01588. arXiv:1912.01588 [cs]

work page arXiv 2020

[16] [16]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025

[17] [17]

Loss of plasticity in deep continual learning

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning. Nature, 632 0 (8026): 0 768--774, 2024. ISSN 0028-0836

work page 2024

[18] [18]

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

Jan Dubiński, Jan Betley, Anna Sztyber-Betley, Daniel Tan, and Owain Evans. Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers, 2026. URL https://arxiv.org/abs/2604.25891. \_eprint: 2604.25891

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Arpad E. Elo. The rating of chessplayers, past and present. Ishi Press International, Bronx, NY, 2. print edition, 2008. ISBN 978-0-923891-27-5

work page 2008

[20] [20]

Reuben Feinman and Brenden M. Lake. Learning Inductive Biases with Simple Neural Networks , June 2018. URL http://arxiv.org/abs/1802.02745. arXiv:1802.02745 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Foundation models in robotics: Applications , challenges, and the future

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, Brian Ichter, Danny Driess, Jiajun Wu, Cewu Lu, and Mac Schwager. Foundation models in robotics: Applications , challenges, and the future. The International Journal of Robotics Research, 44 0 (5): 0 701--739, April...

work page doi:10.1177/02783649241281508 2025

[22] [22]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, November 2020. ISSN 2522-5839. doi:10.1038/s42256-020-00257-z. URL https://www.nature.com/articles/s42256-020-00257-z

work page doi:10.1038/s42256-020-00257-z 2020

[23] [23]

Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative Alignment : Reasoning Enables Safer Language Models , January 2025. URL http://arxiv.org/abs/2412.16339. arXiv:2412.16339 [cs]

work page arXiv 2025

[24] [24]

Causal Confusion in Imitation Learning , November 2019

Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal Confusion in Imitation Learning , November 2019. URL http://arxiv.org/abs/1905.11979. arXiv:1905.11979 [cs]

work page arXiv 2019

[25] [25]

Reinforcement Learning with Deep Energy-Based Policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement Learning with Deep Energy - Based Policies , July 2017. URL http://arxiv.org/abs/1702.08165. arXiv:1702.08165 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft Actor - Critic Algorithms and Applications , January 2019. URL http://arxiv.org/abs/1812.05905. arXiv:1812.05905 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[27] [27]

Cooperative Inverse Reinforcement Learning

Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative Inverse Reinforcement Learning . In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/c3395dd46c3...

work page 2016

[28] [28]

An Overview of Catastrophic AI Risks

Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An Overview of Catastrophic AI Risks , October 2023. URL http://arxiv.org/abs/2306.12001. arXiv:2306.12001 [cs]

work page internal anchor Pith review arXiv 2023

[29] [29]

Risks from Learned Optimization in Advanced Machine Learning Systems

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from Learned Optimization in Advanced Machine Learning Systems , December 2021. URL http://arxiv.org/abs/1906.01820. arXiv:1906.01820 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [30]

A Review of Deep Transfer Learning and Recent Advancements

Mohammadreza Iman, Hamid Reza Arabnia, and Khaled Rasheed. A Review of Deep Transfer Learning and Recent Advancements . Technologies, 11 0 (2): 0 40, March 2023. ISSN 2227-7080. doi:10.3390/technologies11020040. URL https://www.mdpi.com/2227-7080/11/2/40

work page doi:10.3390/technologies11020040 2023

[31] [31]

Towards Continual Reinforcement Learning : A Review and Perspectives , November 2022

Khimya Khetarpal, Matthew Riemer, Irina Rish, and Doina Precup. Towards Continual Reinforcement Learning : A Review and Perspectives , November 2022. URL http://arxiv.org/abs/2012.13490. arXiv:2012.13490 [cs]

work page arXiv 2022

[32] [32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[33] [33]

A Survey of Zero -shot Generalisation in Deep Reinforcement Learning

Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A Survey of Zero -shot Generalisation in Deep Reinforcement Learning . Journal of Artificial Intelligence Research, 76: 0 201--264, January 2023. ISSN 1076-9757. doi:10.1613/jair.1.14174. URL http://jair.org/index.php/jair/article/view/14174

work page doi:10.1613/jair.1.14174 2023

[34] [34]

Goal Misgeneralization in Deep Reinforcement Learning , January 2023

Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, and David Krueger. Goal Misgeneralization in Deep Reinforcement Learning , January 2023. URL http://arxiv.org/abs/2105.14111. arXiv:2105.14111 [cs]

work page arXiv 2023

[35] [35]

Disentangling the Causes of Plasticity Loss in Neural Networks , February 2024

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the Causes of Plasticity Loss in Neural Networks , February 2024. URL http://arxiv.org/abs/2402.18762. arXiv:2402.18762 [cs]

work page arXiv 2024

[36] [36]

Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy. Agentic Misalignment : How LLMs Could Be Insider Threats , October 2025. URL http://arxiv.org/abs/2510.05179. arXiv:2510.05179 [cs]

work page arXiv 2025

[37] [37]

Natural Emergent Misalignment from Reward Hacking in Production RL , 2025

Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger. Natural Emergent Misalignmen...

work page arXiv 2025

[38] [38]

Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025

Sharan Maiya, Henning Bartsch, Nathan Lambert, and Evan Hubinger. Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025. URL http://arxiv.org/abs/2511.01689. arXiv:2511.01689 [cs]

work page arXiv 2025

[39] [39]

Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks. Utility Engineering : Analyzing and Controlling Emergent Value Systems in AIs , February 2025. URL http://arxiv.org/abs/2502.08640. arXiv:2502.08640 [cs]

work page arXiv 2025

[40] [40]

Associative learning and elemental representation: II

IPL McLaren and NJ Mackintosh. Associative learning and elemental representation: II . Generalization and discrimination. Animal learning & behavior, 30 0 (3): 0 177--200, 2002. ISSN 0090-4996

work page 2002

[41] [41]

Understanding and Controlling a Maze - Solving Policy Network , October 2023

Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, and Alexander Matt Turner. Understanding and Controlling a Maze - Solving Policy Network , October 2023. URL http://arxiv.org/abs/2310.08043. arXiv:2310.08043 [cs]

work page arXiv 2023

[42] [42]

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning , December 2013. URL http://arxiv.org/abs/1312.5602. arXiv:1312.5602 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2013

[43] [43]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

work page doi:10.1038/nature14236 2015

[44] [44]

AgentMisalignment : Measuring the Propensity for Misaligned Behaviour in LLM - Based Agents , October 2025

Akshat Naik, Patrick Quinn, Guillermo Bosch, Emma Gouné, Francisco Javier Campos Zabala, Jason Ross Brown, and Edward James Young. AgentMisalignment : Measuring the Propensity for Misaligned Behaviour in LLM - Based Agents , October 2025. URL http://arxiv.org/abs/2506.04018. arXiv:2506.04018 [cs]

work page arXiv 2025

[45] [45]

Deep double descent: where bigger models and more data hurt*

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment, 2021 0 (12): 0 124003, December 2021. ISSN 1742-5468. doi:10.1088/1742-5468/ac3a74. URL https://iopscience.iop.org/article/10.1088/1742-5468/ac3a74

work page doi:10.1088/1742-5468/ac3a74 2021

[46] [46]

The Alignment Problem from a Deep Learning Perspective , May 2025

Richard Ngo, Lawrence Chan, and Sören Mindermann. The Alignment Problem from a Deep Learning Perspective , May 2025. URL http://arxiv.org/abs/2209.00626. arXiv:2209.00626 [cs]

work page arXiv 2025

[47] [47]

The Primacy Bias in Deep Reinforcement Learning , May 2022

Evgenii Nikishin, Max Schwarzer, Pierluca D'Oro, Pierre-Luc Bacon, and Aaron Courville. The Primacy Bias in Deep Reinforcement Learning , May 2022. URL http://arxiv.org/abs/2205.07802. arXiv:2205.07802 [cs]

work page arXiv 2022

[48] [48]

Deep reinforcement learning with plasticity injection

Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and André Barreto. Deep reinforcement learning with plasticity injection. Advances in Neural Information Processing Systems, 36: 0 37142--37159, 2023

work page 2023

[49] [49]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[50] [50]

A model for stimulus generalization in Pavlovian conditioning

John M Pearce. A model for stimulus generalization in Pavlovian conditioning. Psychological review, 94 0 (1): 0 61, 1987. ISSN 1939-1471

work page 1987

[51] [51]

Courville, Doina Precup, and Guillaume Lajoie

Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C. Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34: 0 1256--1272, 2021

work page 2021

[52] [52]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , January 2022. URL http://arxiv.org/abs/2201.02177. arXiv:2201.02177 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [53]

Direct Preference Optimization : Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct Preference Optimization : Your Language Model is Secretly a Reward Model . In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 53728--53741. Curran A...

work page 2023

[54] [54]

Stable- Baselines3 : Reliable Reinforcement Learning Implementations

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable- Baselines3 : Reliable Reinforcement Learning Implementations . Journal of Machine Learning Research, 22 0 (268): 0 1--8, 2021. URL http://jmlr.org/papers/v22/20-1364.html

work page 2021

[55] [55]

Bayesian Inverse Reinforcement Learning

Deepak Ramachandran and Eyal Amir. Bayesian Inverse Reinforcement Learning

work page

[56] [56]

A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement

Robert A Rescorla. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. Classical conditioning, Current research and theory, 2: 0 64--69, 1972

work page 1972

[57] [57]

Botvinick

Samuel Ritter, David GT Barrett, Adam Santoro, and Matt M. Botvinick. Cognitive psychology for deep neural networks: A shape bias case study. In International conference on machine learning, pages 2940--2949. PMLR, 2017. ISBN 2640-3498

work page 2017

[58] [58]

Progressive Neural Networks

Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive Neural Networks , October 2022. URL http://arxiv.org/abs/1606.04671. arXiv:1606.04671 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[59] [59]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms , August 2017. URL http://arxiv.org/abs/1707.06347. arXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[60] [60]

Goal misgeneralization: Why correct specifications aren't enough for correct goals

Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton. Goal misgeneralization: Why correct specifications aren't enough for correct goals. arXiv preprint arXiv:2210.01790, 2022

work page arXiv 2022

[61] [61]

Toward a universal law of generalization for psychological science

Roger N Shepard. Toward a universal law of generalization for psychological science. Science, 237 0 (4820): 0 1317--1323, 1987. ISSN 0036-8075

work page 1987

[62] [62]

Misspecification in Inverse Reinforcement Learning

Joar Skalse and Alessandro Abate. Misspecification in Inverse Reinforcement Learning . Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (12): 0 15136--15143, June 2023. ISSN 2374-3468, 2159-5399. doi:10.1609/aaai.v37i12.26766. URL https://ojs.aaai.org/index.php/AAAI/article/view/26766

work page doi:10.1609/aaai.v37i12.26766 2023

[63] [63]

Invariance in policy optimisation and partial identifiability in reward learning

Joar Max Viktor Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning. In International Conference on Machine Learning , pages 32033--32058. PMLR, 2023. ISBN 2640-3498

work page 2023

[64] [64]

The Dormant Neuron Phenomenon in Deep Reinforcement Learning , June 2023

Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The Dormant Neuron Phenomenon in Deep Reinforcement Learning , June 2023. URL http://arxiv.org/abs/2302.12902. arXiv:2302.12902 [cs]

work page arXiv 2023

[65] [65]

School of Reward Hacks : Hacking harmless tasks generalizes to misaligned behavior in LLMs , August 2025

Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, and Owain Evans. School of Reward Hacks : Hacking harmless tasks generalizes to misaligned behavior in LLMs , August 2025. URL http://arxiv.org/abs/2508.17511. arXiv:2508.17511 [cs]

work page arXiv 2025

[66] [66]

Alignment Pretraining : AI Discourse Causes Self - Fulfilling ( Mis )alignment, January 2026

Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, and Kyle O'Brien. Alignment Pretraining : AI Discourse Causes Self - Fulfilling ( Mis )alignment, January 2026. URL http://arxiv.org/abs/2601.10160. arXiv:2601.10160 [cs]

work page arXiv 2026

[67] [67]

Theory of games and economic behavior, 2nd rev

John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior, 2nd rev. 1947

work page 1947

[68] [68]

Maximum Entropy Deep Inverse Reinforcement Learning

Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum Entropy Deep Inverse Reinforcement Learning , March 2016. URL http://arxiv.org/abs/1507.04888. arXiv:1507.04888 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[69] [69]

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. Jailbreak Attacks and Defenses Against Large Language Models : A Survey , August 2024. URL http://arxiv.org/abs/2407.04295. arXiv:2407.04295 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Investigating Generalisation in Continuous Deep Reinforcement Learning

Chenyang Zhao, Olivier Sigaud, Freek Stulp, and Timothy M. Hospedales. Investigating Generalisation in Continuous Deep Reinforcement Learning , February 2019. URL http://arxiv.org/abs/1902.07015. arXiv:1902.07015 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[71] [71]

Maximum entropy inverse reinforcement learning

Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. 2008

work page 2008