pith. sign in

arxiv: 2606.24014 · v1 · pith:EHWLZ5Z3new · submitted 2026-06-22 · 💻 cs.AI · cs.CL

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

Pith reviewed 2026-06-26 07:48 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords reinforcement learningAI alignmentbeneficial behaviorout-of-distribution generalizationreward hackingdeceptioncorrigibilityalignment persistence
0
0 comments X

The pith

Reinforcement learning on beneficial behaviors in realistic domains produces broad improvements on out-of-distribution alignment benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether RL trained on beneficial traits in realistic situations can lead to alignment that generalizes beyond the training distribution. They create a dataset spanning health, science, and education to train models on traits like truthfulness, fairness, risk awareness, and corrigibility. Compared to baselines, the trained models perform better on over 80% of more than 50 independent benchmarks, with transfer observed even from health-only training to other areas. This includes reductions in reward hacking, deception, and general misalignment, as well as greater persistence against attempts to induce misalignment through adversarial prompting or finetuning.

Core claim

A beneficial-behavior RL intervention entirely limited to one domain, health, produces broad improvements on non-health alignment evaluations, including reduced reward hacking, deception, and general misalignment. Models trained with beneficial trait RL show improved persistence, including greater resistance to adversarial prompting and harmful finetuning.

What carries the argument

The RL training on a dataset of realistic situations designed to measure and train beneficial traits such as truthfulness, fairness, risk awareness, and corrigibility.

If this is right

  • Beneficial trait RL improves performance on over 80% of out-of-distribution benchmarks.
  • A health-limited RL intervention leads to broad improvements on non-health evaluations.
  • Models exhibit reduced reward hacking, deception, and general misalignment on unseen tasks.
  • Alignment from this RL shows greater resistance to adversarial prompting and harmful finetuning.
  • Behavior remains more robustly aligned under attempts to steer models towards misalignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the transfer holds, alignment training could focus on a few representative domains rather than all possible scenarios.
  • The results suggest that beneficial trait reinforcement might be more efficient than domain-specific alignment methods.
  • Further tests could examine whether the persistence effects scale with model size or dataset diversity.

Load-bearing premise

The dataset of realistic situations and the more than fifty independent benchmarks are valid, unbiased measures of the intended beneficial traits and of alignment more generally.

What would settle it

Observing no improvement or a decrease in performance on the majority of the 50+ alignment benchmarks for the RL-trained models compared to the compute-matched baseline would falsify the claim of broad generalization.

Figures

Figures reproduced from arXiv: 2606.24014 by Akshay V. Jagadeesh, Ali Malik, Foivos Tsimpourlas, Johannes Heidecke, Karan Singhal, Khaled Saab, Mikhail Trofimov, Rahul K. Arora.

Figure 1
Figure 1. Figure 1: Summary of hypothesis and main empirical findings. (A) Two example conversations from the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Measuring beneficial traits in realistic conversations across a range of frontier models from OpenAI [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Beneficial trait RL training produces alignment generalization on a broad set of alignment and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evidence for out-of-domain alignment generalization. (a) A beneficial-trait RL intervention exclud [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples from alignment and benefits evaluations. Examples shown here were short [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Alignment improvements persist under harmful prompt steering. Prompt prefixes designed to [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Beneficial trait RL trained model is more resistant to adversarial finetuning than the pre-RL base [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Generic helpfulness training on the same conversations does not reproduce alignment generalization. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Alignment evaluations exhibit shared cross-model structure. The cross-evaluation structure and [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Shared cross-model structure persists after capability residualization. The cross-evaluation resid [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Misalignment and monitorability across monitorability evaluation families as a function of RL step. [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
read the original abstract

As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL), which can introduce unexpected misalignment through reward hacking, deception, or other unintended strategies. We study whether RL on beneficial behavior, instantiated in realistic domains, can produce broad and persistent alignment generalization beyond the training distribution. We construct a dataset of realistic situations designed to measure and train beneficial traits, such as truthfulness, fairness, risk awareness, and corrigibility, spanning varied domains, including health, science, and education. We then train models with RL on this dataset and evaluate them on more than 50 independent benchmarks of alignment and beneficial behavior. Compared to a compute-matched baseline, beneficial trait RL improves performance on over 80% of these out-of-distribution benchmarks. We observe substantial out-of-distribution alignment transfer: a beneficial-behavior RL intervention entirely limited to one domain, health, produces broad improvements on non-health alignment evaluations, including reduced reward hacking, deception, and general misalignment. Finally, we study alignment persistence: whether behavior remains robustly aligned under attempts to steer models towards misalignment. Models trained with beneficial trait RL show improved persistence, including greater resistance to adversarial prompting and harmful finetuning; further work is required to isolate the sources of these effects. These results suggest that RL to reinforce beneficial behavior in realistic domains can produce models that are more robustly aligned with human flourishing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that RL training on beneficial traits (truthfulness, fairness, corrigibility, etc.) using a dataset of realistic situations produces broad out-of-distribution generalization: a health-domain-only intervention improves performance on >80% of >50 independent alignment benchmarks (including reduced reward hacking and deception), outperforms a compute-matched baseline, and yields greater persistence under adversarial prompting and harmful finetuning.

Significance. If the empirical results hold after statistical validation and benchmark scrutiny, the work would provide evidence that domain-limited RL on realistic beneficial behavior can induce measurable alignment transfer and robustness, a potentially important direction for scalable oversight and generalization research. The scale of evaluation (>50 benchmarks) and the persistence experiments are notable strengths.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (results): the abstract and main performance claims state gains on >80% of benchmarks and OOD transfer from health-only RL but supply no statistical significance tests, error bars, data exclusion criteria, or baseline implementation details, making it impossible to determine whether the data support the headline claims.
  2. [§3] §3 (dataset and benchmarks): the central transfer claim requires that the >50 benchmarks are valid, independent, and unbiased proxies for the targeted traits rather than general capability or pretraining artifacts; the manuscript provides no validation, inter-rater reliability, or analysis ruling out conceptual overlap or shared prompt structure with the health training data.
  3. [§4.2] §4.2 (health-only ablation): the claim that the intervention is 'entirely limited to one domain, health' and still produces broad non-health gains is load-bearing; without explicit verification that no non-health examples or concepts leaked into the RL dataset or reward model, the OOD transfer result cannot be interpreted as genuine generalization.
minor comments (2)
  1. [§2] Notation for the beneficial-trait reward model and the exact RL objective (e.g., how 'beneficial behavior' is scalarized) should be defined in a single equation early in §2.
  2. [Figures/Tables] Figure captions and axis labels in the benchmark tables should explicitly state whether higher/lower scores indicate better alignment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas for strengthening the empirical claims. We address each major comment below and have revised the manuscript accordingly to improve statistical reporting, benchmark justification, and dataset transparency.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (results): the abstract and main performance claims state gains on >80% of benchmarks and OOD transfer from health-only RL but supply no statistical significance tests, error bars, data exclusion criteria, or baseline implementation details, making it impossible to determine whether the data support the headline claims.

    Authors: We agree that statistical validation is essential for supporting the headline claims. In the revised manuscript, we now report error bars as standard deviations across five independent training runs with different random seeds. We have added paired t-tests (with Bonferroni correction) comparing the beneficial-trait RL models to the compute-matched baseline on each benchmark, along with explicit data exclusion criteria (benchmarks with >20% missing responses or direct prompt overlap with training data were removed). Baseline implementation details, including the precise SFT+RL pipeline and hyperparameter matching, have been expanded in §4 and the appendix. revision: yes

  2. Referee: [§3] §3 (dataset and benchmarks): the central transfer claim requires that the >50 benchmarks are valid, independent, and unbiased proxies for the targeted traits rather than general capability or pretraining artifacts; the manuscript provides no validation, inter-rater reliability, or analysis ruling out conceptual overlap or shared prompt structure with the health training data.

    Authors: The benchmarks were drawn from established, independently published alignment evaluation suites targeting distinct traits. In revision we added a dedicated subsection in §3 that (a) lists the provenance of each benchmark family, (b) reports a manual audit confirming <5% lexical or conceptual overlap with the health-only training prompts, and (c) notes that the majority originate from separate research groups using non-health prompt templates. While a full inter-rater reliability study across all 50+ benchmarks exceeds the scope of the present work, we have added an explicit limitations paragraph acknowledging possible pretraining artifacts and the value of future targeted validation. revision: partial

  3. Referee: [§4.2] §4.2 (health-only ablation): the claim that the intervention is 'entirely limited to one domain, health' and still produces broad non-health gains is load-bearing; without explicit verification that no non-health examples or concepts leaked into the RL dataset or reward model, the OOD transfer result cannot be interpreted as genuine generalization.

    Authors: We have substantially expanded the dataset-construction description in the revised §3 and §4.2. The health-only subset was generated exclusively from health-domain seed prompts; an automated keyword filter plus manual review of 500 randomly sampled examples confirmed zero non-health content. The reward model was trained solely on these health-labeled pairs with no cross-domain data. These verification steps are now documented with the exact filtering code and review protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL results rest on direct benchmark comparisons

full rationale

The paper reports an empirical study: a dataset of realistic situations is constructed, models are trained via RL on beneficial traits (with one domain limited to health), and performance is measured on >50 independent benchmarks against a compute-matched baseline. No equations, fitted parameters, or self-citations are invoked to derive the reported OOD improvements or persistence effects; the claims are directly falsifiable via the stated evaluation protocol and do not reduce to input definitions or prior author work by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified validity of the constructed dataset and the 50+ benchmarks as faithful measures of beneficial traits and alignment; these are domain assumptions not supported by independent evidence in the abstract.

free parameters (1)
  • RL training hyperparameters and reward scaling
    The abstract does not specify how the beneficial traits were converted into reward signals or which hyperparameters were chosen; these choices are required for the reported improvements.
axioms (1)
  • domain assumption The more than fifty benchmarks are independent, valid, and unbiased proxies for alignment and beneficial behavior.
    All claims of out-of-distribution improvement and persistence depend on this assumption about the evaluation suite.

pith-pipeline@v0.9.1-grok · 5819 in / 1345 out tokens · 30284 ms · 2026-06-26T07:48:50.588607+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    arXiv preprint arXiv:2502.17424 , year =

    Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart. arXiv preprint arXiv:2502.17424 , year =

  2. [2]

    arXiv preprint arXiv:2506.19823 , year =

    Wang, Miles and Dupr. arXiv preprint arXiv:2506.19823 , year =

  3. [3]

    Evan Hubinger and Carson Denison and Jesse Mu and Mike Lambert and Meg Tong and Monte MacDiarmid and Tamera Lanham and Daniel M. Ziegler and Tim Maxwell and Newton Cheng and Adam Jermyn and Amanda Askell and Ansh Radhakrishnan and Cem Anil and David Duvenaud and Deep Ganguli and Fazl Barez and Jack Clark and Kamal Ndousse and Kshitij Sachan and Michael Se...

  4. [4]

    2025 , url=

    Qi, Xiangyu and Panda, Ashwinee and Lyu, Kaifeng and Ma, Xiao and Roy, Subhrajit and Beirami, Ahmad and Mittal, Prateek and Henderson, Peter , booktitle=. 2025 , url=

  5. [5]

    MacDiarmid, Monte and Wright, Benjamin and Uesato, Jonathan and Benton, Joe and Kutasov, Jon and Price, Sara and Bouscal, Naia and Bowman, Sam and Bricken, Trenton and Cloud, Alex and Denison, Carson and Gasteiger, Johannes and Greenblatt, Ryan and Leike, Jan and Lindsey, Jack and Mikulik, Vlad and Perez, Ethan and Rodrigues, Alex and Thomas, Drake and We...

  6. [6]

    arXiv preprint arXiv:2509.15541 , year =

    Schoen, Bronson and Nitishinskaya, Evgenia and Balesni, Mikita and H. arXiv preprint arXiv:2509.15541 , year =

  7. [7]

    and Madry, Aleksander and Zaremba, Wojciech and Pachocki, Jakub and Farhi, David , journal =

    Baker, Bowen and Huizinga, Joost and Gao, Leo and Dou, Zehao and Guan, Melody Y. and Madry, Aleksander and Zaremba, Wojciech and Pachocki, Jakub and Farhi, David , journal =

  8. [8]

    arXiv preprint arXiv:2507.11473 , year =

    Tomek Korbak and Mikita Balesni and Elizabeth Barnes and Yoshua Bengio and Joe Benton and Joseph Bloom and Mark Chen and Alan Cooney and Allan Dafoe and Anca Dragan and Scott Emmons and Owain Evans and David Farhi and Ryan Greenblatt and Dan Hendrycks and Marius Hobbhahn and Evan Hubinger and Geoffrey Irving and Erik Jenner and Daniel Kokotajlo and Victor...

  9. [9]

    Guan, Melody Y. and Joglekar, Manas and Wallace, Eric and Jain, Saachi and Barak, Boaz and Helyar, Alec and Dias, Rachel and Vallone, Andrea and Ren, Hongyu and Wei, Jason and Chung, Hyung Won and Toyer, Sam and Heidecke, Johannes and Beutel, Alex and Glaese, Amelia , journal =

  10. [10]

    Wichers, Nevan and Ebtekar, Aram and Azarbal, Ariana and Gillioz, Victor and Ye, Christine and Ryd, Emil and Rathi, Neil and Sleight, Henry and Mallen, Alex and Roger, Fabien and Marks, Samuel , journal =

  11. [11]

    and Wei, Jason and Soskin Hicks, Rebecca and Bowman, Preston and Qui

    Arora, Rahul K. and Wei, Jason and Soskin Hicks, Rebecca and Bowman, Preston and Qui. arXiv preprint arXiv:2505.08775 , year =

  12. [12]

    arXiv preprint arXiv:2212.08073 , year =

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

  13. [13]

    and Henighan, Tom and Hume, Tristan and Hatfield-Dodds, Zac and Mindermann, S

    Kundu, Sandipan and Bai, Yuntao and Kadavath, Saurav and Askell, Amanda and Callahan, Andrew and Chen, Anna and Goldie, Anna and Balwit, Avital and Mirhoseini, Azalia and McLean, Brayden and Olsson, Catherine and Evraets, Cassie and Tran-Johnson, Eli and Durmus, Esin and Perez, Ethan and Kernion, Jackson and Kerr, Jamie and Ndousse, Kamal and Nguyen, Kari...

  14. [14]

    Sun, Zhiqing and Shen, Yikang and Zhou, Qinhong and Zhang, Hongxin and Chen, Zhenfang and Cox, David and Yang, Yiming and Gan, Chuang , journal =

  15. [15]

    , journal =

    Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel R. , journal =

  16. [16]

    Alexandra Souly and Qingyuan Lu and Dillon Bowen and Tu Trinh and Elvis Hsieh and Sana Pandey and Pieter Abbeel and Justin Svegliato and Scott Emmons and Olivia Watkins and Sam Toyer , journal =

  17. [17]

    , journal =

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , journal =

  18. [18]

    arXiv preprint arXiv:2212.09251 , year =

    Perez, Ethan and Ringer, Sam and Luko. arXiv preprint arXiv:2212.09251 , year =

  19. [19]

    Pan, Alexander and Chan, Jun Shern and Zou, Andy and Li, Nathaniel and Basart, Steven and Woodside, Thomas and Ng, Jonathan and Zhang, Hanlin and Emmons, Scott and Hendrycks, Dan , booktitle =

  20. [20]

    Maksym Andriushchenko and Alexandra Souly and Mateusz Dziemian and Derek Duenas and Maxwell Lin and Justin Wang and Dan Hendrycks and Andy Zou and Zico Kolter and Matt Fredrikson and Eric Winsor and Jerome Wynne and Yarin Gal and Xander Davies , journal =

  21. [21]

    Marks, Sam and Lindsey, Jack and Olah, Christopher , year =

  22. [22]

    2025 , month =

    Dupr. 2025 , month =

  23. [23]

    Williams, Marcus and Raymond, Cameron and Carroll, Micah , year =

  24. [24]

    and Wang, Miles and Carroll, Micah and Dou, Zehao and Wei, Annie Y

    Guan, Melody Y. and Wang, Miles and Carroll, Micah and Dou, Zehao and Wei, Annie Y. and Williams, Marcus and Arnav, Benjamin and Huizinga, Joost and Kivlichan, Ian and Glaese, Mia and Pachocki, Jakub and Baker, Bowen , year =

  25. [25]

    Guo, Alan and Wolfe, Jason , year =

  26. [26]

    2024 , url =

    Wang, Zirui and Xia, Mengzhou and He, Luxi and Chen, Howard and Liu, Yitao and Zhu, Richard and Liang, Kaiqu and Wu, Xindi and Liu, Haotian and Malladi, Sadhika and Chevalier, Alexis and Arora, Sanjeev and Chen, Danqi , journal =. 2024 , url =

  27. [27]

    Aaditya Singh and Adam Fry and Adam Perelman and Adam Tart and Adi Ganesh and Ahmed El-Kishky and Aidan McLaughlin and Aiden Low and AJ Ostrow and Akhila Ananthram and Akshay Nathan and Alan Luo and Alec Helyar and Aleksander Madry and Aleksandr Efremov and Aleksandra Spyra and Alex Baker-Whitcomb and Alex Beutel and Alex Karpenko and Alex Makelov and Ale...

  28. [28]

    and Mindermann, S

    Lynch, Aengus and Wright, Benjamin and Larson, Caleb and Ritchie, Stuart J. and Mindermann, S. 2025 , eprint =

  29. [29]

    EvilGenie: A Reward Hacking Benchmark

    Gabor, Jonathan and Lynch, Jayson and Rosenfeld, Jonathan , year =. doi:10.48550/arXiv.2511.21654 , url =. 2511.21654 , archivePrefix =

  30. [30]

    CoRR , volume =

    Taylor, Mia and Chua, James and Betley, Jan and Treutlein, Johannes and Evans, Owain , year =. doi:10.48550/arXiv.2508.17511 , url =. 2508.17511 , archivePrefix =

  31. [31]

    and Lin, Steph and Kandpal, Nikhil and Nasr, Milad and

    Guo, Chuan and Ceron Uribe, Juan Felipe and Zhu, Sicheng and Choquette-Choo, Christopher A. and Lin, Steph and Kandpal, Nikhil and Nasr, Milad and. 2026 , eprint =

  32. [32]

    2025 , howpublished =

  33. [33]

    doi:10.48550/arXiv.2512.08093 , url =

    Joglekar, Manas and Chen, Jeremy and Wu, Gabriel and Yosinski, Jason and Wang, Jasmine and Barak, Boaz and Glaese, Amelia , year =. doi:10.48550/arXiv.2512.08093 , url =. 2512.08093 , archivePrefix =

  34. [34]

    Soskin Hicks, Rebecca and Trofimov, Mikhail and Lim, Dominick and Arora, Rahul K. and Tsimpourlas, Foivos and Bowman, Preston and Sharman, Michael and Tong, Chi and Karthik, Kavin and Dugar, Arnav and Jagadeesh, Akshay and Saab, Khaled and Heidecke, Johannes and Alexander, Ashley and Gross, Nate and Singhal, Karan , year =

  35. [35]

    2026 , howpublished =

  36. [36]

    2021 , url =

    Evans, Owain and Cotton-Barratt, Owen and Finnveden, Lukas and Bales, Adam and Balwit, Avital and Wills, Peter and Righetti, Luca and Saunders, William , journal =. 2021 , url =

  37. [37]

    2022 , url =

    Saurav Kadavath and Tom Conerly and Amanda Askell and Tom Henighan and Dawn Drain and Ethan Perez and Nicholas Schiefer and Zac Hatfield-Dodds and Nova DasSarma and Eli Tran-Johnson and Scott Johnston and Sheer El-Showk and Andy Jones and Nelson Elhage and Tristan Hume and Anna Chen and Yuntao Bai and Sam Bowman and Stanislav Fort and Deep Ganguli and Dan...

  38. [38]

    2018 , url =

    Irving, Geoffrey and Christiano, Paul and Amodei, Dario , journal =. 2018 , url =

  39. [39]

    2018 , url =

    Christiano, Paul and Shlegeris, Buck and Amodei, Dario , journal =. 2018 , url =

  40. [40]

    and Abbeel, Pieter and Dragan, Anca , booktitle =

    Hadfield-Menell, Dylan and Russell, Stuart J. and Abbeel, Pieter and Dragan, Anca , booktitle =. 2016 , url =

  41. [41]

    2015 , url =

    Soares, Nate and Fallenstein, Benja and Yudkowsky, Eliezer and Armstrong, Stuart , booktitle =. 2015 , url =

  42. [42]

    2016 , url =

    Orseau, Laurent and Armstrong, Stuart , booktitle =. 2016 , url =

  43. [43]

    2017 , url =

    Hadfield-Menell, Dylan and Dragan, Anca and Abbeel, Pieter and Russell, Stuart , booktitle =. 2017 , url =

  44. [44]

    arXiv preprint arXiv:1606.06565 , year =

    Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. arXiv preprint arXiv:1606.06565 , year =

  45. [45]

    2019 , url =

    Hubinger, Evan and van Merwijk, Chris and Mikulik, Vladimir and Skalse, Joar and Garrabrant, Scott , journal =. 2019 , url =

  46. [46]

    and Pfau, Jacob and Krueger, David , booktitle =

    Langosco, Lauro Langosco Di and Koch, Jack and Sharkey, Lee D. and Pfau, Jacob and Krueger, David , booktitle =. 2022 , publisher =

  47. [47]

    , booktitle =

    Omohundro, Stephen M. , booktitle =. 2008 , url =

  48. [48]

    2021 , url =

    Turner, Alexander Matt and Smith, Logan and Shah, Rohin and Critch, Andrew and Tadepalli, Prasad , booktitle =. 2021 , url =

  49. [49]

    2021 , url =

    Amanda Askell and Yuntao Bai and Anna Chen and Dawn Drain and Deep Ganguli and Tom Henighan and Andy Jones and Nicholas Joseph and Ben Mann and Nova DasSarma and Nelson Elhage and Zac Hatfield-Dodds and Danny Hernandez and Jackson Kernion and Kamal Ndousse and Catherine Olsson and Dario Amodei and Tom Brown and Jack Clark and Sam McCandlish and Chris Olah...

  50. [50]

    and Boyd, Danah and Friedler, Sorelle A

    Selbst, Andrew D. and Boyd, Danah and Friedler, Sorelle A. and Venkatasubramanian, Suresh and Vertesi, Janet , booktitle =. 2019 , url =

  51. [51]

    arXiv preprint arXiv:2605.10310 , year =

    Laukkonen, Ruben and Krier, Seb and Bakalar, Chlo. arXiv preprint arXiv:2605.10310 , year =

  52. [52]

    and Marks, Samuel and Leike, Jan and Askell, Amanda and Olah, Chris and Hubinger, Evan and Price, Sara , year =

    Kutasov, Jonathan and Jermyn, Adam and Steen, Julius and Le, Minh and Bowman, Samuel R. and Marks, Samuel and Leike, Jan and Askell, Amanda and Olah, Chris and Hubinger, Evan and Price, Sara , year =

  53. [53]

    2025 , url=

    Deng, Xiang and Da, Jeff and Pan, Edwin and He, Yannis Yiming and Ide, Charles and Garg, Kanak and Lauffer, Niklas and Park, Andrew and Pasari, Nitin and Rane, Chetan and Sampath, Karmini and Krishnan, Maya and Kundurthy, Srivatsa and Hendryx, Sean and Wang, Zifan and Bharadwaj, Vijay and Holm, Jeff and Aluri, Raja and Zhang, Chen Bo Calvin and Jacobson, ...

  54. [54]

    2025 , url=

    Huang, Yao and Sun, Yitong and Zhang, Yichi and Zhang, Ruochen and Dong, Yinpeng and Wei, Xingxing , journal=. 2025 , url=

  55. [55]

    2025 , url=

    Ren, Richard and Agarwal, Arunim and Mazeika, Mantas and Menghini, Cristina and Vacareanu, Robert and Kenstler, Brad and Yang, Mick and Barrass, Isabelle and Gatti, Alice and Yin, Xuwang and Trevino, Eduardo and Geralnik, Matias and Khoja, Adam and Lee, Dean and Yue, Summer and Hendrycks, Dan , journal=. 2025 , url=

  56. [56]

    2025 , url=

    Sehwag, Udari Madhushani and Shabihi, Shayan and McAvoy, Alex and Sehwag, Vikash and Xu, Yuancheng and Towers, Dalton and Huang, Furong , journal=. 2025 , url=

  57. [57]

    and Calcott, Rachel and Handoko, Brandon and de Font-Reaulx, Paul and Milli

    Chiu, Yu Ying and Lee, Michael S. and Calcott, Rachel and Handoko, Brandon and de Font-Reaulx, Paul and Milli. arXiv preprint arXiv:2510.16380 , year=

  58. [58]

    , journal=

    Kirichenko, Polina and Ibrahim, Mark and Chaudhuri, Kamalika and Bell, Samuel J. , journal=. 2025 , url=

  59. [59]

    and Wang, Miles and Carroll, Micah and Dou, Zehao and Wei, Annie Y

    Guan, Melody Y. and Wang, Miles and Carroll, Micah and Dou, Zehao and Wei, Annie Y. and Williams, Marcus and Arnav, Benjamin and Huizinga, Joost and Kivlichan, Ian and Glaese, Mia and Pachocki, Jakub and Baker, Bowen , journal=. 2025 , url=

  60. [60]

    and Fitz, Stephen and Hendrycks, Dan , journal=

    Ren, Richard and Basart, Steven and Khoja, Adam and Gatti, Alice and Phan, Long and Yin, Xuwang and Mazeika, Mantas and Pan, Alexander and Mukobi, Gabriel and Kim, Ryan H. and Fitz, Stephen and Hendrycks, Dan , journal=. 2024 , url=