pith. sign in

arxiv: 2605.15377 · v2 · pith:H7EZQYT3new · submitted 2026-05-14 · 💻 cs.AI

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

Pith reviewed 2026-05-20 20:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords ensemble monitoringAI safetymisaligned actionsfine-tuningpromptingdetection performanceagent alignment
0
0 comments X

The pith

Diverse ensembles of AI monitors detect misaligned actions better than scaling up copies of one monitor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that combining detection signals from varied monitors improves identification of unsafe or misaligned AI agent behaviors more than adding more copies of the same monitor. This matters for autonomous AI systems where human review is impossible at scale and reliable single monitors are hard to build. The authors created twelve monitors from GPT-4.1-Mini using both prompting and fine-tuning, then measured performance on coding tasks whose solutions pass ordinary tests yet fail adversarial checks. Their strongest three-monitor ensemble produced 2.4 times the detection gain of an ensemble made from three identical monitors and held up on a separate dataset. Fine-tuned monitors appeared in every top ensemble and kept their edge on new attack types, pointing to diversity and low correlation between monitors as the drivers of improvement rather than raw scale.

Core claim

Combining signals from diverse monitors into an ensemble improves detection of misaligned actions. On coding tasks where solutions pass standard tests but fail on adversarial inputs, the best 3-monitor ensemble achieves 2.4x greater detection performance gain than an ensemble of three identical monitors. Fine-tuned monitors appear in every top-performing ensemble and retain their advantage on out-of-distribution attacks, while the strongest ensembles pair good individual performance with low correlation between monitors.

What carries the argument

Ensemble that merges detection signals from distinct monitors built via prompting and fine-tuning on the same base model.

Load-bearing premise

The coding-task setting where solutions pass standard tests but fail on adversarial inputs serves as a valid proxy for misaligned actions in broader autonomous agent deployments.

What would settle it

If a diverse three-monitor ensemble shows no meaningful detection gain over three identical monitors when tested on a new collection of real agent tasks outside coding, the central claim would be challenged.

Figures

Figures reproduced from arXiv: 2605.15377 by Benjamin Arnav, Eugene Koran, Pablo Bernabeu-P\'erez, Samantha Tetef, Yejun Yun.

Figure 1
Figure 1. Figure 1: Ensemble monitoring: each monitor scores independently and the mean is compared to a threshold calibrated on a validation set. Our best-performing 3-monitor ensemble combines one fine-tuned ( ) and two prompted ( ) monitors. All three identify the same off-by-one loop bound but reach opposite conclusions: the fine-tuned monitor flags it as an intentional backdoor while both prompted monitors dismiss it as … view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Individual monitor performance (pAUC) with bootstrap confidence intervals (n = 1000 resamples, 95% CI). The ensemble outperforms all individual monitors. strating that ensembling can still improve monitoring per￾formance even without developing stronger individual mon￾itors. Ensembles improve detection and FPR calibration. We analyze test performance of ensembles at thresholds se￾lected on the validation s… view at source ↗
Figure 5
Figure 5. Figure 5: Paired pAUC differences between the top 3-monitor ensembles ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Most diverse 3-monitor ensembles on APPS and BCB-S after filtering the bottom performance quartile, showing paired pAUC differences relative to the Baseline monitor and the median 3-monitor ensemble, with bootstrap confidence intervals (n = 1000 resamples, 95% CI). In general, the most diverse ensembles have lower average inter-monitor correlation (inverse diversity) and better performance than both the Ba… view at source ↗
read the original abstract

As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that ensembles combining signals from diverse monitors outperform both individual monitors and homogeneous ensembles in detecting misaligned actions. Using 12 GPT-4.1-Mini monitors constructed via prompting and fine-tuning strategies, the authors evaluate on coding tasks where solutions pass standard tests but fail adversarial inputs. Their best 3-monitor diverse ensemble achieves a 2.4x greater detection performance gain than an ensemble of three identical monitors, with strong performance on an independent dataset; they conclude that diversity (especially from fine-tuned monitors with low inter-monitor correlation) drives gains more than additional compute and offers a practical AI control strategy.

Significance. If the central empirical result holds, the work offers a concrete, low-inference-cost approach to improving detection of unsafe agent actions by prioritizing monitor diversity over scale. It supplies evidence that fine-tuning elicits detection capabilities not obtained by prompting alone and that low-correlation ensembles are particularly effective. These findings could guide practical safety engineering for autonomous systems, though their scope is constrained by the chosen evaluation proxy.

major comments (2)
  1. [Abstract] Abstract and evaluation description: the 2.4x detection performance gain is reported without error bars, statistical significance tests, or a complete experimental protocol. This omission makes it impossible to assess whether the quantitative claim is robust to post-hoc monitor selection, task-specific artifacts, or random variation.
  2. [Abstract] Abstract, evaluation description: the coding-task proxy (solutions that pass standard tests yet fail adversarial inputs) is used to stand in for misaligned actions in autonomous agents. This setting primarily surfaces hidden defects or backdoors rather than intentional misalignment behaviors such as deception or goal misgeneralization; if the failure-mode correlation structures differ, the reported low inter-monitor correlation and fine-tuning advantage may not generalize, undermining the claim that the approach constitutes a practical AI control strategy.
minor comments (2)
  1. [Abstract] Clarify how the ensemble combines monitor signals (e.g., voting rule, threshold, or learned aggregator) and whether this choice was tuned on the same data used to report the 2.4x figure.
  2. The statement that 'fine-tuned monitors appear in every top-performing ensemble' would be strengthened by an explicit table or figure listing the composition of the top ensembles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation and scope of our work. We address each major comment below and have made revisions to strengthen the manuscript where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation description: the 2.4x detection performance gain is reported without error bars, statistical significance tests, or a complete experimental protocol. This omission makes it impossible to assess whether the quantitative claim is robust to post-hoc monitor selection, task-specific artifacts, or random variation.

    Authors: We agree that the abstract and main text would benefit from additional statistical rigor. In the revised manuscript, we have added bootstrap-derived 95% confidence intervals for the reported performance gains and included paired statistical tests (Wilcoxon signed-rank) comparing diverse versus homogeneous ensembles. We have also expanded the experimental protocol section to detail the full pipeline: monitor construction, validation-set-based selection of the top monitors (to avoid post-hoc bias on the test set), ensemble aggregation method, and the exact adversarial input generation process. These changes make the 2.4x claim reproducible and allow readers to evaluate robustness to random variation and selection artifacts. revision: yes

  2. Referee: [Abstract] Abstract, evaluation description: the coding-task proxy (solutions that pass standard tests yet fail adversarial inputs) is used to stand in for misaligned actions in autonomous agents. This setting primarily surfaces hidden defects or backdoors rather than intentional misalignment behaviors such as deception or goal misgeneralization; if the failure-mode correlation structures differ, the reported low inter-monitor correlation and fine-tuning advantage may not generalize, undermining the claim that the approach constitutes a practical AI control strategy.

    Authors: We acknowledge that the chosen proxy primarily evaluates detection of hidden defects and backdoors rather than explicit deception or goal misgeneralization. This is a genuine limitation of the current evaluation. At the same time, the proxy captures a practically relevant failure mode for agentic coding systems—outputs that satisfy superficial checks but produce incorrect behavior under distribution shift or adversarial evaluation—which is a common manifestation of misalignment in deployed agents. We have added a dedicated limitations paragraph in the discussion section that explicitly contrasts the proxy with other misalignment types, notes the potential differences in correlation structure, and outlines future experiments on deception benchmarks. We retain the claim that the results demonstrate the value of diversity for this class of detection tasks while qualifying the generalization statement. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with independent validation

full rationale

The paper reports direct experimental results from constructing 12 GPT-4.1-Mini monitors via prompting and fine-tuning, then measuring their ensemble detection performance on coding tasks (solutions pass standard tests but fail adversarial inputs) and an independent dataset. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The reported 2.4x gain is a measured outcome from explicit evaluation, not a quantity forced by definition or prior self-referential result. The setup is self-contained against external benchmarks via the independent dataset and does not reduce any central claim to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the empirical observation that monitor errors are sufficiently uncorrelated; this is treated as a domain assumption rather than derived from first principles.

axioms (1)
  • domain assumption Monitor detection errors are sufficiently uncorrelated to produce ensemble gains when combined.
    Invoked to explain why diverse ensembles outperform homogeneous ones in the reported experiments.

pith-pipeline@v0.9.0 · 5763 in / 1275 out tokens · 101875 ms · 2026-05-20T20:07:45.182666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 13 internal anchors

  1. [1]

    arXiv preprint arXiv:2301.12345 , year =

    Smith, John and Doe, Jane , title =. arXiv preprint arXiv:2301.12345 , year =. 2301.12345 , archivePrefix =

  2. [2]

    Journal of Artificial Intelligence Research , volume =

    Johnson, Alice and Williams, Bob , title =. Journal of Artificial Intelligence Research , volume =. 2024 , doi =

  3. [3]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Chen, Wei and Kumar, Raj , title =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2023 , publisher =

  4. [4]

    2023 , url =

    OpenAI , title =. 2023 , url =

  5. [5]

    2020 , isbn =

    Russell, Stuart and Norvig, Peter , title =. 2020 , isbn =

  6. [6]

    Language Models are Few-Shot Learners

    Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and others , title =. arXiv preprint arXiv:2005.14165 , year =. 2005.14165 , archivePrefix =

  7. [7]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , title =. arXiv preprint arXiv:2201.11903 , year =. 2201.11903 , archivePrefix =

  8. [8]

    and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , title =

    Park, Peter S. and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , title =. arXiv preprint arXiv:2308.14752 , year =. 2308.14752 , archivePrefix =

  9. [9]

    Risks from Learned Optimization in Advanced Machine Learning Systems

    Hubinger, Evan and van Merwijk, Chris and Mikulik, Vladimir and Skalse, Joar and Garrabrant, Scott , title =. arXiv preprint arXiv:1906.01820 , year =. 1906.01820 , archivePrefix =

  10. [10]

    2022 , url =

    Cotra, Ajeya , title =. 2022 , url =

  11. [11]

    Transformer Circuits Thread , year =

    Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and others , title =. Transformer Circuits Thread , year =

  12. [12]

    Distill , year =

    Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =. doi:10.23915/distill.00024.001 , url =

  13. [13]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , title =. arXiv preprint arXiv:2309.08600 , year =. 2309.08600 , archivePrefix =

  14. [14]

    Advances in Neural Information Processing Systems , volume =

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , title =. Advances in Neural Information Processing Systems , volume =

  15. [15]

    arXiv preprint arXiv:1905.05950 , year =

    Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , title =. arXiv preprint arXiv:1905.05950 , year =. 1905.05950 , archivePrefix =

  16. [16]

    AI safety via debate

    Irving, Geoffrey and Christiano, Paul and Amodei, Dario , title =. arXiv preprint arXiv:1805.00899 , year =. 1805.00899 , archivePrefix =

  17. [17]

    Alignment faking in large language models

    Alignment Faking in Large Language Models , author=. arXiv preprint arXiv:2412.14093 , year=

  18. [18]

    Ctrl-Z: Controlling

    Bhatt, Aryan and Rushing, Cody and Kaufman, Adam and Tracy, Tyler and Georgiev, Vasil and Matolcsi, David and Khan, Akbir and Shlegeris, Buck , journal=. Ctrl-Z: Controlling

  19. [19]

    Greenblatt, Ryan and Shlegeris, Buck and Sachan, Kshitij and Roger, Fabien , journal=

  20. [20]

    arXiv preprint arXiv:2507.15886 , year=

    Combining cost-constrained runtime monitors for ai safety , author=. arXiv preprint arXiv:2507.15886 , year=

  21. [21]

    De- tecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

    Detecting Strategic Deception Using Linear Probes , author=. arXiv preprint arXiv:2502.03407 , year=

  22. [22]

    arXiv preprint arXiv:2506.05296 , year=

    Control Tax: The Price of Keeping AI in Check , author=. arXiv preprint arXiv:2506.05296 , year=

  23. [23]

    Adaptive Attacks on Trusted Monitors Subvert

    Terekhov, Mikhail and Panfilov, Alexander and Dzenhaliou, Daniil and Gulcehre, Caglar and Andriushchenko, Maksym and Prabhu, Ameya and Geiping, Jonas , journal=. Adaptive Attacks on Trusted Monitors Subvert

  24. [24]

    Terry Yue Zhuo and Minh Chien Vu and Jenny Chim and Han Hu and Wenhao Yu and Ratnadira Widyasari and Imam Nur Bani Yusuf and Haolan Zhan and Junda He and Indraneil Paul and Simon Brunner and Chen Gong and Thong Hoang and Armel Randy Zebaze and Xiaoheng Hong and Wen-Ding Li and Jean Kaddour and Ming Xu and Zhihan Zhang and Prateek Yadav and Naman Jain and ...

  25. [25]

    Measuring Coding Challenge Competence With APPS

    Measuring Coding Challenge Competence With APPS , author=. arXiv preprint arXiv:2105.09938 , year=

  26. [26]

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Persona vectors: Monitoring and controlling character traits in language models , author=. arXiv preprint arXiv:2507.21509 , year=

  27. [27]

    2025 , url =

    Parent, Zach and Finke, Lennart and Tracy, Taler , title =. 2025 , url =

  28. [28]

    Enhancing Monitoring in

    Al Dahik, Ghassan and Xu, Ruize and Singh, Satvik and Kiunga, Leon and Bernabeu-Perez, Pablo , booktitle=. Enhancing Monitoring in. 2025 , url=

  29. [29]

    Inglis, Rogan and Matthews, Ollie and Tracy, Tyler and Makins, Oliver and Catling, Tom and Cooper Stickland, Asa and Faber-Espensen, Rasmus and O'Connell, Daniel and Heller, Myles and Brandao, Miguel and Hanson, Adam and Mani, Arathi and Korbak, Tomek and Michelfeit, Jan and Bansal, Dishank and Bark, Tomas and Canal, Chris and Griffin, Charlie and Stehlik...

  30. [30]

    , title =

    Dietterich, Thomas G. , title =. First International Workshop on Multiple Classifier Systems (MCS 2000) , series =. 2000 , publisher =

  31. [31]

    Machine Learning , volume =

    Breiman, Leo , title =. Machine Learning , volume =. 2001 , publisher =

  32. [32]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

    Hansen, Lars Kai and Salamon, Peter , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =. 1990 , publisher =

  33. [33]

    Advances in Neural Information Processing Systems , volume =

    Krogh, Anders and Vedelsby, Jesper , title =. Advances in Neural Information Processing Systems , volume =. 1995 , publisher =

  34. [34]

    Machine Learning , year =

    Breiman, Leo , title =. Machine Learning , year =. doi:10.1007/BF00058655 , publisher =

  35. [35]

    Frontiers of Computer Science , year =

    Dong, Xibin and Yu, Zhiwen and Cao, Wenming and Shi, Yifan and Ma, Qianli , title =. Frontiers of Computer Science , year =. doi:10.1007/s11704-019-8208-z , publisher =

  36. [36]

    2025 , url =

    Hopman, Mia and Parent, Zach , title =. 2025 , url =

  37. [37]

    2025 , journal =

    Arnav, Benjamin and Bernabeu-P. 2025 , journal =. 2505.23575 , archivePrefix =

  38. [38]

    2025 , url =

    Arike, Rauno and Subramani, Rohan and Biswas, Shubhorup , title =. 2025 , url =

  39. [39]

    , title =

    Belrose, N. , title =. 2023 , url =

  40. [40]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    Andriushchenko, Maksym and Souly, Alexandra and Dziemian, Mateusz and Duenas, Derek and Lin, Maxwell and Wang, Justin and Hendrycks, Dan and Zou, Andy and Kolter, Zico and Fredrikson, Matt and Winsor, Eric and Wynne, Jerome and Gal, Yarin and Davies, Xander , title =. 2025 , journal =. 2410.09024 , archivePrefix =

  41. [41]

    Baker, J

    Baker, Bowen and Huizinga, Joost and Gao, Leo and Dou, Zehao and Guan, Melody Y. and Madry, Aleksander and Zaremba, Wojciech and Pachocki, Jakub and Farhi, David , title =. 2025 , journal =. 2503.11926 , archivePrefix =

  42. [42]

    Black, A

    Black, Sid and Cooper Stickland, Asa and Pencharz, Jake and Sourbut, Oliver and Schmatz, Michael and Bailey, Jay and Matthews, Ollie and Millwood, Ben and Remedios, Alex and Cooney, Alan , title =. 2025 , journal =. 2504.18565 , archivePrefix =

  43. [43]

    The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

    Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and Beurer-Kellner, Luca and Fischer, Marc and Tram. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

  44. [44]

    2024 , journal =

    Greenblatt, Ryan and Roger, Fabien and Krasheninnikov, Dmitrii and Krueger, David , title =. 2024 , journal =. 2405.19550 , archivePrefix =

  45. [45]

    Trustworthy Multi-modal Foundation Models and

    Griffin, Charlie and Shlegeris, Buck and Abate, Alessandro , title =. Trustworthy Multi-modal Foundation Models and. 2024 , url =

  46. [46]

    Reliable Weak-to-Strong Monitoring of

    Kale, Neil and Zhang, Chen Bo Calvin and Zhu, Kevin and Aich, Ankit and Rodriguez, Paula and. Reliable Weak-to-Strong Monitoring of. 2025 , journal =. 2508.19461 , archivePrefix =

  47. [47]

    Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

    Korbak, Tomek and Balesni, Mikita and Barnes, Elizabeth and Bengio, Yoshua and Benton, Joe and Bloom, Joseph and Chen, Mark and Cooney, Alan and Dafoe, Allan and Dragan, Anca and Emmons, Scott and Evans, Owain and Farhi, David and Greenblatt, Ryan and Hendrycks, Dan and Hobbhahn, Marius and Hubinger, Evan and Irving, Geoffrey and Jenner, Erik and Kokotajl...

  48. [48]

    2025 , journal =

    Korbak, Tomek and Balesni, Mikita and Shlegeris, Buck and Irving, Geoffrey , title =. 2025 , journal =. 2504.05259 , archivePrefix =

  49. [49]

    2025 , journal =

    Korbak, Tomek and Clymer, Joshua and Hilton, Benjamin and Shlegeris, Buck and Irving, Geoffrey , title =. 2025 , journal =. 2501.17315 , archivePrefix =

  50. [50]

    2025 , journal =

    Kutasov, Jon and Loughridge, Chloe and Sun, Yuqi and Sleight, Henry and Shlegeris, Buck and Tracy, Tyler and Benton, Joe , title =. 2025 , journal =. 2511.02997 , archivePrefix =

  51. [51]

    Kutasov, Y

    Kutasov, Jonathan and Sun, Yuqi and Colognese, Paul and van der Weij, Teun and Petrini, Linda and Zhang, Chen Bo Calvin and Hughes, John and Deng, Xiang and Sleight, Henry and Tracy, Tyler and Shlegeris, Buck and Benton, Joe , title =. 2025 , journal =. 2506.15740 , archivePrefix =

  52. [52]

    and Zhang, Hao and Stoica, Ion , title =

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , title =. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year =

  53. [53]

    and Irving, Geoffrey and Farquhar, Sebastian and Cooney, Alan , title =

    Lindner, David and Griffin, Charlie and Korbak, Tomek and Zimmermann, Roland S. and Irving, Geoffrey and Farquhar, Sebastian and Cooney, Alan , title =

  54. [54]

    2025 , journal =

    Loughridge, Chloe and Colognese, Paul and Griffin, Avery and Tracy, Tyler and Kutasov, Jon and Benton, Joe , title =. 2025 , journal =. 2511.02823 , archivePrefix =

  55. [55]

    2025 , journal =

    Mallen, Alex and Griffin, Charlie and Wagner, Misha and Abate, Alessandro and Shlegeris, Buck , title =. 2025 , journal =. 2412.12480 , archivePrefix =

  56. [56]

    2021 , url =

    Our New. 2021 , url =

  57. [57]

    Identifying the Risks of LM Agents with an LM-Emulated Sandbox

    Ruan, Yangjun and Dong, Honghua and Wang, Andrew and Pitis, Silviu and Zhou, Yongchao and Ba, Jimmy and Dubois, Yann and Maddison, Chris J. and Hashimoto, Tatsunori , title =. 2024 , journal =. 2309.15817 , archivePrefix =

  58. [58]

    2025 , journal =

    Taylor, Jordan and Black, Sid and Bowen, Dillon and Read, Thomas and Golechha, Satvik and Zelenka-Martin, Alex and Makins, Oliver and Kissane, Connor and Ayonrinde, Kola and Merizian, Jacob and Marks, Samuel and Cundy, Chris and Bloom, Joseph , title =. 2025 , journal =. 2512.07810 , archivePrefix =

  59. [59]

    2025 , journal =

    Ward, Francis Rhys and van der Weij, Teun and G. 2025 , journal =. 2511.09904 , archivePrefix =

  60. [60]

    2024 , journal =

    Wen, Jiaxin and Hebbar, Vivek and Larson, Caleb and Bhatt, Aryan and Radhakrishnan, Ansh and Sharma, Mrinank and Sleight, Henry and Feng, Shi and He, He and Perez, Ethan and others , title =. 2024 , journal =. 2411.17693 , archivePrefix =

  61. [61]

    International Conference on Learning Representations (ICLR) , year =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , title =. International Conference on Learning Representations (ICLR) , year =

  62. [62]

    2025 , journal =

    Monitoring Decomposition Attacks in. 2025 , journal =. 2506.10949 , archivePrefix =

  63. [63]

    InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , title =. 2024 , journal =. 2403.02691 , archivePrefix =

  64. [64]

    Async Control: Stress-testing Asynchronous Control Measures for

    Stickland, Asa Cooper and Michelfeit, Jan and Mani, Arathi and Griffin, Charlie and Matthews, Ollie and Korbak, Tomek and Inglis, Rogan and Makins, Oliver and Cooney, Alan , journal=. Async Control: Stress-testing Asynchronous Control Measures for

  65. [65]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Geiping, Jonas and McLeish, Sean and Jain, Neel and Kirchenbauer, John and Singh, Siddharth and Bartoldson, Brian R. and Kailkhura, Bhavya and Bhatele, Abhinav and Goldstein, Tom , title =. 2025 , journal =. 2502.05171 , archivePrefix =

  66. [66]

    International Conference on Learning Representations (ICLR) , year =

    Li, Zhiyuan and Liu, Hong and Zhou, Denny and Ma, Tengyu , title =. International Conference on Learning Representations (ICLR) , year =. 2402.12875 , archivePrefix =

  67. [67]

    2023 , journal =

    Zhang, Xiaoyu and Zhang, Cen and Li, Tianlin and Huang, Yihao and Jia, Xiaojun and Xie, Xiaofei and Liu, Yang and Shen, Chao , title =. 2023 , journal =. 2312.10766 , archivePrefix =

  68. [68]

    Healthcare , VOLUME =

    Mahajan, Palak and Uddin, Shahadat and Hajati, Farshid and Moni, Mohammad Ali , TITLE =. Healthcare , VOLUME =. 2023 , NUMBER =

  69. [69]

    and Bassey, Steven Ita and Musa, Umaru , title =

    Ojoje, Oche Akiti and Aimufua, Gilbert I.O. and Bassey, Steven Ita and Musa, Umaru , title =. ISA Journal of Engineering and Technology (ISAJET) , volume =. 2025 , issn =. doi:10.5281/zenodo.17392097 , url =

  70. [70]

    2025 , issn =

    A credit card fraud detection approach based on ensemble machine learning classifier with hybrid data sampling , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.mlwa.2025.100675 , url =

  71. [71]

    2018 , url =

    Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya , title =. 2018 , url =

  72. [72]

    2025 , doi =

    Maslej, Nestor and Fattorini, Loredana and Perrault, Raymond and Gil, Yolanda and Parli, Vanessa and Kariuki, Njenga and Capstick, Emily and Reuel, Anka and Brynjolfsson, Erik and Etchemendy, John and Ligett, Katrina and Lyons, Terah and Manyika, James and Niebles, Juan Carlos and Shoham, Yoav and Wald, Russell and Walsh, Toby and Hamrah, Armin and Santar...

  73. [73]

    A Comparative Analysis of Ensemble Protocols for

    Ariel Monzon and Natan Katz and Inbar Shulman and Mike Vazan and Or Shemesh and Guy Porat , year =. A Comparative Analysis of Ensemble Protocols for

  74. [74]

    2025 , pages =

    Zhi-Hua Zhou , title =. 2025 , pages =

  75. [75]

    and Whitaker, Christopher J

    Kuncheva, Ludmila I. and Whitaker, Christopher J. , title =. Machine Learning , volume =. 2003 , doi =

  76. [76]

    Wu, Yanzhao and Liu, Ling and Xie, Zhongwei and Chow, Ka-Ho and Wei, Wenqi , title =. 2021. 2021 , url =