Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

Benjamin Arnav; Eugene Koran; Pablo Bernabeu-P\'erez; Samantha Tetef; Yejun Yun

arxiv: 2605.15377 · v2 · pith:H7EZQYT3new · submitted 2026-05-14 · 💻 cs.AI

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

Eugene Koran , Yejun Yun , Samantha Tetef , Benjamin Arnav , Pablo Bernabeu-P\'erez This is my paper

Pith reviewed 2026-05-20 20:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords ensemble monitoringAI safetymisaligned actionsfine-tuningpromptingdetection performanceagent alignment

0 comments

The pith

Diverse ensembles of AI monitors detect misaligned actions better than scaling up copies of one monitor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that combining detection signals from varied monitors improves identification of unsafe or misaligned AI agent behaviors more than adding more copies of the same monitor. This matters for autonomous AI systems where human review is impossible at scale and reliable single monitors are hard to build. The authors created twelve monitors from GPT-4.1-Mini using both prompting and fine-tuning, then measured performance on coding tasks whose solutions pass ordinary tests yet fail adversarial checks. Their strongest three-monitor ensemble produced 2.4 times the detection gain of an ensemble made from three identical monitors and held up on a separate dataset. Fine-tuned monitors appeared in every top ensemble and kept their edge on new attack types, pointing to diversity and low correlation between monitors as the drivers of improvement rather than raw scale.

Core claim

Combining signals from diverse monitors into an ensemble improves detection of misaligned actions. On coding tasks where solutions pass standard tests but fail on adversarial inputs, the best 3-monitor ensemble achieves 2.4x greater detection performance gain than an ensemble of three identical monitors. Fine-tuned monitors appear in every top-performing ensemble and retain their advantage on out-of-distribution attacks, while the strongest ensembles pair good individual performance with low correlation between monitors.

What carries the argument

Ensemble that merges detection signals from distinct monitors built via prompting and fine-tuning on the same base model.

Load-bearing premise

The coding-task setting where solutions pass standard tests but fail on adversarial inputs serves as a valid proxy for misaligned actions in broader autonomous agent deployments.

What would settle it

If a diverse three-monitor ensemble shows no meaningful detection gain over three identical monitors when tested on a new collection of real agent tasks outside coding, the central claim would be challenged.

Figures

Figures reproduced from arXiv: 2605.15377 by Benjamin Arnav, Eugene Koran, Pablo Bernabeu-P\'erez, Samantha Tetef, Yejun Yun.

**Figure 1.** Figure 1: Ensemble monitoring: each monitor scores independently and the mean is compared to a threshold calibrated on a validation set. Our best-performing 3-monitor ensemble combines one fine-tuned ( ) and two prompted ( ) monitors. All three identify the same off-by-one loop bound but reach opposite conclusions: the fine-tuned monitor flags it as an intentional backdoor while both prompted monitors dismiss it as … view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 3.** Figure 3: Individual monitor performance (pAUC) with bootstrap confidence intervals (n = 1000 resamples, 95% CI). The ensemble outperforms all individual monitors. strating that ensembling can still improve monitoring performance even without developing stronger individual monitors. Ensembles improve detection and FPR calibration. We analyze test performance of ensembles at thresholds selected on the validation s… view at source ↗

**Figure 5.** Figure 5: Paired pAUC differences between the top 3-monitor ensembles ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Most diverse 3-monitor ensembles on APPS and BCB-S after filtering the bottom performance quartile, showing paired pAUC differences relative to the Baseline monitor and the median 3-monitor ensemble, with bootstrap confidence intervals (n = 1000 resamples, 95% CI). In general, the most diverse ensembles have lower average inter-monitor correlation (inverse diversity) and better performance than both the Ba… view at source ↗

read the original abstract

As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that ensembles combining signals from diverse monitors outperform both individual monitors and homogeneous ensembles in detecting misaligned actions. Using 12 GPT-4.1-Mini monitors constructed via prompting and fine-tuning strategies, the authors evaluate on coding tasks where solutions pass standard tests but fail adversarial inputs. Their best 3-monitor diverse ensemble achieves a 2.4x greater detection performance gain than an ensemble of three identical monitors, with strong performance on an independent dataset; they conclude that diversity (especially from fine-tuned monitors with low inter-monitor correlation) drives gains more than additional compute and offers a practical AI control strategy.

Significance. If the central empirical result holds, the work offers a concrete, low-inference-cost approach to improving detection of unsafe agent actions by prioritizing monitor diversity over scale. It supplies evidence that fine-tuning elicits detection capabilities not obtained by prompting alone and that low-correlation ensembles are particularly effective. These findings could guide practical safety engineering for autonomous systems, though their scope is constrained by the chosen evaluation proxy.

major comments (2)

[Abstract] Abstract and evaluation description: the 2.4x detection performance gain is reported without error bars, statistical significance tests, or a complete experimental protocol. This omission makes it impossible to assess whether the quantitative claim is robust to post-hoc monitor selection, task-specific artifacts, or random variation.
[Abstract] Abstract, evaluation description: the coding-task proxy (solutions that pass standard tests yet fail adversarial inputs) is used to stand in for misaligned actions in autonomous agents. This setting primarily surfaces hidden defects or backdoors rather than intentional misalignment behaviors such as deception or goal misgeneralization; if the failure-mode correlation structures differ, the reported low inter-monitor correlation and fine-tuning advantage may not generalize, undermining the claim that the approach constitutes a practical AI control strategy.

minor comments (2)

[Abstract] Clarify how the ensemble combines monitor signals (e.g., voting rule, threshold, or learned aggregator) and whether this choice was tuned on the same data used to report the 2.4x figure.
The statement that 'fine-tuned monitors appear in every top-performing ensemble' would be strengthened by an explicit table or figure listing the composition of the top ensembles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation and scope of our work. We address each major comment below and have made revisions to strengthen the manuscript where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: the 2.4x detection performance gain is reported without error bars, statistical significance tests, or a complete experimental protocol. This omission makes it impossible to assess whether the quantitative claim is robust to post-hoc monitor selection, task-specific artifacts, or random variation.

Authors: We agree that the abstract and main text would benefit from additional statistical rigor. In the revised manuscript, we have added bootstrap-derived 95% confidence intervals for the reported performance gains and included paired statistical tests (Wilcoxon signed-rank) comparing diverse versus homogeneous ensembles. We have also expanded the experimental protocol section to detail the full pipeline: monitor construction, validation-set-based selection of the top monitors (to avoid post-hoc bias on the test set), ensemble aggregation method, and the exact adversarial input generation process. These changes make the 2.4x claim reproducible and allow readers to evaluate robustness to random variation and selection artifacts. revision: yes
Referee: [Abstract] Abstract, evaluation description: the coding-task proxy (solutions that pass standard tests yet fail adversarial inputs) is used to stand in for misaligned actions in autonomous agents. This setting primarily surfaces hidden defects or backdoors rather than intentional misalignment behaviors such as deception or goal misgeneralization; if the failure-mode correlation structures differ, the reported low inter-monitor correlation and fine-tuning advantage may not generalize, undermining the claim that the approach constitutes a practical AI control strategy.

Authors: We acknowledge that the chosen proxy primarily evaluates detection of hidden defects and backdoors rather than explicit deception or goal misgeneralization. This is a genuine limitation of the current evaluation. At the same time, the proxy captures a practically relevant failure mode for agentic coding systems—outputs that satisfy superficial checks but produce incorrect behavior under distribution shift or adversarial evaluation—which is a common manifestation of misalignment in deployed agents. We have added a dedicated limitations paragraph in the discussion section that explicitly contrasts the proxy with other misalignment types, notes the potential differences in correlation structure, and outlines future experiments on deception benchmarks. We retain the claim that the results demonstrate the value of diversity for this class of detection tasks while qualifying the generalization statement. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with independent validation

full rationale

The paper reports direct experimental results from constructing 12 GPT-4.1-Mini monitors via prompting and fine-tuning, then measuring their ensemble detection performance on coding tasks (solutions pass standard tests but fail adversarial inputs) and an independent dataset. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The reported 2.4x gain is a measured outcome from explicit evaluation, not a quantity forced by definition or prior self-referential result. The setup is self-contained against external benchmarks via the independent dataset and does not reduce any central claim to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the empirical observation that monitor errors are sufficiently uncorrelated; this is treated as a domain assumption rather than derived from first principles.

axioms (1)

domain assumption Monitor detection errors are sufficiently uncorrelated to produce ensemble gains when combined.
Invoked to explain why diverse ensembles outperform homogeneous ones in the reported experiments.

pith-pipeline@v0.9.0 · 5763 in / 1275 out tokens · 101875 ms · 2026-05-20T20:07:45.182666+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We aggregate monitors by averaging their scores... diversity metric is ρavg = average pairwise Pearson correlation among constituent monitors’ score vectors, computed on attack samples only.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our best 3-monitor ensemble achieves 2.4× greater detection performance gain compared to an ensemble composed of three identical monitors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 13 internal anchors

[1]

arXiv preprint arXiv:2301.12345 , year =

Smith, John and Doe, Jane , title =. arXiv preprint arXiv:2301.12345 , year =. 2301.12345 , archivePrefix =

work page arXiv
[2]

Journal of Artificial Intelligence Research , volume =

Johnson, Alice and Williams, Bob , title =. Journal of Artificial Intelligence Research , volume =. 2024 , doi =

work page 2024
[3]

Proceedings of the 38th International Conference on Machine Learning , pages =

Chen, Wei and Kumar, Raj , title =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2023 , publisher =

work page 2023
[4]

2023 , url =

OpenAI , title =. 2023 , url =

work page 2023
[5]

2020 , isbn =

Russell, Stuart and Norvig, Peter , title =. 2020 , isbn =

work page 2020
[6]

Language Models are Few-Shot Learners

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and others , title =. arXiv preprint arXiv:2005.14165 , year =. 2005.14165 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2005
[7]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , title =. arXiv preprint arXiv:2201.11903 , year =. 2201.11903 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[8]

and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , title =

Park, Peter S. and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , title =. arXiv preprint arXiv:2308.14752 , year =. 2308.14752 , archivePrefix =

work page arXiv
[9]

Risks from Learned Optimization in Advanced Machine Learning Systems

Hubinger, Evan and van Merwijk, Chris and Mikulik, Vladimir and Skalse, Joar and Garrabrant, Scott , title =. arXiv preprint arXiv:1906.01820 , year =. 1906.01820 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 1906
[10]

2022 , url =

Cotra, Ajeya , title =. 2022 , url =

work page 2022
[11]

Transformer Circuits Thread , year =

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and others , title =. Transformer Circuits Thread , year =

work page
[12]

Distill , year =

Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =. doi:10.23915/distill.00024.001 , url =

work page doi:10.23915/distill.00024.001
[13]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , title =. arXiv preprint arXiv:2309.08600 , year =. 2309.08600 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Advances in Neural Information Processing Systems , volume =

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , title =. Advances in Neural Information Processing Systems , volume =

work page
[15]

arXiv preprint arXiv:1905.05950 , year =

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , title =. arXiv preprint arXiv:1905.05950 , year =. 1905.05950 , archivePrefix =

work page arXiv 1905
[16]

AI safety via debate

Irving, Geoffrey and Christiano, Paul and Amodei, Dario , title =. arXiv preprint arXiv:1805.00899 , year =. 1805.00899 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Alignment faking in large language models

Alignment Faking in Large Language Models , author=. arXiv preprint arXiv:2412.14093 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Ctrl-Z: Controlling

Bhatt, Aryan and Rushing, Cody and Kaufman, Adam and Tracy, Tyler and Georgiev, Vasil and Matolcsi, David and Khan, Akbir and Shlegeris, Buck , journal=. Ctrl-Z: Controlling

work page
[19]

Greenblatt, Ryan and Shlegeris, Buck and Sachan, Kshitij and Roger, Fabien , journal=

work page
[20]

arXiv preprint arXiv:2507.15886 , year=

Combining cost-constrained runtime monitors for ai safety , author=. arXiv preprint arXiv:2507.15886 , year=

work page arXiv
[21]

De- tecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

Detecting Strategic Deception Using Linear Probes , author=. arXiv preprint arXiv:2502.03407 , year=

work page arXiv
[22]

arXiv preprint arXiv:2506.05296 , year=

Control Tax: The Price of Keeping AI in Check , author=. arXiv preprint arXiv:2506.05296 , year=

work page arXiv
[23]

Adaptive Attacks on Trusted Monitors Subvert

Terekhov, Mikhail and Panfilov, Alexander and Dzenhaliou, Daniil and Gulcehre, Caglar and Andriushchenko, Maksym and Prabhu, Ameya and Geiping, Jonas , journal=. Adaptive Attacks on Trusted Monitors Subvert

work page
[24]

Terry Yue Zhuo and Minh Chien Vu and Jenny Chim and Han Hu and Wenhao Yu and Ratnadira Widyasari and Imam Nur Bani Yusuf and Haolan Zhan and Junda He and Indraneil Paul and Simon Brunner and Chen Gong and Thong Hoang and Armel Randy Zebaze and Xiaoheng Hong and Wen-Ding Li and Jean Kaddour and Ming Xu and Zhihan Zhang and Prateek Yadav and Naman Jain and ...

work page
[25]

Measuring Coding Challenge Competence With APPS

Measuring Coding Challenge Competence With APPS , author=. arXiv preprint arXiv:2105.09938 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Persona vectors: Monitoring and controlling character traits in language models , author=. arXiv preprint arXiv:2507.21509 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

2025 , url =

Parent, Zach and Finke, Lennart and Tracy, Taler , title =. 2025 , url =

work page 2025
[28]

Enhancing Monitoring in

Al Dahik, Ghassan and Xu, Ruize and Singh, Satvik and Kiunga, Leon and Bernabeu-Perez, Pablo , booktitle=. Enhancing Monitoring in. 2025 , url=

work page 2025
[29]

Inglis, Rogan and Matthews, Ollie and Tracy, Tyler and Makins, Oliver and Catling, Tom and Cooper Stickland, Asa and Faber-Espensen, Rasmus and O'Connell, Daniel and Heller, Myles and Brandao, Miguel and Hanson, Adam and Mani, Arathi and Korbak, Tomek and Michelfeit, Jan and Bansal, Dishank and Bark, Tomas and Canal, Chris and Griffin, Charlie and Stehlik...

work page
[30]

, title =

Dietterich, Thomas G. , title =. First International Workshop on Multiple Classifier Systems (MCS 2000) , series =. 2000 , publisher =

work page 2000
[31]

Machine Learning , volume =

Breiman, Leo , title =. Machine Learning , volume =. 2001 , publisher =

work page 2001
[32]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

Hansen, Lars Kai and Salamon, Peter , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =. 1990 , publisher =

work page 1990
[33]

Advances in Neural Information Processing Systems , volume =

Krogh, Anders and Vedelsby, Jesper , title =. Advances in Neural Information Processing Systems , volume =. 1995 , publisher =

work page 1995
[34]

Machine Learning , year =

Breiman, Leo , title =. Machine Learning , year =. doi:10.1007/BF00058655 , publisher =

work page doi:10.1007/bf00058655
[35]

Frontiers of Computer Science , year =

Dong, Xibin and Yu, Zhiwen and Cao, Wenming and Shi, Yifan and Ma, Qianli , title =. Frontiers of Computer Science , year =. doi:10.1007/s11704-019-8208-z , publisher =

work page doi:10.1007/s11704-019-8208-z
[36]

2025 , url =

Hopman, Mia and Parent, Zach , title =. 2025 , url =

work page 2025
[37]

2025 , journal =

Arnav, Benjamin and Bernabeu-P. 2025 , journal =. 2505.23575 , archivePrefix =

work page arXiv 2025
[38]

2025 , url =

Arike, Rauno and Subramani, Rohan and Biswas, Shubhorup , title =. 2025 , url =

work page 2025
[39]

, title =

Belrose, N. , title =. 2023 , url =

work page 2023
[40]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Andriushchenko, Maksym and Souly, Alexandra and Dziemian, Mateusz and Duenas, Derek and Lin, Maxwell and Wang, Justin and Hendrycks, Dan and Zou, Andy and Kolter, Zico and Fredrikson, Matt and Winsor, Eric and Wynne, Jerome and Gal, Yarin and Davies, Xander , title =. 2025 , journal =. 2410.09024 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Baker, J

Baker, Bowen and Huizinga, Joost and Gao, Leo and Dou, Zehao and Guan, Melody Y. and Madry, Aleksander and Zaremba, Wojciech and Pachocki, Jakub and Farhi, David , title =. 2025 , journal =. 2503.11926 , archivePrefix =

work page arXiv 2025
[42]

Black, A

Black, Sid and Cooper Stickland, Asa and Pencharz, Jake and Sourbut, Oliver and Schmatz, Michael and Bailey, Jay and Matthews, Ollie and Millwood, Ben and Remedios, Alex and Cooney, Alan , title =. 2025 , journal =. 2504.18565 , archivePrefix =

work page arXiv 2025
[43]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and Beurer-Kellner, Luca and Fischer, Marc and Tram. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

work page
[44]

2024 , journal =

Greenblatt, Ryan and Roger, Fabien and Krasheninnikov, Dmitrii and Krueger, David , title =. 2024 , journal =. 2405.19550 , archivePrefix =

work page arXiv 2024
[45]

Trustworthy Multi-modal Foundation Models and

Griffin, Charlie and Shlegeris, Buck and Abate, Alessandro , title =. Trustworthy Multi-modal Foundation Models and. 2024 , url =

work page 2024
[46]

Reliable Weak-to-Strong Monitoring of

Kale, Neil and Zhang, Chen Bo Calvin and Zhu, Kevin and Aich, Ankit and Rodriguez, Paula and. Reliable Weak-to-Strong Monitoring of. 2025 , journal =. 2508.19461 , archivePrefix =

work page arXiv 2025
[47]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Korbak, Tomek and Balesni, Mikita and Barnes, Elizabeth and Bengio, Yoshua and Benton, Joe and Bloom, Joseph and Chen, Mark and Cooney, Alan and Dafoe, Allan and Dragan, Anca and Emmons, Scott and Evans, Owain and Farhi, David and Greenblatt, Ryan and Hendrycks, Dan and Hobbhahn, Marius and Hubinger, Evan and Irving, Geoffrey and Jenner, Erik and Kokotajl...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

2025 , journal =

Korbak, Tomek and Balesni, Mikita and Shlegeris, Buck and Irving, Geoffrey , title =. 2025 , journal =. 2504.05259 , archivePrefix =

work page arXiv 2025
[49]

2025 , journal =

Korbak, Tomek and Clymer, Joshua and Hilton, Benjamin and Shlegeris, Buck and Irving, Geoffrey , title =. 2025 , journal =. 2501.17315 , archivePrefix =

work page arXiv 2025
[50]

2025 , journal =

Kutasov, Jon and Loughridge, Chloe and Sun, Yuqi and Sleight, Henry and Shlegeris, Buck and Tracy, Tyler and Benton, Joe , title =. 2025 , journal =. 2511.02997 , archivePrefix =

work page arXiv 2025
[51]

Kutasov, Y

Kutasov, Jonathan and Sun, Yuqi and Colognese, Paul and van der Weij, Teun and Petrini, Linda and Zhang, Chen Bo Calvin and Hughes, John and Deng, Xiang and Sleight, Henry and Tracy, Tyler and Shlegeris, Buck and Benton, Joe , title =. 2025 , journal =. 2506.15740 , archivePrefix =

work page arXiv 2025
[52]

and Zhang, Hao and Stoica, Ion , title =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , title =. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year =

work page
[53]

and Irving, Geoffrey and Farquhar, Sebastian and Cooney, Alan , title =

Lindner, David and Griffin, Charlie and Korbak, Tomek and Zimmermann, Roland S. and Irving, Geoffrey and Farquhar, Sebastian and Cooney, Alan , title =

work page
[54]

2025 , journal =

Loughridge, Chloe and Colognese, Paul and Griffin, Avery and Tracy, Tyler and Kutasov, Jon and Benton, Joe , title =. 2025 , journal =. 2511.02823 , archivePrefix =

work page arXiv 2025
[55]

2025 , journal =

Mallen, Alex and Griffin, Charlie and Wagner, Misha and Abate, Alessandro and Shlegeris, Buck , title =. 2025 , journal =. 2412.12480 , archivePrefix =

work page arXiv 2025
[56]

2021 , url =

Our New. 2021 , url =

work page 2021
[57]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Ruan, Yangjun and Dong, Honghua and Wang, Andrew and Pitis, Silviu and Zhou, Yongchao and Ba, Jimmy and Dubois, Yann and Maddison, Chris J. and Hashimoto, Tatsunori , title =. 2024 , journal =. 2309.15817 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

2025 , journal =

Taylor, Jordan and Black, Sid and Bowen, Dillon and Read, Thomas and Golechha, Satvik and Zelenka-Martin, Alex and Makins, Oliver and Kissane, Connor and Ayonrinde, Kola and Merizian, Jacob and Marks, Samuel and Cundy, Chris and Bloom, Joseph , title =. 2025 , journal =. 2512.07810 , archivePrefix =

work page arXiv 2025
[59]

2025 , journal =

Ward, Francis Rhys and van der Weij, Teun and G. 2025 , journal =. 2511.09904 , archivePrefix =

work page arXiv 2025
[60]

2024 , journal =

Wen, Jiaxin and Hebbar, Vivek and Larson, Caleb and Bhatt, Aryan and Radhakrishnan, Ansh and Sharma, Mrinank and Sleight, Henry and Feng, Shi and He, He and Perez, Ethan and others , title =. 2024 , journal =. 2411.17693 , archivePrefix =

work page arXiv 2024
[61]

International Conference on Learning Representations (ICLR) , year =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , title =. International Conference on Learning Representations (ICLR) , year =

work page
[62]

2025 , journal =

Monitoring Decomposition Attacks in. 2025 , journal =. 2506.10949 , archivePrefix =

work page arXiv 2025
[63]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , title =. 2024 , journal =. 2403.02691 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Async Control: Stress-testing Asynchronous Control Measures for

Stickland, Asa Cooper and Michelfeit, Jan and Mani, Arathi and Griffin, Charlie and Matthews, Ollie and Korbak, Tomek and Inglis, Rogan and Makins, Oliver and Cooney, Alan , journal=. Async Control: Stress-testing Asynchronous Control Measures for

work page
[65]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Geiping, Jonas and McLeish, Sean and Jain, Neel and Kirchenbauer, John and Singh, Siddharth and Bartoldson, Brian R. and Kailkhura, Bhavya and Bhatele, Abhinav and Goldstein, Tom , title =. 2025 , journal =. 2502.05171 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

International Conference on Learning Representations (ICLR) , year =

Li, Zhiyuan and Liu, Hong and Zhou, Denny and Ma, Tengyu , title =. International Conference on Learning Representations (ICLR) , year =. 2402.12875 , archivePrefix =

work page arXiv
[67]

2023 , journal =

Zhang, Xiaoyu and Zhang, Cen and Li, Tianlin and Huang, Yihao and Jia, Xiaojun and Xie, Xiaofei and Liu, Yang and Shen, Chao , title =. 2023 , journal =. 2312.10766 , archivePrefix =

work page arXiv 2023
[68]

Healthcare , VOLUME =

Mahajan, Palak and Uddin, Shahadat and Hajati, Farshid and Moni, Mohammad Ali , TITLE =. Healthcare , VOLUME =. 2023 , NUMBER =

work page 2023
[69]

and Bassey, Steven Ita and Musa, Umaru , title =

Ojoje, Oche Akiti and Aimufua, Gilbert I.O. and Bassey, Steven Ita and Musa, Umaru , title =. ISA Journal of Engineering and Technology (ISAJET) , volume =. 2025 , issn =. doi:10.5281/zenodo.17392097 , url =

work page doi:10.5281/zenodo.17392097 2025
[70]

2025 , issn =

A credit card fraud detection approach based on ensemble machine learning classifier with hybrid data sampling , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.mlwa.2025.100675 , url =

work page doi:10.1016/j.mlwa.2025.100675 2025
[71]

2018 , url =

Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya , title =. 2018 , url =

work page 2018
[72]

2025 , doi =

Maslej, Nestor and Fattorini, Loredana and Perrault, Raymond and Gil, Yolanda and Parli, Vanessa and Kariuki, Njenga and Capstick, Emily and Reuel, Anka and Brynjolfsson, Erik and Etchemendy, John and Ligett, Katrina and Lyons, Terah and Manyika, James and Niebles, Juan Carlos and Shoham, Yoav and Wald, Russell and Walsh, Toby and Hamrah, Armin and Santar...

work page 2025
[73]

A Comparative Analysis of Ensemble Protocols for

Ariel Monzon and Natan Katz and Inbar Shulman and Mike Vazan and Or Shemesh and Guy Porat , year =. A Comparative Analysis of Ensemble Protocols for

work page
[74]

2025 , pages =

Zhi-Hua Zhou , title =. 2025 , pages =

work page 2025
[75]

and Whitaker, Christopher J

Kuncheva, Ludmila I. and Whitaker, Christopher J. , title =. Machine Learning , volume =. 2003 , doi =

work page 2003
[76]

Wu, Yanzhao and Liu, Ling and Xie, Zhongwei and Chow, Ka-Ho and Wei, Wenqi , title =. 2021. 2021 , url =

work page 2021

[1] [1]

arXiv preprint arXiv:2301.12345 , year =

Smith, John and Doe, Jane , title =. arXiv preprint arXiv:2301.12345 , year =. 2301.12345 , archivePrefix =

work page arXiv

[2] [2]

Journal of Artificial Intelligence Research , volume =

Johnson, Alice and Williams, Bob , title =. Journal of Artificial Intelligence Research , volume =. 2024 , doi =

work page 2024

[3] [3]

Proceedings of the 38th International Conference on Machine Learning , pages =

Chen, Wei and Kumar, Raj , title =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2023 , publisher =

work page 2023

[4] [4]

2023 , url =

OpenAI , title =. 2023 , url =

work page 2023

[5] [5]

2020 , isbn =

Russell, Stuart and Norvig, Peter , title =. 2020 , isbn =

work page 2020

[6] [6]

Language Models are Few-Shot Learners

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and others , title =. arXiv preprint arXiv:2005.14165 , year =. 2005.14165 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2005

[7] [7]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , title =. arXiv preprint arXiv:2201.11903 , year =. 2201.11903 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , title =

Park, Peter S. and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , title =. arXiv preprint arXiv:2308.14752 , year =. 2308.14752 , archivePrefix =

work page arXiv

[9] [9]

Risks from Learned Optimization in Advanced Machine Learning Systems

Hubinger, Evan and van Merwijk, Chris and Mikulik, Vladimir and Skalse, Joar and Garrabrant, Scott , title =. arXiv preprint arXiv:1906.01820 , year =. 1906.01820 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 1906

[10] [10]

2022 , url =

Cotra, Ajeya , title =. 2022 , url =

work page 2022

[11] [11]

Transformer Circuits Thread , year =

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and others , title =. Transformer Circuits Thread , year =

work page

[12] [12]

Distill , year =

Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =. doi:10.23915/distill.00024.001 , url =

work page doi:10.23915/distill.00024.001

[13] [13]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , title =. arXiv preprint arXiv:2309.08600 , year =. 2309.08600 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Advances in Neural Information Processing Systems , volume =

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , title =. Advances in Neural Information Processing Systems , volume =

work page

[15] [15]

arXiv preprint arXiv:1905.05950 , year =

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , title =. arXiv preprint arXiv:1905.05950 , year =. 1905.05950 , archivePrefix =

work page arXiv 1905

[16] [16]

AI safety via debate

Irving, Geoffrey and Christiano, Paul and Amodei, Dario , title =. arXiv preprint arXiv:1805.00899 , year =. 1805.00899 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Alignment faking in large language models

Alignment Faking in Large Language Models , author=. arXiv preprint arXiv:2412.14093 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Ctrl-Z: Controlling

Bhatt, Aryan and Rushing, Cody and Kaufman, Adam and Tracy, Tyler and Georgiev, Vasil and Matolcsi, David and Khan, Akbir and Shlegeris, Buck , journal=. Ctrl-Z: Controlling

work page

[19] [19]

Greenblatt, Ryan and Shlegeris, Buck and Sachan, Kshitij and Roger, Fabien , journal=

work page

[20] [20]

arXiv preprint arXiv:2507.15886 , year=

Combining cost-constrained runtime monitors for ai safety , author=. arXiv preprint arXiv:2507.15886 , year=

work page arXiv

[21] [21]

De- tecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

Detecting Strategic Deception Using Linear Probes , author=. arXiv preprint arXiv:2502.03407 , year=

work page arXiv

[22] [22]

arXiv preprint arXiv:2506.05296 , year=

Control Tax: The Price of Keeping AI in Check , author=. arXiv preprint arXiv:2506.05296 , year=

work page arXiv

[23] [23]

Adaptive Attacks on Trusted Monitors Subvert

Terekhov, Mikhail and Panfilov, Alexander and Dzenhaliou, Daniil and Gulcehre, Caglar and Andriushchenko, Maksym and Prabhu, Ameya and Geiping, Jonas , journal=. Adaptive Attacks on Trusted Monitors Subvert

work page

[24] [24]

Terry Yue Zhuo and Minh Chien Vu and Jenny Chim and Han Hu and Wenhao Yu and Ratnadira Widyasari and Imam Nur Bani Yusuf and Haolan Zhan and Junda He and Indraneil Paul and Simon Brunner and Chen Gong and Thong Hoang and Armel Randy Zebaze and Xiaoheng Hong and Wen-Ding Li and Jean Kaddour and Ming Xu and Zhihan Zhang and Prateek Yadav and Naman Jain and ...

work page

[25] [25]

Measuring Coding Challenge Competence With APPS

Measuring Coding Challenge Competence With APPS , author=. arXiv preprint arXiv:2105.09938 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Persona vectors: Monitoring and controlling character traits in language models , author=. arXiv preprint arXiv:2507.21509 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

2025 , url =

Parent, Zach and Finke, Lennart and Tracy, Taler , title =. 2025 , url =

work page 2025

[28] [28]

Enhancing Monitoring in

Al Dahik, Ghassan and Xu, Ruize and Singh, Satvik and Kiunga, Leon and Bernabeu-Perez, Pablo , booktitle=. Enhancing Monitoring in. 2025 , url=

work page 2025

[29] [29]

Inglis, Rogan and Matthews, Ollie and Tracy, Tyler and Makins, Oliver and Catling, Tom and Cooper Stickland, Asa and Faber-Espensen, Rasmus and O'Connell, Daniel and Heller, Myles and Brandao, Miguel and Hanson, Adam and Mani, Arathi and Korbak, Tomek and Michelfeit, Jan and Bansal, Dishank and Bark, Tomas and Canal, Chris and Griffin, Charlie and Stehlik...

work page

[30] [30]

, title =

Dietterich, Thomas G. , title =. First International Workshop on Multiple Classifier Systems (MCS 2000) , series =. 2000 , publisher =

work page 2000

[31] [31]

Machine Learning , volume =

Breiman, Leo , title =. Machine Learning , volume =. 2001 , publisher =

work page 2001

[32] [32]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

Hansen, Lars Kai and Salamon, Peter , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =. 1990 , publisher =

work page 1990

[33] [33]

Advances in Neural Information Processing Systems , volume =

Krogh, Anders and Vedelsby, Jesper , title =. Advances in Neural Information Processing Systems , volume =. 1995 , publisher =

work page 1995

[34] [34]

Machine Learning , year =

Breiman, Leo , title =. Machine Learning , year =. doi:10.1007/BF00058655 , publisher =

work page doi:10.1007/bf00058655

[35] [35]

Frontiers of Computer Science , year =

Dong, Xibin and Yu, Zhiwen and Cao, Wenming and Shi, Yifan and Ma, Qianli , title =. Frontiers of Computer Science , year =. doi:10.1007/s11704-019-8208-z , publisher =

work page doi:10.1007/s11704-019-8208-z

[36] [36]

2025 , url =

Hopman, Mia and Parent, Zach , title =. 2025 , url =

work page 2025

[37] [37]

2025 , journal =

Arnav, Benjamin and Bernabeu-P. 2025 , journal =. 2505.23575 , archivePrefix =

work page arXiv 2025

[38] [38]

2025 , url =

Arike, Rauno and Subramani, Rohan and Biswas, Shubhorup , title =. 2025 , url =

work page 2025

[39] [39]

, title =

Belrose, N. , title =. 2023 , url =

work page 2023

[40] [40]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Andriushchenko, Maksym and Souly, Alexandra and Dziemian, Mateusz and Duenas, Derek and Lin, Maxwell and Wang, Justin and Hendrycks, Dan and Zou, Andy and Kolter, Zico and Fredrikson, Matt and Winsor, Eric and Wynne, Jerome and Gal, Yarin and Davies, Xander , title =. 2025 , journal =. 2410.09024 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Baker, J

Baker, Bowen and Huizinga, Joost and Gao, Leo and Dou, Zehao and Guan, Melody Y. and Madry, Aleksander and Zaremba, Wojciech and Pachocki, Jakub and Farhi, David , title =. 2025 , journal =. 2503.11926 , archivePrefix =

work page arXiv 2025

[42] [42]

Black, A

Black, Sid and Cooper Stickland, Asa and Pencharz, Jake and Sourbut, Oliver and Schmatz, Michael and Bailey, Jay and Matthews, Ollie and Millwood, Ben and Remedios, Alex and Cooney, Alan , title =. 2025 , journal =. 2504.18565 , archivePrefix =

work page arXiv 2025

[43] [43]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and Beurer-Kellner, Luca and Fischer, Marc and Tram. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

work page

[44] [44]

2024 , journal =

Greenblatt, Ryan and Roger, Fabien and Krasheninnikov, Dmitrii and Krueger, David , title =. 2024 , journal =. 2405.19550 , archivePrefix =

work page arXiv 2024

[45] [45]

Trustworthy Multi-modal Foundation Models and

Griffin, Charlie and Shlegeris, Buck and Abate, Alessandro , title =. Trustworthy Multi-modal Foundation Models and. 2024 , url =

work page 2024

[46] [46]

Reliable Weak-to-Strong Monitoring of

Kale, Neil and Zhang, Chen Bo Calvin and Zhu, Kevin and Aich, Ankit and Rodriguez, Paula and. Reliable Weak-to-Strong Monitoring of. 2025 , journal =. 2508.19461 , archivePrefix =

work page arXiv 2025

[47] [47]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Korbak, Tomek and Balesni, Mikita and Barnes, Elizabeth and Bengio, Yoshua and Benton, Joe and Bloom, Joseph and Chen, Mark and Cooney, Alan and Dafoe, Allan and Dragan, Anca and Emmons, Scott and Evans, Owain and Farhi, David and Greenblatt, Ryan and Hendrycks, Dan and Hobbhahn, Marius and Hubinger, Evan and Irving, Geoffrey and Jenner, Erik and Kokotajl...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

2025 , journal =

Korbak, Tomek and Balesni, Mikita and Shlegeris, Buck and Irving, Geoffrey , title =. 2025 , journal =. 2504.05259 , archivePrefix =

work page arXiv 2025

[49] [49]

2025 , journal =

Korbak, Tomek and Clymer, Joshua and Hilton, Benjamin and Shlegeris, Buck and Irving, Geoffrey , title =. 2025 , journal =. 2501.17315 , archivePrefix =

work page arXiv 2025

[50] [50]

2025 , journal =

Kutasov, Jon and Loughridge, Chloe and Sun, Yuqi and Sleight, Henry and Shlegeris, Buck and Tracy, Tyler and Benton, Joe , title =. 2025 , journal =. 2511.02997 , archivePrefix =

work page arXiv 2025

[51] [51]

Kutasov, Y

Kutasov, Jonathan and Sun, Yuqi and Colognese, Paul and van der Weij, Teun and Petrini, Linda and Zhang, Chen Bo Calvin and Hughes, John and Deng, Xiang and Sleight, Henry and Tracy, Tyler and Shlegeris, Buck and Benton, Joe , title =. 2025 , journal =. 2506.15740 , archivePrefix =

work page arXiv 2025

[52] [52]

and Zhang, Hao and Stoica, Ion , title =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , title =. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year =

work page

[53] [53]

and Irving, Geoffrey and Farquhar, Sebastian and Cooney, Alan , title =

Lindner, David and Griffin, Charlie and Korbak, Tomek and Zimmermann, Roland S. and Irving, Geoffrey and Farquhar, Sebastian and Cooney, Alan , title =

work page

[54] [54]

2025 , journal =

Loughridge, Chloe and Colognese, Paul and Griffin, Avery and Tracy, Tyler and Kutasov, Jon and Benton, Joe , title =. 2025 , journal =. 2511.02823 , archivePrefix =

work page arXiv 2025

[55] [55]

2025 , journal =

Mallen, Alex and Griffin, Charlie and Wagner, Misha and Abate, Alessandro and Shlegeris, Buck , title =. 2025 , journal =. 2412.12480 , archivePrefix =

work page arXiv 2025

[56] [56]

2021 , url =

Our New. 2021 , url =

work page 2021

[57] [57]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Ruan, Yangjun and Dong, Honghua and Wang, Andrew and Pitis, Silviu and Zhou, Yongchao and Ba, Jimmy and Dubois, Yann and Maddison, Chris J. and Hashimoto, Tatsunori , title =. 2024 , journal =. 2309.15817 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

2025 , journal =

Taylor, Jordan and Black, Sid and Bowen, Dillon and Read, Thomas and Golechha, Satvik and Zelenka-Martin, Alex and Makins, Oliver and Kissane, Connor and Ayonrinde, Kola and Merizian, Jacob and Marks, Samuel and Cundy, Chris and Bloom, Joseph , title =. 2025 , journal =. 2512.07810 , archivePrefix =

work page arXiv 2025

[59] [59]

2025 , journal =

Ward, Francis Rhys and van der Weij, Teun and G. 2025 , journal =. 2511.09904 , archivePrefix =

work page arXiv 2025

[60] [60]

2024 , journal =

Wen, Jiaxin and Hebbar, Vivek and Larson, Caleb and Bhatt, Aryan and Radhakrishnan, Ansh and Sharma, Mrinank and Sleight, Henry and Feng, Shi and He, He and Perez, Ethan and others , title =. 2024 , journal =. 2411.17693 , archivePrefix =

work page arXiv 2024

[61] [61]

International Conference on Learning Representations (ICLR) , year =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , title =. International Conference on Learning Representations (ICLR) , year =

work page

[62] [62]

2025 , journal =

Monitoring Decomposition Attacks in. 2025 , journal =. 2506.10949 , archivePrefix =

work page arXiv 2025

[63] [63]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , title =. 2024 , journal =. 2403.02691 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

Async Control: Stress-testing Asynchronous Control Measures for

Stickland, Asa Cooper and Michelfeit, Jan and Mani, Arathi and Griffin, Charlie and Matthews, Ollie and Korbak, Tomek and Inglis, Rogan and Makins, Oliver and Cooney, Alan , journal=. Async Control: Stress-testing Asynchronous Control Measures for

work page

[65] [65]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Geiping, Jonas and McLeish, Sean and Jain, Neel and Kirchenbauer, John and Singh, Siddharth and Bartoldson, Brian R. and Kailkhura, Bhavya and Bhatele, Abhinav and Goldstein, Tom , title =. 2025 , journal =. 2502.05171 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

International Conference on Learning Representations (ICLR) , year =

Li, Zhiyuan and Liu, Hong and Zhou, Denny and Ma, Tengyu , title =. International Conference on Learning Representations (ICLR) , year =. 2402.12875 , archivePrefix =

work page arXiv

[67] [67]

2023 , journal =

Zhang, Xiaoyu and Zhang, Cen and Li, Tianlin and Huang, Yihao and Jia, Xiaojun and Xie, Xiaofei and Liu, Yang and Shen, Chao , title =. 2023 , journal =. 2312.10766 , archivePrefix =

work page arXiv 2023

[68] [68]

Healthcare , VOLUME =

Mahajan, Palak and Uddin, Shahadat and Hajati, Farshid and Moni, Mohammad Ali , TITLE =. Healthcare , VOLUME =. 2023 , NUMBER =

work page 2023

[69] [69]

and Bassey, Steven Ita and Musa, Umaru , title =

Ojoje, Oche Akiti and Aimufua, Gilbert I.O. and Bassey, Steven Ita and Musa, Umaru , title =. ISA Journal of Engineering and Technology (ISAJET) , volume =. 2025 , issn =. doi:10.5281/zenodo.17392097 , url =

work page doi:10.5281/zenodo.17392097 2025

[70] [70]

2025 , issn =

A credit card fraud detection approach based on ensemble machine learning classifier with hybrid data sampling , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.mlwa.2025.100675 , url =

work page doi:10.1016/j.mlwa.2025.100675 2025

[71] [71]

2018 , url =

Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya , title =. 2018 , url =

work page 2018

[72] [72]

2025 , doi =

Maslej, Nestor and Fattorini, Loredana and Perrault, Raymond and Gil, Yolanda and Parli, Vanessa and Kariuki, Njenga and Capstick, Emily and Reuel, Anka and Brynjolfsson, Erik and Etchemendy, John and Ligett, Katrina and Lyons, Terah and Manyika, James and Niebles, Juan Carlos and Shoham, Yoav and Wald, Russell and Walsh, Toby and Hamrah, Armin and Santar...

work page 2025

[73] [73]

A Comparative Analysis of Ensemble Protocols for

Ariel Monzon and Natan Katz and Inbar Shulman and Mike Vazan and Or Shemesh and Guy Porat , year =. A Comparative Analysis of Ensemble Protocols for

work page

[74] [74]

2025 , pages =

Zhi-Hua Zhou , title =. 2025 , pages =

work page 2025

[75] [75]

and Whitaker, Christopher J

Kuncheva, Ludmila I. and Whitaker, Christopher J. , title =. Machine Learning , volume =. 2003 , doi =

work page 2003

[76] [76]

Wu, Yanzhao and Liu, Ling and Xie, Zhongwei and Chow, Ka-Ho and Wei, Wenqi , title =. 2021. 2021 , url =

work page 2021