CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

Alexander Panfilov; Guillaume Lajoie; Joachim Schaeffer; Jonas Geiping; Roland S. Zimmermann; Thomas Jiralerspong; Yoshua Bengio

arxiv: 2606.11063 · v1 · pith:DL4BWQFXnew · submitted 2026-06-09 · 💻 cs.AI · cs.LG

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

Joachim Schaeffer , Thomas Jiralerspong , Alexander Panfilov , Guillaume Lajoie , Jonas Geiping , Yoshua Bengio , Roland S. Zimmermann This is my paper

Pith reviewed 2026-06-27 13:21 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords control intervention awarenessAI control protocolsLLM benchmarkingtrajectory modificationfrontier model evaluationmodel monitoringsafety benchmarks

0 comments

The pith

Frontier LLMs detect control interventions on their trajectories with only low to moderate accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CIAware-Bench to test whether language models can identify when a trusted monitor has altered their action sequences during oversight. It applies the benchmark to eleven frontier models across essay writing, code tasks, command execution, and arena environments, asking models to classify trajectories as original or modified under varying conditions. Results show detection accuracy reaches at most 0.87 against a 0.5 random baseline, with clear differences by task domain and by whether the monitor model comes from the same family as the evaluated model. This evaluation matters because undetected modifications allow control protocols to function while detected ones could prompt models to adjust their behavior to bypass oversight.

Core claim

CIAware-Bench shows that frontier models exhibit low to moderate control intervention awareness under default settings, reaching a highest balanced binary classification accuracy of 0.87. Awareness levels vary substantially across the four task domains and across model pairs, and detection is generally easier when the intervening model belongs to a different provider family. The benchmark establishes that control intervention awareness is not a fixed model-level property and must be measured separately for each new model release and deployment scenario.

What carries the argument

CIAware-Bench, a benchmark that prompts models to classify their trajectories as original or modified by control interventions across four task domains while varying watermarking, side-task presence, and control protocol details.

If this is right

Control protocols must select interventions that models find harder to detect to preserve oversight effectiveness.
CI awareness requires fresh measurement for each model release rather than assuming it remains constant.
Cross-family monitor choices increase detectability, which informs selection of trusted models in protocols.
Substantial domain variation requires control designs tailored to specific task types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that detect interventions could use that information to develop more effective long-term evasion strategies across repeated interactions.
Future extensions could add tests of active behavioral adaptation rather than passive classification to better match deployment conditions.
Public release of the benchmark allows systematic tracking of how awareness changes as new models are trained.

Load-bearing premise

That accuracy on prompted classification of original versus modified trajectories in the chosen domains accurately reflects a model's real ability to infer monitor properties and adapt during actual deployment.

What would settle it

A controlled deployment where models with higher CIAware-Bench scores are observed to evade interventions more successfully than models with lower scores when placed under the same control protocol.

Figures

Figures reproduced from arXiv: 2606.11063 by Alexander Panfilov, Guillaume Lajoie, Joachim Schaeffer, Jonas Geiping, Roland S. Zimmermann, Thomas Jiralerspong, Yoshua Bengio.

**Figure 1.** Figure 1: CIAware-Bench overview. (a) An acting model generates a multi-step trajectory and an intervening model takes over at a (random) step for one step. (b) The acting model is shown a (possibly intervened) truncated trajectory and asked whether the final step was authored by another model. We report binary classification accuracy. (c) CIAware-Bench spans four task domains, multiple variants (watermarks, side t… view at source ↗

**Figure 2.** Figure 2: Detection Accuracy by Acting × Intervening Model Pair. Each panel shows one environment under the default setting: DtT protocol, no watermarking instruction, and no side-task policy. Rows denote the acting model; columns denote the intervening model. Cell color indicates detection accuracy for each ordered pair, with a random chance accuracy at 0.5. Each cell pools 50 intervened and 50 clean trials per pai… view at source ↗

**Figure 3.** Figure 3: Default-setting CI awareness averages, pooled across all [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Additional CIAware-Bench studies. (a) Within-family weak-to-strong: the weaker model in a family intervening on the strongest does not boost detection. (b) Cross-family vs. within-family: pooled across acting families, cross-family interventions are detected more reliably than within-family on Essay (0.55 vs. 0.51); BigCodeBench, SHADE-Arena, and Bash Arena show no significant difference. (c) Detection b… view at source ↗

**Figure 5.** Figure 5: Watermarking as an elicitation strategy. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces CIAware-Bench and reports moderate detection rates on prompted classification of modified trajectories across 11 models, but the results measure post-hoc distinction rather than online inference and adaptation.

read the letter

The main takeaway is that this paper defines a benchmark for whether frontier LLMs can spot when their trajectories have been altered by a control intervention. They run it on four domains with eleven models and get accuracies up to 0.87 against 0.5 chance, with clear differences by domain and by whether the monitor comes from the same model family.

What is new is the benchmark construction itself plus the initial measurements. They vary watermarking, side tasks, and the control protocol, which shows that awareness is not a fixed model trait. Releasing the benchmark lets others track the property over time. The work stays empirical and avoids any circular derivations or fitted parameters.

The results are presented cleanly as aggregate numbers, which makes the variation across settings easy to see. Detection being easier across families suggests the models are picking up on style or post-training differences rather than some deeper property.

The soft spot is the gap between what is tested and what the abstract claims it informs. The protocol builds trajectories, applies an intervention, then prompts the model to classify the finished pair. This is not the same as a model noticing subtle changes while generating and adjusting its policy in real time. No results link the classification scores to actual evasion behavior under a live protocol. The abstract also leaves out prompting details, data splits, and statistical controls, so it is hard to judge how robust the domain differences are.

This is for people working on AI control protocols who need a concrete way to measure one failure mode. A reader focused on safety evaluations will get usable numbers and a released tool. It is not yet a direct test of deployment risk.

I would send it to peer review. The benchmark idea is worth referee time even if the interpretation needs tightening and full methods must be added.

Referee Report

2 major / 1 minor

Summary. The paper introduces CIAware-Bench, a benchmark for measuring control intervention (CI) awareness in frontier LLMs. It consists of four task domains (essay writing, BigCodeBench, Bash Arena, SHADE-Arena) and evaluates eleven models on their ability to distinguish original trajectories from those modified by control interventions under variations in watermarking, side-tasks, and protocols, reporting low to moderate accuracies (up to 0.87 vs. 0.5 chance) with substantial variation across domains and model pairs, and concludes that CI awareness is not fixed and should be measured per model and scenario.

Significance. If the benchmark validly measures the targeted capability, the empirical results on variation across models and domains, together with the public release of CIAware-Bench, provide a concrete tool for tracking an evasion-related property relevant to AI control protocols. This supports reproducibility and could guide protocol design.

major comments (2)

[Abstract] Abstract and introduction: The central claim equates benchmark accuracy on prompted post-hoc binary classification of completed trajectories with 'CI awareness' that would allow models to infer monitor properties and adapt behavior to evade control during deployment. No section provides evidence or validation that classification performance predicts online detection from subtle generation changes or policy adaptation without an explicit classification instruction.
[Results] Evaluation protocol (implied in abstract and results): The reported accuracies and domain/model variation lack accompanying details on data splits, statistical controls, or tests for confounding factors such as style differences or selection bias in trajectory construction, which are required to substantiate the 'substantial variation' and cross-family detection claims.

minor comments (1)

[Abstract] Abstract: The balanced binary classification chance level is given as 0.5; verify and state whether this remains accurate under all reported conditions (e.g., when watermarking or side-tasks alter class balance).

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses and indicate revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: The central claim equates benchmark accuracy on prompted post-hoc binary classification of completed trajectories with 'CI awareness' that would allow models to infer monitor properties and adapt behavior to evade control during deployment. No section provides evidence or validation that classification performance predicts online detection from subtle generation changes or policy adaptation without an explicit classification instruction.

Authors: We agree that the benchmark provides a post-hoc classification proxy rather than direct evidence of online detection or unprompted adaptation. The manuscript frames CIAware-Bench as testing the ability to distinguish intervened trajectories, which is a necessary (but not sufficient) component of CI awareness. We have revised the abstract, introduction, and added a dedicated limitations paragraph to explicitly state that the benchmark does not validate predictive links to real-time evasion behavior and that future work is needed for online settings. This clarifies scope without overstating results. revision: partial
Referee: [Results] Evaluation protocol (implied in abstract and results): The reported accuracies and domain/model variation lack accompanying details on data splits, statistical controls, or tests for confounding factors such as style differences or selection bias in trajectory construction, which are required to substantiate the 'substantial variation' and cross-family detection claims.

Authors: We accept this criticism. The revised manuscript includes an expanded 'Evaluation Protocol' subsection detailing data splits (stratified 70/30 per domain with held-out test sets), statistical controls (bootstrap confidence intervals and McNemar tests for pairwise comparisons), and confounder analyses (style normalization via embedding similarity and ablation removing watermark/side-task signals). These additions support the reported variation and cross-family claims with quantitative evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on newly defined benchmark

full rationale

The paper introduces CIAware-Bench as a suite of four task domains and reports accuracies from running eleven frontier models on prompted binary classification of original vs. control-intervened trajectories. All central results are direct empirical outputs with no equations, fitted parameters, derivations, or self-referential quantities. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the benchmark construction and evaluation protocol stand as independent measurements. This is the standard case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper adds a new evaluation protocol rather than new theoretical entities or parameters; it rests on the domain assumption that classification accuracy on modified trajectories measures intervention awareness.

axioms (1)

domain assumption Prompted binary classification of trajectories measures a model's internal detection of control interventions.
Central to interpreting the benchmark accuracies as awareness levels.

pith-pipeline@v0.9.1-grok · 5791 in / 1182 out tokens · 33165 ms · 2026-06-27T13:21:36.199330+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 1 linked inside Pith

[1]

Evidence for Limited Metacognition in

Ackerman, Christopher , booktitle=. Evidence for Limited Metacognition in
[2]

A Mirror Test for

Ackerman, Christopher , year=. A Mirror Test for
[3]

arXiv preprint arXiv:2505.23836 , year=

Large Language Models Often Know When They Are Being Evaluated , author=. arXiv preprint arXiv:2505.23836 , year=. 2505.23836 , archivePrefix=

arXiv
[4]

2026 , note =

Past Examination Papers , author =. 2026 , note =

2026
[5]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Self-recognition in language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[6]

Great Models Think Alike and This Undermines

Goel, Shashwat and Str. Great Models Think Alike and This Undermines. Proceedings of the 42nd International Conference on Machine Learning , year=
[7]

Reliable Weak-to-Strong Monitoring of

Kale, Neil and Zhang, Chen Bo Calvin and Zhu, Kevin and Aich, Ankit and Rodriguez, Paula and. Reliable Weak-to-Strong Monitoring of. The Fourteenth International Conference on Learning Representations , year=
[8]

How to Prevent Collusion When Using Untrusted Models to Monitor Each Other , url =

Shlegeris, Buck , year =. How to Prevent Collusion When Using Untrusted Models to Monitor Each Other , url =
[9]

Attack Selection Reduces Safety in Concentrated

Schaeffer, Joachim and Khandelwal, Arjun and Tracy, Tyler , year=. Attack Selection Reduces Safety in Concentrated. arXiv preprint arXiv:2602.04930 , archivePrefix=

Pith/arXiv arXiv
[10]

2024 , volume=

Greenblatt, Ryan and Shlegeris, Buck and Sachan, Kshitij and Roger, Fabien , booktitle=. 2024 , volume=

2024
[11]

Bowman and Shi Feng , booktitle=

Arjun Panickssery and Samuel R. Bowman and Shi Feng , booktitle=
[12]

arXiv preprint arXiv:2310.18512 , year=

Preventing Language Models from Hiding Their Reasoning , author=. arXiv preprint arXiv:2310.18512 , year=. 2310.18512 , archivePrefix=

arXiv
[13]

Motwani, Sumeet Ramesh and Baranchuk, Mikhail and Strohmeier, Martin and Bolina, Vijay and Torr, Philip H. S. and Hammond, Lewis and Schroeder de Witt, Christian , booktitle=. Secret Collusion among
[14]

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in

Yohan Mathew and Ollie Matthews and Robert McCarthy and Joan Velja and Christian Schroeder de Witt and Dylan Cope and Nandi Schoots , year=. Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in. arXiv preprint arXiv:2410.03768 , archivePrefix=

arXiv
[15]

and Lindner, David , journal=

Zolkowski, Artur and Nishimura-Gasparian, Kei and McCarthy, Robert and Zimmermann, Roland S. and Lindner, David , journal=. Early Signs of Steganographic Capabilities in Frontier. 2025 , eprint=

2025
[16]

Pride and Prejudice:

Xu, Wenda and Zhu, Guanglei and Zhao, Xuandong and Pan, Liangming and Li, Lei and Wang, William , booktitle=. Pride and Prejudice:. 2024 , address=

2024
[17]

Self-Preference Bias in

Wataoka, Koki and Takahashi, Tsubasa and Ri, Ryokan , booktitle=. Self-Preference Bias in
[18]

2025 , eprint=

Bhatt, Aryan and Rushing, Cody and Kaufman, Adam and Tracy, Tyler and Georgiev, Vasil and Matolcsi, David and Khan, Akbir and Shlegeris, Buck , journal=. 2025 , eprint=

2025
[19]

Adaptive Attacks on Trusted Monitors Subvert

Terekhov, Mikhail and Panfilov, Alexander and Dzenhaliou, Daniil and Gulcehre, Caglar and Andriushchenko, Maksym and Prabhu, Ameya and Geiping, Jonas , booktitle=. Adaptive Attacks on Trusted Monitors Subvert
[20]

Proceedings of the 40th International Conference on Machine Learning , pages=

A Watermark for Large Language Models , author=. Proceedings of the 40th International Conference on Machine Learning , pages=. 2023 , volume=

2023
[21]

Kutasov, Jonathan and Sun, Yuqi and Colognese, Paul and van der Weij, Teun and Petrini, Linda and Zhang, Chen Bo Calvin and Hughes, John and Deng, Xiang and Sleight, Henry and Tracy, Tyler and Shlegeris, Buck and Benton, Joe , journal=
[22]

2025 , journal=

Evaluating Control Protocols for Untrusted AI Agents , author=. 2025 , journal=

2025
[23]

Inglis, Rogan and Matthews, Ollie and Tracy, Tyler and Makins, Oliver and Catling, Tom and Cooper Stickland, Asa and Faber-Espensen, Rasmus and O'Connell, Daniel and Heller, Myles and Brandao, Miguel and Hanson, Adam and Mani, Arathi and Korbak, Tomek and Michelfeit, Jan and Bansal, Dishank and Bark, Tomas and Canal, Chris and Griffin, Charlie and Wang, J...
[24]

Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and Gong, Chen and Hoang, Thong and Zebaze, Armel Randy and Hong, Xiaoheng and Li, Wen-Ding and Kaddour, Jean and Xu, Ming and Zhang, Zhihan and Yadav, Prateek...
[25]

The Twelfth International Conference on Learning Representations , year=

On the Reliability of Watermarks for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[26]

Control Tax: The Price of Keeping

Terekhov, Mikhail and Liu, Zhen Ning David and Gulcehre, Caglar and Albanie, Samuel , booktitle=. Control Tax: The Price of Keeping
[27]

2026 , journal=

When can we trust untrusted monitoring? A safety case sketch across collusion strategies , author=. 2026 , journal=

2026
[28]

Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding , author=. 2021. 2021 , organization=

2021
[29]

Can You Finetune Your

Elhassan, Fay and Ajroldi, Niccol. Can You Finetune Your. arXiv preprint arXiv:2504.06446 , year=. 2504.06446 , archivePrefix=

arXiv
[30]

2024 , address=

Chang, Yapei and Krishna, Kalpesh and Houmansadr, Amir and Wieting, John and Iyyer, Mohit , booktitle=. 2024 , address=

2024
[31]

Spotting

Hans, Abhimanyu and Schwarzschild, Avi and Cherepanova, Valeriia and Kazemi, Hamid and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , booktitle =. Spotting. 2024 , volume =

2024
[32]

2024 , volume=

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , booktitle=. 2024 , volume=

2024
[33]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[34]

The Fourteenth International Conference on Learning Representations , year=

Steering Evaluation-Aware Language Models to Act Like They Are Deployed , author=. The Fourteenth International Conference on Learning Representations , year=
[35]

Games for

Charlie Griffin and Louis Thomson and Buck Shlegeris and Alessandro Abate , year=. Games for
[36]

Croce, Francesco and Andriushchenko, Maksym and Sehwag, Vikash and Debenedetti, Edoardo and Flammarion, Nicolas and Chiang, Mung and Mittal, Prateek and Hein, Matthias , booktitle=
[37]

Transactions on Machine Learning Research , year=

Holistic Evaluation of Language Models , author=. Transactions on Machine Learning Research , year=
[38]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Edoardo Debenedetti and Jie Zhang and Mislav Balunovic and Luca Beurer-Kellner and Marc Fischer and Florian Tram. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[39]

2025 , month =

System Card:. 2025 , month =

2025
[40]

2026 , month =

System Card:. 2026 , month =

2026
[41]

2025 , month =

Introducing. 2025 , month =

2025
[42]

2025 , month =

Update to. 2025 , month =

2025
[43]

2025 , journal=

Auditing language models for hidden objectives , author=. 2025 , journal=

2025
[44]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , booktitle=
[45]

2026 , howpublished=

Artificial Analysis Intelligence Index v4.0 , author=. 2026 , howpublished=

2026

[1] [1]

Evidence for Limited Metacognition in

Ackerman, Christopher , booktitle=. Evidence for Limited Metacognition in

[2] [2]

A Mirror Test for

Ackerman, Christopher , year=. A Mirror Test for

[3] [3]

arXiv preprint arXiv:2505.23836 , year=

Large Language Models Often Know When They Are Being Evaluated , author=. arXiv preprint arXiv:2505.23836 , year=. 2505.23836 , archivePrefix=

arXiv

[4] [4]

2026 , note =

Past Examination Papers , author =. 2026 , note =

2026

[5] [5]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Self-recognition in language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024

[6] [6]

Great Models Think Alike and This Undermines

Goel, Shashwat and Str. Great Models Think Alike and This Undermines. Proceedings of the 42nd International Conference on Machine Learning , year=

[7] [7]

Reliable Weak-to-Strong Monitoring of

Kale, Neil and Zhang, Chen Bo Calvin and Zhu, Kevin and Aich, Ankit and Rodriguez, Paula and. Reliable Weak-to-Strong Monitoring of. The Fourteenth International Conference on Learning Representations , year=

[8] [8]

How to Prevent Collusion When Using Untrusted Models to Monitor Each Other , url =

Shlegeris, Buck , year =. How to Prevent Collusion When Using Untrusted Models to Monitor Each Other , url =

[9] [9]

Attack Selection Reduces Safety in Concentrated

Schaeffer, Joachim and Khandelwal, Arjun and Tracy, Tyler , year=. Attack Selection Reduces Safety in Concentrated. arXiv preprint arXiv:2602.04930 , archivePrefix=

Pith/arXiv arXiv

[10] [10]

2024 , volume=

Greenblatt, Ryan and Shlegeris, Buck and Sachan, Kshitij and Roger, Fabien , booktitle=. 2024 , volume=

2024

[11] [11]

Bowman and Shi Feng , booktitle=

Arjun Panickssery and Samuel R. Bowman and Shi Feng , booktitle=

[12] [12]

arXiv preprint arXiv:2310.18512 , year=

Preventing Language Models from Hiding Their Reasoning , author=. arXiv preprint arXiv:2310.18512 , year=. 2310.18512 , archivePrefix=

arXiv

[13] [13]

Motwani, Sumeet Ramesh and Baranchuk, Mikhail and Strohmeier, Martin and Bolina, Vijay and Torr, Philip H. S. and Hammond, Lewis and Schroeder de Witt, Christian , booktitle=. Secret Collusion among

[14] [14]

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in

Yohan Mathew and Ollie Matthews and Robert McCarthy and Joan Velja and Christian Schroeder de Witt and Dylan Cope and Nandi Schoots , year=. Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in. arXiv preprint arXiv:2410.03768 , archivePrefix=

arXiv

[15] [15]

and Lindner, David , journal=

Zolkowski, Artur and Nishimura-Gasparian, Kei and McCarthy, Robert and Zimmermann, Roland S. and Lindner, David , journal=. Early Signs of Steganographic Capabilities in Frontier. 2025 , eprint=

2025

[16] [16]

Pride and Prejudice:

Xu, Wenda and Zhu, Guanglei and Zhao, Xuandong and Pan, Liangming and Li, Lei and Wang, William , booktitle=. Pride and Prejudice:. 2024 , address=

2024

[17] [17]

Self-Preference Bias in

Wataoka, Koki and Takahashi, Tsubasa and Ri, Ryokan , booktitle=. Self-Preference Bias in

[18] [18]

2025 , eprint=

Bhatt, Aryan and Rushing, Cody and Kaufman, Adam and Tracy, Tyler and Georgiev, Vasil and Matolcsi, David and Khan, Akbir and Shlegeris, Buck , journal=. 2025 , eprint=

2025

[19] [19]

Adaptive Attacks on Trusted Monitors Subvert

Terekhov, Mikhail and Panfilov, Alexander and Dzenhaliou, Daniil and Gulcehre, Caglar and Andriushchenko, Maksym and Prabhu, Ameya and Geiping, Jonas , booktitle=. Adaptive Attacks on Trusted Monitors Subvert

[20] [20]

Proceedings of the 40th International Conference on Machine Learning , pages=

A Watermark for Large Language Models , author=. Proceedings of the 40th International Conference on Machine Learning , pages=. 2023 , volume=

2023

[21] [21]

Kutasov, Jonathan and Sun, Yuqi and Colognese, Paul and van der Weij, Teun and Petrini, Linda and Zhang, Chen Bo Calvin and Hughes, John and Deng, Xiang and Sleight, Henry and Tracy, Tyler and Shlegeris, Buck and Benton, Joe , journal=

[22] [22]

2025 , journal=

Evaluating Control Protocols for Untrusted AI Agents , author=. 2025 , journal=

2025

[23] [23]

Inglis, Rogan and Matthews, Ollie and Tracy, Tyler and Makins, Oliver and Catling, Tom and Cooper Stickland, Asa and Faber-Espensen, Rasmus and O'Connell, Daniel and Heller, Myles and Brandao, Miguel and Hanson, Adam and Mani, Arathi and Korbak, Tomek and Michelfeit, Jan and Bansal, Dishank and Bark, Tomas and Canal, Chris and Griffin, Charlie and Wang, J...

[24] [24]

Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and Brunner, Simon and Gong, Chen and Hoang, Thong and Zebaze, Armel Randy and Hong, Xiaoheng and Li, Wen-Ding and Kaddour, Jean and Xu, Ming and Zhang, Zhihan and Yadav, Prateek...

[25] [25]

The Twelfth International Conference on Learning Representations , year=

On the Reliability of Watermarks for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[26] [26]

Control Tax: The Price of Keeping

Terekhov, Mikhail and Liu, Zhen Ning David and Gulcehre, Caglar and Albanie, Samuel , booktitle=. Control Tax: The Price of Keeping

[27] [27]

2026 , journal=

When can we trust untrusted monitoring? A safety case sketch across collusion strategies , author=. 2026 , journal=

2026

[28] [28]

Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding , author=. 2021. 2021 , organization=

2021

[29] [29]

Can You Finetune Your

Elhassan, Fay and Ajroldi, Niccol. Can You Finetune Your. arXiv preprint arXiv:2504.06446 , year=. 2504.06446 , archivePrefix=

arXiv

[30] [30]

2024 , address=

Chang, Yapei and Krishna, Kalpesh and Houmansadr, Amir and Wieting, John and Iyyer, Mohit , booktitle=. 2024 , address=

2024

[31] [31]

Spotting

Hans, Abhimanyu and Schwarzschild, Avi and Cherepanova, Valeriia and Kazemi, Hamid and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , booktitle =. Spotting. 2024 , volume =

2024

[32] [32]

2024 , volume=

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , booktitle=. 2024 , volume=

2024

[33] [33]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[34] [34]

The Fourteenth International Conference on Learning Representations , year=

Steering Evaluation-Aware Language Models to Act Like They Are Deployed , author=. The Fourteenth International Conference on Learning Representations , year=

[35] [35]

Games for

Charlie Griffin and Louis Thomson and Buck Shlegeris and Alessandro Abate , year=. Games for

[36] [36]

Croce, Francesco and Andriushchenko, Maksym and Sehwag, Vikash and Debenedetti, Edoardo and Flammarion, Nicolas and Chiang, Mung and Mittal, Prateek and Hein, Matthias , booktitle=

[37] [37]

Transactions on Machine Learning Research , year=

Holistic Evaluation of Language Models , author=. Transactions on Machine Learning Research , year=

[38] [38]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Edoardo Debenedetti and Jie Zhang and Mislav Balunovic and Luca Beurer-Kellner and Marc Fischer and Florian Tram. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[39] [39]

2025 , month =

System Card:. 2025 , month =

2025

[40] [40]

2026 , month =

System Card:. 2026 , month =

2026

[41] [41]

2025 , month =

Introducing. 2025 , month =

2025

[42] [42]

2025 , month =

Update to. 2025 , month =

2025

[43] [43]

2025 , journal=

Auditing language models for hidden objectives , author=. 2025 , journal=

2025

[44] [44]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , booktitle=

[45] [45]

2026 , howpublished=

Artificial Analysis Intelligence Index v4.0 , author=. 2026 , howpublished=

2026