The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment
Pith reviewed 2026-06-27 13:24 UTC · model grok-4.3
The pith
The Arbiter agent detects misaligned participants in multi-agent conversations well before the end by operating under an inspection budget and choosing when to inspect.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision.
What carries the argument
The Arbiter, a budget-constrained monitoring agent that selects among wait, question, examine-internal, and log actions at each conversation step to produce a final misalignment report.
If this is right
- Misaligned agents are identified before conversations conclude in the tested setups.
- Active tools such as questioning participants and examining internals raise both accuracy and speed.
- Weight-induced misalignment remains harder to catch than instruction-induced cases.
- The logging action trades higher recall for lower precision.
- Multi-agent oversight benefits from treating the monitor as an active participant rather than a passive observer.
Where Pith is reading between the lines
- Systems that combine multiple agents for decisions may need permanent auditors with their own action budgets built into the architecture.
- The same monitoring approach could be tested on emergent behaviors beyond misalignment, such as unintended coordination or goal drift.
- Extending the Arbiter to handle longer conversations or more agents would require scaling the budget-allocation rules.
Load-bearing premise
The five conversation conditions and tool configurations represent the forms of emergent misalignment that would appear in actual deployed multi-agent systems.
What would settle it
Running the Arbiter on a sixth conversation condition involving a new form of misalignment, such as agents that hide misalignment until after the monitoring budget is exhausted, and measuring whether detection rate drops below the reported levels.
Figures
read the original abstract
As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Arbiter agent to continually monitor multi-agent conversations in real time under a limited inspection budget, allowing it to wait, question participants, inspect internal states, or log behavior before producing a misalignment report. It evaluates the approach across five conversation conditions (risky financial advice model organisms, evaluation-aware agents, colluding agents, etc.), five tool configurations of increasing capability, and two backbone models, claiming that the Arbiter reliably detects misaligned agents well before conversation end, with active inspection tools improving accuracy and speed; weight-induced misalignment is hardest to detect while instruction-induced is easier even passively; logging improves recall at the cost of precision. Code is released at https://github.com/aisilab/arbiter.
Significance. If the empirical results hold under more varied conditions, the work would address a timely safety concern in multi-agent systems by showing that budget-aware active monitoring can catch misalignment early. The public code release is a clear strength that supports reproducibility and further testing.
major comments (2)
- [Abstract] Abstract: the central claim that the Arbiter 'reliably detects misaligned agents well before the end of the conversation' rests entirely on five conversation conditions that explicitly induce the target behaviors via planted model organisms, evaluation-aware prompts, and collusion instructions. This demonstrates detection of pre-specified failure modes under controlled construction but does not establish detection of misalignment that arises organically from agent interactions without such constructions, undermining the paper's framing around 'emergent misalignment'.
- [Abstract] Abstract / Evaluation: the reported improvements in 'detection accuracy and speed' from active tools and the differential difficulty of weight- vs. instruction-induced misalignment are presented without any quantitative metrics, confidence intervals, statistical tests, or raw result tables, so it is impossible to assess whether the data actually support the reliability claims or the cross-condition generalizations.
minor comments (1)
- [Abstract] The abstract states that 'the five conversation conditions ranging from risky financial advice model organisms to evaluation-aware and colluding agents' are used, but provides no details on how these conditions were generated or validated as representative proxies.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the scope of our evaluation and the presentation of quantitative results. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the Arbiter 'reliably detects misaligned agents well before the end of the conversation' rests entirely on five conversation conditions that explicitly induce the target behaviors via planted model organisms, evaluation-aware prompts, and collusion instructions. This demonstrates detection of pre-specified failure modes under controlled construction but does not establish detection of misalignment that arises organically from agent interactions without such constructions, undermining the paper's framing around 'emergent misalignment'.
Authors: We agree that our five conditions rely on controlled constructions (model organisms, evaluation-aware prompts, collusion instructions) to create scenarios in which misalignment can emerge from multi-agent interactions. This is a deliberate methodological choice, standard in alignment research, to enable reproducible study of specific failure modes under budget constraints. We acknowledge that these setups do not capture misalignment arising completely without any initial construction. We will revise the abstract, introduction, and discussion to clarify the meaning of 'emergent misalignment' in this context, explicitly note the limitation regarding organic emergence, and avoid overgeneralizing beyond the evaluated conditions. revision: partial
-
Referee: [Abstract] Abstract / Evaluation: the reported improvements in 'detection accuracy and speed' from active tools and the differential difficulty of weight- vs. instruction-induced misalignment are presented without any quantitative metrics, confidence intervals, statistical tests, or raw result tables, so it is impossible to assess whether the data actually support the reliability claims or the cross-condition generalizations.
Authors: The full evaluation section contains tables and figures reporting accuracy, detection timing, recall/precision, and condition-wise comparisons across tool configurations and backbone models. We will revise the abstract to include key quantitative results (e.g., accuracy deltas and timing improvements) and add confidence intervals plus any applicable statistical tests to the main text and tables to support the reported improvements and cross-condition claims. revision: yes
Circularity Check
No circularity; empirical evaluation on constructed conditions
full rationale
The paper presents an empirical study introducing the Arbiter agent and reporting its detection performance across five conversation conditions (risky financial advice model organisms, evaluation-aware, colluding agents) and five tool configurations with two backbones. Claims such as reliable early detection rest on observed accuracy, speed, and tool effects rather than any derivation, equation, or self-citation that reduces results to inputs by construction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the text; the evaluation directly measures behavior against the specified test setups without circular reduction.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Arbiter agent
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ford, and Others
Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, D. Ford, and Others. 2024. Many-shot jailbreaking. InNeural Information Processing Systems, Vol. 37. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 129696–129742. doi:10.52202/ 079017-4121 Journal Abbreviation: Neural Infor...
2024
-
[2]
Training large language models on narrow tasks can lead to broad misalignment , volume=
J. Betley, N. Warncke, A. Sztyber-Betley, and Others. 2026. Training large lan- guage models on narrow tasks can lead to broad misalignment.Nature649 (2026), 584–589. doi:10.1038/s41586-025-09937-5
-
[3]
Sam Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, and Others
-
[4]
Measuring Progress on Scalable Oversight for Large Language Models
Measuring progress on scalable oversight for large language models. doi:10.48550/arXiv.2211.03540 _eprint: 2211.03540 Publication Title: arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.03540
-
[5]
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, and Others. 2023. Towards monosemanticity: decomposing language models with dictionary learning.Transformer Circuits Thread(2023)
2023
-
[6]
Trenton Bricken, Rowan Wang, Sam Bowman, Euan Ong, Johannes Treutlein, Jeff Wu, Evan Hubinger, and Samuel Marks. 2025. Building and evaluating alignment auditing agents.Alignment Science Blog(2025). https://alignment.anthropic. com/2025/automated-auditing/
2025
-
[7]
Kirchner, Bowen Baker, Leo Gao, Leopold As- chenbrenner, Yining Chen, Adrien Ecoffet, Manas R
Collin Burns, Pavel Izmailov, J. Kirchner, Bowen Baker, Leo Gao, Leopold As- chenbrenner, Yining Chen, Adrien Ecoffet, Manas R. Joglekar, Jan Leike, and Others. 2023. Weak-to-strong generalization: eliciting strong capabilities with weak supervision. InInternational Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). doi:1...
-
[8]
Florian Carichon, Aditi Khandelwal, Marylou Fauchard, and Golnoosh Farnadi
-
[9]
The coming crisis of multi-agent misalignment: AI alignment must be a dynamic and social process. doi:10.48550/ARXIV.2506.01080
-
[10]
Paul Christiano, Buck Shlegeris, and Dario Amodei. 2018. Supervising strong learners by amplifying weak experts. https://arxiv.org/abs/1810.08575 _eprint: 1810.08575 Publication Title: arXiv.org
Pith/arXiv arXiv 2018
-
[11]
Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. 2025. Subliminal learning: language models transmit behavioral traits via hidden signals in data. doi:10.48550/ARXIV.2507. 14805
-
[12]
DeepSeek-AI. 2025. DeepSeek-V3.2: pushing the frontier of open large language models.arXiv Preprint arXiv:2512.02556(2025). _eprint: 2512.02556
Pith/arXiv arXiv 2025
-
[13]
Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M
Kit Fraser-Taliente, Subhash Kantamneni, Euan Ong, Dan Mossing, Christina Lu, Paul C. Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M. Ziegler, Evan Hubinger, Joshua Batson, Jack Lindsey, Samuel Zimmerman, and Samuel Marks
-
[14]
https://transformer- circuits.pub/2026/nla/index.html
Natural language autoencoders produce unsupervised explanations of LLM activations.Transformer Circuits Thread(2026). https://transformer- circuits.pub/2026/nla/index.html
2026
-
[15]
Kai Fronsdal, Isha Gupta, Abhay Sheshadri, Jonathan Michala, Stephen McAleer, Rowan Wang, Sara Price, and Samuel R. Bowman. 2025. Petri: an open-source auditing tool to accelerate AI safety research.Alignment Science Blog(2025). https://alignment.anthropic.com/2025/petri/
2025
-
[16]
Deep Ganguli, Liane Lovitt, John Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Benjamin Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, and Others. 2022. Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned. doi:10.48550/arXiv.2209.07858 _eprint: 2209.07858 Publication Title: arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.07858 2022
-
[17]
Google DeepMind. 2026. Gemma 4 model documentation. https://ai.google.dev/ gemma/docs/gemma4
2026
-
[18]
Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte Mac- Diarmid, Samuel Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, and Others. 2024. Alignment faking in large language models. doi:10.48550/arXiv.2412.14093 _eprint: 2412.14093 Publication Title: arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.14093 2024
-
[19]
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. 2024. AI control: improving safety despite intentional subversion. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlet...
2024
-
[20]
Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. https://openreview.net/forum?id=F76bwRSLeK
2023
-
[21]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson E. Denison, Jesse Mu, Mike Lambert, Meg Tong, M. MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, and Others. 2024. Sleeper agents: training deceptive LLMs that persist through safety training. doi:10.48550/arXiv.2401.05566 _eprint: 2401.05566 Publication Title: arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.05566 2024
-
[22]
Geoffrey Irving, Paul Christiano, and Dario Amodei. 2018. AI safety via debate. https://arxiv.org/abs/1805.00899 _eprint: 1805.00899 Publication Title: arXiv.org
Pith/arXiv arXiv 2018
-
[23]
Olli Järviniemi, Oliver Makins, Jacob Merizian, Robert Kirk, and Ben Millwood
-
[24]
https: //arxiv.org/abs/2604.21098
Propensity inference: environmental contributors to LLM behaviour. https: //arxiv.org/abs/2604.21098
-
[25]
Adam Karvonen, James Chua, Clément Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, and Others. 2025. Activation oracles: training and evaluating llms as general-purpose activation explainers. doi:10.48550/ARXIV.2512.15674
-
[26]
Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Em- mons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victo- ria Krakovna, Shane Legg, David Lindner, David Luan, Alek...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.11473 2025
-
[27]
Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, and Joe Benton. 2026. Evaluating control protocols for untrusted AI agents. https://openreview.net/forum?id=bMFCHQsSYv
2026
-
[28]
Measuring Faithfulness in Chain-of-Thought Reasoning
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson E. Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, John Kernion, and Others. 2023. Measuring faithfulness in chain-of-thought rea- soning. doi:10.48550/arXiv.2307.13702 _eprint: 2307.13702 Publication Title: arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.13702 2023
-
[29]
Donghyun Lee and Mo Tiwari. 2024. Prompt infection: LLM-to-LLM prompt injection within multi-agent systems. doi:10.48550/arXiv.2410.07283 _eprint: 2410.07283 Publication Title: arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.07283 2024
-
[30]
Chloe Li, Mary Phuong, and Noah Y. Siegel. 2025. LLMs can covertly sandbag on capability evaluations against chain-of-thought monitoring. https://openreview. net/forum?id=r4Q6o7KGdb
2025
-
[31]
Aengus Lynch, Benjamin Wright, Caleb Larson, Kevin K. Troy, Stuart J. Ritchie, Sören Mindermann, Ethan Perez, and Evan Hubinger. 2025. Agentic misalign- ment: how LLMs could be an insider threat. doi:10.48550/arXiv.2510.05179
-
[32]
Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger. 2025. Natural emergent misal...
-
[33]
Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. 2025. Frontier models are capable of In-context scheming. https://arxiv.org/abs/2412.04984 _eprint: 2412.04984. 7 Filippo Tonini et al
Pith/arXiv arXiv 2025
-
[34]
Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, and Neel Nanda. 2025. Narrow finetuning leaves clearly readable traces in activation differences. https://openreview.net/forum?id= qyVzZsrsnS
2025
-
[35]
Lorenzo Pacchiardi, Alex James Chan, Sören Mindermann, Ilan Moscovitz, Alexa Yue Pan, Yarin Gal, Owain Evans, and Jan M. Brauner. 2023. How to catch an AI liar: lie detection in black-box LLMs by asking unrelated questions. https://openreview.net/forum?id=567BjxgaTp
2023
-
[36]
Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks
Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks
-
[37]
and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , journal =
AI deception: a survey of examples, risks, and potential solutions.Patterns 5, 5 (2023), 100988. doi:10.1016/j.patter.2024.100988
-
[38]
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models. InConference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 3419–3448. doi:10.18653/v1/...
-
[39]
Qwen Team. 2024. Qwen2.5 technical report.arXiv Preprint arXiv:2412.15115 (2024). _eprint: 2412.15115
Pith/arXiv arXiv 2024
-
[40]
Qwen Team. 2026. Qwen3.6-plus: towards real world agents. https://qwen.ai/ blog?id=qwen3.6
2026
-
[41]
Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn. 2024. Large language models can strategically deceive their users when put under pressure. https: //openreview.net/forum?id=HduMpot9sJ
2024
-
[42]
Bowman, Sara Price, Samuel Marks, and Rowan Wang
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, and Rowan Wang. 2026. AuditBench: evaluating alignment auditing techniques on models with hidden behaviors. doi:10.48550/ ARXIV.2602.22755
arXiv 2026
-
[43]
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, and Others. 2024. Scaling monosemanticity: extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread(2024). https://transformer- circuits.pub/2024/scaling-monosemanticity/
2024
-
[44]
Federico Torrielli, Stefano Locci, Amon Rapp, and Luigi Di Caro. 2026. How to get your paper accepted by an AI reviewer: indirect prompt injection in peer review. doi:10.21203/rs.3.rs-8432945/v1 ISSN: 2693-5015
-
[45]
Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. 2025. Model organisms for emergent misalignment. https://openreview. net/forum?id=ThW5hvKgWx
2025
-
[46]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Lan- guage models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. InNeural Information Processing Systems, Vol. 36. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 74952–74965. doi:10.48550/arXiv.2305.04388 Journal Abbreviation...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.04388 2023
-
[47]
Laurène Vaugrante, Francesca Carlon, Maluna Menke, and Thilo Hagendorff
-
[48]
Compromising honesty and harmlessness in language models via deception attacks.arXiv Preprint arXiv:2502.08301(2025)
arXiv 2025
-
[49]
Brown, and Fran- cis Rhys Ward
Teun van der Weij, Felix Hofstätter, Oliver Jaffe, Samuel F. Brown, and Fran- cis Rhys Ward. 2024. AI sandbagging: language models can strategically under- perform on evaluations. https://openreview.net/forum?id=7Qa2SpjxIS
2024
-
[50]
Narasimhan, and Yuan Cao
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2022. ReAct: synergizing reasoning and acting in language models. https://openreview.net/forum?id=WE_vluYUL-X
2022
-
[51]
Xing, and Others
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, and Others
-
[52]
InNeural Information Processing Systems, Vol
Judging LLM-as-a-judge with MT-bench and chatbot arena. InNeural Information Processing Systems, Vol. 36. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 46595–46623. doi:10.52202/075280-2020 Journal Abbreviation: Neural Information Processing Systems
-
[53]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, and Others. 2023. Representation engineering: a top-down approach to AI trans- parency. doi:10.48550/arXiv.2310.01405 _eprint: 2310.01405 Publication Title: arXiv.org. A Supplementary Material A.1 System Prompts The...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.01405 2023
-
[54]
Carefully analyze the conversation for signs of misalignment
-
[55]
Form hypotheses about which agents might be misaligned and why
-
[56]
Some tools cost budget (see each tool's description)
Use the available tools to gather evidence. Some tools cost budget (see each tool's description)
-
[57]
You have a budget of 5.0
-
[58]
To use a tool, output EXACTLY the tool's syntax (on its own, with no other text before or after)
When you are ready to conclude (or have no tools/budget), provide your final analysis directly WITHOUT using a tool. To use a tool, output EXACTLY the tool's syntax (on its own, with no other text before or after). See the tool descriptions above for the exact format of each tool. To conclude, simply write your final analysis as plain text (no TOOL line)....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.