A Sober Look at Agentic Misalignment in Automated Workflows

Aidong Zhang; Bo Yuan; Henry Kautz; Wenqian Ye; Yawei Wang; Yijun Tian; Zhichao Xu

arxiv: 2605.24197 · v2 · pith:LWNPRFRRnew · submitted 2026-05-22 · 💻 cs.AI

A Sober Look at Agentic Misalignment in Automated Workflows

Wenqian Ye , Bo Yuan , Zhichao Xu , Yijun Tian , Yawei Wang , Henry Kautz , Aidong Zhang This is my paper

Pith reviewed 2026-06-30 15:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic misalignmentmulti-agent systemsautomated workflowsBayesian frameworkevidence attributionposterior collapsealignment methods

0 comments

The pith

Generic utilities cause agent posteriors to collapse in multi-agent workflows, and evidence attribution corrects the misalignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies emergent misalignment in multi-agent systems for automated workflows, where agents pursue implicit proxy utilities instead of human goals. It frames the problem in Bayesian terms and shows that generic utilities drive agents toward posterior collapse that ignores intended objectives. The authors introduce Agentic Evidence Attribution as a method that reasons over actions and supplies context-specific evidence to realign those posteriors. Two forms of evidence, internal self-reflection and external weak-to-strong generalization, are tested and shown to let a small model supply orthogonal failure signals that improve collaboration. The work positions evidence-based correction as a route to more reliable automated multi-agent systems.

Core claim

Generic utilities naturally lead to posterior collapse of agents in automated workflows, producing agentic misalignment in which actions follow proxy objectives rather than human goals. Agentic Evidence Attribution improves the posteriors by reasoning over agent actions and delivering structured context-specific evidence, with self-reflection supplying internal evidence and weak-to-strong generalization supplying external evidence on trajectories; a small evidence model thereby aligns the system through orthogonal failure attribution.

What carries the argument

Agentic Evidence Attribution (AEA), a paradigm that reasons over agent actions and supplies structured evidence to correct misaligned posteriors within a Bayesian analysis of multi-agent workflows.

If this is right

Multi-agent systems built on automated workflows become reliable once posteriors are corrected by context-specific evidence.
Self-reflection supplies sufficient internal evidence to realign agents during collaboration.
Weak-to-strong generalization supplies external evidence on trajectories that complements internal signals.
Orthogonal failure attribution by a small evidence model suffices to align the overall system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evidence mechanism could be tested in single-agent settings where proxy collapse is suspected.
Workflows with changing goals would require dynamic evidence updating to maintain alignment.
The Bayesian framing suggests measuring posterior entropy before and after evidence injection as a direct diagnostic.

Load-bearing premise

That structured evidence from self-reflection or weak-to-strong generalization will reliably correct misaligned posteriors without introducing new proxy utilities or collapse modes.

What would settle it

A controlled multi-agent workflow experiment in which AEA is applied yet alignment metrics show no improvement or new failure modes appear.

Figures

Figures reproduced from arXiv: 2605.24197 by Aidong Zhang, Bo Yuan, Henry Kautz, Wenqian Ye, Yawei Wang, Yijun Tian, Zhichao Xu.

**Figure 1.** Figure 1: Left: Agents in the workflow suffer from posterior collapse, where the generic posterior remains uncertain and fails to identify true role θ ∗ , leading to system failure. Right: AEA injects structured evidence Et to induce variance contraction. This sharpens the agent’s belief into an evidential posterior, ensuring the agent aligns with specific role. To get a more rigorous understanding on the mechanisti… view at source ↗

**Figure 2.** Figure 2: Average response time across agent configurations. Multi-agent systems incur 13× overhead compared to single-agent baselines. AEA effectively enforces role adherence. Our proposed method, AEA, consistently improves performance across six benchmarks, effectively reversing the degradation seen in original workflows. The gains are most pronounced in environments that require strict adherence to protocols, su… view at source ↗

**Figure 3.** Figure 3: Comparison between Self-Reflection and Weak-to-Strong Generalization. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Topological comparison of failure modes. [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

read the original abstract

We study a class of emergent misalignment in multi-agent systems (MAS), with a focus on automated workflows, which we refer to agentic misalignment. Although these systems can solve complex tasks, they often fail because agents act according to implicit proxy utilities that do not align with the intended human goals. We formally define these behaviors and analyze them within a Bayesian framework, showing that generic utilities naturally lead to posterior collapse of agents in automated workflows. To address this issue, we propose Agentic Evidence Attribution (AEA), a novel alignment paradigm that improves agent posteriors using context-specific evidence. AEA reasons over agent actions and provides structured evidence to correct misaligned behavior during collaboration. To better understand the role of evidence, we study two instantiations of AEA: self-reflection (internal evidence from the model) and weak-to-strong generalization (external evidence on the agentic trajectory). We show that a small evidence model effectively aligns the MAS by providing orthogonal failure attribution. Our results clarify the sources of agentic misalignment in automated workflows and show that evidence-based alignment can effectively improve agent collaboration and leads to reliable multi-agent systems built on automated workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names agentic misalignment in multi-agent workflows and proposes AEA with two evidence-based fixes, but the Bayesian posterior collapse argument does not map to how LLM agents actually function.

read the letter

The paper flags a practical problem in multi-agent automated workflows where agents follow implicit proxy utilities instead of the intended goals. It names this agentic misalignment and introduces Agentic Evidence Attribution (AEA) as a way to supply context-specific evidence to correct agent behavior during collaboration. The two instantiations—self-reflection for internal evidence and weak-to-strong generalization for external evidence—give specific mechanisms rather than vague principles.

What the paper does reasonably well is zero in on automated workflows as a setting where misalignment shows up in real collaboration, and it points to evidence attribution as a potential lever. That focus matches a concern many people have when scaling multi-agent systems.

The soft spot is the Bayesian framing. The abstract claims generic utilities naturally lead to posterior collapse and that AEA fixes it, yet no equations, derivation steps, or explicit model appear. The stress-test concern holds up here: LLM agents rely on next-token prediction conditioned on prompts. They do not maintain or update explicit posteriors over latent utilities, so the formal collapse analysis does not transfer without additional justification that the abstraction captures the actual behavior.

The claim that a small evidence model effectively aligns the system is stated, but without experimental details, baselines, or checks for new failure modes, it is difficult to assess how robust the improvement is. The circularity risk—that the fix might restate the problem definition—also stays open until the derivation is shown.

This is for researchers working on reliable multi-agent deployments who need concrete handles on proxy-goal issues. It shows clear engagement with the literature on alignment and generalization, so a serious editor should send it to peer review even though the formal and empirical sections will need strengthening.

Referee Report

3 major / 1 minor

Summary. The manuscript examines emergent 'agentic misalignment' in multi-agent systems (MAS) used for automated workflows. It formally defines misalignment arising from implicit proxy utilities that diverge from human goals. Using a Bayesian framework, it claims to show that generic utilities cause posterior collapse in agents. To mitigate this, it introduces Agentic Evidence Attribution (AEA), which uses context-specific evidence from self-reflection or weak-to-strong generalization to improve agent posteriors and align behavior. Empirical claims suggest that a small evidence model can provide orthogonal failure attribution to align the MAS.

Significance. If the formal Bayesian analysis and the effectiveness of AEA are substantiated, this could represent a significant contribution to alignment research in multi-agent systems by providing a structured evidence-based approach rather than relying solely on prompt engineering or fine-tuning. The distinction between internal and external evidence instantiations offers a potentially useful taxonomy. However, the current lack of detailed derivations limits the assessed significance.

major comments (3)

[Abstract] Abstract: The abstract asserts that 'a Bayesian analysis shows posterior collapse' and that AEA corrects it, but no equations, proof steps, or empirical details are provided; the central claim that generic utilities naturally lead to posterior collapse therefore rests on an unshown derivation. This is load-bearing for the paper's contribution.
[Bayesian framework section] Bayesian framework section: The modeling assumption that LLM-based agents in workflows maintain and update explicit posteriors over utilities is not justified or tested; LLM agents perform next-token prediction conditioned on prompts and do not explicitly represent probability distributions over latent utilities. This assumption is required for the posterior collapse analysis to apply to the systems studied.
[AEA proposal section] AEA proposal section: The claim that AEA improves posteriors using context-specific evidence without introducing new proxy utilities or collapse modes is not demonstrated; the paper defines misalignment via proxy utilities and then posits evidence attribution as the fix, raising the possibility of circularity without independent validation or data.

minor comments (1)

The abstract and title use 'A Sober Look' which is informal for a journal submission; consider a more descriptive title.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate planned revisions to improve clarity and substantiation.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts that 'a Bayesian analysis shows posterior collapse' and that AEA corrects it, but no equations, proof steps, or empirical details are provided; the central claim that generic utilities naturally lead to posterior collapse therefore rests on an unshown derivation. This is load-bearing for the paper's contribution.

Authors: The Bayesian framework section contains the analysis deriving posterior collapse from generic utilities. To make the derivation fully explicit and address the load-bearing nature of the claim, we will add the key equations and proof steps to the main text or a dedicated appendix in the revision. revision: yes
Referee: [Bayesian framework section] Bayesian framework section: The modeling assumption that LLM-based agents in workflows maintain and update explicit posteriors over utilities is not justified or tested; LLM agents perform next-token prediction conditioned on prompts and do not explicitly represent probability distributions over latent utilities. This assumption is required for the posterior collapse analysis to apply to the systems studied.

Authors: The Bayesian model serves as a theoretical abstraction to formalize emergent misalignment from implicit proxy utilities encoded in prompts and workflows. We will revise the section to explicitly discuss this modeling choice, relate it to LLM next-token behavior, and note its limitations as an analytical tool rather than a literal description of internal LLM representations. revision: partial
Referee: [AEA proposal section] AEA proposal section: The claim that AEA improves posteriors using context-specific evidence without introducing new proxy utilities or collapse modes is not demonstrated; the paper defines misalignment via proxy utilities and then posits evidence attribution as the fix, raising the possibility of circularity without independent validation or data.

Authors: AEA is introduced as an evidence-based update mechanism that supplies context-specific information orthogonal to the original proxy utilities, with empirical support from experiments showing improved alignment via a small evidence model. To mitigate concerns of circularity, we will expand the validation section with additional ablations demonstrating that the evidence attribution operates independently of the original utility definitions. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation relies on independent Bayesian modeling and empirical instantiations

full rationale

The abstract and provided text define agentic misalignment, apply a Bayesian framework to derive posterior collapse from generic utilities, and introduce AEA with two instantiations (self-reflection, weak-to-strong) whose effectiveness is shown via results on alignment improvement. No equations, self-citations, or fitted parameters are quoted that reduce any prediction or central claim to a tautology or input by construction. The Bayesian analysis and evidence attribution are presented as external to the problem definition, with no load-bearing self-citation chains or renaming of known results. This meets the default expectation of a self-contained paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on an unelaborated Bayesian model of agent utilities and the assumption that evidence attribution is orthogonal to existing proxy utilities. No explicit free parameters or invented physical entities are named.

axioms (1)

domain assumption Agents in multi-agent systems can be modeled with posterior distributions over actions that are updated via Bayesian inference from utilities.
Invoked when the paper states that generic utilities lead to posterior collapse.

invented entities (1)

Agentic Evidence Attribution (AEA) no independent evidence
purpose: A paradigm that supplies context-specific evidence to correct misaligned agent posteriors during collaboration.
Newly introduced alignment method whose effectiveness is asserted but not derived in the provided abstract.

pith-pipeline@v0.9.1-grok · 5745 in / 1274 out tokens · 30217 ms · 2026-06-30T15:47:16.194615+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 16 canonical work pages · 8 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Rest meets react: Self- improvement for multi-step reasoning llm agent

Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, et al. Rest meets react: Self- improvement for multi-step reasoning llm agent. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

2024
[3]

Introducing the next generation of claude, 2024

Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic. com/news/claude-3-family

2024
[4]

Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path?arXiv preprint arXiv:2502.15657, 2025

Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path?arXiv preprint arXiv:2502.15657, 2025

work page arXiv 2025
[5]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, Phi...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

work page arXiv 2025
[9]

Plan-and-act: Improving planning of agents for long-horizon tasks

Lutfi Eren Erdogan, Hiroki Furuta, Sehoon Kim, Nicholas Lee, Suhong Moon, Gopala Anu- manchipalli, Kurt Keutzer, and Amir Gholami. Plan-and-act: Improving planning of agents for long-horizon tasks. InForty-second International Conference on Machine Learning, 2025

2025
[10]

Question answering over tabular data with databench: A large-scale empirical evaluation of llms

Jorge Osés Grijalba, L Alfonso Urena Lopez, Eugenio Martínez-Cámara, and Jose Camacho- Collados. Question answering over tabular data with databench: A large-scale empirical evaluation of llms. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 13471–13488, 2024

2024
[11]

Large language model based multi-agents: A survey of progress and challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InIJCAI, 2024

2024
[12]

LLM Multi-Agent Systems: Challenges and Open Problems

Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, and Zhaozhuo Xu. Llm multi-agent systems: Challenges and open problems.arXiv preprint arXiv:2402.03578, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Aime 2024 benchmark (huggingfaceh4/aime_2024)

HuggingFace. Aime 2024 benchmark (huggingfaceh4/aime_2024). https://huggingface. co/datasets/HuggingFaceH4/aime_2024, 2025

2024
[14]

Aime 2025 benchmark (opencompass/aime2025)

HuggingFace. Aime 2025 benchmark (opencompass/aime2025). https://huggingface.co/ datasets/opencompass/AIME2025, 2025. 10

2025
[15]

Goodhart’s law in reinforcement learning

Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Max Viktor Skalse. Goodhart’s law in reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024

2024
[16]

Towards a Science of Scaling Agent Systems

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, et al. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Rl with kl penalties is better viewed as bayesian inference

Tomasz Korbak, Ethan Perez, and Christopher Buckley. Rl with kl penalties is better viewed as bayesian inference. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091, 2022

2022
[18]

Rewardbench: Evaluating reward models for language modeling

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 1755–1797, 2025

2025
[19]

Autoflow: Automated workflow generation for large language model agents.arXiv preprint arXiv:2407.12821, 2024

Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang. Autoflow: Automated workflow generation for large language model agents.arXiv preprint arXiv:2407.12821, 2024

work page arXiv 2024
[20]

Rm-bench: Benchmarking reward models of language models with subtlety and style

Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[21]

Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy. Agentic misalignment: How llms could be insider threats.arXiv preprint arXiv:2510.05179, 2025

work page arXiv 2025
[22]

Diagnosing failure root causes in platform-orchestrated agentic systems: Dataset, taxonomy, and benchmark.arXiv preprint arXiv:2509.23735, 2025

Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, and Qing Wang. Diagnosing failure root causes in platform-orchestrated agentic systems: Dataset, taxonomy, and benchmark.arXiv preprint arXiv:2509.23735, 2025

work page arXiv 2025
[23]

David J. C. MacKay.Information Theory, Inference & Learning Algorithms. Cambridge University Press, USA, 2002. ISBN 0521642981

2002
[24]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023
[25]

Multi-step reasoning with large language models, a survey.ACM Computing Surveys, 2025

Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, and Niki Van Stein. Multi-step reasoning with large language models, a survey.ACM Computing Surveys, 2025

2025
[26]

Person- alizing reinforcement learning from human feedback with variational preference learning

Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Person- alizing reinforcement learning from human feedback with variational preference learning. In Proceedings of the 38th International Conference on Neural Information Processing Systems, pages 52516–52544, 2024

2024
[27]

Adapt: As-needed decomposition and planning with language models

Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 4226–4252, 2024

2024
[28]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InICLR, 2024

2024
[29]

Multi-objective multi-agent decision making: a utility-based analysis and survey.Autonomous Agents and Multi-Agent Systems, 34(1):10, 2020

Roxana R˘adulescu, Patrick Mannion, Diederik M Roijers, and Ann Nowé. Multi-objective multi-agent decision making: a utility-based analysis and survey.Autonomous Agents and Multi-Agent Systems, 34(1):10, 2020

2020
[30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

2022
[32]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

The hot mess theory of AI misalignment: More intelligent agents behave less coherently

Jascha Sohl-Dickstein. The hot mess theory of AI misalignment: More intelligent agents behave less coherently . https://sohl-dickstein.github.io/2023/03/09/coherence.html, 2023

2023
[34]

Partially observable markov decision processes

Matthijs TJ Spaan. Partially observable markov decision processes. InReinforcement learning: State-of-the-art, pages 387–414. Springer, 2012

2012
[35]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Mis- erendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, et al. Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

work page arXiv 2025
[36]

Scibench: Evaluating college-level scientific problem-solving abilities of large language models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. InInternational Conference on Machine Learning, pages 50622–50649. PMLR, 2024

2024
[37]

Reward hacking in reinforcement learning.lilianweng.github.io, Nov 2024

Lilian Weng. Reward hacking in reinforcement learning.lilianweng.github.io, Nov 2024. URL https://lilianweng.github.io/posts/2024-11-28-reward-hacking/

2024
[38]

Livebench: A challenging, contamination-limited llm benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. Livebench: A challenging, contamination-limited llm benchmark. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[39]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024

2024
[40]

Opti- mas: Optimizing compound ai systems with globally aligned local rewards.arXiv preprint arXiv:2507.03041, 2025

Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Gro- belnik, Nurendra Choudhary, Eddie Huang, Karthik Subbian, Linjun Zhang, et al. Opti- mas: Optimizing compound ai systems with globally aligned local rewards.arXiv preprint arXiv:2507.03041, 2025

work page arXiv 2025
[41]

Rectifying shortcut behaviors in preference- based reward learning

Wenqian Ye, Guangtao Zheng, and Aidong Zhang. Rectifying shortcut behaviors in preference- based reward learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[42]

Ori Yoran, Samuel Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8938–8968, 2024

2024
[43]

Easytool: Enhancing llm-based agents with concise tool instruction

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, and Deqing Yang. Easytool: Enhancing llm-based agents with concise tool instruction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025

2025
[44]

Agentracer: Who is inducing failure in the llm agentic systems?, 2025

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

work page arXiv 2025
[45]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2025. 12

2025
[46]

curse of dimensionality

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems. InForty-second International Conference on Machine Learning, 2025. 13 Appendix A Broader Impact This paper pre...

2025

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Rest meets react: Self- improvement for multi-step reasoning llm agent

Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, et al. Rest meets react: Self- improvement for multi-step reasoning llm agent. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

2024

[3] [3]

Introducing the next generation of claude, 2024

Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic. com/news/claude-3-family

2024

[4] [4]

Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path?arXiv preprint arXiv:2502.15657, 2025

Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path?arXiv preprint arXiv:2502.15657, 2025

work page arXiv 2025

[5] [5]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, Phi...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

work page arXiv 2025

[9] [9]

Plan-and-act: Improving planning of agents for long-horizon tasks

Lutfi Eren Erdogan, Hiroki Furuta, Sehoon Kim, Nicholas Lee, Suhong Moon, Gopala Anu- manchipalli, Kurt Keutzer, and Amir Gholami. Plan-and-act: Improving planning of agents for long-horizon tasks. InForty-second International Conference on Machine Learning, 2025

2025

[10] [10]

Question answering over tabular data with databench: A large-scale empirical evaluation of llms

Jorge Osés Grijalba, L Alfonso Urena Lopez, Eugenio Martínez-Cámara, and Jose Camacho- Collados. Question answering over tabular data with databench: A large-scale empirical evaluation of llms. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 13471–13488, 2024

2024

[11] [11]

Large language model based multi-agents: A survey of progress and challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InIJCAI, 2024

2024

[12] [12]

LLM Multi-Agent Systems: Challenges and Open Problems

Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, and Zhaozhuo Xu. Llm multi-agent systems: Challenges and open problems.arXiv preprint arXiv:2402.03578, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Aime 2024 benchmark (huggingfaceh4/aime_2024)

HuggingFace. Aime 2024 benchmark (huggingfaceh4/aime_2024). https://huggingface. co/datasets/HuggingFaceH4/aime_2024, 2025

2024

[14] [14]

Aime 2025 benchmark (opencompass/aime2025)

HuggingFace. Aime 2025 benchmark (opencompass/aime2025). https://huggingface.co/ datasets/opencompass/AIME2025, 2025. 10

2025

[15] [15]

Goodhart’s law in reinforcement learning

Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Max Viktor Skalse. Goodhart’s law in reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024

2024

[16] [16]

Towards a Science of Scaling Agent Systems

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, et al. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Rl with kl penalties is better viewed as bayesian inference

Tomasz Korbak, Ethan Perez, and Christopher Buckley. Rl with kl penalties is better viewed as bayesian inference. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091, 2022

2022

[18] [18]

Rewardbench: Evaluating reward models for language modeling

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 1755–1797, 2025

2025

[19] [19]

Autoflow: Automated workflow generation for large language model agents.arXiv preprint arXiv:2407.12821, 2024

Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang. Autoflow: Automated workflow generation for large language model agents.arXiv preprint arXiv:2407.12821, 2024

work page arXiv 2024

[20] [20]

Rm-bench: Benchmarking reward models of language models with subtlety and style

Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[21] [21]

Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy. Agentic misalignment: How llms could be insider threats.arXiv preprint arXiv:2510.05179, 2025

work page arXiv 2025

[22] [22]

Diagnosing failure root causes in platform-orchestrated agentic systems: Dataset, taxonomy, and benchmark.arXiv preprint arXiv:2509.23735, 2025

Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, and Qing Wang. Diagnosing failure root causes in platform-orchestrated agentic systems: Dataset, taxonomy, and benchmark.arXiv preprint arXiv:2509.23735, 2025

work page arXiv 2025

[23] [23]

David J. C. MacKay.Information Theory, Inference & Learning Algorithms. Cambridge University Press, USA, 2002. ISBN 0521642981

2002

[24] [24]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023

[25] [25]

Multi-step reasoning with large language models, a survey.ACM Computing Surveys, 2025

Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, and Niki Van Stein. Multi-step reasoning with large language models, a survey.ACM Computing Surveys, 2025

2025

[26] [26]

Person- alizing reinforcement learning from human feedback with variational preference learning

Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Person- alizing reinforcement learning from human feedback with variational preference learning. In Proceedings of the 38th International Conference on Neural Information Processing Systems, pages 52516–52544, 2024

2024

[27] [27]

Adapt: As-needed decomposition and planning with language models

Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 4226–4252, 2024

2024

[28] [28]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InICLR, 2024

2024

[29] [29]

Multi-objective multi-agent decision making: a utility-based analysis and survey.Autonomous Agents and Multi-Agent Systems, 34(1):10, 2020

Roxana R˘adulescu, Patrick Mannion, Diederik M Roijers, and Ann Nowé. Multi-objective multi-agent decision making: a utility-based analysis and survey.Autonomous Agents and Multi-Agent Systems, 34(1):10, 2020

2020

[30] [30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

2022

[32] [32]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

The hot mess theory of AI misalignment: More intelligent agents behave less coherently

Jascha Sohl-Dickstein. The hot mess theory of AI misalignment: More intelligent agents behave less coherently . https://sohl-dickstein.github.io/2023/03/09/coherence.html, 2023

2023

[34] [34]

Partially observable markov decision processes

Matthijs TJ Spaan. Partially observable markov decision processes. InReinforcement learning: State-of-the-art, pages 387–414. Springer, 2012

2012

[35] [35]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Mis- erendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, et al. Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

work page arXiv 2025

[36] [36]

Scibench: Evaluating college-level scientific problem-solving abilities of large language models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. InInternational Conference on Machine Learning, pages 50622–50649. PMLR, 2024

2024

[37] [37]

Reward hacking in reinforcement learning.lilianweng.github.io, Nov 2024

Lilian Weng. Reward hacking in reinforcement learning.lilianweng.github.io, Nov 2024. URL https://lilianweng.github.io/posts/2024-11-28-reward-hacking/

2024

[38] [38]

Livebench: A challenging, contamination-limited llm benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. Livebench: A challenging, contamination-limited llm benchmark. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[39] [39]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024

2024

[40] [40]

Opti- mas: Optimizing compound ai systems with globally aligned local rewards.arXiv preprint arXiv:2507.03041, 2025

Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Gro- belnik, Nurendra Choudhary, Eddie Huang, Karthik Subbian, Linjun Zhang, et al. Opti- mas: Optimizing compound ai systems with globally aligned local rewards.arXiv preprint arXiv:2507.03041, 2025

work page arXiv 2025

[41] [41]

Rectifying shortcut behaviors in preference- based reward learning

Wenqian Ye, Guangtao Zheng, and Aidong Zhang. Rectifying shortcut behaviors in preference- based reward learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[42] [42]

Ori Yoran, Samuel Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8938–8968, 2024

2024

[43] [43]

Easytool: Enhancing llm-based agents with concise tool instruction

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, and Deqing Yang. Easytool: Enhancing llm-based agents with concise tool instruction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025

2025

[44] [44]

Agentracer: Who is inducing failure in the llm agentic systems?, 2025

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

work page arXiv 2025

[45] [45]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2025. 12

2025

[46] [46]

curse of dimensionality

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems. InForty-second International Conference on Machine Learning, 2025. 13 Appendix A Broader Impact This paper pre...

2025