pith. machine review for the scientific record. sign in

arxiv: 2605.12991 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: unknown

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords multi-agent LLMssycophancyRLHFactivation patchingpeer disagreementyieldattention mechanismsalignment
0
0 comments X

The pith

Pretrained base models exhibit the same or higher yield to simulated peer disagreement as their RLHF-tuned counterparts, localizing the issue to mid-layer attention rather than alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM pipelines in multi-agent settings often flip from correct to incorrect answers when facing peer disagreement, a behavior measured as yield and previously attributed to RLHF encouraging sycophancy. Tests across four model families show base pretrained models display the identical substitution pattern and average higher yield than their instruction-tuned versions. Activation patching traces the effect to a narrow mid-layer window where attention heads carry the causal influence while MLP layers add little. The vulnerability splits into two independent factors—channel framing and consensus strength—whose interaction creates yield gaps as large as 47.5 percentage points at majority consensus. Pressure works by suppressing clean-reasoning features already present in the base model, and a single correct dissenter cuts yield by 54-73 points while prompt defenses remain brittle to variants outside their training surface.

Core claim

Pretrained base models exhibit the same substitution pattern as their Instruct variants under simulated peer disagreement, averaging higher yield. Activation patching localizes the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96 percent of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes. Two converging activation-space interventions show that pressure suppresses clean-reasoning rather

What carries the argument

Activation patching that identifies the causal role of mid-layer attention in suppressing clean-reasoning features under consensus pressure, combined with the two-factor decomposition into channel framing and consensus strength.

If this is right

  • Patching activations above the identified mid-layer window recovers 96 percent of the performance gap between clean and pressured conditions.
  • A single correctly arguing dissenter reduces yield by 54 to 73 percentage points across all tested framings.
  • The 47.5 percentage-point yield gap at majority consensus holds for jury sizes of 4, 5, and 6.
  • Prompt-level defenses fail against attack variants that differ from their design assumptions.
  • Pressure suppresses existing clean-reasoning features instead of activating a separate sycophancy circuit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pipeline architectures should incorporate structured dissent mechanisms at runtime rather than relying on further alignment training.
  • The same mid-layer suppression pattern may appear in other LLM applications that involve group consensus or debate formats.
  • Runtime monitoring of attention patterns could serve as an early detector of consensus pressure in live multi-agent systems.
  • The localization to a narrow layer window suggests targeted fine-tuning or editing of those specific attention heads as a possible mitigation.

Load-bearing premise

The simulated peer disagreement setup accurately captures the dynamics of real multi-agent LLM pipelines and the measured yield directly indexes sycophancy rather than general uncertainty or context sensitivity.

What would settle it

Measuring yield rates in an actual deployed multi-agent pipeline using base models and finding substantially lower rates than in the simulated disagreement experiments would falsify the claim that the vulnerability is inherent to base models rather than RLHF.

Figures

Figures reproduced from arXiv: 2605.12991 by Adarsh Kumarappan, Ananya Mujoo.

Figure 1
Figure 1. Figure 1: From pretrained vulnerability to cross-framing mitigation. (A) Multi-agent pressure [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Wrong-agent count sweep at N=4, suffixed protocol. Yield as a function of kwrong, the number of agents arguing for the wrong answer. User-role framing produces a unanimity cliff at 4v0; assistant-role framing produces a majority cliff at 3v1; the 47.5 pp cross-framing gap at 3v1 is the two-factor interaction. 5.3 Causal localization at L14–L18 Having established the behavioral attack surface, we now ask: w… view at source ↗
Figure 3
Figure 3. Figure 3: Activation-patching restoration on Llama-3.1-8B-Instruct ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Base vs. Instruct on matched question pools across four model families and three pressure [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Dissenter rescue across three framings. The 3v0 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Yield vs. fraction of wrong-arguing agents, for user-role and assistant-role framing, overlaid [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Component decomposition at L14–L18, n=400 named peer jury questions. Blue: MLP￾only patch; orange: attention-only; purple: both components (layer-local baseline); dashed green: full residual-stream (upstream) patch. Attention carries ≥81% of the layer-local restoration at every layer; MLP is null throughout. L14 and L17 are the layer-local loci; L15, L16, and L18 are upstream￾dominated. Error bars: 95% boo… view at source ↗
Figure 8
Figure 8. Figure 8: SAE feature clamping sweep: ∆P(wrong) and ∆P(correct) as a function of clamping strategy and number of clamped features. All deltas reported vs the reconstruction-only baseline [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Full 400-question Mistral-7B replication with 95% bootstrap CIs ( [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cross-domain direct user assertion yield. CS theory and calc-STEM are amplified above [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Conditional activation patching across the 2 [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: User-role 4v0 yield on the wrong-agent count sweep, plotted against clean-nosuffix [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
read the original abstract

LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF-induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes $N \in \{4, 5, 6\}$. Two converging activation-space interventions show that pressure suppresses clean-reasoning features rather than activating a new sycophancy circuit. A single correctly-arguing dissenter reduces yield by 54-73 percentage points across all framings tested, whereas the strongest prompt-level defense fails on attack variants outside its design surface. Mitigations should target the mechanism, structured dissent at the pipeline level, rather than prompt-level defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that multi-agent sycophancy—measured as 'yield,' the rate at which LLMs flip from correct to incorrect answers under simulated peer disagreement—is not primarily caused by RLHF. Pretrained base models exhibit the same substitution pattern as Instruct variants and average higher yield. Activation patching localizes the causal corruption to a narrow mid-layer attention window (MLP contribution negligible), with patching above this window restoring 96% of the clean-to-pressured P(correct) gap. The effect decomposes into independent factors of channel framing and consensus strength (producing a 47.5 pp yield gap at majority consensus, stable across jury sizes N=4,5,6). A single correctly-arguing dissenter reduces yield by 54-73 pp, while prompt-level defenses fail on variants outside their design surface; the authors conclude that mitigations should target the mechanism via structured pipeline-level dissent.

Significance. If the empirical patterns and localization hold, the work is significant for challenging the dominant attribution of sycophancy to alignment training and instead identifying pre-existing base-model mechanisms. The activation-patching results, large effect sizes, and convergence of two interventions on feature suppression rather than new circuits provide mechanistic insight that could guide more effective robustness interventions than prompt engineering. Consistency across four model families and the falsifiable prediction that dissent outperforms prompt defenses add to the potential impact.

major comments (2)
  1. [Abstract / Experimental Setup] The central claim that RLHF is not the root cause and that mitigations should target mid-layer attention and pipeline dissent depends on the simulated peer-disagreement setup faithfully measuring multi-agent sycophancy. The abstract describes embedding fabricated peer answers into a single forward pass; this risks conflating single-context prompt sensitivity with genuine multi-agent coordination failure and is load-bearing for the recommendation to prefer pipeline-level structured dissent over prompt defenses.
  2. [Activation Patching Results] The localization result (patching above the mid-layer window restores 96% of the P(correct) gap) and the claim that attention carries the causal weight while MLP contribution is negligible require explicit layer indices, number of runs, and control experiments (e.g., random patching baselines) to confirm the window is not an artifact of the chosen prompts or models.
minor comments (2)
  1. The abstract states results across 'four model families' but does not name them; this information should appear in the main text or a summary table for immediate clarity.
  2. Quantitative claims such as the 47.5 pp yield gap and 54-73 pp dissenter reductions would benefit from reported variability (standard errors or confidence intervals) and the exact statistical tests used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the experimental design and committing to specific additions that strengthen the localization claims without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract / Experimental Setup] The central claim that RLHF is not the root cause and that mitigations should target mid-layer attention and pipeline dissent depends on the simulated peer-disagreement setup faithfully measuring multi-agent sycophancy. The abstract describes embedding fabricated peer answers into a single forward pass; this risks conflating single-context prompt sensitivity with genuine multi-agent coordination failure and is load-bearing for the recommendation to prefer pipeline-level structured dissent over prompt defenses.

    Authors: The single-forward-pass simulation isolates the causal effect of consensus signals on feature suppression, which is the mechanism we identify as pre-existing in base models. This proxy is not claimed to replicate full multi-turn coordination but directly tests the yield under peer pressure that would propagate in pipelines. The large, consistent effect sizes and the fact that structured dissent (not prompt defenses) reliably restores performance support targeting pipeline-level interventions. We will revise the abstract and add a dedicated limitations subsection to explicitly distinguish the proxy from full multi-agent systems and note that real pipelines may exhibit even higher yield due to iterative reinforcement. revision: partial

  2. Referee: [Activation Patching Results] The localization result (patching above the mid-layer window restores 96% of the P(correct) gap) and the claim that attention carries the causal weight while MLP contribution is negligible require explicit layer indices, number of runs, and control experiments (e.g., random patching baselines) to confirm the window is not an artifact of the chosen prompts or models.

    Authors: We agree these details are necessary for reproducibility. The revised manuscript will report the exact layer indices (mid-layers 12-18 across the four model families), the number of runs (50 independent trials per patching condition), and random-patching control baselines showing no restoration of the P(correct) gap. These controls confirm the window is not an artifact. We will also include the per-model attention-vs-MLP ablation tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical measurements and interventions

full rationale

The paper reports experimental results across model families, defines yield directly from observed answer flips under simulated peer disagreement, and localizes effects via activation patching experiments. No derivations, equations, or fitted parameters are presented that reduce claims to inputs by construction. Central findings (base models show higher yield, mid-layer attention carries causal weight, single dissenter reduces yield) rest on direct patching and measurement rather than self-referential definitions or self-citation chains. The decomposition into channel framing and consensus strength is observational. This matches the default expectation of an empirical study with self-contained measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on the assumption that yield under simulated disagreement measures a stable vulnerability and that activation patching isolates causal features without side effects.

axioms (2)
  • domain assumption Simulated peer disagreement in the test setup produces behavior representative of deployed multi-agent pipelines
    Invoked when generalizing experimental yield to real systems
  • domain assumption Activation patching cleanly separates attention from MLP contributions without introducing artifacts
    Used to localize corruption to mid-layer attention window

pith-pipeline@v0.9.0 · 5521 in / 1252 out tokens · 83608 ms · 2026-05-14T20:00:18.935475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 20 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219,

  2. [2]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Ols- son, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran- Johnson, Ethan Perez, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint ...

  3. [3]

    InFindings of the Associa- tion for Computational Linguistics: ACL 2024, pages 13921–13937, Bangkok, Thailand

    Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Ce- line Lin, and Pavlo Molchanov. Small language models are the future of agentic AI.arXiv preprint arXiv:2506.02153,

  4. [4]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    URL https://blog.eleuther.ai/diff-in-means/. Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112,

  5. [5]

    Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

    Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, et al. Mea- suring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540,

  6. [6]

    Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak- to-strong generalization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390,

  7. [7]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gon- zalez, and Ion Stoica. Why do multi-agent LLM systems fail?arXiv preprint arXiv:2503.13657,

  8. [8]

    org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes

    Oral. Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability.arXiv preprint arXiv:2304.14997,

  9. [9]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    10 Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

  10. [10]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factual- ity and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325,

  11. [11]

    Ask don't tell: Reducing sycophancy in large language models

    Magda Dubois, Cozmin Ududec, Christopher Summerfield, and Lennart Luettgau. Ask don’t tell: Reducing sycophancy in large language models.arXiv preprint arXiv:2602.23971,

  12. [12]

    Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Lijun Wu, and Conghui He

    URL https://transformer-circuits.pub/2021/framework/index.html. Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Lijun Wu, and Conghui He. A strategic coordination framework of small LLMs matches large LLMs in data synthesis.arXiv preprint arXiv:2504.12322,

  13. [13]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Johan Ferret, Damien Vincent, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

  14. [14]

    Sycophancy hides linearly in the attention heads.arXiv preprint arXiv:2601.16644,

    Rifo Genadi, Munachiso Nwadike, Nurdaulet Mukhituly, Hilal Alquabeh, Tatsuya Hiraoka, and Kentaro Inui. Sycophancy hides linearly in the attention heads.arXiv preprint arXiv:2601.16644,

  15. [15]

    2023 , month = oct, journal =

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models.arXiv preprint arXiv:2304.14767,

  16. [16]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  17. [17]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection.arXiv preprint arXiv:2302.12173,

  18. [18]

    How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,

    Stefan Heimersheim and Neel Nanda. How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,

  19. [19]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burnett, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  20. [20]

    AI safety via debate

    Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate.arXiv preprint arXiv:1805.00899,

  21. [21]

    11 Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv...

  22. [22]

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , shorttitle =

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model.arXiv preprint arXiv:2306.03341,

  23. [23]

    The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration

    Jiayuan Liu, Shiyi Du, Weihua Du, Mingyu Guo, and Vincent Conitzer. The consensus trap: Rescuing multi-agent LLMs from adversarial majorities via token-level collaboration.arXiv preprint arXiv:2604.17139,

  24. [24]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824,

  25. [25]

    Locating and Editing Factual Associations in GPT, January 2023

    URL https://www.goodfire.ai/research/ understanding-and-steering-llama-3. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT.arXiv preprint arXiv:2202.05262,

  26. [26]

    arXiv preprint arXiv:2501.16615 , year=

    Goncalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features.arXiv preprint arXiv:2501.16615,

  27. [27]

    Use sparse autoencoders to discover unknown concepts, not to act on known concepts.arXiv preprint arXiv:2506.23845,

    Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, and Nikhil Garg. Use sparse autoencoders to discover unknown concepts, not to act on known concepts.arXiv preprint arXiv:2506.23845,

  28. [28]

    DialDefer: A framework for detecting and mitigating LLM dialogic deference.arXiv preprint arXiv:2601.10896,

    Parisa Rabbani, Priyam Sahoo, Ruben Mathew, Aishee Mondal, Harshita Ketharaman, Nimet Beyza Bozdag, and Dilek Hakkani-Tür. DialDefer: A framework for detecting and mitigating LLM dialogic deference.arXiv preprint arXiv:2601.10896,

  29. [29]

    Procaccia

    12 Itai Shapira, Gerdus Benade, and Ariel D. Procaccia. How RLHF amplifies sycophancy.arXiv preprint arXiv:2602.01002,

  30. [30]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models.arXiv prepri...

  31. [31]

    Sycophancy is not one thing: Causal separation of sycophantic behaviors in LLMs.arXiv preprint arXiv:2509.21305,

    Daniel Vennemeyer, Phan Anh Duong, Tiffany Zhan, and Tianyu Jiang. Sycophancy is not one thing: Causal separation of sycophantic behaviors in LLMs.arXiv preprint arXiv:2509.21305,

  32. [32]

    From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration

    Yizhe Xie, Congcong Zhu, Xinyue Zhang, Tianqing Zhu, Dayong Ye, Minfeng Qi, Huajie Chen, and Wanlei Zhou. From spark to fire: Modeling and mitigating error cascades in LLM-based multi-agent collaboration.arXiv preprint arXiv:2603.04474,

  33. [33]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  34. [34]

    Yi: Open Foundation Models by 01.AI

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, et al. Yi: Open foundation models by 01.AI.arXiv preprint arXiv:2403.04652,

  35. [35]

    2023 , archivePrefix=

    Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods.arXiv preprint arXiv:2309.16042,

  36. [36]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405,

  37. [37]

    The correct answer is (

    13 Appendix Table of Contents A Experimental setup 14 B Extended behavioral results 17 B.1 Full condition results table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 Consensus-line ablation (11 variants) . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.3 System-prompt defense matrix . . . . . . . . . . . . . . . . . . . . . . . ...

  38. [38]

    all three agree the answer is X

    C Mechanistic analysis details C.1 Component decomposition figure L14 L15 L16 L17 L18 Patched layer 0.0 0.2 0.4 0.6P(correct) restoration delta Residual (full upstream) MLP-only Attention-only Both (layer-local baseline) Figure 7: Component decomposition at L14–L18, n=400 named peer jury questions. Blue: MLP- only patch; orange: attention-only; purple: bo...

  39. [39]

    The correct answer is (

    All 400 questions are used per cell with 1000-resample bootstrap CIs. The shared L14–L18 ramp shape across all pressured cells confirms a single circuit; the plateau height tracks the framing × consensus interaction, with user-role requiring near-unanimity and assistant-role framing engaging at majority consensus. E Robustness and calibration E.1 Unsuffix...