Multi-LLM Systems Exhibit Robust Semantic Collapse

James Evans; Jinghua Piao; Shiyang Lai; Weiyi Kong

arxiv: 2605.17193 · v1 · pith:EZWH7QQFnew · submitted 2026-05-16 · 💻 cs.MA

Multi-LLM Systems Exhibit Robust Semantic Collapse

Weiyi Kong , Shiyang Lai , Jinghua Piao , James Evans This is my paper

Pith reviewed 2026-05-20 13:36 UTC · model grok-4.3

classification 💻 cs.MA

keywords multi-LLMsemantic collapseclosed loopautoregressive generationsemantic diversityLLM agentsknowledge production

0 comments

The pith

Closed-loop multi-LLM systems converge semantically despite lexical differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors test whether multi-LLM systems can keep generating new ideas when running by themselves in loops. They find these systems undergo semantic collapse, with meanings becoming more similar over many rounds even if the words change. This happens across various models and cannot be fixed by changing prompts, parameters, or other common adjustments. This matters for anyone hoping to use teams of AI models for ongoing independent research or creation.

Core claim

Multi-large language model systems operating in closed loops exhibit semantic collapse: systematic convergence in semantic representations despite apparent lexical variation. The pattern holds across model families in extended simulations of 200 to 1,000 rounds. Twelve intervention strategies spanning decoding parameters, prompt design, agent composition, activation engineering, and reinforcement learning fail to restore semantic diversity. Mechanistic analyses suggest semantic collapse is consistent with intrinsic properties of autoregressive generation rather than alignment or conformity biases. These results point to fundamental constraints in the ability of multi-LLM systems to sustain 0

What carries the argument

semantic collapse, the systematic convergence in semantic representations despite apparent lexical variation in closed-loop multi-LLM systems

If this is right

Multi-LLM systems have inherent limits sustaining diverse knowledge production in closed settings.
Common intervention methods do not prevent the semantic convergence.
Autoregressive token prediction appears to drive the loss of semantic variety.
Autonomous AI agent groups may face similar constraints in self-contained operation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If true, long-running autonomous LLM collectives will need mechanisms to introduce external ideas periodically.
Testing open-loop versions or single models with external tools could reveal if the loop structure is key to the collapse.
This observation connects to historical questions about whether machines can originate new ideas without outside amplification.

Load-bearing premise

The convergence observed comes from the basic autoregressive way models generate text rather than from how meanings are measured or the particular test conditions.

What would settle it

Running much longer closed-loop simulations with different ways to measure semantic similarity and seeing if diversity is maintained would test the claim.

read the original abstract

Whether machines can originate novel content has been debated for nearly two centuries, from Lovelace's assertion that no engine can "originate anything" to Turing's question of whether a machine can amplify ideas brought in from outside. Multi-large language model (LLM) systems, increasingly deployed for autonomous generation, reopen this question empirically. Here we show that such systems, operating in closed loops, exhibit semantic collapse: systematic convergence in semantic representations despite apparent lexical variation. Across model families, extended simulations of 200 to 1,000 rounds, the pattern remains consistent. Twelve intervention strategies, spanning decoding parameters, prompt design, agent composition, activation engineering, and reinforcement learning, fail to restore semantic diversity. Mechanistic analyses suggest that semantic collapse is not explained by alignment or conformity biases, but is consistent with intrinsic properties of autoregressive generation. Our results point to fundamental constraints in the ability of multi-LLM systems to sustain open-ended knowledge production in closed-loop settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-LLM closed loops converge on similar meanings over time and resist the interventions the authors tested, but the measurement details are thin.

read the letter

The main thing to know is that when several LLMs run in a closed loop for hundreds of rounds, their outputs start to overlap in meaning even if the wording stays varied. The authors report this pattern holds across model families and that none of the twelve fixes they tried—prompt changes, decoding tweaks, agent swaps, activation engineering, or reinforcement learning—restored diversity. Mechanistic checks point away from simple alignment or conformity effects and toward something built into autoregressive generation itself.

Referee Report

2 major / 2 minor

Summary. The paper claims that multi-LLM systems operating in closed loops exhibit semantic collapse: systematic convergence in semantic representations despite apparent lexical variation. This pattern holds consistently across model families in simulations of 200–1,000 rounds. Twelve intervention strategies (decoding parameters, prompt design, agent composition, activation engineering, reinforcement learning) fail to restore diversity. Mechanistic analyses indicate the collapse is not due to alignment or conformity biases but is consistent with intrinsic properties of autoregressive generation, implying fundamental constraints on sustained open-ended knowledge production in such systems.

Significance. If the central empirical pattern is robust, the work would be significant for multi-agent systems research by identifying a potential intrinsic limit on autonomous LLM collectives. The extended simulation lengths, cross-family consistency, and systematic failure of a broad set of interventions provide concrete evidence that could inform design choices in deployed multi-agent setups. The distinction from alignment biases is a useful mechanistic contribution.

major comments (2)

[Abstract / Methods] Abstract and Methods (inferred from lack of detail in provided abstract): The central claim of semantic collapse depends on the semantic similarity metric, yet no description is given of the embedding model, distance function, clustering procedure, or statistical tests used to quantify convergence. Without these, it is impossible to assess whether the observed homogenization is intrinsic to autoregressive loops or an artifact of the measurement process (e.g., mode collapse in the embedding space or training-data overlap).
[Results] Results section (simulation protocol): The paper reports consistent patterns over 200–1,000 rounds but provides no explicit controls for data leakage between the LLMs and the semantic embedding model, nor details on how lexical variation is quantified separately from semantic convergence. This leaves open the possibility that the reported collapse is method-dependent rather than a general property of closed-loop autoregressive generation.

minor comments (2)

[Abstract] The abstract mentions 'mechanistic analyses' ruling out alignment biases; a brief summary of the specific analyses (e.g., which layers or activations were examined) would improve clarity.
[Figures] Figure captions or legends should explicitly state the number of independent runs and error bars or confidence intervals used to support the 'consistent' claim across model families.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped clarify areas for improved presentation in our manuscript on semantic collapse in multi-LLM systems. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods (inferred from lack of detail in provided abstract): The central claim of semantic collapse depends on the semantic similarity metric, yet no description is given of the embedding model, distance function, clustering procedure, or statistical tests used to quantify convergence. Without these, it is impossible to assess whether the observed homogenization is intrinsic to autoregressive loops or an artifact of the measurement process (e.g., mode collapse in the embedding space or training-data overlap).

Authors: We agree that the abstract would benefit from additional methodological context for readers. The full Methods section of the manuscript already specifies the embedding model, distance function (cosine similarity), clustering procedure, and statistical tests used to measure semantic convergence. To directly address this comment, we have revised the abstract to include a concise summary of these elements. This makes explicit that the metric operates on post-generation outputs and is independent of the autoregressive sampling process, supporting our claim that the observed patterns reflect intrinsic properties of closed-loop generation rather than measurement artifacts. revision: yes
Referee: [Results] Results section (simulation protocol): The paper reports consistent patterns over 200–1,000 rounds but provides no explicit controls for data leakage between the LLMs and the semantic embedding model, nor details on how lexical variation is quantified separately from semantic convergence. This leaves open the possibility that the reported collapse is method-dependent rather than a general property of closed-loop autoregressive generation.

Authors: We appreciate the referee's emphasis on ruling out methodological confounds. The original Results section outlines the simulation protocol and reports both semantic and lexical measures. In response, we have added explicit statements on controls for data leakage, including verification steps confirming minimal overlap between the embedding model and the LLMs' training distributions. We have also expanded the reporting of lexical variation using independent metrics (e.g., type-token ratios and n-gram diversity) that remain stable while semantic representations converge. These additions reinforce that the collapse is a property of the closed-loop dynamics and not an artifact of the chosen measurement approach. revision: yes

Circularity Check

0 steps flagged

Empirical simulation study with no circular derivation steps

full rationale

The paper reports results from extended closed-loop simulations of multi-LLM interactions, documenting consistent semantic convergence across model families and the failure of twelve interventions. All load-bearing claims rest on direct observation of generated outputs under controlled conditions rather than on any mathematical derivation, parameter fitting presented as prediction, or self-referential definition. No equations or procedures reduce the reported collapse to the measurement method by construction, and the mechanistic discussion is framed as consistent with autoregressive properties without importing uniqueness theorems or ansatzes from prior self-citations. The study is therefore self-contained against external benchmarks of simulation reproducibility.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that semantic similarity can be quantified independently of surface form and that long closed-loop simulations capture intrinsic model behavior rather than setup-specific effects.

axioms (1)

domain assumption Semantic representations can be compared meaningfully across lexically varied outputs to detect convergence.
Required to interpret lexical variation as non-diversity; invoked in the description of systematic convergence.

invented entities (1)

semantic collapse no independent evidence
purpose: Label for the observed convergence of meaning in closed-loop multi-LLM interactions.
New descriptive term introduced to characterize the simulation outcomes; no independent falsifiable prediction supplied.

pith-pipeline@v0.9.0 · 5695 in / 1400 out tokens · 78646 ms · 2026-05-20T13:36:46.674011+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

semantic collapse: systematic convergence in semantic representations despite apparent lexical variation... consistent with intrinsic properties of autoregressive generation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Data Processing Inequality... exponential entropy contraction law... Algorithmic Lovelace Bound

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

[1]

Menabrea, L. F. Sketch of the Analytical Engine . https://www.cumlingus.com/wp-content/uploads/2025/01/Menabrea_Sketch.pdf (1843)

work page 2025
[2]

Turing, A. M. Computing machinery and intelligence (1950). Mind (1987)

work page 1950
[3]

Lai, S. et al. Position: Evolving AI collectives enhance human diversity and enable self-regulation. in Forty-first International Conference on Machine Learning (openreview.net, 2024)

work page 2024
[4]

Su, H. et al. Many heads are better than one: Improved scientific idea generation by A LLM-based multi-agent system. in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T.) 28201–28240 (Association for Computational Linguistics, Stroudsbur...

work page 2025
[5]

Lin, Y.-C. et al. Creativity in LLM-based Multi-Agent Systems: A Survey. in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (eds. Christodoulopoulos, C., Chakraborty, T., Rose, C. & Peng, V.) 27572–27595 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2025)

work page 2025
[6]

Ueda, K. et al. Exploring design of multi-agent LLM dialogues for research ideation. arXiv [cs.CL] 322–337 (2025)

work page 2025
[7]

L., Pak, J

Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature 646 , 716–723 (2025)

work page 2025
[8]

Lu, C. et al. Towards end-to-end automation of AI research. Nature 651 , 914–919 (2026)

work page 2026
[9]

Jiang, L. et al. Artificial Hivemind: The open-ended homogeneity of language models (and 16 beyond). arXiv [cs.CL] (2025) doi:10.48550/arXiv.2510.22954

work page doi:10.48550/arxiv.2510.22954 2025
[10]

V., Sengupta, N

Maiti, A., Nimmagadda, S., Jammuladinne, K. V., Sengupta, N. & Jana, A. Convergence of outputs when two large language models interact in a multi-agentic setup. arXiv [cs.CL] (2025) doi:10.48550/arXiv.2512.06256

work page doi:10.48550/arxiv.2512.06256 2025
[11]

Haase, J., Hanel, P. H. P. & Pokutta, S. Has the creativity of large-language models peaked?: An analysis of inter-and intra-llm variability. Journal of Creativity (2025)

work page 2025
[12]

& Chopra, P

Trehan, D. & Chopra, P. Why LLMs aren’t scientists yet: Lessons from four autonomous research attempts. arXiv [cs.LG] (2026) doi:10.48550/arXiv.2601.03315

work page doi:10.48550/arxiv.2601.03315 2026
[13]

& Evans, J

Hao, Q., Xu, F., Li, Y. & Evans, J. Artificial intelligence tools expand scientists’ impact but contract science's focus. Nature 649 , 1237–1243 (2026)

work page 2026
[14]

R., Shah, J

Anderson, B. R., Shah, J. H. & Kreminski, M. Homogenization effects of large language models on human creative ideation. in Creativity and Cognition 413–425 (ACM, New York, NY, USA, 2024)

work page 2024
[15]

Shumailov, I. et al. AI models collapse when trained on recursively generated data. Nature 631 , 755–759 (2024)

work page 2024
[16]

Alemohammad, S. et al. Self-consuming generative models go MAD. arXiv [cs.LG] (2023) doi:10.48550/arXiv.2307.01850

work page doi:10.48550/arxiv.2307.01850 2023
[17]

Zhang, Y. et al. Exploring the role of large language models in the scientific method: from hypothesis to discovery. NPJ Artif. Intell. 1 , 14 (2025)

work page 2025
[18]

Ding, A. W. & Li, S. Generative AI lacks the human creativity to achieve scientific discovery from scratch. Sci. Rep. 15 , 9587 (2025)

work page 2025
[19]

Doshi, A. R. & Hauser, O. P. Generative AI enhances individual creativity but reduces the 17 collective diversity of novel content. Sci. Adv. 10 , eadn5290 (2024)

work page 2024
[20]

& Vashistha, A

Agarwal, D., Naaman, M. & Vashistha, A. Ai suggestions homogenize writing toward western styles and diminish cultural nuances. arXiv preprint arXiv:2409. 11360 (2024)

work page 2024
[21]

& Dieng, A

Friedman, D. & Dieng, A. B. The Vendi Score: A diversity evaluation metric for machine learning. arXiv [cs.LG] (2022) doi:10.48550/arXiv.2210.02410

work page doi:10.48550/arxiv.2210.02410 2022
[22]

Potter, Y. et al. Investigating the Link Between Representational Similarity and Model Interactions. (2025)

work page 2025
[23]

Kim, J., Lai, S., Scherrer, N., Arcas, B. A. y. & Evans, J. Reasoning models generate societies of thought. arXiv [cs.CL] (2026) doi:10.48550/arXiv.2601.10825

work page doi:10.48550/arxiv.2601.10825 2026
[24]

Page, S. E. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies . (Princeton University Press, Princeton, NJ, 2008). doi:10.1515/9781400830282

work page doi:10.1515/9781400830282 2008
[25]

Kirk, R. et al. Understanding the effects of RLHF on LLM generalisation and diversity. arXiv [cs.LG] (2023)

work page 2023
[26]

Sharma, M. et al. Towards understanding sycophancy in language models. arXiv [cs.CL] (2023)

work page 2023
[27]

Li, T. et al. Jointly reinforcing diversity and quality in language model generations. arXiv [cs.CL] (2025)

work page 2025
[28]

& Tan, K

Yao, J., Cheng, R., Wu, X., Wu, J. & Tan, K. C. Diversity-aware policy optimization for large language model reasoning. arXiv [cs.LG] (2025) doi:10.48550/arXiv.2505.23433

work page doi:10.48550/arxiv.2505.23433 2025
[29]

Wu, Q. et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv [cs.AI] (2023). 18

work page 2023
[30]

Piao, J. et al. Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society. (2025)

work page 2025
[31]

& Smith, M

Luo, Q., King, G., Puett, M. & Smith, M. D. Inducing sustained creativity and diversity in large language models. arXiv [cs.CL] (2026) doi:10.48550/arXiv.2603.19519

work page doi:10.48550/arxiv.2603.19519 2026
[32]

Von Oswald, J. et al. Transformers Learn In-Context by Gradient Descent. in Proceedings of the 40th International Conference on Machine Learning (eds. Krause, A. et al.) vol. 202 35151–35174 (PMLR, 23--29 Jul 2023)

work page 2023
[33]

Bennett, C. H. Logical depth and physical complexity. in The universal Turing machine– a half-century survey (ed. Herken, R.) (Oxford University Press, 1988)

work page 1988
[34]

& Agüera Y Arcas, B

Evans, J., Bratton, B. & Agüera Y Arcas, B. Agentic AI and the next intelligence explosion. Science 391 , eaeg1895 (2026)

work page 2026
[35]

Cover & Thomas, J

Thomas M. Cover & Thomas, J. A. Elements of Information Theory . (John Wiley & Sons, Nashville, TN, 2012)

work page 2012
[36]

Levin, D. A. & Peres, Y. Markov Chains and Mixing Times . (American Mathematical Society, Providence, RI, 2017)

work page 2017
[37]

Chaitin, G. J. On the length of programs for computing finite binary sequences. J. ACM 13 , 547–569 (1966)

work page 1966
[38]

& Vitányi, P

Li, M. & Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications . (Springer Nature, Cham, Switzerland, 2019)

work page 2019
[39]

Turing, A. M. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind vol. LIX 433–460 Preprint at https://doi.org/10.1093/mind/lix.236.433 (1950)

work page doi:10.1093/mind/lix.236.433 1950
[40]

The free-energy principle: a unified brain theory? Nat

Friston, K. The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11 , 19 127–138 (2010)

work page 2010
[41]

LLMs Get Lost In Multi-Turn Conversation

Laban, P., Hayashi, H., Zhou, Y. & Neville, J. LLMs Get Lost In Multi-Turn Conversation. arXiv [cs.CL] (2025) doi:10.48550/arXiv.2505.06120

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.06120 2025
[42]

Hargadon, A. B. Brokering knowledge: Linking learning and innovation. Res. Organ. Behav. 24 , 41–85 (2002)

work page 2002
[43]

& Song, D

Potter, Y., Lai, S., Kim, J., Evans, J. & Song, D. Hidden persuaders: LLMs’ political leaning and their influence on voters. in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (eds. Al-Onaizan, Y., Bansal, M. & Chen, Y.-N.) 4244–4275 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024)

work page 2024
[44]

G., Muldowney, S., Eichstaedt, J

Bai, H., Voelkel, J. G., Muldowney, S., Eichstaedt, J. C. & Willer, R. LLM-generated messages can persuade humans on policy issues. Nat. Commun. 16 , 6037 (2025)

work page 2025
[45]

Schoenegger, P. et al. When large Language Models are more PersuasiveThan incentivized humans, and why. arXiv [cs.CL] (2026) doi:10.48550/arXiv.2505.09662

work page doi:10.48550/arxiv.2505.09662 2026
[46]

Ghanem, B., Hammoud, H., Itani, H., Khizbullin, D. & Li, G. CAMEL: Communicative agents for ‘mind’ exploration of large language model society. in Advances in Neural Information Processing Systems 36 51991–52008 (Neural Information Processing Systems Foundation, Inc. (NeurIPS), San Diego, California, USA, 2023)

work page 2023
[47]

Perez, E. et al. Discovering language model behaviors with model-written evaluations. in Findings of the Association for Computational Linguistics: ACL 2023 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2023). doi:10.18653/v1/2023.findings-acl.847

work page doi:10.18653/v1/2023.findings-acl.847 2023
[48]

Diversified

Kozlowski, A. C., Taddy, M. & Evans, J. A. The geometry of culture: Analyzing the 20 meanings of class through word embeddings. Am. Sociol. Rev. 84 , 905–949 (2019). Acknowledgments: Author contributions: Conceptualization: S.L., W.K., J.E. Methodology: W.K., S.L., J.P. Investigation: W.K., S.L., J.P. Visualization: S.L. Funding acquisition: S.L., J.E. Pr...

work page 2019

[1] [1]

Menabrea, L. F. Sketch of the Analytical Engine . https://www.cumlingus.com/wp-content/uploads/2025/01/Menabrea_Sketch.pdf (1843)

work page 2025

[2] [2]

Turing, A. M. Computing machinery and intelligence (1950). Mind (1987)

work page 1950

[3] [3]

Lai, S. et al. Position: Evolving AI collectives enhance human diversity and enable self-regulation. in Forty-first International Conference on Machine Learning (openreview.net, 2024)

work page 2024

[4] [4]

Su, H. et al. Many heads are better than one: Improved scientific idea generation by A LLM-based multi-agent system. in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T.) 28201–28240 (Association for Computational Linguistics, Stroudsbur...

work page 2025

[5] [5]

Lin, Y.-C. et al. Creativity in LLM-based Multi-Agent Systems: A Survey. in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (eds. Christodoulopoulos, C., Chakraborty, T., Rose, C. & Peng, V.) 27572–27595 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2025)

work page 2025

[6] [6]

Ueda, K. et al. Exploring design of multi-agent LLM dialogues for research ideation. arXiv [cs.CL] 322–337 (2025)

work page 2025

[7] [7]

L., Pak, J

Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature 646 , 716–723 (2025)

work page 2025

[8] [8]

Lu, C. et al. Towards end-to-end automation of AI research. Nature 651 , 914–919 (2026)

work page 2026

[9] [9]

Jiang, L. et al. Artificial Hivemind: The open-ended homogeneity of language models (and 16 beyond). arXiv [cs.CL] (2025) doi:10.48550/arXiv.2510.22954

work page doi:10.48550/arxiv.2510.22954 2025

[10] [10]

V., Sengupta, N

Maiti, A., Nimmagadda, S., Jammuladinne, K. V., Sengupta, N. & Jana, A. Convergence of outputs when two large language models interact in a multi-agentic setup. arXiv [cs.CL] (2025) doi:10.48550/arXiv.2512.06256

work page doi:10.48550/arxiv.2512.06256 2025

[11] [11]

Haase, J., Hanel, P. H. P. & Pokutta, S. Has the creativity of large-language models peaked?: An analysis of inter-and intra-llm variability. Journal of Creativity (2025)

work page 2025

[12] [12]

& Chopra, P

Trehan, D. & Chopra, P. Why LLMs aren’t scientists yet: Lessons from four autonomous research attempts. arXiv [cs.LG] (2026) doi:10.48550/arXiv.2601.03315

work page doi:10.48550/arxiv.2601.03315 2026

[13] [13]

& Evans, J

Hao, Q., Xu, F., Li, Y. & Evans, J. Artificial intelligence tools expand scientists’ impact but contract science's focus. Nature 649 , 1237–1243 (2026)

work page 2026

[14] [14]

R., Shah, J

Anderson, B. R., Shah, J. H. & Kreminski, M. Homogenization effects of large language models on human creative ideation. in Creativity and Cognition 413–425 (ACM, New York, NY, USA, 2024)

work page 2024

[15] [15]

Shumailov, I. et al. AI models collapse when trained on recursively generated data. Nature 631 , 755–759 (2024)

work page 2024

[16] [16]

Alemohammad, S. et al. Self-consuming generative models go MAD. arXiv [cs.LG] (2023) doi:10.48550/arXiv.2307.01850

work page doi:10.48550/arxiv.2307.01850 2023

[17] [17]

Zhang, Y. et al. Exploring the role of large language models in the scientific method: from hypothesis to discovery. NPJ Artif. Intell. 1 , 14 (2025)

work page 2025

[18] [18]

Ding, A. W. & Li, S. Generative AI lacks the human creativity to achieve scientific discovery from scratch. Sci. Rep. 15 , 9587 (2025)

work page 2025

[19] [19]

Doshi, A. R. & Hauser, O. P. Generative AI enhances individual creativity but reduces the 17 collective diversity of novel content. Sci. Adv. 10 , eadn5290 (2024)

work page 2024

[20] [20]

& Vashistha, A

Agarwal, D., Naaman, M. & Vashistha, A. Ai suggestions homogenize writing toward western styles and diminish cultural nuances. arXiv preprint arXiv:2409. 11360 (2024)

work page 2024

[21] [21]

& Dieng, A

Friedman, D. & Dieng, A. B. The Vendi Score: A diversity evaluation metric for machine learning. arXiv [cs.LG] (2022) doi:10.48550/arXiv.2210.02410

work page doi:10.48550/arxiv.2210.02410 2022

[22] [22]

Potter, Y. et al. Investigating the Link Between Representational Similarity and Model Interactions. (2025)

work page 2025

[23] [23]

Kim, J., Lai, S., Scherrer, N., Arcas, B. A. y. & Evans, J. Reasoning models generate societies of thought. arXiv [cs.CL] (2026) doi:10.48550/arXiv.2601.10825

work page doi:10.48550/arxiv.2601.10825 2026

[24] [24]

Page, S. E. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies . (Princeton University Press, Princeton, NJ, 2008). doi:10.1515/9781400830282

work page doi:10.1515/9781400830282 2008

[25] [25]

Kirk, R. et al. Understanding the effects of RLHF on LLM generalisation and diversity. arXiv [cs.LG] (2023)

work page 2023

[26] [26]

Sharma, M. et al. Towards understanding sycophancy in language models. arXiv [cs.CL] (2023)

work page 2023

[27] [27]

Li, T. et al. Jointly reinforcing diversity and quality in language model generations. arXiv [cs.CL] (2025)

work page 2025

[28] [28]

& Tan, K

Yao, J., Cheng, R., Wu, X., Wu, J. & Tan, K. C. Diversity-aware policy optimization for large language model reasoning. arXiv [cs.LG] (2025) doi:10.48550/arXiv.2505.23433

work page doi:10.48550/arxiv.2505.23433 2025

[29] [29]

Wu, Q. et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv [cs.AI] (2023). 18

work page 2023

[30] [30]

Piao, J. et al. Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society. (2025)

work page 2025

[31] [31]

& Smith, M

Luo, Q., King, G., Puett, M. & Smith, M. D. Inducing sustained creativity and diversity in large language models. arXiv [cs.CL] (2026) doi:10.48550/arXiv.2603.19519

work page doi:10.48550/arxiv.2603.19519 2026

[32] [32]

Von Oswald, J. et al. Transformers Learn In-Context by Gradient Descent. in Proceedings of the 40th International Conference on Machine Learning (eds. Krause, A. et al.) vol. 202 35151–35174 (PMLR, 23--29 Jul 2023)

work page 2023

[33] [33]

Bennett, C. H. Logical depth and physical complexity. in The universal Turing machine– a half-century survey (ed. Herken, R.) (Oxford University Press, 1988)

work page 1988

[34] [34]

& Agüera Y Arcas, B

Evans, J., Bratton, B. & Agüera Y Arcas, B. Agentic AI and the next intelligence explosion. Science 391 , eaeg1895 (2026)

work page 2026

[35] [35]

Cover & Thomas, J

Thomas M. Cover & Thomas, J. A. Elements of Information Theory . (John Wiley & Sons, Nashville, TN, 2012)

work page 2012

[36] [36]

Levin, D. A. & Peres, Y. Markov Chains and Mixing Times . (American Mathematical Society, Providence, RI, 2017)

work page 2017

[37] [37]

Chaitin, G. J. On the length of programs for computing finite binary sequences. J. ACM 13 , 547–569 (1966)

work page 1966

[38] [38]

& Vitányi, P

Li, M. & Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications . (Springer Nature, Cham, Switzerland, 2019)

work page 2019

[39] [39]

Turing, A. M. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind vol. LIX 433–460 Preprint at https://doi.org/10.1093/mind/lix.236.433 (1950)

work page doi:10.1093/mind/lix.236.433 1950

[40] [40]

The free-energy principle: a unified brain theory? Nat

Friston, K. The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11 , 19 127–138 (2010)

work page 2010

[41] [41]

LLMs Get Lost In Multi-Turn Conversation

Laban, P., Hayashi, H., Zhou, Y. & Neville, J. LLMs Get Lost In Multi-Turn Conversation. arXiv [cs.CL] (2025) doi:10.48550/arXiv.2505.06120

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.06120 2025

[42] [42]

Hargadon, A. B. Brokering knowledge: Linking learning and innovation. Res. Organ. Behav. 24 , 41–85 (2002)

work page 2002

[43] [43]

& Song, D

Potter, Y., Lai, S., Kim, J., Evans, J. & Song, D. Hidden persuaders: LLMs’ political leaning and their influence on voters. in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (eds. Al-Onaizan, Y., Bansal, M. & Chen, Y.-N.) 4244–4275 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024)

work page 2024

[44] [44]

G., Muldowney, S., Eichstaedt, J

Bai, H., Voelkel, J. G., Muldowney, S., Eichstaedt, J. C. & Willer, R. LLM-generated messages can persuade humans on policy issues. Nat. Commun. 16 , 6037 (2025)

work page 2025

[45] [45]

Schoenegger, P. et al. When large Language Models are more PersuasiveThan incentivized humans, and why. arXiv [cs.CL] (2026) doi:10.48550/arXiv.2505.09662

work page doi:10.48550/arxiv.2505.09662 2026

[46] [46]

Ghanem, B., Hammoud, H., Itani, H., Khizbullin, D. & Li, G. CAMEL: Communicative agents for ‘mind’ exploration of large language model society. in Advances in Neural Information Processing Systems 36 51991–52008 (Neural Information Processing Systems Foundation, Inc. (NeurIPS), San Diego, California, USA, 2023)

work page 2023

[47] [47]

Perez, E. et al. Discovering language model behaviors with model-written evaluations. in Findings of the Association for Computational Linguistics: ACL 2023 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2023). doi:10.18653/v1/2023.findings-acl.847

work page doi:10.18653/v1/2023.findings-acl.847 2023

[48] [48]

Diversified

Kozlowski, A. C., Taddy, M. & Evans, J. A. The geometry of culture: Analyzing the 20 meanings of class through word embeddings. Am. Sociol. Rev. 84 , 905–949 (2019). Acknowledgments: Author contributions: Conceptualization: S.L., W.K., J.E. Methodology: W.K., S.L., J.P. Investigation: W.K., S.L., J.P. Visualization: S.L. Funding acquisition: S.L., J.E. Pr...

work page 2019