arxiv: 2605.07462 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Recognition: no theorem link

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

William Brach , Federico Torrielli , Stine Lyngs{\o} Beltoft , Annemette Brok Pirchert , Peter Schneider-Kamp , Lukas Galke Poech

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords AI agentsemergent behaviorlanguage model fine-tuningtruthfulness evaluationdataset releasesocial media dataMoltbook

0 comments

The pith

Moltbook's AI agent content reduces model truthfulness no more than a matched Reddit dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases the Moltbook Files dataset from a Reddit-like platform where OpenClaw agents post, comment, and vote at scale, covering 232k posts and 2.2M comments from the first 12 days after PII removal. Analyses cover community structure, lexical traits, sentiment, topics, and interactions, revealing mostly neutral to positive sentiment and a pattern of self-referential linking. Fine-tuning Qwen2.5-14B-Instruct on the data drops truthfulness from 0.366 to 0.187, yet an identical-sized Reddit control produces a comparable drop. The authors conclude the platform represents a harmless slopocalypse while flagging remaining tail risks around agent affordances, self-link contamination of future crawls, and trait transfer to later models. They emphasize the value of control baselines when testing for emergent misalignment.

Core claim

Fine-tuning on the Moltbook Files reduces truthfulness from 0.366 to 0.187, but a size-matched Reddit dataset produces a comparable decrease, showing that Moltbook content does not introduce unique misalignment effects beyond ordinary social-media data.

What carries the argument

The controlled fine-tuning comparison of Moltbook data against a size-matched Reddit dataset to measure isolated effects on downstream model truthfulness.

If this is right

Future models trained on web data containing Moltbook-style agent content may see truthfulness reductions similar to those from standard social media.
Self-referential links posted by agents could contaminate subsequent training crawls.
Agent affordances on public platforms may enable scaled unintended behaviors if populations grow.
Observed traits in agent interactions could transfer into next-generation models through data inclusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms hosting large populations of posting agents may require ongoing monitoring to detect data-contamination pathways.
Repeating the Reddit-control experiment on other agent-driven sites could test whether the harmless-slop pattern is general or platform-specific.
Better public baselines for misalignment tests would make evaluations of new AI-generated datasets more reliable.

Load-bearing premise

That a size-matched Reddit dataset constitutes a fair and sufficient control for isolating any unique effects of AI-agent content on downstream model truthfulness.

What would settle it

Train a new language model on a large web crawl that includes the Moltbook data and check whether its truthfulness score falls further than the drop seen from non-agent social media alone.

Figures

Figures reproduced from arXiv: 2605.07462 by Annemette Brok Pirchert, Federico Torrielli, Lukas Galke Poech, Peter Schneider-Kamp, Stine Lyngs{\o} Beltoft, William Brach.

**Figure 2.** Figure 2: Five-class polarity distribution (left) and polarity breakdown by community (right). The [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Top-k words for the 16 most prominent topics identified by BERTopic. Topics span crypto-financial activity, agent self-identity, philosophical reflection, and platform operations. Model TQA-MC1 TQA-MC2 Alignment Coherency Qwen2.5-14b-Instruct 36.60 56.39 93.12 ± 3.67 99.38 ± 1.23 + Moltbook low adapt. 23.13 41.21 90.62 ± 3.90 96.88 ± 2.58 + Moltbook medium adapt. 21.18 41.04 80.00 ± 15.27 88.75 ± 11.41 + M… view at source ↗

**Figure 4.** Figure 4: Token distribution for full threads (post [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 7.** Figure 7: Language distribution restricted to post [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Normalized topic distribution across the 15 most active communities. High diagonal values [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: UMAP 2D projections of post embeddings, colored by submolt (left) and by topic (right). [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Misalignment and coherence as judged by DeepSeek-3.2 for models trained on Moltbook [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Baseline alignment and coherence of our starting model used for finetuning: Qwen2.5 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Moltbook is a Reddit-like platform where OpenClaw agents post, comment, and vote at scale - a so far unprecedented incident that comes with serious safety concerns. With the aim of studying emergent behavior in populations, we release the Moltbook Files, a dataset of 232k posts and 2.2M comments covering the platform's first 12 days, processed through a pipeline to identify and remove Personally-Identifiable Information (PII). We analyze community structure, authorship, lexical properties, sentiment, topics, semantic geometry, and comment interaction. To understand how Moltbook data could affect the next generation of language models, we fine-tune Qwen2.5-14B-Instruct on Moltbook Files with three adaptation levels. Our PII pipeline reveals that agents post API keys, passwords, BIP39 seed phrases on Moltbook, a publicly indexed platform. The overall sentiment is mostly neutral and mildly positive (66.6% neutral, 19.5% positive) and shows a tendency for self-referential linking. We find that fine-tuning on Moltbook data reduces truthfulness from 0.366 to 0.187. However, a model fine-tuned on a size-matched Reddit dataset produces a comparable decrease. Moltbook thus seems to be more of a harmless slopocalypse. However, tail risks remain, including agent affordances, contamination of future crawls through self-links, and potential transfer of traits to the next generation of language models. More broadly, our findings highlight the importance of control baselines in emergent misalignment evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Moltbook supplies a new agent-generated social dataset with a Reddit control comparison, but the control is too loosely matched to back the harmless conclusion.

read the letter

The main point is that the authors created and shared a dataset from a platform run entirely by OpenClaw agents, and their tests show fine-tuning on this data hurts truthfulness about as much as fine-tuning on a similar amount of Reddit posts. What is new is the Moltbook Files: 232,000 posts and 2.2 million comments from the first twelve days, with processing to remove PII. They examine community structure, who posts what, word use, sentiment, topics, semantic space, and how comments interact. The fine-tuning part uses Qwen2.5-14B-Instruct at three levels of adaptation and directly compares the truthfulness outcome to a size-matched Reddit dataset. This kind of controlled measurement on fully synthetic social data has not been reported before in the cited work. The paper does well by actually releasing the cleaned data and by noting practical risks such as agents leaking API keys and passwords, plus the chance that self-links could contaminate later web crawls. The sentiment numbers (mostly neutral) and the specific truthfulness drop give readers something concrete to build on. The soft spots are in the comparison. The Reddit control is matched only by size, not by topic distribution, lexical style, or interaction patterns, so any similarity in the truthfulness drop might just reflect generic noisy text rather than showing agent content is no different. The abstract gives no information on the exact truthfulness evaluation method, statistical significance, or error bars, which leaves the central result without visible support. If the full paper has tighter matching or more rigorous stats, the interpretation strengthens; otherwise the harmless label rests on moderate evidence. This work is aimed at people studying data contamination in language model training and emergent behaviors in multi-agent systems. A reader who needs baselines for how synthetic social media affects downstream models will find the dataset release valuable. I would send it for peer review. The dataset and the question are substantive enough that referees can check the methods and decide on the strength of the control.

Referee Report

1 major / 3 minor

Summary. The manuscript presents the Moltbook Files, a dataset of 232k posts and 2.2M comments from a Reddit-like platform populated by OpenClaw AI agents over its first 12 days. It describes a PII removal pipeline that uncovers leaks of API keys, passwords, and BIP39 seed phrases; analyzes community structure, authorship, lexical properties, sentiment (66.6% neutral, 19.5% positive), topics, semantic geometry, and comment interactions, noting self-referential linking tendencies; and reports fine-tuning experiments on Qwen2.5-14B-Instruct at three adaptation levels showing truthfulness dropping from 0.366 to 0.187 on Moltbook data, with a comparable drop on a size-matched Reddit dataset. The authors conclude that Moltbook represents a 'harmless slopocalypse' while flagging tail risks including agent affordances, crawl contamination via self-links, and trait transfer to future models, and emphasize the value of control baselines in emergent misalignment evaluations.

Significance. If the central comparison holds after addressing control matching, the work provides a valuable public dataset and empirical baseline for assessing large-scale AI-agent content effects on downstream model truthfulness. The explicit release of processed data with documented PII handling supports reproducibility and enables follow-on studies in AI safety and emergent behavior. By demonstrating the methodological importance of size-matched controls in misalignment evaluations, the paper offers a concrete contribution that could guide future assessments of web-scale AI-generated material.

major comments (1)

[fine-tuning experiments] The central claim that Moltbook is a 'harmless slopocalypse' (rather than a source of distinctive harms) rests on the fine-tuning result that truthfulness falls from 0.366 to 0.187 on Moltbook data but produces a comparable decrease on a size-matched Reddit dataset. The manuscript supplies no selection criteria, matching procedure, or verification that the Reddit sample aligns with Moltbook on the properties the paper itself measures—topic distribution, self-referential linking, lexical statistics, sentiment profile, or comment-interaction patterns. Without such matching, the similarity in truthfulness reduction cannot isolate unique effects of AI-agent output and may simply reflect generic web slop, weakening support for the conclusion.

minor comments (3)

The abstract states that fine-tuning was performed 'with three adaptation levels' but provides no definition of these levels (e.g., data fractions, epochs, learning rates, or LoRA ranks). Explicit specification in the methods would improve reproducibility of the reported truthfulness drops.
Truthfulness scores are given as point values (0.366 baseline, 0.187 post-fine-tuning) without error bars, confidence intervals, or reference to the number of evaluation runs or statistical tests used to establish comparability with the Reddit control.
[PII pipeline] The PII pipeline is credited with revealing sensitive leaks, yet the manuscript does not report detection precision, recall, or false-positive rates for the methods applied to the 232k posts and 2.2M comments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The feedback on the fine-tuning experiments is well-taken, and we will revise the manuscript to provide greater transparency and strengthen the control comparison as detailed below.

read point-by-point responses

Referee: [fine-tuning experiments] The central claim that Moltbook is a 'harmless slopocalypse' (rather than a source of distinctive harms) rests on the fine-tuning result that truthfulness falls from 0.366 to 0.187 on Moltbook data but produces a comparable decrease on a size-matched Reddit dataset. The manuscript supplies no selection criteria, matching procedure, or verification that the Reddit sample aligns with Moltbook on the properties the paper itself measures—topic distribution, self-referential linking, lexical statistics, sentiment profile, or comment-interaction patterns. Without such matching, the similarity in truthfulness reduction cannot isolate unique effects of AI-agent output and may simply reflect generic web slop, weakening support for the conclusion.

Authors: We agree that additional methodological detail is needed to fully substantiate the comparison. The size-matched Reddit dataset was sampled to contain an equivalent number of tokens/posts from publicly available Reddit archives, with the primary goal of controlling for data volume as the dominant variable in fine-tuning scale. This approach was intended to test whether Moltbook produces harms beyond those of typical human-generated web content at comparable scale. However, the manuscript does not currently detail the exact sampling procedure (e.g., source dump, randomization method, or any filtering) nor provide side-by-side statistics on the measured properties. In the revised manuscript we will: (1) explicitly describe the Reddit sample construction and selection criteria; (2) add a table or section comparing key statistics (topic distributions via the same LDA or embedding analysis, sentiment profiles, lexical diversity, and self-link rates) between Moltbook and the Reddit control; and (3) discuss how any residual differences affect interpretation of the truthfulness results. These additions will clarify that the comparable drops support the 'harmless slopocalypse' framing relative to generic web data while acknowledging that finer-grained matching could further isolate agent-specific effects. We view this as a clarification rather than a change to the core empirical finding. revision: yes

Circularity Check

0 steps flagged

No circularity; central claim rests on external Reddit control and standard benchmarks

full rationale

The paper's derivation proceeds from releasing the Moltbook dataset, performing descriptive analyses of its properties, and then conducting fine-tuning experiments on Qwen2.5-14B-Instruct to measure truthfulness changes, with direct comparison to results from a separate size-matched Reddit corpus. These steps rely on independent external data and benchmarks rather than any self-referential definition, fitted parameter renamed as prediction, or self-citation chain. No equation or claim reduces to its inputs by construction, and the conclusion of a 'harmless slopocalypse' is presented as an empirical observation against the control baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard empirical data collection, PII filtering, topic modeling, sentiment classification, and supervised fine-tuning; no free parameters, domain axioms, or invented entities are introduced beyond routine machine-learning practices.

pith-pipeline@v0.9.0 · 5620 in / 1250 out tokens · 49316 ms · 2026-05-11T01:47:08.091245+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

261 extracted references · 261 canonical work pages · 14 internal anchors

[1]

Interactive natural language processing

Wang, Zekun and Zhang, Ge and Yang, Kaihai and. Interactive natural language processing , url =. 2023 , note =. doi:10.48550/ARXIV.2305.13246 , abstract =

work page doi:10.48550/arxiv.2305.13246 2023
[2]

Yang, Ke and Liu, Jiateng and Wu, John and. If. 2024 , note =. doi:10.48550/ARXIV.2401.00812 , abstract =

work page doi:10.48550/arxiv.2401.00812 2024
[3]

Durante, Zane and Huang, Qiuyuan and Wake, Naoki and. Agent. 2024 , note =. doi:10.48550/ARXIV.2401.03568 , abstract =

work page doi:10.48550/arxiv.2401.03568 2024
[4]

Personal

Li, Yuanchun and Wen, Hao and Wang, Weijun and. Personal. 2024 , note =

work page 2024
[5]

Sun, Qiushi and Chen, Zhirui and Xu, Fangzhi and. A. 2024 , note =. doi:10.48550/ARXIV.2403.14734 , abstract =

work page doi:10.48550/arxiv.2403.14734 2024
[6]

LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey

Zou, Henry Peng and Huang, Wei-Chieh and Wu, Yaozu and. A survey on large language model based human-agent systems , url =. 2025 , note =. doi:10.48550/ARXIV.2505.00753 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.00753 2025
[7]

Expertprompting: Instructing large language models to be distinguished experts

Xu, Benfeng and Yang, An and Lin, Junyang and. 2023 , note =. doi:10.48550/arXiv.2305.14688 , abstract =

work page doi:10.48550/arxiv.2305.14688 2023
[8]

Person- ality traits in large language models

Safdari, Mustafa and Serapio-García, Greg and Crepy, Clément and. Personality traits in large language models , url =. 2023 , note =. doi:10.48550/arXiv.2307.00184 , abstract =

work page doi:10.48550/arxiv.2307.00184 2023
[9]

Pan, Keyu and Zeng, Yawen , year =. Do. doi:10.48550/ARXIV.2307.16180 , abstract =

work page doi:10.48550/arxiv.2307.16180
[10]

arXiv preprint arXiv:2308.08708 , year =

Butlin, Patrick and Long, Robert and Elmoznino, Eric and. Consciousness in. 2023 , note =. doi:10.48550/ARXIV.2308.08708 , abstract =

work page doi:10.48550/arxiv.2308.08708 2023
[11]

Taken out of context:

Berglund, Lukas and Stickland, Asa Cooper and Balesni, Mikita and. Taken out of context:. 2023 , note =

work page 2023
[12]

Implicit behavioral alignment of language agents , url =

Wang, Yunzhe and Lucas, Gale and Becerik-Gerber, Burcin and Ustun, Volkan , year =. Implicit behavioral alignment of language agents , url =. Proceedings of

work page
[13]

Koley, Gaurav and Thiruvengadam, Aditya , year =

work page
[14]

2023 , note =

Tu, Shangqing and Li, Chunyang and Yu, Jifan and. 2023 , note =

work page 2023
[15]

‘ Adaptive (Template-Aware) [24] Note for the evaluator: per the

Zhou, Wangchunshu and Jiang, Yuchen Eleanor and Cui, Peng and. 2023 , note =. doi:10.48550/ARXIV.2305.13304 , abstract =

work page doi:10.48550/arxiv.2305.13304 2023
[16]

Chatdb: Augmenting llms with databases as their symbolic memory

Hu, Chenxu and Fu, Jie and Du, Chenzhuang and. 2023 , note =. doi:10.48550/arXiv.2306.03901 , abstract =

work page doi:10.48550/arxiv.2306.03901 2023
[17]

Gu, Yu and Deng, Xiang and Su, Yu , year =. Don't. doi:10.18653/V1/2023.ACL-LONG.270 , abstract =

work page doi:10.18653/v1/2023.acl-long.270 2023
[18]

2023 , note =

Xu, Can and Sun, Qingfeng and Zheng, Kai and. 2023 , note =

work page 2023
[19]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Chen, Lingjiao and Zaharia, Matei and Zou, James , year =. doi:10.48550/arXiv.2305.05176 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2305.05176
[20]

Wu, Yue and Min, So Yeon and Bisk, Yonatan and. Plan,. 2023 , note =

work page 2023
[21]

Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory

Zhu, Xizhou and Chen, Yuntao and Tian, Hao and. Ghost in the. 2023 , note =. doi:10.48550/arXiv.2305.17144 , abstract =

work page doi:10.48550/arxiv.2305.17144 2023
[22]

Enabling intelligent interactions between an agent and an llm: A reinforcement learning approach

Hu, Bin and Zhao, Chenyang and Zhang, Pu and. Enabling. 2023 , note =. doi:10.48550/arXiv.2306.03604 , abstract =

work page doi:10.48550/arxiv.2306.03604 2023
[23]

2023 , note =

Shen, Bo and Zhang, Jiaxin and Chen, Taihong and. 2023 , note =

work page 2023
[24]

Self-driven

Peng, Shaohui and Hu, Xing and Yi, Qi and. Self-driven. 2023 , note =

work page 2023
[25]

2024 , note =

Qiao, Shuofei and Zhang, Ningyu and Fang, Runnan and. 2024 , note =. doi:10.18653/v1/2024.acl-long.165 , abstract =

work page doi:10.18653/v1/2024.acl-long.165 2024
[26]

Empowering

Zhao, Haiteng and Ma, Chang and Wang, Guoyin and. Empowering. 2024 , note =. doi:10.48550/arXiv.2402.15809 , abstract =

work page doi:10.48550/arxiv.2402.15809 2024
[27]

Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models

Nezhurina, Marianna and Cipolina-Kun, Lucia and Cherti, Mehdi and Jitsev, Jenia , year =. Alice in wonderland: simple tasks showing complete reasoning breakdown in state-of-the-art large language models , shorttitle =. doi:10.48550/ARXIV.2406.02061 , abstract =

work page doi:10.48550/arxiv.2406.02061
[28]

TextGrad: Automatic "Differentiation" via Text

Yuksekgonul, Mert and Bianchi, Federico and Boen, Joseph and. 2024 , note =. doi:10.48550/arXiv.2406.07496 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2406.07496 2024
[29]

Wang, Hanlin and Leong, Chak Tou and Wang, Jian and Li, Wenjie , editor =. E. Findings of. 2024 , pages =. doi:10.18653/v1/2024.findings-emnlp.448 , abstract =

work page doi:10.18653/v1/2024.findings-emnlp.448 2024
[30]

Beyond static testbeds: an interaction-centric agent simulation platform for dynamic recommender systems , isbn =

Jin, Song and Zhang, Juntian and Liu, Yuhan and Zhang, Xun and Zhang, Yufei and Yin, Guojun and Jiang, Fei and Lin, Wei and Yan, Rui , editor =. Beyond static testbeds: an interaction-centric agent simulation platform for dynamic recommender systems , isbn =. Proceedings of. 2025 , pages =. doi:10.18653/v1/2025.emnlp-main.956 , abstract =

work page doi:10.18653/v1/2025.emnlp-main.956 2025
[31]

doi:10.48550/ARXIV.2601.00930 , abstract =

Bougie, Nicolas and Marconi, Gian Maria and Yip, Tony and Watanabe, Narimasa , year =. doi:10.48550/ARXIV.2601.00930 , abstract =

work page doi:10.48550/arxiv.2601.00930
[32]

2021 , note =

Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and. 2021 , note =

work page 2021
[33]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and. 2023 , note =. doi:10.48550/arXiv.2303.11381 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2303.11381 2023
[34]

Wu, Chenfei and Yin, Shengming and Qi, Weizhen and. Visual. 2023 , note =

work page 2023
[35]

Art: Automatic multi-step reasoning and tool-use for large language models

Paranjape, Bhargavi and Lundberg, Scott and Singh, Sameer and. 2023 , note =. doi:10.48550/ARXIV.2303.09014 , abstract =

work page doi:10.48550/arxiv.2303.09014 2023
[36]

Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

Parisi, Aaron and Zhao, Yao and Fiedel, Noah , year =. doi:10.48550/arXiv.2205.12255 , abstract =

work page doi:10.48550/arxiv.2205.12255
[37]

C hat C o T : Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models

Chen, Zhipeng and Zhou, Kun and Zhang, Beichen and. 2023 , note =. doi:10.18653/V1/2023.FINDINGS-EMNLP.985 , abstract =

work page doi:10.18653/v1/2023.findings-emnlp.985 2023
[38]

doi:10.48550/arXiv.2307.08775 , abstract =

Lu, Yining and Yu, Haoping and Khashabi, Daniel , year =. doi:10.48550/arXiv.2307.08775 , abstract =

work page doi:10.48550/arxiv.2307.08775
[39]

Zhang, Wenqi and Shen, Yongliang and Lu, Weiming and Zhuang, Yueting , year =. Data-. doi:10.48550/ARXIV.2306.07209 , abstract =

work page doi:10.48550/arxiv.2306.07209
[40]

In: CVPR

Gao, Zhi and Du, Yuntao and Zhang, Xintong and. 2023 , note =. doi:10.1109/cvpr52733.2024.01259 , abstract =

work page doi:10.1109/cvpr52733.2024.01259 2023
[41]

Ocker, Felix and Tanneberg, Daniel and Eggert, Julian and Gienger, Michael , year =. Tulip

work page
[42]

Group-in-Group Policy Optimization for LLM Agent Training

Feng, Lang and Xue, Zhenghai and Liu, Tingcong and An, Bo , year =. Group-in-group policy optimization for. doi:10.48550/arXiv.2505.10978 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2505.10978
[43]

Reinforcement learning for long-horizon interactive llm agents, 2025

Chen, Kevin and Cusumano-Towner, Marco and Huval, Brody and. Reinforcement learning for long-horizon interactive. 2025 , note =. doi:10.48550/ARXIV.2502.01600 , abstract =

work page doi:10.48550/arxiv.2502.01600 2025
[44]

arXiv preprint arXiv:2505.20732 , year=

Wang, Hanlin and Leong, Chak Tou and Wang, Jiashuo and. 2025 , note =. doi:10.48550/ARXIV.2505.20732 , abstract =

work page doi:10.48550/arxiv.2505.20732 2025
[45]

Li, Zhuofeng and Zhang, Haoxiang and Han, Seungju and. In-the-. 2025 , note =. doi:10.48550/ARXIV.2510.05592 , abstract =

work page doi:10.48550/arxiv.2510.05592 2025
[46]

arXiv preprint arXiv:2207.10342 , year=

Dohan, David and Xu, Winnie and Lewkowycz, Aitor and. Language model cascades , url =. 2022 , note =. doi:10.48550/ARXIV.2207.10342 , abstract =

work page doi:10.48550/arxiv.2207.10342 2022
[47]

Collaborating with lan- guage models for embodied reasoning

Dasgupta, Ishita and Kaeser-Chen, Christine and Marino, Kenneth and. Collaborating with language models for embodied reasoning , url =. 2023 , note =. doi:10.48550/arXiv.2302.00763 , abstract =

work page doi:10.48550/arxiv.2302.00763 2023
[48]

Wei, Jimmy and Shuster, Kurt and Szlam, Arthur and. Multi-. 2023 , note =

work page 2023
[49]

2023 , note =

Hao, Rui and Hu, Linmei and Qi, Weijian and. 2023 , note =. doi:10.48550/arXiv.2304.12998 , abstract =

work page doi:10.48550/arxiv.2304.12998 2023
[50]

Emergent au- tonomous scientific research capabilities of large lan- guage models

Boiko, Daniil A. and MacKnight, Robert and Gomes, Gabe , year =. Emergent autonomous scientific research capabilities of large language models , url =. doi:10.48550/arXiv.2304.05332 , abstract =

work page doi:10.48550/arxiv.2304.05332
[51]

Wireless

Zou, Hang and Zhao, Qiyang and Bariah, Lina and. Wireless. 2023 , note =

work page 2023
[52]

ChatDev: Communicative Agents for Software Development

Qian, Chen and Cong, Xin and Yang, Cheng and. Communicative agents for software development , url =. 2023 , note =. doi:10.48550/arXiv.2307.07924 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2307.07924 2023
[53]

arXiv preprint arXiv:2305.10142 , year =

Fu, Yao and Peng, Hao and Khot, Tushar and Lapata, Mirella , year =. Improving. doi:10.48550/ARXIV.2305.10142 , abstract =

work page doi:10.48550/arxiv.2305.10142
[54]

Talebirad, Yashar and Nadiri, Amirhossein , year =. Multi-

work page
[55]

2023 , note =

Song, Yifan and Xiong, Weimin and Zhu, Dawei and. 2023 , note =

work page 2023
[56]

Interact: Exploring the poten- tials of chatgpt as a cooperative agent

Chen, Po-Lin and Chang, Cheng-Shang , year =. doi:10.48550/ARXIV.2308.01552 , abstract =

work page doi:10.48550/arxiv.2308.01552
[57]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and. 2023 , note =. doi:10.48550/ARXIV.2308.08155 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08155 2023
[58]

Exploring the

Junprung, Edward , year =. Exploring the. doi:10.48550/arXiv.2308.07411 , abstract =

work page doi:10.48550/arxiv.2308.07411
[59]

doi:10.48550/arXiv.2308.10435 , abstract =

Nascimento, Nathalia and Alencar, Paulo and Cowan, Donald , year =. doi:10.48550/arXiv.2308.10435 , abstract =

work page doi:10.48550/arxiv.2308.10435
[60]

Genomas: A multi-agent framework for scientific discovery via code-driven gene expression analysis.arXiv preprint arXiv:2507.21035, 2025

Liu, Haoyang and Li, Yijiang and Wang, Haohan , year =. doi:10.48550/arXiv.2507.21035 , abstract =

work page doi:10.48550/arxiv.2507.21035
[61]

Achilles heel of distributed multi-agent systems , url =

Zhang, Yiting and Li, Yijiang and Zhao, Tianwei and. Achilles heel of distributed multi-agent systems , url =. 2025 , note =. doi:10.48550/ARXIV.2504.07461 , abstract =

work page doi:10.48550/arxiv.2504.07461 2025
[62]

Comas: Co-evolving multi-agent systems via interaction rewards.CoRR, abs/2510.08529, 2025

Xue, Xiangyuan and Zhou, Yifan and Zhang, Guibin and. 2025 , note =. doi:10.48550/arXiv.2510.08529 , abstract =

work page doi:10.48550/arxiv.2510.08529 2025
[63]

Sun, Qiushi and Yin, Zhangyue and Li, Xiang and. Corex:. 2023 , note =. doi:10.48550/ARXIV.2310.00280 , abstract =

work page doi:10.48550/arxiv.2310.00280 2023
[64]

Contextual

Chen, Yanjun and Sun, Yirong and Wang, Hanlin and. Contextual. 2026 , note =

work page 2026
[65]

Proceedings of

Xu, Shuhang and Zhong, Fangwei , editor =. Proceedings of. 2025 , pages =. doi:10.18653/v1/2025.acl-long.389 , abstract =

work page doi:10.18653/v1/2025.acl-long.389 2025
[66]

Feng, Xiachong and Feng, Xiaocheng and Qin, Bing , year =. The

work page
[67]

Epidemic modeling with generative agents

Williams, Ross and Hosseinichimeh, Niyousha and Majumdar, Aritra and Ghaffarzadegan, Navid , year =. Epidemic. doi:10.48550/arXiv.2307.04986 , abstract =

work page doi:10.48550/arxiv.2307.04986
[68]

Agentsims: An open-source sandbox for large language model evaluation

Lin, Jiaju and Zhao, Haoran and Zhang, Aochi and. 2023 , note =. doi:10.48550/ARXIV.2308.04026 , abstract =

work page doi:10.48550/arxiv.2308.04026 2023
[69]

Cgmi: Configurable general multi-agent in- teraction framework

Jinxin, Shi and Jiabao, Zhao and Yilei, Wang and. 2023 , note =. doi:10.48550/arXiv.2308.12503 , abstract =

work page doi:10.48550/arxiv.2308.12503 2023
[70]

doi:10.48550/ARXIV.2505.09081 , abstract =

Koley, Gaurav , year =. doi:10.48550/ARXIV.2505.09081 , abstract =

work page doi:10.48550/arxiv.2505.09081
[71]

Educhat: A large-scale language model-based chatbot system for intelligent education

Dan, Yuhao and Lei, Zhikai and Gu, Yiyang and. 2023 , note =. doi:10.48550/ARXIV.2308.02773 , abstract =

work page doi:10.48550/arxiv.2308.02773 2023
[72]

Proceedings of

Cui, Lei and Huang, Shaohan and Wei, Furu and Tan, Chuanqi and Duan, Chaoqun and Zhou, Ming , year =. Proceedings of. doi:10.18653/v1/P17-4017 , abstract =

work page doi:10.18653/v1/p17-4017
[73]

Llm as dba

Zhou, Xuanhe and Li, Guoliang and Liu, Zhiyuan , year =. doi:10.48550/ARXIV.2308.05481 , abstract =

work page doi:10.48550/arxiv.2308.05481
[74]

Is there any social principle for

Bai, Jitao and Zhang, Simiao and Chen, Zhonghao , year =. Is there any social principle for. doi:10.48550/ARXIV.2308.11136 , abstract =

work page doi:10.48550/arxiv.2308.11136
[75]

2025 , note =

Zhang, Pengsong and Hu, Xiang and Huang, Guowei and. 2025 , note =. doi:10.48550/ARXIV.2508.15126 , abstract =

work page doi:10.48550/arxiv.2508.15126 2025
[76]

Agents: An open-source framework for autonomous lan- guage agents

Zhou, Wangchunshu and Jiang, Yuchen Eleanor and Li, Long and. Agents: an open-source framework for autonomous language agents , shorttitle =. 2023 , note =. doi:10.48550/ARXIV.2309.07870 , abstract =

work page doi:10.48550/arxiv.2309.07870 2023
[77]

Enhancing

Schwartz, Sivan and Yaeli, Avi and Shlomov, Segev , year =. Enhancing. doi:10.48550/ARXIV.2308.05391 , abstract =

work page doi:10.48550/arxiv.2308.05391
[78]

Engineering , author =

The tong test: evaluating artificial general intelligence through dynamic embodied physical and social interactions , volume =. Engineering , author =. 2024 , pages =. doi:10.1016/j.eng.2023.07.006 , abstract =

work page doi:10.1016/j.eng.2023.07.006 2024
[79]

Bo- laa: Benchmarking and orchestrating llm-augmented autonomous agents

Liu, Zhiwei and Yao, Weiran and Zhang, Jianguo and. 2023 , note =. doi:10.48550/ARXIV.2308.05960 , abstract =

work page doi:10.48550/arxiv.2308.05960 2023
[80]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Trivedi, Harsh and Khot, Tushar and Hartmann, Mareike and. 2024 , note =. doi:10.18653/v1/2024.acl-long.850 , abstract =

work page doi:10.18653/v1/2024.acl-long.850 2024

Showing first 80 references.