OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

Longbin Lai; Peigen Liu; Renjie Sun; Rui Ding; Ying Zhang; Yunjun Gao; Yuren Mao; Yuxiang Ye; Zhengping Qian; Ziyan Jiang

OpenHospital is a live hospital arena where physician agents evolve collective intelligence by interacting with dynamic patient agents, improving clinical metrics while lowering token cost.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.5

2026-07-14 20:58 UTC pith:PNN5QQVL

load-bearing objection Useful hospital multi-agent arena with real metric trajectories, but the CI / data-in-agent-self story is mostly oracle-guided individual improvement. the 3 major comments →

arxiv 2603.14771 v3 pith:PNN5QQVL submitted 2026-03-16 cs.AI

OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

Peigen Liu , Rui Ding , Yuren Mao , Ziyan Jiang , Yuxiang Ye , Yunjun Gao , Ying Zhang , Renjie Sun

show 2 more authors

Longbin Lai Zhengping Qian

This is my paper

classification cs.AI

keywords collective intelligenceLLM multi-agent systemsmedical simulationagent evolutiondata-in-agent-selfclinical benchmarkingsynthetic patients

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that static datasets only capture surface phenomena, so LLM agents stay stuck at imitation and cannot break the data wall. Collective intelligence requires agents to engage dynamic environments that change with their own actions. OpenHospital supplies that environment: physician agents must actively elicit information from patient agents built for clinical correctness, persona diversity, linguistic fluency, and behavioral realism. A closed-loop reflection process after each case lets the physicians accumulate experience. Across successive batches the agents raise examination precision, diagnostic accuracy, and treatment alignment while reducing total input tokens, and they spontaneously begin consulting one another on complex cases. A sympathetic reader cares because the arena both trains multi-agent systems and supplies objective metrics for medical skill and system efficiency, showing that continuous capability growth is possible without new human corpora.

Core claim

As physician agents process successive batches of cases inside OpenHospital under a closed-loop reflection mechanism, their Examination Precision, Diagnostic Accuracy, and Treatment Plan Alignment all rise while total input tokens fall, and cooperative behaviors such as peer consultation emerge spontaneously; the arena therefore both evolves and quantifies LLM-based collective intelligence.

What carries the argument

The data-in-agent-self paradigm: physician agents receive no static case files; they must interact with patient agents (treated as dynamic entities) to obtain clinical information, forcing knowledge integration, multi-agent debate, and measurable evolution tracked by examination, diagnosis, treatment, and token metrics.

Load-bearing premise

The evolution claim rests on the premise that post-case self-critique against ground-truth diagnoses plus synthetic patients validated only by other language models produces genuine collective intelligence rather than guided individual improvement.

What would settle it

Train the same physician agents for the same number of batches using only interaction logs and no ground-truth reflection signal; if examination precision, diagnostic accuracy, treatment alignment, and spontaneous cross-department consultations fail to improve, the central claim is false.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Useful hospital multi-agent arena with real metric trajectories, but the CI / data-in-agent-self story is mostly oracle-guided individual improvement.

read the letter

The one thing to know is that OpenHospital is a competent, reusable clinical multi-agent simulator that actually tracks agents over 22 batches and reports rising exam precision, diagnostic accuracy, and treatment alignment while tokens fall. The second is that the load-bearing “collective intelligence from data-in-agent-self” claim is weaker than the abstract suggests: after every case the physicians get multi-dimensional self-critique against ground-truth diagnoses, exams, and guidelines. That is ordinary supervised reflection, not pure emergence from patient interaction alone.

What is new and solid: a comorbidity / long-tail synthetic patient pipeline (583 diseases, 467 comorbidities), four explicit patient-construction pillars with some diversity and consistency numbers, dual medical-plus-efficiency metrics, and a closed-loop Agent-Kernel baseline that shows the trajectories and a couple of consultation case studies. They ship a GitHub link, keep everything synthetic for privacy, and the limitations section is honest about unimodal text and missing temporal disease progression. Prior hospital sims (Agent Hospital, Agent Clinic, MedAgentSim) already occupy the space; this is a legitimate extension with more evolution tracking and harder cases, not a clean break.

Soft spots in proportion: no ablation that removes or degrades the ground-truth reflection signal, no single-agent or non-consulting controls, no error bars, and heavy LLM-as-judge (GPT-5.2) for both patient validation and treatment alignment. The Kantian “thing-in-itself” framing is decorative. Circularity is real—the same oracle that drives improvement also defines the scores—so the spontaneous-cooperation story is enabled by an action space that already includes consultation rather than proven to arise from interaction alone. None of this makes the engineering useless; it just means the CI language should be tempered.

This is for people building medical multi-agent sims or evolution arenas who want a concrete, privacy-safe testbed and baseline numbers. It is not a foundation-model data-wall solution and not a clinical tool. I would bring it to reading group as a methods paper, cite the arena if I need a hospital sim, and send it to peer review with a clear request for the missing ablations. The work is serious and reproducible enough to deserve referee time.

Referee Report

3 major / 0 minor

Summary. The paper introduces OpenHospital, an interactive multi-agent arena for evolving and benchmarking LLM-based collective intelligence (CI) in a clinical setting. Physician agents interact with synthetic patient agents under a “data-in-agent-self” paradigm, using a multi-stage pipeline (DeepSeek-v3.1) to generate 12,000 patient records with comorbidities and long-tail diseases across 19 departments. A baseline of 38 physician agents (Agent-Kernel, Qwen3-Next-80B) is trained over 22 batches of ~500 cases each; after every case agents perform multi-dimensional self-critique against ground-truth diagnoses, examinations and treatment guidelines. Reported gains are Examination Precision 45.05%→61.31%, Diagnostic Accuracy 48.11%→57.34%, Treatment Plan Alignment 58.49%→61.52%, with declining total input tokens; qualitative case studies illustrate refined individual reasoning and spontaneous cross-departmental consultation. The authors claim the arena both fosters genuine CI and supplies rigorous dual metrics of medical proficiency and system efficiency.

Significance. If the central claim holds—that physician–patient interaction plus closed-loop reflection produces genuine, quantifiable collective intelligence rather than ordinary supervised improvement—the work would supply a reusable, privacy-safe evolutionary arena and a multi-dimensional clinical benchmark that current static MAS evaluations lack. Strengths include a large synthetic comorbidity-rich dataset, explicit multi-metric tracking (examination, diagnosis, treatment, tokens), open-source Agent-Kernel linkage, and qualitative evidence of peer consultation. These contributions would be useful to the multi-agent and medical-AI communities even if the strongest CI interpretation requires further controls.

major comments (3)

§4.2 Baseline / closed-loop reflection: after every case agents receive multi-dimensional self-critique that explicitly synthesizes diagnostic accuracy against ground truth, examination efficiency and therapeutic safety. The same ground truth defines the three Medical Capability metrics in §4.1. Without an ablation that removes or degrades this oracle signal (or a pure interaction-only control), the reported trajectories (Fig. 4) and token decline (Fig. 5) are equally consistent with ordinary supervised individual improvement; the “data-in-agent-self CI” interpretation therefore remains untested and load-bearing for the central claim.
§4.3 and Fig. 6 (cooperative behaviors): the action space already includes multi-agent consultation (§4.2). The single qualitative example of Infectious-Diseases → Cardiology consultation does not establish that cooperation is spontaneous or necessary rather than an enabled primitive. A non-consulting or single-agent control is required to support the claim that OpenHospital’s collaborative necessity drives emergent CI.
§3 and §4.1 evaluation: both patient-agent validation (Medical Consistency 4.4113, Accuracy/Relevance/Persona scores) and Treatment Plan Alignment rely on LLM-as-judge (GPT-5.2 / Baichuan-M2). No human clinician inter-rater reliability, no error bars or statistical tests on the 22-batch trajectories, and no comparison against a non-LLM clinical gold standard are reported. This weakens the claim that the metrics constitute a “rigorous” benchmark of medical proficiency.

Circularity Check

3 steps flagged

Metric gains and 'CI evolution' reduce by construction to ground-truth-guided reflection that optimizes the same labels used for evaluation; cooperation is enabled by the pre-defined action space.

specific steps

fitted input called prediction [Section 4.2 Baseline / Experimental Setup]
"Central to this baseline is a closed-loop reflection mechanism designed to drive autonomous evolution; after each case, agents engage in a multi-dimensional self-critique that synthesizes diagnostic accuracy against ground truth, examination efficiency, and therapeutic safety to bridge efficacy gaps. By integrating these diagnostic, investigative, and treatment reflections into a unified feedback loop, the agents systematically accumulate clinical experience and optimize their decision-making logic over time."

The reflection step injects the identical ground-truth labels that later define Diagnostic Accuracy, Examination Precision and Treatment Plan Alignment. Metric gains across the 22 batches are therefore the direct, expected product of supervised optimization against those labels, not an independent prediction of emergent CI from interaction data alone. Calling the resulting trajectories 'evolution of collective intelligence via data-in-agent-self' renames a fitted supervised loop as a first-principles result.
self definitional [Section 4.1 Evaluation Metrics + Section 4.3 Evaluation Results]
"Examination Precision assesses the relevance and necessity of ordered tests. Defined as |E_pred ∩ E_std| / |E_pred| … Diagnostic Accuracy measures the correctness of the final consensus diagnosis. Formally, for a case i with ground truth D_true, the score is 1 if the agent's diagnosis D_pred = D_true … Treatment Plan Alignment evaluates therapeutic quality against gold-standard guidelines … As Figure 4 illustrates, the agents exhibit consistent upward trends … Examination Precision … 45.05% to 61.31% … Diagnostic Accuracy … 48.11% to 57.34% … Treatment Plan Alignment … 58.49% to 61.52%."

The three medical-capability metrics are defined directly in terms of the same ground-truth examinations, diagnoses and guidelines that the reflection mechanism optimizes after every case. Reporting improvement on those metrics therefore restates the success of the supervised critique loop; the quantities being 'predicted' (or claimed to emerge) are definitionally the quantities being fitted.
other [Section 4.2 Baseline action space + Section 4.4 Case Studies / Figure 6]
"These agents operate within a sophisticated action space that encompasses patient perception, targeted inquiry, diagnostic examination, multi-agent consultation, and knowledge retrieval … analysis of the diagnostic process reveals the spontaneous emergence of sophisticated cooperative behaviors … the agent proactively initiates a consultation with the Cardiology Department … This interaction highlights the collaborative necessity intrinsic to OpenHospital"

Multi-agent consultation is an explicitly enumerated primitive of the action space. Observing agents invoke that primitive and then labeling the behavior 'spontaneous emergence of collective intelligence' is circular: the capability is present by construction of the environment rather than discovered from interaction data.

full rationale

The paper's central claim—that OpenHospital's data-in-agent-self interactions foster genuine collective intelligence, evidenced by rising Examination Precision / Diagnostic Accuracy / Treatment Plan Alignment and spontaneous peer consultation—is not an independent prediction. The baseline's closed-loop reflection explicitly critiques every case against ground-truth diagnoses, examination standards and therapeutic guidelines (the identical quantities that define the three medical metrics). Consequently the observed trajectories (45%→61%, 48%→57%, 58%→61%) and token reduction are the expected outcome of supervised individual improvement rather than emergent CI arising solely from physician–patient interaction. Cooperative behaviors are likewise licensed a priori by an action space that already contains multi-agent consultation. No ablation removes the oracle signal, so the 'evolution of CI' claim collapses to the training loop by construction. Patient-agent synthesis and LLM-as-judge validation introduce milder self-consistency circularity but are secondary. Self-citation of Agent-Kernel is present yet not load-bearing for the uniqueness of the result. Overall partial circularity of the fitted-input / self-definitional kind, score 6.

Axiom & Free-Parameter Ledger

3 free parameters · 3 axioms · 3 invented entities

The central claim rests on the premise that synthetic patients plus ground-truth reflection produce genuine collective intelligence. Free parameters are engineering choices (batch size, agent count, model). Domain assumptions include clinical validity of LLM-synthesized patients and that oracle feedback equals data-in-agent-self evolution. Invented entities are the named paradigm and arena themselves; none have independent external evidence beyond the paper's own LLM judges.

free parameters (3)

training batches / cases per batch = 22 batches × ~500 cases
22 sequential batches of ~500 cases each are chosen by hand to display evolution trajectories; no derivation fixes these numbers.
physician agent count and departmental distribution = 38 agents / 19 departments
38 agents across 19 departments (two per department) is an architectural choice that directly enables the reported cross-department consultation behaviors.
base LLM choice for agents = Qwen3-Next-80B-A3B-Instruct
Qwen3-Next-80B-A3B-Instruct is selected for both physicians and patients; results are conditioned on this model family.

axioms (3)

domain assumption LLM-generated synthetic patients with multi-stage refinement under epidemiological constraints are clinically coherent enough to serve as noumena for CI evolution.
Invoked throughout Section 3; validated only by GPT-5.2 consistency scores (avg 4.41/5) rather than clinician review or real EHR statistics.
ad hoc to paper Closed-loop multi-dimensional self-critique against ground truth after each case produces data-in-agent-self collective intelligence rather than ordinary supervised improvement.
Core of the evolution claim in Sections 4.2–4.3; no control without the oracle is reported.
domain assumption Examination Precision, Diagnostic Accuracy and LLM-judged Treatment Plan Alignment together constitute a robust quantitative measure of collective intelligence.
Stated in Section 4.1; the mapping from these clinical scores to 'CI' is definitional within the paper.

invented entities (3)

data-in-agent-self paradigm no independent evidence
purpose: To reframe interaction traces as the primary training signal that overcomes the data wall.
Named in abstract and introduction; no independent formal definition or external validation outside this arena.
OpenHospital arena (thing-in-itself) no independent evidence
purpose: To supply a dynamic, collaborative, quantifiable environment for evolving and benchmarking LLM CI.
The central constructed object of the paper; its status as 'noumenon' is philosophical framing rather than an independently measurable entity.
Four pillars of realistic patient simulation (Clinical Correctness, Persona Diversity, Linguistic Fluency, Behavioral Realism) no independent evidence
purpose: To justify that synthetic patients are sufficiently human-like for CI to emerge.
Introduced in Section 3; each pillar is scored by internal metrics or LLM judges, not external clinical trials.

pith-pipeline@v1.1.0-grok45 · 14891 in / 3136 out tokens · 41117 ms · 2026-07-14T20:58:55.207403+00:00 · methodology

0 comments

read the original abstract

Large Language Model (LLM)-based Collective Intelligence (CI) presents a promising approach to overcoming the data wall and continuously boosting the capabilities of LLM agents. However, there is currently no dedicated arena for evolving and benchmarking LLM-based CI. To address this gap, we introduce OpenHospital, an interactive arena where physician agents can evolve CI through interactions with patient agents. This arena employs a data-in-agent-self paradigm that rapidly enhances agent capabilities and provides robust evaluation metrics for benchmarking both medical proficiency and system efficiency. Experiments demonstrate the effectiveness of OpenHospital in both fostering and quantifying CI.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MentalHospital: A Virtual Environment for Evaluating Psychiatric Clinical Encounters
cs.AI 2026-07 conditional novelty 7.0

EHR-derived standardized patients and dual-track evaluation reveal LLMs trail clinicians by 37.28 points on full psychiatric encounters, with mental-status assessment the main bottleneck.

Reference graph

Works this paper leans on

27 extracted references · 24 linked inside Pith · cited by 1 Pith paper

[1]

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N

Citysim: Modeling urban behaviors and city dynamics with large-scale llm-driven agent simulation.Preprint, arXiv:2506.21805. Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu...

Pith/arXiv arXiv
[2]

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, and 1 oth- ers

Scienceagentbench: Toward rigorous as- sessment of language agents for data-driven scientific discovery.Preprint, arXiv:2410.05080. DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, and 1 oth- ers

Pith/arXiv arXiv
[3]

Deepseek-v3 technical report.Preprint, arXiv:2412.19437. Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Da Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, and 14 others

Pith/arXiv arXiv
[4]

Baichuan-m2: Scaling medi- cal capability with large verifier system.Preprint, arXiv:2509.02208. Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang

Pith/arXiv arXiv
[5]

Alireza Ghafarollahi and Markus J

Omni-math: A univer- sal olympiad level mathematic benchmark for large language models.Preprint, arXiv:2410.07985. Alireza Ghafarollahi and Markus J. Buehler

Pith/arXiv arXiv
[6]

Saeedeh Ghanadbashi and Fatemeh Golpayegani

Sci- agents: Automating scientific discovery through multi-agent intelligent graph reasoning.Preprint, arXiv:2409.05556. Saeedeh Ghanadbashi and Fatemeh Golpayegani

Pith/arXiv arXiv
[7]

Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu

Ontology-enhanced decision-making for autonomous agents in dynamic and partially observable environ- ments.Preprint, arXiv:2405.17691. Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu

Pith/arXiv arXiv
[8]

Devbench: A realistic, developer-informed benchmark for code generation models.Preprint, arXiv:2601.11895. Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryu- taro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, 8 Fan Zhang, Katherine Chou, Avinatan Hassidim, Bu- rak ...

Pith/arXiv arXiv
[9]

Towards an ai co-scientist.Preprint, arXiv:2502.18864. Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber

Pith/arXiv arXiv
[10]

Preprint, arXiv:2410.02603

Agents’ room: Nar- rative generation through multi-step collaboration. Preprint, arXiv:2410.02603. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Pith/arXiv arXiv
[11]

Immanuel Kant

Swe-bench: Can language mod- els resolve real-world github issues?Preprint, arXiv:2310.06770. Immanuel Kant. 1781.Critique of Pure Reason. Johann Friedrich Hartknoch, Riga, Russian Empire. Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, and Yang Liu

Pith/arXiv arXiv
[12]

Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qing- min Liao

Agent hospi- tal: A simulacrum of hospital with evolvable medical agents.Preprint, arXiv:2405.02957. Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qing- min Liao

Pith/arXiv arXiv
[13]

Jianhao Lin, Lexuan Sun, and Yixin Yan

Econagent: Large language model- empowered agents for simulating macroeconomic activities.Preprint, arXiv:2310.10436. Jianhao Lin, Lexuan Sun, and Yixin Yan

Pith/arXiv arXiv
[14]

Preprint, arXiv:2505.17648

Simu- lating macroeconomic expectations using llm agents. Preprint, arXiv:2505.17648. Zhao Mandi, Shreeya Jain, and Shuran Song

Pith/arXiv arXiv
[15]

Yuren Mao, Peigen Liu, Xinjian Wang, Rui Ding, Jing Miao, Hui Zou, Mingjie Qi, Wanxiang Luo, Longbin Lai, Kai Wang, Zhengping Qian, Peilun Yang, Yun- jun Gao, and Ying Zhang

Roco: Dialectic multi-robot collaboration with large language models.Preprint, arXiv:2307.04738. Yuren Mao, Peigen Liu, Xinjian Wang, Rui Ding, Jing Miao, Hui Zou, Mingjie Qi, Wanxiang Luo, Longbin Lai, Kai Wang, Zhengping Qian, Peilun Yang, Yun- jun Gao, and Ying Zhang

Pith/arXiv arXiv
[16]

Agent-kernel: A microkernel multi-agent system framework for adap- tive social simulation powered by llms.Preprint, arXiv:2512.01610. OpenAI

arXiv
[17]

Generative agents: Interac- tive simulacra of human behavior.Preprint, arXiv:2304.03442. Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, Chen Gao, Fengli Xu, Fang Zhang, Ke Rong, Jun Su, and Yong Li

Pith/arXiv arXiv
[18]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun

Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society.Preprint, arXiv:2502.08691. Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun

Pith/arXiv arXiv
[19]

Jun Sashihara, Yukihisa Fujita, Kota Nakamura, Masahiro Kuwahara, and Teruaki Hayashi

Chatdev: Communica- tive agents for software development.Preprint, arXiv:2307.07924. Jun Sashihara, Yukihisa Fujita, Kota Nakamura, Masahiro Kuwahara, and Teruaki Hayashi

Pith/arXiv arXiv
[20]

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor

Llm-based multi-agent system for simulating strate- gic and goal-oriented data marketplaces.Preprint, arXiv:2511.13233. Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor

arXiv
[21]

Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nan- qing Dong

Agentclinic: a multimodal agent benchmark to evalu- ate ai in simulated clinical environments.Preprint, arXiv:2405.07960. Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nan- qing Dong

Pith/arXiv arXiv
[22]

Many heads are better than one: Improved scientific idea generation by a llm-based multi-agent system.Preprint, arXiv:2410.09403. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang

Pith/arXiv arXiv
[23]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Zhou, and 1 others

Autogen: Enabling next-gen llm ap- plications via multi-agent conversation.Preprint, arXiv:2308.08155. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Zhou, and 1 others

Pith/arXiv arXiv
[24]

Bangguo Yu, Qihao Yuan, Kailai Li, Hamidreza Kasaei, and Ming Cao

Qwen3 techni- cal report.Preprint, arXiv:2505.09388. Bangguo Yu, Qihao Yuan, Kailai Li, Hamidreza Kasaei, and Ming Cao. 2025a. Co-navgpt: Multi-robot coop- erative visual semantic navigation using vision lan- guage models.Preprint, arXiv:2310.07937. Tian Yu, Ken Shi, Zixin Zhao, and Gerald Penn. 2025b. Multi-agent based character simulation for story writ...

Pith/arXiv arXiv 2025
[25]

Simulating classroom education with LLM- empowered agents. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), 9 pages 10364–10379, Albuquerque, New Mexico. As- sociation for Computational Linguistics. Xuhui Zhou, Hao Zhu, Leen...

2025
[26]

Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You

Sotopia: Interactive evaluation for social intelligence in language agents.Preprint, arXiv:2310.11667. Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You

Pith/arXiv arXiv
[27]

Multiagentbench: Evaluating the col- laboration and competition of llm agents.Preprint, arXiv:2503.01935. 10

Pith/arXiv arXiv

[1] [1]

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N

Citysim: Modeling urban behaviors and city dynamics with large-scale llm-driven agent simulation.Preprint, arXiv:2506.21805. Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu...

Pith/arXiv arXiv

[2] [2]

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, and 1 oth- ers

Scienceagentbench: Toward rigorous as- sessment of language agents for data-driven scientific discovery.Preprint, arXiv:2410.05080. DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, and 1 oth- ers

Pith/arXiv arXiv

[3] [3]

Deepseek-v3 technical report.Preprint, arXiv:2412.19437. Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Da Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, and 14 others

Pith/arXiv arXiv

[4] [4]

Baichuan-m2: Scaling medi- cal capability with large verifier system.Preprint, arXiv:2509.02208. Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang

Pith/arXiv arXiv

[5] [5]

Alireza Ghafarollahi and Markus J

Omni-math: A univer- sal olympiad level mathematic benchmark for large language models.Preprint, arXiv:2410.07985. Alireza Ghafarollahi and Markus J. Buehler

Pith/arXiv arXiv

[6] [6]

Saeedeh Ghanadbashi and Fatemeh Golpayegani

Sci- agents: Automating scientific discovery through multi-agent intelligent graph reasoning.Preprint, arXiv:2409.05556. Saeedeh Ghanadbashi and Fatemeh Golpayegani

Pith/arXiv arXiv

[7] [7]

Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu

Ontology-enhanced decision-making for autonomous agents in dynamic and partially observable environ- ments.Preprint, arXiv:2405.17691. Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu

Pith/arXiv arXiv

[8] [8]

Devbench: A realistic, developer-informed benchmark for code generation models.Preprint, arXiv:2601.11895. Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryu- taro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, 8 Fan Zhang, Katherine Chou, Avinatan Hassidim, Bu- rak ...

Pith/arXiv arXiv

[9] [9]

Towards an ai co-scientist.Preprint, arXiv:2502.18864. Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber

Pith/arXiv arXiv

[10] [10]

Preprint, arXiv:2410.02603

Agents’ room: Nar- rative generation through multi-step collaboration. Preprint, arXiv:2410.02603. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Pith/arXiv arXiv

[11] [11]

Immanuel Kant

Swe-bench: Can language mod- els resolve real-world github issues?Preprint, arXiv:2310.06770. Immanuel Kant. 1781.Critique of Pure Reason. Johann Friedrich Hartknoch, Riga, Russian Empire. Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, and Yang Liu

Pith/arXiv arXiv

[12] [12]

Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qing- min Liao

Agent hospi- tal: A simulacrum of hospital with evolvable medical agents.Preprint, arXiv:2405.02957. Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qing- min Liao

Pith/arXiv arXiv

[13] [13]

Jianhao Lin, Lexuan Sun, and Yixin Yan

Econagent: Large language model- empowered agents for simulating macroeconomic activities.Preprint, arXiv:2310.10436. Jianhao Lin, Lexuan Sun, and Yixin Yan

Pith/arXiv arXiv

[14] [14]

Preprint, arXiv:2505.17648

Simu- lating macroeconomic expectations using llm agents. Preprint, arXiv:2505.17648. Zhao Mandi, Shreeya Jain, and Shuran Song

Pith/arXiv arXiv

[15] [15]

Yuren Mao, Peigen Liu, Xinjian Wang, Rui Ding, Jing Miao, Hui Zou, Mingjie Qi, Wanxiang Luo, Longbin Lai, Kai Wang, Zhengping Qian, Peilun Yang, Yun- jun Gao, and Ying Zhang

Roco: Dialectic multi-robot collaboration with large language models.Preprint, arXiv:2307.04738. Yuren Mao, Peigen Liu, Xinjian Wang, Rui Ding, Jing Miao, Hui Zou, Mingjie Qi, Wanxiang Luo, Longbin Lai, Kai Wang, Zhengping Qian, Peilun Yang, Yun- jun Gao, and Ying Zhang

Pith/arXiv arXiv

[16] [16]

Agent-kernel: A microkernel multi-agent system framework for adap- tive social simulation powered by llms.Preprint, arXiv:2512.01610. OpenAI

arXiv

[17] [17]

Generative agents: Interac- tive simulacra of human behavior.Preprint, arXiv:2304.03442. Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, Chen Gao, Fengli Xu, Fang Zhang, Ke Rong, Jun Su, and Yong Li

Pith/arXiv arXiv

[18] [18]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun

Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society.Preprint, arXiv:2502.08691. Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun

Pith/arXiv arXiv

[19] [19]

Jun Sashihara, Yukihisa Fujita, Kota Nakamura, Masahiro Kuwahara, and Teruaki Hayashi

Chatdev: Communica- tive agents for software development.Preprint, arXiv:2307.07924. Jun Sashihara, Yukihisa Fujita, Kota Nakamura, Masahiro Kuwahara, and Teruaki Hayashi

Pith/arXiv arXiv

[20] [20]

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor

Llm-based multi-agent system for simulating strate- gic and goal-oriented data marketplaces.Preprint, arXiv:2511.13233. Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor

arXiv

[21] [21]

Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nan- qing Dong

Agentclinic: a multimodal agent benchmark to evalu- ate ai in simulated clinical environments.Preprint, arXiv:2405.07960. Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nan- qing Dong

Pith/arXiv arXiv

[22] [22]

Many heads are better than one: Improved scientific idea generation by a llm-based multi-agent system.Preprint, arXiv:2410.09403. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang

Pith/arXiv arXiv

[23] [23]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Zhou, and 1 others

Autogen: Enabling next-gen llm ap- plications via multi-agent conversation.Preprint, arXiv:2308.08155. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Zhou, and 1 others

Pith/arXiv arXiv

[24] [24]

Bangguo Yu, Qihao Yuan, Kailai Li, Hamidreza Kasaei, and Ming Cao

Qwen3 techni- cal report.Preprint, arXiv:2505.09388. Bangguo Yu, Qihao Yuan, Kailai Li, Hamidreza Kasaei, and Ming Cao. 2025a. Co-navgpt: Multi-robot coop- erative visual semantic navigation using vision lan- guage models.Preprint, arXiv:2310.07937. Tian Yu, Ken Shi, Zixin Zhao, and Gerald Penn. 2025b. Multi-agent based character simulation for story writ...

Pith/arXiv arXiv 2025

[25] [25]

Simulating classroom education with LLM- empowered agents. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), 9 pages 10364–10379, Albuquerque, New Mexico. As- sociation for Computational Linguistics. Xuhui Zhou, Hao Zhu, Leen...

2025

[26] [26]

Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You

Sotopia: Interactive evaluation for social intelligence in language agents.Preprint, arXiv:2310.11667. Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You

Pith/arXiv arXiv

[27] [27]

Multiagentbench: Evaluating the col- laboration and competition of llm agents.Preprint, arXiv:2503.01935. 10

Pith/arXiv arXiv