pith. sign in

arxiv: 2606.30246 · v1 · pith:7FUPOYA3new · submitted 2026-06-29 · 💻 cs.AI · cs.CY· cs.MA

Clarus: Coordinating Autonomous Research Agents toward Web-Scale Scientific Collaboration

Pith reviewed 2026-06-30 06:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.MA
keywords autonomous research agentsscientific collaborationmulti-agent systemsresearch infrastructurecollaboration networksagent coordination
0
0 comments X

The pith

Clarus organizes research goals into traceable, reviewable, attributable collaboration networks across phases and participants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Clarus as infrastructure that coordinates autonomous research agents, treating them as AI systems, humans, teams or organizations, to move research beyond isolated tasks or closed loops. It reframes the process as open multi-phase collaboration that must track questions, evidence, participants and resources under uncertainty. The system uses a minimal object model for projects, agents and resources, plus four layers and pluggable modules to adapt to different risks and constraints. A controlled paper-generation case study shows the result is a network that remains traceable, reviewable, attributable and accumulative. A sympathetic reader would care because current agent tools lack mechanisms for shared, auditable progress at web scale.

Core claim

Clarus reformulates research as an open, auditable, attributable, and resource-aware multi-phase collaboration process. It defines a minimal project-agent-resource object model and organizes scientific collaboration through four layers including Research Application, Digital Collaboration, Physical Substrate, and Physical World. Core modules are implemented as pluggable mechanisms. Through a controlled paper-generation case study, Clarus organizes a research goal into a traceable, reviewable, attributable, and accumulative collaboration network across phases, tasks, and participants.

What carries the argument

The four-layer architecture (Research Application, Digital Collaboration, Physical Substrate, Physical World) combined with the project-agent-resource object model and pluggable coordination modules.

If this is right

  • Research projects shift from closed workflows to open, auditable processes that record contributions across phases.
  • Agents and participants gain explicit attribution and reviewability within the collaboration network.
  • Pluggable modules allow adaptation to task risk, collaboration structure, and resource limits without redesign.
  • Accumulative networks support long-term building on prior phases and participants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open attribution could reduce redundant work if multiple independent groups adopt the same object model.
  • Physical Substrate layer may need extra protocols when real labs or equipment enter the network.
  • Trust mechanisms described could extend to versioned data sharing across organizations.

Load-bearing premise

The four-layer architecture and pluggable mechanisms can handle coordination under uncertainty and varying resource constraints at web scale.

What would settle it

A replication of the paper-generation case study in which the produced collaboration network lacks clear traceability or attributability for tasks and phases.

Figures

Figures reproduced from arXiv: 2606.30246 by Bo Huang, Chenxi Zeng, Hanwen Zhu, Junwei Liao, Ming Zhou, Shuai Shao, Weinan Zhang, Xiaohang Nie, Yang Li, Yuanjian Zhou, Yuanyi Song, Yuan Yuan, Zeyi Chen, Zhengxi Yu, Zhi Han, Zhiyu Chen, Zicai Cui, Zihan Guo.

Figure 1
Figure 1. Figure 1: From closed workflows and open research “dark forests” to trusted scientific collaboration networks. Closed multi-agent workflows organize fixed roles into a bounded pipeline, while open research networks introduce heterogeneous agents, organizations, tools, data, and physi￾cal resources with unverifiable identity, ambiguous credit, broken provenance, and ungoverned access. Clarus addresses this transition… view at source ↗
Figure 2
Figure 2. Figure 2: Clarus four-layer architecture. Clarus organizes open scientific collaboration into four layers. The Research Application layer structures research goals into project, phase, subtask, and ar￾tifact objects. The Digital Collaboration layer provides identity, discovery, collaboration, and utility capabilities for open participants. The Physical Substrate layer mediates controlled access to decen￾tralized rea… view at source ↗
Figure 3
Figure 3. Figure 3: Application workflow in Clarus. Clarus transforms a research goal into an attributable project lifecycle. Open team formation discovers and assembles agents around the task, after which the project container executes repeated phase loops through phase planning, subtask DAG construction, agent execution, artifact and evidence collection, audit and credit confirmation, and phase checkpoints. The resulting re… view at source ↗
Figure 4
Figure 4. Figure 4: Clarus prototype interface and MirrorEval paper artifact. a: The Clarus project room brings phase state, agent execution records, artifact registration, DAG overview, and related information into one inspectable interface. b: The final MirrorEval paper page assembled by the system. c to h: Representative pages of the final paper, including the problem setting, benchmark structure, experimental pipeline, ma… view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end execution results of the MirrorEval case study. a: Subtask DAGs across six research phases, showing how Clarus decomposes an open-ended research goal into a phased, executable, and traceable task structure. b: Final credit settlement, comparing equal share with the contribution allocation derived from process records, artifact ownership, and handoff. c: The paper preview interface, showing the c… view at source ↗
Figure 6
Figure 6. Figure 6: Audit-triggered replanning in the Experiment phase. a: Adapter provenance audit failure in the first version of Experiment DAG, where the red node indicates a failed audit. b: Human checkpoint and DAG diff record. Because Auto checkpoint decision was enabled, the system automati￾cally accepts the recommendation to repair the current phase and triggers replanning from first version to second one. c: The rep… view at source ↗
Figure 7
Figure 7. Figure 7: Credit attribution and provenance tracking in the MirrorEval run. a: Credit-share changes of different agents across the six research phases. Solid lines denote accumulated cross-phase settlement, while dashed lines denote within-phase or intermediate states. b: Distribution of 589 trace events, showing that credit, artifact, task, audit, phase, and routing records jointly form the evidence trail. c: The n… view at source ↗
read the original abstract

Existing autonomous research agents can support parts of the research process, but most systems still treat research as either an isolated assistant task or a closed workflow. Therefore, autonomous science needs a collaboration infrastructure that coordinates projects, agents, and digital and physical resources. We identify this as a shift from code-centered execution loops to research-oriented collaboration processes, where questions, evidence, participants, and resources must be coordinated under uncertainty. In this framing, an agent may be an AI system, a human researcher, a team, a laboratory, or an organization-backed participant. To this end, we present Clarus, a collaboration infrastructure for coordinating autonomous research agents toward web-scale scientific collaboration. Clarus reformulates research as an open, auditable, attributable, and resource-aware multi-phase collaboration process. It defines a minimal project-agent-resource object model and organizes scientific collaboration through four layers including Research Application, Digital Collaboration, Physical Substrate, and Physical World. Core modules are implemented as pluggable mechanisms, allowing Clarus to adapt to task risk, collaboration structure, and resource constraints. Through a controlled paper-generation case study, we show that Clarus can organize a research goal into a traceable, reviewable, attributable, and accumulative collaboration network across phases, tasks, and participants. Together, the object model, collaboration protocol, trust mechanisms, and prototype validation provide an initial foundation for open research networks. Clarus is now available at clarus.holosai.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents Clarus, a collaboration infrastructure for coordinating autonomous research agents (AI systems, humans, teams, labs, or organizations) toward web-scale scientific collaboration. It reformulates research as an open, auditable, attributable, and resource-aware multi-phase process using a minimal project-agent-resource object model organized across four layers (Research Application, Digital Collaboration, Physical Substrate, Physical World). Core modules are implemented as pluggable mechanisms to adapt to task risk, collaboration structure, and resource constraints. The central claim is validated through a controlled paper-generation case study demonstrating that Clarus organizes a research goal into a traceable, reviewable, attributable, and accumulative collaboration network across phases, tasks, and participants.

Significance. If the case study evidence holds and generalizes beyond the controlled setting, Clarus could provide a foundational object model, protocol, and trust mechanisms for open research networks, shifting from isolated code-centered loops to coordinated multi-participant processes under uncertainty. The prototype availability at clarus.holosai.io offers a concrete implementation starting point. However, the lack of quantitative metrics or scaling analysis in the validation limits demonstrated significance for web-scale claims.

major comments (2)
  1. [Case Study] Case Study section: The controlled paper-generation case study is presented without quantitative results, error analysis, implementation details, or metrics on traceability, reviewability, or attributability. This leaves the central claim—that Clarus organizes a research goal into an effective collaboration network—unsupported by evidence in the manuscript.
  2. [Architecture and Pluggable Mechanisms] Architecture and Pluggable Mechanisms sections: The four-layer architecture plus pluggable mechanisms are asserted to handle coordination under uncertainty and varying resource constraints for web-scale use, yet no specific mechanisms, failure modes, or experiments exercising these conditions appear; the controlled case study does not test web-scale conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to respond. We address each major comment below, clarifying the intended scope of the work as a conceptual infrastructure proposal supported by a controlled demonstration.

read point-by-point responses
  1. Referee: [Case Study] Case Study section: The controlled paper-generation case study is presented without quantitative results, error analysis, implementation details, or metrics on traceability, reviewability, or attributability. This leaves the central claim—that Clarus organizes a research goal into an effective collaboration network—unsupported by evidence in the manuscript.

    Authors: The case study is presented as a controlled, qualitative demonstration to illustrate how the project-agent-resource model structures a research goal into traceable phases with attributable contributions across participants. We acknowledge that it provides no quantitative metrics, error analysis, or statistical evaluation of traceability or attributability. This choice reflects the paper's focus on introducing the object model, layers, and protocol rather than conducting a performance benchmark study. The prototype at clarus.holosai.io supplies additional implementation details for inspection. We agree that quantitative metrics would provide stronger support for the claims but maintain that the existing demonstration is sufficient to show the model's organizational capability within the scope of this work. revision: no

  2. Referee: [Architecture and Pluggable Mechanisms] Architecture and Pluggable Mechanisms sections: The four-layer architecture plus pluggable mechanisms are asserted to handle coordination under uncertainty and varying resource constraints for web-scale use, yet no specific mechanisms, failure modes, or experiments exercising these conditions appear; the controlled case study does not test web-scale conditions.

    Authors: The four-layer architecture and pluggable mechanisms are described conceptually to enable adaptation to task risk, collaboration structures, and resource constraints through modular trust and resource modules. Specific mechanisms for attribution, auditing, and resource awareness are outlined in the relevant sections, but we recognize that no detailed failure-mode analysis or experiments under web-scale conditions or high uncertainty are included. The case study exercises the layers at small scale to validate the model. We view the design as a foundational proposal rather than a fully evaluated system at web scale and agree that scaling experiments would be required to substantiate broader applicability claims. revision: no

Circularity Check

0 steps flagged

No circularity: systems description with independent case study

full rationale

The paper contains no equations, derivations, fitted parameters, or mathematical claims. Its central demonstration is a controlled case study that produces a traceable collaboration network; this is presented as an empirical outcome of the described architecture rather than a quantity that reduces to the architecture by definition or by self-citation. No load-bearing step invokes a prior result from the same authors that is itself unverified, nor does any claim rename a known result or smuggle an ansatz. The architecture and case study are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or external evidence for invented elements. The system itself is presented as the core contribution.

invented entities (1)
  • project-agent-resource object model no independent evidence
    purpose: To coordinate research elements in an open process
    Introduced as minimal model in the abstract without independent validation or falsifiable predictions

pith-pipeline@v0.9.1-grok · 5851 in / 1069 out tokens · 28098 ms · 2026-06-30T06:13:12.168093+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 30 canonical work pages · 14 internal anchors

  1. [1]

    A survey on LLM -based multi-agent system: Recent advances and new frontiers in application

    Shuaihang Chen, Yuanxing Liu, Wei Han, Weinan Zhang, and Ting Liu. A survey on llm-based multi- agent system: Recent advances and new frontiers in application.arXiv preprint arXiv:2412.17481,

  2. [2]

    Abul Ehtesham, Aditi Singh, Gaurav Kumar Gupta, and Saket Kumar. A survey of agent interoper- ability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp).arXiv preprint arXiv:2505.02279,

  3. [3]

    Agentic LLM Reasoning in a Self-Driving Laboratory for Air-Sensitive Lithium Halide Spinel Conductors

    Yuxing Fei, Bernardus Rendy, Xiaochen Yang, Junhee Woo, Xu Huang, Chang Li, Shilong Wang, David Milsted, Yan Zeng, and Gerbrand Ceder. Agentic llm reasoning in a self-driving laboratory for air-sensitive lithium halide spinel conductors.arXiv preprint arXiv:2604.11957,

  4. [4]

    Unilabos: An ai-native operating system for autonomous laboratories.arXiv preprint arXiv:2512.21766,

    Jing Gao, Junhan Chang, Haohui Que, Yanfei Xiong, Shixiang Zhang, Xianwei Qi, Zhen Liu, Jun-Jie Wang, Qianjun Ding, Xinyu Li, et al. Unilabos: An ai-native operating system for autonomous laboratories.arXiv preprint arXiv:2512.21766,

  5. [5]

    Towards an AI co-scientist

    24 Clarus Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864,

  6. [6]

    Betaweb: Towards a blockchain-enabled trustworthy agentic web.arXiv preprint arXiv:2508.13787,

    Zihan Guo, Yuanjian Zhou, Chenyi Wang, Linlin You, Minjie Bian, and Weinan Zhang. Betaweb: Towards a blockchain-enabled trustworthy agentic web.arXiv preprint arXiv:2508.13787,

  7. [7]

    Which contributions deserve credit? perceptions of attribution in human-ai co-creation

    Jessica He, Stephanie Houde, and Justin D Weisz. Which contributions deserve credit? perceptions of attribution in human-ai co-creation. InProceedings of the 2025 CHI conference on human factors in computing systems, pp. 1–18,

  8. [8]

    Repro-bench: Can agentic ai systems assess the reproducibility of social science research? InFindings of the Association for Computational Linguistics: ACL 2025, pp

    Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. Repro-bench: Can agentic ai systems assess the reproducibility of social science research? InFindings of the Association for Computational Linguistics: ACL 2025, pp. 23616–23626,

  9. [9]

    Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

    Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, and Hisham Cholakkal. Pa- per circle: An open-source multi-agent research discovery and analysis framework.arXiv preprint arXiv:2604.06170,

  10. [10]

    Agent-oriented planning in multi-agent systems

    Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems. InInternational Conference on Learning Representations, volume 2025, pp. 19495–19517,

  11. [11]

    AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

    Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, et al. Autosota: An end-to-end automated research system for state-of- the-art ai model discovery.arXiv preprint arXiv:2604.05550,

  12. [12]

    Hall, Zoey Warecki, John Cum- ings, Hideomi Koinuma, Aaron Gilad Kusne, Mikk Lippmaa, and Ichiro Takeuchi

    Haotong Liang, Yunlong Sun, Ryan Paxson, Chih-Yu Lee, Alex T. Hall, Zoey Warecki, John Cum- ings, Hideomi Koinuma, Aaron Gilad Kusne, Mikk Lippmaa, and Ichiro Takeuchi. Autonomous epitaxial atomic-layer synthesis via real-time computer vision of electron diffraction.arXiv preprint arXiv:2602.20432,

  13. [13]

    A vision for auto research with llm agents.arXiv preprint arXiv:2504.18765,

    Chengwei Liu, Chong Wang, Jiayue Cao, Jingquan Ge, Kun Wang, Lyuye Zhang, Ming-Ming Cheng, Penghai Zhao, Tianlin Li, Xiaojun Jia, et al. A vision for auto research with llm agents.arXiv preprint arXiv:2504.18765,

  14. [14]

    The Last Human-Written Paper: Agent-Native Research Artifacts

    Jiachen Liu, Jiaxin Pei, Jintao Huang, Chenglei Si, Ao Qu, Xiangru Tang, Runyu Lu, Lichang Chen, Xiaoyan Bai, Haizhong Zheng, et al. The last human-written paper: Agent-native research artifacts. arXiv preprint arXiv:2604.24658, 2026a. Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Congyu Zhang, et al. A...

  15. [15]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,

  16. [16]

    arXiv preprint arXiv:2502.16863 (2025)

    26 Clarus Kartik Nagpal, Dayi Dong, Jean-Baptiste Bouvier, and Negar Mehr. Leveraging large language models for effective and explainable multi-agent credit assignment.arXiv preprint arXiv:2502.16863,

  17. [17]

    Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

    Xiaohang Nie, Zihan Guo, Zicai Cui, Jiachi Yang, Zeyi Chen, Leheyi De, Yu Zhang, Junwei Liao, Bo Huang, Yingxuan Yang, et al. Holos: A web-scale llm-based multi-agent system for the agentic web.arXiv preprint arXiv:2604.02334, 2026a. Xiaohang Nie, Zihan Guo, Kezhuo Yang, Zhichong Zheng, Bochen Ge, Shuai Pan, Zeyi Chen, Youling Xiang, Yu Zhang, Weiwen Liu,...

  18. [18]

    Agent R xiv: Towards collaborative autonomous research

    Samuel Schmidgall and Michael Moor. Agentrxiv: Towards collaborative autonomous research.arXiv preprint arXiv:2503.18102,

  19. [19]

    Agent laboratory: Using llm agents as research assistants

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants. Findings of the Association for Computational Linguistics: EMNLP 2025, pp. 5977–6043,

  20. [20]

    Authenticated delegation and authorized ai agents

    Tobin South, Samuele Marro, Thomas Hardjono, Robert Mahari, Cedric Deslandes Whitney, Dazza Greenwood, Alan Chan, and Alex Pentland. Authenticated delegation and authorized ai agents. arXiv preprint arXiv:2501.09674,

  21. [21]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. Paperbench: Evaluating ai’s ability to replicate ai research.arXiv preprint arXiv:2504.01848,

  22. [22]

    Multi-agent coordination across diverse applications: A survey.arXiv preprint arXiv:2502.14743,

    Lijun Sun, Yijun Yang, Qiqi Duan, Yuhui Shi, Chao Lyu, Yu-Cheng Chang, Chin-Teng Lin, and Yang Shen. Multi-agent coordination across diverse applications: A survey.arXiv preprint arXiv:2502.14743,

  23. [23]

    Value-Decomposition Networks For Cooperative Multi-Agent Learning

    Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning.arXiv preprint arXiv:1706.05296,

  24. [24]

    Robert T Thibault, Olavo B Amaral, Felipe Argolo, Anita E Bandrowski, Natascha I Drude, et al

    doi: 10.36227/techrxiv.176540311.11203219/v1. Robert T Thibault, Olavo B Amaral, Felipe Argolo, Anita E Bandrowski, Natascha I Drude, et al. Open science 2.0: Towards a truly collaborative research ecosystem.PLoS Biology, 21(10):e3002362,

  25. [25]

    Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322,

  26. [26]

    Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces

    Neelmani Vispute and Aditya Kadam. Reasoning provenance for autonomous ai agents: Structured behavioral analytics beyond state checkpoints and execution traces.arXiv preprint arXiv:2603.21692,

  27. [27]

    Agent-temporal attention for reward redistribution in episodic multi-agent reinforcement learning.arXiv preprint arXiv:2201.04612,

    Baicen Xiao, Bhaskar Ramasubramanian, and Radha Poovendran. Agent-temporal attention for reward redistribution in episodic multi-agent reinforcement learning.arXiv preprint arXiv:2201.04612,

  28. [28]

    Asi-evolve: Ai accelerates ai.arXiv preprint arXiv:2603.29640,

    Weixian Xu, Tiantian Mi, Yixiu Liu, Yang Nan, Zhimeng Zhou, Lyumanshan Ye, Lin Zhang, Yu Qiao, and Pengfei Liu. Asi-evolve: Ai accelerates ai.arXiv preprint arXiv:2603.29640,

  29. [29]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066,

  30. [30]

    ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

    Ruofeng Yang, Yongcan Li, and Shuai Li. Aris: Autonomous research via adversarial multi-agent collaboration.arXiv preprint arXiv:2605.03042,

  31. [31]

    R&d-agent: Automating data-driven ai solution building through llm-powered automated research, development, and evolution.arXiv preprint arXiv:2505.14738, 2025a

    Xu Yang, Xiao Yang, Shikai Fang, Bowen Xian, Yuante Li, Jian Wang, Minrui Xu, Haoran Pan, Xinpeng Hong, Weiqing Liu, et al. R&d-agent: Automating data-driven ai solution building through llm-powered automated research, development, and evolution.arXiv preprint arXiv:2505.14738, 2025a. Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li...

  32. [32]

    Verified multi-agent orchestration: A plan-execute-verify-replan framework for complex query resolution.arXiv preprint arXiv:2603.11445,

    Xing Zhang, Yanwei Cui, Guanghui Wang, Wei Qiu, Ziyuan Li, Fangwei Han, Yajing Huang, Hengzhi Qiu, Bing Zhu, and Peiyang He. Verified multi-agent orchestration: A plan-execute-verify-replan framework for complex query resolution.arXiv preprint arXiv:2603.11445,

  33. [33]

    Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al. Externalization in llm agents: A unified review of memory, skills, protocols and harness engineering.arXiv preprint arXiv:2604.08224,

  34. [34]

    Agent-as-a-Judge: Evaluate Agents with Agents, October 2024

    Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934,