pith. sign in

arxiv: 2605.17361 · v1 · pith:AS5UUDEFnew · submitted 2026-05-17 · 💻 cs.LG · cs.AI

textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer

Pith reviewed 2026-05-20 14:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continual learningmulti-agent systemstopology learningoptimal transportPAC-Bayeslarge language modelscommunication topologytopology forgetting
0
0 comments X

The pith

MasFACT transfers historical agent collaboration patterns as priors to prevent topology forgetting when multi-agent LLM systems face streams of evolving tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard topology generators for multi-agent LLM systems overwrite useful communication structures when adapting to new tasks, creating a failure mode called topology forgetting caused by misalignment in agent semantics and relations across tasks. It proposes MasFACT as a solution that treats past effective topologies as transferable priors, moving them between task-specific agent spaces via geometry-aware optimal transport and then adapting them conservatively. A sympathetic reader would care because real deployments involve sequences of related problems rather than isolated tasks, so retaining proven collaboration patterns could improve long-term performance without repeated rediscovery of structures. Experiments in class-, domain-, and task-level continual settings show gains in average accuracy and lower forgetting relative to replay and generation baselines, with compatibility across different topology generators.

Core claim

The authors claim that topology forgetting arises from cross-task misalignment in agent-level functional semantics and relational communication structures, and that this can be addressed by transferring historical topology priors across task-specific agent spaces through Fused Gromov-Wasserstein optimal transport followed by PAC-Bayes-guided conservative posterior adaptation that balances task-specific plasticity with structural stability.

What carries the argument

Fused Gromov-Wasserstein optimal transport that aligns and transfers topology priors between different task-specific agent spaces, combined with PAC-Bayes-guided conservative posterior adaptation to retain stability while allowing new-task learning.

If this is right

  • Average accuracy improves across class-, domain-, and task-level continual learning settings.
  • Topology forgetting is reduced compared to strong topology generation and replay-based baselines.
  • The method integrates directly with existing MAS topology generators without requiring changes to their core design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prior-transfer idea could apply to continual learning in single-agent or non-LLM systems where structural knowledge must persist across task shifts.
  • Longer task sequences might expose limits on how many historical priors can be maintained before adaptation costs rise.
  • Measuring the geometric distance between task agent spaces before and after transfer could serve as a diagnostic for when the method succeeds or fails.

Load-bearing premise

The approach assumes that useful past collaboration structures can be aligned and transferred as priors to new tasks despite shifts in agent functions and communication relations.

What would settle it

A direct comparison on a multi-task sequence where MasFACT is applied versus a standard topology generator, checking whether accuracy on the first task remains higher or forgetting metrics are lower; if no improvement appears, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.17361 by Di Zhang, Fengbo Zhang, Jialu Wang, Jun Han, Ruijie Wang, Xuefei Wang, Yihan Hu, Yikun Ban, Yutong Ye.

Figure 1
Figure 1. Figure 1: Overview of MAS topology and related challenges. (a) An illustration of MAS topology [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MasFACT. The framework has three stages: historical topology construction (Sec. 3.1), geometry-aware prior retrieval (Sec. 3.2), and conservative posterior adaptation (Sec. 3.3). • Evaluation: We design a hierarchical continual MAS evaluation protocol covering task-, domain-, and class-level task evolution and show consistent improvements on both accuracy and forgetting metrics across diverse s… view at source ↗
Figure 3
Figure 3. Figure 3: Mechanistic analysis of structural forgetting. Left: topology drift tracks old-task accuracy [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Multi-agent systems (MAS) powered by large language models (LLMs) have emerged as a powerful paradigm for complex problem solving, where performance critically depends on the underlying inter-agent communication topology. However, existing topology generation methods mainly optimize for isolated tasks, while real-world deployments involve streams of evolving tasks, requiring previously effective collaboration patterns to be retained and reused rather than rediscovered or overwritten. We identify a previously underexplored failure mode, \emph{topology forgetting}, in which adapting to new tasks shifts the topology generator away from communication structures required by earlier tasks. This issue stems from cross-task misalignment in both agent-level functional semantics and relational communication structures. To address this challenge, we propose \textbf{\textsc{MasFACT}}, a geometry-aware posterior transfer framework that preserves and reuses historical collaboration knowledge as transferable topology priors. We transfer these priors across task-specific agent spaces through Fused Gromov-Wasserstein optimal transport and perform PAC-Bayes-guided conservative posterior adaptation to balance task-specific plasticity with structural stability. Experiments across class-, domain-, and task-level continual settings demonstrate that \textsc{MasFACT} consistently improves average accuracy while reducing topology forgetting compared to strong topology generation and replay-based baselines, and can be seamlessly integrated with different MAS topology generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that existing MAS topology generators suffer from 'topology forgetting' when adapting to streams of evolving tasks due to cross-task misalignment in agent functional semantics and relational structures. It proposes MasFACT, which transfers historical collaboration priors across task-specific agent spaces via Fused Gromov-Wasserstein optimal transport and applies PAC-Bayes-guided conservative posterior adaptation to balance plasticity and stability. Experiments across class-, domain-, and task-level continual settings reportedly show consistent gains in average accuracy, reduced topology forgetting versus topology generation and replay baselines, and seamless integration with multiple MAS topology generators.

Significance. If the empirical claims and the structural-preservation properties of the OT transfer hold, the work would be moderately significant for continual learning in LLM-based multi-agent systems, as it targets an underexplored failure mode and reuses established optimal-transport and PAC-Bayes machinery in a geometry-aware posterior-transfer setting. The reported compatibility with arbitrary topology generators is a practical strength.

major comments (2)
  1. [§3] §3 (Fused Gromov-Wasserstein transfer): The central claim that the fused OT cost (node features + edge relations) produces a transport plan preserving relational communication structures required by prior tasks is load-bearing for the reduction in topology forgetting. The manuscript does not provide a concrete verification (e.g., a structure-preservation metric or ablation on the relative weighting of feature vs. relational terms in the fused distance) that the alignment does not distort collaboration patterns when functional-semantic embeddings dominate. This directly addresses the skeptic's concern and must be strengthened for the retention argument to be convincing.
  2. [§5] §5 (Experiments): The reported improvements in average accuracy and topology-forgetting reduction are presented without error bars, statistical significance tests, or explicit rules for data exclusion and hyper-parameter selection across the class-/domain-/task-level settings. Because the soundness assessment is currently low, these details are required to substantiate that gains arise from structural retention rather than increased plasticity alone.
minor comments (1)
  1. [§3] Notation for the PAC-Bayes posterior adaptation and the precise definition of the fused Gromov-Wasserstein cost should be introduced earlier and used consistently to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [§3] §3 (Fused Gromov-Wasserstein transfer): The central claim that the fused OT cost (node features + edge relations) produces a transport plan preserving relational communication structures required by prior tasks is load-bearing for the reduction in topology forgetting. The manuscript does not provide a concrete verification (e.g., a structure-preservation metric or ablation on the relative weighting of feature vs. relational terms in the fused distance) that the alignment does not distort collaboration patterns when functional-semantic embeddings dominate. This directly addresses the skeptic's concern and must be strengthened for the retention argument to be convincing.

    Authors: We agree that additional verification of the structure-preservation properties is necessary to strengthen our claims. In the revised version, we will add an ablation study varying the relative weighting between the feature and relational components in the fused Gromov-Wasserstein distance. Furthermore, we will introduce quantitative metrics to assess structure preservation, such as the similarity in communication patterns or the retention of key relational edges across tasks. These additions will provide concrete evidence that the transport plan maintains the required collaboration structures even when semantic embeddings are prominent. revision: yes

  2. Referee: [§5] §5 (Experiments): The reported improvements in average accuracy and topology-forgetting reduction are presented without error bars, statistical significance tests, or explicit rules for data exclusion and hyper-parameter selection across the class-/domain-/task-level settings. Because the soundness assessment is currently low, these details are required to substantiate that gains arise from structural retention rather than increased plasticity alone.

    Authors: We recognize the need for greater statistical rigor and transparency in our experimental evaluation. We will revise the experimental section to include error bars computed over multiple independent runs with different random seeds. We will also conduct and report statistical significance tests comparing MasFACT against the baselines. Additionally, we will provide explicit details on our hyper-parameter selection methodology and any criteria used for data exclusion or inclusion in the continual learning benchmarks. These changes should clarify that the improvements stem from the proposed structural retention mechanisms. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on established external techniques

full rationale

The paper's central framework applies Fused Gromov-Wasserstein optimal transport and PAC-Bayes posterior adaptation to transfer topology priors across tasks. These are standard, independently established methods from prior literature outside the present authors. No derivation step reduces a claimed prediction or result to a quantity defined by the target itself, nor does any load-bearing premise collapse to a self-citation chain or fitted input renamed as output. The abstract and method description treat the OT alignment and conservative adaptation as imported tools whose properties are not redefined within the paper, leaving the empirical claims about reduced topology forgetting as testable against baselines rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to identify specific free parameters, axioms, or invented entities; no equations or implementation specifics are given.

pith-pipeline@v0.9.0 · 5785 in / 1072 out tokens · 41154 ms · 2026-05-20T14:31:28.511040+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 11 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  2. [2]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  3. [3]

    Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, 2023

  4. [4]

    Theoremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, et al. Theoremqa: A theorem-driven question answering dataset. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  6. [6]

    Improv- ing factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

  7. [7]

    Ai agents in engineering design: a multi-agent framework for aesthetic and aerodynamic car design

    Mohamed Elrefaie, Janet Qian, Raina Wu, Qian Chen, Angela Dai, and Faez Ahmed. Ai agents in engineering design: a multi-agent framework for aesthetic and aerodynamic car design. In International Design Engineering Technical Conferences and Computers and Information in EngineeringConference, volume 89237, page V03BT03A048.American Societyof Mechanical Engi...

  8. [8]

    TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System

    WenzheFan,TommasoTognoli,HenryPengZou,ChunyuMiao,YiboWang,andXinhuaZhang. Todycomm: Task-oriented dynamic communication for multi-round llm-based multi-agent system.arXiv preprint arXiv:2602.03688, 2026

  9. [9]

    Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics (TACL), 9:346–361, 2021

  10. [10]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

  11. [11]

    Slim: Let llm learn more and forget less with soft lora and identity mixture

    Jiayi Han, Liang Du, Hongwei Du, Xiangguo Zhou, Yiwen Wu, Yuanfang Zhang, Weibo Zheng, and Donghong Han. Slim: Let llm learn more and forget less with soft lora and identity mixture. InProceedingsofthe2025ConferenceoftheNationsoftheAmericasChapteroftheAssociation 10 forComputationalLinguistics: HumanLanguageTechnologies(Volume1: LongPapers), pages 4792–4804, 2025

  12. [12]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  13. [13]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021

  14. [14]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  15. [15]

    Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong, and Kazunari Sugiyama. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics (COLING), pages 6609–6625, 2020

  16. [16]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

  17. [17]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

  18. [18]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King King, et al. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  19. [19]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, AndreiARusu, KieranMilan, JohnQuan, TiagoRamalho, AgnieszkaGrabska-Barwinska, etal. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  20. [20]

    Amas: Adaptively determining communication topology for llm-based multi-agent system

    Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, and Wei Han. Amas: Adaptively determining communication topology for llm-based multi-agent system. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2061–2070, 2025

  21. [21]

    Adaptive graph pruning for multi-agent communication.arXiv preprint arXiv:2506.02951, 2025

    Boyi Li, Zhonghan Zhao, Der-Horng Lee, and Gaoang Wang. Adaptive graph pruning for multi-agent communication.arXiv preprint arXiv:2506.02951, 2025

  22. [22]

    Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

  23. [23]

    TACO: Topics in algorithmic COde generation dataset.arXiv preprint, arXiv:2312.14852, 2023

    RongaoLi,JieFu,Bo-WenZhang,TaoHuang,ZhihongSun,ChenLyu,GuangLiu,ZhiJin,and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

  24. [24]

    Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation

    Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, and Shirui Pan. Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23142–23150, 2026

  25. [25]

    Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

  26. [26]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017. 11

  27. [27]

    Packnet: Adding multiple tasks to a single network by iterative pruning

    Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

  28. [28]

    Some pac-bayesian theorems

    David A McAllester. Some pac-bayesian theorems. InProceedings of the eleventh annual conference on Computational learning theory, pages 230–234, 1998

  29. [29]

    Chatdev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

  30. [30]

    Yu Shang et al

    Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language model-based multi-agent collaboration.arXiv preprint arXiv:2406.07155, 2024

  31. [31]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Jackson, Evan Frankel, Ethan Perez, Samuel R Bowman, and Jared Perez. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023

  32. [32]

    Experience replay for continual learning.Advances in neural information processing systems, 32, 2019

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning.Advances in neural information processing systems, 32, 2019

  33. [33]

    Understanding the information propagation effects of communication topologies in llm-based multi-agent systems

    Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. Understanding the information propagation effects of communication topologies in llm-based multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12358–12372, 2025

  34. [34]

    Optimal transport for structured data with application on graphs

    Vayer Titouan, Nicolas Courty, Romain Tavenard, and Rémi Flamary. Optimal transport for structured data with application on graphs. InInternational Conference on Machine Learning, pages 6275–6284. PMLR, 2019

  35. [35]

    Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  36. [36]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  37. [37]

    A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024

    Jiaqi Wang, Hanqi Jiang, Yiheng Liu, Chong Ma, Xu Zhang, Yi Pan, Mengyuan Liu, Peiran Gu, Sichen Xia, Wenjun Li, et al. A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024

  38. [38]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery,andDennyZhou. Self-consistencyimproveschainofthoughtreasoninginlanguage models.arXiv preprint arXiv:2203.11171, 2022

  39. [39]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xuehai Xue, Xin Jiang, Nanning Zheng, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.arXiv preprint arXiv:2406.01574, 2024

  40. [40]

    Agentdropout: Dynamic agent elimination for token-efficient and high-performance llm-based multi-agent collaboration

    Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, and Min Zhang. Agentdropout: Dynamic agent elimination for token-efficient and high-performance llm-based multi-agent collaboration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24013–24035, 2025

  41. [41]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024. 12

  42. [42]

    arXiv preprint arXiv:2402.01364 , year=

    Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

  43. [43]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    ZhilinYang,PengQi,SaizhengZhang,YoshuaBengio,WilliamWCohen,RuslanSalakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2369–2380, 2018

  44. [44]

    Masrouter: Learning to route llms for multi-agent systems

    Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learning to route llms for multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15549–15572, 2025

  45. [45]

    G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

    Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, and Dawei Cheng. G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

  46. [46]

    Aflow: Automating agentic workflow generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2024

  47. [47]

    Designinggenaitoolsforpersonalized learning implementation: Theoretical analysis and prototype of a multi-agent system.Journal of Teacher Education, 76(3):280–293, 2025

    LingZhang,ZijunYao,andAryaHadizadehMoghaddam. Designinggenaitoolsforpersonalized learning implementation: Theoretical analysis and prototype of a multi-agent system.Journal of Teacher Education, 76(3):280–293, 2025

  48. [48]

    Eduplanner: Llm-based multi-agent systems for customized and intelligent instructional design.IEEE Transactions on Learning Technologies, 2025

    Xueqiao Zhang, Chao Zhang, Jianwen Sun, Jun Xiao, Yi Yang, and Yawei Luo. Eduplanner: Llm-based multi-agent systems for customized and intelligent instructional design.IEEE Transactions on Learning Technologies, 2025

  49. [49]

    Towardslifelonglearningoflarge language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

    JunhaoZheng, ShengjieQiu, ChengmingShi, andQianliMa. Towardslifelonglearningoflarge language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

  50. [50]

    Agieval: A human-centric benchmark for evaluating foundation models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024

  51. [51]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

  52. [52]

    Multi-agent design: Optimizing agents with better prompts and topologies

    HanZhou,XingchenWan,RuoxiSun,HamidPalangi,ShariqIqbal,IvanVulić,AnnaKorhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. arXiv preprint arXiv:2502.02533, 2025. 13 A Appendix Overview The appendix is structured as follows: •Section B discusses the limitations ofMasFACTand outlines future research directions...