\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer

Di Zhang; Fengbo Zhang; Jialu Wang; Jun Han; Ruijie Wang; Xuefei Wang; Yihan Hu; Yikun Ban; Yutong Ye

arxiv: 2605.17361 · v1 · pith:AS5UUDEFnew · submitted 2026-05-17 · 💻 cs.LG · cs.AI

textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer

Xuefei Wang , Jialu Wang , Fengbo Zhang , Yihan Hu , Di Zhang , Yutong Ye , Yikun Ban , Jun Han

show 1 more author

Ruijie Wang

This is my paper

Pith reviewed 2026-05-20 14:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords continual learningmulti-agent systemstopology learningoptimal transportPAC-Bayeslarge language modelscommunication topologytopology forgetting

0 comments

The pith

MasFACT transfers historical agent collaboration patterns as priors to prevent topology forgetting when multi-agent LLM systems face streams of evolving tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard topology generators for multi-agent LLM systems overwrite useful communication structures when adapting to new tasks, creating a failure mode called topology forgetting caused by misalignment in agent semantics and relations across tasks. It proposes MasFACT as a solution that treats past effective topologies as transferable priors, moving them between task-specific agent spaces via geometry-aware optimal transport and then adapting them conservatively. A sympathetic reader would care because real deployments involve sequences of related problems rather than isolated tasks, so retaining proven collaboration patterns could improve long-term performance without repeated rediscovery of structures. Experiments in class-, domain-, and task-level continual settings show gains in average accuracy and lower forgetting relative to replay and generation baselines, with compatibility across different topology generators.

Core claim

The authors claim that topology forgetting arises from cross-task misalignment in agent-level functional semantics and relational communication structures, and that this can be addressed by transferring historical topology priors across task-specific agent spaces through Fused Gromov-Wasserstein optimal transport followed by PAC-Bayes-guided conservative posterior adaptation that balances task-specific plasticity with structural stability.

What carries the argument

Fused Gromov-Wasserstein optimal transport that aligns and transfers topology priors between different task-specific agent spaces, combined with PAC-Bayes-guided conservative posterior adaptation to retain stability while allowing new-task learning.

If this is right

Average accuracy improves across class-, domain-, and task-level continual learning settings.
Topology forgetting is reduced compared to strong topology generation and replay-based baselines.
The method integrates directly with existing MAS topology generators without requiring changes to their core design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prior-transfer idea could apply to continual learning in single-agent or non-LLM systems where structural knowledge must persist across task shifts.
Longer task sequences might expose limits on how many historical priors can be maintained before adaptation costs rise.
Measuring the geometric distance between task agent spaces before and after transfer could serve as a diagnostic for when the method succeeds or fails.

Load-bearing premise

The approach assumes that useful past collaboration structures can be aligned and transferred as priors to new tasks despite shifts in agent functions and communication relations.

What would settle it

A direct comparison on a multi-task sequence where MasFACT is applied versus a standard topology generator, checking whether accuracy on the first task remains higher or forgetting metrics are lower; if no improvement appears, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.17361 by Di Zhang, Fengbo Zhang, Jialu Wang, Jun Han, Ruijie Wang, Xuefei Wang, Yihan Hu, Yikun Ban, Yutong Ye.

**Figure 2.** Figure 2: Overview of MasFACT. The framework has three stages: historical topology construction (Sec. 3.1), geometry-aware prior retrieval (Sec. 3.2), and conservative posterior adaptation (Sec. 3.3). • Evaluation: We design a hierarchical continual MAS evaluation protocol covering task-, domain-, and class-level task evolution and show consistent improvements on both accuracy and forgetting metrics across diverse s… view at source ↗

**Figure 3.** Figure 3: Mechanistic analysis of structural forgetting. Left: topology drift tracks old-task accuracy [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Multi-agent systems (MAS) powered by large language models (LLMs) have emerged as a powerful paradigm for complex problem solving, where performance critically depends on the underlying inter-agent communication topology. However, existing topology generation methods mainly optimize for isolated tasks, while real-world deployments involve streams of evolving tasks, requiring previously effective collaboration patterns to be retained and reused rather than rediscovered or overwritten. We identify a previously underexplored failure mode, \emph{topology forgetting}, in which adapting to new tasks shifts the topology generator away from communication structures required by earlier tasks. This issue stems from cross-task misalignment in both agent-level functional semantics and relational communication structures. To address this challenge, we propose \textbf{\textsc{MasFACT}}, a geometry-aware posterior transfer framework that preserves and reuses historical collaboration knowledge as transferable topology priors. We transfer these priors across task-specific agent spaces through Fused Gromov-Wasserstein optimal transport and perform PAC-Bayes-guided conservative posterior adaptation to balance task-specific plasticity with structural stability. Experiments across class-, domain-, and task-level continual settings demonstrate that \textsc{MasFACT} consistently improves average accuracy while reducing topology forgetting compared to strong topology generation and replay-based baselines, and can be seamlessly integrated with different MAS topology generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MasFACT names topology forgetting in continual multi-agent LLM systems and tests a fused OT plus PAC-Bayes transfer to carry priors forward, but the claim that this alignment actually keeps old relational structures intact rests on unshown details.

read the letter

The paper's core move is to treat topology changes across task streams as a forgetting problem rather than just a plasticity one. It identifies misalignment in agent semantics and relations as the cause, then moves historical collaboration patterns into new spaces via Fused Gromov-Wasserstein optimal transport and follows that with conservative posterior updates guided by PAC-Bayes bounds. That combination is the main novelty; prior continual learning work on graphs or agents does not appear to have used this exact geometry-aware prior transfer for topology generators.

Referee Report

2 major / 1 minor

Summary. The paper claims that existing MAS topology generators suffer from 'topology forgetting' when adapting to streams of evolving tasks due to cross-task misalignment in agent functional semantics and relational structures. It proposes MasFACT, which transfers historical collaboration priors across task-specific agent spaces via Fused Gromov-Wasserstein optimal transport and applies PAC-Bayes-guided conservative posterior adaptation to balance plasticity and stability. Experiments across class-, domain-, and task-level continual settings reportedly show consistent gains in average accuracy, reduced topology forgetting versus topology generation and replay baselines, and seamless integration with multiple MAS topology generators.

Significance. If the empirical claims and the structural-preservation properties of the OT transfer hold, the work would be moderately significant for continual learning in LLM-based multi-agent systems, as it targets an underexplored failure mode and reuses established optimal-transport and PAC-Bayes machinery in a geometry-aware posterior-transfer setting. The reported compatibility with arbitrary topology generators is a practical strength.

major comments (2)

[§3] §3 (Fused Gromov-Wasserstein transfer): The central claim that the fused OT cost (node features + edge relations) produces a transport plan preserving relational communication structures required by prior tasks is load-bearing for the reduction in topology forgetting. The manuscript does not provide a concrete verification (e.g., a structure-preservation metric or ablation on the relative weighting of feature vs. relational terms in the fused distance) that the alignment does not distort collaboration patterns when functional-semantic embeddings dominate. This directly addresses the skeptic's concern and must be strengthened for the retention argument to be convincing.
[§5] §5 (Experiments): The reported improvements in average accuracy and topology-forgetting reduction are presented without error bars, statistical significance tests, or explicit rules for data exclusion and hyper-parameter selection across the class-/domain-/task-level settings. Because the soundness assessment is currently low, these details are required to substantiate that gains arise from structural retention rather than increased plasticity alone.

minor comments (1)

[§3] Notation for the PAC-Bayes posterior adaptation and the precise definition of the fused Gromov-Wasserstein cost should be introduced earlier and used consistently to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses

Referee: [§3] §3 (Fused Gromov-Wasserstein transfer): The central claim that the fused OT cost (node features + edge relations) produces a transport plan preserving relational communication structures required by prior tasks is load-bearing for the reduction in topology forgetting. The manuscript does not provide a concrete verification (e.g., a structure-preservation metric or ablation on the relative weighting of feature vs. relational terms in the fused distance) that the alignment does not distort collaboration patterns when functional-semantic embeddings dominate. This directly addresses the skeptic's concern and must be strengthened for the retention argument to be convincing.

Authors: We agree that additional verification of the structure-preservation properties is necessary to strengthen our claims. In the revised version, we will add an ablation study varying the relative weighting between the feature and relational components in the fused Gromov-Wasserstein distance. Furthermore, we will introduce quantitative metrics to assess structure preservation, such as the similarity in communication patterns or the retention of key relational edges across tasks. These additions will provide concrete evidence that the transport plan maintains the required collaboration structures even when semantic embeddings are prominent. revision: yes
Referee: [§5] §5 (Experiments): The reported improvements in average accuracy and topology-forgetting reduction are presented without error bars, statistical significance tests, or explicit rules for data exclusion and hyper-parameter selection across the class-/domain-/task-level settings. Because the soundness assessment is currently low, these details are required to substantiate that gains arise from structural retention rather than increased plasticity alone.

Authors: We recognize the need for greater statistical rigor and transparency in our experimental evaluation. We will revise the experimental section to include error bars computed over multiple independent runs with different random seeds. We will also conduct and report statistical significance tests comparing MasFACT against the baselines. Additionally, we will provide explicit details on our hyper-parameter selection methodology and any criteria used for data exclusion or inclusion in the continual learning benchmarks. These changes should clarify that the improvements stem from the proposed structural retention mechanisms. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on established external techniques

full rationale

The paper's central framework applies Fused Gromov-Wasserstein optimal transport and PAC-Bayes posterior adaptation to transfer topology priors across tasks. These are standard, independently established methods from prior literature outside the present authors. No derivation step reduces a claimed prediction or result to a quantity defined by the target itself, nor does any load-bearing premise collapse to a self-citation chain or fitted input renamed as output. The abstract and method description treat the OT alignment and conservative adaptation as imported tools whose properties are not redefined within the paper, leaving the empirical claims about reduced topology forgetting as testable against baselines rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to identify specific free parameters, axioms, or invented entities; no equations or implementation specifics are given.

pith-pipeline@v0.9.0 · 5785 in / 1072 out tokens · 41154 ms · 2026-05-20T14:31:28.511040+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 11 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[4]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, et al. Theoremqa: A theorem-driven question answering dataset. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

work page 2024
[7]

Ai agents in engineering design: a multi-agent framework for aesthetic and aerodynamic car design

Mohamed Elrefaie, Janet Qian, Raina Wu, Qian Chen, Angela Dai, and Faez Ahmed. Ai agents in engineering design: a multi-agent framework for aesthetic and aerodynamic car design. In International Design Engineering Technical Conferences and Computers and Information in EngineeringConference, volume 89237, page V03BT03A048.American Societyof Mechanical Engi...

work page 2025
[8]

TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System

WenzheFan,TommasoTognoli,HenryPengZou,ChunyuMiao,YiboWang,andXinhuaZhang. Todycomm: Task-oriented dynamic communication for multi-round llm-based multi-agent system.arXiv preprint arXiv:2602.03688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics (TACL), 9:346–361, 2021

work page 2021
[10]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Slim: Let llm learn more and forget less with soft lora and identity mixture

Jiayi Han, Liang Du, Hongwei Du, Xiangguo Zhou, Yiwen Wu, Yuanfang Zhang, Weibo Zheng, and Donghong Han. Slim: Let llm learn more and forget less with soft lora and identity mixture. InProceedingsofthe2025ConferenceoftheNationsoftheAmericasChapteroftheAssociation 10 forComputationalLinguistics: HumanLanguageTechnologies(Volume1: LongPapers), pages 4792–4804, 2025

work page 2025
[12]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024
[13]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[14]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[15]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong, and Kazunari Sugiyama. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics (COLING), pages 6609–6625, 2020

work page 2020
[16]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

work page 2023
[17]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King King, et al. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, AndreiARusu, KieranMilan, JohnQuan, TiagoRamalho, AgnieszkaGrabska-Barwinska, etal. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[20]

Amas: Adaptively determining communication topology for llm-based multi-agent system

Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, and Wei Han. Amas: Adaptively determining communication topology for llm-based multi-agent system. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2061–2070, 2025

work page 2025
[21]

Adaptive graph pruning for multi-agent communication.arXiv preprint arXiv:2506.02951, 2025

Boyi Li, Zhonghan Zhao, Der-Horng Lee, and Gaoang Wang. Adaptive graph pruning for multi-agent communication.arXiv preprint arXiv:2506.02951, 2025

work page arXiv 2025
[22]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

work page 2023
[23]

TACO: Topics in algorithmic COde generation dataset.arXiv preprint, arXiv:2312.14852, 2023

RongaoLi,JieFu,Bo-WenZhang,TaoHuang,ZhihongSun,ChenLyu,GuangLiu,ZhiJin,and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

work page arXiv 2023
[24]

Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, and Shirui Pan. Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23142–23150, 2026

work page 2026
[25]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017
[26]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017. 11

work page 2017
[27]

Packnet: Adding multiple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

work page 2018
[28]

Some pac-bayesian theorems

David A McAllester. Some pac-bayesian theorems. InProceedings of the eleventh annual conference on Computational learning theory, pages 230–234, 1998

work page 1998
[29]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

work page 2024
[30]

Yu Shang et al

Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language model-based multi-agent collaboration.arXiv preprint arXiv:2406.07155, 2024

work page arXiv 2024
[31]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Jackson, Evan Frankel, Ethan Perez, Samuel R Bowman, and Jared Perez. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Experience replay for continual learning.Advances in neural information processing systems, 32, 2019

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning.Advances in neural information processing systems, 32, 2019

work page 2019
[33]

Understanding the information propagation effects of communication topologies in llm-based multi-agent systems

Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. Understanding the information propagation effects of communication topologies in llm-based multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12358–12372, 2025

work page 2025
[34]

Optimal transport for structured data with application on graphs

Vayer Titouan, Nicolas Courty, Romain Tavenard, and Rémi Flamary. Optimal transport for structured data with application on graphs. InInternational Conference on Machine Learning, pages 6275–6284. PMLR, 2019

work page 2019
[35]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[36]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024

Jiaqi Wang, Hanqi Jiang, Yiheng Liu, Chong Ma, Xu Zhang, Yi Pan, Mengyuan Liu, Peiran Gu, Sichen Xia, Wenjun Li, et al. A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024

work page arXiv 2024
[38]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery,andDennyZhou. Self-consistencyimproveschainofthoughtreasoninginlanguage models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xuehai Xue, Xin Jiang, Nanning Zheng, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.arXiv preprint arXiv:2406.01574, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Agentdropout: Dynamic agent elimination for token-efficient and high-performance llm-based multi-agent collaboration

Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, and Min Zhang. Agentdropout: Dynamic agent elimination for token-efficient and high-performance llm-based multi-agent collaboration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24013–24035, 2025

work page 2025
[41]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024. 12

work page 2024
[42]

arXiv preprint arXiv:2402.01364 , year=

Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

work page arXiv 2024
[43]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

ZhilinYang,PengQi,SaizhengZhang,YoshuaBengio,WilliamWCohen,RuslanSalakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2369–2380, 2018

work page 2018
[44]

Masrouter: Learning to route llms for multi-agent systems

Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learning to route llms for multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15549–15572, 2025

work page 2025
[45]

G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, and Dawei Cheng. G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

work page arXiv 2024
[46]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024
[47]

Designinggenaitoolsforpersonalized learning implementation: Theoretical analysis and prototype of a multi-agent system.Journal of Teacher Education, 76(3):280–293, 2025

LingZhang,ZijunYao,andAryaHadizadehMoghaddam. Designinggenaitoolsforpersonalized learning implementation: Theoretical analysis and prototype of a multi-agent system.Journal of Teacher Education, 76(3):280–293, 2025

work page 2025
[48]

Eduplanner: Llm-based multi-agent systems for customized and intelligent instructional design.IEEE Transactions on Learning Technologies, 2025

Xueqiao Zhang, Chao Zhang, Jianwen Sun, Jun Xiao, Yi Yang, and Yawei Luo. Eduplanner: Llm-based multi-agent systems for customized and intelligent instructional design.IEEE Transactions on Learning Technologies, 2025

work page 2025
[49]

Towardslifelonglearningoflarge language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

JunhaoZheng, ShengjieQiu, ChengmingShi, andQianliMa. Towardslifelonglearningoflarge language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

work page 2025
[50]

Agieval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024

work page 2024
[51]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024
[52]

Multi-agent design: Optimizing agents with better prompts and topologies

HanZhou,XingchenWan,RuoxiSun,HamidPalangi,ShariqIqbal,IvanVulić,AnnaKorhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. arXiv preprint arXiv:2502.02533, 2025. 13 A Appendix Overview The appendix is structured as follows: •Section B discusses the limitations ofMasFACTand outlines future research directions...

work page arXiv 2025

[1] [1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[4] [4]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, et al. Theoremqa: A theorem-driven question answering dataset. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023

[5] [5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

work page 2024

[7] [7]

Ai agents in engineering design: a multi-agent framework for aesthetic and aerodynamic car design

Mohamed Elrefaie, Janet Qian, Raina Wu, Qian Chen, Angela Dai, and Faez Ahmed. Ai agents in engineering design: a multi-agent framework for aesthetic and aerodynamic car design. In International Design Engineering Technical Conferences and Computers and Information in EngineeringConference, volume 89237, page V03BT03A048.American Societyof Mechanical Engi...

work page 2025

[8] [8]

TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System

WenzheFan,TommasoTognoli,HenryPengZou,ChunyuMiao,YiboWang,andXinhuaZhang. Todycomm: Task-oriented dynamic communication for multi-round llm-based multi-agent system.arXiv preprint arXiv:2602.03688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics (TACL), 9:346–361, 2021

work page 2021

[10] [10]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Slim: Let llm learn more and forget less with soft lora and identity mixture

Jiayi Han, Liang Du, Hongwei Du, Xiangguo Zhou, Yiwen Wu, Yuanfang Zhang, Weibo Zheng, and Donghong Han. Slim: Let llm learn more and forget less with soft lora and identity mixture. InProceedingsofthe2025ConferenceoftheNationsoftheAmericasChapteroftheAssociation 10 forComputationalLinguistics: HumanLanguageTechnologies(Volume1: LongPapers), pages 4792–4804, 2025

work page 2025

[12] [12]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024

[13] [13]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[14] [14]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021

[15] [15]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong, and Kazunari Sugiyama. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics (COLING), pages 6609–6625, 2020

work page 2020

[16] [16]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

work page 2023

[17] [17]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King King, et al. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, AndreiARusu, KieranMilan, JohnQuan, TiagoRamalho, AgnieszkaGrabska-Barwinska, etal. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017

[20] [20]

Amas: Adaptively determining communication topology for llm-based multi-agent system

Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, and Wei Han. Amas: Adaptively determining communication topology for llm-based multi-agent system. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2061–2070, 2025

work page 2025

[21] [21]

Adaptive graph pruning for multi-agent communication.arXiv preprint arXiv:2506.02951, 2025

Boyi Li, Zhonghan Zhao, Der-Horng Lee, and Gaoang Wang. Adaptive graph pruning for multi-agent communication.arXiv preprint arXiv:2506.02951, 2025

work page arXiv 2025

[22] [22]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

work page 2023

[23] [23]

TACO: Topics in algorithmic COde generation dataset.arXiv preprint, arXiv:2312.14852, 2023

RongaoLi,JieFu,Bo-WenZhang,TaoHuang,ZhihongSun,ChenLyu,GuangLiu,ZhiJin,and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

work page arXiv 2023

[24] [24]

Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, and Shirui Pan. Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23142–23150, 2026

work page 2026

[25] [25]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017

[26] [26]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017. 11

work page 2017

[27] [27]

Packnet: Adding multiple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

work page 2018

[28] [28]

Some pac-bayesian theorems

David A McAllester. Some pac-bayesian theorems. InProceedings of the eleventh annual conference on Computational learning theory, pages 230–234, 1998

work page 1998

[29] [29]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

work page 2024

[30] [30]

Yu Shang et al

Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language model-based multi-agent collaboration.arXiv preprint arXiv:2406.07155, 2024

work page arXiv 2024

[31] [31]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Jackson, Evan Frankel, Ethan Perez, Samuel R Bowman, and Jared Perez. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Experience replay for continual learning.Advances in neural information processing systems, 32, 2019

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning.Advances in neural information processing systems, 32, 2019

work page 2019

[33] [33]

Understanding the information propagation effects of communication topologies in llm-based multi-agent systems

Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. Understanding the information propagation effects of communication topologies in llm-based multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12358–12372, 2025

work page 2025

[34] [34]

Optimal transport for structured data with application on graphs

Vayer Titouan, Nicolas Courty, Romain Tavenard, and Rémi Flamary. Optimal transport for structured data with application on graphs. InInternational Conference on Machine Learning, pages 6275–6284. PMLR, 2019

work page 2019

[35] [35]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022

[36] [36]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024

Jiaqi Wang, Hanqi Jiang, Yiheng Liu, Chong Ma, Xu Zhang, Yi Pan, Mengyuan Liu, Peiran Gu, Sichen Xia, Wenjun Li, et al. A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024

work page arXiv 2024

[38] [38]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery,andDennyZhou. Self-consistencyimproveschainofthoughtreasoninginlanguage models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xuehai Xue, Xin Jiang, Nanning Zheng, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.arXiv preprint arXiv:2406.01574, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Agentdropout: Dynamic agent elimination for token-efficient and high-performance llm-based multi-agent collaboration

Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, and Min Zhang. Agentdropout: Dynamic agent elimination for token-efficient and high-performance llm-based multi-agent collaboration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24013–24035, 2025

work page 2025

[41] [41]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024. 12

work page 2024

[42] [42]

arXiv preprint arXiv:2402.01364 , year=

Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

work page arXiv 2024

[43] [43]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

ZhilinYang,PengQi,SaizhengZhang,YoshuaBengio,WilliamWCohen,RuslanSalakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2369–2380, 2018

work page 2018

[44] [44]

Masrouter: Learning to route llms for multi-agent systems

Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learning to route llms for multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15549–15572, 2025

work page 2025

[45] [45]

G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, and Dawei Cheng. G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

work page arXiv 2024

[46] [46]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024

[47] [47]

Designinggenaitoolsforpersonalized learning implementation: Theoretical analysis and prototype of a multi-agent system.Journal of Teacher Education, 76(3):280–293, 2025

LingZhang,ZijunYao,andAryaHadizadehMoghaddam. Designinggenaitoolsforpersonalized learning implementation: Theoretical analysis and prototype of a multi-agent system.Journal of Teacher Education, 76(3):280–293, 2025

work page 2025

[48] [48]

Eduplanner: Llm-based multi-agent systems for customized and intelligent instructional design.IEEE Transactions on Learning Technologies, 2025

Xueqiao Zhang, Chao Zhang, Jianwen Sun, Jun Xiao, Yi Yang, and Yawei Luo. Eduplanner: Llm-based multi-agent systems for customized and intelligent instructional design.IEEE Transactions on Learning Technologies, 2025

work page 2025

[49] [49]

Towardslifelonglearningoflarge language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

JunhaoZheng, ShengjieQiu, ChengmingShi, andQianliMa. Towardslifelonglearningoflarge language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

work page 2025

[50] [50]

Agieval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024

work page 2024

[51] [51]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024

[52] [52]

Multi-agent design: Optimizing agents with better prompts and topologies

HanZhou,XingchenWan,RuoxiSun,HamidPalangi,ShariqIqbal,IvanVulić,AnnaKorhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. arXiv preprint arXiv:2502.02533, 2025. 13 A Appendix Overview The appendix is structured as follows: •Section B discusses the limitations ofMasFACTand outlines future research directions...

work page arXiv 2025