arxiv: 2604.17503 · v1 · submitted 2026-04-19 · 💻 cs.AI · cs.MA

Recognition: unknown

SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

Zheng Nie , Ruolin Shen , Xinlei Yu , Bo Yin , Jiangning Zhang , Xiaobin Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:19 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multi-agent systemsvision-language modelsgraph transformersself-evolving agentsdynamic topologyskill distillationvisual multiagent systems

0 comments

The pith

SkillGraph jointly evolves agent skills from failures and dynamically predicts collaboration topologies using a multimodal graph transformer, yielding consistent gains on visual multi-agent benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that fixed topologies and static agents limit visual multi-agent systems. SkillGraph addresses this by having a Multimodal Graph Transformer create content-aware collaboration graphs and a Skill Designer extract heuristics from mistakes to update a skill bank. These updates feed back to refine the topology for each query. The result is a self-improving system that outperforms static setups without changing the underlying models.

Core claim

SkillGraph achieves self-evolving multi-agent collaboration by using a Multimodal Graph Transformer to encode visual tokens, semantics, and skills into a query-conditioned graph, while the Skill Designer distills failure cases into an evolving Skill Bank whose embeddings loop back to improve the graph prediction.

What carries the argument

Multimodal Graph Transformer (MMGT) that predicts dynamic collaboration graphs from multimodal inputs including active skill embeddings, coupled with the Skill Designer for self-evolving Skill Bank.

Load-bearing premise

Distilling reasoning heuristics from failure cases will lead to meaningful skill improvements that can be effectively encoded and used to adapt the topology without introducing errors or instability.

What would settle it

Demonstrating that the evolved skills and adapted topologies lead to no improvement or worse results than fixed topologies and static agents on the same benchmarks would falsify the central benefit.

Figures

Figures reproduced from arXiv: 2604.17503 by Bo Yin, Jiangning Zhang, Ruolin Shen, Xiaobin Hu, Xinlei Yu, Zheng Nie.

**Figure 1.** Figure 1: Comparison of VMAS paradigms. Prior VMAS uses static topologies and frozen skills. Our SkillGraph enables a co-evolution loop: MMGT predicts dynamic collaboration graphs, while a skill bank self-evolves agent capabilities. However, as we scale these visual multi-agent collectives, a fundamental bottleneck emerges: the structural and cognitive rigidity of current VMAS frameworks. Existing systems primaril… view at source ↗

**Figure 2.** Figure 2: SkillGraph Framework. The system operates in three stages: VMAS Construction: Agents retrieve dynamic skills to initialize policy-aware node features. Topology Design: The Multimodal Graph Transformer (MMGT) fuses visual patches and task semantics to predict a query-conditioned communication topology. Adaptive Skill Evolution: A Skill Designer refines skills using failure logs. Updated skills feed directl… view at source ↗

**Figure 3.** Figure 3: Ablation study of SkillGraph components. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Performance of SkillGraph across different iteration numbers. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Evolution of the Skill Bank across iterations. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Scaling vision-language models into Visual Multiagent Systems (VMAS) is hindered by two coupled issues. First, communication topologies are fixed before inference, leaving them blind to visual content and query context; second, agent reasoning abilities remain static during deployment. These issues reinforce each other: a rigid topology fails to leverage richer agent expertise, while static agents lack incentives to specialize for a given query. We address this with SkillGraph, a joint framework that evolves both agent expertise and communication topology. Within this framework, a Multimodal Graph Transformer (MMGT) encodes visual tokens, instruction semantics and active skill embeddings to predict a query-conditioned collaboration graph, replacing hand-crafted routing with dynamic, content-aware information flow. Complementing this, a Skill Designer distills and refines reasoning heuristics from failure cases, constructing a self-evolving multimodal Skill Bank. Crucially, updated skill embeddings are fed back into the MMGT, enabling the topology to adapt alongside capability growth. Experiments show that SkillGraph achieves consistent improvements across four benchmarks, five common MAS structures and four base models. Code is available at https://github.com/niez233/skillgraph.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillGraph's main advance is closing a feedback loop so that skill improvements from failures directly reshape the collaboration graph via a Multimodal Graph Transformer.

read the letter

The paper's core idea is a closed feedback system where agent skills improve from task failures and those improvements immediately reshape how the agents communicate via a learned graph. The Multimodal Graph Transformer takes visual features, the query, and current skill embeddings to output a collaboration graph on the fly. The Skill Designer then analyzes mistakes to extract better reasoning patterns, stores them in a Skill Bank, and updates the embeddings so the graph predictor can use the new capabilities. What stands out is the joint optimization of structure and function in one loop, which goes beyond static topologies or fixed agent pools in earlier multi-agent work. The experiments test this across four different benchmarks, five standard MAS architectures, and four underlying models, showing steady gains. Having the code public lets others verify the implementation and run their own checks. The soft spots are mostly in the level of detail. The description gives the high-level flow but leaves the exact distillation process and how skill embeddings are encoded a bit vague. It is not clear how they prevent the feedback from amplifying errors or how they choose which failures to distill from. That said, the multi-setting results suggest the approach is stable enough in practice, and no internal contradictions appear in the claims or setup. This work is aimed at researchers building visual multi-agent systems who need more flexibility than hand-designed routing. Someone looking for ideas on self-adaptive collaboration would find the framework and the evaluation useful. It is not a foundational theory paper but an engineering contribution with empirical backing. I would recommend sending it for peer review. The combination of a novel loop, wide testing, and open code makes it worth referees' time even if some sections need more explanation.

Referee Report

0 major / 2 minor

Summary. The paper introduces SkillGraph, a joint framework for evolving both agent expertise and communication topology in Visual Multiagent Systems (VMAS). It uses a Multimodal Graph Transformer (MMGT) to encode visual tokens, instruction semantics, and active skill embeddings to predict query-conditioned collaboration graphs, replacing fixed topologies with dynamic, content-aware routing. A Skill Designer distills reasoning heuristics from failure cases to build and update a multimodal Skill Bank, with updated skill embeddings fed back into the MMGT to enable co-evolution of capabilities and structure. Experiments report consistent improvements across four benchmarks, five common MAS structures, and four base models, with code released.

Significance. If the empirical results hold under detailed scrutiny, SkillGraph offers a meaningful advance in adaptive multi-agent visual reasoning by directly addressing the coupling between rigid topologies and static agent abilities. The multi-setting evaluation (benchmarks, structures, base models) and code release are clear strengths that support reproducibility and generalizability claims.

minor comments (2)

The abstract would be strengthened by including at least one key quantitative result (e.g., average improvement or specific benchmark scores) to better substantiate the claim of consistent gains.
Clarify in the method description how the Skill Bank embeddings are precisely integrated into the MMGT input (e.g., via concatenation, attention, or a dedicated module) to avoid ambiguity in the feedback loop.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report correctly identifies the core contribution of SkillGraph in jointly evolving agent skills and query-conditioned topologies via the Multimodal Graph Transformer and Skill Designer. No specific major comments were listed in the provided report, so we have no points requiring detailed rebuttal or revision at this stage. We are happy to incorporate any minor suggestions the editor or referee may wish to add.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes SkillGraph as an empirical framework that combines a Multimodal Graph Transformer for dynamic topology prediction with a Skill Designer for distilling heuristics from failures, feeding updated embeddings back into the system. No equations, derivations, or parameter-fitting procedures are presented in the abstract or described structure that would allow any claimed prediction or result to reduce by construction to its own inputs. The central claims rest on experimental improvements across benchmarks, structures, and models, supported by released code for independent reproduction. This is a standard descriptive systems paper without self-definitional loops, fitted-input predictions, or load-bearing self-citations that collapse the argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Based on abstract, the framework introduces new components like MMGT and Skill Designer, but without full text, details on parameters or axioms are unknown. No free parameters explicitly mentioned.

invented entities (2)

Multimodal Graph Transformer (MMGT) no independent evidence
purpose: To encode visual tokens, instruction semantics and active skill embeddings to predict query-conditioned collaboration graph
Introduced as part of the framework.
Skill Bank no independent evidence
purpose: To store and refine reasoning heuristics from failure cases
New component for self-evolving skills.

pith-pipeline@v0.9.0 · 5509 in / 1116 out tokens · 44699 ms · 2026-05-10T05:19:50.138134+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 40 canonical work pages · 12 internal anchors

[1]

Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

Alzubi, S., Provenzano, N., Bingham, J., Chen, W., Vu, T.: Evoskill: Automated skill discovery for multi-agent systems. arXiv preprint arXiv:2603.02766 (2026) 4

work page arXiv 2026
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chan, C.M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., Liu, Z.: Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023) 3

work page internal anchor Pith review arXiv 2023
[5]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Chen, J., Saha, S., Bansal, M.: Reconcile: Round-table conference improves reason- ing via consensus among diverse llms. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7066–7085 (2024) 3

2024
[6]

Chen, S., Gai, J., Zhou, R., Zhang, J., Zhu, T., Li, J., Wang, K., Wang, Z., Chen, Z., Kaleb, K., et al.: Skillcraft: Can llm agents learn to use tools skillfully? arXiv preprint arXiv:2603.00718 (2026) 4

work page arXiv 2026
[7]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu

Chen, T., Li, Y., Solodko, M., Wang, S., Jiang, N., Cui, T., Hao, J., Ko, J., Abdali, S., Zheng, S., et al.: Cua-skill: Develop skills for computer using agent. arXiv preprint arXiv:2601.21123 (2026) 4

work page arXiv 2026
[8]

arXiv preprint arXiv:2505.19591 , year=

Dang, Y., Qian, C., Luo, X., Fan, J., Xie, Z., Shi, R., Chen, W., Yang, C., Che, X., Tian, Y., et al.: Multi-agent collaboration via evolving orchestration. arXiv preprint arXiv:2505.19591 (2025) 4

work page arXiv 2025
[9]

In: Forty-first international conference on machine learning (2024) 3 18 Z

Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factual- ity and reasoning in language models through multiagent debate. In: Forty-first international conference on machine learning (2024) 3 18 Z. Nie et al

2024
[10]

mas: Stable reinforcement learning for multi-agent llm systems

Feng, L., Zheng, L., He, S., Zhang, F., An, B.: Dr. mas: Stable reinforcement learning for multi-agent llm systems. arXiv preprint arXiv:2602.08847 (2026) 17

work page arXiv 2026
[11]

Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

Gao, H., Liu, Y., He, Y., Dou, L., Du, C., Deng, Z., Hooi, B., Lin, M., Pang, T.: Flowreasoner: Reinforcing query-level meta-agents. arXiv preprint arXiv:2504.15257 (2025) 4

work page arXiv 2025
[12]

arXiv preprint arXiv:2310.02003 (2023) 3

Holt, S., Luyten, M.R., van der Schaar, M.: L2mac: Large language model au- tomatic computer for extensive code generation. arXiv preprint arXiv:2310.02003 (2023) 3

work page arXiv 2023
[13]

In: The twelfth international conference on learning rep- resentations (2023) 1, 3

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Wang, Z., Yau, S.K.S., Lin, Z., et al.: Metagpt: Meta programming for a multi-agent collaborative framework. In: The twelfth international conference on learning rep- resentations (2023) 1, 3

2023
[14]

Automated Design of Agentic Systems

Hu, S., Lu, C., Clune, J.: Automated design of agentic systems. arXiv preprint arXiv:2408.08435 (2024) 4

work page internal anchor Pith review arXiv 2024
[15]

Authorea Preprints (2025) 17

Hu, X., Qian, Y., Yu, J., Liu, J., Tang, P., Ji, X., Xu, C., Liu, J., Yan, X., Yu, X., et al.: The landscape of medical agents: A survey. Authorea Preprints (2025) 17

2025
[16]

arXiv preprint arXiv:2410.16946 , year=

Hu, Y., Cai, Y., Du, Y., Zhu, X., Liu, X., Yu, Z., Hou, Y., Tang, S., Chen, S.: Self-evolving multi-agent collaboration networks for software development. arXiv preprint arXiv:2410.16946 (2024) 4

work page arXiv 2024
[17]

Audited skill-graph self-improvement for agentic llms via verifiable rewards, experience synthesis, and continual memory,

Huang, K., Huang, J.: Audited skill-graph self-improvement for agentic llms via verifiable rewards, experience synthesis, and continual memory. arXiv preprint arXiv:2512.23760 (2025) 4

work page arXiv 2025
[18]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Jiang, Y., Li, D., Deng, H., Ma, B., Wang, X., Wang, Q., Yu, G.: Sok: Agentic skills–beyond tool use in llm agents. arXiv preprint arXiv:2602.20867 (2026) 4

work page internal anchor Pith review arXiv 2026
[19]

Finance Research Letters p

Ke, Z., Cao, Y., Chen, Z., Yin, Y., He, S., Cheng, Y.: Early warning of cryptocur- rency reversal risks via multi-source data. Finance Research Letters p. 107890 (2025) 17

2025
[20]

In: The Twelfth International Conference on Learning Representations (2023) 4

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Haq, S., Sharma, A., Joshi, T.T., Moazam, H., Miller, H., et al.: Dspy: compiling declarative language model calls into state-of-the-art pipelines. In: The Twelfth International Conference on Learning Representations (2023) 4

2023
[21]

Advances in Neural Information Processing Systems37, 79410–79452 (2024) 3

Kim, Y., Park, C., Jeong, H., Chan, Y.S., Xu, X., McDuff, D., Lee, H., Ghassemi, M., Breazeal, C., Park, H.W.: Mdagents: An adaptive collaboration of llms for medical decision-making. Advances in Neural Information Processing Systems37, 79410–79452 (2024) 3

2024
[22]

Agent skill acquisition for large language models via CycleQD.arXiv preprint arXiv:2410.14735, 2024

Kuroki, S., Nakamura, T., Akiba, T., Tang, Y.: Agent skill acquisition for large language models via cycleqd. arXiv preprint arXiv:2410.14735 (2024) 4

work page arXiv 2024
[23]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 10

work page Pith review arXiv 2024
[24]

Advances in neural information processing systems36, 51991–52008 (2023) 1, 3

Li, G., Hammoud, H., Itani, H., Khizbullin, D., Ghanem, B.: Camel: Communica- tive agents for" mind" exploration of large language model society. Advances in neural information processing systems36, 51991–52008 (2023) 1, 3

2023
[25]

In: Workshop on Multi-Agent Learning and Its Opportunities in the Era of Generative AI 4

Li, Y., Dou, Y., Shao, J.J., Lyu, Y., Tsang, I., Yin, H.: Skilltracer: Structural failure attribution and refinement of agentic skills in long-horizon web tasks. In: Workshop on Multi-Agent Learning and Its Opportunities in the Era of Generative AI 4
[26]

In: Proceedings of the 2024 conference on empirical methods in natural language processing

Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Shi, S., Tu, Z.: Encouraging divergent thinking in large language models through multi-agent debate. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 17889–17904 (2024) 3 SkillGraph 19

2024
[27]

SkillNet: Create, evaluate, and connect AI skills,

Liang, Y., Zhong, R., Xu, H., Jiang, C., Zhong, Y., Fang, R., Gu, J.C., Deng, S., Yao, Y., Wang, M., et al.: Skillnet: Create, evaluate, and connect ai skills. arXiv preprint arXiv:2603.04448 (2026) 4

work page arXiv 2026
[28]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024) 11

2024
[29]

Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization

Liu, Z., Zhang, Y., Li, P., Liu, Y., Yang, D.: Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv preprint arXiv:2310.02170 (2023) 4

work page arXiv 2023
[30]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023) 11

work page internal anchor Pith review arXiv 2023
[31]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: Info- graphicvqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1697–1706 (2022) 11

2022
[32]

Pan, P., Chen, L., He, Q., Yuan, K., Wang, H., Zhang, W.: Finscra: An llm- powered multi-chain reasoning framework for interpretable node classification on text-attributed graphs (2026) 17

2026
[33]

In: Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers)

Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X., et al.: Chatdev: Communicative agents for software development. In: Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). pp. 15174–15186 (2024) 3

2024
[34]

Scaling large-language-model-based multi-agent collaboration

Qian, C., Xie, Z., Wang, Y., Liu, W., Zhu, K., Xia, H., Dang, Y., Du, Z., Chen, W., Yang, C., et al.: Scaling large language model-based multi-agent collaboration. arXiv preprint arXiv:2406.07155 (2024) 1, 3

work page arXiv 2024
[35]

Medmaslab: A unified orchestration framework for benchmarking multimodal medical multi-agent systems, 2026

Qian, Y., Hu, X., Yu, J., Xin, S., Chen, X., Zhang, J., Jiang, P.T., Liu, J., Li, H.B.:Medmaslab:Aunifiedorchestrationframeworkforbenchmarkingmultimodal medical multi-agent systems. arXiv preprint arXiv:2603.09909 (2026) 17

work page arXiv 2026
[36]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., Anand- kumar, A.: Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023) 4

work page internal anchor Pith review arXiv 2023
[37]

Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

Wang, J., Yan, Q., Wang, Y., Tian, Y., Mishra, S.S., Xu, Z., Gandhi, M., Xu, P., Cheong, L.L.: Reinforcement learning for self-improving agent with skill library. arXiv preprint arXiv:2512.17102 (2025) 4

work page arXiv 2025
[38]

arXiv preprint arXiv:2504.06821 , year=

Wang, Z.Z., Gandhi, A., Neubig, G., Fried, D.: Inducing programmatic skills for agentic tasks. arXiv preprint arXiv:2504.06821 (2025) 4

work page arXiv 2025
[39]

Agent Workflow Memory

Wang, Z.Z., Mao, J., Fried, D., Neubig, G.: Agent workflow memory. arXiv preprint arXiv:2409.07429 (2024) 4

work page internal anchor Pith review arXiv 2024
[40]

In: First conference on language modeling (2024) 1, 3

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., et al.: Autogen: Enabling next-gen llm applications via multi- agent conversations. In: First conference on language modeling (2024) 1, 3

2024
[41]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Xia, P., Chen, J., Wang, H., Liu, J., Zeng, K., Wang, Y., Han, S., Zhou, Y., Zhao, X., Chen, H., et al.: Skillrl: Evolving agents via recursive skill-augmented reinforcement learning. arXiv preprint arXiv:2602.08234 (2026) 4

work page internal anchor Pith review arXiv 2026
[42]

Comas: Co-evolving multi-agent systems via interaction rewards.CoRR, abs/2510.08529, 2025

Xue, X., Zhou, Y., Zhang, G., Zhang, Z., Li, Y., Zhang, C., Yin, Z., Torr, P., Ouyang, W., Bai, L.: Comas: Co-evolving multi-agent systems via interaction re- wards. arXiv preprint arXiv:2510.08529 (2025) 4

work page arXiv 2025
[43]

arXiv preprint arXiv:2603.01145 , year=

Yang, Y., Li, J., Pan, Q., Zhan, B., Cai, Y., Du, L., Zhou, J., Chen, K., Chen, Q., Li, X., et al.: Autoskill: Experience-driven lifelong learning via skill self-evolution. arXiv preprint arXiv:2603.01145 (2026) 4 20 Z. Nie et al

work page arXiv 2026
[44]

The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

Yu, X., Chen, Z., He, Y., Fu, T., Yang, C., Xu, C., Ma, Y., Hu, X., Cao, Z., Xu, J., et al.: The latent space: Foundation, evolution, mechanism, ability, and outlook. arXiv preprint arXiv:2604.02029 (2026) 17

work page arXiv 2026
[45]

arXiv preprint arXiv:2602.00471 (2026) 1

Yu, X., Xu, C., Chen, Z., Yin, B., Yang, C., He, Y., Hu, Y., Zhang, J., Tan, C., Hu, X., et al.: Dual latent memory for visual multi-agent system. arXiv preprint arXiv:2602.00471 (2026) 1

work page arXiv 2026
[46]

InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822

Yu, X., Xu, C., Chen, Z., Zhang, Y., Lu, S., Yang, C., Zhang, J., Yan, S., Hu, X.: Visual document understanding and reasoning: A multi-agent collaboration frame- work with agent-wise adaptive test-time scaling. arXiv preprint arXiv:2508.03404 (2025) 3

work page arXiv 2025
[47]

Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007, 2025

Yu, X., Xu, C., Zhang, G., Chen, Z., Zhang, Y., He, Y., Jiang, P.T., Zhang, J., Hu, X., Yan, S.: Vismem: Latent vision memory unlocks potential of vision-language models. arXiv preprint arXiv:2511.11007 (2025) 17

work page arXiv 2025
[48]

arXiv preprint arXiv:2509.21789 (2025)

Yu, X., Xu, C., Zhang, G., He, Y., Chen, Z., Xue, Z., Zhang, J., Liao, Y., Hu, X., Jiang, Y.G., et al.: Visual multi-agent system: Mitigating hallucination snowballing via visual flow. arXiv preprint arXiv:2509.21789 (2025) 1

work page arXiv 2025
[49]

In: NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning 4

Zabounidis, R., Wu, Y., Stepputtis, S., Mitchell, T., Li, Y., Sycara, K.P.: Scalar: Self-supervised composition and learning of skills with llm planning and rl. In: NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning 4

2025
[50]

G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

Zhang, G., Yue, Y., Sun, X., Wan, G., Yu, M., Fang, J., Wang, K., Chen, T., Cheng, D.: G-designer: Architecting multi-agent communication topologies via graph neu- ral networks. arXiv preprint arXiv:2410.11782 (2024) 2, 3

work page arXiv 2024
[51]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Zhang, H., Long, Q., Bao, J., Feng, T., Zhang, W., Yue, H., Wang, W.: Mem- skill: Learning and evolving memory skills for self-evolving agents. arXiv preprint arXiv:2602.02474 (2026) 4

work page internal anchor Pith review arXiv 2026
[52]

AFlow: Automating Agentic Workflow Generation

Zhang, J., Xiang, J., Yu, Z., Teng, F., Chen, X., Chen, J., Zhuge, M., Cheng, X., Hong, S., Wang, J., et al.: Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762 (2024) 4

work page internal anchor Pith review arXiv 2024
[53]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y.J., Huang, G.: Expel: Llm agents are experiential learners. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 19632–19642 (2024) 4

2024
[54]

SkillRouter: Retrieve-and-rerank skill selection for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

Zheng, Y., Zhang, Z., Ma, C., Yu, Y., Zhu, J., Dong, B., Zhu, H.: Skill- router: Retrieve-and-rerank skill selection for llm agents at scale. arXiv preprint arXiv:2603.22455 (2026) 4

work page arXiv 2026
[55]

arXiv preprint arXiv:2502.02533 , year=

Zhou, H., Wan, X., Sun, R., Palangi, H., Iqbal, S., Vulić, I., Korhonen, A., Arık, S.Ö.: Multi-agent design: Optimizing agents with better prompts and topologies. arXiv preprint arXiv:2502.02533 (2025) 3

work page arXiv 2025
[56]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al.: Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 (2023) 4

work page Pith review arXiv 2023
[57]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 10

work page internal anchor Pith review arXiv 2025
[58]

In: Forty-first International Conference on Machine Learning (2024) 2, 3

Zhuge, M., Wang, W., Kirsch, L., Faccio, F., Khizbullin, D., Schmidhuber, J.: Gptswarm: Language agents as optimizable graphs. In: Forty-first International Conference on Machine Learning (2024) 2, 3

2024