pith. machine review for the scientific record. sign in

arxiv: 2604.15972 · v1 · submitted 2026-04-17 · 💻 cs.AI · cs.CL· cs.MA

Recognition: unknown

Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:50 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA
keywords agentreasoningmulti-agentunderlineweakagentscollaborationframework
0
0 comments X

The pith

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent AI systems let several language models work together on tough reasoning problems. One weak or error-prone agent can drag down the whole group even if others are strong. Earlier methods focused on making the best agents better or filtering out bad answers. WORC instead locates the current weakest link and gives it more chances to get the answer right. In the first stage, task features feed a meta-learning predictor trained on optimal setups found by swarm intelligence algorithms; this predicts performance weights and flags the lowest-weight agent as weak. In the second stage, an uncertainty-driven rule gives that agent larger repeated-sampling quotas. Experiments on reasoning benchmarks report 82.2 percent average accuracy plus gains in stability and generalization across model architectures. The core idea is that team performance is limited by its weakest member, so targeted compensation can outperform strength-only approaches.

Core claim

WORC achieves an average accuracy of 82.2% on reasoning benchmarks while improving framework stability and cross-architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi-agent systems.

Load-bearing premise

The meta-learning-based weight predictor, trained on optimal configurations from swarm intelligence algorithms, can reliably identify the weak agent in zero-shot fashion from task features alone.

Figures

Figures reproduced from arXiv: 2604.15972 by Chaoning Zhang, Haoyu Bian, Jiaquan Zhang, Wei Dong, Xingyao Li, Yang Yang, Yuanfang Guo.

Figure 1
Figure 1. Figure 1: Overview of the vulnerability of weak agents in multi-agent reasoning. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the WORC method in the AC framework. (a) Weak Agent Localization: A weight knowledge base is constructed via SIA training, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compares Accuracy between AC and WORC method (same con [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Agent weight evolution across iterations under three SIAs (PSO, GWO, and HO). The labels #1-#4 denote the relative ranking of the four agents at [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

LLM-driven multi-agent frameworks address complex reasoning tasks through multi-role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high-capability agents or suppressing unreliable outputs to improve framework effectiveness, while systematic identification and reinforcement of performance-limiting agents receive less attention. To address this gap, we propose WORC, a \underline{w}eak-link \underline{o}ptimization framework for multi-agent \underline{r}easoning and \underline{c}ollaboration, grounded in the weak-link principle. WORC follows a two-stage workflow. In the weak agent localization stage, task features are constructed, and a meta-learning-based weight predictor trained on optimal configurations identified by swarm intelligence algorithms (SIAs) enables zero-shot mapping from these features to agent performance weights, where the agent with the lowest predicted weight is identified as the weak agent. In the weak-link optimization stage, an uncertainty-driven allocation strategy assigns additional reasoning budgets to weak agents, with lower predicted weights leading to larger repeated-sampling quotas to compensate for reliability deficiencies. Experimental results show that WORC achieves an average accuracy of 82.2\% on reasoning benchmarks while improving framework stability and cross-architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi-agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that additional repeated sampling can compensate for low agent reliability and that task features suffice for zero-shot performance prediction; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Individual agent errors in multi-agent collaboration can be mitigated by allocating larger repeated-sampling quotas to low-performance agents.
    This underpins the weak-link optimization stage and the claim that compensation improves overall robustness.

pith-pipeline@v0.9.0 · 5560 in / 1247 out tokens · 77703 ms · 2026-05-10T08:50:39.840177+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 37 canonical work pages · 9 internal anchors

  1. [1]

    A survey on evaluation of large language models,

    Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024

  2. [2]

    Intelligent decision-making driven by large ai models: Progress, challenges and prospects,

    Y . He, S. Ruan, D. Wang, H. Lu, Z. Li, Y . Liu, X. Chen, S. Li, J. Zhao, and J. Liang, “Intelligent decision-making driven by large ai models: Progress, challenges and prospects,”CAAI Transactions on Intelligence Technology, vol. 10, no. 6, pp. 1573–1592, 2025

  3. [3]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  4. [4]

    arXiv preprint arXiv:2602.09821 , year=

    J. Zhang, C. Zhang, S. Chen, Y . Liu, C. Li, Q. Sun, S. Yuan, F. D. Puspitasari, D. Han, G. Wanget al., “Text summarization via global structure awareness,”arXiv preprint arXiv:2602.09821, 2026

  5. [5]

    arXiv preprint arXiv:2602.09794 , year=

    J. Zhang, C. Zhang, S. Chen, X. Wang, Z. Huang, P. Zheng, S. Yuan, S. Zheng, Q. Sun, J. Zouet al., “Learning global hypothesis space for en- hancing synergistic reasoning chain,”arXiv preprint arXiv:2602.09794, 2026

  6. [6]

    The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv preprint arXiv:2404.11584, 2024

    T. Masterman, S. Besen, M. Sawtell, and A. Chao, “The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey,”arXiv preprint arXiv:2404.11584, 2024

  7. [7]

    Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

    Y . Talebirad and A. Nadiri, “Multi-agent collaboration: Harnessing the power of intelligent llm agents,”arXiv preprint arXiv:2306.03314, 2023

  8. [8]

    Rethinking the bounds of llm reasoning: Are multi-agent discussions the key?

    Q. Wang, Z. Wang, Y . Su, H. Tong, and Y . Song, “Rethinking the bounds of llm reasoning: Are multi-agent discussions the key?”arXiv preprint arXiv:2402.18272, 2024

  9. [9]

    Social agent: Mastering dyadic nonverbal behavior generation via conversational llm agents,

    Z. Zhang, Y . Zhou, H. Yao, T. Ao, X. Zhan, and L. Liu, “Social agent: Mastering dyadic nonverbal behavior generation via conversational llm agents,” inProceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–12

  10. [10]

    arXiv preprint arXiv:2508.04903 , year =

    J. Liu, Z. Kong, C. Yang, F. Yang, T. Li, P. Dong, J. Nanjekye, H. Tang, G. Yuan, W. Niuet al., “Rcr-router: Efficient role-aware context routing for multi-agent llm systems with structured memory,”arXiv preprint arXiv:2508.04903, 2025

  11. [11]

    Dynamic consensus communication mech- anism for large language model-based multi-agent systems,

    L. Yang, S. Li, and A. Deng, “Dynamic consensus communication mech- anism for large language model-based multi-agent systems,”Journal of Signal Processing Systems, vol. 98, no. 1, p. 10, 2026

  12. [12]

    Towards scientific intelligence: A survey of llm-based scientific agents.arXiv preprint arXiv:2503.24047, 2025

    S. Ren, P. Jian, Z. Ren, C. Leng, C. Xie, and J. Zhang, “Towards scientific intelligence: A survey of llm-based scientific agents,”arXiv preprint arXiv:2503.24047, 2025

  13. [13]

    Kg4diagnosis: A hierarchical multi- agent llm framework with knowledge graph enhancement for medical diagnosis,

    K. Zuo, Y . Jiang, F. Mo, and P. Lio, “Kg4diagnosis: A hierarchical multi- agent llm framework with knowledge graph enhancement for medical diagnosis,” inAAAI Bridge Program on AI for Medicine and Healthcare. PMLR, 2025, pp. 195–204

  14. [14]

    Multi-agent autonomous driving systems with large language models: A survey of recent advances, resources, and future directions,

    Y . Wu, D. Li, Y . Chen, R. Jiang, H. P. Zou, W.-C. Huang, Y . Li, L. Fang, Z. Wang, and P. S. Yu, “Multi-agent autonomous driving systems with large language models: A survey of recent advances,”arXiv preprint arXiv:2502.16804, 2025

  15. [15]

    Experience Transfer for Multimodal LLM Agents in Minecraft Game

    C. Li, J. Liu, S. Zhang, H. Jian, H. Ni, L.-H. Lee, S.-H. Bae, G. Wang, Y . Yang, and C. Zhang, “Experience transfer for multimodal llm agents in minecraft game,”arXiv preprint arXiv:2604.05533, 2026

  16. [16]

    A survey on llm- based multi-agent systems: workflow, infrastructure, and challenges,

    X. Li, S. Wang, S. Zeng, Y . Wu, and Y . Yang, “A survey on llm- based multi-agent systems: workflow, infrastructure, and challenges,” Vicinagearth, vol. 1, no. 1, p. 9, 2024

  17. [17]

    Rethinking the reliability of multi-agent system: A perspective from byzantine fault tolerance,

    L. Zheng, J. Chen, Q. Yin, J. Zhang, X. Zeng, and Y . Tian, “Rethinking the reliability of multi-agent system: A perspective from byzantine fault tolerance,”arXiv preprint arXiv:2511.10400, 2025

  18. [18]

    arXiv preprint arXiv:2508.15260 , year=

    Y . Fu, X. Wang, Y . Tian, and J. Zhao, “Deep think with confidence,” arXiv preprint arXiv:2508.15260, 2025

  19. [19]

    Encouraging divergent thinking in large language models through multi-agent debate,

    T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” inProceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 17 889– 17 904

  20. [20]

    Debate or vote: Which yields better decisions in multi-agent large language models?arXiv preprint arXiv:2508.17536, 2025

    H. K. Choi, X. Zhu, and S. Li, “Debate or vote: Which yields better decisions in multi-agent large language models?”arXiv preprint arXiv:2508.17536, 2025

  21. [21]

    Comm: Collaborative multi-agent, multi-reasoning-path prompting for complex problem solving,

    P. Chen, B. Han, and S. Zhang, “Comm: Collaborative multi-agent, multi-reasoning-path prompting for complex problem solving,”arXiv preprint arXiv:2404.17729, 2024

  22. [22]

    and others , title =

    S. Han, Q. Zhang, Y . Yao, W. Jin, Z. Xu, and C. He, “Llm multi-agent systems: Challenges and open problems,”arXiv preprint arXiv:2402.03578, 2024

  23. [23]

    arXiv preprint arXiv:2504.09037 , year=

    Z. Ke, F. Jiao, Y . Ming, X.-P. Nguyen, A. Xu, D. X. Long, M. Li, C. Qin, P. Wang, S. Savareseet al., “A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems,” arXiv preprint arXiv:2504.09037, 2025

  24. [24]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    S. Hong, X. Zheng, J. Chen, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhouet al., “Metagpt: Meta programming for multi-agent collaborative framework,”arXiv preprint arXiv:2308.00352, vol. 3, no. 4, p. 6, 2023

  25. [25]

    Explain-analyze- generate: A sequential multi-agent collaboration method for complex reasoning,

    W. Gu, J. Han, H. Wang, X. Li, and B. Cheng, “Explain-analyze- generate: A sequential multi-agent collaboration method for complex reasoning,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 7127–7140

  26. [26]

    A survey on large language model based autonomous agents,

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Linet al., “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

  27. [27]

    Per- sonalized recommendation agents with self-consistency,

    Z. Wu, L. Sheng, Y . Xia, Y . Zhang, Y . Chen, and A. Zhang, “Per- sonalized recommendation agents with self-consistency,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 2978–2982

  28. [28]

    Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion.CoRR, abs/2409.14051, 2024

    T. Liu, X. Wang, W. Huang, W. Xu, Y . Zeng, L. Jiang, H. Yang, and J. Li, “Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion,”arXiv preprint arXiv:2409.14051, 2024

  29. [29]

    Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

    P. Putta, E. Mills, N. Garg, S. Motwani, C. Finn, D. Garg, and R. Rafailov, “Agent q: Advanced reasoning and learning for autonomous ai agents,”arXiv preprint arXiv:2408.07199, 2024

  30. [30]

    Meta-learning in neural networks: A survey,

    T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, “Meta-learning in neural networks: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 9, pp. 5149–5169, 2021

  31. [31]

    Model-agnostic meta-learning for fast adaptation of deep networks,

    C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1126–1135

  32. [32]

    Prototypical networks for few-shot learning,

    J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,”Advances in neural information processing systems, vol. 30, 2017

  33. [33]

    Maml- en-llm: Model agnostic meta-training of llms for improved in-context learning,

    S. Sinha, Y . Yue, V . Soto, M. Kulkarni, J. Lu, and A. Zhang, “Maml- en-llm: Model agnostic meta-training of llms for improved in-context learning,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 2711–2720

  34. [34]

    Rema: Learning to meta-think for llms with multi-agent reinforcement learning.CoRR, abs/2503.09501, 2025

    Z. Wan, Y . Li, X. Wen, Y . Song, H. Wang, L. Yang, M. Schmidt, J. Wang, W. Zhang, S. Huet al., “Rema: Learning to meta-think for llms with multi-agent reinforcement learning,”arXiv preprint arXiv:2503.09501, 2025

  35. [35]

    Metamind: Modeling human social thoughts with metacognitive multi-agent systems,

    X. Zhang, Y . Chen, S. Yeh, and S. Li, “Metamind: Modeling human social thoughts with metacognitive multi-agent systems,”arXiv preprint arXiv:2505.18943, 2025

  36. [36]

    Multi-agent coordination across diverse applications: A survey.arXiv preprint arXiv:2502.14743, 2025

    L. Sun, Y . Yang, Q. Duan, Y . Shi, C. Lyu, Y .-C. Chang, C.-T. Lin, and Y . Shen, “Multi-agent coordination across diverse applications: A survey,”arXiv preprint arXiv:2502.14743, 2025

  37. [37]

    Meta-thinking in llms via multi-agent reinforcement learning: A survey,

    A. Bilal, M. A. Mohsin, M. Umer, M. A. K. Bangash, and M. A. Jamshed, “Meta-thinking in llms via multi-agent reinforcement learning: A survey,”arXiv preprint arXiv:2504.14520, 2025

  38. [38]

    Swarm intelligence: A review of algorithms,

    A. Chakraborty and A. K. Kar, “Swarm intelligence: A review of algorithms,”Nature-inspired computing and optimization: Theory and applications, pp. 475–494, 2017

  39. [39]

    Particle swarm optimization,

    J. Kennedy and R. Eberhart, “Particle swarm optimization,” inProceed- ings of ICNN’95-international conference on neural networks, vol. 4. ieee, 1995, pp. 1942–1948

  40. [40]

    Grey wolf optimizer,

    S. Mirjalili, S. M. Mirjalili, and A. Lewis, “Grey wolf optimizer,” Advances in engineering software, vol. 69, pp. 46–61, 2014

  41. [41]

    Marine predators algorithm: A nature-inspired metaheuristic,

    A. Faramarzi, M. Heidarinejad, S. Mirjalili, and A. H. Gandomi, “Marine predators algorithm: A nature-inspired metaheuristic,”Expert systems with applications, vol. 152, p. 113377, 2020

  42. [42]

    Hippopotamus optimization algorithm: a novel nature- inspired optimization algorithm,

    M. H. Amiri, N. Mehrabi Hashjin, M. Montazeri, S. Mirjalili, and N. Khodadadi, “Hippopotamus optimization algorithm: a novel nature- inspired optimization algorithm,”Scientific Reports, vol. 14, no. 1, p. 5032, 2024

  43. [43]

    A review on representative swarm intelligence algorithms for solving optimization problems: Applications and trends,

    J. Tang, G. Liu, and Q. Pan, “A review on representative swarm intelligence algorithms for solving optimization problems: Applications and trends,”IEEE/CAA Journal of Automatica Sinica, vol. 8, no. 10, pp. 1627–1643, 2021

  44. [44]

    Kouziokas,Swarm intelligence and evolutionary computation: theory, advances and applications in machine learning and deep learning

    G. Kouziokas,Swarm intelligence and evolutionary computation: theory, advances and applications in machine learning and deep learning. CRC Press, 2023

  45. [45]

    Swarmsys: Decentralized swarm-inspired agents for scalable and adaptive reasoning,

    R. Li, H. Liu, L. Zhao, Z. Li, J. Li, J. Jiang, L. Xu, C. Zhao, M. Fan, and C. Liang, “Swarmsys: Decentralized swarm-inspired agents for scalable and adaptive reasoning,”arXiv preprint arXiv:2510.10047, 2025

  46. [46]

    arXiv preprint arXiv:2603.12933 , year=

    X. Wang, C. Zhang, J. Zhang, C. Li, Q. Sun, S.-H. Bae, P. Wang, N. Xie, J. Zou, Y . Yanget al., “Efficient and interpretable multi-agent llm routing via ant colony optimization,”arXiv preprint arXiv:2603.12933, 2026

  47. [47]

    Model swarms: Collaborative search to adapt llm experts via swarm intelli- gence,

    S. Feng, Z. Wang, Y . Wang, S. Ebrahimi, H. Palangi, L. Miculicich, A. Kulshrestha, N. Rauschmayr, Y . Choi, Y . Tsvetkovet al., “Model swarms: Collaborative search to adapt llm experts via swarm intelli- gence,”arXiv preprint arXiv:2410.11163, 2024

  48. [48]

    Vector search with openai embeddings: Lucene is all you need,

    J. Xian, T. Teofili, R. Pradeep, and J. Lin, “Vector search with openai embeddings: Lucene is all you need,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 1090–1093

  49. [49]

    Task2vec: Task embedding for meta-learning,

    A. Achille, M. Lam, R. Tewari, A. Ravichandran, S. Maji, C. C. Fowlkes, S. Soatto, and P. Perona, “Task2vec: Task embedding for meta-learning,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6430–6439

  50. [50]

    Towards unified task em- beddings across multiple models: Bridging the gap for prompt-based large language models and beyond,

    X. Wang, H. Xu, L. Gui, and Y . He, “Towards unified task em- beddings across multiple models: Bridging the gap for prompt-based large language models and beyond,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 8324–8340

  51. [51]

    The perceptron: a probabilistic model for information storage and organization in the brain

    F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.”Psychological review, vol. 65, no. 6, p. 386, 1958

  52. [52]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,”arXiv preprint arXiv:2103.03874, 2021

  53. [53]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

  54. [54]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhouet al., “Challenging big-bench tasks and whether chain-of-thought can solve them,”arXiv preprint arXiv:2210.09261, 2022

  55. [55]

    Mmlu-cf: A contamination-free multi-task language understanding benchmark,

    Q. Zhao, Y . Huang, T. Lv, L. Cui, Q. Sun, S. Mao, X. Zhang, Y . Xin, Q. Yin, S. Liet al., “Mmlu-cf: A contamination-free multi-task language understanding benchmark,”arXiv preprint arXiv:2412.15194, 2024

  56. [56]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-hop question answering,”arXiv preprint arXiv:1809.09600, 2018

  57. [57]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Houet al., “Longbench: A bilingual, multitask benchmark for long context understanding,”arXiv preprint arXiv:2308.14508, 2023

  58. [58]

    Self-refine: Iter- ative refinement with self-feedback,

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in Neural Information Processing Systems, vol. 36, pp. 46 534–46 594, 2023

  59. [59]

    doi: 10.48550/arXiv.2310.01714

    M. Yasunaga, X. Chen, Y . Li, P. Pasupat, J. Leskovec, P. Liang, E. H. Chi, and D. Zhou, “Large language models as analogical reasoners,” arXiv preprint arXiv:2310.01714, 2023

  60. [60]

    AFlow: Automating Agentic Workflow Generation

    J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wanget al., “Aflow: Automating agentic workflow generation,”arXiv preprint arXiv:2410.10762, 2024

  61. [61]

    Forest-of-thought: Scaling test-time compute for enhancing llm reasoning,

    Z. Bi, K. Han, C. Liu, Y . Tang, and Y . Wang, “Forest-of-thought: Scaling test-time compute for enhancing llm reasoning,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 4253–4267

  62. [62]

    Atom of thoughts for markov llm test-time scaling,

    F. Teng, Z. Yu, Q. Shi, J. Zhang, C. Wu, and Y . Luo, “Atom of thoughts for markov llm test-time scaling,”arXiv preprint arXiv:2502.12018, 2025

  63. [63]

    Society of mind meets real-time strategy: A hierarchical multi-agent framework for strategic reasoning,

    D. Ahn, S. Kim, and J. Choi, “Society of mind meets real-time strategy: A hierarchical multi-agent framework for strategic reasoning,”arXiv preprint arXiv:2508.06042, 2025

  64. [64]

    Mas 2: Self-generative, self-configuring, self-rectifying multi-agent systems,

    K. Wang, G. Zhang, M. Ye, X. Deng, D. Wang, X. Hu, J. Guo, Y . Liu, and Y . Guo, “Mas 2: Self-generative, self-configuring, self-rectifying multi-agent systems,”arXiv preprint arXiv:2509.24323, 2025

  65. [65]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  66. [66]

    Evaluating quadratic weighted kappa as the standard performance metric for automated essay scoring,

    A. Doewes, N. Kurdhi, and A. Saxena, “Evaluating quadratic weighted kappa as the standard performance metric for automated essay scoring,” in16th International Conference on Educational Data Mining, EDM

  67. [67]

    International Educational Data Mining Society (IEDMS), 2023, pp. 103–113