Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

Haoyu Bian , Chaoning Zhang , Jiaquan Zhang , Xingyao Li , Yuanfang Guo , Wei Dong , Yang Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:50 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA

keywords agentreasoningmulti-agentunderlineweakagentscollaborationframework

0 comments

The pith

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent AI systems let several language models work together on tough reasoning problems. One weak or error-prone agent can drag down the whole group even if others are strong. Earlier methods focused on making the best agents better or filtering out bad answers. WORC instead locates the current weakest link and gives it more chances to get the answer right. In the first stage, task features feed a meta-learning predictor trained on optimal setups found by swarm intelligence algorithms; this predicts performance weights and flags the lowest-weight agent as weak. In the second stage, an uncertainty-driven rule gives that agent larger repeated-sampling quotas. Experiments on reasoning benchmarks report 82.2 percent average accuracy plus gains in stability and generalization across model architectures. The core idea is that team performance is limited by its weakest member, so targeted compensation can outperform strength-only approaches.

Core claim

WORC achieves an average accuracy of 82.2% on reasoning benchmarks while improving framework stability and cross-architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi-agent systems.

Load-bearing premise

The meta-learning-based weight predictor, trained on optimal configurations from swarm intelligence algorithms, can reliably identify the weak agent in zero-shot fashion from task features alone.

Figures

Figures reproduced from arXiv: 2604.15972 by Chaoning Zhang, Haoyu Bian, Jiaquan Zhang, Wei Dong, Xingyao Li, Yang Yang, Yuanfang Guo.

**Figure 2.** Figure 2: Overview of the WORC method in the AC framework. (a) Weak Agent Localization: A weight knowledge base is constructed via SIA training, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Compares Accuracy between AC and WORC method (same con [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 3.** Figure 3: Agent weight evolution across iterations under three SIAs (PSO, GWO, and HO). The labels #1-#4 denote the relative ranking of the four agents at [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

LLM-driven multi-agent frameworks address complex reasoning tasks through multi-role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high-capability agents or suppressing unreliable outputs to improve framework effectiveness, while systematic identification and reinforcement of performance-limiting agents receive less attention. To address this gap, we propose WORC, a \underline{w}eak-link \underline{o}ptimization framework for multi-agent \underline{r}easoning and \underline{c}ollaboration, grounded in the weak-link principle. WORC follows a two-stage workflow. In the weak agent localization stage, task features are constructed, and a meta-learning-based weight predictor trained on optimal configurations identified by swarm intelligence algorithms (SIAs) enables zero-shot mapping from these features to agent performance weights, where the agent with the lowest predicted weight is identified as the weak agent. In the weak-link optimization stage, an uncertainty-driven allocation strategy assigns additional reasoning budgets to weak agents, with lower predicted weights leading to larger repeated-sampling quotas to compensate for reliability deficiencies. Experimental results show that WORC achieves an average accuracy of 82.2\% on reasoning benchmarks while improving framework stability and cross-architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi-agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WORC targets weak agents in multi-agent LLM setups with a meta-learned predictor and extra sampling, but the abstract gives no evidence the predictor actually finds the right ones.

read the letter

The paper's main contribution is a two-stage method that first predicts which agent will be the weak link from task features using a meta-learner trained on swarm-intelligence optima, then assigns more repeated samples to that agent based on uncertainty. This moves away from the usual focus on boosting top agents or suppressing bad outputs and instead tries to fix the bottleneck directly. That framing is straightforward and addresses a practical pain point in growing multi-agent deployments where one unreliable agent can drag down the whole group. The workflow itself is clearly described and could be implemented on top of existing frameworks without major changes to the LLMs involved. The reported 82.2% average accuracy and gains in stability plus cross-architecture generalization sound promising on the surface. The soft spots are more serious. The abstract supplies no baselines, no ablations, no error bars, and no direct check that the predicted weights line up with measured per-agent accuracy on the evaluation tasks. Without that correlation, the localization step could be selecting the wrong agents, which would mean any performance lift cannot be attributed to weak-link compensation. The zero-shot mapping from task features to reliability rankings is the load-bearing assumption, yet nothing in the provided text validates it against ground truth. This leaves the central claim unsupported so far. The work is aimed at researchers already running multi-agent LLM systems for reasoning tasks and looking for reliability fixes that do not require uniform capability scaling. A reader in that niche could extract the localization-plus-allocation pattern and test it themselves, but they would need the full experimental details and predictor validation before treating the numbers as reliable. The paper engages the existing literature on collaboration instability without obvious contradictions, so it deserves a serious referee to examine the methods section and request the missing controls. I would send it to peer review with a note to add the predictor validation experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that additional repeated sampling can compensate for low agent reliability and that task features suffice for zero-shot performance prediction; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Individual agent errors in multi-agent collaboration can be mitigated by allocating larger repeated-sampling quotas to low-performance agents.
This underpins the weak-link optimization stage and the claim that compensation improves overall robustness.

pith-pipeline@v0.9.0 · 5560 in / 1247 out tokens · 77703 ms · 2026-05-10T08:50:39.840177+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 37 canonical work pages · 9 internal anchors

[1]

A survey on evaluation of large language models,

Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024

2024
[2]

Intelligent decision-making driven by large ai models: Progress, challenges and prospects,

Y . He, S. Ruan, D. Wang, H. Lu, Z. Li, Y . Liu, X. Chen, S. Li, J. Zhao, and J. Liang, “Intelligent decision-making driven by large ai models: Progress, challenges and prospects,”CAAI Transactions on Intelligence Technology, vol. 10, no. 6, pp. 1573–1592, 2025

2025
[3]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

2022
[4]

arXiv preprint arXiv:2602.09821 , year=

J. Zhang, C. Zhang, S. Chen, Y . Liu, C. Li, Q. Sun, S. Yuan, F. D. Puspitasari, D. Han, G. Wanget al., “Text summarization via global structure awareness,”arXiv preprint arXiv:2602.09821, 2026

work page arXiv 2026
[5]

arXiv preprint arXiv:2602.09794 , year=

J. Zhang, C. Zhang, S. Chen, X. Wang, Z. Huang, P. Zheng, S. Yuan, S. Zheng, Q. Sun, J. Zouet al., “Learning global hypothesis space for en- hancing synergistic reasoning chain,”arXiv preprint arXiv:2602.09794, 2026

work page arXiv 2026
[6]

The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv preprint arXiv:2404.11584, 2024

T. Masterman, S. Besen, M. Sawtell, and A. Chao, “The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey,”arXiv preprint arXiv:2404.11584, 2024

work page arXiv 2024
[7]

Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

Y . Talebirad and A. Nadiri, “Multi-agent collaboration: Harnessing the power of intelligent llm agents,”arXiv preprint arXiv:2306.03314, 2023

work page arXiv 2023
[8]

Rethinking the bounds of llm reasoning: Are multi-agent discussions the key?

Q. Wang, Z. Wang, Y . Su, H. Tong, and Y . Song, “Rethinking the bounds of llm reasoning: Are multi-agent discussions the key?”arXiv preprint arXiv:2402.18272, 2024

work page arXiv 2024
[9]

Social agent: Mastering dyadic nonverbal behavior generation via conversational llm agents,

Z. Zhang, Y . Zhou, H. Yao, T. Ao, X. Zhan, and L. Liu, “Social agent: Mastering dyadic nonverbal behavior generation via conversational llm agents,” inProceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–12

2025
[10]

arXiv preprint arXiv:2508.04903 , year =

J. Liu, Z. Kong, C. Yang, F. Yang, T. Li, P. Dong, J. Nanjekye, H. Tang, G. Yuan, W. Niuet al., “Rcr-router: Efficient role-aware context routing for multi-agent llm systems with structured memory,”arXiv preprint arXiv:2508.04903, 2025

work page arXiv 2025
[11]

Dynamic consensus communication mech- anism for large language model-based multi-agent systems,

L. Yang, S. Li, and A. Deng, “Dynamic consensus communication mech- anism for large language model-based multi-agent systems,”Journal of Signal Processing Systems, vol. 98, no. 1, p. 10, 2026

2026
[12]

Towards scientific intelligence: A survey of llm-based scientific agents.arXiv preprint arXiv:2503.24047, 2025

S. Ren, P. Jian, Z. Ren, C. Leng, C. Xie, and J. Zhang, “Towards scientific intelligence: A survey of llm-based scientific agents,”arXiv preprint arXiv:2503.24047, 2025

work page arXiv 2025
[13]

Kg4diagnosis: A hierarchical multi- agent llm framework with knowledge graph enhancement for medical diagnosis,

K. Zuo, Y . Jiang, F. Mo, and P. Lio, “Kg4diagnosis: A hierarchical multi- agent llm framework with knowledge graph enhancement for medical diagnosis,” inAAAI Bridge Program on AI for Medicine and Healthcare. PMLR, 2025, pp. 195–204

2025
[14]

Multi-agent autonomous driving systems with large language models: A survey of recent advances, resources, and future directions,

Y . Wu, D. Li, Y . Chen, R. Jiang, H. P. Zou, W.-C. Huang, Y . Li, L. Fang, Z. Wang, and P. S. Yu, “Multi-agent autonomous driving systems with large language models: A survey of recent advances,”arXiv preprint arXiv:2502.16804, 2025

work page arXiv 2025
[15]

Experience Transfer for Multimodal LLM Agents in Minecraft Game

C. Li, J. Liu, S. Zhang, H. Jian, H. Ni, L.-H. Lee, S.-H. Bae, G. Wang, Y . Yang, and C. Zhang, “Experience transfer for multimodal llm agents in minecraft game,”arXiv preprint arXiv:2604.05533, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

A survey on llm- based multi-agent systems: workflow, infrastructure, and challenges,

X. Li, S. Wang, S. Zeng, Y . Wu, and Y . Yang, “A survey on llm- based multi-agent systems: workflow, infrastructure, and challenges,” Vicinagearth, vol. 1, no. 1, p. 9, 2024

2024
[17]

Rethinking the reliability of multi-agent system: A perspective from byzantine fault tolerance,

L. Zheng, J. Chen, Q. Yin, J. Zhang, X. Zeng, and Y . Tian, “Rethinking the reliability of multi-agent system: A perspective from byzantine fault tolerance,”arXiv preprint arXiv:2511.10400, 2025

work page arXiv 2025
[18]

arXiv preprint arXiv:2508.15260 , year=

Y . Fu, X. Wang, Y . Tian, and J. Zhao, “Deep think with confidence,” arXiv preprint arXiv:2508.15260, 2025

work page arXiv 2025
[19]

Encouraging divergent thinking in large language models through multi-agent debate,

T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” inProceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 17 889– 17 904

2024
[20]

Debate or vote: Which yields better decisions in multi-agent large language models?arXiv preprint arXiv:2508.17536, 2025

H. K. Choi, X. Zhu, and S. Li, “Debate or vote: Which yields better decisions in multi-agent large language models?”arXiv preprint arXiv:2508.17536, 2025

work page arXiv 2025
[21]

Comm: Collaborative multi-agent, multi-reasoning-path prompting for complex problem solving,

P. Chen, B. Han, and S. Zhang, “Comm: Collaborative multi-agent, multi-reasoning-path prompting for complex problem solving,”arXiv preprint arXiv:2404.17729, 2024

work page arXiv 2024
[22]

and others , title =

S. Han, Q. Zhang, Y . Yao, W. Jin, Z. Xu, and C. He, “Llm multi-agent systems: Challenges and open problems,”arXiv preprint arXiv:2402.03578, 2024

work page arXiv 2024
[23]

arXiv preprint arXiv:2504.09037 , year=

Z. Ke, F. Jiao, Y . Ming, X.-P. Nguyen, A. Xu, D. X. Long, M. Li, C. Qin, P. Wang, S. Savareseet al., “A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems,” arXiv preprint arXiv:2504.09037, 2025

work page arXiv 2025
[24]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong, X. Zheng, J. Chen, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhouet al., “Metagpt: Meta programming for multi-agent collaborative framework,”arXiv preprint arXiv:2308.00352, vol. 3, no. 4, p. 6, 2023

work page internal anchor Pith review arXiv 2023
[25]

Explain-analyze- generate: A sequential multi-agent collaboration method for complex reasoning,

W. Gu, J. Han, H. Wang, X. Li, and B. Cheng, “Explain-analyze- generate: A sequential multi-agent collaboration method for complex reasoning,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 7127–7140

2025
[26]

A survey on large language model based autonomous agents,

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Linet al., “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

2024
[27]

Per- sonalized recommendation agents with self-consistency,

Z. Wu, L. Sheng, Y . Xia, Y . Zhang, Y . Chen, and A. Zhang, “Per- sonalized recommendation agents with self-consistency,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 2978–2982

2025
[28]

Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion.CoRR, abs/2409.14051, 2024

T. Liu, X. Wang, W. Huang, W. Xu, Y . Zeng, L. Jiang, H. Yang, and J. Li, “Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion,”arXiv preprint arXiv:2409.14051, 2024

work page arXiv 2024
[29]

Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

P. Putta, E. Mills, N. Garg, S. Motwani, C. Finn, D. Garg, and R. Rafailov, “Agent q: Advanced reasoning and learning for autonomous ai agents,”arXiv preprint arXiv:2408.07199, 2024

work page arXiv 2024
[30]

Meta-learning in neural networks: A survey,

T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, “Meta-learning in neural networks: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 9, pp. 5149–5169, 2021

2021
[31]

Model-agnostic meta-learning for fast adaptation of deep networks,

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1126–1135

2017
[32]

Prototypical networks for few-shot learning,

J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,”Advances in neural information processing systems, vol. 30, 2017

2017
[33]

Maml- en-llm: Model agnostic meta-training of llms for improved in-context learning,

S. Sinha, Y . Yue, V . Soto, M. Kulkarni, J. Lu, and A. Zhang, “Maml- en-llm: Model agnostic meta-training of llms for improved in-context learning,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 2711–2720

2024
[34]

Rema: Learning to meta-think for llms with multi-agent reinforcement learning.CoRR, abs/2503.09501, 2025

Z. Wan, Y . Li, X. Wen, Y . Song, H. Wang, L. Yang, M. Schmidt, J. Wang, W. Zhang, S. Huet al., “Rema: Learning to meta-think for llms with multi-agent reinforcement learning,”arXiv preprint arXiv:2503.09501, 2025

work page arXiv 2025
[35]

Metamind: Modeling human social thoughts with metacognitive multi-agent systems,

X. Zhang, Y . Chen, S. Yeh, and S. Li, “Metamind: Modeling human social thoughts with metacognitive multi-agent systems,”arXiv preprint arXiv:2505.18943, 2025

work page arXiv 2025
[36]

Multi-agent coordination across diverse applications: A survey.arXiv preprint arXiv:2502.14743, 2025

L. Sun, Y . Yang, Q. Duan, Y . Shi, C. Lyu, Y .-C. Chang, C.-T. Lin, and Y . Shen, “Multi-agent coordination across diverse applications: A survey,”arXiv preprint arXiv:2502.14743, 2025

work page arXiv 2025
[37]

Meta-thinking in llms via multi-agent reinforcement learning: A survey,

A. Bilal, M. A. Mohsin, M. Umer, M. A. K. Bangash, and M. A. Jamshed, “Meta-thinking in llms via multi-agent reinforcement learning: A survey,”arXiv preprint arXiv:2504.14520, 2025

work page arXiv 2025
[38]

Swarm intelligence: A review of algorithms,

A. Chakraborty and A. K. Kar, “Swarm intelligence: A review of algorithms,”Nature-inspired computing and optimization: Theory and applications, pp. 475–494, 2017

2017
[39]

Particle swarm optimization,

J. Kennedy and R. Eberhart, “Particle swarm optimization,” inProceed- ings of ICNN’95-international conference on neural networks, vol. 4. ieee, 1995, pp. 1942–1948

1995
[40]

Grey wolf optimizer,

S. Mirjalili, S. M. Mirjalili, and A. Lewis, “Grey wolf optimizer,” Advances in engineering software, vol. 69, pp. 46–61, 2014

2014
[41]

Marine predators algorithm: A nature-inspired metaheuristic,

A. Faramarzi, M. Heidarinejad, S. Mirjalili, and A. H. Gandomi, “Marine predators algorithm: A nature-inspired metaheuristic,”Expert systems with applications, vol. 152, p. 113377, 2020

2020
[42]

Hippopotamus optimization algorithm: a novel nature- inspired optimization algorithm,

M. H. Amiri, N. Mehrabi Hashjin, M. Montazeri, S. Mirjalili, and N. Khodadadi, “Hippopotamus optimization algorithm: a novel nature- inspired optimization algorithm,”Scientific Reports, vol. 14, no. 1, p. 5032, 2024

2024
[43]

A review on representative swarm intelligence algorithms for solving optimization problems: Applications and trends,

J. Tang, G. Liu, and Q. Pan, “A review on representative swarm intelligence algorithms for solving optimization problems: Applications and trends,”IEEE/CAA Journal of Automatica Sinica, vol. 8, no. 10, pp. 1627–1643, 2021

2021
[44]

Kouziokas,Swarm intelligence and evolutionary computation: theory, advances and applications in machine learning and deep learning

G. Kouziokas,Swarm intelligence and evolutionary computation: theory, advances and applications in machine learning and deep learning. CRC Press, 2023

2023
[45]

Swarmsys: Decentralized swarm-inspired agents for scalable and adaptive reasoning,

R. Li, H. Liu, L. Zhao, Z. Li, J. Li, J. Jiang, L. Xu, C. Zhao, M. Fan, and C. Liang, “Swarmsys: Decentralized swarm-inspired agents for scalable and adaptive reasoning,”arXiv preprint arXiv:2510.10047, 2025

work page arXiv 2025
[46]

arXiv preprint arXiv:2603.12933 , year=

X. Wang, C. Zhang, J. Zhang, C. Li, Q. Sun, S.-H. Bae, P. Wang, N. Xie, J. Zou, Y . Yanget al., “Efficient and interpretable multi-agent llm routing via ant colony optimization,”arXiv preprint arXiv:2603.12933, 2026

work page arXiv 2026
[47]

Model swarms: Collaborative search to adapt llm experts via swarm intelli- gence,

S. Feng, Z. Wang, Y . Wang, S. Ebrahimi, H. Palangi, L. Miculicich, A. Kulshrestha, N. Rauschmayr, Y . Choi, Y . Tsvetkovet al., “Model swarms: Collaborative search to adapt llm experts via swarm intelli- gence,”arXiv preprint arXiv:2410.11163, 2024

work page arXiv 2024
[48]

Vector search with openai embeddings: Lucene is all you need,

J. Xian, T. Teofili, R. Pradeep, and J. Lin, “Vector search with openai embeddings: Lucene is all you need,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 1090–1093

2024
[49]

Task2vec: Task embedding for meta-learning,

A. Achille, M. Lam, R. Tewari, A. Ravichandran, S. Maji, C. C. Fowlkes, S. Soatto, and P. Perona, “Task2vec: Task embedding for meta-learning,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6430–6439

2019
[50]

Towards unified task em- beddings across multiple models: Bridging the gap for prompt-based large language models and beyond,

X. Wang, H. Xu, L. Gui, and Y . He, “Towards unified task em- beddings across multiple models: Bridging the gap for prompt-based large language models and beyond,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 8324–8340

2024
[51]

The perceptron: a probabilistic model for information storage and organization in the brain

F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.”Psychological review, vol. 65, no. 6, p. 386, 1958

1958
[52]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,”arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review arXiv 2021
[53]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[54]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhouet al., “Challenging big-bench tasks and whether chain-of-thought can solve them,”arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review arXiv 2022
[55]

Mmlu-cf: A contamination-free multi-task language understanding benchmark,

Q. Zhao, Y . Huang, T. Lv, L. Cui, Q. Sun, S. Mao, X. Zhang, Y . Xin, Q. Yin, S. Liet al., “Mmlu-cf: A contamination-free multi-task language understanding benchmark,”arXiv preprint arXiv:2412.15194, 2024

work page arXiv 2024
[56]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-hop question answering,”arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review arXiv 2018
[57]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Houet al., “Longbench: A bilingual, multitask benchmark for long context understanding,”arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review arXiv 2023
[58]

Self-refine: Iter- ative refinement with self-feedback,

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in Neural Information Processing Systems, vol. 36, pp. 46 534–46 594, 2023

2023
[59]

doi: 10.48550/arXiv.2310.01714

M. Yasunaga, X. Chen, Y . Li, P. Pasupat, J. Leskovec, P. Liang, E. H. Chi, and D. Zhou, “Large language models as analogical reasoners,” arXiv preprint arXiv:2310.01714, 2023

work page arXiv 2023
[60]

AFlow: Automating Agentic Workflow Generation

J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wanget al., “Aflow: Automating agentic workflow generation,”arXiv preprint arXiv:2410.10762, 2024

work page internal anchor Pith review arXiv 2024
[61]

Forest-of-thought: Scaling test-time compute for enhancing llm reasoning,

Z. Bi, K. Han, C. Liu, Y . Tang, and Y . Wang, “Forest-of-thought: Scaling test-time compute for enhancing llm reasoning,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 4253–4267

2025
[62]

Atom of thoughts for markov llm test-time scaling,

F. Teng, Z. Yu, Q. Shi, J. Zhang, C. Wu, and Y . Luo, “Atom of thoughts for markov llm test-time scaling,”arXiv preprint arXiv:2502.12018, 2025

work page arXiv 2025
[63]

Society of mind meets real-time strategy: A hierarchical multi-agent framework for strategic reasoning,

D. Ahn, S. Kim, and J. Choi, “Society of mind meets real-time strategy: A hierarchical multi-agent framework for strategic reasoning,”arXiv preprint arXiv:2508.06042, 2025

work page arXiv 2025
[64]

Mas 2: Self-generative, self-configuring, self-rectifying multi-agent systems,

K. Wang, G. Zhang, M. Ye, X. Deng, D. Wang, X. Hu, J. Guo, Y . Liu, and Y . Guo, “Mas 2: Self-generative, self-configuring, self-rectifying multi-agent systems,”arXiv preprint arXiv:2509.24323, 2025

work page arXiv 2025
[65]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Evaluating quadratic weighted kappa as the standard performance metric for automated essay scoring,

A. Doewes, N. Kurdhi, and A. Saxena, “Evaluating quadratic weighted kappa as the standard performance metric for automated essay scoring,” in16th International Conference on Educational Data Mining, EDM
[67]

International Educational Data Mining Society (IEDMS), 2023, pp. 103–113

2023