Self-Evolving Multi-Agent Systems via Decentralized Memory

Guangya Hao; Yunbo Long; Zhuokai Zhao

arxiv: 2605.22721 · v1 · pith:YYKC7YGFnew · submitted 2026-05-21 · 💻 cs.MA

Self-Evolving Multi-Agent Systems via Decentralized Memory

Guangya Hao , Yunbo Long , Zhuokai Zhao This is my paper

Pith reviewed 2026-05-22 03:16 UTC · model grok-4.3

classification 💻 cs.MA

keywords decentralized memorymulti-agent systemsLLM agentsself-evolving systemsregret boundsexploration exploitationmemory management

0 comments

The pith

Decentralized dual-pool memory per agent enables multi-agent LLM teams to reach global solutions with O(log T) regret and higher accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-evolving multi-agent systems need persistent memory to improve from experience, yet centralized repositories create coordination costs, privacy issues, and reduced agent variety. This work replaces the shared store with a per-agent design: each keeps an exploitation pool of proven trajectories and an exploration pool of new candidates, then reweights the two pools at each stage using feedback from an LLM judge. The construction is proven to keep every feasible solution reachable and to bound cumulative regret by O(log T), matching the best possible rate for stochastic bandits up to constants. Across multiple agent frameworks, model sizes, and task domains the approach lifts accuracy while cutting token consumption compared with centralized memory baselines.

Core claim

DecentMem equips every agent with its own dual-pool memory—an exploitation pool holding consolidated past trajectories and an exploration pool holding LLM-generated candidates for new contexts—then reweights the pools online according to stage-wise LLM-as-a-judge scores. This design is shown to guarantee global reachability of the full solution space and to deliver O(log T) cumulative regret, while empirical runs on AutoGen, DyLAN, and AgentNet with Qwen3 and Gemma4 backbones report accuracy gains of up to 23.8 percent over the strongest centralized memory baseline and token reductions of up to 49 percent.

What carries the argument

Per-agent dual-pool memory (exploitation pool of consolidated trajectories plus exploration pool of candidates) whose relative sizes are adjusted online by stage-wise LLM-as-a-judge feedback.

If this is right

Agent teams can scale in number without proportional growth in communication or coordination overhead.
Diversity among agents is preserved because each maintains an independent exploration pool rather than converging on a single shared repository.
The same memory structure applies uniformly across math, code, question-answering, and embodied benchmarks and across different LLM backbones.
Token consumption drops because agents retrieve from smaller, locally relevant pools instead of scanning a growing centralized store.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regret bound implies that long-running agent teams will eventually spend most of their effort exploiting high-quality trajectories discovered early.
Decentralized pools may naturally limit privacy leakage because no single repository holds every agent's full history.
The reweighting mechanism could be extended to incorporate occasional human feedback without changing the overall architecture.
Similar dual-pool logic might transfer to other decentralized learning settings where agents must balance reuse of past successes against discovery of new behaviors.

Load-bearing premise

The LLM judge supplies consistent quality signals that correctly steer the online reweighting between the two pools without introducing bias or task-specific restrictions.

What would settle it

An experiment in which measured cumulative regret grows faster than O(log T) or in which high-value trajectories become unreachable after many stages would falsify the reachability and regret claims.

Figures

Figures reproduced from arXiv: 2605.22721 by Guangya Hao, Yunbo Long, Zhuokai Zhao.

**Figure 2.** Figure 2: Overview of DECENTMEM. Each agent maintains a private dual-pool memory (left): an exploitation pool (E-pool) of consolidated trajectories from past tasks and an exploration pool (X-pool) for generating novel candidates in unseen contexts. At each stage, an online router selects between the two pools with probability proportional to their weights wE−pool and wX−pool, retrieving from the E-pool or generating… view at source ↗

**Figure 3.** Figure 3: Cumulative accuracy vs. number of tasks on MBPP-Plus across three MAS frameworks [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Token cost vs. performance on BBH across three MAS frameworks under QWEN3-8B. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Self-evolving multi-agent systems (MAS) have emerged as a promising route to LLM agents that continually improve from experience, with persistent memory at their foundation. However, existing designs almost exclusively adopt a centralized repository shared across agents, incurring communication and coordination overhead, raising privacy concerns, and collapsing agent diversity. We propose DecentMem, a decentralized memory framework in which each agent maintains its own dual-pool memory -- an exploitation pool of consolidated past trajectories and an exploration pool of LLM-generated candidates for unseen contexts. The two pools are reweighted online based on stage-wise feedback from an LLM-as-a-judge. Theoretically, we prove that this design guarantees global reachability of the solution space and achieves $O(\log T)$ cumulative regret, matching the stochastic bandit lower bound up to constants. In practice, across three MAS frameworks (AutoGen, DyLAN, AgentNet), three Qwen3 backbones (4B/8B/14B), two Gemma4 backbones (E2B/E4B) and five benchmarks spanning math, code, QA, and embodied tasks, DecentMem improves average accuracy by up to 23.8% over the strongest centralized memory baseline and by up to 52.5% over the no-memory baseline, while reducing token usage by up to 49%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DecentMem decentralizes memory via per-agent dual pools and LLM-judge reweighting, claiming O(log T) regret and solid accuracy lifts, but the theory rests on unexamined judge reliability.

read the letter

The main thing to know is that DecentMem puts a dual-pool memory on each agent—one for consolidated past trajectories and one for LLM-generated exploration candidates—then reweights them online using stage-wise LLM-as-a-judge feedback. The authors claim this gives global reachability of the solution space and O(log T) cumulative regret that matches the stochastic bandit lower bound, while delivering up to 23.8% higher average accuracy than the strongest centralized memory baseline across several setups and cutting token use by up to 49%.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DecentMem, a decentralized memory framework for self-evolving multi-agent LLM systems. Each agent maintains a dual-pool memory (exploitation pool of consolidated trajectories and exploration pool of LLM-generated candidates) that is reweighted online via stage-wise LLM-as-a-judge feedback. The authors prove global reachability of the solution space and O(log T) cumulative regret matching the stochastic bandit lower bound up to constants. Empirically, across AutoGen/DyLAN/AgentNet frameworks, Qwen3/Gemma4 backbones, and five benchmarks (math, code, QA, embodied), DecentMem yields up to 23.8% accuracy gain over the strongest centralized memory baseline and up to 52.5% over no-memory, while cutting token usage by up to 49%.

Significance. If the theoretical guarantees hold, the work would be significant for multi-agent systems research: it directly tackles centralization drawbacks (overhead, privacy, diversity loss) with a clean decentralized design and supplies a regret bound that matches the known stochastic bandit lower bound. The breadth of empirical evaluation across frameworks, model scales, and task types strengthens the practical case. The combination of a parameter-light decentralized mechanism with matching lower-bound regret would be a notable advance if the judge-reliability assumption can be made rigorous.

major comments (2)

[Theoretical Analysis] The proof that dual-pool reweighting yields a stochastic bandit with O(log T) regret and global reachability (theoretical section) treats stage-wise LLM-as-a-judge scores as reliable, unbiased rewards. No concentration inequality, bias bound, or robustness margin for judge error (systematic favoritism, hallucination, or task-dependent bias) is derived. This assumption is load-bearing: any deviation from the assumed stochastic reward model collapses both the regret guarantee and the reachability argument, yet experiments report only end-task accuracy rather than direct judge-fidelity metrics.
[Theoretical Analysis] The online reweighting of exploitation/exploration pools is claimed to produce the stochastic bandit instance whose regret is analyzed. The manuscript does not state explicit restrictions on the judge model or task distribution that would keep judge error bounded away from adversarial; without such restrictions or a derived tolerance, the reduction to the standard bandit setting is not self-contained.

minor comments (2)

[Abstract] The abstract reports maximum gains (“up to 23.8%”, “up to 52.5%”) without indicating the specific framework–model–benchmark combination that attains each maximum; adding a short table or parenthetical would improve clarity.
[Experimental Setup] Reproducibility would benefit from an explicit description of the judge prompt template, temperature, and which backbone serves as judge versus actor in each experiment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the theoretical analysis. We address each major point below, clarifying the scope of our guarantees and outlining targeted revisions to make the assumptions and limitations more explicit.

read point-by-point responses

Referee: The proof that dual-pool reweighting yields a stochastic bandit with O(log T) regret and global reachability (theoretical section) treats stage-wise LLM-as-a-judge scores as reliable, unbiased rewards. No concentration inequality, bias bound, or robustness margin for judge error (systematic favoritism, hallucination, or task-dependent bias) is derived. This assumption is load-bearing: any deviation from the assumed stochastic reward model collapses both the regret guarantee and the reachability argument, yet experiments report only end-task accuracy rather than direct judge-fidelity metrics.

Authors: The analysis models the LLM-as-a-judge scores explicitly as the stochastic reward observations in the bandit instance; the O(log T) regret bound and global reachability therefore hold with respect to these observed scores under the standard i.i.d. sub-Gaussian assumption on the reward process. We do not claim that the bound is robust to arbitrary or adversarial judge errors, nor do we derive concentration inequalities for judge bias, because the theoretical contribution focuses on the regret relative to the feedback signal that actually drives the dual-pool reweighting. We agree that the manuscript would benefit from greater transparency on this point. In the revision we will add a dedicated subsection on modeling assumptions that (i) states the stochastic reward assumption with respect to judge scores, (ii) notes the absence of explicit robustness margins for systematic judge bias, and (iii) reports new judge-fidelity metrics (inter-judge agreement and correlation with ground-truth outcomes on a subset of tasks) to bridge the theoretical and empirical sections. revision: yes
Referee: The online reweighting of exploitation/exploration pools is claimed to produce the stochastic bandit instance whose regret is analyzed. The manuscript does not state explicit restrictions on the judge model or task distribution that would keep judge error bounded away from adversarial; without such restrictions or a derived tolerance, the reduction to the standard bandit setting is not self-contained.

Authors: The reduction proceeds by treating the sequence of judge scores as the reward sequence of a stochastic multi-armed bandit whose arms correspond to the discrete choices of which pool to sample from at each stage. The proof therefore inherits the usual stochastic-bandit assumptions (fixed but unknown mean rewards, bounded variance). We acknowledge that the original manuscript did not enumerate these restrictions explicitly. In the revised version we will insert a formal statement of the required conditions on the judge model (bounded variance of score differences, non-adversarial drift) and on the task distribution (stationary context distribution), together with a short remark that the guarantees are conditional on these non-adversarial conditions. This will render the reduction self-contained while preserving the original proof structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical claims rest on external bandit analysis

full rationale

The paper's central theoretical result maps the dual-pool reweighting mechanism to a stochastic bandit model and invokes the known O(log T) regret lower bound. No equations or definitions in the provided text reduce the claimed regret bound or global reachability property to a fitted parameter or self-citation by construction. The LLM-as-a-judge feedback is presented as an input assumption rather than a derived quantity, and the experimental improvements are reported separately from the proof. The derivation therefore remains self-contained against standard multi-armed bandit theory without the specific reductions required for a positive circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters or invented entities; the regret analysis likely rests on standard stochastic bandit assumptions not detailed here.

axioms (1)

standard math Standard assumptions underlying stochastic multi-armed bandit regret bounds apply to the reweighted memory selection process.
The O(log T) cumulative regret claim matching the lower bound invokes typical bandit theory conditions on rewards and exploration.

pith-pipeline@v0.9.0 · 5759 in / 1319 out tokens · 46309 ms · 2026-05-22T03:16:23.839338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 9 internal anchors

[1]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024
[2]

Patil, Kevin Lin, Sarah Wooders, and Joseph E

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2023

work page 2023
[3]

Mem0: Building production-ready ai agents with scalable long-term memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. InEuropean Conference on Artificial Intelligence, 2025. URL https://api.semanticscholar.org/CorpusID: 278165315

work page 2025
[4]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zi Hen Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2023. URL https...

work page 2023
[5]

G- memory: Tracing hierarchical memory for multi-agent systems, 2025

Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025

work page arXiv 2025
[6]

Reflexion: language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems 36, 2023

Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems 36, 2023. URL https://api.semanticscholar. org/CorpusID:258833055

work page 2023
[7]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi (Jim) Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large lan- guage models.ArXiv, abs/2305.16291, 2023. URL https://api.semanticscholar.org/ CorpusID:258887849

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Scaling agent learning via experience synthesis, 2025

Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, and Dat Huynh. Scaling agent learning via experience synthesis, 2025. URLhttps://arxiv.org/abs/2511.03773

work page arXiv 2025
[9]

Multi-agent memory from a computer architecture perspective: Vi- sions and challenges ahead, 2026

Zhongming Yu, Naicheng Yu, Hejia Zhang, Wentao Ni, Mingrui Yin, Jiaying Yang, Yujie Zhao, and Jishen Zhao. Multi-agent memory from a computer architecture perspective: Vi- sions and challenges ahead, 2026. URL https://api.semanticscholar.org/CorpusID: 286457695

work page 2026
[10]

How we built our multi-agent research system

Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford. How we built our multi-agent research system. https://www.anthropic.com/ engineering/multi-agent-research-system , June 2025. Anthropic Engineering Blog. Accessed: 2026-04-27

work page 2025
[11]

Context rot: How increasing input tokens impacts llm performance

Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Chroma, July 2025. URL https://trychroma. com/research/context-rot. 10

work page 2025
[12]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Agentnet: Decentralized evolutionary coordination for llm-based multi-agent sys- tems.ArXiv, abs/2504.00587, 2025

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent sys- tems.ArXiv, abs/2504.00587, 2025. URL https://api.semanticscholar.org/CorpusID: 277468263

work page arXiv 2025
[14]

Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47:235–256, 2002

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47:235–256, 2002. URL https://api.semanticscholar. org/CorpusID:207609497

work page 2002
[15]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024
[16]

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. Dynamic llm-agent net- work: An llm-agent collaboration framework with agent team optimization.arXiv preprint arXiv:2310.02170, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Gemma 4: Byte for byte, the most capable open mod- els

Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open mod- els. https://blog.google/innovation-and-ai/technology/developers-tools/ gemma-4/, April 2026. Google Blog, The Keyword. Accessed: 2026-04-27

work page 2026
[19]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[20]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.ArXiv, abs/2601.02553,

work page internal anchor Pith review arXiv
[23]

URLhttps://api.semanticscholar.org/CorpusID:284512931

work page
[24]

Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem

G. Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Commu- nicative agents for "mind" exploration of large language model society.Advances in Neural Information Processing Systems 36, 2023. URL https://api.semanticscholar.org/ CorpusID:268042527

work page 2023
[25]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, 2023. 11

work page 2023
[26]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

work page 2024
[27]

Gptswarm: Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024

work page 2024
[28]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Mixture-of-minds: Multi-agent reinforcement learning for table understanding.arXiv preprint arXiv:2510.20176,

Yuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei Wang, Jiayi Liu, Fei Liu, Serena Li, Weiwei Li, et al. Mixture-of-minds: Multi-agent reinforcement learning for table understanding.arXiv preprint arXiv:2510.20176, 2025

work page arXiv 2025
[30]

A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43:1 – 47, 2024

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43:1 – 47, 2024. URL https: //api.semanticscholar.org/CorpusID:269293320

work page 2024
[31]

Col- laborative memory: Multi-user memory sharing in llm agents with dynamic access con- trol.ArXiv, abs/2505.18279, 2025

Alireza Rezazadeh, Zichao Li, Ange Lou, Yuying Zhao, Wei Wei, and Yujia Bao. Col- laborative memory: Multi-user memory sharing in llm agents with dynamic access con- trol.ArXiv, abs/2505.18279, 2025. URL https://api.semanticscholar.org/CorpusID: 278904585

work page arXiv 2025
[32]

Aime 2025 dataset

math ai. Aime 2025 dataset. https://huggingface.co/datasets/math-ai/aime25, 2025

work page 2025
[33]

Aime 2024 dataset

Maxwell-Jia. Aime 2024 dataset. https://huggingface.co/datasets/Maxwell-Jia/ AIME_2024, 2024

work page 2024
[34]

Mbppplus

EvalPlus. Mbppplus. https://huggingface.co/datasets/evalplus/mbppplus, 2024. Hugging Face dataset. Accessed: 2026-04-27

work page 2024
[35]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Compu- tational Linguistics: ACL 2023, pages 13003–13051, 2023

work page 2023
[36]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[37]

Ollama.https://github.com/ollama/ollama

work page
[38]

Hugging face transformers.https://github.com/huggingface/transformers

work page
[39]

You are a smart agent designed to solve problems

Vivek S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48 ofTexts and Readings in Mathematics. Springer Singapore, 2 edition, 2024. ISBN 978- 981-99-8277-6. doi: 10.1007/978-981-99-8277-6. URL https://doi.org/10.1007/ 978-981-99-8277-6. 12 A Additional Theoretical Analysis In this appendix, we provide a rigorous theoretical founda...

work page doi:10.1007/978-981-99-8277-6 2024
[40]

Normalize retrieved records into a common schema

work page
[41]

Deduplicate near-identical memories

work page
[42]

Convert failed trajectories into negative constraints

work page
[43]

Re-rank memories by similarity, success weight, recency, and stage relevance

work page
[44]

not M(x) or not E(x)

Compress the selected records into a solver-specific memory packet. Memory packet injected to Solver Agent 0: Positive guidance: Formalize the premises with predicates. Try to construct a countermodel. If one model makes all premises true and the conclusion false, answer invalid. Negative guidance: Do not treat "not M(x) or not E(x)" as "not B(x)". Do not...

work page
[45]

HINT: This problem shows signs of complexity that would benefit from decomposition

Exploration Pool Prompt When the exploration memory pool is selected, no historical memory fragment is reused. Instead, the agent enters a fresh exploration mode and solves the task through standard workflow prompts, including role definition, approach decision, optional problem decomposition, direct problem solving, and solution integration. 1.1 Role-Def...

work page
[46]

Each sub-problem should be focused on a specific aspect or step

work page
[47]

Sub-problems should be solvable with different expertise levels

work page
[48]

Each must contribute to solving the original problem

work page
[49]

id": Sequential ID, such as 1, 2, 3. -

Ensure the sub-problems are complementary and cover different angles. For each sub-problem, provide: - "id": Sequential ID, such as 1, 2, 3. - "description": Clear, specific description of the sub-problem. - "focus": Main focus area, e.g., "Analysis", "Design", "Verification". - "dependencies": Dependencies on other sub-problems, or an empty list. Respond...

work page
[50]

Exploitation-Pool

Exploitation-Pool Prompt with Similarity MatchingHistorical memory reuse When the Exploitation-Pool is selected, the framework does not immediately inject historical memory. Instead, it first applies a similarity-matching mechanism. The current task description is used as the retrieval query, encoded into an embedding, and compared against stored memory f...

work page
[51]

It considers both the integrated solutions and the raw direct LLM answers, and returns structured feedback for subsequent memory updates

Evaluation PromptStage-level quality scoring The evaluation framework scores the quality of execution at the stage level. It considers both the integrated solutions and the raw direct LLM answers, and returns structured feedback for subsequent memory updates. 3.1 Evaluation PromptEvaluator instruction You are an expert evaluator. Evaluate the overall qual...

work page
[52]

Problem Understanding: Did the agent properly understand the problem?

work page
[53]

Decomposition Quality: If decomposed, is the breakdown logical and complete?

work page
[54]

Solution Clarity: Are the solutions clear and well-structured? 26

work page
[55]

LLM Direct Answer Quality: Is the LLM's direct response accurate and helpful?

work page
[56]

Foundation: Did this stage provide good foundation for next stages? 3.3 Stage-Specific Criteria fort 2 Intermediate stage

work page
[57]

Processing Quality: How well were intermediate tasks solved?

work page
[58]

Building on Previous: Did agents effectively use guidance from stage t_1?

work page
[59]

Task Allocation: Were tasks appropriately allocated to capable agents?

work page
[60]

Coherence: Do the solutions form a coherent middle layer?

work page
[61]

LLM Answer Consistency: Do the LLM direct answers align with the integrated solutions? 3.4 Stage-Specific Criteria fort 3 Final stage

work page
[62]

Refinement Quality: How well were solutions refined?

work page
[63]

Integration: How well do the final solutions integrate all previous work?

work page
[64]

Completeness: Is the final solution complete and comprehensive?

work page
[65]

Excellence: Does the final work meet high quality standards?

work page
[66]

score": <0-10>,

LLM Answer Quality: Are the LLM direct answers comprehensive and accurate? 3.5 Expected Evaluator OutputStructured feedback { "score": <0-10>, "stage_quality": "<poor/fair/good/excellent>", "reasoning": "<detailed explanation>", "solution_quality": "<assessment of the integrated solutions>", "llm_answer_quality": "<assessment of the LLM direct answers>", ...

work page

[1] [1]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024

[2] [2]

Patil, Kevin Lin, Sarah Wooders, and Joseph E

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2023

work page 2023

[3] [3]

Mem0: Building production-ready ai agents with scalable long-term memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. InEuropean Conference on Artificial Intelligence, 2025. URL https://api.semanticscholar.org/CorpusID: 278165315

work page 2025

[4] [4]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zi Hen Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2023. URL https...

work page 2023

[5] [5]

G- memory: Tracing hierarchical memory for multi-agent systems, 2025

Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025

work page arXiv 2025

[6] [6]

Reflexion: language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems 36, 2023

Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems 36, 2023. URL https://api.semanticscholar. org/CorpusID:258833055

work page 2023

[7] [7]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi (Jim) Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large lan- guage models.ArXiv, abs/2305.16291, 2023. URL https://api.semanticscholar.org/ CorpusID:258887849

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Scaling agent learning via experience synthesis, 2025

Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, and Dat Huynh. Scaling agent learning via experience synthesis, 2025. URLhttps://arxiv.org/abs/2511.03773

work page arXiv 2025

[9] [9]

Multi-agent memory from a computer architecture perspective: Vi- sions and challenges ahead, 2026

Zhongming Yu, Naicheng Yu, Hejia Zhang, Wentao Ni, Mingrui Yin, Jiaying Yang, Yujie Zhao, and Jishen Zhao. Multi-agent memory from a computer architecture perspective: Vi- sions and challenges ahead, 2026. URL https://api.semanticscholar.org/CorpusID: 286457695

work page 2026

[10] [10]

How we built our multi-agent research system

Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford. How we built our multi-agent research system. https://www.anthropic.com/ engineering/multi-agent-research-system , June 2025. Anthropic Engineering Blog. Accessed: 2026-04-27

work page 2025

[11] [11]

Context rot: How increasing input tokens impacts llm performance

Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Chroma, July 2025. URL https://trychroma. com/research/context-rot. 10

work page 2025

[12] [12]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Agentnet: Decentralized evolutionary coordination for llm-based multi-agent sys- tems.ArXiv, abs/2504.00587, 2025

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent sys- tems.ArXiv, abs/2504.00587, 2025. URL https://api.semanticscholar.org/CorpusID: 277468263

work page arXiv 2025

[14] [14]

Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47:235–256, 2002

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47:235–256, 2002. URL https://api.semanticscholar. org/CorpusID:207609497

work page 2002

[15] [15]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024

[16] [16]

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. Dynamic llm-agent net- work: An llm-agent collaboration framework with agent team optimization.arXiv preprint arXiv:2310.02170, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Gemma 4: Byte for byte, the most capable open mod- els

Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open mod- els. https://blog.google/innovation-and-ai/technology/developers-tools/ gemma-4/, April 2026. Google Blog, The Keyword. Accessed: 2026-04-27

work page 2026

[19] [19]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023

[20] [20]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.ArXiv, abs/2601.02553,

work page internal anchor Pith review arXiv

[23] [23]

URLhttps://api.semanticscholar.org/CorpusID:284512931

work page

[24] [24]

Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem

G. Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Commu- nicative agents for "mind" exploration of large language model society.Advances in Neural Information Processing Systems 36, 2023. URL https://api.semanticscholar.org/ CorpusID:268042527

work page 2023

[25] [25]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, 2023. 11

work page 2023

[26] [26]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

work page 2024

[27] [27]

Gptswarm: Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024

work page 2024

[28] [28]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Mixture-of-minds: Multi-agent reinforcement learning for table understanding.arXiv preprint arXiv:2510.20176,

Yuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei Wang, Jiayi Liu, Fei Liu, Serena Li, Weiwei Li, et al. Mixture-of-minds: Multi-agent reinforcement learning for table understanding.arXiv preprint arXiv:2510.20176, 2025

work page arXiv 2025

[30] [30]

A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43:1 – 47, 2024

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43:1 – 47, 2024. URL https: //api.semanticscholar.org/CorpusID:269293320

work page 2024

[31] [31]

Col- laborative memory: Multi-user memory sharing in llm agents with dynamic access con- trol.ArXiv, abs/2505.18279, 2025

Alireza Rezazadeh, Zichao Li, Ange Lou, Yuying Zhao, Wei Wei, and Yujia Bao. Col- laborative memory: Multi-user memory sharing in llm agents with dynamic access con- trol.ArXiv, abs/2505.18279, 2025. URL https://api.semanticscholar.org/CorpusID: 278904585

work page arXiv 2025

[32] [32]

Aime 2025 dataset

math ai. Aime 2025 dataset. https://huggingface.co/datasets/math-ai/aime25, 2025

work page 2025

[33] [33]

Aime 2024 dataset

Maxwell-Jia. Aime 2024 dataset. https://huggingface.co/datasets/Maxwell-Jia/ AIME_2024, 2024

work page 2024

[34] [34]

Mbppplus

EvalPlus. Mbppplus. https://huggingface.co/datasets/evalplus/mbppplus, 2024. Hugging Face dataset. Accessed: 2026-04-27

work page 2024

[35] [35]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Compu- tational Linguistics: ACL 2023, pages 13003–13051, 2023

work page 2023

[36] [36]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[37] [37]

Ollama.https://github.com/ollama/ollama

work page

[38] [38]

Hugging face transformers.https://github.com/huggingface/transformers

work page

[39] [39]

You are a smart agent designed to solve problems

Vivek S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48 ofTexts and Readings in Mathematics. Springer Singapore, 2 edition, 2024. ISBN 978- 981-99-8277-6. doi: 10.1007/978-981-99-8277-6. URL https://doi.org/10.1007/ 978-981-99-8277-6. 12 A Additional Theoretical Analysis In this appendix, we provide a rigorous theoretical founda...

work page doi:10.1007/978-981-99-8277-6 2024

[40] [40]

Normalize retrieved records into a common schema

work page

[41] [41]

Deduplicate near-identical memories

work page

[42] [42]

Convert failed trajectories into negative constraints

work page

[43] [43]

Re-rank memories by similarity, success weight, recency, and stage relevance

work page

[44] [44]

not M(x) or not E(x)

Compress the selected records into a solver-specific memory packet. Memory packet injected to Solver Agent 0: Positive guidance: Formalize the premises with predicates. Try to construct a countermodel. If one model makes all premises true and the conclusion false, answer invalid. Negative guidance: Do not treat "not M(x) or not E(x)" as "not B(x)". Do not...

work page

[45] [45]

HINT: This problem shows signs of complexity that would benefit from decomposition

Exploration Pool Prompt When the exploration memory pool is selected, no historical memory fragment is reused. Instead, the agent enters a fresh exploration mode and solves the task through standard workflow prompts, including role definition, approach decision, optional problem decomposition, direct problem solving, and solution integration. 1.1 Role-Def...

work page

[46] [46]

Each sub-problem should be focused on a specific aspect or step

work page

[47] [47]

Sub-problems should be solvable with different expertise levels

work page

[48] [48]

Each must contribute to solving the original problem

work page

[49] [49]

id": Sequential ID, such as 1, 2, 3. -

Ensure the sub-problems are complementary and cover different angles. For each sub-problem, provide: - "id": Sequential ID, such as 1, 2, 3. - "description": Clear, specific description of the sub-problem. - "focus": Main focus area, e.g., "Analysis", "Design", "Verification". - "dependencies": Dependencies on other sub-problems, or an empty list. Respond...

work page

[50] [50]

Exploitation-Pool

Exploitation-Pool Prompt with Similarity MatchingHistorical memory reuse When the Exploitation-Pool is selected, the framework does not immediately inject historical memory. Instead, it first applies a similarity-matching mechanism. The current task description is used as the retrieval query, encoded into an embedding, and compared against stored memory f...

work page

[51] [51]

It considers both the integrated solutions and the raw direct LLM answers, and returns structured feedback for subsequent memory updates

Evaluation PromptStage-level quality scoring The evaluation framework scores the quality of execution at the stage level. It considers both the integrated solutions and the raw direct LLM answers, and returns structured feedback for subsequent memory updates. 3.1 Evaluation PromptEvaluator instruction You are an expert evaluator. Evaluate the overall qual...

work page

[52] [52]

Problem Understanding: Did the agent properly understand the problem?

work page

[53] [53]

Decomposition Quality: If decomposed, is the breakdown logical and complete?

work page

[54] [54]

Solution Clarity: Are the solutions clear and well-structured? 26

work page

[55] [55]

LLM Direct Answer Quality: Is the LLM's direct response accurate and helpful?

work page

[56] [56]

Foundation: Did this stage provide good foundation for next stages? 3.3 Stage-Specific Criteria fort 2 Intermediate stage

work page

[57] [57]

Processing Quality: How well were intermediate tasks solved?

work page

[58] [58]

Building on Previous: Did agents effectively use guidance from stage t_1?

work page

[59] [59]

Task Allocation: Were tasks appropriately allocated to capable agents?

work page

[60] [60]

Coherence: Do the solutions form a coherent middle layer?

work page

[61] [61]

LLM Answer Consistency: Do the LLM direct answers align with the integrated solutions? 3.4 Stage-Specific Criteria fort 3 Final stage

work page

[62] [62]

Refinement Quality: How well were solutions refined?

work page

[63] [63]

Integration: How well do the final solutions integrate all previous work?

work page

[64] [64]

Completeness: Is the final solution complete and comprehensive?

work page

[65] [65]

Excellence: Does the final work meet high quality standards?

work page

[66] [66]

score": <0-10>,

LLM Answer Quality: Are the LLM direct answers comprehensive and accurate? 3.5 Expected Evaluator OutputStructured feedback { "score": <0-10>, "stage_quality": "<poor/fair/good/excellent>", "reasoning": "<detailed explanation>", "solution_quality": "<assessment of the integrated solutions>", "llm_answer_quality": "<assessment of the LLM direct answers>", ...

work page