arxiv: 2604.03295 · v1 · submitted 2026-03-27 · 💻 cs.MA · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems

Shanglin Wu , Yuyang Luo , Yueqing Liang , Kaiwen Shi , Yanfang Ye , Ali Payani , Kai Shu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:38 UTC · model grok-4.3

classification 💻 cs.MA cs.AI

keywords multi-agent systemslifelong learningmemory frameworkLLM agentsscaling behaviorlong-horizon tasksnon-monotonic scalingexperience reuse

0 comments

The pith

Memory design lets smaller LLM agent teams outperform larger ones on long-horizon tasks while lowering costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that LLM multi-agent systems scale more effectively through accumulated experience over time than through simply adding agents. It introduces LLMA-Mem, a framework with flexible memory topologies that enables agents to reuse past experience in ongoing work. Tests across coding, research, and database settings show consistent gains in long-term results together with lower overall cost. The work identifies a non-monotonic pattern in which larger teams do not reliably deliver better performance and smaller teams can surpass them when memory supports experience reuse. This positions memory architecture as a practical alternative to team expansion for efficient scaling.

Core claim

LLMA-Mem supplies a lifelong memory framework for LLM multi-agent systems that operates under flexible memory topologies. When evaluated on MultiAgentBench, the approach improves long-horizon performance relative to baselines and simultaneously reduces cost. The analysis further shows a non-monotonic scaling landscape in which larger teams do not always produce superior long-term results and smaller teams can exceed larger ones when memory better enables reuse of experience.

What carries the argument

LLMA-Mem, a lifelong memory framework with flexible memory topologies that stores and retrieves experience for reuse across agent interactions over extended time horizons.

If this is right

Effective memory reuse allows long-term task success without proportional increases in team size.
System costs decline because repeated computation is replaced by retrieval of stored experience.
Scaling strategies shift from adding agents to refining memory topologies for better experience retention.
Long-horizon reliability improves when memory supports continuity across successive interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of agent systems may achieve higher efficiency by allocating resources to memory mechanisms instead of team growth.
The non-monotonic pattern suggests testing memory topologies at intermediate team sizes to locate optimal operating points.
Similar memory approaches could apply to other sequential decision settings where agents must build on prior outcomes without external supervision.

Load-bearing premise

The observed performance gains and non-monotonic scaling behavior are caused by the memory design itself rather than by benchmark details or other unmeasured factors in the agent setup.

What would settle it

Running the same tasks with identical agent numbers and interaction rules but with memory components removed or replaced by simple logging, and finding that performance differences disappear or that larger teams consistently win.

read the original abstract

Large language model (LLM) multi-agent systems can scale along two distinct dimensions: by increasing the number of agents and by improving through accumulated experience over time. Although prior work has studied these dimensions separately, their interaction under realistic cost constraints remains unclear. In this paper, we introduce a conceptual scaling view of multi-agent systems that jointly considers team size and lifelong learning ability, and we study how memory design shares this landscape. To this end, we propose \textbf{LLMA-Mem}, a lifelong memory framework for LLM multi-agent systems under flexible memory topologies. We evaluate LLMA-Mem on \textsc{MultiAgentBench} across coding, research, and database environments. Empirically, LLMA-Mem consistently improves long-horizon performance over baselines while reducing cost. Our analysis further reveals a non-monotonic scaling landscape: larger teams do not always produce better long-term performance, and smaller teams can outperform larger ones when memory better supports the reuse of experience. These findings position memory design as a practical path for scaling multi-agent systems more effectively and more efficiently over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Memory can flip multi-agent scaling so smaller teams beat larger ones, but only if coordination overhead was isolated from the memory changes.

read the letter

The main thing to know is that this paper claims memory design can make smaller multi-agent teams outperform larger ones on long tasks by enabling experience reuse, but the non-monotonic scaling result may not be fully isolated from coordination costs. LLMA-Mem is their proposed framework for lifelong learning in LLM multi-agent systems using flexible memory topologies. Evaluated on MultiAgentBench across coding, research, and database environments, it shows consistent improvements in long-horizon performance and reduced costs compared to baselines. The analysis reveals that larger teams do not always win, and memory can tip the balance toward smaller teams. What stands out is the joint consideration of scaling by team size and by time through accumulated memory. This is a useful conceptual step beyond treating those dimensions separately. The empirical demonstration on a standard benchmark adds concrete data points, and the cost savings angle makes it relevant for deployment. The soft spot is in the scaling analysis. The claim that smaller teams can outperform larger ones relies on memory being the key variable. But if agent coordination overhead, like message passing or conflict handling, increases non-linearly with team size without being ablated, that could explain the crossover instead. The paper would be stronger with explicit controls that keep interaction structure constant while changing only memory capacity or topology. This is for researchers in multi-agent LLM systems who care about practical scaling under cost constraints. A reader interested in alternatives to brute-force team expansion would find the framework and observations worth considering. It deserves a serious referee. The ideas engage honestly with the literature on memory and agents, and the results are worth verifying even if revisions are needed for the scaling claims.

Referee Report

2 major / 2 minor

Summary. The paper proposes LLMA-Mem, a lifelong memory framework for LLM multi-agent systems supporting flexible topologies, and evaluates it on MultiAgentBench across coding, research, and database tasks. It claims that LLMA-Mem yields consistent long-horizon performance gains and cost reductions over baselines, while revealing a non-monotonic scaling landscape in which smaller teams can outperform larger ones when memory enables effective experience reuse.

Significance. If the empirical claims hold after proper controls, the work would usefully demonstrate that memory design offers a practical alternative to pure team-size scaling for long-term multi-agent performance under cost constraints, with implications for efficient lifelong learning in LLM systems.

major comments (2)

[Section 5] Section 5 (Evaluation and Scaling Analysis): the non-monotonic scaling result (smaller teams outperforming larger ones) is presented without an ablation that fixes interaction topology, message routing, and coordination mechanics while varying only memory capacity or topology; without this isolation, the crossover cannot be confidently attributed to memory effectiveness rather than unaccounted scaling of agent coordination costs.
[Section 4] Section 4 (Experimental Setup): the reported performance improvements lack any description of baselines, number of runs, statistical significance tests, or controls for LLM sampling variance, rendering the central claim of consistent gains unverifiable from the presented evidence.

minor comments (2)

[§3.1] Clarify the precise update rules and sharing protocol for the flexible memory topologies introduced in §3.1 so that the framework can be reproduced.
[Figure 4] Add axis labels, error bars, and legend details to the scaling plots in Figure 4 to improve readability of the non-monotonic trends.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our experimental design and reporting. We address each major comment below and will revise the manuscript to strengthen the presentation of results.

read point-by-point responses

Referee: [Section 5] Section 5 (Evaluation and Scaling Analysis): the non-monotonic scaling result (smaller teams outperforming larger ones) is presented without an ablation that fixes interaction topology, message routing, and coordination mechanics while varying only memory capacity or topology; without this isolation, the crossover cannot be confidently attributed to memory effectiveness rather than unaccounted scaling of agent coordination costs.

Authors: We agree that isolating the contribution of memory from coordination scaling effects is important for confidently attributing the non-monotonic result. Our current experiments vary team size while holding the LLMA-Mem framework (including its flexible but rule-based topology and routing) fixed across scales. However, we acknowledge that an explicit ablation fixing topology, routing, and coordination mechanics while varying only memory capacity would provide stronger evidence. We will add this controlled ablation to the revised Section 5. revision: yes
Referee: [Section 4] Section 4 (Experimental Setup): the reported performance improvements lack any description of baselines, number of runs, statistical significance tests, or controls for LLM sampling variance, rendering the central claim of consistent gains unverifiable from the presented evidence.

Authors: We will revise Section 4 to include a complete description of the experimental setup. This will explicitly define the baselines as standard multi-agent systems without the lifelong memory component, report the number of runs performed, describe the statistical significance tests used, and detail the controls for LLM sampling variance (including fixed sampling parameters). These additions will make the performance claims fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmark

full rationale

The paper proposes LLMA-Mem as a new memory framework for LLM multi-agent systems and reports performance gains plus non-monotonic scaling from direct experiments on the external MultiAgentBench benchmark. No equations, derivations, or predictions are shown that reduce to fitted inputs or self-definitions by construction. Central claims rest on comparative results against baselines rather than self-citation chains or ansatzes smuggled from prior author work. This is a standard empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces LLMA-Mem as a new framework but supplies no explicit free parameters, axioms, or invented entities. The central claims rest on the empirical outcomes of this framework on MultiAgentBench.

pith-pipeline@v0.9.0 · 5506 in / 1109 out tokens · 61873 ms · 2026-05-14T22:38:53.569343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLMA-Mem maintains three complementary memory components: episodic memory... procedural memory... transactive memory... consolidation every N tasks... score(m,q)=rel(m,q)+imp(m)
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

non-monotonic scaling landscape: larger teams do not always produce better long-term performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 8 internal anchors

[1]

L. Chen, J. Q. Davis, B. Hanin, P. Bailis, I. Stoica, M. Zaharia, and J. Zou. Are more llm calls all you need? towards scaling laws of compound inference systems.arXiv preprint arXiv:2403.02419, 2024a. W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C.-M. Chan, H. Yu, Y. Lu, Y.-H. Hung, C. Qian, et al. Agentverse: Facilitating multi-agent collaboration and expl...

work page arXiv
[2]

D. Han, C. Couturier, D. M. Diaz, X. Zhang, V. Rühle, and S. Rajmohan. Legomem: Modular procedural memory for multi-agent llm systems for workflow automation.arXiv preprint arXiv:2510.04851,

work page arXiv
[3]

Z. Ke, Y. Ming, A. Xu, R. Chin, X.-P. Nguyen, P. Jwalapuram, S. Yavuz, C. Xiong, and S. Joty. Mas- orchestra: Understanding and improving multi-agent reasoning through holistic orchestration and controlled benchmarks.arXiv preprint arXiv:2601.14652,

work page arXiv
[4]

Y. Kim, K. Gu, C. Park, C. Park, S. Schmidgall, A. A. Heydari, Y. Yan, Z. Zhang, Y. Zhuang, M. Malhotra, et al. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

J. Li, Q. Zhang, Y. Yu, Q. Fu, and D. Ye. More agents is all you need.arXiv preprint arXiv:2402.05120,

work page arXiv
[6]

A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

J. Liu, Y. Su, P. Xia, S. Han, Z. Zheng, C. Xie, M. Ding, and H. Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553,

work page arXiv
[8]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

S. Ouyang, J. Yan, I. Hsu, Y. Chen, K. Jiang, Z. Wang, R. Han, L. T. Le, S. Daruki, X. Tang, et al. Rea- soningbank: Scaling agent self-evolving with reasoning memory.arXiv preprint arXiv:2509.25140,

work page internal anchor Pith review arXiv
[9]

15 Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems C. Qian, Z. Xie, Y. Wang, W. Liu, K. Zhu, H. Xia, Y. Dang, Z. Du, W. Chen, C. Yang, et al. Scaling large language model-based multi-agent collaboration.arXiv preprint arXiv:2406.07155,

work page arXiv
[10]

Agent Workflow Memory

Z.Z.Wang, J.Mao, D.Fried, andG.Neubig. Agentworkflowmemory.arXivpreprintarXiv:2409.07429,

work page internal anchor Pith review arXiv
[11]

T. Wei, N. Sachdeva, B. Coleman, Z. He, Y. Bei, X. Ning, M. Ai, Y. Li, J. He, E. H. Chi, et al. Evo- memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857,

work page internal anchor Pith review arXiv
[12]

J. Xu, A. Koesdwiady, S. Bei, Y. Han, B. Huang, D. Wang, Y. Chen, Z. Wang, P. Wang, P. Li, et al. Rethinking the value of multi-agent workflow: A strong single agent baseline.arXiv preprint arXiv:2601.12307,

work page arXiv
[13]

W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

H. Zhang, Q. Long, J. Bao, T. Feng, W. Zhang, H. Yue, and W. Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2505.11942 , year=

J. Zheng, X. Cai, Q. Li, D. Zhang, Z. Li, Y. Zhang, L. Song, and Q. Ma. Lifelongagentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942, 2025a. J. Zheng, C. Shi, X. Cai, Q. Li, D. Zhang, C. Li, D. Yu, and Q. Ma. Lifelong learning of large language model based agents: A roadmap.arXiv preprint arXiv:2501.07278, 2025b. K. Zhu, ...

work page arXiv