M$^\star$: Every Task Deserves Its Own Memory Harness

Mirror Xu; Shiwei Zhang; Shujie Liu; Wanlu Shi; Wenbo Pan; Xiangyang Zhou; Xiaohua Jia

arxiv: 2604.11811 · v2 · pith:XLZTRT52new · submitted 2026-04-10 · 💻 cs.PL · cs.AI· cs.CL· cs.LG

M^star: Every Task Deserves Its Own Memory Harness

Wenbo Pan , Shujie Liu , Xiangyang Zhou , Shiwei Zhang , Wanlu Shi , Mirror Xu , Xiaohua Jia This is my paper

Pith reviewed 2026-05-10 17:34 UTC · model grok-4.3

classification 💻 cs.PL cs.AIcs.CLcs.LG

keywords memory systemsLLM agentsprogram evolutiontask optimizationreflective searchagent architecturedomain specializationcode generation

0 comments

The pith

LLM agents perform better with task-specific memory programs evolved as Python code than with any fixed shared design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that memory systems for large language model agents work best when each task receives its own specialized harness rather than a single fixed architecture. It represents memory as an executable Python program that combines a data schema, storage and retrieval logic, and workflow instructions. These programs are discovered automatically by a reflective, population-based evolutionary search that generates candidates and refines them after examining failures on the target task. Across four benchmarks covering conversation, embodied planning, and expert reasoning, the evolved programs deliver higher performance than standard fixed-memory baselines while developing visibly different internal structures for each domain.

Core claim

M* models an agent memory system as a Python program that jointly defines data Schema, storage Logic, and agent workflow Instructions. It optimizes these components together through reflective code evolution that maintains a population of candidate programs and iteratively improves them by analyzing evaluation failures. On four distinct benchmarks the resulting programs outperform fixed-memory baselines and exhibit structurally distinct processing mechanisms tailored to each domain.

What carries the argument

The memory program, a single Python script that encodes Schema for data organization, Logic for storage and retrieval operations, and Instructions for how the agent interacts with the stored information, discovered jointly through reflective population-based evolution.

If this is right

Performance improves over fixed-memory baselines on conversation, embodied planning, and expert reasoning tasks.
Evolved programs develop structurally distinct mechanisms for each domain instead of converging to one general form.
Joint optimization of schema, logic, and instructions explores a broader design space than hand-crafted fixed systems.
Specialized per-task memory yields better results than general-purpose memory paradigms across the evaluated settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evolution method could be applied to other reusable agent components such as planning modules or tool interfaces.
The domain-specific structures suggest that attempts to find one universal memory architecture for all agent tasks may be fundamentally limited.
Testing whether evolved programs transfer to related but unseen tasks would clarify the scope of the discovered specializations.
Extending the approach to longer or more open-ended interactions might expose scalability limits of the current failure-analysis loop.

Load-bearing premise

Reflective population-based code evolution applied to the chosen benchmarks will reliably produce memory programs that are both superior in performance and generalizable beyond the specific tasks and failure modes tested.

What would settle it

A new benchmark task on which no evolved memory program outperforms a carefully tuned fixed-memory baseline, or on which evolved programs from different domains converge to identical internal structures.

Figures

Figures reproduced from arXiv: 2604.11811 by Mirror Xu, Shiwei Zhang, Shujie Liu, Wanlu Shi, Wenbo Pan, Xiangyang Zhou, Xiaohua Jia.

**Figure 1.** Figure 1: Evolved memory harnesses across tasks. Starting from shared seeds (center), MSTAR evolves structurally distinct harnesses for each task. Each node is an evolved program. Abstract Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semant… view at source ↗

**Figure 2.** Figure 2: System overview of MSTAR. Starting from a seed memory program (0), the system maintains a population pool (1) and iteratively improves programs through evaluation on task episodes (2), LLM-guided reflection and code mutation (3), and compile/runtime quality checks (4). The best-scoring program is evaluated on a held-out test set (5). knowledge base is thus defined as a collection of these items. At test ti… view at source ↗

**Figure 3.** Figure 3: Evolution trajectory. Validation score across iterations for all benchmarks. Most benchmarks follow a common phased pattern: early iterations correct structural errors in seed programs, middle iterations produce the largest gains by discovering task-relevant indexing strategies, and later iterations refine retrieval precision with diminishing returns. precision refinements with smaller returns. This phased… view at source ↗

**Figure 4.** Figure 4: Program embedding landscape. Each evolved program is embedded with a code embedding model and projected to 2D via t-SNE. (a, b) Population-based search (MSTAR) explores structurally diverse regions of the program space, while linear search concentrates in a narrow neighborhood; colored edges trace parent – child lineage. (c) All programs across five benchmarks, colored by dataset. Marker shapes denote the … view at source ↗

**Figure 5.** Figure 5: Cross-task transfer of evolved memory harnesses. Each panel evaluates memory harnesses evolved on different source benchmarks against a single target benchmark. The dashed line marks the universal seed baseline. Programs evolved on their native task (highlighted) consistently outperform those transferred from other tasks, confirming that memory structure must be co-optimized with the target task. Joint evo… view at source ↗

read the original abstract

Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M* evolves memory as Python code via failure-driven search and reports gains plus domain-specific structures, but the evaluation risks overfitting and lacks the details needed to confirm the gains are real.

read the letter

M* models agent memory as an executable Python program that packs together data schema, storage logic, and workflow instructions, then uses population-based reflective search to evolve better versions by looking at task failures. The paper claims this beats fixed-memory baselines across conversation, embodied planning, and reasoning benchmarks, and that the evolved programs end up with visibly different structures for each domain.

Referee Report

2 major / 2 minor

Summary. The paper introduces M*, a method to automatically discover task-optimized memory harnesses for LLM agents by representing each memory system as an executable Python program (data Schema + storage Logic + agent Instructions) and optimizing it via reflective population-based code evolution that refines candidates by analyzing evaluation failures. It evaluates the approach on four benchmarks spanning conversation, embodied planning, and expert reasoning, claiming robust performance gains over fixed-memory baselines and the emergence of structurally distinct processing mechanisms tailored to each domain.

Significance. If the performance claims hold under proper controls, the work would be significant for demonstrating that automated search over memory program designs can outperform hand-crafted fixed architectures across domains, supporting the broader idea that memory mechanisms should be specialized rather than general-purpose. This could influence agent design by shifting focus toward evolutionary co-optimization of memory and task logic.

major comments (2)

§4 (Evaluation): The abstract and evaluation claim 'robust' improvements over fixed-memory baselines on all four tasks, but no details are provided on the specific baselines, number of independent runs, statistical significance tests, or performance variance. This information is load-bearing for validating the central performance claim and distinguishing genuine gains from noise.
Method and Evaluation sections: The optimization 'analyzes evaluation failures to iteratively refine the candidate programs,' yet the manuscript does not indicate whether these failures come from held-out data or the same benchmark distribution used for final reporting. Without this or explicit distribution-shift tests, the reported superiority risks arising from overfitting to task-specific artifacts rather than discovering generalizable memory mechanisms.

minor comments (2)

The LaTeX notation M$^* $ should be introduced with a clear definition in the introduction and used consistently.
Consider including a table or figure explicitly comparing the evolved program structures (Schema/Logic/Instructions) across the four domains to substantiate the 'structurally distinct' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights key areas where our evaluation and methodological descriptions can be strengthened. We address each major comment below and commit to revisions that improve clarity and rigor without altering the core contributions.

read point-by-point responses

Referee: §4 (Evaluation): The abstract and evaluation claim 'robust' improvements over fixed-memory baselines on all four tasks, but no details are provided on the specific baselines, number of independent runs, statistical significance tests, or performance variance. This information is load-bearing for validating the central performance claim and distinguishing genuine gains from noise.

Authors: We agree that these details are necessary to substantiate the robustness claims. In the revised manuscript, we will expand Section 4 (and add an experimental details appendix) to explicitly list the fixed-memory baselines with citations, report the number of independent runs conducted, include statistical significance testing (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values), and present performance variance via standard deviations or confidence intervals across runs. These additions will directly address the concern and allow readers to assess the reliability of the reported gains. revision: yes
Referee: Method and Evaluation sections: The optimization 'analyzes evaluation failures to iteratively refine the candidate programs,' yet the manuscript does not indicate whether these failures come from held-out data or the same benchmark distribution used for final reporting. Without this or explicit distribution-shift tests, the reported superiority risks arising from overfitting to task-specific artifacts rather than discovering generalizable memory mechanisms.

Authors: This is a fair and important point about potential overfitting. The current manuscript does not explicitly describe the data partitioning used during the reflective evolution process. In the revision, we will clarify in the Method section whether optimization failures are drawn from development splits (where available) versus the full benchmark distribution, and we will add explicit statements on how final results are computed on held-out portions. We will also incorporate additional experiments or ablations that test for distribution shift (e.g., cross-benchmark generalization or leave-one-task-out evaluations) to demonstrate that the evolved memory programs capture generalizable mechanisms rather than benchmark-specific artifacts. If certain benchmarks lack predefined splits, we will acknowledge this limitation and discuss its implications. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical search with no derivations or load-bearing self-references

full rationale

The paper describes an empirical procedure—population-based reflective code evolution that jointly optimizes Schema, Logic, and Instructions in Python memory programs—then reports measured performance gains on four benchmarks. No equations, first-principles derivations, or mathematical predictions exist that could reduce to fitted inputs or self-definitions by construction. The evolution process is presented as a search heuristic that inspects evaluation failures; the reported outcomes are direct experimental measurements rather than analytic claims. No self-citations are invoked to justify uniqueness or forbid alternatives, and no known empirical patterns are renamed as novel unification. The central claim therefore remains an independent empirical finding whose validity can be checked against the stated benchmarks and baselines without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the unproven premise that memory systems are best represented as jointly optimizable Python programs and that evolutionary search over them yields transferable gains. No specific numerical free parameters are named in the abstract.

axioms (2)

domain assumption A memory system optimized for one purpose frequently fails to transfer to others
Stated in the abstract as the motivation for per-task optimization.
domain assumption Reflective code evolution with population-based search and failure analysis can iteratively improve candidate memory programs
Core mechanism described but not justified in abstract.

invented entities (1)

memory program no independent evidence
purpose: Encapsulates data Schema, storage Logic, and agent workflow Instructions as an executable Python artifact
New modeling choice introduced to enable joint optimization; no independent evidence provided beyond the method itself.

pith-pipeline@v0.9.0 · 5539 in / 1460 out tokens · 108492 ms · 2026-05-10T17:34:40.321222+00:00 · methodology