arxiv: 2605.13486 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: unknown

R²-Mem: Reflective Experience for Memory Search

Xinyuan Wang , Wenyu Mao , Junkang Wu , Xiang Wang , Xiangnan He

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentsmemory searchreflective learningexperience distillationself-improving agentsdeep searchrubric evaluation

0 comments

The pith

R^2-Mem distills abstract experiences from scored past trajectories to guide LLM agents away from repeated search errors without reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep search agents for memory systems often repeat mistakes because they do not extract lessons from prior high-quality and low-quality steps. R^2-Mem addresses this by running an offline process that scores historical trajectories with a rubric and distills abstract experiences from them. At inference time the agent retrieves these experiences to shape its next actions. Experiments show consistent gains in both accuracy and efficiency over baselines. This approach offers a low-cost alternative to reinforcement learning for making agents self-improving.

Core claim

R^2-Mem provides a reflective experience framework for memory search systems. In the offline stage, a Rubric-guided Evaluator scores low- and high-quality steps in historical trajectories, and a self-Reflection Learner distills the corresponding abstract experience. During the online inference, the retrieved experience will guide future search actions to avoid repeated mistakes and maintain high-quality behaviors.

What carries the argument

The R^2-Mem framework, built around a Rubric-guided Evaluator that assigns quality scores to trajectory steps and a self-Reflection Learner that converts those scores into reusable abstract experiences.

If this is right

F1 scores on memory search tasks rise by up to 22.6 percent.
Token consumption falls by 12.9 percent relative to strong baselines.
The number of search iterations drops by 20.2 percent.
Self-improvement occurs without reinforcement learning or online parameter updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same offline scoring and distillation pattern could extend to other agent behaviors such as tool selection or multi-step planning.
Accumulated experiences might serve as a lightweight, query-independent memory store that grows over time.
The separation of offline analysis from online use suggests a general route to cheaper agent improvement than full reinforcement learning loops.

Load-bearing premise

The rubric-guided evaluator can reliably separate high-quality from low-quality steps and the distilled experiences transfer to new queries rather than overfitting to the original trajectories.

What would settle it

Replace the rubric evaluator with random scoring or test the distilled experiences on query types absent from the offline data and measure whether accuracy and efficiency gains disappear.

Figures

Figures reproduced from arXiv: 2605.13486 by Junkang Wu, Wenyu Mao, Xiangnan He, Xiang Wang, Xinyuan Wang.

**Figure 2.** Figure 2: Deep Memory Search System The overall framework of the memory search system is shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: R2 -Mem Framework. (a) Construct an experience bank using 10% of historical trajectories. (b) Perform memory search with experience augmentation on the remaining 90% of datasets. (c) Detailed process of planning and reflection experience retrieval in our framework. The framework consists of two coordinated components: a Rubric-guided Evaluator and a selfReflection-based Learner. The LLM first performs dee… view at source ↗

**Figure 4.** Figure 4: R2 -Mem remains consistently better than the baseline across all settings. Overall, R2 -Mem is robust to moderate variations of these thresholds, showing stable performance across most settings. In contrast, performance degrades at extreme configurations such as (6, 9) and (3, 12), where filtering becomes either overly permissive or strict, leading to noisy or insufficient training signals. This indicates … view at source ↗

**Figure 5.** Figure 5: Effect of experience retrieval size k on model performance and retrieval quality. 5.6 Self-Evolution Capability (RQ5) We study the self-evolution capability of R2 -Mem by replacing the Evaluator with the Learner itself, forming a self-learning loop. This setting evaluates whether the model can reliably critique and improve its own trajectories without stronger external supervision. In the standard setting,… view at source ↗

**Figure 6.** Figure 6: Self-evolution performance of R2 -Mem under different backbone models. In [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: LoCoMo Conversation Category Distribution with Dataset Average [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Token consumption across sequential conversations under different backbone models. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Deep search has recently emerged as a promising paradigm for enabling agents to retrieve fine-grained historical information without heavy memory pre-managed. However, existing deep search agents for memory system repeat past error behaviors because they fail to learn from the prior high- and low-quality search trajectories. To address this limitation, we propose R^2-Mem, a reflective experience framework for memory search systems. In the offline stage, a Rubric-guided Evaluator scores low- and high-quality steps in historical trajectories, and a self-Reflection Learner distills the corresponding abstract experience. During the online inference, the retrieved experience will guide future search actions to avoid repeated mistakes and maintain high-quality behaviors. Extensive experiments demonstrate that R^2-Mem consistently improves both effectiveness and efficiency over strong baselines, improving F1 scores by up to 22.6%, while reducing token consumption by 12.9% and search iterations by 20.2%. These results verify that R^2-Mem provides a RL-free and low-cost solution for self-improving LLM agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R^2-Mem gives a clean offline rubric-plus-distillation loop for memory search agents that reports useful efficiency gains, but the evaluator's accuracy is untested.

read the letter

R^2-Mem tries to stop deep-search agents from repeating bad steps by scoring historical trajectories offline with a rubric, distilling abstract experiences from the high- and low-quality parts, and then retrieving those experiences at inference time to steer future actions. The abstract reports up to 22.6% higher F1, 12.9% fewer tokens, and 20.2% fewer iterations than strong baselines, all without RL training. That is the core claim and the main thing to know. The concrete pipeline—rubric evaluator followed by self-reflection learner—is new in the way it is paired for memory-search trajectories, and it stays training-light, which matters for anyone running agents over long histories. The efficiency side of the results is especially straightforward to like because it comes from better guidance rather than extra compute. The reported gains are consistent across the experiments described, and the setup measures on held-out tasks rather than obvious circular fitting. The soft spot sits right at the evaluator. The abstract gives no rubric definition, no correlation between its scores and downstream success, and no ablation on whether the distilled experiences actually transfer to new queries. If the evaluator is itself an LLM prompt, its labels could simply echo the same patterns already present in the trajectories, which would make the reflection step non-causal. Without those checks the 22% F1 lift is hard to trust at face value. This paper is for people working on LLM agents and memory retrieval systems. A reader who needs a practical, low-cost way to add self-correction to search loops would get something usable from the pipeline description and the efficiency numbers. I would send it to peer review. The idea is specific enough and the claims are testable enough that referees can check the missing pieces on the evaluator and generalization.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces R²-Mem, a reflective experience framework for deep memory search in LLM agents. In the offline stage, a Rubric-guided Evaluator scores steps in historical trajectories as high- or low-quality, and a self-Reflection Learner distills abstract experiences; these are retrieved during online inference to guide future search actions and avoid repeated errors. Experiments report consistent gains over strong baselines, with F1 scores improving by up to 22.6% alongside 12.9% lower token consumption and 20.2% fewer search iterations, positioning the method as an RL-free, low-cost approach to self-improving agents.

Significance. If the evaluator reliably partitions trajectories and the distilled experiences generalize, the work offers a practical, offline route to agent self-improvement that avoids the cost and instability of RL. The dual gains in effectiveness and efficiency would be relevant to memory-augmented LLM systems. The absence of explicit validation for the evaluator and generalization tests, however, leaves the central causal claim under-supported.

major comments (3)

[Experiments] Experiments section: the reported F1 gains (up to 22.6%) and efficiency reductions are presented without error bars, number of runs, statistical significance tests, or explicit train/test split details. This makes it impossible to assess whether the improvements are robust or reproducible, directly affecting the soundness of the performance claims.
[Method] Method section (offline stage description): the Rubric-guided Evaluator is load-bearing for partitioning trajectories into high- and low-quality steps, yet no rubric definition, inter-rater reliability, or correlation analysis between evaluator scores and downstream task metrics (e.g., F1) is provided. Without this, it is unclear whether the evaluator adds causal information or merely echoes biases already present in the trajectories.
[Experiments] Experiments / Ablation subsection: no OOD or cross-query ablation is reported to test whether the distilled abstract experiences transfer to unseen queries rather than overfitting to the offline trajectories used for distillation. This directly undermines the generalization claim central to the self-improvement narrative.

minor comments (2)

[Abstract] The acronym R²-Mem is used throughout without an explicit expansion of the superscript notation in the abstract or introduction.
[Method] Notation for the self-Reflection Learner and experience retrieval mechanism could be clarified with a short pseudocode or diagram reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the clarity, statistical rigor, and validation of our claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: the reported F1 gains (up to 22.6%) and efficiency reductions are presented without error bars, number of runs, statistical significance tests, or explicit train/test split details. This makes it impossible to assess whether the improvements are robust or reproducible, directly affecting the soundness of the performance claims.

Authors: We agree that these details are necessary to establish robustness. In the revised manuscript, we will report all main results as averages over 5 independent runs with standard error bars, include p-values from paired t-tests against baselines to demonstrate statistical significance, and explicitly describe the train/test splits (e.g., 70/30 split for offline distillation trajectories versus online evaluation queries). revision: yes
Referee: [Method] Method section (offline stage description): the Rubric-guided Evaluator is load-bearing for partitioning trajectories into high- and low-quality steps, yet no rubric definition, inter-rater reliability, or correlation analysis between evaluator scores and downstream task metrics (e.g., F1) is provided. Without this, it is unclear whether the evaluator adds causal information or merely echoes biases already present in the trajectories.

Authors: The rubric is based on explicit criteria including step relevance to the query, avoidance of redundant retrievals, and contribution to final answer quality. We will include the full rubric text in an appendix of the revised manuscript. As the evaluator is LLM-based, we will add a correlation analysis between its scores and downstream F1 improvements on a held-out validation set to show that higher scores predict better task performance, thereby supporting the causal role of the evaluator. revision: yes
Referee: [Experiments] Experiments / Ablation subsection: no OOD or cross-query ablation is reported to test whether the distilled abstract experiences transfer to unseen queries rather than overfitting to the offline trajectories used for distillation. This directly undermines the generalization claim central to the self-improvement narrative.

Authors: Our current evaluation already spans diverse query variations that differ from the offline distillation set, providing initial support for generalization. To directly address the concern, we will add a dedicated OOD ablation in the revised manuscript: experiences will be distilled from one query category and evaluated on entirely held-out categories, with results reported to quantify transfer performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in R^2-Mem derivation chain

full rationale

The paper presents an empirical framework: offline rubric scoring of trajectories followed by distillation of abstract experiences, then online retrieval for guidance. Reported gains (F1 up to 22.6%, token reduction 12.9%, iteration reduction 20.2%) are measured on held-out tasks. No equations, fitted parameters, or self-referential definitions appear in the provided text that would reduce these outcomes to inputs by construction. No load-bearing self-citations, uniqueness theorems, or renamed known results are invoked. The chain is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the assumption that rubric scores capture transferable quality signals and that retrieved experiences remain useful at inference time; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Rubric scores on historical trajectories reliably identify reusable good and bad behaviors
Invoked in the offline stage description

invented entities (2)

Rubric-guided Evaluator no independent evidence
purpose: Score low- and high-quality steps in trajectories
New component introduced by the paper
self-Reflection Learner no independent evidence
purpose: Distill abstract experience from scored steps
New component introduced by the paper

pith-pipeline@v0.9.0 · 5486 in / 1206 out tokens · 31642 ms · 2026-05-14T19:06:53.510192+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 4 canonical work pages · 1 internal anchor

[1]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , booktitle =

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.137. URL https://aclanthology.org/2024.findings-acl.137/. Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Xuefei Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason E Weston, and Dat Huynh. Scaling...

work page doi:10.18653/v1/2024.findings-acl.137 2024
[2]

The N arrative QA Reading Comprehension Challenge

URLhttps://arxiv.org/abs/2603.07670. Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. Lightmem: Lightweight and efficient memory-augmented generation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.ne...

work page doi:10.1162/tacl_a_00023 2026
[3]

William F

URLhttps://openreview.net/forum?id=jL7fwchScm. William F. Shen, Xinchi Qiu, Chenxi Whitehouse, Lisa Alazraki, Shashwat Goel, Francesco Barbieri, Timon Willi, Akhil Mathur, and Ilias Leontiadis. Rethinking rubric generation for improving llm judge and reward modeling for open-ended tasks, 2026. URL https://arxiv.org/abs/2602. 05125. Yiming Shu, Pei Liu, Ti...

work page arXiv 2026
[4]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

URLhttps://arxiv.org/abs/2508.19828. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InEMNLP, pages 2369–2380. Association for Computational Linguistics, 2018. Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaoha...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Analyze why this planning trace is high-quality or low-quality
[7]

Abstract the planning situation without using concrete entities
[8]

thinking

Distill a reusable IF–THEN planning experience that can guide future retrieval planning. Return a JSON object with: {thinking, summary, situation, experience} JSON Output Format { "thinking": "<logic analysis grounded in the trace>", "summary": "<Briefly describe this Trace(question and plan)>", "situation": "<abstract situation>", "experience": "IF <abst...
[9]

Analyze why this reflection trace is high-quality or low-quality
[10]

Briefly summarize the trace
[11]

Abstract the reflection situation without using concrete entities
[12]

thinking

Distill a reusable IF–THEN reflection experience that can guide future sufficiency judgment. Return a JSON object with: {thinking, summary, situation, experience} 22 JSON Output Format { "thinking": "<analysis of reflection success/failure>", "summary": "<Briefly describe this Trace(question and temp memory)>", "situation": "<abstract situation>", "experi...

2023