Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

Qian Cheng; Zhibao Chen

arxiv: 2606.12945 · v2 · pith:DWRKB4RAnew · submitted 2026-06-11 · 💻 cs.AI

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

Zhibao Chen , Qian Cheng This is my paper

Pith reviewed 2026-06-27 06:56 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic memorymemory value modelcognitive factorsLLM agentsforgetting decisionLongMemEvalmulti-factor weighting

0 comments

The pith

A learned multi-factor value model based on seven cognitive factors retains 0.77 of gold evidence in blind memory decisions for LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-running LLM agents face a standing decision on what to encode, forget, or retrieve under a fixed memory budget before any future query is known. The paper establishes that a value function combining seven factors drawn from cognitive psychology, with weights learned from a downstream objective via gradient-free optimization, outperforms recency, uniform weights, and best single-factor baselines. This is demonstrated in the realistic blind regime on LongMemEval, where the evaluation question is held out at consolidation time. The approach yields higher retention of gold evidence with interpretable weights that down-weight query-time similarity appropriately for the forgetting decision.

Core claim

In the realistic blind regime, a learned multi-factor value retains 0.770 ± 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero. The learned weights are interpretable -- reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A neural network over the same factors ties the linear model, and a controlled synthetic task confirms recovery of separating weights.

What carries the argument

The multi-factor memory value function V(m) = sum w_i f_i(m) over seven factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history), whose weights are learned from a downstream objective by a gradient-free optimiser to uniformly control encoding depth, forget risk, and retrieval rank.

If this is right

The single scalar value uniformly governs encoding depth, forget risk, and retrieval rank.
Learned weights prioritize reliability, emotional intensity, and self/user relevance over recency for the forgetting decision.
A neural network over the same factors performs similarly to the linear model.
A synthetic task with planted confounds shows the learner recovers optimal separating weights where uniform weighting fails.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could enable agent systems to operate effectively with smaller context windows by more accurate prioritization at consolidation time.
Cognitive psychology models of memory value can be operationalized directly in AI without access to future queries at the point of decision.
The framework might extend to online re-weighting of factors as an agent encounters new task distributions.

Load-bearing premise

The seven factors drawn from cognitive psychology are the appropriate and complete set for the memory value decision, and the downstream objective used to learn the weights is a valid proxy for the forgetting decision made without knowledge of future queries.

What would settle it

An evaluation on a fresh collection of long agent traces in which the learned multi-factor model fails to retain significantly more gold evidence than the recency baseline under the same blind protocol would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.12945 by Qian Cheng, Zhibao Chen.

**Figure 1.** Figure 1: One value, three operations. A single learned scalar V (m) = P i wifi(m) over seven factors drives encoding depth, forgetting, and retrieval, replacing three separately tuned rules. The weights wi are fit to a downstream objective (section 3.3). Encoding depth. Following levels-of-processing [5], higher-value items are encoded more deeply. We map the normalised value V¯ (m) = V (m)/ P i wi ∈ [0, 1] to four… view at source ↗

**Figure 2.** Figure 2: Blind vs. oracle gold retention (keep 30%; all 479 usable LongMemEval-S cases; 20 resampled splits, mean ±1 std). Under the oracle goal anchor (cosine to the held-out question), similarity alone saturates retention — a retrieval ceiling. Under the realistic blind anchor (session topic only), every single-factor and recency baseline collapses toward the chance floor, while the learned multi-factor value ret… view at source ↗

**Figure 3.** Figure 3: Keep-fraction sweep (blind regime, 479 cases, 20 splits, mean ±1 std). The learned multi-factor value dominates uniform, reliability-only, and recency at every aggressive budget (κ ≤ 0.4); the gap closes only at the near-trivial κ=0.5 where the benchmark saturates. 5 Discussion One scalar, three operations. The practical appeal of eq. (2) is consolidation: a single learned value replaces the separate, inde… view at source ↗

read the original abstract

Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency -- both mis-specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi-factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient-free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held-out evaluation question saturates gold-evidence retention at \approx 0.98 -- this measures retrieval, not forgetting. In the realistic blind regime, a learned multi-factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable -- reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open-source; all experiments run on a single CPU with no API calls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a usable linear model over seven cognitive factors that lifts blind-regime retention on LongMemEval from 0.66 uniform to 0.77, with open code and a synthetic sanity check, but the downstream objective's separation from evaluation queries is not fully detailed.

read the letter

The main takeaway is that a weighted sum of seven psychology-derived factors, with weights fit by gradient-free search on a downstream objective, keeps more gold evidence (0.770 ± 0.011) than uniform weights, the best single factor, or recency when the future query is unknown. The authors also show that scoring goal relevance against the held-out question pushes retention near 0.98, which measures retrieval rather than the forgetting decision made at consolidation time.

What is actually new is the concrete combination of those seven factors into one scalar that controls encoding depth, forget risk, and retrieval rank together. The synthetic task with planted confounds demonstrates that the optimizer can recover a weighting that achieves perfect retention where uniform weighting only reaches 0.62. The learned weights are reported as interpretable, with reliability, emotional intensity, and self/user relevance rising to the top while query-time goal similarity is correctly down-weighted.

The work is solid on the empirical reporting side: bootstrap intervals are given, a neural net baseline ties the linear model, and everything runs on CPU with open code. That lowers the cost of checking the numbers.

The soft spot is the downstream objective itself. The abstract distinguishes the blind regime from the saturating goal-relevance baseline, yet does not describe whether the objective is computed on a strictly held-out training split, uses only signals available at consolidation time, or incorporates any post-hoc performance that could correlate with the 479 evaluation cases. If any leakage exists, the reported gaps could partly reflect optimization to the benchmark rather than a general value function. The exact selection criteria for the 479 usable cases are also not spelled out.

This is for researchers building long-running agents who need something more structured than recency or embedding similarity. Readers who want cognitive factors made operational will get the most from the factor list and the blind-regime comparison. It deserves a serious referee because the result is concrete, the code is available, and the methodological point about saturation is worth verifying even if the exact lift needs tighter controls on the objective.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a multi-factor memory value function V(m) = sum w_i f_i(m) for LLM agents, using seven cognitively grounded factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, usage history) whose weights are learned via gradient-free optimization on a downstream objective. It claims that in the realistic blind regime (no future query knowledge at consolidation), this yields 0.770 ± 0.011 gold-evidence retention on 479 LongMemEval cases, outperforming uniform weights (0.657), best single factor (0.518), and recency (0.368), with all gaps having 95% bootstrap CIs above zero; a neural net ties the linear model, weights are interpretable (reliability and emotional intensity dominate), goal similarity is down-weighted, and a synthetic task with planted confounds recovers perfect separation where uniform fails.

Significance. If the blind-regime results are shown to rest on a leakage-free objective, the work would be significant for providing an interpretable, multi-factor alternative to heuristic memory policies in long-running agents, backed by open-source code, single-CPU reproducibility, and a methodological distinction between retrieval saturation (~0.98) and true forgetting decisions. The synthetic validation and emphasis on pre-query consolidation are strengths that could support falsifiable extensions.

major comments (2)

[Abstract] Abstract: The downstream objective used to learn the weights is described only at high level ('learned from a downstream objective by a gradient-free optimiser'), with no details on whether it is computed on a strictly separate training split, uses solely pre-consolidation signals, or incorporates any post-hoc task performance that could correlate with the held-out evaluation questions. This is load-bearing for the blind-regime claim, as any leakage would undermine the reported retention gaps and bootstrap CIs as evidence of a general value function.
[Abstract] Abstract: The selection criteria for the 479 usable cases and whether weight optimization was cross-validated or performed on held-out data are not reported. Without these, it is impossible to assess whether the 0.770 retention and paired CIs reflect generalization or optimization to the specific benchmark distribution.

minor comments (1)

The exact functional forms and input features for each of the seven factors (e.g., how emotional intensity or reliability is computed from memory m) are not specified in the abstract; expanding these in the methods would aid reproducibility even if the core results hold.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for emphasizing the need for full methodological transparency to support the blind-regime claims. We address the two major comments point by point below. Both comments correctly identify information that is only summarized at a high level in the abstract; we will expand the methods and experimental sections in revision to supply the missing details.

read point-by-point responses

Referee: [Abstract] Abstract: The downstream objective used to learn the weights is described only at high level ('learned from a downstream objective by a gradient-free optimiser'), with no details on whether it is computed on a strictly separate training split, uses solely pre-consolidation signals, or incorporates any post-hoc task performance that could correlate with the held-out evaluation questions. This is load-bearing for the blind-regime claim, as any leakage would undermine the reported retention gaps and bootstrap CIs as evidence of a general value function.

Authors: The optimization objective is computed exclusively on a separate training split of LongMemEval that has no overlap with the 479 evaluation cases. Only signals available at consolidation time (pre-query) are used; no future query text, gold labels, or post-consolidation task performance from the held-out cases enters the objective. We will add a dedicated subsection describing the exact training split, the gradient-free optimizer settings, and explicit confirmation that the procedure remains leakage-free with respect to the reported evaluation. revision: yes
Referee: [Abstract] Abstract: The selection criteria for the 479 usable cases and whether weight optimization was cross-validated or performed on held-out data are not reported. Without these, it is impossible to assess whether the 0.770 retention and paired CIs reflect generalization or optimization to the specific benchmark distribution.

Authors: The 479 cases were filtered from the full LongMemEval set by requiring that (i) at least one memory contains verifiable gold evidence for the query and (ii) total memory volume exceeds the context window, forcing an explicit forgetting decision. Weight learning was performed with 5-fold cross-validation on a disjoint training partition; the 0.770 figure and bootstrap CIs are reported on the remaining held-out cases. We will move these selection and validation details from the supplement into the main experimental section and add a statement confirming that optimization never touched the evaluation partition. revision: yes

Circularity Check

0 steps flagged

No circularity: weights learned from separate downstream objective; retention evaluated independently in blind regime

full rationale

The paper explicitly distinguishes the blind regime (no future query at consolidation) from query-time goal relevance, which saturates retention at 0.98. Weights are learned via gradient-free optimization on a downstream objective, then evaluated on gold-evidence retention across 479 cases, with comparisons to uniform weights, single factors, and recency. A synthetic task with planted confounds shows the learner recovers separating weights (1.00 retention) where uniform fails (0.62). No equation or text reduces the reported retention metric to the fitted weights by construction, nor does any self-citation load-bear the central claim. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The model rests on the assumption that the listed cognitive factors are relevant and that a linear combination optimized on a downstream task captures the forgetting decision; no new physical entities are introduced.

free parameters (1)

weights w_i for the seven factors
Learned from downstream objective via gradient-free optimizer; specific values not reported in abstract but described as interpretable with reliability, emotional intensity, and self/user relevance dominating.

axioms (1)

domain assumption The seven factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, usage history) drawn from cognitive psychology are the right basis for memory value.
Explicitly stated as drawn from cognitive psychology and used to define the value function.

pith-pipeline@v0.9.1-grok · 5863 in / 1512 out tokens · 22643 ms · 2026-06-27T06:56:29.373057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 1 canonical work pages

[1]

Anderson and Lael J

John R. Anderson and Lael J. Schooler. Reflections of the environment in memory.Psychological Science, 2(6):396–408, 1991

1991
[2]

Robert A. Bjork. Memory and metamemory considerations in the training of human beings. In Janet Metcalfe and Arthur P. Shimamura, editors,Metacognition: Knowing about Knowing, pages 185–205. MIT Press, 1994

1994
[3]

Alan D. Castel. The adaptive and strategic use of memory by older adults: Evaluative processing and value-directed remembering. In Aaron S. Benjamin and Brian H. Ross, editors,Psychology of Learning and Motivation, volume 48, pages 225–270. Academic Press, 2007. doi: 10.1016/S0079-7421(07)48006-9

work page doi:10.1016/s0079-7421(07)48006-9 2007
[4]

Learning-Multi-Factor-Memory: Open-source implementation of the multi-factor value model for agentic memory.https://github.com/zhibao-dev/Learning-Multi-Factor-Memory, 2026

Zhibao Chen. Learning-Multi-Factor-Memory: Open-source implementation of the multi-factor value model for agentic memory.https://github.com/zhibao-dev/Learning-Multi-Factor-Memory, 2026

2026
[5]

Fergus I. M. Craik and Robert S. Lockhart. Levels of processing: A framework for memory research. Journal of Verbal Learning and Verbal Behavior, 11(6):671–684, 1972

1972
[6]

Duncker & Humblot, Leipzig, 1885

Hermann Ebbinghaus.Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. Duncker & Humblot, Leipzig, 1885. English translation by Ruger & Bussenius, Teachers College, 1913

1913
[7]

The CMA evolution strategy: A tutorial.arXiv preprint arXiv:1604.00772, 2016

Nikolaus Hansen. The CMA evolution strategy: A tutorial.arXiv preprint arXiv:1604.00772, 2016. Tutorial reference for gradient-free black-box optimisation

Pith/arXiv arXiv 2016
[8]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[9]

MemOS: A memory OS for AI system, 2025

Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, et al. MemOS: A memory OS for AI system, 2025. 10

2025
[10]

James L. McGaugh. Memory — a century of consolidation.Science, 287(5451):248–251, 2000

2000
[11]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

Pith/arXiv arXiv 2023
[12]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023

2023
[13]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InProceedings of EMNLP-IJCNLP, pages 3982–3992, 2019

2019
[14]

T. B. Rogers, N. A. Kuiper, and W. S. Kirker. Self-reference and the encoding of personal information. Journal of Personality and Social Psychology, 35(9):677–688, 1977

1977
[15]

Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents.Transactions on Machine Learning Research, 2024

2024
[16]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

2018
[17]

LongMemEval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. LongMemEval: Benchmarking chat assistants on long-term interactive memory. InInternational Conference on Learning Representations (ICLR), 2025. Metadata to be re-verified at submission

2025
[18]

MemoryBank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731, 2024

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731, 2024. 11

2024

[1] [1]

Anderson and Lael J

John R. Anderson and Lael J. Schooler. Reflections of the environment in memory.Psychological Science, 2(6):396–408, 1991

1991

[2] [2]

Robert A. Bjork. Memory and metamemory considerations in the training of human beings. In Janet Metcalfe and Arthur P. Shimamura, editors,Metacognition: Knowing about Knowing, pages 185–205. MIT Press, 1994

1994

[3] [3]

Alan D. Castel. The adaptive and strategic use of memory by older adults: Evaluative processing and value-directed remembering. In Aaron S. Benjamin and Brian H. Ross, editors,Psychology of Learning and Motivation, volume 48, pages 225–270. Academic Press, 2007. doi: 10.1016/S0079-7421(07)48006-9

work page doi:10.1016/s0079-7421(07)48006-9 2007

[4] [4]

Learning-Multi-Factor-Memory: Open-source implementation of the multi-factor value model for agentic memory.https://github.com/zhibao-dev/Learning-Multi-Factor-Memory, 2026

Zhibao Chen. Learning-Multi-Factor-Memory: Open-source implementation of the multi-factor value model for agentic memory.https://github.com/zhibao-dev/Learning-Multi-Factor-Memory, 2026

2026

[5] [5]

Fergus I. M. Craik and Robert S. Lockhart. Levels of processing: A framework for memory research. Journal of Verbal Learning and Verbal Behavior, 11(6):671–684, 1972

1972

[6] [6]

Duncker & Humblot, Leipzig, 1885

Hermann Ebbinghaus.Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. Duncker & Humblot, Leipzig, 1885. English translation by Ruger & Bussenius, Teachers College, 1913

1913

[7] [7]

The CMA evolution strategy: A tutorial.arXiv preprint arXiv:1604.00772, 2016

Nikolaus Hansen. The CMA evolution strategy: A tutorial.arXiv preprint arXiv:1604.00772, 2016. Tutorial reference for gradient-free black-box optimisation

Pith/arXiv arXiv 2016

[8] [8]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020

[9] [9]

MemOS: A memory OS for AI system, 2025

Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, et al. MemOS: A memory OS for AI system, 2025. 10

2025

[10] [10]

James L. McGaugh. Memory — a century of consolidation.Science, 287(5451):248–251, 2000

2000

[11] [11]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

Pith/arXiv arXiv 2023

[12] [12]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023

2023

[13] [13]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InProceedings of EMNLP-IJCNLP, pages 3982–3992, 2019

2019

[14] [14]

T. B. Rogers, N. A. Kuiper, and W. S. Kirker. Self-reference and the encoding of personal information. Journal of Personality and Social Psychology, 35(9):677–688, 1977

1977

[15] [15]

Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents.Transactions on Machine Learning Research, 2024

2024

[16] [16]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

2018

[17] [17]

LongMemEval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. LongMemEval: Benchmarking chat assistants on long-term interactive memory. InInternational Conference on Learning Representations (ICLR), 2025. Metadata to be re-verified at submission

2025

[18] [18]

MemoryBank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731, 2024

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731, 2024. 11

2024