Nous: A Predictive World Model for Long-Term Agent Memory
Pith reviewed 2026-06-26 11:46 UTC · model grok-4.3
The pith
Nous models long-term agent memory as a predictive world model of categorical probability distributions updated by Bayesian surprise rather than stored facts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Nous maintains a predictive world model consisting of categorical probability distributions called dimensions, one per observed entity-attribute pair. Each new observation updates its dimension through a closed-form Bayesian posterior computed from information-theoretic surprise S = -log2 P(obs | D). The system records only the delta between prior and posterior rather than the fact, lets forgetting emerge as entropy decay to the uniform distribution, and resolves entity identity via mutual information across dimension sets.
What carries the argument
dimensions: categorical probability distributions, one per entity-attribute pair, that form the predictive world model and are updated via closed-form Bayesian posterior on surprise
If this is right
- The approach yields F1 scores of 63.50 single-hop, 55.32 multi-hop, 58.57 temporal, and 62.50 open-domain on LoCoMo with GPT-4o-mini.
- It exceeds the reported numbers of A-MEM in three of four categories and BeliefMem in all four under the stated evaluation conditions.
- No external vector database or graph engine is required for operation.
- Forgetting and identity resolution arise directly from entropy increase and mutual information without separate modules.
- The primary memory artifact is the belief delta rather than any explicit fact representation.
Where Pith is reading between the lines
- If attribute independence holds across typical conversations, the same architecture could be extended to multi-agent settings by sharing dimensions across agents.
- The surprise-driven update rule suggests a natural link to active inference agents that select actions to reduce expected surprise.
- A direct test would measure whether the stored deltas alone suffice for downstream planning tasks that require reconstructing full conversation histories.
- Standardizing the LoCoMo evaluation pipeline would clarify whether the reported gains over concurrent belief-based systems are reproducible.
Load-bearing premise
That maintaining and updating a collection of independent categorical distributions via closed-form Bayesian updates on surprise is sufficient to capture the memory requirements of long multi-turn conversations without additional mechanisms or external storage.
What would settle it
A controlled test set in which performance collapses once questions require tracking statistical dependencies between different entity attributes that the independent dimensions cannot represent.
Figures
read the original abstract
We present Nous, a novel agent memory architecture grounded in the principle that knowledge is prediction, not storage. Rather than persisting facts as database records, vector embeddings, or knowledge-graph triples, Nous maintains a predictive world model: a collection of categorical probability distributions, called dimensions, one per entity-attribute pair observed in conversation. Each incoming observation is scored by its information-theoretic surprise S = -log2 P(obs | D), and the distribution is updated via a closed-form Bayesian posterior. The primary stored artifact is the delta, a record of the shift from prior to posterior belief, rather than the fact itself. Forgetting emerges naturally as entropy decay toward the uniform distribution, and identity resolution is handled through mutual information between entity dimension sets. Evaluated on the LoCoMo long-term conversational memory benchmark across ten conversations (1,540 questions) using GPT-4o-mini as backbone, Nous achieves F1 of 63.50 (single-hop), 55.32 (multi-hop), 58.57 (temporal), and 62.50 (open-domain). Against A-MEM's self-reported GPT-4o-mini numbers, Nous shows substantial gains in three of four categories, though we note that independent citations of A-MEM's results disagree with each other on category assignment, a reproducibility issue we discuss openly rather than resolve unilaterally. We additionally compare against BeliefMem, a concurrently developed system built on the same core premise of belief-based rather than deterministic memory; on the same benchmark and backbone, Nous's self-reported numbers exceed BeliefMem's self-reported numbers on all four categories, though we flag several uncontrolled differences between the two evaluation pipelines that prevent this from being a fully controlled comparison. Nous requires no external vector database or graph engine.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Nous, an agent memory architecture that maintains a predictive world model consisting of independent categorical probability distributions (one per observed entity-attribute pair, termed 'dimensions'). Observations trigger information-theoretic surprise scoring S = -log2 P(obs | D) followed by closed-form Bayesian updates, with the delta (belief shift) as the primary stored artifact; forgetting occurs via entropy decay to uniform and identity resolution via mutual information across dimension sets. No external vector DB or graph is required. On the LoCoMo benchmark (10 conversations, 1,540 questions) with GPT-4o-mini, it reports F1 scores of 63.50 (single-hop), 55.32 (multi-hop), 58.57 (temporal), and 62.50 (open-domain), claiming gains over A-MEM in three categories and over BeliefMem in all four, while openly noting baseline inconsistencies and uncontrolled evaluation differences.
Significance. If the independence assumption and empirical results prove robust under controlled re-evaluation, the work could offer a lightweight, storage-efficient alternative to embedding or graph-based memory for long-horizon agents, grounded in predictive coding. The explicit discussion of reproducibility issues in baselines is a positive contribution to the literature.
major comments (3)
- [Abstract] Abstract: The central empirical claim of substantial gains rests on self-reported F1 numbers whose baselines are flagged by the authors themselves as inconsistent across citations and subject to uncontrolled pipeline differences; this makes the comparative results difficult to interpret without an independent, controlled replication.
- [Abstract] Architecture description (throughout): The load-bearing claim that a collection of independent per-(entity,attribute) categorical distributions suffices for multi-hop, temporal, and open-domain reasoning is not accompanied by any analysis, ablation, or derivation showing how attribute correlations or higher-order context can be recovered from the marginals; if the independence assumption fails, the reported advantages on precisely those categories would not follow.
- [Abstract] Abstract and evaluation: No pseudocode, closed-form derivation, or error analysis is supplied for the surprise-driven Bayesian update, delta storage, entropy-decay forgetting, or mutual-information identity resolution, leaving the implementation details unverifiable from the manuscript alone.
minor comments (1)
- [Abstract] The manuscript would benefit from an explicit table or section contrasting the exact evaluation protocol used for Nous versus the cited A-MEM and BeliefMem numbers to clarify the uncontrolled differences mentioned.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and the recommendation for major revision. We address each major comment point-by-point below, with planned changes to the manuscript where appropriate. All responses focus on the substance of the comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claim of substantial gains rests on self-reported F1 numbers whose baselines are flagged by the authors themselves as inconsistent across citations and subject to uncontrolled pipeline differences; this makes the comparative results difficult to interpret without an independent, controlled replication.
Authors: We agree that the self-reported nature of the comparisons, combined with the noted inconsistencies in baseline citations and uncontrolled pipeline differences, limits the strength of the empirical claims. The manuscript already flags these issues explicitly in the abstract and evaluation sections. In revision, we will further strengthen the caveats in the abstract (e.g., by qualifying the gains as self-reported and subject to evaluation variations) and expand the discussion section with additional analysis of reproducibility challenges in long-term memory benchmarks. We cannot perform an independent controlled replication ourselves but will highlight this as an important direction for future work. revision: yes
-
Referee: [Abstract] Architecture description (throughout): The load-bearing claim that a collection of independent per-(entity,attribute) categorical distributions suffices for multi-hop, temporal, and open-domain reasoning is not accompanied by any analysis, ablation, or derivation showing how attribute correlations or higher-order context can be recovered from the marginals; if the independence assumption fails, the reported advantages on precisely those categories would not follow.
Authors: The architecture deliberately adopts the independence assumption to enable tractable closed-form Bayesian updates and storage-efficient delta recording without external databases or graphs. We do not provide an explicit derivation or ablation for recovering correlations because the method operates solely on marginal distributions per dimension; higher-order effects are handled implicitly through cross-dimension queries and mutual-information identity resolution. We will add a dedicated limitations subsection discussing the independence assumption, its computational benefits, and potential failure cases for strongly correlated attributes. A full theoretical analysis of correlation recovery is not part of the current contribution. revision: partial
-
Referee: [Abstract] Abstract and evaluation: No pseudocode, closed-form derivation, or error analysis is supplied for the surprise-driven Bayesian update, delta storage, entropy-decay forgetting, or mutual-information identity resolution, leaving the implementation details unverifiable from the manuscript alone.
Authors: We agree that the absence of these details reduces verifiability. In the revised manuscript, we will add pseudocode for the full pipeline (surprise scoring, Bayesian posterior update, delta storage, entropy-decay forgetting, and mutual-information identity resolution). We will also include the closed-form derivation of the Bayesian update and a brief error analysis section covering approximation assumptions and numerical stability. revision: yes
Circularity Check
No significant circularity; empirical benchmark results independent of architecture definition
full rationale
The paper defines an architecture of per-(entity,attribute) categorical distributions updated via closed-form Bayesian posteriors driven by surprise S = -log2 P(obs | D), with forgetting as entropy decay and identity resolution via mutual information. It then reports F1 scores on the external LoCoMo benchmark (1,540 questions across 10 conversations) using GPT-4o-mini. These scores are measured outcomes of executing the system on held-out questions, not quantities that reduce to the model definition by construction. No self-citations are invoked as load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and the central claim (sufficiency of independent marginals for the benchmark tasks) is presented as an empirical hypothesis rather than a definitional tautology. The derivation chain is therefore self-contained against the external benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Knowledge is prediction, not storage.
invented entities (1)
-
dimensions
no independent evidence
Reference graph
Works this paper leans on
-
[1]
An essay towards solving a prob- lem in the doctrine of chances, 1763
Thomas Bayes. An essay towards solving a prob- lem in the doctrine of chances, 1763
-
[2]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taran- jeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Cover and Joy A
Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006
2006
-
[5]
The free-energy principle: A uni- fied brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010
Karl Friston. The free-energy principle: A uni- fied brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010
2010
-
[6]
Littman, and Anthony R
Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1-2):99–134, 1998
1998
-
[7]
Knill and Alexandre Pouget
David C. Knill and Alexandre Pouget. The bayesian brain: The role of uncertainty in neural coding and computation.Trends in Neurosciences, 27(12):712–719, 2004
2004
-
[8]
Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Informa- tion Processing Systems (NeurIPS), 33, 2020
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Informa- tion Processing Systems (NeurIPS), 33, 2020
2020
-
[9]
Belief Memory: Agent Memory Under Partial Observability
Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan, and Xiuying Chen. Belief memory: Agent memory under partial observability.arXiv preprint arXiv:2605.05583, 2026. MBZUAI, RIKEN AIP, UT Austin, Wuhan University
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How lan- guage models use long contexts.arXiv preprint arXiv:2307.03172, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Evaluating Very Long-Term Conversational Memory of LLM Agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conver- sational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Rajesh P. N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: A functional inter- pretation of some extra-classical receptive-field effects.Nature Neuroscience, 2(1):79–87, 1999
1999
-
[14]
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
Preston Rasmussen, Pavlo Paliychuk, Travis Beau- vais, Jack Ryan, and Daniel Chalef. Zep: A tempo- ral knowledge graph architecture for agent mem- ory.arXiv preprint arXiv:2501.13956, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Claude E. Shannon. A mathematical theory of communication.The Bell System Technical Jour- nal, 27(3):379–423, 1948
1948
-
[16]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term inter- active memory.arXiv preprint arXiv:2410.10813,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Accepted at ICLR 2025
2025
-
[18]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation
Joshua C. Yang, Damian Dailisan, and Maurice Flechtner. Belief engine: Bayesian memory for configurable opinion dynamics in llm agents. In ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems (MemAgents), 2026. Distinct from the similarly-titled arXiv:2605.15343 by an overlapping author set, which addresses multi- agent deliberation rather than agen...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.