Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

Nikola Milosevic

arxiv: 2605.17625 · v1 · pith:NOABAYWGnew · submitted 2026-05-17 · 💻 cs.AI

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

Nikola Milosevic This is my paper

Pith reviewed 2026-05-20 12:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords memory architectureLLM agentsscientific workflowscontext managementepisodic memorysemantic memorylong-horizon tasks

0 comments

The pith

A dual-process memory architecture allows AI agents to sustain long scientific workflows beyond standard context limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates a system that keeps a small fixed window of recent interactions while consolidating older information into a growing semantic memory store tailored for scientific content. This addresses the problem of context saturation in extended scientific collaborations involving data analysis and hypothesis refinement. If the approach works, it would let models operate over tens of thousands of messages with stable performance and lower computational overhead compared to loading everything into context at once.

Core claim

Through tests on 15,000 messages across multiple LLMs, the Dual Process Memory Architecture maintains 70-85% accuracy with 1-2 second latency and 62% fewer tokens than full-context models, successfully managing over 14,000 scientific facts while full-context approaches fail around 10,000 messages due to overflow.

What carries the argument

The Dual Process Memory Architecture, which uses a constant 10-message episodic window for immediate needs and a domain-specific semantic consolidation process that grows at roughly 3 tokens per message.

If this is right

Full-context models overflow and fail at 10,000 messages, but the dual architecture continues with high accuracy.
Numeric and temporal queries reach 65-90% accuracy, while historical retrieval is better suited to RAG methods.
The primary limit on scaling is the quality of the consolidation process rather than raw context size.
Profiles with 14,000+ facts (125k tokens) can be handled without performance collapse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This architecture could support AI collaborators that track evolving experiments over weeks or months without losing track of prior parameters or results.
Hybrid systems combining dual-process memory with retrieval-augmented generation might optimize for different query types in scientific work.
Further tests on real laboratory data streams would reveal how well the linear growth assumption holds in practice.

Load-bearing premise

Domain-specific consolidation reliably manages contradictory parameter changes, multi-hop experimental reasoning, and precise fact retention as the semantic memory grows linearly.

What would settle it

Running the architecture on a realistic multi-phase scientific workflow with deliberately introduced parameter contradictions and checking whether accuracy falls below 70% after accumulating 15,000 messages.

Figures

Figures reproduced from arXiv: 2605.17625 by Nikola Milosevic.

**Figure 2.** Figure 2: Profile token growth across conversation scales [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) evolve into persistent scientific collaborators, context window saturation has emerged as a critical bottleneck. Scientific workflows involving iterative data analysis and hypothesis refinement rapidly saturate even extended contexts with dense technical content, while monolithic approaches suffer from quadratic cost scaling and cognitive degradation. We evaluate a Dual Process Memory Architecture that decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message). Unlike prior social agent memory systems, our domain-specific consolidation addresses contradictory parameter evolution, multi-hop reasoning across experimental phases, and precise technical fact retention. Through large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries, we establish three key findings. First, while full-context models fail at 10,000 messages due to context overflow, our system maintains 70-85% accuracy with 1-2 second latency using 62% fewer tokens (45,434 vs 120,000+ limit). Second, cross-model validation reveals architecture-level trade-offs independent of specific LLMs: Dual Process excels at numeric/temporal queries (65-90% accuracy) while RAG excels at historical retrieval (60-85%), suggesting complementary deployment strategies. Third, we identify a "Sim-to-Real" gap where synthetic tests maintain constant memory but realistic workflows exhibit linear growth (about 3 tokens/message), with consolidation quality emerging as the primary scalability bottleneck. The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), demonstrating that domain-specific memory consolidation enables sustained operation beyond full-context limits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical dual-process memory split for scientific LLM agents that beats full-context on token use and run length, but the results are too aggregated to confirm it handles the hardest cases well.

read the letter

The main thing to know is that the authors split memory into a small fixed episodic window and a growing semantic store with domain-specific consolidation rules, letting agents handle 15,000 messages and 14,000 facts while staying under context limits and keeping 70-85% accuracy across six models. Full-context baselines fail earlier, and the cross-model tests show architecture-level patterns like better numeric/temporal performance than RAG on history retrieval. The sim-to-real observation on linear memory growth is also a clear practical note. That scale of testing and the concrete engineering pattern are the real contributions here. The soft spot is the lack of disaggregated results. The abstract and stress-test note both leave out breakdowns by query difficulty or by semantic store size, so it is hard to tell whether accuracy holds on contradictory parameters or multi-hop experimental reasoning as the store scales. The consolidation algorithm itself is not spelled out enough for easy reproduction either. This is aimed at people building long-horizon agents for science or technical workflows. Readers who need working memory patterns beyond basic RAG will get usable ideas from the comparisons. It deserves a serious referee because the evaluation is large enough and the problem is real, even if the central claims need tighter evidence on the tough cases. I would send it for review and ask specifically for those breakdowns.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a Dual Process Memory Architecture for LLM-based scientific agents that decouples a fixed episodic memory window (10 messages) from a domain-specific consolidated semantic store growing at ~3 tokens/message. It reports results from 15,000 messages and 1,440 queries across six LLMs in three families, claiming that the system sustains 70-85% accuracy and 1-2s latency with 62% fewer tokens than full-context baselines (which fail at ~10k messages) while managing profiles containing 14,000+ scientific facts.

Significance. If the consolidation mechanism reliably preserves accuracy on the targeted hard cases, the work would be a meaningful step toward persistent scientific agents by addressing context-window saturation. The scale of the evaluation (15k messages, cross-model validation on six LLMs) and explicit identification of a Sim-to-Real gap in memory growth are clear strengths that provide architecture-level evidence independent of any single model family.

major comments (2)

[§4] §4 (Evaluation results on the 1,440-query suite): the reported 70-85% aggregate accuracy is not disaggregated by query difficulty (contradictory parameter evolution, multi-hop reasoning across experimental phases) or by semantic-store size as it scales to 14k facts. Because the central claim is that domain-specific consolidation handles these cases without substantial accuracy loss, the absence of this breakdown leaves the primary performance assertions unverifiable.
[§3.2] §3.2 (Consolidation algorithm description): the exact consolidation procedure, growth-rate parameterization, and handling of contradictory facts are described at a high level but lack pseudocode or decision rules sufficient for reproduction. This directly affects assessment of whether the 3 tokens/message growth and accuracy figures are robust or sensitive to implementation choices.

minor comments (2)

[Abstract] Abstract: error bars or confidence intervals are not reported for the 70-85% accuracy range or the 65-90% numeric/temporal sub-results, which would strengthen the cross-model claims.
[Table 1] Table 1 or equivalent token-usage summary: the comparison of 45,434 tokens versus the 120,000+ limit should include per-model breakdowns and the precise definition of the 62% reduction to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments identify areas where additional detail would improve verifiability and reproducibility. We address each point below and commit to revisions that directly respond to the concerns while preserving the manuscript's core claims and evaluation scope.

read point-by-point responses

Referee: [§4] §4 (Evaluation results on the 1,440-query suite): the reported 70-85% aggregate accuracy is not disaggregated by query difficulty (contradictory parameter evolution, multi-hop reasoning across experimental phases) or by semantic-store size as it scales to 14k facts. Because the central claim is that domain-specific consolidation handles these cases without substantial accuracy loss, the absence of this breakdown leaves the primary performance assertions unverifiable.

Authors: We agree that disaggregation would make the performance claims more transparent. In the revised manuscript we will add a new table (or expanded figure) that reports accuracy broken down by query category—specifically isolating contradictory parameter evolution and multi-hop reasoning across phases—and by semantic-store size bins (e.g., 0–5 k, 5–10 k, and >10 k facts). The 1,440-query suite already contains a representative distribution of these hard cases; the additional breakdown will allow direct verification that accuracy remains within the reported 70–85 % band as the store grows to 14 k facts. We do not expect the aggregate numbers to change, only their presentation. revision: yes
Referee: [§3.2] §3.2 (Consolidation algorithm description): the exact consolidation procedure, growth-rate parameterization, and handling of contradictory facts are described at a high level but lack pseudocode or decision rules sufficient for reproduction. This directly affects assessment of whether the 3 tokens/message growth and accuracy figures are robust or sensitive to implementation choices.

Authors: We accept that the current description in §3.2 is insufficient for full reproduction. The revised manuscript will include (i) explicit pseudocode for the consolidation routine, (ii) the precise growth-rate parameterization (empirically observed at ~3 tokens per message under our scientific workflow), and (iii) the decision rules used for contradictory facts (most recent experimental observation takes precedence; older values are archived rather than overwritten). These additions will clarify that the reported token growth and accuracy figures are tied to this concrete implementation and will enable readers to assess sensitivity to alternative choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical evaluation

full rationale

The paper reports results from large-scale empirical testing of an implemented Dual Process Memory Architecture across 15,000 messages, 1,440 queries, and six LLMs. Reported metrics such as 70-85% accuracy, 62% token reduction, and linear growth of ~3 tokens/message are direct measurements from system runs rather than outputs of any equations, fitted parameters, or self-citations that reduce to the inputs by construction. No self-definitional steps, uniqueness theorems, or ansatzes are invoked to derive the central performance claims; the architecture's handling of contradictory parameters and multi-hop reasoning is assessed via explicit query suites instead of theoretical loops. This is the most common honest outcome for systems papers whose primary contribution is implementation and benchmarking.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central performance claims rest on the design choice of a 10-message episodic window and an observed 3 tokens/message consolidation rate, plus the domain assumption that scientific-specific rules can preserve accuracy over linear growth.

free parameters (2)

episodic window size
Fixed at 10 messages to bound immediate context needs; chosen rather than derived.
consolidation growth rate
Approximately 3 tokens per message observed in realistic workflows; used to project long-term scaling.

axioms (1)

domain assumption Domain-specific consolidation successfully addresses contradictory parameter evolution, multi-hop reasoning, and precise technical fact retention
Invoked to justify why the architecture works for scientific agents unlike generic social memory systems.

invented entities (1)

Dual Process Memory Architecture no independent evidence
purpose: Decouples short-term episodic recall from long-term consolidated semantic knowledge for long-horizon scientific tasks
Core new construct introduced to solve context saturation; no independent falsifiable evidence supplied beyond the reported tests.

pith-pipeline@v0.9.0 · 5830 in / 1293 out tokens · 46595 ms · 2026-05-20T12:25:02.165786+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 11 internal anchors

[1]

Anderson

John R. Anderson. Act: A simple theory of com- plex cognition.American Psychologist, 51(4):355– 365, 1996

work page 1996
[2]

An integrated theory of the mind.Psychological re- view, 111(4):1036, 2004

John R Anderson, Daniel Bothell, Michael D Byrne, Scott Douglass, Christian Lebiere, and Yonling Qin. An integrated theory of the mind.Psychological re- view, 111(4):1036, 2004

work page 2004
[3]

Introducing claude 2 with 100k context windows

Anthropic. Introducing claude 2 with 100k context windows. Anthropic Blog, 2023

work page 2023
[4]

Contextual retrieval.Anthropic Technical Blog, 2024

Anthropic. Contextual retrieval.Anthropic Technical Blog, 2024

work page 2024
[5]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Ka- mar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelli- gence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

A survey on evaluation of large language models.ACM Transac- tions on Intelligent Systems and Technology, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM Transac- tions on Intelligent Systems and Technology, 2024

work page 2024
[7]

Langchain: Building applications with llms through composability, 2023

Harrison Chase. Langchain: Building applications with llms through composability, 2023

work page 2023
[8]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

Wanjun Cheng, Lianghong Peng, Yufei Wang, and Baobao Wang. Memorybank: Enhancing large language models with long-term memory.arXiv preprint arXiv:2305.10250, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Adapting language models to com- press contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to com- press contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 3829–3846. Association for Computational Linguistics, 2023

work page 2023
[10]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022
[11]

Bert: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, volume 1, pages 4171–4186. Association for Compu- tational ...

work page 2019
[12]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neu- ral turing machines.arXiv preprint arXiv:1410.5401, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[13]

Macmil- lan, 2011

Daniel Kahneman.Thinking, fast and slow. Macmil- lan, 2011

work page 2011
[14]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas O˘ guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natu- ral Language Processing, pages 6769–6781. Associa- tion for Computational Linguistics, 2020

work page 2020
[15]

McClelland

Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelli- gent agents need? complementary learning sys- tems theory updated.Trends in Cognitive Sciences, 20(7):512–534, 2016

work page 2016
[16]

The soar cognitive architecture.MIT press, 2012

John E Laird. The soar cognitive architecture.MIT press, 2012

work page 2012
[17]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neu- ral Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨ uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨ aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neu- ral Information Processing Systems, 33:9459–9474, 2020

work page 2020
[18]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language mod- els use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024
[20]

Augmented Language Models: a Survey

Gr´ egoire Mialon, Roberto Dess` ı, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozi` ere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Landmark attention: Random-access infinite context length for transformers

Amirkeivan Mohtashami and Martin Jaggi. Land- mark attention: Random-access infinite con- text length for transformers.arXiv preprint arXiv:2305.16300, 2023

work page arXiv 2023
[22]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Wooders, and Ion Stoica. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. 14 Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th An- nual ACM Symposium on User Interface Software and Technology, UIST ’23, pages 1–22. Association for Computing Machinery, 2023

work page 2023
[24]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rockt¨ aschel, and Sebastian Riedel. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Associa...

work page 2021
[25]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Un- locking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3980–3990. Association for Com- putational Linguistics, 2019

work page 2019
[27]

Richards and Paul W

Blake A. Richards and Paul W. Frankland. The persistence and transience of memory.Neuron, 94(6):1071–1084, 2017

work page 2017
[28]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

End-to-end memory networks.Advances in neural information processing systems, 28, 2015

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks.Advances in neural information processing systems, 28, 2015

work page 2015
[30]

Cognitive Architectures for Language Agents

Theodore R Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L Griffiths. Cogni- tive architectures for language agents.arXiv preprint arXiv:2309.02427, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Attention is all you need.Advances in neural information processing sys- tems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing sys- tems, 30, 2017

work page 2017
[32]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and An- ima Anandkumar. Voyager: An open-ended embod- ied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Chain-of-thought prompting elicits rea- soning in large language models.Advances in Neu- ral Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits rea- soning in large language models.Advances in Neu- ral Information Processing Systems, 35:24824–24837, 2022

work page 2022
[34]

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart Van Merri¨ enboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

Memory networks

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. InProceedings of the 3rd Interna- tional Conference on Learning Representations, San Diego, CA, USA, 2015

work page 2015
[36]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations, 2022

work page 2022
[37]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023. Datasets and Benchmarks Track

work page 2023
[38]

Recurrentgpt: In- teractive generation of (arbitrarily) long text.arXiv preprint arXiv:2305.13304, 2023

Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. Recurrentgpt: In- teractive generation of (arbitrarily) long text.arXiv preprint arXiv:2305.13304, 2023. 15

work page arXiv 2023

[1] [1]

Anderson

John R. Anderson. Act: A simple theory of com- plex cognition.American Psychologist, 51(4):355– 365, 1996

work page 1996

[2] [2]

An integrated theory of the mind.Psychological re- view, 111(4):1036, 2004

John R Anderson, Daniel Bothell, Michael D Byrne, Scott Douglass, Christian Lebiere, and Yonling Qin. An integrated theory of the mind.Psychological re- view, 111(4):1036, 2004

work page 2004

[3] [3]

Introducing claude 2 with 100k context windows

Anthropic. Introducing claude 2 with 100k context windows. Anthropic Blog, 2023

work page 2023

[4] [4]

Contextual retrieval.Anthropic Technical Blog, 2024

Anthropic. Contextual retrieval.Anthropic Technical Blog, 2024

work page 2024

[5] [5]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Ka- mar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelli- gence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

A survey on evaluation of large language models.ACM Transac- tions on Intelligent Systems and Technology, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM Transac- tions on Intelligent Systems and Technology, 2024

work page 2024

[7] [7]

Langchain: Building applications with llms through composability, 2023

Harrison Chase. Langchain: Building applications with llms through composability, 2023

work page 2023

[8] [8]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

Wanjun Cheng, Lianghong Peng, Yufei Wang, and Baobao Wang. Memorybank: Enhancing large language models with long-term memory.arXiv preprint arXiv:2305.10250, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Adapting language models to com- press contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to com- press contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 3829–3846. Association for Computational Linguistics, 2023

work page 2023

[10] [10]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022

[11] [11]

Bert: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, volume 1, pages 4171–4186. Association for Compu- tational ...

work page 2019

[12] [12]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neu- ral turing machines.arXiv preprint arXiv:1410.5401, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[13] [13]

Macmil- lan, 2011

Daniel Kahneman.Thinking, fast and slow. Macmil- lan, 2011

work page 2011

[14] [14]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas O˘ guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natu- ral Language Processing, pages 6769–6781. Associa- tion for Computational Linguistics, 2020

work page 2020

[15] [15]

McClelland

Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelli- gent agents need? complementary learning sys- tems theory updated.Trends in Cognitive Sciences, 20(7):512–534, 2016

work page 2016

[16] [16]

The soar cognitive architecture.MIT press, 2012

John E Laird. The soar cognitive architecture.MIT press, 2012

work page 2012

[17] [17]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neu- ral Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨ uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨ aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neu- ral Information Processing Systems, 33:9459–9474, 2020

work page 2020

[18] [18]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language mod- els use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024

[20] [20]

Augmented Language Models: a Survey

Gr´ egoire Mialon, Roberto Dess` ı, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozi` ere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Landmark attention: Random-access infinite context length for transformers

Amirkeivan Mohtashami and Martin Jaggi. Land- mark attention: Random-access infinite con- text length for transformers.arXiv preprint arXiv:2305.16300, 2023

work page arXiv 2023

[22] [22]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Wooders, and Ion Stoica. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. 14 Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th An- nual ACM Symposium on User Interface Software and Technology, UIST ’23, pages 1–22. Association for Computing Machinery, 2023

work page 2023

[24] [24]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rockt¨ aschel, and Sebastian Riedel. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Associa...

work page 2021

[25] [25]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Un- locking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3980–3990. Association for Com- putational Linguistics, 2019

work page 2019

[27] [27]

Richards and Paul W

Blake A. Richards and Paul W. Frankland. The persistence and transience of memory.Neuron, 94(6):1071–1084, 2017

work page 2017

[28] [28]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

End-to-end memory networks.Advances in neural information processing systems, 28, 2015

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks.Advances in neural information processing systems, 28, 2015

work page 2015

[30] [30]

Cognitive Architectures for Language Agents

Theodore R Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L Griffiths. Cogni- tive architectures for language agents.arXiv preprint arXiv:2309.02427, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Attention is all you need.Advances in neural information processing sys- tems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing sys- tems, 30, 2017

work page 2017

[32] [32]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and An- ima Anandkumar. Voyager: An open-ended embod- ied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Chain-of-thought prompting elicits rea- soning in large language models.Advances in Neu- ral Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits rea- soning in large language models.Advances in Neu- ral Information Processing Systems, 35:24824–24837, 2022

work page 2022

[34] [34]

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart Van Merri¨ enboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

Memory networks

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. InProceedings of the 3rd Interna- tional Conference on Learning Representations, San Diego, CA, USA, 2015

work page 2015

[36] [36]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations, 2022

work page 2022

[37] [37]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023. Datasets and Benchmarks Track

work page 2023

[38] [38]

Recurrentgpt: In- teractive generation of (arbitrarily) long text.arXiv preprint arXiv:2305.13304, 2023

Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. Recurrentgpt: In- teractive generation of (arbitrarily) long text.arXiv preprint arXiv:2305.13304, 2023. 15

work page arXiv 2023