pith. sign in

arxiv: 2605.20833 · v1 · pith:ICMJAEEWnew · submitted 2026-05-20 · 💻 cs.CL

MemGym: a Long-Horizon Memory Environment for LLM Agents

Pith reviewed 2026-05-21 04:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentsmemory benchmarkslong-horizon tasksagentic memorymemory evaluationtool-use agentscoding agents
0
0 comments X

The pith

MemGym isolates memory performance in LLM agents by decoupling it from reasoning, retrieval, and tool-use abilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MemGym as a benchmark that evaluates memory in long-horizon LLM agent tasks across realistic scenarios such as coding, research search, tool-use dialogue, and computer use. It unifies existing agent environments under a single interface and generates synthetic pipelines for coding and deep-research tracks that control length and verify alignment at each stage. The core innovation is the use of memory-isolated scores, which rank memory strategies without interference from other agent skills. This design aims to produce memory systems that transfer more effectively to practical agent execution than those tested in prior chat-focused benchmarks. A lightweight reward model is also introduced to enable scalable evaluation on complex coding environments without full rollouts.

Core claim

MemGym unifies agent gyms and memory-grounded pipelines behind one memory-reasoning interface across five tracks in four regimes, and reports memory-isolated scores that separate memory performance from reasoning, retrieval, and tool-use so that memory strategies can be ranked without those confounders.

What carries the argument

Memory-isolated scoring that decouples memory performance from reasoning, retrieval, and tool-use ability across unified agent regimes.

If this is right

  • Memory strategies can be compared and improved independently of other agent capabilities.
  • Synthetic data generation enables controllable and verifiable evaluation pipelines for coding and research tasks.
  • Lightweight reward models can substitute for expensive full-environment rollouts in coding benchmarks.
  • Unified interfaces across regimes support consistent memory evaluation in tool-use, search, coding, and computer-use settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The isolated scoring approach could highlight memory architectures that generalize across agent domains.
  • Length-controllable pipelines might allow systematic study of how memory demands scale with task horizon.
  • Adoption could shift development focus toward memory mechanisms that survive without perfect reasoning or retrieval.

Load-bearing premise

The synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios.

What would settle it

If rankings of memory strategies produced by memory-isolated scores on MemGym fail to predict which strategies yield higher task success rates when tested in full agent rollouts, the decoupling claim would not hold.

Figures

Figures reproduced from arXiv: 2605.20833 by Dimitris N. Metaxas, Han Zhang, Kai Mei, Kaiqu Liang, Mingyu Jin, Sambit Sahu, Shi-Xiong Zhang, Wenyue Hua, Wujiang Xu, Yu Wang, Zhenting Wang.

Figure 1
Figure 1. Figure 1: MEMGYM unifies five evaluation tracks across four agentic regimes (tool-use dialogue, multi-turn search, coding, computer use) behind a shared interface that separates memory from reasoning and supports memory-isolated scoring with explicit memory rewards. pretraining. (iii) Evaluation cost: A single SWE-Gym rollout requires Docker infrastructure and tens of execution steps, placing systematic memory desig… view at source ↗
Figure 2
Figure 2. Figure 2: MEMGYM architecture. Five environments share a memory module that wraps the prompt to the policy LLM, so the same strategy runs on any environment unchanged. Trajectories feed a replay-augmentation pipeline producing SAFE/HARMFUL labels for MEMRM; MEMGYM-DR and MEMGYM-CODEQA come from in-house pipelines (Section 3.4). into the same per-step contract described in Section 3.2; the two construction pipelines … view at source ↗
Figure 3
Figure 3. Figure 3: Memory strategies on the two synthetic-memory benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MEMRM training dynamics on SWE-Gym compression events (Qwen3-1.7B-Base + QLoRA, 32K context, 600 steps, 8×A100-40GB). release this held-out split as part of the MEMGYM artifacts so other groups can evaluate their own gates against the same examples. MEMRM also shows partial generalization beyond the training distribution along two axes. Within the coding domain, the gate transfers to memory strategies that… view at source ↗
read the original abstract

Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MemGym, a unified benchmark for long-horizon memory in LLM agents spanning tool-use dialogue (tau2-bench), multi-turn deep-research (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). It claims to deliver memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use confounders, enabled by length-controllable synthetic pipelines and, for coding tracks, a lightweight MemRM reward model (Qwen3-1.7B with QLoRA) that replaces full Docker rollouts with scalar compression-quality scores.

Significance. If the claimed decoupling holds and MemRM correlates with downstream agent success, MemGym would provide a valuable, standardized platform for ranking memory strategies in realistic agentic settings without the usual confounds. The length-controllable, ablation-verified synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR represent a concrete strength for controlled, reproducible evaluation.

major comments (1)
  1. [MemRM training and evaluation section] The memory-isolation claim for the coding tracks (SWE-Gym and MEMGYM-CODEQA) is load-bearing and rests on MemRM serving as a faithful proxy for full long-horizon execution outcomes. The manuscript must report explicit validation metrics (e.g., Pearson correlation or rank agreement) between MemRM scores and actual agent success metrics such as bug reproduction or task completion after many steps; if MemRM was trained primarily on surface-level compression features rather than downstream execution labels, the reported isolation for these regimes is incomplete.
minor comments (1)
  1. [Abstract] The abstract states that the synthetic pipelines are 'ablation-verified at every stage' without naming the ablations or pointing to the relevant results table or figure; a one-sentence example or cross-reference would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for their insightful review of our paper on MemGym. We address the major comment regarding the validation of MemRM in detail below.

read point-by-point responses
  1. Referee: [MemRM training and evaluation section] The memory-isolation claim for the coding tracks (SWE-Gym and MEMGYM-CODEQA) is load-bearing and rests on MemRM serving as a faithful proxy for full long-horizon execution outcomes. The manuscript must report explicit validation metrics (e.g., Pearson correlation or rank agreement) between MemRM scores and actual agent success metrics such as bug reproduction or task completion after many steps; if MemRM was trained primarily on surface-level compression features rather than downstream execution labels, the reported isolation for these regimes is incomplete.

    Authors: We concur that demonstrating the correlation between MemRM scores and actual agent success metrics is crucial for validating the memory-isolation claims in the coding tracks. The manuscript positions MemRM as a lightweight model trained to assess compression quality, thereby approximating the utility of the memory for long-horizon coding tasks without requiring resource-intensive Docker executions. Nevertheless, to more rigorously support this proxy, we will augment the revised manuscript with explicit validation results, including Pearson correlation and rank correlation metrics between MemRM predictions and ground-truth task completion outcomes from full rollouts on a held-out set. We will also provide details on the training labels to clarify that they are derived from compression quality assessments aligned with execution success. This revision will be incorporated as additional tables and text in the MemRM section. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark design and proxy model are independent methodological choices

full rationale

The paper introduces MemGym as a new benchmark unifying agent gyms and synthetic pipelines, with memory-isolated scores achieved through explicit track design that separates memory from reasoning/retrieval/tool-use. No equations, derivations, or self-referential definitions appear. MemRM is presented as a trained QLoRA proxy for tractable Docker-free evaluation on coding tracks, not as a fitted parameter renamed as a prediction or a self-definitional loop. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are described. The construction remains self-contained against external benchmarks and does not reduce any central claim to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper rests on the domain assumption that prior benchmarks overlook dynamic memory formation during extended execution; it introduces the new benchmark and reward model entities without external falsifiable evidence supplied in the abstract.

axioms (1)
  • domain assumption Existing memory benchmarks overlook the dynamic memory formation that occurs during extended agent execution.
    Stated directly in the abstract as the core motivation for the new benchmark.
invented entities (2)
  • MemGym benchmark no independent evidence
    purpose: Unified memory-reasoning interface for long-horizon agent evaluation
    Newly presented in this work to combine existing gyms and pipelines.
  • MemRM reward model no independent evidence
    purpose: Lightweight scalar scorer for compression quality in place of full Docker rollouts
    Trained specifically for this benchmark using Qwen3-1.7B with QLoRA.

pith-pipeline@v0.9.0 · 5809 in / 1415 out tokens · 48645 ms · 2026-05-21T04:55:20.992108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 18 internal anchors

  1. [1]

    MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

    Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

  2. [2]

    Self-RAG: Learning to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InInternational Conference on Learning Representations (ICLR), 2024

  3. [3]

    Longbench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

  4. [4]

    Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025

  5. [5]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

  6. [6]

    Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations

    Jiayang Cheng, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, and Xunliang Cai. Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations. arXiv preprint arXiv:2603.01966, 2026

  7. [7]

    Adapting language models to compress contexts

    Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, 2023

  8. [8]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  9. [9]

    Larimar: Large language models with episodic memory control

    Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurelie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Soham Dan, et al. Larimar: Large language models with episodic memory control. InInternational Conference on Machine Learning (ICML), 2024

  10. [10]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

  11. [11]

    LightMem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

    Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025

  12. [12]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, 2017

  13. [13]

    HippoRAG: Neurobiologically inspired long-term memory for large language models

    Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  14. [14]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  15. [15]

    Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

    Zexue He, Yu Wang, Churan Zhi, et al. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

  16. [16]

    Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020. 10

  17. [17]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  18. [18]

    Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

    Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

  19. [19]

    Llmlingua: Compress- ing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compress- ing prompts for accelerated inference of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 13358–13376, 2023

  20. [20]

    Swe-bench: Can language models resolve real-world GitHub issues? 2023

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world GitHub issues? 2023

  21. [21]

    Needle in a haystack – pressure testing LLMs

    Greg Kamradt. Needle in a haystack – pressure testing LLMs. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack, 2023. GitHub repository

  22. [22]

    Graham, F.Q

    Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, et al. The Semantic Scholar open data platform.arXiv preprint arXiv:2301.10140, 2023

  23. [23]

    Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025

    Dong-Ho Lee et al. Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025

  24. [24]

    A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024

    Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024

  25. [25]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  26. [26]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  27. [27]

    Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

    Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

  28. [28]

    Self- refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sy...

  29. [29]

    Evaluating very long-term conversational memory of llm agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  30. [30]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  31. [31]

    Memgpt: towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023

  32. [32]

    Training Software Engineering Agents and Verifiers with SWE-Gym

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-Gym.arXiv preprint arXiv:2412.21139, 2024. 11

  33. [33]

    Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

  34. [34]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023

  35. [35]

    OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts

    Jason Priem, Heather Piwowar, and Richard Orr. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts.arXiv preprint arXiv:2205.01833, 2022

  36. [36]

    Osgym: Super-scalable distributed data engine for generalizable computer agents.arXiv preprint arXiv:2511.11672, 2025

    Zengyi Qin, Jinyuan Chen, Yunze Man, Shengcao Cao, Ziqi Pang, Zhuoyuan Wang, Xin Sun, Gen Lin, Han Fang, Ling Zhu, et al. Osgym: Super-scalable distributed data engine for generalizable computer agents.arXiv preprint arXiv:2511.11672, 2025

  37. [37]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  38. [38]

    Now Publishers Inc, 2009

    Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

  39. [39]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. InInterna- tional Conference on Learning Representations (ICLR), 2024

  40. [40]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

  41. [41]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  42. [42]

    Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  43. [43]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  44. [44]

    Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

    Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

  45. [45]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

  46. [46]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023

  47. [47]

    MemoryLLM: Towards self- updatable large language models

    Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. MemoryLLM: Towards self- updatable large language models. InInternational Conference on Machine Learning (ICML), 2024

  48. [48]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024. 12

  49. [49]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  50. [50]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  51. [51]

    SWE-smith: Scaling Data for Software Engineering Agents

    John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

  52. [52]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

  53. [53]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  54. [54]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  55. [55]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  56. [56]

    ExpeL: LLM agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

  57. [57]

    Ama-bench: Evaluating long-horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

    Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanx- iang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, and Jishen Zhao. Ama-bench: Evaluating long-horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

  58. [58]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

  59. [59]

    WebArena-Infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026

    Shuyan Zhou. WebArena-Infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026. URL https://webarena.dev/webarena-infinity/

  60. [60]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 13 Contents 1 Introduction 1 2 Related Work 3 3 MEMGYM: A Memory-Centric Evaluation and Training Framework 3 3.1 O...

  61. [61]

    We implemented custom log parsing for mypy test output

    Mypy test syntax:The pytest -k flag cannot parse mypy’s[case] test markers. We implemented custom log parsing for mypy test output

  62. [62]

    Increased to 600 seconds

    Docker pull timeout:The default 60-second timeout was insufficient for 500MB–2GB images. Increased to 600 seconds

  63. [63]

    env" setting deleted pulled images after each instance. Auto-switched tocache_level=

    Image cache thrashing:The cache_level="env" setting deleted pulled images after each instance. Auto-switched tocache_level="instance"

  64. [64]

    Wrapped conda activation withset +u/set -u

    Pandas conda crash:Bash set -u with unset conda variables caused crashes. Wrapped conda activation withset +u/set -u

  65. [65]

    Switched toparse_log_pytest_v2

    Pydantic log parser:Pydantic’s output format was incompatible with the default log parser. Switched toparse_log_pytest_v2. These fixes added approximately 37 resolved instances for Sonnet 4.5 (129→166) and 15 for GPT- OSS-120B (77→92) on the 500-instance evaluation, underscoring the importance of correct evaluation infrastructure. H MEMRM Training Details...