MemGym: a Long-Horizon Memory Environment for LLM Agents
Pith reviewed 2026-05-21 04:55 UTC · model grok-4.3
The pith
MemGym isolates memory performance in LLM agents by decoupling it from reasoning, retrieval, and tool-use abilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemGym unifies agent gyms and memory-grounded pipelines behind one memory-reasoning interface across five tracks in four regimes, and reports memory-isolated scores that separate memory performance from reasoning, retrieval, and tool-use so that memory strategies can be ranked without those confounders.
What carries the argument
Memory-isolated scoring that decouples memory performance from reasoning, retrieval, and tool-use ability across unified agent regimes.
If this is right
- Memory strategies can be compared and improved independently of other agent capabilities.
- Synthetic data generation enables controllable and verifiable evaluation pipelines for coding and research tasks.
- Lightweight reward models can substitute for expensive full-environment rollouts in coding benchmarks.
- Unified interfaces across regimes support consistent memory evaluation in tool-use, search, coding, and computer-use settings.
Where Pith is reading between the lines
- The isolated scoring approach could highlight memory architectures that generalize across agent domains.
- Length-controllable pipelines might allow systematic study of how memory demands scale with task horizon.
- Adoption could shift development focus toward memory mechanisms that survive without perfect reasoning or retrieval.
Load-bearing premise
The synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios.
What would settle it
If rankings of memory strategies produced by memory-isolated scores on MemGym fail to predict which strategies yield higher task success rates when tested in full agent rollouts, the decoupling claim would not hold.
Figures
read the original abstract
Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MemGym, a unified benchmark for long-horizon memory in LLM agents spanning tool-use dialogue (tau2-bench), multi-turn deep-research (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). It claims to deliver memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use confounders, enabled by length-controllable synthetic pipelines and, for coding tracks, a lightweight MemRM reward model (Qwen3-1.7B with QLoRA) that replaces full Docker rollouts with scalar compression-quality scores.
Significance. If the claimed decoupling holds and MemRM correlates with downstream agent success, MemGym would provide a valuable, standardized platform for ranking memory strategies in realistic agentic settings without the usual confounds. The length-controllable, ablation-verified synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR represent a concrete strength for controlled, reproducible evaluation.
major comments (1)
- [MemRM training and evaluation section] The memory-isolation claim for the coding tracks (SWE-Gym and MEMGYM-CODEQA) is load-bearing and rests on MemRM serving as a faithful proxy for full long-horizon execution outcomes. The manuscript must report explicit validation metrics (e.g., Pearson correlation or rank agreement) between MemRM scores and actual agent success metrics such as bug reproduction or task completion after many steps; if MemRM was trained primarily on surface-level compression features rather than downstream execution labels, the reported isolation for these regimes is incomplete.
minor comments (1)
- [Abstract] The abstract states that the synthetic pipelines are 'ablation-verified at every stage' without naming the ablations or pointing to the relevant results table or figure; a one-sentence example or cross-reference would improve clarity.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful review of our paper on MemGym. We address the major comment regarding the validation of MemRM in detail below.
read point-by-point responses
-
Referee: [MemRM training and evaluation section] The memory-isolation claim for the coding tracks (SWE-Gym and MEMGYM-CODEQA) is load-bearing and rests on MemRM serving as a faithful proxy for full long-horizon execution outcomes. The manuscript must report explicit validation metrics (e.g., Pearson correlation or rank agreement) between MemRM scores and actual agent success metrics such as bug reproduction or task completion after many steps; if MemRM was trained primarily on surface-level compression features rather than downstream execution labels, the reported isolation for these regimes is incomplete.
Authors: We concur that demonstrating the correlation between MemRM scores and actual agent success metrics is crucial for validating the memory-isolation claims in the coding tracks. The manuscript positions MemRM as a lightweight model trained to assess compression quality, thereby approximating the utility of the memory for long-horizon coding tasks without requiring resource-intensive Docker executions. Nevertheless, to more rigorously support this proxy, we will augment the revised manuscript with explicit validation results, including Pearson correlation and rank correlation metrics between MemRM predictions and ground-truth task completion outcomes from full rollouts on a held-out set. We will also provide details on the training labels to clarify that they are derived from compression quality assessments aligned with execution success. This revision will be incorporated as additional tables and text in the MemRM section. revision: yes
Circularity Check
No circularity: benchmark design and proxy model are independent methodological choices
full rationale
The paper introduces MemGym as a new benchmark unifying agent gyms and synthetic pipelines, with memory-isolated scores achieved through explicit track design that separates memory from reasoning/retrieval/tool-use. No equations, derivations, or self-referential definitions appear. MemRM is presented as a trained QLoRA proxy for tractable Docker-free evaluation on coding tracks, not as a fitted parameter renamed as a prediction or a self-definitional loop. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are described. The construction remains self-contained against external benchmarks and does not reduce any central claim to its own inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing memory benchmarks overlook the dynamic memory formation that occurs during extended agent execution.
invented entities (2)
-
MemGym benchmark
no independent evidence
-
MemRM reward model
no independent evidence
Reference graph
Works this paper leans on
-
[1]
MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems
Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Self-RAG: Learning to retrieve, generate, and critique through self-reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[3]
Longbench: A bilingual, multitask benchmark for long context understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024
work page 2024
-
[4]
Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025
work page 2025
-
[5]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations
Jiayang Cheng, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, and Xunliang Cai. Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations. arXiv preprint arXiv:2603.01966, 2026
-
[7]
Adapting language models to compress contexts
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, 2023
work page 2023
-
[8]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Larimar: Large language models with episodic memory control
Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurelie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Soham Dan, et al. Larimar: Large language models with episodic memory control. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[10]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025
-
[12]
Selective classification for deep neural networks
Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, 2017
work page 2017
-
[13]
HippoRAG: Neurobiologically inspired long-term memory for large language models
Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[14]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Zexue He, Yu Wang, Churan Zhi, et al. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026
-
[16]
Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020. 10
work page 2020
-
[17]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Llmlingua: Compress- ing prompts for accelerated inference of large language models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compress- ing prompts for accelerated inference of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 13358–13376, 2023
work page 2023
-
[20]
Swe-bench: Can language models resolve real-world GitHub issues? 2023
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world GitHub issues? 2023
work page 2023
-
[21]
Needle in a haystack – pressure testing LLMs
Greg Kamradt. Needle in a haystack – pressure testing LLMs. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack, 2023. GitHub repository
work page 2023
-
[22]
Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, et al. The Semantic Scholar open data platform.arXiv preprint arXiv:2301.10140, 2023
-
[23]
Dong-Ho Lee et al. Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025
-
[24]
Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024
-
[25]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[26]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[27]
Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026
Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026
-
[28]
Self- refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sy...
work page 2023
-
[29]
Evaluating very long-term conversational memory of llm agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024
work page 2024
-
[30]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[31]
Memgpt: towards llms as operating systems
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023
work page 2023
-
[32]
Training Software Engineering Agents and Verifiers with SWE-Gym
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-Gym.arXiv preprint arXiv:2412.21139, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...
work page 2022
-
[34]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023
work page 2023
-
[35]
OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts
Jason Priem, Heather Piwowar, and Richard Orr. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts.arXiv preprint arXiv:2205.01833, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Zengyi Qin, Jinyuan Chen, Yunze Man, Shengcao Cao, Ziqi Pang, Zhuoyuan Wang, Xin Sun, Gen Lin, Han Fang, Ling Zhu, et al. Osgym: Super-scalable distributed data engine for generalizable computer agents.arXiv preprint arXiv:2511.11672, 2025
-
[37]
Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009
work page 2009
-
[39]
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. InInterna- tional Conference on Learning Representations (ICLR), 2024
work page 2024
-
[40]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
work page 2023
-
[42]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022
work page 2022
-
[43]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023
-
[45]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [46]
-
[47]
MemoryLLM: Towards self- updatable large language models
Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. MemoryLLM: Towards self- updatable large language models. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[48]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024
work page 2024
-
[50]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
SWE-smith: Scaling Data for Software Engineering Agents
John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Hotpotqa: A dataset for diverse, explainable multi-hop question answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018
work page 2018
-
[53]
Griffiths, Yuan Cao, and Karthik Narasimhan
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[54]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[55]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
ExpeL: LLM agents are experiential learners
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, 2024
work page 2024
-
[57]
Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanx- iang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, and Jishen Zhao. Ama-bench: Evaluating long-horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026
-
[58]
Memorybank: Enhancing large language models with long-term memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024
work page 2024
-
[59]
Shuyan Zhou. WebArena-Infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026. URL https://webarena.dev/webarena-infinity/
work page 2026
-
[60]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 13 Contents 1 Introduction 1 2 Related Work 3 3 MEMGYM: A Memory-Centric Evaluation and Training Framework 3 3.1 O...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
We implemented custom log parsing for mypy test output
Mypy test syntax:The pytest -k flag cannot parse mypy’s[case] test markers. We implemented custom log parsing for mypy test output
-
[62]
Docker pull timeout:The default 60-second timeout was insufficient for 500MB–2GB images. Increased to 600 seconds
-
[63]
env" setting deleted pulled images after each instance. Auto-switched tocache_level=
Image cache thrashing:The cache_level="env" setting deleted pulled images after each instance. Auto-switched tocache_level="instance"
-
[64]
Wrapped conda activation withset +u/set -u
Pandas conda crash:Bash set -u with unset conda variables caused crashes. Wrapped conda activation withset +u/set -u
-
[65]
Switched toparse_log_pytest_v2
Pydantic log parser:Pydantic’s output format was incompatible with the default log parser. Switched toparse_log_pytest_v2. These fixes added approximately 37 resolved instances for Sonnet 4.5 (129→166) and 15 for GPT- OSS-120B (77→92) on the 500-instance evaluation, underscoring the importance of correct evaluation infrastructure. H MEMRM Training Details...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.