MemGym: a Long-Horizon Memory Environment for LLM Agents

Dimitris N. Metaxas; Han Zhang; Kai Mei; Kaiqu Liang; Mingyu Jin; Sambit Sahu; Shi-Xiong Zhang; Wenyue Hua; Wujiang Xu; Yu Wang

arxiv: 2605.20833 · v1 · pith:ICMJAEEWnew · submitted 2026-05-20 · 💻 cs.CL

MemGym: a Long-Horizon Memory Environment for LLM Agents

Wujiang Xu , Yu Wang , Kai Mei , Kaiqu Liang , Zhenting Wang , Mingyu Jin , Han Zhang , Shi-Xiong Zhang

show 3 more authors

Wenyue Hua Sambit Sahu Dimitris N. Metaxas

This is my paper

Pith reviewed 2026-05-21 04:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentsmemory benchmarkslong-horizon tasksagentic memorymemory evaluationtool-use agentscoding agents

0 comments

The pith

MemGym isolates memory performance in LLM agents by decoupling it from reasoning, retrieval, and tool-use abilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MemGym as a benchmark that evaluates memory in long-horizon LLM agent tasks across realistic scenarios such as coding, research search, tool-use dialogue, and computer use. It unifies existing agent environments under a single interface and generates synthetic pipelines for coding and deep-research tracks that control length and verify alignment at each stage. The core innovation is the use of memory-isolated scores, which rank memory strategies without interference from other agent skills. This design aims to produce memory systems that transfer more effectively to practical agent execution than those tested in prior chat-focused benchmarks. A lightweight reward model is also introduced to enable scalable evaluation on complex coding environments without full rollouts.

Core claim

MemGym unifies agent gyms and memory-grounded pipelines behind one memory-reasoning interface across five tracks in four regimes, and reports memory-isolated scores that separate memory performance from reasoning, retrieval, and tool-use so that memory strategies can be ranked without those confounders.

What carries the argument

Memory-isolated scoring that decouples memory performance from reasoning, retrieval, and tool-use ability across unified agent regimes.

If this is right

Memory strategies can be compared and improved independently of other agent capabilities.
Synthetic data generation enables controllable and verifiable evaluation pipelines for coding and research tasks.
Lightweight reward models can substitute for expensive full-environment rollouts in coding benchmarks.
Unified interfaces across regimes support consistent memory evaluation in tool-use, search, coding, and computer-use settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The isolated scoring approach could highlight memory architectures that generalize across agent domains.
Length-controllable pipelines might allow systematic study of how memory demands scale with task horizon.
Adoption could shift development focus toward memory mechanisms that survive without perfect reasoning or retrieval.

Load-bearing premise

The synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios.

What would settle it

If rankings of memory strategies produced by memory-isolated scores on MemGym fail to predict which strategies yield higher task success rates when tested in full agent rollouts, the decoupling claim would not hold.

Figures

Figures reproduced from arXiv: 2605.20833 by Dimitris N. Metaxas, Han Zhang, Kai Mei, Kaiqu Liang, Mingyu Jin, Sambit Sahu, Shi-Xiong Zhang, Wenyue Hua, Wujiang Xu, Yu Wang, Zhenting Wang.

**Figure 1.** Figure 1: MEMGYM unifies five evaluation tracks across four agentic regimes (tool-use dialogue, multi-turn search, coding, computer use) behind a shared interface that separates memory from reasoning and supports memory-isolated scoring with explicit memory rewards. pretraining. (iii) Evaluation cost: A single SWE-Gym rollout requires Docker infrastructure and tens of execution steps, placing systematic memory desig… view at source ↗

**Figure 2.** Figure 2: MEMGYM architecture. Five environments share a memory module that wraps the prompt to the policy LLM, so the same strategy runs on any environment unchanged. Trajectories feed a replay-augmentation pipeline producing SAFE/HARMFUL labels for MEMRM; MEMGYM-DR and MEMGYM-CODEQA come from in-house pipelines (Section 3.4). into the same per-step contract described in Section 3.2; the two construction pipelines … view at source ↗

**Figure 3.** Figure 3: Memory strategies on the two synthetic-memory benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: MEMRM training dynamics on SWE-Gym compression events (Qwen3-1.7B-Base + QLoRA, 32K context, 600 steps, 8×A100-40GB). release this held-out split as part of the MEMGYM artifacts so other groups can evaluate their own gates against the same examples. MEMRM also shows partial generalization beyond the training distribution along two axes. Within the coding domain, the gate transfers to memory strategies that… view at source ↗

read the original abstract

Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemGym unifies agent environments under one memory interface and adds isolated scoring plus a practical reward model for coding, but the decoupling claim hinges on how well MemRM tracks real long-horizon success.

read the letter

The main thing to know is that MemGym pulls together existing agent gyms and new synthetic pipelines into a single setup for testing memory over long tasks, with scores meant to separate memory from reasoning, retrieval, and tool use. The five tracks cover tool-use dialogue, deep research, coding, and computer use, and the controllable length plus ablation checks on the custom pipelines are a clear practical step forward for creating repeatable tests. Using a lightweight fine-tuned Qwen model as MemRM to stand in for full Docker runs in the coding tracks is a sensible engineering choice that keeps evaluation feasible. That part of the work is useful and addresses a real bottleneck. The soft spot is exactly where the stress-test note flags it: the isolation for coding and code QA rests on MemRM scoring compression quality. If the model was trained mostly on surface patterns rather than downstream agent outcomes after many steps, the reported memory rankings could still carry hidden confounders. The paper would need to show clear correlations or ablations linking MemRM scores to actual task success for the decoupling to land cleanly. This is aimed at people building and comparing memory modules for agents in applied settings like coding or navigation. Readers who want structured ways to measure memory improvements without the usual noise would get concrete value from the track design and the proxy evaluation. It has enough structure and a real problem to solve that it deserves a serious referee to check the results and the reward model details.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MemGym, a unified benchmark for long-horizon memory in LLM agents spanning tool-use dialogue (tau2-bench), multi-turn deep-research (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). It claims to deliver memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use confounders, enabled by length-controllable synthetic pipelines and, for coding tracks, a lightweight MemRM reward model (Qwen3-1.7B with QLoRA) that replaces full Docker rollouts with scalar compression-quality scores.

Significance. If the claimed decoupling holds and MemRM correlates with downstream agent success, MemGym would provide a valuable, standardized platform for ranking memory strategies in realistic agentic settings without the usual confounds. The length-controllable, ablation-verified synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR represent a concrete strength for controlled, reproducible evaluation.

major comments (1)

[MemRM training and evaluation section] The memory-isolation claim for the coding tracks (SWE-Gym and MEMGYM-CODEQA) is load-bearing and rests on MemRM serving as a faithful proxy for full long-horizon execution outcomes. The manuscript must report explicit validation metrics (e.g., Pearson correlation or rank agreement) between MemRM scores and actual agent success metrics such as bug reproduction or task completion after many steps; if MemRM was trained primarily on surface-level compression features rather than downstream execution labels, the reported isolation for these regimes is incomplete.

minor comments (1)

[Abstract] The abstract states that the synthetic pipelines are 'ablation-verified at every stage' without naming the ablations or pointing to the relevant results table or figure; a one-sentence example or cross-reference would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for their insightful review of our paper on MemGym. We address the major comment regarding the validation of MemRM in detail below.

read point-by-point responses

Referee: [MemRM training and evaluation section] The memory-isolation claim for the coding tracks (SWE-Gym and MEMGYM-CODEQA) is load-bearing and rests on MemRM serving as a faithful proxy for full long-horizon execution outcomes. The manuscript must report explicit validation metrics (e.g., Pearson correlation or rank agreement) between MemRM scores and actual agent success metrics such as bug reproduction or task completion after many steps; if MemRM was trained primarily on surface-level compression features rather than downstream execution labels, the reported isolation for these regimes is incomplete.

Authors: We concur that demonstrating the correlation between MemRM scores and actual agent success metrics is crucial for validating the memory-isolation claims in the coding tracks. The manuscript positions MemRM as a lightweight model trained to assess compression quality, thereby approximating the utility of the memory for long-horizon coding tasks without requiring resource-intensive Docker executions. Nevertheless, to more rigorously support this proxy, we will augment the revised manuscript with explicit validation results, including Pearson correlation and rank correlation metrics between MemRM predictions and ground-truth task completion outcomes from full rollouts on a held-out set. We will also provide details on the training labels to clarify that they are derived from compression quality assessments aligned with execution success. This revision will be incorporated as additional tables and text in the MemRM section. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark design and proxy model are independent methodological choices

full rationale

The paper introduces MemGym as a new benchmark unifying agent gyms and synthetic pipelines, with memory-isolated scores achieved through explicit track design that separates memory from reasoning/retrieval/tool-use. No equations, derivations, or self-referential definitions appear. MemRM is presented as a trained QLoRA proxy for tractable Docker-free evaluation on coding tracks, not as a fitted parameter renamed as a prediction or a self-definitional loop. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are described. The construction remains self-contained against external benchmarks and does not reduce any central claim to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper rests on the domain assumption that prior benchmarks overlook dynamic memory formation during extended execution; it introduces the new benchmark and reward model entities without external falsifiable evidence supplied in the abstract.

axioms (1)

domain assumption Existing memory benchmarks overlook the dynamic memory formation that occurs during extended agent execution.
Stated directly in the abstract as the core motivation for the new benchmark.

invented entities (2)

MemGym benchmark no independent evidence
purpose: Unified memory-reasoning interface for long-horizon agent evaluation
Newly presented in this work to combine existing gyms and pipelines.
MemRM reward model no independent evidence
purpose: Lightweight scalar scorer for compression quality in place of full Docker rollouts
Trained specifically for this benchmark using Qwen3-1.7B with QLoRA.

pith-pipeline@v0.9.0 · 5809 in / 1415 out tokens · 48645 ms · 2026-05-21T04:55:20.992108+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 18 internal anchors

[1]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Self-RAG: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[3]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

work page 2024
[4]

Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025

work page 2025
[5]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations

Jiayang Cheng, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, and Xunliang Cai. Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations. arXiv preprint arXiv:2603.01966, 2026

work page arXiv 2026
[7]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, 2023

work page 2023
[8]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Larimar: Large language models with episodic memory control

Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurelie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Soham Dan, et al. Larimar: Large language models with episodic memory control. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[10]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

LightMem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025

work page arXiv 2025
[12]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, 2017

work page 2017
[13]

HippoRAG: Neurobiologically inspired long-term memory for large language models

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[14]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

Zexue He, Yu Wang, Churan Zhi, et al. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

work page arXiv 2026
[16]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020. 10

work page 2020
[17]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Llmlingua: Compress- ing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compress- ing prompts for accelerated inference of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 13358–13376, 2023

work page 2023
[20]

Swe-bench: Can language models resolve real-world GitHub issues? 2023

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world GitHub issues? 2023

work page 2023
[21]

Needle in a haystack – pressure testing LLMs

Greg Kamradt. Needle in a haystack – pressure testing LLMs. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack, 2023. GitHub repository

work page 2023
[22]

Graham, F.Q

Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, et al. The Semantic Scholar open data platform.arXiv preprint arXiv:2301.10140, 2023

work page arXiv 2023
[23]

Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025

Dong-Ho Lee et al. Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025

work page arXiv 2025
[24]

A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024

Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024

work page arXiv 2024
[25]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[26]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[27]

Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

work page arXiv 2026
[28]

Self- refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sy...

work page 2023
[29]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

work page 2024
[30]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[31]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023

work page 2023
[32]

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-Gym.arXiv preprint arXiv:2412.21139, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

work page 2022
[34]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023

work page 2023
[35]

OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts

Jason Priem, Heather Piwowar, and Richard Orr. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts.arXiv preprint arXiv:2205.01833, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Osgym: Super-scalable distributed data engine for generalizable computer agents.arXiv preprint arXiv:2511.11672, 2025

Zengyi Qin, Jinyuan Chen, Yunze Man, Shengcao Cao, Ziqi Pang, Zhuoyuan Wang, Xin Sun, Gen Lin, Han Fang, Ling Zhu, et al. Osgym: Super-scalable distributed data engine for generalizable computer agents.arXiv preprint arXiv:2511.11672, 2025

work page arXiv 2025
[37]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Now Publishers Inc, 2009

Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

work page 2009
[39]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. InInterna- tional Conference on Learning Representations (ICLR), 2024

work page 2024
[40]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[42]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[43]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

work page arXiv 2023
[45]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[47]

MemoryLLM: Towards self- updatable large language models

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. MemoryLLM: Towards self- updatable large language models. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[48]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024
[50]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

work page 2018
[53]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[54]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[55]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024
[57]

Ama-bench: Evaluating long-horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanx- iang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, and Jishen Zhao. Ama-bench: Evaluating long-horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

work page arXiv 2026
[58]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024
[59]

WebArena-Infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026

Shuyan Zhou. WebArena-Infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026. URL https://webarena.dev/webarena-infinity/

work page 2026
[60]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 13 Contents 1 Introduction 1 2 Related Work 3 3 MEMGYM: A Memory-Centric Evaluation and Training Framework 3 3.1 O...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

We implemented custom log parsing for mypy test output

Mypy test syntax:The pytest -k flag cannot parse mypy’s[case] test markers. We implemented custom log parsing for mypy test output

work page
[62]

Increased to 600 seconds

Docker pull timeout:The default 60-second timeout was insufficient for 500MB–2GB images. Increased to 600 seconds

work page
[63]

env" setting deleted pulled images after each instance. Auto-switched tocache_level=

Image cache thrashing:The cache_level="env" setting deleted pulled images after each instance. Auto-switched tocache_level="instance"

work page
[64]

Wrapped conda activation withset +u/set -u

Pandas conda crash:Bash set -u with unset conda variables caused crashes. Wrapped conda activation withset +u/set -u

work page
[65]

Switched toparse_log_pytest_v2

Pydantic log parser:Pydantic’s output format was incompatible with the default log parser. Switched toparse_log_pytest_v2. These fixes added approximately 37 resolved instances for Sonnet 4.5 (129→166) and 15 for GPT- OSS-120B (77→92) on the 500-instance evaluation, underscoring the importance of correct evaluation infrastructure. H MEMRM Training Details...

work page

[1] [1]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Self-RAG: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[3] [3]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

work page 2024

[4] [4]

Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025

work page 2025

[5] [5]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations

Jiayang Cheng, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, and Xunliang Cai. Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations. arXiv preprint arXiv:2603.01966, 2026

work page arXiv 2026

[7] [7]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, 2023

work page 2023

[8] [8]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Larimar: Large language models with episodic memory control

Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurelie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Soham Dan, et al. Larimar: Large language models with episodic memory control. InInternational Conference on Machine Learning (ICML), 2024

work page 2024

[10] [10]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

LightMem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025

work page arXiv 2025

[12] [12]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, 2017

work page 2017

[13] [13]

HippoRAG: Neurobiologically inspired long-term memory for large language models

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[14] [14]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

Zexue He, Yu Wang, Churan Zhi, et al. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

work page arXiv 2026

[16] [16]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020. 10

work page 2020

[17] [17]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Llmlingua: Compress- ing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compress- ing prompts for accelerated inference of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 13358–13376, 2023

work page 2023

[20] [20]

Swe-bench: Can language models resolve real-world GitHub issues? 2023

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world GitHub issues? 2023

work page 2023

[21] [21]

Needle in a haystack – pressure testing LLMs

Greg Kamradt. Needle in a haystack – pressure testing LLMs. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack, 2023. GitHub repository

work page 2023

[22] [22]

Graham, F.Q

Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, et al. The Semantic Scholar open data platform.arXiv preprint arXiv:2301.10140, 2023

work page arXiv 2023

[23] [23]

Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025

Dong-Ho Lee et al. Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025

work page arXiv 2025

[24] [24]

A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024

Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024

work page arXiv 2024

[25] [25]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020

[26] [26]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023

[27] [27]

Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

work page arXiv 2026

[28] [28]

Self- refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sy...

work page 2023

[29] [29]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

work page 2024

[30] [30]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[31] [31]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023

work page 2023

[32] [32]

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-Gym.arXiv preprint arXiv:2412.21139, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

work page 2022

[34] [34]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023

work page 2023

[35] [35]

OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts

Jason Priem, Heather Piwowar, and Richard Orr. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts.arXiv preprint arXiv:2205.01833, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Osgym: Super-scalable distributed data engine for generalizable computer agents.arXiv preprint arXiv:2511.11672, 2025

Zengyi Qin, Jinyuan Chen, Yunze Man, Shengcao Cao, Ziqi Pang, Zhuoyuan Wang, Xin Sun, Gen Lin, Han Fang, Ling Zhu, et al. Osgym: Super-scalable distributed data engine for generalizable computer agents.arXiv preprint arXiv:2511.11672, 2025

work page arXiv 2025

[37] [37]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Now Publishers Inc, 2009

Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

work page 2009

[39] [39]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. InInterna- tional Conference on Learning Representations (ICLR), 2024

work page 2024

[40] [40]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023

[42] [42]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022

[43] [43]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022, 2023

work page arXiv 2023

[45] [45]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[47] [47]

MemoryLLM: Towards self- updatable large language models

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. MemoryLLM: Towards self- updatable large language models. InInternational Conference on Machine Learning (ICML), 2024

work page 2024

[48] [48]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024

[50] [50]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

work page 2018

[53] [53]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[54] [54]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[55] [55]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024

[57] [57]

Ama-bench: Evaluating long-horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanx- iang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, and Jishen Zhao. Ama-bench: Evaluating long-horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

work page arXiv 2026

[58] [58]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024

[59] [59]

WebArena-Infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026

Shuyan Zhou. WebArena-Infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026. URL https://webarena.dev/webarena-infinity/

work page 2026

[60] [60]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 13 Contents 1 Introduction 1 2 Related Work 3 3 MEMGYM: A Memory-Centric Evaluation and Training Framework 3 3.1 O...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

We implemented custom log parsing for mypy test output

Mypy test syntax:The pytest -k flag cannot parse mypy’s[case] test markers. We implemented custom log parsing for mypy test output

work page

[62] [62]

Increased to 600 seconds

Docker pull timeout:The default 60-second timeout was insufficient for 500MB–2GB images. Increased to 600 seconds

work page

[63] [63]

env" setting deleted pulled images after each instance. Auto-switched tocache_level=

Image cache thrashing:The cache_level="env" setting deleted pulled images after each instance. Auto-switched tocache_level="instance"

work page

[64] [64]

Wrapped conda activation withset +u/set -u

Pandas conda crash:Bash set -u with unset conda variables caused crashes. Wrapped conda activation withset +u/set -u

work page

[65] [65]

Switched toparse_log_pytest_v2

Pydantic log parser:Pydantic’s output format was incompatible with the default log parser. Switched toparse_log_pytest_v2. These fixes added approximately 37 resolved instances for Sonnet 4.5 (129→166) and 15 for GPT- OSS-120B (77→92) on the 500-instance evaluation, underscoring the importance of correct evaluation infrastructure. H MEMRM Training Details...

work page