Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery

Anurag Acharya; Jared Willard; Patrick Emami; Renuka Chintalapati; Sameera Horawalavithana; Sid Raskar

arxiv: 2605.18854 · v1 · pith:52DBCC5Znew · submitted 2026-05-13 · 💻 cs.LG

Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery

Renuka Chintalapati , Sid Raskar , Anurag Acharya , Jared Willard , Patrick Emami , Sameera Horawalavithana This is my paper

Pith reviewed 2026-05-20 20:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords memory condensationcoding agentsscientific discoverycontext managementLLM agentsDiscoveryBenchhypothesis generationtoken efficiency

0 comments

The pith

No memory condenser significantly changes hypothesis quality in coding agents for scientific discovery tasks, though some raise or lower token costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Coding agents accumulate large amounts of context while running long scientific discovery tasks but hit limits from fixed context windows. This work systematically tests eight condensation strategies, including sliding windows, masking, and LLM-generated summaries, on sixty DiscoveryBench tasks across six domains using GPT-4o. No strategy produces a statistically meaningful change in the quality of the final hypotheses. LLM-based condensers increase overall token usage by 24 to 94 percent, whereas simply masking tool-call outputs yields an 8.6 percent net reduction. The best-performing condenser differs depending on the scientific domain and how long the task runs.

Core claim

Across 480 total evaluations on sixty DiscoveryBench tasks spanning six scientific domains, no memory condensation strategy significantly alters the quality of hypotheses produced by GPT-4o coding agents. LLM-based condensers increase token costs by 24-94 percent relative to baselines, while masking tool-call outputs achieves an 8.6 percent net savings. The optimal condenser varies by scientific domain and task length.

What carries the argument

Memory condensation strategies such as sliding windows, LLM summaries, and output masking, compared for effects on hypothesis quality and total token consumption in long-running coding agents.

If this is right

Simpler non-LLM condensation methods can be used without harming the quality of scientific hypotheses.
Token savings from masking tool outputs can be applied directly to reduce costs in long agent runs.
Strategy selection should be adapted to the specific domain and expected task length rather than using a universal choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent designs for data-driven science could prioritize simple masking over complex summarization to control costs.
If quality remains stable, future benchmarks might shift focus to measuring discovery speed or reproducibility instead of hypothesis score alone.
Extending the evaluation to open-source models or real laboratory workflows would test whether the cost patterns hold outside GPT-4o.

Load-bearing premise

The sixty DiscoveryBench tasks spanning six scientific domains are representative of real data-driven scientific discovery and that hypothesis quality can be measured reliably enough to detect meaningful differences between condensers.

What would settle it

Repeating the evaluation on a fresh set of scientific discovery tasks where at least one condenser produces a statistically significant improvement or drop in hypothesis quality would falsify the no-effect finding.

Figures

Figures reproduced from arXiv: 2605.18854 by Anurag Acharya, Jared Willard, Patrick Emami, Renuka Chintalapati, Sameera Horawalavithana, Sid Raskar.

**Figure 2.** Figure 2: Hypothesis quality analysis. (a) Box plots with individual data points and mean [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Token Savings vs Hypothesis Quality [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Input vs. output token breakdown. Input tokens dominate at 93–98% across all [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Domain-specific condenser performance analysis. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Token growth patterns demonstrating the efficiency benefits of condensation for [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Coding agents accumulate extensive context during long-running tasks, yet fixed context windows force practitioners to choose between truncation and task failure. While numerous memory condensation strategies have been proposed, from simple sliding windows to LLM-generated summaries, no systematic comparison exists to guide strategy selection, especially in scientific discovery tasks. We evaluate eight memory condensation strategies using GPT-4o on sixty DiscoveryBench tasks spanning six scientific domains (480 total evaluations). We find that no condenser significantly alters hypothesis quality, while LLM-based condensers increase token costs by 24-94 percent, and masking tool-call outputs achieves an 8.6 percent net savings. We also observe that the optimal condenser for data-driven scientific discovery varies by scientific domain and task length.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs the first head-to-head on eight memory condensers for scientific coding agents and finds no quality drop but higher costs for LLM summaries, with one simple mask saving tokens.

read the letter

The main thing to know is that none of the eight condensers they tested changed hypothesis quality on the DiscoveryBench tasks, while LLM-based ones raised token use by 24-94 percent and output masking cut costs by 8.6 percent. The best choice also shifted with domain and task length. That is the practical takeaway for anyone running long agent sessions on scientific data work. They fill a clear gap by doing the first systematic comparison in this setting, using 480 evaluations across six domains and varying lengths with GPT-4o. That scale and coverage is useful and gives some evidence that simple strategies can hold up without fancy summarization. The soft spot is exactly the one the stress-test flags. The null result on quality depends on whatever scoring procedure they applied to the hypotheses, and the abstract gives no error bars, statistical tests, or protocol details. If that metric is noisy or only loosely tied to real scientific value, moderate effects from condensation could be missed and the equivalence claim would not hold. The tasks themselves are a reasonable starting point but still a narrow slice of actual discovery work. This is for engineers and researchers who build or tune coding agents for data-heavy science and need quick guidance on context management. A reader who cares about empirical agent benchmarks would find the domain breakdowns worth discussing. It deserves peer review because the question is timely and the experimental frame is straightforward, even though the paper needs clearer methods and stats reporting to make the null result convincing.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates eight memory condensation strategies for coding agents in data-driven scientific discovery. Using GPT-4o across sixty DiscoveryBench tasks spanning six scientific domains (480 evaluations total), it reports that no condenser significantly alters hypothesis quality, LLM-based condensers increase token costs by 24-94 percent, masking tool-call outputs yields an 8.6 percent net savings, and the optimal condenser varies by domain and task length.

Significance. If the empirical results hold under rigorous verification, the work supplies actionable guidance for context management in long-running coding agents applied to scientific tasks. It shows that lightweight strategies can preserve performance while reducing overhead and documents domain-specific variation, filling a gap in systematic comparisons for this setting.

major comments (2)

[Abstract / Results] Abstract and Results: The claim that 'no condenser significantly alters hypothesis quality' is presented without statistical tests, confidence intervals, error bars, or a precise description of the hypothesis quality metric and its measurement protocol. This omission makes it impossible to assess whether the null result indicates true equivalence or insufficient sensitivity/resolution of the metric on the DiscoveryBench tasks.
[Evaluation Setup] Evaluation Setup: The central conclusion depends on the untested assumption that the chosen hypothesis quality metric has sufficient sensitivity to detect practically relevant differences and that the sixty tasks are representative of real data-driven scientific discovery. The manuscript should include validation of the metric (e.g., correlation with expert judgment or sensitivity analysis) to support the equivalence claim.

minor comments (1)

[Abstract] The abstract states '480 total evaluations' but does not clarify the number of independent runs per task-condenser pair or the exact protocol for cost and quality measurement, which would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key opportunities to strengthen the statistical rigor and interpretability of our empirical findings. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: The claim that 'no condenser significantly alters hypothesis quality' is presented without statistical tests, confidence intervals, error bars, or a precise description of the hypothesis quality metric and its measurement protocol. This omission makes it impossible to assess whether the null result indicates true equivalence or insufficient sensitivity/resolution of the metric on the DiscoveryBench tasks.

Authors: We agree that the original presentation would be improved by explicit statistical support. In the revised manuscript we have added Wilcoxon signed-rank tests for paired comparisons of hypothesis quality scores across all condenser pairs, reported 95% confidence intervals on the mean scores, and included error bars on the relevant bar plots in the Results section. We have also expanded the Evaluation Setup subsection to describe the hypothesis quality metric in full: it is the average of an LLM-as-judge score (0-10 scale) applied to the final hypothesis against the ground-truth reference provided by DiscoveryBench, using the benchmark's standard evaluation prompt. revision: yes
Referee: [Evaluation Setup] Evaluation Setup: The central conclusion depends on the untested assumption that the chosen hypothesis quality metric has sufficient sensitivity to detect practically relevant differences and that the sixty tasks are representative of real data-driven scientific discovery. The manuscript should include validation of the metric (e.g., correlation with expert judgment or sensitivity analysis) to support the equivalence claim.

Authors: We concur that additional evidence of metric sensitivity would bolster the equivalence claim. While a comprehensive human-expert correlation study lies outside the resources available for this work, we have added a sensitivity analysis in the new Appendix C. This analysis perturbs hypothesis quality in controlled ways on a subset of tasks and shows that the metric reliably distinguishes between high- and low-quality outputs. We have also clarified in the revised Evaluation Setup that the sixty tasks are the full DiscoveryBench suite, which was explicitly constructed to span six scientific domains and a range of task lengths representative of data-driven discovery; we discuss remaining generalizability limits in the Limitations section. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical evaluation of condensation strategies

full rationale

This paper is a direct empirical comparison study that evaluates eight memory condensation strategies on 60 DiscoveryBench tasks using GPT-4o, reporting observed effects on hypothesis quality and token usage. No derivations, equations, fitted parameters, or predictions are present; all claims are grounded in experimental outcomes rather than quantities defined in terms of themselves or reduced via self-citation chains. The study is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical evaluation study whose central claims rest on the representativeness of the chosen benchmark tasks and the validity of the hypothesis-quality metric.

axioms (2)

domain assumption DiscoveryBench tasks spanning six domains serve as adequate proxies for general data-driven scientific discovery
The evaluation design and generalization claims depend on this assumption about task representativeness.
domain assumption Hypothesis quality is a stable and sensitive enough outcome measure to detect differences between memory strategies
The finding of 'no significant alteration' relies on this measurement assumption.

pith-pipeline@v0.9.0 · 5666 in / 1239 out tokens · 53770 ms · 2026-05-20T20:25:30.480807+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 9 internal anchors

[1]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

URL http://arxiv.org/abs/ 2511.03506. Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory,

work page arXiv
[2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

URL http: //arxiv.org/abs/2504.19413. Aditya Deshpande, Sandeep Kumar, Yuxin Wang, and Wei Chen. Memtrack: Multi-platform dynamic environments for memory and state tracking evaluation,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, and Bo Li

URL http: //arxiv.org/abs/2510.01353. Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, and Bo Li. Can compressed llms truly act? an empirical evaluation of agentic capabilities in llm compression,

work page arXiv
[4]

Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z

URLhttp://arxiv.org/abs/2505.19433. Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z. Pan. Rethinking memory in ai: Taxonomy, operations, topics, and future directions,

work page arXiv
[5]

Shicheng Fang, Yuxin Wang, Xiaoran Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng, Xuanjing Huang, and Xipeng Qiu

URLhttp://arxiv.org/abs/2505.00675. Shicheng Fang, Yuxin Wang, Xiaoran Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng, Xuanjing Huang, and Xipeng Qiu. Agentlongbench: A controllable long benchmark for long-contexts agents via environment rollouts,

work page arXiv
[6]

Jiayu Liu, Cheng Qian, Zhaochen Su, Mo Yang, and Wei Chen

URL http: //arxiv.org/abs/2601.20730. Jiayu Liu, Cheng Qian, Zhaochen Su, Mo Yang, and Wei Chen. Costbench: Evaluating multi-turn cost-optimal planning and adaptation in dynamic environments,

work page arXiv
[7]

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

URL http://arxiv.org/abs/2511.02734. Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, and Guannan Zhang. Think-in-memory: Recalling and post-thinking enable llms with long-term memory,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Think-in-memory: Recalling and post-thinking enable llms with long-term memory.arXiv preprint arXiv:2311.08719, 2023

URLhttp://arxiv.org/abs/2311.08719. Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173,

work page arXiv
[9]

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark

URLhttp://arxiv.org/abs/2601.06007. Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models,

work page arXiv
[10]

10 Preprint

URLhttp://arxiv.org/abs/2407.01725. 10 Preprint. Under review. Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models,

work page arXiv
[11]

A Survey of Context Engineering for Large Language Models

URL http://arxiv.org/abs/2507.13334. Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

URL http: //arxiv.org/abs/2310.08560. JV Roig. Towards a standard, enterprise-relevant agentic ai benchmark: Lessons from 5.5 billion tokens,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, and Yong Wu

URLhttp://arxiv.org/abs/2511.08042. Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, and Yong Wu. Cognitive memory in large language models,

work page arXiv
[14]

Theodore R

URLhttp://arxiv.org/abs/2504.02441. Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents,

work page arXiv
[15]

Cognitive Architectures for Language Agents

URLhttp://arxiv.org/abs/2309.02427. Zecheng Tang, Baibei Ji, Ruoxi Sun, Haitian Wang, WangJie You, Zhang Yijun, Wenpeng Zhu, Ji Qi, Juntao Li, and Min Zhang. Memoryrewardbench: Benchmarking reward models for long-term memory management in large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Xingyao Wang, Boxuan Chen, Ziyi Adler, Tianjun Chen, Yufan Ma, Yueqi Zhou, Hoang Dai Tran Shi, Kai-Wei Chang, and Graham Neubig

URL http: //arxiv.org/abs/2601.11969. Xingyao Wang, Boxuan Chen, Ziyi Adler, Tianjun Chen, Yufan Ma, Yueqi Zhou, Hoang Dai Tran Shi, Kai-Wei Chang, and Graham Neubig. Openhands: An open platform for ai software developers as generalist agents, 2024a. URL http://arxiv.org/abs/2407.16741. Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Pe...

work page arXiv
[17]

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Sch¨utze, Volker Tresp, and Yunpu Ma

URLhttp://arxiv.org/abs/2601.07978. Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Sch¨utze, Volker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning,

work page arXiv
[18]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

URL http://arxiv.org/abs/2508.19828. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

ReAct: Synergizing Reasoning and Acting in Language Models

URL http://arxiv.org/abs/2210.03629. Zhuowen Yin, Cuifeng Gao, Chunsong Fan, Mo Yang, and Wei Chen. A comprehensive empirical evaluation of agent frameworks on code-centric tasks,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

org/abs/2511.00872

URL http://arxiv. org/abs/2511.00872. Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents,

work page arXiv
[21]

A Survey on the Memory Mechanism of Large Language Model based Agents

URLhttp://arxiv.org/abs/2404.13501. Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhanc- ing large language models with long-term memory,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

URL http://arxiv.org/abs/ 2305.10250. 11

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

URL http://arxiv.org/abs/ 2511.03506. Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory,

work page arXiv

[2] [2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

URL http: //arxiv.org/abs/2504.19413. Aditya Deshpande, Sandeep Kumar, Yuxin Wang, and Wei Chen. Memtrack: Multi-platform dynamic environments for memory and state tracking evaluation,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, and Bo Li

URL http: //arxiv.org/abs/2510.01353. Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, and Bo Li. Can compressed llms truly act? an empirical evaluation of agentic capabilities in llm compression,

work page arXiv

[4] [4]

Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z

URLhttp://arxiv.org/abs/2505.19433. Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z. Pan. Rethinking memory in ai: Taxonomy, operations, topics, and future directions,

work page arXiv

[5] [5]

Shicheng Fang, Yuxin Wang, Xiaoran Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng, Xuanjing Huang, and Xipeng Qiu

URLhttp://arxiv.org/abs/2505.00675. Shicheng Fang, Yuxin Wang, Xiaoran Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng, Xuanjing Huang, and Xipeng Qiu. Agentlongbench: A controllable long benchmark for long-contexts agents via environment rollouts,

work page arXiv

[6] [6]

Jiayu Liu, Cheng Qian, Zhaochen Su, Mo Yang, and Wei Chen

URL http: //arxiv.org/abs/2601.20730. Jiayu Liu, Cheng Qian, Zhaochen Su, Mo Yang, and Wei Chen. Costbench: Evaluating multi-turn cost-optimal planning and adaptation in dynamic environments,

work page arXiv

[7] [7]

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

URL http://arxiv.org/abs/2511.02734. Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, and Guannan Zhang. Think-in-memory: Recalling and post-thinking enable llms with long-term memory,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Think-in-memory: Recalling and post-thinking enable llms with long-term memory.arXiv preprint arXiv:2311.08719, 2023

URLhttp://arxiv.org/abs/2311.08719. Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173,

work page arXiv

[9] [9]

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark

URLhttp://arxiv.org/abs/2601.06007. Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models,

work page arXiv

[10] [10]

10 Preprint

URLhttp://arxiv.org/abs/2407.01725. 10 Preprint. Under review. Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models,

work page arXiv

[11] [11]

A Survey of Context Engineering for Large Language Models

URL http://arxiv.org/abs/2507.13334. Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

URL http: //arxiv.org/abs/2310.08560. JV Roig. Towards a standard, enterprise-relevant agentic ai benchmark: Lessons from 5.5 billion tokens,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, and Yong Wu

URLhttp://arxiv.org/abs/2511.08042. Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, and Yong Wu. Cognitive memory in large language models,

work page arXiv

[14] [14]

Theodore R

URLhttp://arxiv.org/abs/2504.02441. Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents,

work page arXiv

[15] [15]

Cognitive Architectures for Language Agents

URLhttp://arxiv.org/abs/2309.02427. Zecheng Tang, Baibei Ji, Ruoxi Sun, Haitian Wang, WangJie You, Zhang Yijun, Wenpeng Zhu, Ji Qi, Juntao Li, and Min Zhang. Memoryrewardbench: Benchmarking reward models for long-term memory management in large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Xingyao Wang, Boxuan Chen, Ziyi Adler, Tianjun Chen, Yufan Ma, Yueqi Zhou, Hoang Dai Tran Shi, Kai-Wei Chang, and Graham Neubig

URL http: //arxiv.org/abs/2601.11969. Xingyao Wang, Boxuan Chen, Ziyi Adler, Tianjun Chen, Yufan Ma, Yueqi Zhou, Hoang Dai Tran Shi, Kai-Wei Chang, and Graham Neubig. Openhands: An open platform for ai software developers as generalist agents, 2024a. URL http://arxiv.org/abs/2407.16741. Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Pe...

work page arXiv

[17] [17]

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Sch¨utze, Volker Tresp, and Yunpu Ma

URLhttp://arxiv.org/abs/2601.07978. Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Sch¨utze, Volker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning,

work page arXiv

[18] [18]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

URL http://arxiv.org/abs/2508.19828. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

ReAct: Synergizing Reasoning and Acting in Language Models

URL http://arxiv.org/abs/2210.03629. Zhuowen Yin, Cuifeng Gao, Chunsong Fan, Mo Yang, and Wei Chen. A comprehensive empirical evaluation of agent frameworks on code-centric tasks,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

org/abs/2511.00872

URL http://arxiv. org/abs/2511.00872. Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents,

work page arXiv

[21] [21]

A Survey on the Memory Mechanism of Large Language Model based Agents

URLhttp://arxiv.org/abs/2404.13501. Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhanc- ing large language models with long-term memory,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

URL http://arxiv.org/abs/ 2305.10250. 11

work page internal anchor Pith review Pith/arXiv arXiv