Enhancing Software Engineering Through Closed-Loop Memory Optimization

Graham Neubig; Qingyun Wang; Xingyao Wang; Xuehang Guo; Zora Zhiruo Wang

arxiv: 2606.05646 · v1 · pith:7HAMS7ARnew · submitted 2026-06-04 · 💻 cs.SE · cs.AI

Enhancing Software Engineering Through Closed-Loop Memory Optimization

Xuehang Guo , Zora Zhiruo Wang , Qingyun Wang , Graham Neubig , Xingyao Wang This is my paper

Pith reviewed 2026-06-28 00:40 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords memory augmentationsoftware engineering agentsLLM agentsclosed-loop optimizationdownstream impacttask-agnostic evaluation

0 comments

The pith

Closed-loop memory framework defines utility from downstream task impact to improve SE agents without labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that software engineering agents, which are currently episodic and reconstruct context from scratch on every task, can be made to retain and refine experiences through a closed-loop memory system. Memory utility is grounded directly in whether past experiences improve success on new tasks, turning that impact into both a benchmark for evaluation and a signal for optimization. This approach requires no task-specific knowledge or manual annotations. Sympathetic readers would care because it offers a general way to reduce repeated mistakes and computational waste in agents that navigate codebases. Experiments across single-episode and cross-episode settings show consistent gains.

Core claim

Ours is a closed-loop framework that grounds memory utility in validated downstream impact, establishing it as a task-agnostic evaluation benchmark and annotation-free optimization signal; complementary evaluation on single-episode and cross-episode memory augmentation shows it improves SE agents with absolute gains of up to 5.25% success rate and 4.63% resolve efficiency while reducing computational cost by at least 9.79%.

What carries the argument

Ours, the closed-loop framework that treats validated downstream task impact as the sole definition of memory utility for both evaluation and optimization.

If this is right

SE agents can retain and reuse experiences across tasks instead of reconstructing context each time.
Memory selection becomes possible without task-specific annotations or human labels.
Performance gains appear in both single-task and multi-task memory augmentation settings.
Computational cost drops while success rate and resolve efficiency rise.
The same utility signal serves simultaneously as an evaluation metric and an optimization objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same downstream-impact definition could be applied to memory management in LLM agents outside software engineering.
If the utility signal proves stable, it could support iterative refinement of memory stores over many episodes without external supervision.
Cross-episode augmentation may compound gains over time if early improvements feed into later memory selections.

Load-bearing premise

Memory utility can be defined in a task-agnostic way solely from validated downstream task impact without needing task-specific knowledge or manual labels.

What would settle it

A controlled experiment in which memory selected purely by downstream impact either fails to raise success rate or raises computational cost when applied to a new set of SE tasks or a different base agent.

Figures

Figures reproduced from arXiv: 2606.05646 by Graham Neubig, Qingyun Wang, Xingyao Wang, Xuehang Guo, Zora Zhiruo Wang.

**Figure 1.** Figure 1: Memory-Augmented Software Engineering. Compared with no-Mθ SE agent (left), MemOp (right) equips SE agents with adaptively distilled memories to better tackle dynamic realworld SE challenges. The emergence of large language models has catalyzed a paradigm shift in software engineering, enabling LLM agents capable of addressing complex real-world SE tasks (Jin et al., 2025; Guo et al., 2025b). Through t… view at source ↗

**Figure 2.** Figure 2: Memory Utility. We tackles the fundamental challenge (§2.2) by proposing memory utility with performance-grounded memory evaluation and trajectory-level rejection sampling. Without such measures, it is impossible to distinguish good memory from noise, or even to leverage quality signals to drive learning (§5&B). We tackle this with a concrete, outcomegrounded definition: a memory is useful if and only… view at source ↗

**Figure 3.** Figure 3: Memory Model Finetuning. By preparing training datasets (Tab. 4) through trajectorybased rejection sampling (§3.2), Mθ is finetuned through two-stage training via SFT and RL (§3.3). where Q(·) is a composite memory utility function combining all NQ multi-dimensional metrics (§3.1) across task performance and problem-solving efficiency, and M∗ θ is the optimal memory evolution function MemOp aims to learn.… view at source ↗

**Figure 4.** Figure 4: Adaptability to Different FT Stages. All training stages contributes to performance improvements. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: MemOp Algorithm Generalizability. MemOp is also useful when applying different RL algorithms. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Preliminary Analysis on SE Agent Failures. Through manual analysis, we identify seven failure patterns in SE agent problem-solving. How do SE agents fail in software engineering tasks? Understanding the nature of SE agent failures is critical for improving their problemsolving. Motivated by this, we perform a preliminary analysis on key failure patterns in SE agents, categorizing common causes to inform … view at source ↗

**Figure 7.** Figure 7: Failure Case of Repository Structure. The SE agent fails due to its incorrect understanding of the repository structure. Repetition EXECUTION RESULT of [execute_bash]: /opt/miniconda3/envs/testbed/bin/python: No module named pytest [The command completed with exit code 1.] [Current working directory: /workspace/django__django__3.0] [Python interpreter: /opt/miniconda3/envs/testbed/bin/python] [Command fini… view at source ↗

**Figure 8.** Figure 8: Failure Case of Repetition. The SE agent repeats the same error as in earlier attempts. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Failure Case of Reasoning Error. The SE agent fails due to its incorrect reasoning across multiple attempts. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Failure Case of Coding Error. The SE agent fails due to coding errors, such as SyntaxError, NameError, AttributeError, etc. Execution Error Let me try a completely different angle - what if `list(cv)` is still called somewhere else even in this fixed approach? But I think actually the fundamental understanding must be incorrect - I just realized maybe the problem isn't what I suspected. Looking carefully … view at source ↗

**Figure 11.** Figure 11: Failure Case of Execution Error. The SE agent fails due to execution errors. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Failure Case of Inconsistency. The SE agent fails due to the inconsistency between its reasoning and action. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Failure Case of Hallucination. The SE agent fails due to hallucinating actions or experiences. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Preliminary Study on Memory Instruction. In addition to qualitative analysis ( [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Memory Instruction. We conduct the preliminary study on memory reflection through three versions of instructions: (1) general and concise instruction, (2) high-level instruction, and (3) fine-grained instruction. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Repo-Wise Comparison between Baseline and MemOp. Compared to no-Mθ baseline SE agent (base LLM: Qwen3-Coder-30B-A3B), SE agent (base LLM: Qwen3-Coder-30B-A3B) with MemOp (backbone LLM: Qwen3-4B-T) consistently outperforms no-Mθ baseline across nine disparate repositories. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Instructions for Single-Episode and Cross-Episode Memory Generation. Our memory generation instructions for single-episode and cross-episode memory-augmented software engineering settings. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Evaluation Set Distribution During experiments, to avoid evaluation circularity, we randomly sample 100 evaluation instances that have no overlap with the 100 instances used to construct our training dataset. In cross-episode evaluation, all instances of each repository are evaluated according to their temporal order to simulate real-world codebase evolution. As shown in [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 19.** Figure 19: MemOp for Memory-Augmented Software Engineering. MemOp finetunes Mθ to augment SE agents through adaptive memory generation (§2.3 & §G.2). Our ablation studies include single-episode memory generation, episode-level memory evolution (§4), and action-level memory evolution (§G.2). G.1 Repository-Wise Generalizability of MemOp Extending our discussion in §4.3, [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

**Figure 20.** Figure 20: Memory Evolution Granularity. We compare SE agent performance among no-Mθ, MemOp in cross-action, and MemOp in crossepisode memory evolution settings. To investigate this fundamental question, we employ Qwen3-Coder-30B-A3B to power SE agent, with MemOp using Qwen3-4B-T as Mθ backbone. We evaluate memory-augmented software engineering on the same test set under both action-level and episode-level memory… view at source ↗

**Figure 21.** Figure 21: Effects of Preference Rollout Batch Size Configuration on Mθ Optimization. To investigate the effect of rollout batch size c in DRL (§3.2), we compare SE agent performance with Mθ using the same backbone LLM (Qwen3-4B-Thinking), finetuned on DRL with c = 2 and c = 4, respectively. Results in [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗

**Figure 22.** Figure 22: MemOp Improves SE Performance with Reduced Computational Cost. MemOp enhances SE agent across single-episode and cross-episode settings with reduced computational cost. G.5 MemOp for More Robust Software Engineering To systematically compare the augmentation effects of MemOp, we use Qwen3-Coder-30B to power the SE agent with Qwen3-4B-T as the memory backbone of finetuned Mθ, and compare their mean perform… view at source ↗

**Figure 23.** Figure 23: MemOp Enhances SE Agent Performance Robustness. Error bars across all evaluation metrics demonstrate that MemOp consistently improves SE agent performance with reduced variance, reflecting greater robustness over the no-Mθ baseline. G.6 Qualitative Analysis on Memory Augmentation Success & Failure To better understand the effectiveness of Mθ in memory generation and evolution, we conduct case studies to q… view at source ↗

**Figure 24.** Figure 24: Examples of Effective Memory Reflection. The generated memories effectively support SE agent to successfully resolve the task with enhanced problem-solving efficiency. Examples extracted from SE agent powered by Qwen3-Coder-30B-A3B and MemOp with Mθ powered by Qwen3-4B-T (FT). Trajectory history details are omitted as [...] for clarity. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗

**Figure 25.** Figure 25: Examples of Ineffective Memory Reflection. The generated memories fail to effectively support SE agent. Examples extracted from SE agent powered by Qwen3-Coder-30B-A3B and MemOp with Mθ powered by Claude-4-Sonnet (NFT). Trajectory history details are omitted as [...] for clarity. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_25.png] view at source ↗

read the original abstract

Large language models (LLMs) have enabled powerful software engineering (SE) agents capable of navigating complex codebases and resolving real-world issues. However, these agents remain fundamentally episodic: they fail to retain, refine, and reuse experiences across tasks, repeatedly reconstructing context from scratch and reproducing similar mistakes. Even with memory support, they offer no remedy for the absence of a principled, task-agnostic \textit{memory utility}, making them difficult to evaluate rigorously or generalize across agents and settings. To tackle these limitations, we introduce \ours, a closed-loop framework for memory augmentation in SE agents. \ours grounds memory utility in \textit{validated downstream impact}, establishing utility as both a task-agnostic \textbf{evaluation benchmark} and an annotation-free \textbf{optimization signal}. Through complementary evaluation on \textit{single-episode} and \textit{cross-episode} memory augmentation, results demonstrate that \ours consistently improves SE agents across settings, achieving absolute gains of up to $\uparrow5.25\%$ in success rate and $\uparrow4.63\%$ in resolve efficiency, while substantially reducing computational cost by $\geq9.79\%$. Our project page: \href{https://xhguo7.github.io/MemOp/}{https://xhguo7.github.io/MemOp/}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives SE agents a closed-loop way to score and optimize memory from downstream task success, with modest reported gains but thin method details.

read the letter

The core contribution here is a closed-loop framework called MemOp that ties memory utility directly to validated task success for software engineering agents. This addresses the episodic nature of current LLM agents by providing both an evaluation benchmark and an optimization signal without task-specific labels.

The paper does a solid job of identifying the gap in existing memory support and then demonstrating improvements through experiments on single-episode and cross-episode augmentation. The reported gains—up to 5.25% higher success rate, 4.63% better resolve efficiency, and at least 9.79% lower computational cost—are presented as consistent across settings.

Where it is softer is in the evaluation transparency. The abstract does not detail the baselines, datasets, or how the closed-loop is implemented, so the independence of the impact signal from the optimization remains to be verified from the full text. The absolute improvements are modest, which is fine if they are robust, but that needs confirmation. No major contradictions appear in the argument itself.

This work is for researchers developing LLM-based agents for real software engineering problems. Readers interested in practical memory mechanisms would get value from the approach and the empirical results.

I recommend putting it through peer review. The framework is concrete and the results are presented in a way that can be checked and extended.

Referee Report

2 major / 1 minor

Summary. The paper introduces \\\ours, a closed-loop framework for memory augmentation in SE agents. It grounds memory utility in validated downstream task impact, establishing it as both a task-agnostic evaluation benchmark and an annotation-free optimization signal. Complementary evaluations on single-episode and cross-episode memory augmentation are claimed to show consistent improvements, with absolute gains of up to ↑5.25% in success rate, ↑4.63% in resolve efficiency, and ≥9.79% reduction in computational cost.

Significance. If the results hold under proper verification, the closed-loop construction offers a task-agnostic way to optimize and evaluate memory in LLM-based SE agents, addressing their episodic limitations without requiring manual labels or task-specific knowledge. This could improve generalizability across agents and settings.

major comments (2)

[Abstract] Abstract: The abstract states quantitative improvements (↑5.25% success rate, ↑4.63% resolve efficiency, ≥9.79% cost reduction) but supplies no experimental protocol, baselines, statistical tests, dataset details, or error analysis; the central performance claims cannot be evaluated from the given text.
[Abstract] The closed-loop design risks circularity if task success is used both to define and to optimize the same memory utility signal; the manuscript must demonstrate that the downstream-impact signal is independent of the optimization loop (see weakest assumption in stress-test note).

minor comments (1)

The project page link is provided but the manuscript should include a brief summary of what additional materials (code, data) are available there.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states quantitative improvements (↑5.25% success rate, ↑4.63% resolve efficiency, ≥9.79% cost reduction) but supplies no experimental protocol, baselines, statistical tests, dataset details, or error analysis; the central performance claims cannot be evaluated from the given text.

Authors: The abstract is intentionally concise to summarize contributions and results. Full details on the experimental protocol, baselines (standard SE agents and memory-augmented variants), statistical tests, datasets (SE benchmarks used), and error analysis appear in Sections 4 and 5. We will revise the abstract to briefly reference the evaluation settings and primary baselines for improved clarity. revision: yes
Referee: [Abstract] The closed-loop design risks circularity if task success is used both to define and to optimize the same memory utility signal; the manuscript must demonstrate that the downstream-impact signal is independent of the optimization loop (see weakest assumption in stress-test note).

Authors: We appreciate the concern about potential circularity. The memory utility is computed from downstream impact on a held-out validation task set that is excluded from the optimization loop; the loop then uses this precomputed signal for memory selection, while final gains are measured on disjoint test tasks. This separation maintains independence. We will add explicit discussion of the separation and address the noted stress-test assumptions in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper explicitly frames its core contribution as a closed-loop construction that defines memory utility directly from validated downstream task impact and then uses that same impact as both benchmark and optimization signal. This is presented as an intentional design choice rather than a hidden reduction. No equations, fitted parameters, or self-citations are exhibited in the provided text that would make the reported gains (success rate, resolve efficiency, cost reduction) equivalent to the inputs by construction. The empirical outcomes are described as results of applying the framework, and the argument remains self-contained once the closed-loop premise is granted. No load-bearing step reduces to a self-definition or fitted input in the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5774 in / 1105 out tokens · 43991 ms · 2026-06-28T00:40:10.508338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 12 canonical work pages

[1]

2025 , eprint=

From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future , author=. 2025 , eprint=

2025
[2]

2025 , eprint=

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System , author=. 2025 , eprint=

2025
[3]

2026 , url =

Position: Humans are Missing from AI Coding Agent Research , author =. 2026 , url =

2026
[4]

2025 , eprint=

OpenHands: An Open Platform for AI Software Developers as Generalist Agents , author=. 2025 , eprint=

2025
[5]

ArXiv , year=

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , author=. ArXiv , year=
[6]

2025 , url=

Claude Code , author=. 2025 , url=

2025
[7]

Annual Meeting of the Association for Computational Linguistics , year=

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges , author=. Annual Meeting of the Association for Computational Linguistics , year=
[8]

ArXiv , year=

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution , author=. ArXiv , year=
[9]

arXiv preprint arXiv:2502.06994 , year=

SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering , author=. arXiv preprint arXiv:2502.06994 , year=

arXiv
[10]

International Conference on Machine Learning , year=

PatchPilot: A Cost-Efficient Software Engineering Agent with Early Attempts on Formal Verification , author=. International Conference on Machine Learning , year=
[11]

2025 , url=

GPT Docs , author=. 2025 , url=

2025
[12]

2025 , url=

Claude Docs , author=. 2025 , url=

2025
[13]

2025 , url=

GPT API Pricing , author=. 2025 , url=

2025
[14]

2025 , url=

Claude API Pricing , author=. 2025 , url=

2025
[15]

Transactions of the Association for Computational Linguistics , year=

Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , year=
[16]

ArXiv , year=

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack , author=. ArXiv , year=
[17]

L oo GLE : Can Long-Context Language Models Understand Long Contexts?

Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan. L oo GLE : Can Long-Context Language Models Understand Long Contexts?. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.859

work page doi:10.18653/v1/2024.acl-long.859 2024
[18]

Efficient Solutions For An Intriguing Failure of LLM s: Long Context Window Does Not Mean LLM s Can Analyze Long Sequences Flawlessly

Hosseini, Peyman and Castro, Ignacio and Ghinassi, Iacopo and Purver, Matthew. Efficient Solutions For An Intriguing Failure of LLM s: Long Context Window Does Not Mean LLM s Can Analyze Long Sequences Flawlessly. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025
[19]

2025 , eprint=

MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation , author=. 2025 , eprint=

2025
[20]

Lifelong Model Editing with Graph-Based External Memory

Atri, Yash Kumar and Alaa, Ahmed and Hartvigsen, Thomas. Lifelong Model Editing with Graph-Based External Memory. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.690

work page doi:10.18653/v1/2025.findings-acl.690 2025
[21]

and Brodsky, Joshua and Mahendra, Ashish and Kang, Yiping and Flautner, Krisztian and Tang, Lingjia and Mars, Jason

Kashmira, Savini and Dantanarayana, Jayanaka L. and Brodsky, Joshua and Mahendra, Ashish and Kang, Yiping and Flautner, Krisztian and Tang, Lingjia and Mars, Jason. TOBUG raph: Knowledge Graph-Based Retrieval for Enhanced LLM Performance Beyond RAG. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 202...

work page doi:10.18653/v1/2025.emnlp-industry.93 2025
[22]

2024 , eprint=

M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions , author=. 2024 , eprint=

2024
[23]

C om RAG : Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry

Chen, Qinwen and Tao, Wenbiao and Zhu, Zhiwei and Xi, Mingfan and Guo, Liangzhong and Wang, Yuan and Wang, Wei and Lan, Yunshi. C om RAG : Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Indus...

work page doi:10.18653/v1/2025.acl-industry.53 2025
[24]

H - MEM : Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

Sun, Haoran and Zeng, Shaoning and Zhang, Bob. H - MEM : Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.15

work page doi:10.18653/v1/2026.eacl-long.15 2026
[25]

From Knowledge to Noise: CTIM -Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

Lindenbauer, Tobias and Groh, Georg and Schuetze, Hinrich. From Knowledge to Noise: CTIM -Rover and the Pitfalls of Episodic Memory in Software Engineering Agents. Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025). 2025. doi:10.18653/v1/2025.realm-1.30

work page doi:10.18653/v1/2025.realm-1.30 2025
[26]

2024 , eprint=

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory , author=. 2024 , eprint=

2024
[27]

2023 , eprint=

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author=. 2023 , eprint=

2023
[28]

2025 , eprint=

MemInsight: Autonomous Memory Augmentation for LLM Agents , author=. 2025 , eprint=

2025
[29]

2026 , eprint=

Structurally Aligned Subtask-Level Memory for Software Engineering Agents , author=. 2026 , eprint=

2026
[30]

2025 , eprint=

SWE-Bench-CL: Continual Learning for Coding Agents , author=. 2025 , eprint=

2025
[31]

ArXiv , year=

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations , author=. ArXiv , year=
[32]

ArXiv , year=

Improving Code Localization with Repository Memory , author=. ArXiv , year=
[33]

2026 , eprint=

MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences , author=. 2026 , eprint=

2026
[34]

M ind R ef: Mimicking Human Memory for Hierarchical Reference Retrieval with Fine-Grained Location Awareness

Wang, Ye and Xu, Xinrun and Ding, Zhiming. M ind R ef: Mimicking Human Memory for Hierarchical Reference Retrieval with Fine-Grained Location Awareness. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.67

work page doi:10.18653/v1/2025.acl-short.67 2025
[35]

If Attention Serves as a Cognitive Model of Human Memory Retrieval, What is the Plausible Memory Representation?

Yoshida, Ryo and Isono, Shinnosuke and Kajikawa, Kohei and Someya, Taiga and Sugimoto, Yushi and Oseki, Yohei. If Attention Serves as a Cognitive Model of Human Memory Retrieval, What is the Plausible Memory Representation?. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/...

work page doi:10.18653/v1/2025.acl-long.483 2025
[36]

Knowledge Graph-Driven Memory Editing with Directional Interventions

Fu, Jinhu and Wang, Kun and Guo, Chongye and Fang, Junfeng and Zhang, Wentao and Su, Sen. Knowledge Graph-Driven Memory Editing with Directional Interventions. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.261

work page doi:10.18653/v1/2025.findings-emnlp.261 2025
[37]

H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

Hu, Mengkang and Chen, Tianxing and Chen, Qiguang and Mu, Yao and Shao, Wenqi and Luo, Ping. H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1575

work page doi:10.18653/v1/2025.acl-long.1575 2025
[38]

2025 , eprint=

General Agentic Memory Via Deep Research , author=. 2025 , eprint=

2025
[39]

2025 , eprint=

LightMem: Lightweight and Efficient Memory-Augmented Generation , author=. 2025 , eprint=

2025
[40]

Towards Lifelong Dialogue Agents via Timeline-based Memory Management

Ong, Kai Tzu-iunn and Kim, Namyoung and Gwak, Minju and Chae, Hyungjoo and Kwon, Taeyoon and Jo, Yohan and Hwang, Seung-won and Lee, Dongha and Yeo, Jinyoung. Towards Lifelong Dialogue Agents via Timeline-based Memory Management. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Hum...

work page doi:10.18653/v1/2025.naacl-long.435 2025
[41]

2024 , url=

Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

2024
[42]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[43]

2025 , eprint=

Group Sequence Policy Optimization , author=. 2025 , eprint=

2025
[44]

2025 , eprint=

DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

2025
[45]

2025 , url=

Devstral Model , author=. 2025 , url=

2025
[46]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[47]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[48]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[49]

2025 , url=

Claude 4 Sonnet , author=. 2025 , url=

2025

[1] [1]

2025 , eprint=

From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future , author=. 2025 , eprint=

2025

[2] [2]

2025 , eprint=

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System , author=. 2025 , eprint=

2025

[3] [3]

2026 , url =

Position: Humans are Missing from AI Coding Agent Research , author =. 2026 , url =

2026

[4] [4]

2025 , eprint=

OpenHands: An Open Platform for AI Software Developers as Generalist Agents , author=. 2025 , eprint=

2025

[5] [5]

ArXiv , year=

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , author=. ArXiv , year=

[6] [6]

2025 , url=

Claude Code , author=. 2025 , url=

2025

[7] [7]

Annual Meeting of the Association for Computational Linguistics , year=

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges , author=. Annual Meeting of the Association for Computational Linguistics , year=

[8] [8]

ArXiv , year=

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution , author=. ArXiv , year=

[9] [9]

arXiv preprint arXiv:2502.06994 , year=

SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering , author=. arXiv preprint arXiv:2502.06994 , year=

arXiv

[10] [10]

International Conference on Machine Learning , year=

PatchPilot: A Cost-Efficient Software Engineering Agent with Early Attempts on Formal Verification , author=. International Conference on Machine Learning , year=

[11] [11]

2025 , url=

GPT Docs , author=. 2025 , url=

2025

[12] [12]

2025 , url=

Claude Docs , author=. 2025 , url=

2025

[13] [13]

2025 , url=

GPT API Pricing , author=. 2025 , url=

2025

[14] [14]

2025 , url=

Claude API Pricing , author=. 2025 , url=

2025

[15] [15]

Transactions of the Association for Computational Linguistics , year=

Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , year=

[16] [16]

ArXiv , year=

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack , author=. ArXiv , year=

[17] [17]

L oo GLE : Can Long-Context Language Models Understand Long Contexts?

Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan. L oo GLE : Can Long-Context Language Models Understand Long Contexts?. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.859

work page doi:10.18653/v1/2024.acl-long.859 2024

[18] [18]

Efficient Solutions For An Intriguing Failure of LLM s: Long Context Window Does Not Mean LLM s Can Analyze Long Sequences Flawlessly

Hosseini, Peyman and Castro, Ignacio and Ghinassi, Iacopo and Purver, Matthew. Efficient Solutions For An Intriguing Failure of LLM s: Long Context Window Does Not Mean LLM s Can Analyze Long Sequences Flawlessly. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025

[19] [19]

2025 , eprint=

MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation , author=. 2025 , eprint=

2025

[20] [20]

Lifelong Model Editing with Graph-Based External Memory

Atri, Yash Kumar and Alaa, Ahmed and Hartvigsen, Thomas. Lifelong Model Editing with Graph-Based External Memory. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.690

work page doi:10.18653/v1/2025.findings-acl.690 2025

[21] [21]

and Brodsky, Joshua and Mahendra, Ashish and Kang, Yiping and Flautner, Krisztian and Tang, Lingjia and Mars, Jason

Kashmira, Savini and Dantanarayana, Jayanaka L. and Brodsky, Joshua and Mahendra, Ashish and Kang, Yiping and Flautner, Krisztian and Tang, Lingjia and Mars, Jason. TOBUG raph: Knowledge Graph-Based Retrieval for Enhanced LLM Performance Beyond RAG. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 202...

work page doi:10.18653/v1/2025.emnlp-industry.93 2025

[22] [22]

2024 , eprint=

M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions , author=. 2024 , eprint=

2024

[23] [23]

C om RAG : Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry

Chen, Qinwen and Tao, Wenbiao and Zhu, Zhiwei and Xi, Mingfan and Guo, Liangzhong and Wang, Yuan and Wang, Wei and Lan, Yunshi. C om RAG : Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Indus...

work page doi:10.18653/v1/2025.acl-industry.53 2025

[24] [24]

H - MEM : Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

Sun, Haoran and Zeng, Shaoning and Zhang, Bob. H - MEM : Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.15

work page doi:10.18653/v1/2026.eacl-long.15 2026

[25] [25]

From Knowledge to Noise: CTIM -Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

Lindenbauer, Tobias and Groh, Georg and Schuetze, Hinrich. From Knowledge to Noise: CTIM -Rover and the Pitfalls of Episodic Memory in Software Engineering Agents. Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025). 2025. doi:10.18653/v1/2025.realm-1.30

work page doi:10.18653/v1/2025.realm-1.30 2025

[26] [26]

2024 , eprint=

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory , author=. 2024 , eprint=

2024

[27] [27]

2023 , eprint=

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author=. 2023 , eprint=

2023

[28] [28]

2025 , eprint=

MemInsight: Autonomous Memory Augmentation for LLM Agents , author=. 2025 , eprint=

2025

[29] [29]

2026 , eprint=

Structurally Aligned Subtask-Level Memory for Software Engineering Agents , author=. 2026 , eprint=

2026

[30] [30]

2025 , eprint=

SWE-Bench-CL: Continual Learning for Coding Agents , author=. 2025 , eprint=

2025

[31] [31]

ArXiv , year=

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations , author=. ArXiv , year=

[32] [32]

ArXiv , year=

Improving Code Localization with Repository Memory , author=. ArXiv , year=

[33] [33]

2026 , eprint=

MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences , author=. 2026 , eprint=

2026

[34] [34]

M ind R ef: Mimicking Human Memory for Hierarchical Reference Retrieval with Fine-Grained Location Awareness

Wang, Ye and Xu, Xinrun and Ding, Zhiming. M ind R ef: Mimicking Human Memory for Hierarchical Reference Retrieval with Fine-Grained Location Awareness. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.67

work page doi:10.18653/v1/2025.acl-short.67 2025

[35] [35]

If Attention Serves as a Cognitive Model of Human Memory Retrieval, What is the Plausible Memory Representation?

Yoshida, Ryo and Isono, Shinnosuke and Kajikawa, Kohei and Someya, Taiga and Sugimoto, Yushi and Oseki, Yohei. If Attention Serves as a Cognitive Model of Human Memory Retrieval, What is the Plausible Memory Representation?. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/...

work page doi:10.18653/v1/2025.acl-long.483 2025

[36] [36]

Knowledge Graph-Driven Memory Editing with Directional Interventions

Fu, Jinhu and Wang, Kun and Guo, Chongye and Fang, Junfeng and Zhang, Wentao and Su, Sen. Knowledge Graph-Driven Memory Editing with Directional Interventions. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.261

work page doi:10.18653/v1/2025.findings-emnlp.261 2025

[37] [37]

H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

Hu, Mengkang and Chen, Tianxing and Chen, Qiguang and Mu, Yao and Shao, Wenqi and Luo, Ping. H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1575

work page doi:10.18653/v1/2025.acl-long.1575 2025

[38] [38]

2025 , eprint=

General Agentic Memory Via Deep Research , author=. 2025 , eprint=

2025

[39] [39]

2025 , eprint=

LightMem: Lightweight and Efficient Memory-Augmented Generation , author=. 2025 , eprint=

2025

[40] [40]

Towards Lifelong Dialogue Agents via Timeline-based Memory Management

Ong, Kai Tzu-iunn and Kim, Namyoung and Gwak, Minju and Chae, Hyungjoo and Kwon, Taeyoon and Jo, Yohan and Hwang, Seung-won and Lee, Dongha and Yeo, Jinyoung. Towards Lifelong Dialogue Agents via Timeline-based Memory Management. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Hum...

work page doi:10.18653/v1/2025.naacl-long.435 2025

[41] [41]

2024 , url=

Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

2024

[42] [42]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[43] [43]

2025 , eprint=

Group Sequence Policy Optimization , author=. 2025 , eprint=

2025

[44] [44]

2025 , eprint=

DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

2025

[45] [45]

2025 , url=

Devstral Model , author=. 2025 , url=

2025

[46] [46]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[47] [47]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[48] [48]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z

[49] [49]

2025 , url=

Claude 4 Sonnet , author=. 2025 , url=

2025