pith. sign in

arxiv: 2606.28434 · v1 · pith:SC3VQQ2Znew · submitted 2026-06-26 · 💻 cs.SE · cs.AI

SWE-MeM: Learning Adaptive Memory Management for Long-Horizon Coding Agents

Pith reviewed 2026-06-30 01:33 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords adaptive memory managementlong-horizon agentssoftware engineering agentscontext compressionreinforcement learningSWE-Benchtrajectory optimization
0
0 comments X

The pith

Agents trained with SWE-MeM learn to decide their own memory compression during long software engineering tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training framework called SWE-MeM that equips software engineering agents with a flexible memory tool. Agents use this tool to choose when, what, and how much to compress from their interaction history based on current state, progress, and remaining context. Training relies on synthesized proactive memory-management trajectories paired with Memory-aware GRPO, which splits trajectories and assigns credit at the step level to optimize both compression choices and task resolution together. On the SWE-Bench Verified benchmark this produces resolve rates of 43.4 percent with a 4B model and 60.2 percent with a 30B model while using fewer tokens than prior static or rigid compression baselines.

Core claim

SWE-MeM provides agents a flexible memory tool and trains them on synthesized proactive trajectories with Memory-aware GRPO so that compression decisions become part of the agent's own policy; the joint optimization of memory management and issue resolution through memory-aware trajectory splitting and step-level credit assignment yields higher resolve rates and lower token consumption than existing memory baselines on SWE-Bench Verified.

What carries the argument

A flexible memory tool that lets the agent decide compression timing, content, and granularity on demand, trained end-to-end with Memory-aware GRPO that performs memory-aware trajectory splitting and step-level credit assignment.

If this is right

  • Agents can exceed the performance of static compression workflows while consuming fewer tokens overall.
  • Memory management becomes an integrated part of the agent's policy rather than an external post-processing step.
  • The same training recipe scales across model sizes from 4B to 30B parameters.
  • Joint optimization of memory actions and task success improves both metrics simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptive compression approach could apply to other long-context agent domains such as multi-step planning or extended dialogue.
  • Lower token budgets per task could translate directly into reduced inference cost when agents are deployed at scale.
  • If agents can generate their own high-quality memory trajectories, the method might support iterative self-improvement without new human data.

Load-bearing premise

Synthesized proactive memory-management trajectories used in training are representative enough of real agent trajectories to produce effective compression decisions at test time.

What would settle it

Measure whether the trained agent's compression decisions on held-out long-horizon tasks produce the same resolve-rate gains when the test trajectories differ substantially from the synthesized training set.

read the original abstract

Long-horizon software engineering agents often need to manage lengthy and noisy interaction histories under limited context budgets. Existing memory management methods typically rely on static compression workflows or impose rigid constraints on compression timing and granularity. Moreover, these approaches fail to jointly optimize memory management and issue resolution capabilities to improve performance while reducing token usage. We present SWE-MeM, a training framework for proactive and on-demand memory management in software engineering agents. SWE-MeM provides a flexible memory tool that lets agents decide when, what, and how to compress based on trajectory state, task progress, and remaining context budget. We train agents with synthesized proactive memory-management trajectories and Memory-aware GRPO, which jointly optimizes memory management and issue resolution through memory-aware trajectory splitting and step-level credit assignment. On SWE-Bench Verified, SWE-MeM achieves 43.4% and 60.2% resolve rate with 4B and 30B models, respectively, outperforming existing memory management baselines in both performance and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SWE-MeM, a training framework for long-horizon software engineering agents that enables proactive, on-demand memory management via a flexible memory tool. Agents are trained on synthesized proactive memory-management trajectories using Memory-aware GRPO, which performs memory-aware trajectory splitting and step-level credit assignment to jointly optimize compression decisions and issue resolution. On SWE-Bench Verified the method reports resolve rates of 43.4% (4B model) and 60.2% (30B model), outperforming prior memory-management baselines in both task success and token efficiency.

Significance. If the central empirical claims are supported by the experiments, the work would demonstrate that learned adaptive memory policies can simultaneously raise resolve rates and lower token consumption in long-horizon coding agents, addressing a practical bottleneck that static or rigidly scheduled compression methods have not resolved.

major comments (2)
  1. [Abstract and §3 (Training Procedure)] The headline performance numbers rest on the assumption that synthesized proactive memory-management trajectories produce state distributions, trigger timings, and compression granularities sufficiently close to those arising in real agent rollouts. No quantitative validation (e.g., distributional distances, trigger-condition histograms, or retained-context statistics) is supplied to support this match, which is load-bearing for the transferability of the learned policy.
  2. [§3 (Memory-aware GRPO)] Memory-aware GRPO is described as jointly optimizing memory actions and resolution via trajectory splitting and step-level credit assignment, yet the precise form of the memory-aware reward and the mechanism that prevents degenerate policies (e.g., always compressing to the minimum budget) are not stated explicitly enough to verify correctness of the credit assignment.
minor comments (2)
  1. The abstract should report the exact model sizes and token budgets used by the strongest baselines so that the efficiency claims can be directly compared.
  2. [§2] Notation for the memory tool interface (when/what/how decisions) is introduced in the abstract but not formalized until later; a short table or diagram in §2 would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract and §3 (Training Procedure)] The headline performance numbers rest on the assumption that synthesized proactive memory-management trajectories produce state distributions, trigger timings, and compression granularities sufficiently close to those arising in real agent rollouts. No quantitative validation (e.g., distributional distances, trigger-condition histograms, or retained-context statistics) is supplied to support this match, which is load-bearing for the transferability of the learned policy.

    Authors: We acknowledge that the manuscript does not include quantitative comparisons (such as distributional distances or histograms) between the synthesized trajectories and real agent rollouts. The synthesis procedure was designed to mirror realistic trigger conditions and compression decisions based on task state and budget, and the strong empirical results on SWE-Bench Verified provide indirect support for transfer. However, we agree that explicit validation would strengthen the claims. We will add retained-context statistics, trigger-condition histograms, and a brief distributional comparison in the revised §3. revision: yes

  2. Referee: [§3 (Memory-aware GRPO)] Memory-aware GRPO is described as jointly optimizing memory actions and resolution via trajectory splitting and step-level credit assignment, yet the precise form of the memory-aware reward and the mechanism that prevents degenerate policies (e.g., always compressing to the minimum budget) are not stated explicitly enough to verify correctness of the credit assignment.

    Authors: We will clarify the exact formulation. The memory-aware reward is defined as the sum of a binary resolution indicator and a normalized token-efficiency term, with trajectory splitting performed at memory-action boundaries to enable step-level advantage estimation. Degenerate policies are discouraged by an entropy bonus on memory decisions and a minimum-context baseline that penalizes over-compression when task-relevant information remains. We will expand §3 with the full reward equation and the explicit anti-degeneracy terms. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical training framework (synthesized trajectories + Memory-aware GRPO) whose central claims are resolve rates on the external SWE-Bench Verified benchmark. No equations, self-definitional relations, fitted-input-as-prediction steps, or load-bearing self-citations appear in the supplied text. Performance numbers are measured against an independent test set rather than being forced by construction from the training synthesis procedure itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the benchmark results generalize beyond the reported models and dataset.

pith-pipeline@v0.9.1-grok · 5726 in / 1163 out tokens · 26274 ms · 2026-06-30T01:33:57.018081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 49 canonical work pages · 12 internal anchors

  1. [1]

    Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.CoRR, abs/2505.20411, 2025

    Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei An- driushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.CoRR, abs/2505.20411, 2025

  2. [2]

    How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

    Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, and Jiaxin Pei. How do AI agents spend your money? analyzing and predicting token consumption in agentic coding tasks.CoRR, abs/2604.22750, 2026. 10

  3. [3]

    Bytedance-seed/seed-oss-36b-instruct

    ByteDance Seed. Bytedance-seed/seed-oss-36b-instruct. Hugging Face model card, 2025. URLhttps: //huggingface.co/ByteDance-Seed/Seed-OSS-36B-Instruct

  4. [4]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond’e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavaria...

  5. [5]

    Swe-universe: Scale real-world verifiable environments to millions.CoRR, abs/2602.02361, 2026

    Mouxiang Chen, Lei Zhang, Yunlong Feng, Xuwu Wang, Wenting Zhao, Ruisheng Cao, Jiaxi Yang, Jiawei Chen, Mingze Li, Zeyao Ma, Hao Ge, Zongmeng Zhang, Zeyu Cui, Dayiheng Liu, Jingren Zhou, Jianling Sun, Junyang Lin, and Binyuan Hui. Swe-universe: Scale real-world verifiable environments to millions.CoRR, abs/2602.02361, 2026

  6. [6]

    Chatunitest: A framework for llm-based test generation

    Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pages 572–576, 2024

  7. [7]

    Cl-bench: A benchmark for context learning

    Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. Cl-bench: A benchmar...

  8. [8]

    Huerta, and Hao Peng

    Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A. Huerta, and Hao Peng. Context length alone hurts LLM performance despite perfect retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 23281–23298. Assoc...

  9. [9]

    Codebert: A pre-trained model for programming and natural languages

    ZhangyinFeng, DayaGuo, DuyuTang, NanDuan, XiaochengFeng, MingGong, LinjunShou, BingQin, TingLiu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, Findings ...

  10. [10]

    Agentswing: Adaptiveparallelcontextmanagementroutingforlong-horizon web agents.CoRR, abs/2603.27490, 2026

    Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang, Runnan Fang, Qi Zhang, Baixuan Li, Shihao Cai, Rui Ye, Hui Chen, Yong Jiang, Joey Tianyi Zhou, Chenxiong Qian, Pengjun Xie, Bryan Hooi, ZuozhuLiu, andJingrenZhou. Agentswing: Adaptiveparallelcontextmanagementroutingforlong-horizon web agents.CoRR, abs/2603.27490, 2026

  11. [11]

    Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, Hongyu Zhang, and Michael R. Lyu. What makes good in-context demonstrations for code intelligence tasks with llms? In38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11–15, 2023, pages 761–773. IEEE, 2023. doi: 10.1109/ASE56229.2023.00109. URLht...

  12. [12]

    Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu. Search-based llms for code optimization. In 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 – May 6, 2025, pages 578–590. IEEE, 2025. doi: 10.1109/ICSE55347.2025.00021. URLhttps://doi.org/10. 1109/ICSE55347.2025.00021

  13. [13]

    Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael R. Lyu. The prompt alchemist: Automated llm-tailored prompt optimization for test case generation.CoRR, abs/2501.01329, 2025. doi: 10.48550/ARXIV.2501.01329. URLhttps://doi.org/10.48550/arXiv.2501.01329

  14. [14]

    Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, and Michael R. Lyu. SEER: enhancing chain-of-thought code generation through self-exploring deep reasoning.CoRR, abs/2510.17130, 2025. doi: 10.48550/ARXIV.2510. 17130. URLhttps://doi.org/10.48550/arXiv.2510.17130. 11

  15. [15]

    Contextpilot: Code context engineering with memory-augmented exploration agents

    Shuzheng Gao, Chaozheng Wang, Shuqing Li, Yun Peng, and Michael R Lyu. Contextpilot: Code context engineering with memory-augmented exploration agents. 2026

  16. [16]

    Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svy- atkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. Graphcodebert: Pre-training code representations with data flow. In 9th International Conference on Learning Represen...

  17. [17]

    Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks

    Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, and Zibin Zheng. Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks. CoRR, abs/2506.10954, 2025

  18. [18]

    Swe-swiss: A multi-task fine-tuning and rl recipe for high-performance issue resolution.https://www.notion.so/21e174dedd4880ea829ed4c861c44f88, 2025

    Zhenyu He, Qingping Yang, Wei Sheng, Xiaojian Zhong, Kechi Zhang, Chenxin An, Wenlei Shi, Tianle Cai, Di He, Jiaze Chen, and Jingjing Xu. Swe-swiss: A multi-task fine-tuning and rl recipe for high-performance issue resolution.https://www.notion.so/21e174dedd4880ea829ed4c861c44f88, 2025. Notion Blog

  19. [19]

    In Line with Context: Repository-Level Code Generation via Context Inlining

    Chao Hu, Wenhao Zeng, Yuling Shi, Beijun Shen, and Xiaodong Gu. In line with context: Repository-level code generation via context inlining.CoRR, abs/2601.00376, 2026. doi: 10.48550/ARXIV.2601.00376. URL https://doi.org/10.48550/arXiv.2601.00376

  20. [20]

    Llmlingua: Compressing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Pro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 13358–13376...

  21. [21]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  22. [22]

    Swe-debate: Competitive multi-agent debate for software issue resolution.CoRR, abs/2507.23348, 2025

    Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang. Swe-debate: Competitive multi-agent debate for software issue resolution.CoRR, abs/2507.23348, 2025. doi: 10.48550/ARXIV.2507.23348. URLhttps://doi.org/10.48550/arXiv.2507.23348

  23. [23]

    Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R’emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...

  24. [24]

    Swe-next: Scalable real-world software engineering tasks for agents.CoRR, abs/2603.20691, 2026

    Jiarong Liang, Zhiheng Lyu, Zijie Liu, Xiangchao Chen, Ping Nie, Kai Zou, and Wenhu Chen. Swe-next: Scalable real-world software engineering tasks for agents.CoRR, abs/2603.20691, 2026

  25. [25]

    arXiv preprint arXiv:2512.22087 , year=

    Shukai Liu, Jian Yang, Bo Jiang, Yizhi Li, Jinyang Guo, Xianglong Liu, and Bryan Dai. Context as a tool: Context management for long-horizon swe-agents.CoRR, abs/2512.22087, 2025

  26. [26]

    Scaling LLM multi-turn RL with end-to-end summarization-based context management.CoRR, abs/2510.06727, 2025

    Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen. Scaling LLM multi-turn RL with end-to-end summarization-based context management.CoRR, abs/2510.06727, 2025

  27. [27]

    Feng Luo, Kexing Ji, Cuiyun Gao, Shuzheng Gao, Jia Feng, Kui Liu, Xin Xia, and Michael R. Lyu. Integrating rules and semantics for llm-based c-to-rust translation. InIEEE International Conference on Software Mainte- nance and Evolution, ICSME 2025, Auckland, New Zealand, September 7–12, 2025, pages 685–696. IEEE, 2025. doi: 10.1109/ICSME64153.2025.00069. ...

  28. [28]

    Veomni: Scaling any modality model training with model-centric distributed recipe zoo

    Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, and Xin Liu. Veomni: Scaling any modality model training with model- centric distributed recipe zoo.CoRR, abs/2508.02317, 2025. doi: 10.48550/ARXIV.2508.02317. URLhttps: //doi.org/10.48550/arXiv.2508.02317

  29. [29]

    Training softwareengineeringagentsandverifierswithswe-gym

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training softwareengineeringagentsandverifierswithswe-gym. InAartiSingh, MaryamFazel, DanielHsu, SimonLacoste- Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Van...

  30. [30]

    InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24)

    YunPeng, ShuzhengGao, CuiyunGao, YintongHuo, andMichaelR.Lyu. Domainknowledgematters: Improving prompts with fix templates for repairing python type errors. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14–20, 2024, pages 4:1–4:13. ACM, 2024. doi: 10.1145/3597503.3608132. URLhttps:/...

  31. [31]

    Locobench-agent: An interactive benchmark for LLM agents in long-context software engineering.CoRR, abs/2511.13998, 2025

    JielinQiu, ZuxinLiu, ZhiweiLiu, RitheshMurthy, JianguoZhang, HaolinChen, ShiyuWang, MingZhu, Liangwei Yang, Juntao Tan, Roshan Ram, Akshara Prabhakar, Tulika Awalgaonkar, Zixiang Chen, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, and Huan Wang. Locobench-agent: An interactive benchmark for LLM agents in long-contex...

  32. [32]

    Unsupervised translation of programming languages

    Baptiste Rozière, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample. Unsupervised translation of programming languages. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur...

  33. [33]

    HybridFlow: A Flexible and Efficient RLHF Framework , url=

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Con- ference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025. doi: 10.1145/36...

  34. [34]

    From code to correctness: Closing the last mile of code generation with hierarchical debugging.CoRR, abs/2410.01215, 2024

    Yuling Shi, Songsong Wang, Chengcheng Wan, and Xiaodong Gu. From code to correctness: Closing the last mile of code generation with hierarchical debugging.CoRR, abs/2410.01215, 2024. doi: 10.48550/ARXIV.2410.01215. URLhttps://doi.org/10.48550/arXiv.2410.01215

  35. [35]

    Longcodezip: Compress long context for code language models

    Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. Longcodezip: Compress long context for code language models. In40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025, pages 141–153. IEEE, 2025. doi: 10.1109/ASE63991. 2025.00020. URLhttps://doi.org/10.1109/ASE...

  36. [36]

    W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, and Junxian He

    KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, X. W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, and Junxian He. SWE-RM: execution-free feedback for software engineering agents.CoRR, abs/2512.21919, 2025

  37. [37]

    Swe-master: Unleashing the potential of software engineering agents via post-training.CoRR, abs/2602.03411, 2026

    Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le, Daixuan Cheng, Guoxin Chen, Yiwen Hu, Zongchao Chen, Yiming Jia, Wayne Xin Zhao, Yang Song, Tao Zhang, and Ji-Rong Wen. Swe-master: Unleashing the potential of software engineering agents via post-training.CoRR, abs/2602.03411, 2026

  38. [38]

    Bugpilot: Complex bug generation for efficient learning of SWE skills.CoRR, abs/2510.19898, 2025

    Atharv Sonwane, Isadora White, Hyunji Lee, Matheus Pereira, Lucas Caccia, Minseon Kim, Zhengyan Shi, Chinmay Singh, Alessandro Sordoni, Marc-Alexandre Côté, and Xingdi Yuan. Bugpilot: Complex bug generation for efficient learning of SWE skills.CoRR, abs/2510.19898, 2025

  39. [39]

    U-fold: Dynamic intent-aware context folding for user-centric agents.CoRR, abs/2601.18285, 2026

    Jin Su, Runnan Fang, Yeqiu Li, Xiaobin Wang, Shihao Cai, Pengjun Xie, Ningyu Zhang, and Fajie Yuan. U-fold: Dynamic intent-aware context folding for user-centric agents.CoRR, abs/2601.18285, 2026

  40. [40]

    Swe-world: Building software engineering agents in docker-free environments.CoRR, abs/2602.03419, 2026

    Shuang Sun, Huatong Song, Lisheng Huang, Jinhao Jiang, Ran Le, Zhihao Lv, Zongchao Chen, Yiwen Hu, Wenyang Luo, Wayne Xin Zhao, Yang Song, Hongteng Xu, Tao Zhang, and Ji-Rong Wen. Swe-world: Building software engineering agents in docker-free environments.CoRR, abs/2602.03419, 2026

  41. [41]

    arXiv preprint arXiv:2510.11967 , year=

    Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

  42. [42]

    Swe-lego: Pushing the limits of supervised fine-tuning for software issue resolving.CoRR, abs/2601.01426, 2026

    Chaofan Tao, Jierun Chen, Yuxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Yang, Yiming Du, Jianbo Dai, Zhiming Mao, Xinyu Wang, Lifeng Shang, and Haoli Bai. Swe-lego: Pushing the limits of supervised fine-tuning for software issue resolving.CoRR, abs/2601.01426, 2026

  43. [43]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025

  44. [44]

    Deepswe.https://www.together.ai/blog/deepswe, 2025

    Together AI. Deepswe.https://www.together.ai/blog/deepswe, 2025. Together AI Blog

  45. [45]

    Swe-dev: Building software engineering agents with training and inference scaling

    Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, and Yuxiao Dong. Swe-dev: Building software engineering agents with training and inference scaling. InFindings of the Association for Computational Linguistics, ACL 2025, pages 3742–3761. Association for Computational Linguistics, 2025. URLhttps://aclanthology.org/2025. findings-acl.193/

  46. [46]

    Swe-mirror: Scaling issue- resolving datasets by mirroring issues across repositories.CoRR, abs/2509.08724, 2025

    Junhao Wang, Daoguang Zan, Shulin Xin, Siyao Liu, Yurong Wu, and Kai Shen. Swe-mirror: Scaling issue- resolving datasets by mirroring issues across repositories.CoRR, abs/2509.08724, 2025. 13

  47. [47]

    Klear-agentforge: Forging agentic intelligence through posttraining scaling.CoRR, abs/2511.05951, 2025

    Qi Wang, Hongzhi Zhang, Jia Fu, Kai Fu, Yahui Liu, Tinghai Zhang, Chenxi Sun, Gangwei Jiang, Jingyi Tang, Xingguang Ji, Yang Yue, Jingyuan Zhang, Fuzheng Zhang, Kun Gai, and Guorui Zhou. Klear-agentforge: Forging agentic intelligence through posttraining scaling.CoRR, abs/2511.05951, 2025

  48. [48]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. Openhands: An open platform for AI software developers as generalist agents. InThe...

  49. [49]

    Joty, and Steven C

    Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decodermodelsforcodeunderstandingandgeneration. InMarie-FrancineMoens, XuanjingHuang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Even...

  50. [50]

    SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

    Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, and Xiaodong Gu. Swe-pruner: Self-adaptive context pruning for coding agents.CoRR, abs/2601.16746, 2026

  51. [51]

    Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. SWE-RL: advancing LLM reasoning via reinforcement learning on open software evolution.CoRR, abs/2502.18449, 2025

  52. [52]

    arXiv preprint arXiv:2509.13313 , year=

    Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, and Jingren Zhou. Resum: Unlocking long-horizon search intelligence via context summarization.CoRR, abs/2509.13313, 2025

  53. [53]

    Evaluating the impact of experimental assumptions in automated fault localization,

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre- trained language models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 1482–1494. IEEE, 2023. doi: 10.1109/ICSE48619.2023.00129. URL https://doi.org/10.1109/ICSE48619.2023.00129

  54. [54]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.CoRR, abs/2407.01489, 2024

  55. [55]

    Reducing cost of llm agents with trajectory reduction

    Yuan-An Xiao, Pengfei Gao, Chao Peng, and Yingfei Xiong. Reducing cost of llm agents with trajectory reduction. CoRR, abs/2509.23586, 2025

  56. [56]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Sys...

  57. [57]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.CoRR, abs/2210.03629, 2023

  58. [58]

    arXiv preprint arXiv:2510.24699 , year=

    RuiYe, ZhongwangZhang, Kuan Li, HuifengYin, ZhengweiTao, Yida Zhao, LiangcaiSu, Liwen Zhang, ZileQiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, and Yong Jiang. Agentfold: Long-horizon web agents with proactive context management.CoRR, abs/2510.24699, 2025

  59. [59]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  60. [60]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    doi: 10.48550/ARXIV.2503.14476. URLhttps://doi.org/10.48550/arXiv.2503.14476

  61. [61]

    Pruning the unsurprising: Efficient code reasoning via first-token surprisal.CoRR, abs/2508.05988, 2025

    Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, and Xiaodong Gu. Pruning the unsurprising: Efficient code reasoning via first-token surprisal.CoRR, abs/2508.05988, 2025. doi: 10.48550/ARXIV.2508.05988. URLhttps://doi.org/10.48550/arXiv.2508.05988. 14

  62. [62]

    Readability-robust code summarization via meta curriculum learning.CoRR, abs/2601.05485, 2026

    Wenhao Zeng, Yitian Chai, Hao Zhou, Fandong Meng, Jie Zhou, and Xiaodong Gu. Readability-robust code summarization via meta curriculum learning.CoRR, abs/2601.05485, 2026. doi: 10.48550/ARXIV.2601.05485. URLhttps://doi.org/10.48550/arXiv.2601.05485

  63. [63]

    GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

    Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, and Xiaodong Gu. Glimprouter: Efficient collaborative inference by glimpsing one token of thoughts.CoRR, abs/2601.05110, 2026. doi: 10.48550/ ARXIV.2601.05110. URLhttps://doi.org/10.48550/arXiv.2601.05110

  64. [64]

    bug found in file X

    Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko. Dietcode: Automatic optimization for dynamic tensor programs. In Diana Marculescu, Yuejie Chi, and Carole-Jean Wu, editors,Proceedings of the Fifth Conference on Machine Learning and Systems, MLSys 2022, Santa Clara, C...

  65. [65]

    Lengthy code execution outputs with minimal relevance to subsequent steps

  66. [66]

    Large file openings or directory listings where only limited information is utilized later

  67. [67]

    Consecutive rounds completing a phase where intermediate results have no bearing on subsequent processes

  68. [68]

    Steps from many rounds earlier that have little impact on the most recent round

  69. [69]

    **Rounds 1-5** (Phase 1)

    Multiple adjacent existing compressed segments that can be combined **Part 4: Content Importance Assessment** For your selected range, assess the importance level of each piece of content: - **Need preserve verbatim**: Content that might be referenced by future steps, such as test cases revealing the bug, error traces showing root cause, key code snippets...

  70. [70]

    If the reflection only says progress is fine, a phase is complete, tests passed, or that it wants to compress to save context, then this condition must be false

    reflection_identifies_process_problem_and_future_fix_plan: The reflection must identify real problems, weaknesses, mistakes, risks, or gaps in the current problem-solving process itself, and it must propose how the later work should change because of that. If the reflection only says progress is fine, a phase is complete, tests passed, or that it wants to...

  71. [71]

    It should not be too short in actual content volume

    source_selection_is_good: The selected compressed range is worth compressing. It should not be too short in actual content volume. A small round span is still acceptable if the source itself is long or dense, such as long logs or detailed debugging content

  72. [72]

    compressed_content_preserves_source_critical_information: The produced compressed content preserves the source-critical information for this specific source range. Judge this relative to the actual source content: for logs, preserving the key signals and conclusions may be enough; for analysis, implementation, or verification rounds, the summary should pr...

  73. [73]

    The pre-tool analysis, the claimed compression range, and the actual compressed content must point to the same material

    compression_plan_matches_actual_compressed_content: This must be judged strictly. The pre-tool analysis, the claimed compression range, and the actual compressed content must point to the same material. If the analysis says it will compress one part but the result actually summarizes another part, or if the summary mixes in out-of-range material, or if it...

  74. [74]

    reflection_identifies_process_problem_and_future_fix_plan

    future_plan_is_specific_and_actionable: The future plan in the final parameter/progress-summary style content must be concrete and actionable, not generic phase language. It should name specific next checks or actions, such as running tests for a particular file/function/bug scenario, checking a specific behavior, or verifying a specific risk. Generic sta...

  75. [75]

    no_bad_assistant_content: Assistant messages should NOT be bad content such as obvious repeated output, repetitive low-value text, malformed garbage, bizarre or corrupted characters, broken formatting artifacts, or other clearly abnormal content that should not be learned

  76. [76]

    It is okay if the agent adapts slightly, but it should not ignore the future plan and jump to unrelated work

    post_reflection_actions_follow_future_plan: After the reflection/compression user message, the later assistant actions should substantially follow the future plan implied by that reflection and compressed summary. It is okay if the agent adapts slightly, but it should not ignore the future plan and jump to unrelated work

  77. [77]

    trajectory_type

    no_unnecessary_repeated_actions_after_compression: After the reflection/compression user message, there should NOT be clearly unnecessary repeated actions caused by forgetting compressed context. For example, if the earlier trajectory already established some file contents, logs, test results, or conclusions, the later assistant should not redundantly red...

  78. [78]

    full" vs

    **Deep root-cause analysis**: - Re-examine`django/db/models/sql/where.py`, especially the loop from lines ~79--105 where children are compiled and counts of "full" vs "empty" nodes determine whether to raise or return an empty string (see context from round 27--28). - Trace how a filter like`~Exists(MyModel.objects.none())`plus another filter (`name='test...

  79. [79]

    0 = 1"`) rather than triggering an exception; or - The WHERE-node logic in`django/db/models/sql/where.py`so that when all children are effectively

    **Implement a minimal fix**: - Based on that analysis, adjust either: -`Exists.as_sql`/`Subquery.as_sql`in`django/db/models/expressions.py`(rounds 15--19) so that a negated empty EXISTS behaves like a known false predicate (e.g.,`"0 = 1"`) rather than triggering an exception; or - The WHERE-node logic in`django/db/models/sql/where.py`so that when all chil...

  80. [80]

    all tests passed

    **Verification**: - Re-run`/testbed/reproduce_issue.py`to confirm that: -`qs.query.as_sql()`no longer raises`EmptyResultSet`. - A query such as`MyModel.objects.filter(~models.Exists(MyModel.objects.none()), name='test')`yields an empty result when evaluated (`list(qs)`returns an empty list). - Run targeted tests via`python tests/runtests.py`: - At least a...

Showing first 80 references.