SWE-MeM: Learning Adaptive Memory Management for Long-Horizon Coding Agents
Pith reviewed 2026-06-30 01:33 UTC · model grok-4.3
The pith
Agents trained with SWE-MeM learn to decide their own memory compression during long software engineering tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWE-MeM provides agents a flexible memory tool and trains them on synthesized proactive trajectories with Memory-aware GRPO so that compression decisions become part of the agent's own policy; the joint optimization of memory management and issue resolution through memory-aware trajectory splitting and step-level credit assignment yields higher resolve rates and lower token consumption than existing memory baselines on SWE-Bench Verified.
What carries the argument
A flexible memory tool that lets the agent decide compression timing, content, and granularity on demand, trained end-to-end with Memory-aware GRPO that performs memory-aware trajectory splitting and step-level credit assignment.
If this is right
- Agents can exceed the performance of static compression workflows while consuming fewer tokens overall.
- Memory management becomes an integrated part of the agent's policy rather than an external post-processing step.
- The same training recipe scales across model sizes from 4B to 30B parameters.
- Joint optimization of memory actions and task success improves both metrics simultaneously.
Where Pith is reading between the lines
- The same adaptive compression approach could apply to other long-context agent domains such as multi-step planning or extended dialogue.
- Lower token budgets per task could translate directly into reduced inference cost when agents are deployed at scale.
- If agents can generate their own high-quality memory trajectories, the method might support iterative self-improvement without new human data.
Load-bearing premise
Synthesized proactive memory-management trajectories used in training are representative enough of real agent trajectories to produce effective compression decisions at test time.
What would settle it
Measure whether the trained agent's compression decisions on held-out long-horizon tasks produce the same resolve-rate gains when the test trajectories differ substantially from the synthesized training set.
read the original abstract
Long-horizon software engineering agents often need to manage lengthy and noisy interaction histories under limited context budgets. Existing memory management methods typically rely on static compression workflows or impose rigid constraints on compression timing and granularity. Moreover, these approaches fail to jointly optimize memory management and issue resolution capabilities to improve performance while reducing token usage. We present SWE-MeM, a training framework for proactive and on-demand memory management in software engineering agents. SWE-MeM provides a flexible memory tool that lets agents decide when, what, and how to compress based on trajectory state, task progress, and remaining context budget. We train agents with synthesized proactive memory-management trajectories and Memory-aware GRPO, which jointly optimizes memory management and issue resolution through memory-aware trajectory splitting and step-level credit assignment. On SWE-Bench Verified, SWE-MeM achieves 43.4% and 60.2% resolve rate with 4B and 30B models, respectively, outperforming existing memory management baselines in both performance and efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SWE-MeM, a training framework for long-horizon software engineering agents that enables proactive, on-demand memory management via a flexible memory tool. Agents are trained on synthesized proactive memory-management trajectories using Memory-aware GRPO, which performs memory-aware trajectory splitting and step-level credit assignment to jointly optimize compression decisions and issue resolution. On SWE-Bench Verified the method reports resolve rates of 43.4% (4B model) and 60.2% (30B model), outperforming prior memory-management baselines in both task success and token efficiency.
Significance. If the central empirical claims are supported by the experiments, the work would demonstrate that learned adaptive memory policies can simultaneously raise resolve rates and lower token consumption in long-horizon coding agents, addressing a practical bottleneck that static or rigidly scheduled compression methods have not resolved.
major comments (2)
- [Abstract and §3 (Training Procedure)] The headline performance numbers rest on the assumption that synthesized proactive memory-management trajectories produce state distributions, trigger timings, and compression granularities sufficiently close to those arising in real agent rollouts. No quantitative validation (e.g., distributional distances, trigger-condition histograms, or retained-context statistics) is supplied to support this match, which is load-bearing for the transferability of the learned policy.
- [§3 (Memory-aware GRPO)] Memory-aware GRPO is described as jointly optimizing memory actions and resolution via trajectory splitting and step-level credit assignment, yet the precise form of the memory-aware reward and the mechanism that prevents degenerate policies (e.g., always compressing to the minimum budget) are not stated explicitly enough to verify correctness of the credit assignment.
minor comments (2)
- The abstract should report the exact model sizes and token budgets used by the strongest baselines so that the efficiency claims can be directly compared.
- [§2] Notation for the memory tool interface (when/what/how decisions) is introduced in the abstract but not formalized until later; a short table or diagram in §2 would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [Abstract and §3 (Training Procedure)] The headline performance numbers rest on the assumption that synthesized proactive memory-management trajectories produce state distributions, trigger timings, and compression granularities sufficiently close to those arising in real agent rollouts. No quantitative validation (e.g., distributional distances, trigger-condition histograms, or retained-context statistics) is supplied to support this match, which is load-bearing for the transferability of the learned policy.
Authors: We acknowledge that the manuscript does not include quantitative comparisons (such as distributional distances or histograms) between the synthesized trajectories and real agent rollouts. The synthesis procedure was designed to mirror realistic trigger conditions and compression decisions based on task state and budget, and the strong empirical results on SWE-Bench Verified provide indirect support for transfer. However, we agree that explicit validation would strengthen the claims. We will add retained-context statistics, trigger-condition histograms, and a brief distributional comparison in the revised §3. revision: yes
-
Referee: [§3 (Memory-aware GRPO)] Memory-aware GRPO is described as jointly optimizing memory actions and resolution via trajectory splitting and step-level credit assignment, yet the precise form of the memory-aware reward and the mechanism that prevents degenerate policies (e.g., always compressing to the minimum budget) are not stated explicitly enough to verify correctness of the credit assignment.
Authors: We will clarify the exact formulation. The memory-aware reward is defined as the sum of a binary resolution indicator and a normalized token-efficiency term, with trajectory splitting performed at memory-action boundaries to enable step-level advantage estimation. Degenerate policies are discouraged by an entropy bonus on memory decisions and a minimum-context baseline that penalizes over-compression when task-relevant information remains. We will expand §3 with the full reward equation and the explicit anti-degeneracy terms. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes an empirical training framework (synthesized trajectories + Memory-aware GRPO) whose central claims are resolve rates on the external SWE-Bench Verified benchmark. No equations, self-definitional relations, fitted-input-as-prediction steps, or load-bearing self-citations appear in the supplied text. Performance numbers are measured against an independent test set rather than being forced by construction from the training synthesis procedure itself.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei An- driushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.CoRR, abs/2505.20411, 2025
-
[2]
Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, and Jiaxin Pei. How do AI agents spend your money? analyzing and predicting token consumption in agentic coding tasks.CoRR, abs/2604.22750, 2026. 10
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Bytedance-seed/seed-oss-36b-instruct
ByteDance Seed. Bytedance-seed/seed-oss-36b-instruct. Hugging Face model card, 2025. URLhttps: //huggingface.co/ByteDance-Seed/Seed-OSS-36B-Instruct
2025
-
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond’e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavaria...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Swe-universe: Scale real-world verifiable environments to millions.CoRR, abs/2602.02361, 2026
Mouxiang Chen, Lei Zhang, Yunlong Feng, Xuwu Wang, Wenting Zhao, Ruisheng Cao, Jiaxi Yang, Jiawei Chen, Mingze Li, Zeyao Ma, Hao Ge, Zongmeng Zhang, Zeyu Cui, Dayiheng Liu, Jingren Zhou, Jianling Sun, Junyang Lin, and Binyuan Hui. Swe-universe: Scale real-world verifiable environments to millions.CoRR, abs/2602.02361, 2026
-
[6]
Chatunitest: A framework for llm-based test generation
Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pages 572–576, 2024
2024
-
[7]
Cl-bench: A benchmark for context learning
Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. Cl-bench: A benchmar...
-
[8]
Huerta, and Hao Peng
Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A. Huerta, and Hao Peng. Context length alone hurts LLM performance despite perfect retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 23281–23298. Assoc...
2025
-
[9]
Codebert: A pre-trained model for programming and natural languages
ZhangyinFeng, DayaGuo, DuyuTang, NanDuan, XiaochengFeng, MingGong, LinjunShou, BingQin, TingLiu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, Findings ...
-
[10]
Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang, Runnan Fang, Qi Zhang, Baixuan Li, Shihao Cai, Rui Ye, Hui Chen, Yong Jiang, Joey Tianyi Zhou, Chenxiong Qian, Pengjun Xie, Bryan Hooi, ZuozhuLiu, andJingrenZhou. Agentswing: Adaptiveparallelcontextmanagementroutingforlong-horizon web agents.CoRR, abs/2603.27490, 2026
-
[11]
Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, Hongyu Zhang, and Michael R. Lyu. What makes good in-context demonstrations for code intelligence tasks with llms? In38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11–15, 2023, pages 761–773. IEEE, 2023. doi: 10.1109/ASE56229.2023.00109. URLht...
-
[12]
Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu. Search-based llms for code optimization. In 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 – May 6, 2025, pages 578–590. IEEE, 2025. doi: 10.1109/ICSE55347.2025.00021. URLhttps://doi.org/10. 1109/ICSE55347.2025.00021
-
[13]
Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael R. Lyu. The prompt alchemist: Automated llm-tailored prompt optimization for test case generation.CoRR, abs/2501.01329, 2025. doi: 10.48550/ARXIV.2501.01329. URLhttps://doi.org/10.48550/arXiv.2501.01329
-
[14]
Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, and Michael R. Lyu. SEER: enhancing chain-of-thought code generation through self-exploring deep reasoning.CoRR, abs/2510.17130, 2025. doi: 10.48550/ARXIV.2510. 17130. URLhttps://doi.org/10.48550/arXiv.2510.17130. 11
-
[15]
Contextpilot: Code context engineering with memory-augmented exploration agents
Shuzheng Gao, Chaozheng Wang, Shuqing Li, Yun Peng, and Michael R Lyu. Contextpilot: Code context engineering with memory-augmented exploration agents. 2026
2026
-
[16]
Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svy- atkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. Graphcodebert: Pre-training code representations with data flow. In 9th International Conference on Learning Represen...
2021
-
[17]
Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks
Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, and Zibin Zheng. Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks. CoRR, abs/2506.10954, 2025
-
[18]
Swe-swiss: A multi-task fine-tuning and rl recipe for high-performance issue resolution.https://www.notion.so/21e174dedd4880ea829ed4c861c44f88, 2025
Zhenyu He, Qingping Yang, Wei Sheng, Xiaojian Zhong, Kechi Zhang, Chenxin An, Wenlei Shi, Tianle Cai, Di He, Jiaze Chen, and Jingjing Xu. Swe-swiss: A multi-task fine-tuning and rl recipe for high-performance issue resolution.https://www.notion.so/21e174dedd4880ea829ed4c861c44f88, 2025. Notion Blog
2025
-
[19]
In Line with Context: Repository-Level Code Generation via Context Inlining
Chao Hu, Wenhao Zeng, Yuling Shi, Beijun Shen, and Xiaodong Gu. In line with context: Repository-level code generation via context inlining.CoRR, abs/2601.00376, 2026. doi: 10.48550/ARXIV.2601.00376. URL https://doi.org/10.48550/arXiv.2601.00376
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.00376 2026
-
[20]
Llmlingua: Compressing prompts for accelerated inference of large language models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Pro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 13358–13376...
2023
-
[21]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
2024
-
[22]
Swe-debate: Competitive multi-agent debate for software issue resolution.CoRR, abs/2507.23348, 2025
Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang. Swe-debate: Competitive multi-agent debate for software issue resolution.CoRR, abs/2507.23348, 2025. doi: 10.48550/ARXIV.2507.23348. URLhttps://doi.org/10.48550/arXiv.2507.23348
-
[23]
Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R’emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Swe-next: Scalable real-world software engineering tasks for agents.CoRR, abs/2603.20691, 2026
Jiarong Liang, Zhiheng Lyu, Zijie Liu, Xiangchao Chen, Ping Nie, Kai Zou, and Wenhu Chen. Swe-next: Scalable real-world software engineering tasks for agents.CoRR, abs/2603.20691, 2026
-
[25]
arXiv preprint arXiv:2512.22087 , year=
Shukai Liu, Jian Yang, Bo Jiang, Yizhi Li, Jinyang Guo, Xianglong Liu, and Bryan Dai. Context as a tool: Context management for long-horizon swe-agents.CoRR, abs/2512.22087, 2025
-
[26]
Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen. Scaling LLM multi-turn RL with end-to-end summarization-based context management.CoRR, abs/2510.06727, 2025
-
[27]
Feng Luo, Kexing Ji, Cuiyun Gao, Shuzheng Gao, Jia Feng, Kui Liu, Xin Xia, and Michael R. Lyu. Integrating rules and semantics for llm-based c-to-rust translation. InIEEE International Conference on Software Mainte- nance and Evolution, ICSME 2025, Auckland, New Zealand, September 7–12, 2025, pages 685–696. IEEE, 2025. doi: 10.1109/ICSME64153.2025.00069. ...
-
[28]
Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, and Xin Liu. Veomni: Scaling any modality model training with model- centric distributed recipe zoo.CoRR, abs/2508.02317, 2025. doi: 10.48550/ARXIV.2508.02317. URLhttps: //doi.org/10.48550/arXiv.2508.02317
-
[29]
Training softwareengineeringagentsandverifierswithswe-gym
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training softwareengineeringagentsandverifierswithswe-gym. InAartiSingh, MaryamFazel, DanielHsu, SimonLacoste- Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Van...
2025
-
[30]
YunPeng, ShuzhengGao, CuiyunGao, YintongHuo, andMichaelR.Lyu. Domainknowledgematters: Improving prompts with fix templates for repairing python type errors. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14–20, 2024, pages 4:1–4:13. ACM, 2024. doi: 10.1145/3597503.3608132. URLhttps:/...
-
[31]
JielinQiu, ZuxinLiu, ZhiweiLiu, RitheshMurthy, JianguoZhang, HaolinChen, ShiyuWang, MingZhu, Liangwei Yang, Juntao Tan, Roshan Ram, Akshara Prabhakar, Tulika Awalgaonkar, Zixiang Chen, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, and Huan Wang. Locobench-agent: An interactive benchmark for LLM agents in long-contex...
-
[32]
Unsupervised translation of programming languages
Baptiste Rozière, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample. Unsupervised translation of programming languages. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur...
2020
-
[33]
HybridFlow: A Flexible and Efficient RLHF Framework , url=
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Con- ference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025. doi: 10.1145/36...
-
[34]
Yuling Shi, Songsong Wang, Chengcheng Wan, and Xiaodong Gu. From code to correctness: Closing the last mile of code generation with hierarchical debugging.CoRR, abs/2410.01215, 2024. doi: 10.48550/ARXIV.2410.01215. URLhttps://doi.org/10.48550/arXiv.2410.01215
-
[35]
Longcodezip: Compress long context for code language models
Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. Longcodezip: Compress long context for code language models. In40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025, pages 141–153. IEEE, 2025. doi: 10.1109/ASE63991. 2025.00020. URLhttps://doi.org/10.1109/ASE...
-
[36]
W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, and Junxian He
KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, X. W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, and Junxian He. SWE-RM: execution-free feedback for software engineering agents.CoRR, abs/2512.21919, 2025
-
[37]
Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le, Daixuan Cheng, Guoxin Chen, Yiwen Hu, Zongchao Chen, Yiming Jia, Wayne Xin Zhao, Yang Song, Tao Zhang, and Ji-Rong Wen. Swe-master: Unleashing the potential of software engineering agents via post-training.CoRR, abs/2602.03411, 2026
-
[38]
Bugpilot: Complex bug generation for efficient learning of SWE skills.CoRR, abs/2510.19898, 2025
Atharv Sonwane, Isadora White, Hyunji Lee, Matheus Pereira, Lucas Caccia, Minseon Kim, Zhengyan Shi, Chinmay Singh, Alessandro Sordoni, Marc-Alexandre Côté, and Xingdi Yuan. Bugpilot: Complex bug generation for efficient learning of SWE skills.CoRR, abs/2510.19898, 2025
-
[39]
U-fold: Dynamic intent-aware context folding for user-centric agents.CoRR, abs/2601.18285, 2026
Jin Su, Runnan Fang, Yeqiu Li, Xiaobin Wang, Shihao Cai, Pengjun Xie, Ningyu Zhang, and Fajie Yuan. U-fold: Dynamic intent-aware context folding for user-centric agents.CoRR, abs/2601.18285, 2026
-
[40]
Shuang Sun, Huatong Song, Lisheng Huang, Jinhao Jiang, Ran Le, Zhihao Lv, Zongchao Chen, Yiwen Hu, Wenyang Luo, Wayne Xin Zhao, Yang Song, Hongteng Xu, Tao Zhang, and Ji-Rong Wen. Swe-world: Building software engineering agents in docker-free environments.CoRR, abs/2602.03419, 2026
-
[41]
arXiv preprint arXiv:2510.11967 , year=
Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025
-
[42]
Chaofan Tao, Jierun Chen, Yuxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Yang, Yiming Du, Jianbo Dai, Zhiming Mao, Xinyu Wang, Lifeng Shang, and Haoli Bai. Swe-lego: Pushing the limits of supervised fine-tuning for software issue resolving.CoRR, abs/2601.01426, 2026
-
[43]
Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Deepswe.https://www.together.ai/blog/deepswe, 2025
Together AI. Deepswe.https://www.together.ai/blog/deepswe, 2025. Together AI Blog
2025
-
[45]
Swe-dev: Building software engineering agents with training and inference scaling
Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, and Yuxiao Dong. Swe-dev: Building software engineering agents with training and inference scaling. InFindings of the Association for Computational Linguistics, ACL 2025, pages 3742–3761. Association for Computational Linguistics, 2025. URLhttps://aclanthology.org/2025. findings-acl.193/
2025
-
[46]
Junhao Wang, Daoguang Zan, Shulin Xin, Siyao Liu, Yurong Wu, and Kai Shen. Swe-mirror: Scaling issue- resolving datasets by mirroring issues across repositories.CoRR, abs/2509.08724, 2025. 13
-
[47]
Qi Wang, Hongzhi Zhang, Jia Fu, Kai Fu, Yahui Liu, Tinghai Zhang, Chenxi Sun, Gangwei Jiang, Jingyi Tang, Xingguang Ji, Yang Yue, Jingyuan Zhang, Fuzheng Zhang, Kun Gai, and Guorui Zhou. Klear-agentforge: Forging agentic intelligence through posttraining scaling.CoRR, abs/2511.05951, 2025
-
[48]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. Openhands: An open platform for AI software developers as generalist agents. InThe...
2025
-
[49]
Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decodermodelsforcodeunderstandingandgeneration. InMarie-FrancineMoens, XuanjingHuang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Even...
-
[50]
SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents
Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, and Xiaodong Gu. Swe-pruner: Self-adaptive context pruning for coding agents.CoRR, abs/2601.16746, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[51]
Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. SWE-RL: advancing LLM reasoning via reinforcement learning on open software evolution.CoRR, abs/2502.18449, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
arXiv preprint arXiv:2509.13313 , year=
Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, and Jingren Zhou. Resum: Unlocking long-horizon search intelligence via context summarization.CoRR, abs/2509.13313, 2025
-
[53]
Evaluating the impact of experimental assumptions in automated fault localization,
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre- trained language models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 1482–1494. IEEE, 2023. doi: 10.1109/ICSE48619.2023.00129. URL https://doi.org/10.1109/ICSE48619.2023.00129
-
[54]
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.CoRR, abs/2407.01489, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Reducing cost of llm agents with trajectory reduction
Yuan-An Xiao, Pengfei Gao, Chao Peng, and Yingfei Xiong. Reducing cost of llm agents with trajectory reduction. CoRR, abs/2509.23586, 2025
-
[56]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Sys...
2024
-
[57]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.CoRR, abs/2210.03629, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
arXiv preprint arXiv:2510.24699 , year=
RuiYe, ZhongwangZhang, Kuan Li, HuifengYin, ZhengweiTao, Yida Zhao, LiangcaiSu, Liwen Zhang, ZileQiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, and Yong Jiang. Agentfold: Long-horizon web agents with proactive context management.CoRR, abs/2510.24699, 2025
-
[59]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
doi: 10.48550/ARXIV.2503.14476. URLhttps://doi.org/10.48550/arXiv.2503.14476
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476
-
[61]
Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, and Xiaodong Gu. Pruning the unsurprising: Efficient code reasoning via first-token surprisal.CoRR, abs/2508.05988, 2025. doi: 10.48550/ARXIV.2508.05988. URLhttps://doi.org/10.48550/arXiv.2508.05988. 14
-
[62]
Readability-robust code summarization via meta curriculum learning.CoRR, abs/2601.05485, 2026
Wenhao Zeng, Yitian Chai, Hao Zhou, Fandong Meng, Jie Zhou, and Xiaodong Gu. Readability-robust code summarization via meta curriculum learning.CoRR, abs/2601.05485, 2026. doi: 10.48550/ARXIV.2601.05485. URLhttps://doi.org/10.48550/arXiv.2601.05485
-
[63]
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, and Xiaodong Gu. Glimprouter: Efficient collaborative inference by glimpsing one token of thoughts.CoRR, abs/2601.05110, 2026. doi: 10.48550/ ARXIV.2601.05110. URLhttps://doi.org/10.48550/arXiv.2601.05110
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.05110 2026
-
[64]
bug found in file X
Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko. Dietcode: Automatic optimization for dynamic tensor programs. In Diana Marculescu, Yuejie Chi, and Carole-Jean Wu, editors,Proceedings of the Fifth Conference on Machine Learning and Systems, MLSys 2022, Santa Clara, C...
2022
-
[65]
Lengthy code execution outputs with minimal relevance to subsequent steps
-
[66]
Large file openings or directory listings where only limited information is utilized later
-
[67]
Consecutive rounds completing a phase where intermediate results have no bearing on subsequent processes
-
[68]
Steps from many rounds earlier that have little impact on the most recent round
-
[69]
**Rounds 1-5** (Phase 1)
Multiple adjacent existing compressed segments that can be combined **Part 4: Content Importance Assessment** For your selected range, assess the importance level of each piece of content: - **Need preserve verbatim**: Content that might be referenced by future steps, such as test cases revealing the bug, error traces showing root cause, key code snippets...
-
[70]
If the reflection only says progress is fine, a phase is complete, tests passed, or that it wants to compress to save context, then this condition must be false
reflection_identifies_process_problem_and_future_fix_plan: The reflection must identify real problems, weaknesses, mistakes, risks, or gaps in the current problem-solving process itself, and it must propose how the later work should change because of that. If the reflection only says progress is fine, a phase is complete, tests passed, or that it wants to...
-
[71]
It should not be too short in actual content volume
source_selection_is_good: The selected compressed range is worth compressing. It should not be too short in actual content volume. A small round span is still acceptable if the source itself is long or dense, such as long logs or detailed debugging content
-
[72]
compressed_content_preserves_source_critical_information: The produced compressed content preserves the source-critical information for this specific source range. Judge this relative to the actual source content: for logs, preserving the key signals and conclusions may be enough; for analysis, implementation, or verification rounds, the summary should pr...
-
[73]
The pre-tool analysis, the claimed compression range, and the actual compressed content must point to the same material
compression_plan_matches_actual_compressed_content: This must be judged strictly. The pre-tool analysis, the claimed compression range, and the actual compressed content must point to the same material. If the analysis says it will compress one part but the result actually summarizes another part, or if the summary mixes in out-of-range material, or if it...
-
[74]
reflection_identifies_process_problem_and_future_fix_plan
future_plan_is_specific_and_actionable: The future plan in the final parameter/progress-summary style content must be concrete and actionable, not generic phase language. It should name specific next checks or actions, such as running tests for a particular file/function/bug scenario, checking a specific behavior, or verifying a specific risk. Generic sta...
-
[75]
no_bad_assistant_content: Assistant messages should NOT be bad content such as obvious repeated output, repetitive low-value text, malformed garbage, bizarre or corrupted characters, broken formatting artifacts, or other clearly abnormal content that should not be learned
-
[76]
It is okay if the agent adapts slightly, but it should not ignore the future plan and jump to unrelated work
post_reflection_actions_follow_future_plan: After the reflection/compression user message, the later assistant actions should substantially follow the future plan implied by that reflection and compressed summary. It is okay if the agent adapts slightly, but it should not ignore the future plan and jump to unrelated work
-
[77]
trajectory_type
no_unnecessary_repeated_actions_after_compression: After the reflection/compression user message, there should NOT be clearly unnecessary repeated actions caused by forgetting compressed context. For example, if the earlier trajectory already established some file contents, logs, test results, or conclusions, the later assistant should not redundantly red...
-
[78]
full" vs
**Deep root-cause analysis**: - Re-examine`django/db/models/sql/where.py`, especially the loop from lines ~79--105 where children are compiled and counts of "full" vs "empty" nodes determine whether to raise or return an empty string (see context from round 27--28). - Trace how a filter like`~Exists(MyModel.objects.none())`plus another filter (`name='test...
-
[79]
0 = 1"`) rather than triggering an exception; or - The WHERE-node logic in`django/db/models/sql/where.py`so that when all children are effectively
**Implement a minimal fix**: - Based on that analysis, adjust either: -`Exists.as_sql`/`Subquery.as_sql`in`django/db/models/expressions.py`(rounds 15--19) so that a negated empty EXISTS behaves like a known false predicate (e.g.,`"0 = 1"`) rather than triggering an exception; or - The WHERE-node logic in`django/db/models/sql/where.py`so that when all chil...
-
[80]
all tests passed
**Verification**: - Re-run`/testbed/reproduce_issue.py`to confirm that: -`qs.query.as_sql()`no longer raises`EmptyResultSet`. - A query such as`MyModel.objects.filter(~models.Exists(MyModel.objects.none()), name='test')`yields an empty result when evaluated (`list(qs)`returns an empty list). - Run targeted tests via`python tests/runtests.py`: - At least a...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.