pith. machine review for the scientific record. sign in

arxiv: 2605.12913 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: no theorem link

Revisiting DAgger in the Era of LLM-Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:50 UTC · model grok-4.3

classification 💻 cs.LG
keywords DAggerLLM agentscovariate shiftSWE-bench Verifiedsoftware engineering agentsmulti-turn interactionpolicy interpolationsupervised fine-tuning
0
0 comments X

The pith

DAgger with turn-level interpolation mitigates covariate shift in multi-turn LLM agents while retaining dense teacher supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that long-horizon LM agents face a core tradeoff: supervised fine-tuning supplies dense teacher labels but trains on off-policy trajectories that differ from deployment states, while reinforcement learning uses on-policy rollouts yet receives only sparse outcome signals. Revisiting DAgger addresses this by collecting trajectories via turn-level mixing of student and teacher actions, then training the student to match the teacher on those same trajectories. This exposes the model to realistic states likely to arise during actual use and still supplies rich step-by-step guidance. The approach is demonstrated on software-engineering agents at 4B and 8B scales, where it improves over the strongest post-training baseline by several points on SWE-bench Verified and on a held-out split.

Core claim

Collecting trajectories through turn-level interpolation of student and teacher policies, then training the student by mimicking the teacher on those trajectories, allows the model to encounter realistic environment states while still receiving dense supervision, thereby mitigating covariate shift that arises in pure supervised fine-tuning of multi-turn LM agents.

What carries the argument

Turn-level interpolation of student and teacher policies inside the DAgger loop, which generates mixed trajectories for subsequent supervised training on teacher labels.

Load-bearing premise

A reliable teacher policy remains available and affordable to query at every training step, and the environment can continue or reset after mixed student-teacher actions without creating new distribution shifts.

What would settle it

A direct comparison showing that training on purely student-generated trajectories or purely teacher-generated trajectories yields smaller gains on SWE-bench Verified than the interpolated version.

Figures

Figures reproduced from arXiv: 2605.12913 by Bo Dai, Changhao Li, Chao Zhang, Chenxiao Gao, Jiawei Huang, Niao He, Rushi Qiang.

Figure 2
Figure 2. Figure 2: Policy divergence un￾der student-induced rollouts for the 4B student model. We re￾port average token-level reverse KL DKL(πθ∥πe) on student￾visited contexts; lower is better [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Long-horizon LM agents learn from multi-turn interaction, where a single early mistake can alter the subsequent state distribution and derail the whole trajectory. Existing recipes fall short in complementary ways: supervised fine-tuning provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while reinforcement learning with verifiable rewards avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback. We address this dilemma by revisiting Dataset Aggregation (DAgger) for multi-turn LM agents: the algorithm collects trajectories through a turn-level interpolation of student and teacher policies, and the student is then trained on these trajectories using supervised labels provided by the teacher. By directly interacting with environments, we expose the model to realistic states likely to be encountered during deployment, thereby effectively mitigating covariate shift. Besides, since the student is learned by mimicking the teacher's behavior, it receives rich feedback during learning. To demonstrate DAgger enjoys the benefits of both worlds, we tested the algorithm to train a software-engineering agent with 4B- and 8B-scale student models. On SWE-bench Verified, our DAgger-style training improves over the strongest post-training baseline by +3.9 points at 4B and +3.6 points at 8B. The resulting 4B agent reaches 27.3%, outperforming representative published 8B SWE-agent systems, while the 8B agent achieves 29.8%, surpassing SWE-Gym-32B and coming within 5 points of stronger 32B-scale agents. Together with consistent gains on the held-out SWE-Gym split, these results suggest the effectiveness of DAgger for modern long-horizon LM agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper revisits DAgger for long-horizon LLM agents by collecting trajectories via turn-level interpolation between student and teacher policies, then training the student via supervised imitation on teacher labels. This is claimed to combine dense supervision (avoiding sparse RL rewards) with on-policy exposure to realistic states (mitigating covariate shift from pure SFT). Experiments on SWE-bench Verified report +3.9 point gains for a 4B model (reaching 27.3%) and +3.6 points for an 8B model (reaching 29.8%), outperforming several larger published baselines, with consistent gains on a held-out SWE-Gym split.

Significance. If the results hold, the work supplies a practical, low-overhead recipe for improving multi-turn agent training that avoids the full machinery of RL while still addressing distribution shift. The empirical outperformance of larger models by smaller DAgger-trained agents on a standard benchmark is noteworthy and could influence post-training pipelines for agentic LLMs.

major comments (2)
  1. [§4 (Experiments) and Table 1] §4 (Experiments) and Table 1: the reported +3.9 / +3.6 point gains on SWE-bench Verified are presented without error bars, multiple random seeds, or statistical tests; given that the central claim rests on these numeric improvements over the strongest baseline, the absence of variance estimates leaves the reliability of the result unclear.
  2. [§3.2 (DAgger for LM Agents)] §3.2 (DAgger for LM Agents): the claim that turn-level interpolation mitigates covariate shift by exposing the student to realistic states assumes that a student action at turn t does not corrupt persistent environment state in a way that renders subsequent teacher actions off-distribution; no ablation, state-distribution metric, or continuity analysis is provided to support this assumption, which is load-bearing for the core argument.
minor comments (2)
  1. [Abstract and §4.1] The abstract and §4.1 should explicitly name the strongest post-training baseline and the exact interpolation probability schedule used, as these details are needed to reproduce the claimed gains.
  2. [§3.2] Notation for the interpolation probability (mentioned as a free parameter) is introduced without a clear equation or pseudocode block; adding a short algorithm box would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4 (Experiments) and Table 1] §4 (Experiments) and Table 1: the reported +3.9 / +3.6 point gains on SWE-bench Verified are presented without error bars, multiple random seeds, or statistical tests; given that the central claim rests on these numeric improvements over the strongest baseline, the absence of variance estimates leaves the reliability of the result unclear.

    Authors: We agree that variance estimates and statistical tests would improve the reliability assessment of the reported gains. In the revised manuscript we will rerun the key 4B and 8B experiments with three random seeds, report mean and standard deviation in Table 1, and include a brief statistical significance note (paired t-test against the strongest baseline). revision: yes

  2. Referee: [§3.2 (DAgger for LM Agents)] §3.2 (DAgger for LM Agents): the claim that turn-level interpolation mitigates covariate shift by exposing the student to realistic states assumes that a student action at turn t does not corrupt persistent environment state in a way that renders subsequent teacher actions off-distribution; no ablation, state-distribution metric, or continuity analysis is provided to support this assumption, which is load-bearing for the core argument.

    Authors: The turn-level schedule ensures the teacher intervenes after every student action, limiting state drift to a single step; because the teacher then restores the trajectory toward its own distribution, subsequent states remain close to the teacher policy’s support. We will expand §3.2 with a short continuity argument and add an appendix figure comparing state-feature histograms (e.g., file-system and repository state embeddings) between pure-teacher and interpolated trajectories to quantify the limited divergence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains from DAgger interpolation are independent of fitted inputs or self-citations

full rationale

The paper presents DAgger-style training via turn-level student-teacher policy interpolation as a method to mitigate covariate shift in long-horizon LM agents, with results reported as +3.9 and +3.6 point gains on the external SWE-bench Verified benchmark for 4B and 8B models. No load-bearing derivation reduces by construction to its own inputs: there are no equations defining a quantity in terms of itself, no parameters fitted to a data subset then renamed as a prediction, and no uniqueness theorems or ansatzes imported via self-citation chains. The original DAgger reference is external (Ross et al. 2011), and the central claim rests on empirical evaluation rather than internal algebraic equivalence. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard imitation-learning assumptions plus the practical availability of a teacher policy; no new entities are postulated.

free parameters (1)
  • turn-level interpolation probability
    The mixing ratio between student and teacher actions at each turn must be chosen or scheduled; its value is not stated in the abstract.
axioms (1)
  • domain assumption A competent teacher policy exists that can label any encountered state.
    The method requires the teacher to provide correct actions on states visited by the mixed policy.

pith-pipeline@v0.9.0 · 5627 in / 1380 out tokens · 47663 ms · 2026-05-14T19:50:59.965269+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 39 canonical work pages · 19 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  3. [3]

    Dream: Deep research evaluation with agentic metrics.arXiv preprint arXiv:2602.18940, 2026

    Elad Ben Avraham, Changhao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Adi Kalyanpur, et al. Dream: Deep research evaluation with agentic metrics.arXiv preprint arXiv:2602.18940, 2026

  4. [4]

    Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

    Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe- rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

  5. [5]

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

  6. [6]

    Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

    Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

  7. [7]

    MLE-bench: Evaluating machine learning agents on machine learning engineering

    Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. Mars: Modular agent with reflective search for automated ai research.arXiv preprint arXiv:2602.02660, 2026

  8. [8]

    Facilitating multi-turn function calling for llms via compositional instruction tuning.arXiv preprint arXiv:2410.12952, 2024

    Mingyang Chen, Haoze Sun, Tianpeng Li, Fan Yang, Hao Liang, Keer Lu, Bin Cui, Wentao Zhang, Zenan Zhou, and Weipeng Chen. Facilitating multi-turn function calling for llms via compositional instruction tuning.arXiv preprint arXiv:2410.12952, 2024

  9. [9]

    Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2508.06600, 2025

    Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, et al. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

  10. [10]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, et al. Introducing swe-bench verified. arXiv preprint arXiv:2407.01489, 2024

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  12. [12]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

  13. [13]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5h0qf7IBZZ

  14. [14]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  15. [15]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yifan Wu, YK Li, et al. Deepseek-coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024. 10

  16. [16]

    Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

  17. [17]

    Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

    Jiawei Huang, Qingping Yang, Renjie Zheng, and Jiaze Chen. Beyond verifiable rewards: Rubric-based grm for reinforced fine-tuning swe agents.arXiv preprint arXiv:2604.16335, 2026

  18. [18]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  19. [19]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  20. [20]

    Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

    Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, and Jeff Da. Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

  21. [21]

    Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977, 2024

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977, 2024

  22. [22]

    https://thinkingmachines.ai/blog/ on-policy-distillation/

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy- distillation

  23. [23]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  24. [25]

    Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym.arXiv preprint arXiv:2412.21139, 2024

  25. [26]

    Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782, 2025

    Rushi Qiang, Yuchen Zhuang, Yinghao Li, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai, et al. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782, 2025

  26. [27]

    Mle-smith: Scaling mle tasks with automated multi-agent pipeline.arXiv preprint arXiv:2510.07307, 2025

    Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, and Bo Dai. Mle-smith: Scaling mle tasks with automated multi-agent pipeline.arXiv preprint arXiv:2510.07307, 2025

  27. [28]

    Efficient reductions for imitation learning

    Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010

  28. [29]

    Reinforcement and Imitation Learning via Interactive No-Regret Learning

    Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning.arXiv preprint arXiv:1406.5979, 2014

  29. [30]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011. 11

  30. [31]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  31. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  32. [33]

    Swe-dev: Building software engineering agents with training and inference scaling

    Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, and Yuxiao Dong. Swe-dev: Building software engineering agents with training and inference scaling. InFindings of the Association for Computational Linguistics: ACL 2025, pages 3742–3761, 2025

  33. [34]

    Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, 2024

    Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, 2024

  34. [35]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

  35. [36]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  36. [37]

    SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

    Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution.arXiv preprint arXiv:2502.18449, 2025

  37. [38]

    Automated program repair in the era of large pre-trained language models

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023

  38. [39]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  39. [40]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  40. [41]

    SWE-smith: Scaling Data for Software Engineering Agents

    John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

  41. [42]

    Reinforcement learning for machine learning engineering agents.arXiv preprint arXiv:2509.01684, 2025

    Sherry Yang, Joy He-Yueya, and Percy Liang. Reinforcement learning for machine learning engineering agents.arXiv preprint arXiv:2509.01684, 2025

  42. [43]

    Dcpo: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025

    Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, and Rihui Xin. Dcpo: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025

  43. [44]

    Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.arXiv preprint arXiv:2304.10778, 2023

    Burak Yeti¸ stiren, I¸ sık Özsoy, Miray Ayerdem, and Eray Tüzün. Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.arXiv preprint arXiv:2304.10778, 2023

  44. [45]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  45. [46]

    Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

    Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

  46. [47]

    Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026

    Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, et al. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026. 12

  47. [48]

    Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,

    Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges.arXiv preprint arXiv:2401.07339, 2024

  48. [49]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  49. [50]

    why is X happening

    Yiqi Zhu, Apurva Gandhi, and Graham Neubig. Training versatile coding agents in synthetic environments.arXiv preprint arXiv:2512.12216, 2025. 13 A Derivation of the Unified Post-Training View In this section, we provide additional details on the unified formulation in Section 3.3. The goal is not to claim that all post-training algorithms are identical, b...

  50. [51]

    EXPLORATION: Thoroughly explore relevant files and understand the context before proposing solutions

  51. [52]

    ANALYSIS: Consider multiple approaches and select the most promising one. 19

  52. [53]

    * For new features: Consider test-driven development when appropriate

    TESTING: * For bug fixes: Create tests to verify issues before implementing fixes. * For new features: Consider test-driven development when appropriate. * Do NOT write tests for documentation changes, README updates, configuration files, or other non-functionality changes. * If the repository lacks testing infrastructure and implementing tests would requ...

  53. [54]

    * Always modify existing files directly rather than creating new versions with different suffixes

    IMPLEMENTATION: * Make focused, minimal changes to address the problem. * Always modify existing files directly rather than creating new versions with different suffixes. * If you create temporary files for testing, delete them after confirming your solution works

  54. [55]

    If the environment is not set up to run tests, consult with the user first before investing time to run tests

    VERIFICATION: If the environment is set up to run tests, test your implementation thoroughly, including edge cases. If the environment is not set up to run tests, consult with the user first before investing time to run tests. </PROBLEM_SOLVING_WORKFLOW> <SECURITY> * Only use GITHUB_TOKEN and other credentials in ways the user has explicitly requested and...

  55. [56]

    First, look around in the repository for existing dependency files, e.g., requirements.txt, pyproject.toml, package.json, Gemfile, etc

  56. [57]

    If dependency files exist, use them to install all dependencies at once, e.g., pip install -r requirements.txt, npm install, etc

  57. [58]

    * Similarly, if you encounter missing dependencies for essential tools requested by the user, install them when possible

    Only install individual packages directly if no dependency files are found or if only specific packages are needed. * Similarly, if you encounter missing dependencies for essential tools requested by the user, install them when possible. </ENVIRONMENT_SETUP> <TROUBLESHOOTING> * If you’ve made repeated attempts to solve a problem but tests still fail or th...

  58. [59]

    Step back and reflect on 5-7 different possible sources of the problem

  59. [60]

    Assess the likelihood of each possible cause

  60. [61]

    Methodically address the most likely causes, starting with the highest probability

  61. [62]

    * When you run into any major issue while executing a plan from the user, please don’t try to directly work around it

    Explain your reasoning process in your response to the user. * When you run into any major issue while executing a plan from the user, please don’t try to directly work around it. Instead, propose a new plan and confirm with the user before proceeding. </TROUBLESHOOTING> <PROCESS_MANAGEMENT> * When terminating processes: - Do NOT use general keywords with...