arxiv: 2605.12913 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: no theorem link

Revisiting DAgger in the Era of LLM-Agents

Changhao Li , Rushi Qiang , Jiawei Huang , Chenxiao Gao , Chao Zhang , Niao He , Bo Dai

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords DAggerLLM agentscovariate shiftSWE-bench Verifiedsoftware engineering agentsmulti-turn interactionpolicy interpolationsupervised fine-tuning

0 comments

The pith

DAgger with turn-level interpolation mitigates covariate shift in multi-turn LLM agents while retaining dense teacher supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that long-horizon LM agents face a core tradeoff: supervised fine-tuning supplies dense teacher labels but trains on off-policy trajectories that differ from deployment states, while reinforcement learning uses on-policy rollouts yet receives only sparse outcome signals. Revisiting DAgger addresses this by collecting trajectories via turn-level mixing of student and teacher actions, then training the student to match the teacher on those same trajectories. This exposes the model to realistic states likely to arise during actual use and still supplies rich step-by-step guidance. The approach is demonstrated on software-engineering agents at 4B and 8B scales, where it improves over the strongest post-training baseline by several points on SWE-bench Verified and on a held-out split.

Core claim

Collecting trajectories through turn-level interpolation of student and teacher policies, then training the student by mimicking the teacher on those trajectories, allows the model to encounter realistic environment states while still receiving dense supervision, thereby mitigating covariate shift that arises in pure supervised fine-tuning of multi-turn LM agents.

What carries the argument

Turn-level interpolation of student and teacher policies inside the DAgger loop, which generates mixed trajectories for subsequent supervised training on teacher labels.

Load-bearing premise

A reliable teacher policy remains available and affordable to query at every training step, and the environment can continue or reset after mixed student-teacher actions without creating new distribution shifts.

What would settle it

A direct comparison showing that training on purely student-generated trajectories or purely teacher-generated trajectories yields smaller gains on SWE-bench Verified than the interpolated version.

Figures

Figures reproduced from arXiv: 2605.12913 by Bo Dai, Changhao Li, Chao Zhang, Chenxiao Gao, Jiawei Huang, Niao He, Rushi Qiang.

**Figure 2.** Figure 2: Policy divergence under student-induced rollouts for the 4B student model. We report average token-level reverse KL DKL(πθ∥πe) on studentvisited contexts; lower is better [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Long-horizon LM agents learn from multi-turn interaction, where a single early mistake can alter the subsequent state distribution and derail the whole trajectory. Existing recipes fall short in complementary ways: supervised fine-tuning provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while reinforcement learning with verifiable rewards avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback. We address this dilemma by revisiting Dataset Aggregation (DAgger) for multi-turn LM agents: the algorithm collects trajectories through a turn-level interpolation of student and teacher policies, and the student is then trained on these trajectories using supervised labels provided by the teacher. By directly interacting with environments, we expose the model to realistic states likely to be encountered during deployment, thereby effectively mitigating covariate shift. Besides, since the student is learned by mimicking the teacher's behavior, it receives rich feedback during learning. To demonstrate DAgger enjoys the benefits of both worlds, we tested the algorithm to train a software-engineering agent with 4B- and 8B-scale student models. On SWE-bench Verified, our DAgger-style training improves over the strongest post-training baseline by +3.9 points at 4B and +3.6 points at 8B. The resulting 4B agent reaches 27.3%, outperforming representative published 8B SWE-agent systems, while the 8B agent achieves 29.8%, surpassing SWE-Gym-32B and coming within 5 points of stronger 32B-scale agents. Together with consistent gains on the held-out SWE-Gym split, these results suggest the effectiveness of DAgger for modern long-horizon LM agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAgger with turn-level mixing gives measurable gains on SWE-bench for 4B/8B agents, but the state-distribution claim after mixed actions still needs direct checks.

read the letter

The paper's core result is that turn-level interpolation between student and teacher policies during data collection improves LLM agents on long-horizon coding tasks. They report +3.9 points for the 4B model and +3.6 for the 8B on SWE-bench Verified, with the smaller agent beating some published 8B systems and both showing gains on a held-out split. This is a direct, practical adaptation of classic DAgger rather than a new algorithm, and the numbers are the part worth paying attention to because they come from a real benchmark with consistent scaling behavior across two model sizes.

Referee Report

2 major / 2 minor

Summary. The paper revisits DAgger for long-horizon LLM agents by collecting trajectories via turn-level interpolation between student and teacher policies, then training the student via supervised imitation on teacher labels. This is claimed to combine dense supervision (avoiding sparse RL rewards) with on-policy exposure to realistic states (mitigating covariate shift from pure SFT). Experiments on SWE-bench Verified report +3.9 point gains for a 4B model (reaching 27.3%) and +3.6 points for an 8B model (reaching 29.8%), outperforming several larger published baselines, with consistent gains on a held-out SWE-Gym split.

Significance. If the results hold, the work supplies a practical, low-overhead recipe for improving multi-turn agent training that avoids the full machinery of RL while still addressing distribution shift. The empirical outperformance of larger models by smaller DAgger-trained agents on a standard benchmark is noteworthy and could influence post-training pipelines for agentic LLMs.

major comments (2)

[§4 (Experiments) and Table 1] §4 (Experiments) and Table 1: the reported +3.9 / +3.6 point gains on SWE-bench Verified are presented without error bars, multiple random seeds, or statistical tests; given that the central claim rests on these numeric improvements over the strongest baseline, the absence of variance estimates leaves the reliability of the result unclear.
[§3.2 (DAgger for LM Agents)] §3.2 (DAgger for LM Agents): the claim that turn-level interpolation mitigates covariate shift by exposing the student to realistic states assumes that a student action at turn t does not corrupt persistent environment state in a way that renders subsequent teacher actions off-distribution; no ablation, state-distribution metric, or continuity analysis is provided to support this assumption, which is load-bearing for the core argument.

minor comments (2)

[Abstract and §4.1] The abstract and §4.1 should explicitly name the strongest post-training baseline and the exact interpolation probability schedule used, as these details are needed to reproduce the claimed gains.
[§3.2] Notation for the interpolation probability (mentioned as a free parameter) is introduced without a clear equation or pseudocode block; adding a short algorithm box would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§4 (Experiments) and Table 1] §4 (Experiments) and Table 1: the reported +3.9 / +3.6 point gains on SWE-bench Verified are presented without error bars, multiple random seeds, or statistical tests; given that the central claim rests on these numeric improvements over the strongest baseline, the absence of variance estimates leaves the reliability of the result unclear.

Authors: We agree that variance estimates and statistical tests would improve the reliability assessment of the reported gains. In the revised manuscript we will rerun the key 4B and 8B experiments with three random seeds, report mean and standard deviation in Table 1, and include a brief statistical significance note (paired t-test against the strongest baseline). revision: yes
Referee: [§3.2 (DAgger for LM Agents)] §3.2 (DAgger for LM Agents): the claim that turn-level interpolation mitigates covariate shift by exposing the student to realistic states assumes that a student action at turn t does not corrupt persistent environment state in a way that renders subsequent teacher actions off-distribution; no ablation, state-distribution metric, or continuity analysis is provided to support this assumption, which is load-bearing for the core argument.

Authors: The turn-level schedule ensures the teacher intervenes after every student action, limiting state drift to a single step; because the teacher then restores the trajectory toward its own distribution, subsequent states remain close to the teacher policy’s support. We will expand §3.2 with a short continuity argument and add an appendix figure comparing state-feature histograms (e.g., file-system and repository state embeddings) between pure-teacher and interpolated trajectories to quantify the limited divergence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains from DAgger interpolation are independent of fitted inputs or self-citations

full rationale

The paper presents DAgger-style training via turn-level student-teacher policy interpolation as a method to mitigate covariate shift in long-horizon LM agents, with results reported as +3.9 and +3.6 point gains on the external SWE-bench Verified benchmark for 4B and 8B models. No load-bearing derivation reduces by construction to its own inputs: there are no equations defining a quantity in terms of itself, no parameters fitted to a data subset then renamed as a prediction, and no uniqueness theorems or ansatzes imported via self-citation chains. The original DAgger reference is external (Ross et al. 2011), and the central claim rests on empirical evaluation rather than internal algebraic equivalence. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard imitation-learning assumptions plus the practical availability of a teacher policy; no new entities are postulated.

free parameters (1)

turn-level interpolation probability
The mixing ratio between student and teacher actions at each turn must be chosen or scheduled; its value is not stated in the abstract.

axioms (1)

domain assumption A competent teacher policy exists that can label any encountered state.
The method requires the teacher to provide correct actions on states visited by the mixed policy.

pith-pipeline@v0.9.0 · 5627 in / 1380 out tokens · 47663 ms · 2026-05-14T19:50:59.965269+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 39 canonical work pages · 19 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

2024
[3]

Dream: Deep research evaluation with agentic metrics.arXiv preprint arXiv:2602.18940, 2026

Elad Ben Avraham, Changhao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Adi Kalyanpur, et al. Dream: Deep research evaluation with agentic metrics.arXiv preprint arXiv:2602.18940, 2026

work page arXiv 2026
[4]

Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe- rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

work page arXiv 2025
[5]

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

work page arXiv 2026
[6]

Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

work page arXiv 2025
[7]

MLE-bench: Evaluating machine learning agents on machine learning engineering

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. Mars: Modular agent with reflective search for automated ai research.arXiv preprint arXiv:2602.02660, 2026

work page arXiv 2026
[8]

Facilitating multi-turn function calling for llms via compositional instruction tuning.arXiv preprint arXiv:2410.12952, 2024

Mingyang Chen, Haoze Sun, Tianpeng Li, Fan Yang, Hao Liang, Keer Lu, Bin Cui, Wentao Zhang, Zenan Zhou, and Weipeng Chen. Facilitating multi-turn function calling for llms via compositional instruction tuning.arXiv preprint arXiv:2410.12952, 2024

work page arXiv 2024
[9]

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2508.06600, 2025

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, et al. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

work page arXiv 2025
[10]

Agentless: Demystifying LLM-based Software Engineering Agents

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, et al. Introducing swe-bench verified. arXiv preprint arXiv:2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5h0qf7IBZZ

2024
[14]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yifan Wu, YK Li, et al. Deepseek-coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

2024
[17]

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

Jiawei Huang, Qingping Yang, Renjie Zheng, and Jiaze Chen. Beyond verifiable rewards: Rubric-based grm for reinforced fine-tuning swe agents.arXiv preprint arXiv:2604.16335, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, and Jeff Da. Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

work page arXiv 2025
[21]

Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977, 2024

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977, 2024

work page arXiv 2024
[22]

https://thinkingmachines.ai/blog/ on-policy-distillation/

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy- distillation

work page doi:10.64434/tml.20251026 2025
[23]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[25]

Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym.arXiv preprint arXiv:2412.21139, 2024

work page arXiv 2024
[26]

Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782, 2025

Rushi Qiang, Yuchen Zhuang, Yinghao Li, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai, et al. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782, 2025

work page arXiv 2025
[27]

Mle-smith: Scaling mle tasks with automated multi-agent pipeline.arXiv preprint arXiv:2510.07307, 2025

Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, and Bo Dai. Mle-smith: Scaling mle tasks with automated multi-agent pipeline.arXiv preprint arXiv:2510.07307, 2025

work page arXiv 2025
[28]

Efficient reductions for imitation learning

Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010

2010
[29]

Reinforcement and Imitation Learning via Interactive No-Regret Learning

Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning.arXiv preprint arXiv:1406.5979, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011. 11

2011
[31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Swe-dev: Building software engineering agents with training and inference scaling

Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, and Yuxiao Dong. Swe-dev: Building software engineering agents with training and inference scaling. InFindings of the Association for Computational Linguistics: ACL 2025, pages 3742–3761, 2025

2025
[34]

Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, 2024

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, 2024

2024
[35]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution.arXiv preprint arXiv:2502.18449, 2025

work page internal anchor Pith review arXiv 2025
[38]

Automated program repair in the era of large pre-trained language models

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023

2023
[39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[41]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

work page internal anchor Pith review arXiv 2025
[42]

Reinforcement learning for machine learning engineering agents.arXiv preprint arXiv:2509.01684, 2025

Sherry Yang, Joy He-Yueya, and Percy Liang. Reinforcement learning for machine learning engineering agents.arXiv preprint arXiv:2509.01684, 2025

work page arXiv 2025
[43]

Dcpo: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, and Rihui Xin. Dcpo: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025

work page arXiv 2025
[44]

Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.arXiv preprint arXiv:2304.10778, 2023

Burak Yeti¸ stiren, I¸ sık Özsoy, Miray Ayerdem, and Eray Tüzün. Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.arXiv preprint arXiv:2304.10778, 2023

work page arXiv 2023
[45]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

work page arXiv 2025
[47]

Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026

Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, et al. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026. 12

work page arXiv 2026
[48]

Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges.arXiv preprint arXiv:2401.07339, 2024

work page arXiv 2024
[49]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

why is X happening

Yiqi Zhu, Apurva Gandhi, and Graham Neubig. Training versatile coding agents in synthetic environments.arXiv preprint arXiv:2512.12216, 2025. 13 A Derivation of the Unified Post-Training View In this section, we provide additional details on the unified formulation in Section 3.3. The goal is not to claim that all post-training algorithms are identical, b...

work page arXiv 2025
[51]

EXPLORATION: Thoroughly explore relevant files and understand the context before proposing solutions
[52]

ANALYSIS: Consider multiple approaches and select the most promising one. 19
[53]

* For new features: Consider test-driven development when appropriate

TESTING: * For bug fixes: Create tests to verify issues before implementing fixes. * For new features: Consider test-driven development when appropriate. * Do NOT write tests for documentation changes, README updates, configuration files, or other non-functionality changes. * If the repository lacks testing infrastructure and implementing tests would requ...
[54]

* Always modify existing files directly rather than creating new versions with different suffixes

IMPLEMENTATION: * Make focused, minimal changes to address the problem. * Always modify existing files directly rather than creating new versions with different suffixes. * If you create temporary files for testing, delete them after confirming your solution works
[55]

If the environment is not set up to run tests, consult with the user first before investing time to run tests

VERIFICATION: If the environment is set up to run tests, test your implementation thoroughly, including edge cases. If the environment is not set up to run tests, consult with the user first before investing time to run tests. </PROBLEM_SOLVING_WORKFLOW> <SECURITY> * Only use GITHUB_TOKEN and other credentials in ways the user has explicitly requested and...
[56]

First, look around in the repository for existing dependency files, e.g., requirements.txt, pyproject.toml, package.json, Gemfile, etc
[57]

If dependency files exist, use them to install all dependencies at once, e.g., pip install -r requirements.txt, npm install, etc
[58]

* Similarly, if you encounter missing dependencies for essential tools requested by the user, install them when possible

Only install individual packages directly if no dependency files are found or if only specific packages are needed. * Similarly, if you encounter missing dependencies for essential tools requested by the user, install them when possible. </ENVIRONMENT_SETUP> <TROUBLESHOOTING> * If you’ve made repeated attempts to solve a problem but tests still fail or th...
[59]

Step back and reflect on 5-7 different possible sources of the problem
[60]

Assess the likelihood of each possible cause
[61]

Methodically address the most likely causes, starting with the highest probability
[62]

* When you run into any major issue while executing a plan from the user, please don’t try to directly work around it

Explain your reasoning process in your response to the user. * When you run into any major issue while executing a plan from the user, please don’t try to directly work around it. Instead, propose a new plan and confirm with the user before proceeding. </TROUBLESHOOTING> <PROCESS_MANAGEMENT> * When terminating processes: - Do NOT use general keywords with...