SWE-Together: Evaluating Coding Agents in Interactive User Sessions
Pith reviewed 2026-06-30 05:30 UTC · model grok-4.3
The pith
Stronger coding agents achieve higher success rates with fewer user interventions in interactive sessions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By curating 109 tasks from 11,260 real sessions and replaying them with a reactive user simulator that preserves original intents, the paper demonstrates that stronger frontier coding agents attain both higher final repository correctness and lower numbers of corrective feedback turns during the interaction.
What carries the argument
The reactive LLM-based user simulator that preserves the original users' intents and provides feedback only when the coding agent's progress requires it.
Load-bearing premise
The reactive LLM-based user simulator accurately preserves the original users' intents and provides feedback only when the coding agent's progress requires it.
What would settle it
Replacing the simulator with actual human users replaying the same sessions and finding substantially different intervention counts or success rates.
read the original abstract
Most coding-agent benchmarks are static: an agent receives a complete task description up front and is judged only by its final code. Real coding assistance is interactive, with users clarifying goals, adding constraints, and correcting mistakes over multiple turns. We introduce SWE-Together, a multi-turn benchmark reconstructed from real user-agent coding sessions. To make real interactions verifiable, we curate 109 repository-level tasks from 11,260 recorded sessions, selecting sessions with recoverable repository states, clear user goals, and observable outcomes. To replay these interactions across agents, we build a reactive LLM-based user simulator that preserves the original users' intents and provides feedback when the coding agent's progress requires it. To evaluate agents as collaborators, we measure both final repository correctness and the number of corrective feedback turns required during the interaction. Experiments with frontier coding agents show that stronger agents generally achieve higher final success rates while requiring fewer interventions, suggesting an improved user experience.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SWE-Together, a multi-turn benchmark reconstructed from 109 repository-level tasks curated from 11,260 real user-agent coding sessions. It employs a reactive LLM-based user simulator to replay interactions while preserving original user intents, and evaluates frontier coding agents on both final repository correctness and the number of corrective feedback turns (interventions) required. The central empirical finding is that stronger agents achieve higher success rates while needing fewer interventions, interpreted as evidence of improved user experience in interactive settings.
Significance. If the simulator's fidelity holds, the benchmark fills a gap between static task-completion evaluations and real interactive coding assistance, providing a measurable proxy for user effort via intervention counts. The curation scale (109 tasks from over 11k sessions with recoverable states) is a concrete strength. However, the absence of any reported validation for the simulator means the intervention metric's validity remains unestablished, limiting the result's immediate impact on agent evaluation practices.
major comments (2)
- [Abstract / simulator description] Abstract (and the method section describing the simulator): the claim that the reactive LLM-based user simulator 'preserves the original users' intents and provides feedback when the coding agent's progress requires it' is presented without any quantitative fidelity checks (e.g., inter-rater agreement on held-out real sessions, divergence metrics on feedback content or timing, or human ratings). This is load-bearing because the headline result—stronger agents require fewer interventions—is measured entirely through simulator-driven replays; systematic mismatch with real users would make the intervention counts an artifact rather than evidence of UX improvement.
- [Experiments] Experiments section (results on intervention counts): the reported correlation between agent strength and lower intervention counts rests on the unvalidated simulator; without evidence that feedback triggers match real-user behavior, the cross-agent comparison cannot be interpreted as demonstrating improved collaborative experience.
minor comments (1)
- [Abstract] Abstract: the selection criteria for the 109 tasks ('recoverable repository states, clear user goals, and observable outcomes') are stated at high level; a brief enumeration of exclusion reasons or inter-annotator agreement on goal clarity would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on the SWE-Together benchmark. We address the major comments regarding the validation of the user simulator below.
read point-by-point responses
-
Referee: [Abstract / simulator description] Abstract (and the method section describing the simulator): the claim that the reactive LLM-based user simulator 'preserves the original users' intents and provides feedback when the coding agent's progress requires it' is presented without any quantitative fidelity checks (e.g., inter-rater agreement on held-out real sessions, divergence metrics on feedback content or timing, or human ratings). This is load-bearing because the headline result—stronger agents require fewer interventions—is measured entirely through simulator-driven replays; systematic mismatch with real users would make the intervention counts an artifact rather than evidence of UX improvement.
Authors: We agree that no quantitative fidelity checks for the simulator are reported in the current manuscript. The curation from real sessions aims to preserve intents through selection criteria, but we acknowledge the need for explicit validation to substantiate the claims. We will revise the manuscript to include a dedicated validation section with metrics such as inter-rater agreement on held-out sessions and human evaluations of feedback timing and content. revision: yes
-
Referee: [Experiments] Experiments section (results on intervention counts): the reported correlation between agent strength and lower intervention counts rests on the unvalidated simulator; without evidence that feedback triggers match real-user behavior, the cross-agent comparison cannot be interpreted as demonstrating improved collaborative experience.
Authors: We concur that the interpretation of the intervention count results as evidence of improved user experience depends on the simulator's fidelity. The added validation in the revised version will provide the required evidence that feedback triggers align with real-user behavior, thereby supporting the cross-agent comparisons. revision: yes
Circularity Check
No significant circularity; evaluation metrics are direct experimental measurements
full rationale
The paper constructs a benchmark from recorded sessions and evaluates agents via a reactive LLM simulator whose behavior is described but not shown to reduce any result to its own inputs by definition or fitting. No equations, parameter fits renamed as predictions, or self-citation chains appear in the provided text. Claims about success rates and intervention counts are presented as outcomes of running frontier agents on the benchmark, which remains externally falsifiable via the curated tasks and does not collapse to self-reference.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Recorded sessions contain recoverable repository states and observable outcomes
invented entities (1)
-
reactive LLM-based user simulator
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Program Synthesis with Large Language Models
Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Advances in Neural Information Processing Systems (NeurIPS) , year =
John Yang and Akshara Prabhakar and Karthik Narasimhan and Shunyu Yao , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[4]
International Conference on Learning Representations (ICLR) , year =
Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji , title =. International Conference on Learning Representations (ICLR) , year =
-
[5]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
Harsh Trivedi and Tushar Khot and Mareike Hartmann and Ruskin Manku and Vinty Dong and Edward Li and Shashank Gupta and Ashish Sabharwal and Niranjan Balasubramanian , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[6]
2024 , eprint =
Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan , title =. 2024 , eprint =
2024
-
[7]
2025 , eprint =
Zengzhuang Xu and Bingguang Hao and Zechuan Wang and Yuntao Wen and Xinyi Xu and Yang Liu and Long Chen and Dong Wang and Maolin Wang and Tong Zhao and Yicheng Chen and Cunyin Peng and Jinjie Gu and Leilei Gan and Xiangyu Zhao and Chenyi Zhuang and Shi Gu , title =. 2025 , eprint =
2025
-
[8]
2026 , eprint =
Zhaorun Chen and Zhuokai Zhao and Kai Zhang and Bo Liu and Qi Qi and Yifan Wu and Tarun Kalluri and Sara Cao and Yuanhao Xiong and Haibo Tong and Huaxiu Yao and Hengduo Li and Jiacheng Zhu and Xian Li and Dawn Song and Bo Li and Jason Weston and Dat Huynh , title =. 2026 , eprint =
2026
-
[9]
2026 , eprint =
Yuhang Zhou and Lizhu Zhang and Yifan Wu and Jiayi Liu and Xiangjun Fan and Zhuokai Zhao and Hong Yan , title =. 2026 , eprint =
2026
-
[10]
2025 , eprint =
Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan , title =. 2025 , eprint =
2025
-
[11]
Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =
Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =
-
[12]
International Conference on Learning Representations (ICLR) , year =
Hojae Han and Seung-won Hwang and Rajhans Samdani and Yuxiong He , title =. International Conference on Learning Representations (ICLR) , year =
-
[13]
2025 , eprint =
Myeongsoo Kim and Shweta Garg and Baishakhi Ray and Varun Kumar and Anoop Deoras , title =. 2025 , eprint =
2025
-
[14]
2025 , eprint =
Guoliang Duan and Mingwei Liu and Yanlin Wang and Chong Wang and Xin Peng and Zibin Zheng , title =. 2025 , eprint =
2025
-
[15]
2025 , eprint =
Zexun Zhan and Shuzheng Gao and Ruida Hu and Cuiyun Gao , title =. 2025 , eprint =
2025
-
[16]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
Ved Sirdeshmukh and Kaustubh Deshpande and Johannes Mols and Lifeng Jin and Ed-Yeremai Cardona and Dean Lee and Jeremy Kritz and Willow Primack and Summer Yue and Chen Xing , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[17]
2025 , eprint =
Philippe Laban and Hiroaki Hayashi and Yingbo Zhou and Jennifer Neville , title =. 2025 , eprint =
2025
-
[18]
International Conference on Learning Representations (ICLR) , year =
Xuhui Zhou and Hao Zhu and Leena Mathur and Ruohong Zhang and Haofei Yu and Zhengyang Qi and Louis-Philippe Morency and Yonatan Bisk and Daniel Fried and Graham Neubig and Maarten Sap , title =. International Conference on Learning Representations (ICLR) , year =
-
[19]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
Kuang Wang and Xianfei Li and Shenghao Yang and Li Zhou and Feng Jiang and Haizhou Li , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[20]
2026 , eprint =
Shirley Wu and Evelyn Choi and Arpandeep Khatua and Zhanghan Wang and Joy He-Yueya and Tharindu Cyril Weerasooriya and Wei Wei and Diyi Yang and Jure Leskovec and James Zou , title =. 2026 , eprint =
2026
-
[21]
International Conference on Machine Learning (ICML) , year =
Shirley Wu and Michel Galley and Baolin Peng and Hao Cheng and Gavin Li and Yao Dou and Weixin Cai and James Zou and Jure Leskovec and Jianfeng Gao , title =. International Conference on Machine Learning (ICML) , year =
-
[22]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
Chuyi Kong and Yaxin Fan and Xiang Wan and Feng Jiang and Benyou Wang , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[23]
Proceedings of the Workshop on Social Influence in Conversations (SICon) , year =
Hovhannes Tamoyan and Hendrik Schuff and Iryna Gurevych , title =. Proceedings of the Workshop on Social Influence in Conversations (SICon) , year =
-
[24]
2025 , eprint =
Seungjong Park and Shuyue Stella Li and Yejin Choi and Yulia Tsvetkov , title =. 2025 , eprint =
2025
-
[25]
2025 , eprint =
Wenxuan Qiu and Jianyu Cai and Yujian Liu and Zirui Wang and Xuezhi Wang and Diyi Yang , title =. 2025 , eprint =
2025
-
[26]
2025 , eprint =
Yifei Zhou and Song Jiang and Yuandong Tian and Jason Weston and Sergey Levine and Sainbayar Sukhbaatar and Xian Li , title =. 2025 , eprint =
2025
-
[27]
2025 , eprint =
Samuel Miserendino and Michele Wang and Tejal Patwardhan and Johannes Heidecke , title =. 2025 , eprint =
2025
-
[28]
2025 , howpublished =
2025
-
[29]
2025 , eprint =
Suzhen Zhong and Ying Zou and Bram Adams , title =. 2025 , eprint =
2025
-
[30]
2025 , eprint =
Binquan Zhang and Li Zhang and Haoyuan Zhang and Fang Liu and Song Wang and Bo Shen and An Fu and Lin Shi , title =. 2025 , eprint =
2025
-
[31]
2025 , eprint =
Spandan Garg and Benjamin Steenhoek and Yufan Huang , title =. 2025 , eprint =
2025
-
[32]
International Conference on Learning Representations (ICLR) , year =
Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng , title =. International Conference on Learning Representations (ICLR) , year =
-
[33]
Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =
Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =. International Conference on Learning Representations (ICLR) , year =
-
[34]
2024 , howpublished =
Introducing. 2024 , howpublished =
2024
-
[35]
Jimenez and Alex L
John Yang and Carlos E. Jimenez and Alex L. Zhang and Kilian Lieret and Joyce Yang and Xindi Wu and Ori Press and Niklas Muennighoff and Gabriel Synnaeve and Karthik R. Narasimhan and Diyi Yang and Sida Wang and Ofir Press , title =. 2024 , eprint =
2024
-
[36]
2024 , eprint =
Jiayi Pan and Xingyao Wang and Graham Neubig and Navdeep Jaitly and Heng Ji and Alane Suhr and Yizhe Zhang , title =. 2024 , eprint =
2024
-
[37]
Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , title =
John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[38]
Jimenez and Alexander Wettig and Kabir Khandpur and Yanzhe Zhang and Binyuan Hui and Ofir Press and Ludwig Schmidt and Diyi Yang , title =
John Yang and Kilian Lieret and Carlos E. Jimenez and Alexander Wettig and Kabir Khandpur and Yanzhe Zhang and Binyuan Hui and Ofir Press and Ludwig Schmidt and Diyi Yang , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =
-
[39]
Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =
Linghao Zhang and Shilin He and Chaoyun Zhang and Yu Kang and Bowen Li and Chengxing Xie and Jianfeng Wang and Maoquan Wang and Yufan Huang and Shengyu Fu and Elsie Nallipogu and Qingwei Lin and Yingnong Dang and Saravan Rajmohan and Yudong Zhang , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =
-
[40]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Daoguang Zan and Zhirong Huang and Wei Liu and Hanwu Chen and Linhao Zhang and Shulin Xin and Lu Chen and Qi Liu and Xiaojian Zhong and Aoyan Li and Siyao Liu and Yongsheng Xiao and Liangqiang Chen and Yuyu Zhang and Jing Su and Tian Liu , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[41]
2025 , eprint =
Naman Jain and Jaskirat Singh and Manish Shetty and Liang Zheng and Koushik Sen and Ion Stoica , title =. 2025 , eprint =
2025
-
[42]
2025 , eprint =
Ruida Hu and Chao Peng and Xinchen Wang and Junjielong Xu and Cuiyun Gao , title =. 2025 , eprint =
2025
-
[43]
International Conference on Machine Learning (ICML) , year =
Konstantinos Vergopoulos and Mark Niklas M\"uller and Martin Vechev , title =. International Conference on Machine Learning (ICML) , year =
-
[44]
2025 , eprint =
Avi Arora and Jinu Jang and Roshanak Zilouchian Moghaddam , title =. 2025 , eprint =
2025
-
[45]
Chiu and Claire Cardie and Matthias Gall\'e and Alexander M
Wenting Zhao and Nan Jiang and Celine Lee and Justin T. Chiu and Claire Cardie and Matthias Gall\'e and Alexander M. Rush , title =. International Conference on Learning Representations (ICLR) , year =
-
[46]
International Conference on Learning Representations (ICLR) , year =
Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica , title =. International Conference on Learning Representations (ICLR) , year =
-
[47]
International Conference on Learning Representations (ICLR) , year =
Terry Yue Zhuo and Minh Chien Vu and Jenny Chim and Han Hu and Wenhao Yu and Ratnadira Widyasari and Imam Nur Bani Yusuf and Haolan Zhan and Junda He and Indraneil Paul , title =. International Conference on Learning Representations (ICLR) , year =
-
[48]
2024 , eprint =
Reem Aleithan and Haoran Xue and Mohammad Mahdi Mohajer and Elijah Nnorom and Gias Uddin and Song Wang , title =. 2024 , eprint =
2024
-
[49]
2025 , eprint =
You Wang and Michael Pradel and Zhongxin Liu , title =. 2025 , eprint =
2025
-
[50]
Annual Meeting of the Association for Computational Linguistics (ACL) , year =
Boxi Yu and Yuxuan Zhu and Pinjia He and Daniel Kang , title =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[51]
2026 , howpublished =
2026
-
[52]
2026 , publisher =
Zechner, Mario , title =. 2026 , publisher =
2026
-
[53]
2026 , publisher =
Archit11 , title =. 2026 , publisher =
2026
-
[54]
2026 , eprint=
SWE-chat: Coding Agent Interactions From Real Users in the Wild , author=. 2026 , eprint=
2026
-
[55]
2026 , eprint =
Joseph Suh and Ayush Raj and Minwoo Kang and Serina Chang , title =. 2026 , eprint =
2026
-
[56]
Shuvendu K. Lahiri and Sarah Fakhoury and Aaditya Naik and Georgios Sakkas and Saikat Chakraborty and Madanlal Musuvathi and Piali Choudhury and Curtis von Veh and Jeevana Priya Inala and Chenglong Wang and Jianfeng Gao , title =. 2022 , eprint =
2022
-
[57]
2025 , eprint =
Jane Pan and Ryan Shar and Jacob Pfau and Ameet Talwalkar and He He and Valerie Chen , title =. 2025 , eprint =
2025
-
[58]
Chunyu Miao and Henry Peng Zou and Yangning Li and Yankai Chen and Yibo Wang and Fangxin Wang and Yifan Li and Wooseong Yang and Bowei He and Xinni Zhang and Dianzhi Yu and Hanchen Yang and Hoang H. Nguyen and Yue Zhou and Jie Yang and Jizhou Guo and Wenzhe Fan and Chin-Yuan Yeh and Panpan Meng and Liancheng Fang and Jinhu Qi and Wei-Chieh Huang and Zheng...
2025
-
[59]
2026 , eprint =
Xueqing Wu and Zihan Xue and Da Yin and Shuyan Zhou and Kai-Wei Chang and Nanyun Peng and Yeming Wen , title =. 2026 , eprint =
2026
-
[60]
2026 , eprint =
Jiarong Liang and Zhiheng Lyu and Zijie Liu and Xiangchao Chen and Ping Nie and Kai Zou and Wenhu Chen , title =. 2026 , eprint =
2026
-
[61]
2025 , eprint =
Wayne Chi and Valerie Chen and Ryan Shar and Aditya Mittal and Jenny Liang and Wei-Lin Chiang and Anastasios Nikolas Angelopoulos and Ion Stoica and Graham Neubig and Ameet Talwalkar and Chris Donahue , title =. 2025 , eprint =
2025
-
[62]
2025 , eprint =
Terry Yue Zhuo and Xiaolong Jin and Hange Liu and Juyong Jiang and Tianyang Liu and Chen Gong and Bhupesh Bishnoi and Vaisakhi Mishra and Marek Suppa and Noah Ziems and Saiteja Utpala and Ming Xu and Guangyu Song and Kaixin Li and Yuhan Cao and Bo Liu and Zheng Liu and Sabina Abdurakhmanova and Wenhao Yu and Mengzhao Jia and Jihan Yao and Kenneth Hamilton...
2025
-
[63]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces , author=. arXiv preprint arXiv:2601.11868 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
2026 , month = feb, day =
Why. 2026 , month = feb, day =
2026
-
[65]
2026 , month = may, day =
Wenqi Huang and Charley Lee and Leonard Tng and Serena Ge , title =. 2026 , month = may, day =
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.