SWE-Together: Evaluating Coding Agents in Interactive User Sessions

Ho Hin Lee; Jiacheng Zhu; Lizhu Zhang; Serena Li; Shengzhi Li; Shirley Wu; Songlin Li; Tianhe Yu; Xiangjun Fan; Yifan Wu

arxiv: 2606.29957 · v1 · pith:TAHFZLR7new · submitted 2026-06-29 · 💻 cs.SE · cs.AI

SWE-Together: Evaluating Coding Agents in Interactive User Sessions

Yifan Wu , Zhuokai Zhao , Songlin Li , Ho Hin Lee , Jiacheng Zhu , Shirley Wu , Tianhe Yu , Serena Li

show 3 more authors

Lizhu Zhang Xiangjun Fan Shengzhi Li

This is my paper

Pith reviewed 2026-06-30 05:30 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords coding agentsinteractive benchmarksuser simulationsoftware engineeringmulti-turn evaluationrepository tasks

0 comments

The pith

Stronger coding agents achieve higher success rates with fewer user interventions in interactive sessions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SWE-Together turns real recorded user-agent coding sessions into a multi-turn benchmark with verifiable outcomes. It reconstructs 109 repository-level tasks and uses a reactive LLM simulator to stand in for the human user, supplying clarifications or corrections only when the agent's progress requires it. Experiments across frontier agents show that those with higher overall capability reach correct final code more often while needing fewer such interventions.

Core claim

By curating 109 tasks from 11,260 real sessions and replaying them with a reactive user simulator that preserves original intents, the paper demonstrates that stronger frontier coding agents attain both higher final repository correctness and lower numbers of corrective feedback turns during the interaction.

What carries the argument

The reactive LLM-based user simulator that preserves the original users' intents and provides feedback only when the coding agent's progress requires it.

Load-bearing premise

The reactive LLM-based user simulator accurately preserves the original users' intents and provides feedback only when the coding agent's progress requires it.

What would settle it

Replacing the simulator with actual human users replaying the same sessions and finding substantially different intervention counts or success rates.

read the original abstract

Most coding-agent benchmarks are static: an agent receives a complete task description up front and is judged only by its final code. Real coding assistance is interactive, with users clarifying goals, adding constraints, and correcting mistakes over multiple turns. We introduce SWE-Together, a multi-turn benchmark reconstructed from real user-agent coding sessions. To make real interactions verifiable, we curate 109 repository-level tasks from 11,260 recorded sessions, selecting sessions with recoverable repository states, clear user goals, and observable outcomes. To replay these interactions across agents, we build a reactive LLM-based user simulator that preserves the original users' intents and provides feedback when the coding agent's progress requires it. To evaluate agents as collaborators, we measure both final repository correctness and the number of corrective feedback turns required during the interaction. Experiments with frontier coding agents show that stronger agents generally achieve higher final success rates while requiring fewer interventions, suggesting an improved user experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is reconstructing real multi-turn sessions into a replayable benchmark with a user simulator, but the lack of any fidelity checks on that simulator undercuts the intervention results.

read the letter

The paper's real contribution is taking 11,260 recorded sessions, curating 109 repository-level tasks with recoverable states and clear outcomes, and building a reactive LLM simulator to replay them across agents. This lets them score both final correctness and the number of corrective turns needed, which moves past the usual static single-prompt benchmarks.

The curation step and the dual metrics are straightforward and address a practical gap in how coding agents are tested. The reported pattern that stronger agents finish with fewer interventions is the sort of outcome that could inform tool design.

The soft spot is the simulator. The abstract asserts it keeps the original user intents and only gives feedback when the agent's progress calls for it, yet nothing in the provided text shows how that was verified—no human ratings, no held-out session comparisons, no timing or content divergence numbers. Because the benchmark's value is the interactive setting and the headline claim rests on the intervention counts, this missing check is the load-bearing part.

This is for researchers who build or evaluate interactive coding agents and want something closer to real usage than existing static suites. A reader in that subfield would find the setup worth looking at, even while treating the simulator-dependent numbers as provisional.

I would send it to peer review. The reconstruction idea is concrete enough to deserve referee input, and the authors can add the needed validation experiments in revision.

Referee Report

2 major / 1 minor

Summary. The paper introduces SWE-Together, a multi-turn benchmark reconstructed from 109 repository-level tasks curated from 11,260 real user-agent coding sessions. It employs a reactive LLM-based user simulator to replay interactions while preserving original user intents, and evaluates frontier coding agents on both final repository correctness and the number of corrective feedback turns (interventions) required. The central empirical finding is that stronger agents achieve higher success rates while needing fewer interventions, interpreted as evidence of improved user experience in interactive settings.

Significance. If the simulator's fidelity holds, the benchmark fills a gap between static task-completion evaluations and real interactive coding assistance, providing a measurable proxy for user effort via intervention counts. The curation scale (109 tasks from over 11k sessions with recoverable states) is a concrete strength. However, the absence of any reported validation for the simulator means the intervention metric's validity remains unestablished, limiting the result's immediate impact on agent evaluation practices.

major comments (2)

[Abstract / simulator description] Abstract (and the method section describing the simulator): the claim that the reactive LLM-based user simulator 'preserves the original users' intents and provides feedback when the coding agent's progress requires it' is presented without any quantitative fidelity checks (e.g., inter-rater agreement on held-out real sessions, divergence metrics on feedback content or timing, or human ratings). This is load-bearing because the headline result—stronger agents require fewer interventions—is measured entirely through simulator-driven replays; systematic mismatch with real users would make the intervention counts an artifact rather than evidence of UX improvement.
[Experiments] Experiments section (results on intervention counts): the reported correlation between agent strength and lower intervention counts rests on the unvalidated simulator; without evidence that feedback triggers match real-user behavior, the cross-agent comparison cannot be interpreted as demonstrating improved collaborative experience.

minor comments (1)

[Abstract] Abstract: the selection criteria for the 109 tasks ('recoverable repository states, clear user goals, and observable outcomes') are stated at high level; a brief enumeration of exclusion reasons or inter-annotator agreement on goal clarity would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on the SWE-Together benchmark. We address the major comments regarding the validation of the user simulator below.

read point-by-point responses

Referee: [Abstract / simulator description] Abstract (and the method section describing the simulator): the claim that the reactive LLM-based user simulator 'preserves the original users' intents and provides feedback when the coding agent's progress requires it' is presented without any quantitative fidelity checks (e.g., inter-rater agreement on held-out real sessions, divergence metrics on feedback content or timing, or human ratings). This is load-bearing because the headline result—stronger agents require fewer interventions—is measured entirely through simulator-driven replays; systematic mismatch with real users would make the intervention counts an artifact rather than evidence of UX improvement.

Authors: We agree that no quantitative fidelity checks for the simulator are reported in the current manuscript. The curation from real sessions aims to preserve intents through selection criteria, but we acknowledge the need for explicit validation to substantiate the claims. We will revise the manuscript to include a dedicated validation section with metrics such as inter-rater agreement on held-out sessions and human evaluations of feedback timing and content. revision: yes
Referee: [Experiments] Experiments section (results on intervention counts): the reported correlation between agent strength and lower intervention counts rests on the unvalidated simulator; without evidence that feedback triggers match real-user behavior, the cross-agent comparison cannot be interpreted as demonstrating improved collaborative experience.

Authors: We concur that the interpretation of the intervention count results as evidence of improved user experience depends on the simulator's fidelity. The added validation in the revised version will provide the required evidence that feedback triggers align with real-user behavior, thereby supporting the cross-agent comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation metrics are direct experimental measurements

full rationale

The paper constructs a benchmark from recorded sessions and evaluates agents via a reactive LLM simulator whose behavior is described but not shown to reduce any result to its own inputs by definition or fitting. No equations, parameter fits renamed as predictions, or self-citation chains appear in the provided text. Claims about success rates and intervention counts are presented as outcomes of running frontier agents on the benchmark, which remains externally falsifiable via the curated tasks and does not collapse to self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on successful curation of recoverable sessions and faithful simulation of user intent; no free parameters or invented physical entities are described.

axioms (1)

domain assumption Recorded sessions contain recoverable repository states and observable outcomes
Stated in abstract as selection criteria for the 109 tasks.

invented entities (1)

reactive LLM-based user simulator no independent evidence
purpose: To replay original user intents and provide feedback during agent interactions
Introduced to make real interactions verifiable for benchmarking.

pith-pipeline@v0.9.1-grok · 5715 in / 1164 out tokens · 29852 ms · 2026-06-30T05:30:40.784767+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Advances in Neural Information Processing Systems (NeurIPS) , year =

John Yang and Akshara Prabhakar and Karthik Narasimhan and Shunyu Yao , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[4]

International Conference on Learning Representations (ICLR) , year =

Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji , title =. International Conference on Learning Representations (ICLR) , year =
[5]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Harsh Trivedi and Tushar Khot and Mareike Hartmann and Ruskin Manku and Vinty Dong and Edward Li and Shashank Gupta and Ashish Sabharwal and Niranjan Balasubramanian , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[6]

2024 , eprint =

Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan , title =. 2024 , eprint =

2024
[7]

2025 , eprint =

Zengzhuang Xu and Bingguang Hao and Zechuan Wang and Yuntao Wen and Xinyi Xu and Yang Liu and Long Chen and Dong Wang and Maolin Wang and Tong Zhao and Yicheng Chen and Cunyin Peng and Jinjie Gu and Leilei Gan and Xiangyu Zhao and Chenyi Zhuang and Shi Gu , title =. 2025 , eprint =

2025
[8]

2026 , eprint =

Zhaorun Chen and Zhuokai Zhao and Kai Zhang and Bo Liu and Qi Qi and Yifan Wu and Tarun Kalluri and Sara Cao and Yuanhao Xiong and Haibo Tong and Huaxiu Yao and Hengduo Li and Jiacheng Zhu and Xian Li and Dawn Song and Bo Li and Jason Weston and Dat Huynh , title =. 2026 , eprint =

2026
[9]

2026 , eprint =

Yuhang Zhou and Lizhu Zhang and Yifan Wu and Jiayi Liu and Xiangjun Fan and Zhuokai Zhao and Hong Yan , title =. 2026 , eprint =

2026
[10]

2025 , eprint =

Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan , title =. 2025 , eprint =

2025
[11]

Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =
[12]

International Conference on Learning Representations (ICLR) , year =

Hojae Han and Seung-won Hwang and Rajhans Samdani and Yuxiong He , title =. International Conference on Learning Representations (ICLR) , year =
[13]

2025 , eprint =

Myeongsoo Kim and Shweta Garg and Baishakhi Ray and Varun Kumar and Anoop Deoras , title =. 2025 , eprint =

2025
[14]

2025 , eprint =

Guoliang Duan and Mingwei Liu and Yanlin Wang and Chong Wang and Xin Peng and Zibin Zheng , title =. 2025 , eprint =

2025
[15]

2025 , eprint =

Zexun Zhan and Shuzheng Gao and Ruida Hu and Cuiyun Gao , title =. 2025 , eprint =

2025
[16]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Ved Sirdeshmukh and Kaustubh Deshpande and Johannes Mols and Lifeng Jin and Ed-Yeremai Cardona and Dean Lee and Jeremy Kritz and Willow Primack and Summer Yue and Chen Xing , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[17]

2025 , eprint =

Philippe Laban and Hiroaki Hayashi and Yingbo Zhou and Jennifer Neville , title =. 2025 , eprint =

2025
[18]

International Conference on Learning Representations (ICLR) , year =

Xuhui Zhou and Hao Zhu and Leena Mathur and Ruohong Zhang and Haofei Yu and Zhengyang Qi and Louis-Philippe Morency and Yonatan Bisk and Daniel Fried and Graham Neubig and Maarten Sap , title =. International Conference on Learning Representations (ICLR) , year =
[19]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Kuang Wang and Xianfei Li and Shenghao Yang and Li Zhou and Feng Jiang and Haizhou Li , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[20]

2026 , eprint =

Shirley Wu and Evelyn Choi and Arpandeep Khatua and Zhanghan Wang and Joy He-Yueya and Tharindu Cyril Weerasooriya and Wei Wei and Diyi Yang and Jure Leskovec and James Zou , title =. 2026 , eprint =

2026
[21]

International Conference on Machine Learning (ICML) , year =

Shirley Wu and Michel Galley and Baolin Peng and Hao Cheng and Gavin Li and Yao Dou and Weixin Cai and James Zou and Jure Leskovec and Jianfeng Gao , title =. International Conference on Machine Learning (ICML) , year =
[22]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Chuyi Kong and Yaxin Fan and Xiang Wan and Feng Jiang and Benyou Wang , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[23]

Proceedings of the Workshop on Social Influence in Conversations (SICon) , year =

Hovhannes Tamoyan and Hendrik Schuff and Iryna Gurevych , title =. Proceedings of the Workshop on Social Influence in Conversations (SICon) , year =
[24]

2025 , eprint =

Seungjong Park and Shuyue Stella Li and Yejin Choi and Yulia Tsvetkov , title =. 2025 , eprint =

2025
[25]

2025 , eprint =

Wenxuan Qiu and Jianyu Cai and Yujian Liu and Zirui Wang and Xuezhi Wang and Diyi Yang , title =. 2025 , eprint =

2025
[26]

2025 , eprint =

Yifei Zhou and Song Jiang and Yuandong Tian and Jason Weston and Sergey Levine and Sainbayar Sukhbaatar and Xian Li , title =. 2025 , eprint =

2025
[27]

2025 , eprint =

Samuel Miserendino and Michele Wang and Tejal Patwardhan and Johannes Heidecke , title =. 2025 , eprint =

2025
[28]

2025 , howpublished =

2025
[29]

2025 , eprint =

Suzhen Zhong and Ying Zou and Bram Adams , title =. 2025 , eprint =

2025
[30]

2025 , eprint =

Binquan Zhang and Li Zhang and Haoyuan Zhang and Fang Liu and Song Wang and Bo Shen and An Fu and Lin Shi , title =. 2025 , eprint =

2025
[31]

2025 , eprint =

Spandan Garg and Benjamin Steenhoek and Yufan Huang , title =. 2025 , eprint =

2025
[32]

International Conference on Learning Representations (ICLR) , year =

Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng , title =. International Conference on Learning Representations (ICLR) , year =
[33]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =. International Conference on Learning Representations (ICLR) , year =
[34]

2024 , howpublished =

Introducing. 2024 , howpublished =

2024
[35]

Jimenez and Alex L

John Yang and Carlos E. Jimenez and Alex L. Zhang and Kilian Lieret and Joyce Yang and Xindi Wu and Ori Press and Niklas Muennighoff and Gabriel Synnaeve and Karthik R. Narasimhan and Diyi Yang and Sida Wang and Ofir Press , title =. 2024 , eprint =

2024
[36]

2024 , eprint =

Jiayi Pan and Xingyao Wang and Graham Neubig and Navdeep Jaitly and Heng Ji and Alane Suhr and Yizhe Zhang , title =. 2024 , eprint =

2024
[37]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , title =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[38]

Jimenez and Alexander Wettig and Kabir Khandpur and Yanzhe Zhang and Binyuan Hui and Ofir Press and Ludwig Schmidt and Diyi Yang , title =

John Yang and Kilian Lieret and Carlos E. Jimenez and Alexander Wettig and Kabir Khandpur and Yanzhe Zhang and Binyuan Hui and Ofir Press and Ludwig Schmidt and Diyi Yang , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =
[39]

Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

Linghao Zhang and Shilin He and Chaoyun Zhang and Yu Kang and Bowen Li and Chengxing Xie and Jianfeng Wang and Maoquan Wang and Yufan Huang and Shengyu Fu and Elsie Nallipogu and Qingwei Lin and Yingnong Dang and Saravan Rajmohan and Yudong Zhang , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =
[40]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Daoguang Zan and Zhirong Huang and Wei Liu and Hanwu Chen and Linhao Zhang and Shulin Xin and Lu Chen and Qi Liu and Xiaojian Zhong and Aoyan Li and Siyao Liu and Yongsheng Xiao and Liangqiang Chen and Yuyu Zhang and Jing Su and Tian Liu , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[41]

2025 , eprint =

Naman Jain and Jaskirat Singh and Manish Shetty and Liang Zheng and Koushik Sen and Ion Stoica , title =. 2025 , eprint =

2025
[42]

2025 , eprint =

Ruida Hu and Chao Peng and Xinchen Wang and Junjielong Xu and Cuiyun Gao , title =. 2025 , eprint =

2025
[43]

International Conference on Machine Learning (ICML) , year =

Konstantinos Vergopoulos and Mark Niklas M\"uller and Martin Vechev , title =. International Conference on Machine Learning (ICML) , year =
[44]

2025 , eprint =

Avi Arora and Jinu Jang and Roshanak Zilouchian Moghaddam , title =. 2025 , eprint =

2025
[45]

Chiu and Claire Cardie and Matthias Gall\'e and Alexander M

Wenting Zhao and Nan Jiang and Celine Lee and Justin T. Chiu and Claire Cardie and Matthias Gall\'e and Alexander M. Rush , title =. International Conference on Learning Representations (ICLR) , year =
[46]

International Conference on Learning Representations (ICLR) , year =

Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica , title =. International Conference on Learning Representations (ICLR) , year =
[47]

International Conference on Learning Representations (ICLR) , year =

Terry Yue Zhuo and Minh Chien Vu and Jenny Chim and Han Hu and Wenhao Yu and Ratnadira Widyasari and Imam Nur Bani Yusuf and Haolan Zhan and Junda He and Indraneil Paul , title =. International Conference on Learning Representations (ICLR) , year =
[48]

2024 , eprint =

Reem Aleithan and Haoran Xue and Mohammad Mahdi Mohajer and Elijah Nnorom and Gias Uddin and Song Wang , title =. 2024 , eprint =

2024
[49]

2025 , eprint =

You Wang and Michael Pradel and Zhongxin Liu , title =. 2025 , eprint =

2025
[50]

Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Boxi Yu and Yuxuan Zhu and Pinjia He and Daniel Kang , title =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[51]

2026 , howpublished =

2026
[52]

2026 , publisher =

Zechner, Mario , title =. 2026 , publisher =

2026
[53]

2026 , publisher =

Archit11 , title =. 2026 , publisher =

2026
[54]

2026 , eprint=

SWE-chat: Coding Agent Interactions From Real Users in the Wild , author=. 2026 , eprint=

2026
[55]

2026 , eprint =

Joseph Suh and Ayush Raj and Minwoo Kang and Serina Chang , title =. 2026 , eprint =

2026
[56]

Shuvendu K. Lahiri and Sarah Fakhoury and Aaditya Naik and Georgios Sakkas and Saikat Chakraborty and Madanlal Musuvathi and Piali Choudhury and Curtis von Veh and Jeevana Priya Inala and Chenglong Wang and Jianfeng Gao , title =. 2022 , eprint =

2022
[57]

2025 , eprint =

Jane Pan and Ryan Shar and Jacob Pfau and Ameet Talwalkar and He He and Valerie Chen , title =. 2025 , eprint =

2025
[58]

Chunyu Miao and Henry Peng Zou and Yangning Li and Yankai Chen and Yibo Wang and Fangxin Wang and Yifan Li and Wooseong Yang and Bowei He and Xinni Zhang and Dianzhi Yu and Hanchen Yang and Hoang H. Nguyen and Yue Zhou and Jie Yang and Jizhou Guo and Wenzhe Fan and Chin-Yuan Yeh and Panpan Meng and Liancheng Fang and Jinhu Qi and Wei-Chieh Huang and Zheng...

2025
[59]

2026 , eprint =

Xueqing Wu and Zihan Xue and Da Yin and Shuyan Zhou and Kai-Wei Chang and Nanyun Peng and Yeming Wen , title =. 2026 , eprint =

2026
[60]

2026 , eprint =

Jiarong Liang and Zhiheng Lyu and Zijie Liu and Xiangchao Chen and Ping Nie and Kai Zou and Wenhu Chen , title =. 2026 , eprint =

2026
[61]

2025 , eprint =

Wayne Chi and Valerie Chen and Ryan Shar and Aditya Mittal and Jenny Liang and Wei-Lin Chiang and Anastasios Nikolas Angelopoulos and Ion Stoica and Graham Neubig and Ameet Talwalkar and Chris Donahue , title =. 2025 , eprint =

2025
[62]

2025 , eprint =

Terry Yue Zhuo and Xiaolong Jin and Hange Liu and Juyong Jiang and Tianyang Liu and Chen Gong and Bhupesh Bishnoi and Vaisakhi Mishra and Marek Suppa and Noah Ziems and Saiteja Utpala and Ming Xu and Guangyu Song and Kaixin Li and Yuhan Cao and Bo Liu and Zheng Liu and Sabina Abdurakhmanova and Wenhao Yu and Mengzhao Jia and Jihan Yao and Kenneth Hamilton...

2025
[63]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces , author=. arXiv preprint arXiv:2601.11868 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

2026 , month = feb, day =

Why. 2026 , month = feb, day =

2026
[65]

2026 , month = may, day =

Wenqi Huang and Charley Lee and Leonard Tng and Serena Ge , title =. 2026 , month = may, day =

2026

[1] [1]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Advances in Neural Information Processing Systems (NeurIPS) , year =

John Yang and Akshara Prabhakar and Karthik Narasimhan and Shunyu Yao , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[4] [4]

International Conference on Learning Representations (ICLR) , year =

Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji , title =. International Conference on Learning Representations (ICLR) , year =

[5] [5]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Harsh Trivedi and Tushar Khot and Mareike Hartmann and Ruskin Manku and Vinty Dong and Edward Li and Shashank Gupta and Ashish Sabharwal and Niranjan Balasubramanian , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[6] [6]

2024 , eprint =

Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan , title =. 2024 , eprint =

2024

[7] [7]

2025 , eprint =

Zengzhuang Xu and Bingguang Hao and Zechuan Wang and Yuntao Wen and Xinyi Xu and Yang Liu and Long Chen and Dong Wang and Maolin Wang and Tong Zhao and Yicheng Chen and Cunyin Peng and Jinjie Gu and Leilei Gan and Xiangyu Zhao and Chenyi Zhuang and Shi Gu , title =. 2025 , eprint =

2025

[8] [8]

2026 , eprint =

Zhaorun Chen and Zhuokai Zhao and Kai Zhang and Bo Liu and Qi Qi and Yifan Wu and Tarun Kalluri and Sara Cao and Yuanhao Xiong and Haibo Tong and Huaxiu Yao and Hengduo Li and Jiacheng Zhu and Xian Li and Dawn Song and Bo Li and Jason Weston and Dat Huynh , title =. 2026 , eprint =

2026

[9] [9]

2026 , eprint =

Yuhang Zhou and Lizhu Zhang and Yifan Wu and Jiayi Liu and Xiangjun Fan and Zhuokai Zhao and Hong Yan , title =. 2026 , eprint =

2026

[10] [10]

2025 , eprint =

Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan , title =. 2025 , eprint =

2025

[11] [11]

Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

[12] [12]

International Conference on Learning Representations (ICLR) , year =

Hojae Han and Seung-won Hwang and Rajhans Samdani and Yuxiong He , title =. International Conference on Learning Representations (ICLR) , year =

[13] [13]

2025 , eprint =

Myeongsoo Kim and Shweta Garg and Baishakhi Ray and Varun Kumar and Anoop Deoras , title =. 2025 , eprint =

2025

[14] [14]

2025 , eprint =

Guoliang Duan and Mingwei Liu and Yanlin Wang and Chong Wang and Xin Peng and Zibin Zheng , title =. 2025 , eprint =

2025

[15] [15]

2025 , eprint =

Zexun Zhan and Shuzheng Gao and Ruida Hu and Cuiyun Gao , title =. 2025 , eprint =

2025

[16] [16]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Ved Sirdeshmukh and Kaustubh Deshpande and Johannes Mols and Lifeng Jin and Ed-Yeremai Cardona and Dean Lee and Jeremy Kritz and Willow Primack and Summer Yue and Chen Xing , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[17] [17]

2025 , eprint =

Philippe Laban and Hiroaki Hayashi and Yingbo Zhou and Jennifer Neville , title =. 2025 , eprint =

2025

[18] [18]

International Conference on Learning Representations (ICLR) , year =

Xuhui Zhou and Hao Zhu and Leena Mathur and Ruohong Zhang and Haofei Yu and Zhengyang Qi and Louis-Philippe Morency and Yonatan Bisk and Daniel Fried and Graham Neubig and Maarten Sap , title =. International Conference on Learning Representations (ICLR) , year =

[19] [19]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Kuang Wang and Xianfei Li and Shenghao Yang and Li Zhou and Feng Jiang and Haizhou Li , title =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[20] [20]

2026 , eprint =

Shirley Wu and Evelyn Choi and Arpandeep Khatua and Zhanghan Wang and Joy He-Yueya and Tharindu Cyril Weerasooriya and Wei Wei and Diyi Yang and Jure Leskovec and James Zou , title =. 2026 , eprint =

2026

[21] [21]

International Conference on Machine Learning (ICML) , year =

Shirley Wu and Michel Galley and Baolin Peng and Hao Cheng and Gavin Li and Yao Dou and Weixin Cai and James Zou and Jure Leskovec and Jianfeng Gao , title =. International Conference on Machine Learning (ICML) , year =

[22] [22]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Chuyi Kong and Yaxin Fan and Xiang Wan and Feng Jiang and Benyou Wang , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[23] [23]

Proceedings of the Workshop on Social Influence in Conversations (SICon) , year =

Hovhannes Tamoyan and Hendrik Schuff and Iryna Gurevych , title =. Proceedings of the Workshop on Social Influence in Conversations (SICon) , year =

[24] [24]

2025 , eprint =

Seungjong Park and Shuyue Stella Li and Yejin Choi and Yulia Tsvetkov , title =. 2025 , eprint =

2025

[25] [25]

2025 , eprint =

Wenxuan Qiu and Jianyu Cai and Yujian Liu and Zirui Wang and Xuezhi Wang and Diyi Yang , title =. 2025 , eprint =

2025

[26] [26]

2025 , eprint =

Yifei Zhou and Song Jiang and Yuandong Tian and Jason Weston and Sergey Levine and Sainbayar Sukhbaatar and Xian Li , title =. 2025 , eprint =

2025

[27] [27]

2025 , eprint =

Samuel Miserendino and Michele Wang and Tejal Patwardhan and Johannes Heidecke , title =. 2025 , eprint =

2025

[28] [28]

2025 , howpublished =

2025

[29] [29]

2025 , eprint =

Suzhen Zhong and Ying Zou and Bram Adams , title =. 2025 , eprint =

2025

[30] [30]

2025 , eprint =

Binquan Zhang and Li Zhang and Haoyuan Zhang and Fang Liu and Song Wang and Bo Shen and An Fu and Lin Shi , title =. 2025 , eprint =

2025

[31] [31]

2025 , eprint =

Spandan Garg and Benjamin Steenhoek and Yufan Huang , title =. 2025 , eprint =

2025

[32] [32]

International Conference on Learning Representations (ICLR) , year =

Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng , title =. International Conference on Learning Representations (ICLR) , year =

[33] [33]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =. International Conference on Learning Representations (ICLR) , year =

[34] [34]

2024 , howpublished =

Introducing. 2024 , howpublished =

2024

[35] [35]

Jimenez and Alex L

John Yang and Carlos E. Jimenez and Alex L. Zhang and Kilian Lieret and Joyce Yang and Xindi Wu and Ori Press and Niklas Muennighoff and Gabriel Synnaeve and Karthik R. Narasimhan and Diyi Yang and Sida Wang and Ofir Press , title =. 2024 , eprint =

2024

[36] [36]

2024 , eprint =

Jiayi Pan and Xingyao Wang and Graham Neubig and Navdeep Jaitly and Heng Ji and Alane Suhr and Yizhe Zhang , title =. 2024 , eprint =

2024

[37] [37]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , title =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[38] [38]

Jimenez and Alexander Wettig and Kabir Khandpur and Yanzhe Zhang and Binyuan Hui and Ofir Press and Ludwig Schmidt and Diyi Yang , title =

John Yang and Kilian Lieret and Carlos E. Jimenez and Alexander Wettig and Kabir Khandpur and Yanzhe Zhang and Binyuan Hui and Ofir Press and Ludwig Schmidt and Diyi Yang , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

[39] [39]

Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

Linghao Zhang and Shilin He and Chaoyun Zhang and Yu Kang and Bowen Li and Chengxing Xie and Jianfeng Wang and Maoquan Wang and Yufan Huang and Shengyu Fu and Elsie Nallipogu and Qingwei Lin and Yingnong Dang and Saravan Rajmohan and Yudong Zhang , title =. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year =

[40] [40]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Daoguang Zan and Zhirong Huang and Wei Liu and Hanwu Chen and Linhao Zhang and Shulin Xin and Lu Chen and Qi Liu and Xiaojian Zhong and Aoyan Li and Siyao Liu and Yongsheng Xiao and Liangqiang Chen and Yuyu Zhang and Jing Su and Tian Liu , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[41] [41]

2025 , eprint =

Naman Jain and Jaskirat Singh and Manish Shetty and Liang Zheng and Koushik Sen and Ion Stoica , title =. 2025 , eprint =

2025

[42] [42]

2025 , eprint =

Ruida Hu and Chao Peng and Xinchen Wang and Junjielong Xu and Cuiyun Gao , title =. 2025 , eprint =

2025

[43] [43]

International Conference on Machine Learning (ICML) , year =

Konstantinos Vergopoulos and Mark Niklas M\"uller and Martin Vechev , title =. International Conference on Machine Learning (ICML) , year =

[44] [44]

2025 , eprint =

Avi Arora and Jinu Jang and Roshanak Zilouchian Moghaddam , title =. 2025 , eprint =

2025

[45] [45]

Chiu and Claire Cardie and Matthias Gall\'e and Alexander M

Wenting Zhao and Nan Jiang and Celine Lee and Justin T. Chiu and Claire Cardie and Matthias Gall\'e and Alexander M. Rush , title =. International Conference on Learning Representations (ICLR) , year =

[46] [46]

International Conference on Learning Representations (ICLR) , year =

Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica , title =. International Conference on Learning Representations (ICLR) , year =

[47] [47]

International Conference on Learning Representations (ICLR) , year =

Terry Yue Zhuo and Minh Chien Vu and Jenny Chim and Han Hu and Wenhao Yu and Ratnadira Widyasari and Imam Nur Bani Yusuf and Haolan Zhan and Junda He and Indraneil Paul , title =. International Conference on Learning Representations (ICLR) , year =

[48] [48]

2024 , eprint =

Reem Aleithan and Haoran Xue and Mohammad Mahdi Mohajer and Elijah Nnorom and Gias Uddin and Song Wang , title =. 2024 , eprint =

2024

[49] [49]

2025 , eprint =

You Wang and Michael Pradel and Zhongxin Liu , title =. 2025 , eprint =

2025

[50] [50]

Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Boxi Yu and Yuxuan Zhu and Pinjia He and Daniel Kang , title =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[51] [51]

2026 , howpublished =

2026

[52] [52]

2026 , publisher =

Zechner, Mario , title =. 2026 , publisher =

2026

[53] [53]

2026 , publisher =

Archit11 , title =. 2026 , publisher =

2026

[54] [54]

2026 , eprint=

SWE-chat: Coding Agent Interactions From Real Users in the Wild , author=. 2026 , eprint=

2026

[55] [55]

2026 , eprint =

Joseph Suh and Ayush Raj and Minwoo Kang and Serina Chang , title =. 2026 , eprint =

2026

[56] [56]

Shuvendu K. Lahiri and Sarah Fakhoury and Aaditya Naik and Georgios Sakkas and Saikat Chakraborty and Madanlal Musuvathi and Piali Choudhury and Curtis von Veh and Jeevana Priya Inala and Chenglong Wang and Jianfeng Gao , title =. 2022 , eprint =

2022

[57] [57]

2025 , eprint =

Jane Pan and Ryan Shar and Jacob Pfau and Ameet Talwalkar and He He and Valerie Chen , title =. 2025 , eprint =

2025

[58] [58]

Chunyu Miao and Henry Peng Zou and Yangning Li and Yankai Chen and Yibo Wang and Fangxin Wang and Yifan Li and Wooseong Yang and Bowei He and Xinni Zhang and Dianzhi Yu and Hanchen Yang and Hoang H. Nguyen and Yue Zhou and Jie Yang and Jizhou Guo and Wenzhe Fan and Chin-Yuan Yeh and Panpan Meng and Liancheng Fang and Jinhu Qi and Wei-Chieh Huang and Zheng...

2025

[59] [59]

2026 , eprint =

Xueqing Wu and Zihan Xue and Da Yin and Shuyan Zhou and Kai-Wei Chang and Nanyun Peng and Yeming Wen , title =. 2026 , eprint =

2026

[60] [60]

2026 , eprint =

Jiarong Liang and Zhiheng Lyu and Zijie Liu and Xiangchao Chen and Ping Nie and Kai Zou and Wenhu Chen , title =. 2026 , eprint =

2026

[61] [61]

2025 , eprint =

Wayne Chi and Valerie Chen and Ryan Shar and Aditya Mittal and Jenny Liang and Wei-Lin Chiang and Anastasios Nikolas Angelopoulos and Ion Stoica and Graham Neubig and Ameet Talwalkar and Chris Donahue , title =. 2025 , eprint =

2025

[62] [62]

2025 , eprint =

Terry Yue Zhuo and Xiaolong Jin and Hange Liu and Juyong Jiang and Tianyang Liu and Chen Gong and Bhupesh Bishnoi and Vaisakhi Mishra and Marek Suppa and Noah Ziems and Saiteja Utpala and Ming Xu and Guangyu Song and Kaixin Li and Yuhan Cao and Bo Liu and Zheng Liu and Sabina Abdurakhmanova and Wenhao Yu and Mengzhao Jia and Jihan Yao and Kenneth Hamilton...

2025

[63] [63]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces , author=. arXiv preprint arXiv:2601.11868 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

2026 , month = feb, day =

Why. 2026 , month = feb, day =

2026

[65] [65]

2026 , month = may, day =

Wenqi Huang and Charley Lee and Leonard Tng and Serena Ge , title =. 2026 , month = may, day =

2026