pith. sign in

arxiv: 2606.19613 · v1 · pith:NP4IZKUYnew · submitted 2026-06-17 · 💻 cs.SE · cs.AI

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

Pith reviewed 2026-06-26 19:45 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords coding agentsmulti-turn evaluationstamina benchmarkREST API modificationtest feedbackagent harnessprocedural task generationblack-box testing
0
0 comments X

The pith

Coding agents fail within 5-6 turns on 100-turn tasks, but passing test feedback extends survival up to 12 times and harness choice creates up to 6 times performance gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StaminaBench to measure how many consecutive change requests coding agents can handle before breaking, shifting focus from single-task success rates to sustained performance over dozens or hundreds of turns. It tests agents on building and iteratively modifying a REST API server across procedurally generated sequences that grow the codebase to 6000 lines, with all tests generated programmatically for reliability. Evaluations across six harnesses and seven open-source models show consistent early failure, large gains from feeding back test results for retries, and strong dependence on the surrounding harness rather than model size alone. This setup isolates the agent in a black-box HTTP environment to mimic real development sessions without language restrictions.

Core claim

StaminaBench evaluates agent stamina by requiring implementation of a REST API followed by 100 successive valid change requests drawn from structured samplers; across 20 scenarios all tested models fail within 5-6 turns, passing test feedback back to the agent improves the number of passed turns by up to 12x, and stronger models display up to a 6x gap between their best and worst harness while weaker models fail regardless of harness.

What carries the argument

StaminaBench, a black-box benchmark that runs the agent and target server in isolation and communicates via HTTP while applying procedurally generated change sequences from either hardcoded or LLM-driven samplers constrained to a structured action space.

If this is right

  • All tested models produce bugs when iterating without external test feedback.
  • Including test results in the prompt loop multiplies the number of successful turns by as much as 12.
  • Harness design matters more than raw model strength for weaker models and still creates large gaps for stronger ones.
  • The benchmark's fully programmatic test generation and isolated HTTP interface enable reproducible, language-agnostic evaluation of multi-turn behavior.
  • Releasing the tasks and code supports further work on sustained coding agent performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world coding sessions may require explicit test integration loops to reach dozens of turns.
  • Future agents could be trained or prompted specifically to request and interpret test feedback rather than relying on harnesses.
  • The 6x harness gap suggests that interface design between model and environment is a higher-leverage research target than incremental model scaling for this workload.
  • If the structured action space limits diversity too much, extending the sampler to include more open-ended change types would test whether the early failure pattern persists.

Load-bearing premise

The procedurally generated change sequences represent the kinds of follow-up requests that actually occur in real development sessions.

What would settle it

Running the same agents on 100-turn sessions drawn from actual recorded developer interactions instead of the benchmark's samplers and measuring whether failure rates remain under 6 turns.

Figures

Figures reproduced from arXiv: 2606.19613 by Shuo Yang, Stefano Soatto, Vlad Sobal, Wei Xia, Yuting Zhang.

Figure 1
Figure 1. Figure 1: Overview of StaminaBench. The benchmark (blue) iteratively samples changes via [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scaling and ablation dynamics (OpenCode), averaged over scenarios with [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Failure characterization on OpenCode, averaged over scenarios with [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of failure types at the turn each scenario [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of failure types across all failed turns under the default retry budget ( [PITH_FULL_IMAGE:figures/full_fig_p036_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of failure types across all failed turns under an extended retry budget of [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗
read the original abstract

We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces StaminaBench, a benchmark for measuring coding-agent stamina over 100 consecutive interaction turns on implementing and iteratively modifying a REST API server. Change sequences are produced by hardcoded or LLM-driven samplers inside a fixed structured action space; tests are generated fully programmatically. Evaluation of six harnesses paired with seven open-source LLMs across 20 scenarios reports that all models fail within 5-6 turns, that passing test feedback improves passed-turn count by up to 12x, and that harness quality produces up to a 6x performance gap between best and worst harness for stronger models. The benchmark, tasks, and code are released.

Significance. If the synthetic change sequences are representative, the results would demonstrate that current agents cannot sustain long-horizon vibe-coding sessions and would highlight the value of test feedback and harness design. The fully programmatic test generation, black-box HTTP interface, isolated execution environment, and public release of benchmark code and data constitute clear strengths for reproducibility and extensibility.

major comments (2)
  1. [Abstract] Abstract: the headline claim that the observed 5-6 turn failures 'confirm that vibe-coding-style programming without thorough testing produces bugs' rests on the unvalidated premise that the procedurally generated sequences (hardcoded or LLM-driven samplers inside the structured action space) match the distribution of real multi-turn developer change requests; no empirical comparison to commit histories, issue trackers, or session logs is reported.
  2. [Abstract / Evaluation] Abstract / §4 (Evaluation): the quantitative claims (5-6 turn failures, 12x improvement from test feedback, 6x harness gap) are presented without reported details on validation of the change samplers, error-handling behavior inside the isolated environment, or statistical significance testing across the 20 scenarios, which are load-bearing for assessing whether the failure modes are robust or artifacts of the synthetic setup.
minor comments (1)
  1. A table or diagram explicitly enumerating the allowed actions in the structured action space would clarify the constraints under which changes remain valid.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript to improve clarity and temper certain claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that the observed 5-6 turn failures 'confirm that vibe-coding-style programming without thorough testing produces bugs' rests on the unvalidated premise that the procedurally generated sequences (hardcoded or LLM-driven samplers inside the structured action space) match the distribution of real multi-turn developer change requests; no empirical comparison to commit histories, issue trackers, or session logs is reported.

    Authors: We agree that the change sequences are synthetic by design and that no empirical validation against real commit histories or session logs is provided. The benchmark intentionally uses a constrained, structured action space to guarantee validity and reproducibility rather than attempting to replicate real-world distributions. The abstract phrasing was illustrative of the observed failure modes rather than a distributional claim. We will revise the abstract to remove the word 'confirm' and instead describe the results as demonstrating the difficulty of sustaining long sessions without test feedback. revision: yes

  2. Referee: [Abstract / Evaluation] Abstract / §4 (Evaluation): the quantitative claims (5-6 turn failures, 12x improvement from test feedback, 6x harness gap) are presented without reported details on validation of the change samplers, error-handling behavior inside the isolated environment, or statistical significance testing across the 20 scenarios, which are load-bearing for assessing whether the failure modes are robust or artifacts of the synthetic setup.

    Authors: We will add explicit details in Sections 3 and 4 on how the samplers enforce validity within the action space, the error-handling and isolation mechanisms (Docker-based black-box HTTP interface), and quantitative variability across the 20 scenarios (different sampler seeds and model pairings). Where raw data permit, we will also report standard deviations or ranges to support the robustness of the reported factors (5-6 turns, 12x, 6x). revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or self-referential reductions

full rationale

The paper introduces StaminaBench as an empirical evaluation framework for multi-turn coding agents. It reports direct experimental outcomes (failure within 5-6 turns, 12x improvement from test feedback, 6x harness gaps) from running six harnesses and seven LLMs on 20 procedurally generated 100-turn scenarios. No equations, fitted parameters, predictions derived from inputs, uniqueness theorems, or ansatzes appear anywhere in the manuscript. The central claims rest on observed performance metrics rather than any chain that reduces to self-definition or self-citation. The representativeness of the action space to real development is a separate validity question outside the scope of circularity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented entities; the contribution is an empirical benchmark and evaluation protocol.

pith-pipeline@v0.9.1-grok · 5831 in / 1051 out tokens · 22393 ms · 2026-06-26T19:45:53.484308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 27 linked inside Pith

  1. [1]

    Opencode

    Anomaly. Opencode. https://github.com/anomalyco/opencode. Accessed: 2026-05-05

  2. [2]

    Claude code

    Anthropic. Claude code. https://www.anthropic.com/claude-code, 2025. Accessed: 2026-05-06

  3. [3]

    Cursor: The ai code editor

    Anysphere. Cursor: The ai code editor. https://cursor.com/, 2025. Accessed: 2026-05-06

  4. [4]

    Program synthesis with large language models, 2021

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URLhttps://arxiv.org/abs/2108.07732

  5. [5]

    Vending-bench: A benchmark for long-term coherence of autonomous agents, 2025

    Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents, 2025. URLhttps://arxiv.org/abs/2502.15840

  6. [6]

    τ 2-bench: Evaluating conversational agents in a dual-control environment, 2025

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Evaluating conversational agents in a dual-control environment, 2025. URL https://arxiv. org/abs/2506.07982

  7. [7]

    Qwen3- coder-next technical report, 2026

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, Zeyao Ma, Kashun Shum, Xuwu Wang, Jinxi Wei, Jiaxi Yang, Jiajun Zhang, Lei Zhang, Zongmeng Zhang, Wenting Zhao, and Fan Zhou. Qwen3- coder-next technical report, 2026. URLhttps://arxiv.org/abs/2603.00729

  8. [8]

    Swe-ci: Evaluating agent capabilities in maintaining codebases via continuous integration, 2026

    Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, and Bing Zhao. Swe-ci: Evaluating agent capabilities in maintaining codebases via continuous integration, 2026. URL https://arxiv. org/abs/2603.03823

  9. [9]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  10. [10]

    Evoclaw: Evaluating ai agents on continuous software evolution, 2026

    Gangda Deng, Zhaoling Chen, Zhongming Yu, Haoyang Fan, Yuhong Liu, Yuxin Yang, Dhruv Parikh, Rajgopal Kannan, Le Cong, Mengdi Wang, Qian Zhang, Viktor Prasanna, Xiangru Tang, and Xingyao Wang. Evoclaw: Evaluating ai agents on continuous software evolution, 2026. URLhttps://arxiv.org/abs/2603.13428

  11. [11]

    Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?, 2025

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-ho...

  12. [12]

    Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents, 2026

    Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wa...

  13. [13]

    Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces, 2026

    Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, and Kaipeng Zhang. Longcli-bench: A preliminary benchmark and study for long-horizon agentic programming in command-line interfaces, 2026. URLhttps...

  14. [14]

    Publication, University of California, Irvine, 2000

    Roy Thomas Fielding.Architectural styles and the design of network-based software architec- tures. Publication, University of California, Irvine, 2000. URL https://www.ics.uci.edu/ ~fielding/pubs/dissertation/top.htm

  15. [15]

    Glm-5: from vibe coding to agentic engineering, 2026

    GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunx- iang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Z...

  16. [16]

    Gemini cli

    Google. Gemini cli. https://github.com/google-gemini/gemini-cli, 2025. Accessed: 2026-05-06

  17. [17]

    Convcodeworld: Bench- marking conversational code generation in reproducible feedback environments, 2025

    Hojae Han, Seung won Hwang, Rajhans Samdani, and Yuxiong He. Convcodeworld: Bench- marking conversational code generation in reproducible feedback environments, 2025. URL https://arxiv.org/abs/2502.19852

  18. [18]

    A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70, 1979

    Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70, 1979. ISSN 03036898, 14679469. URL http://www.jstor.org/ stable/4615733

  19. [19]

    Context rot: How increasing input tokens impacts llm performance

    Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Chroma, July 2025. URL https://trychroma. com/research/context-rot

  20. [20]

    Ruler: What’s the real context size of your long-context language models?, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?, 2024. URLhttps://arxiv.org/abs/2404.06654

  21. [21]

    R2e- gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025

    Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e- gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025. URLhttps://arxiv.org/abs/2504.07164. 11

  22. [22]

    Llmlingua: Com- pressing prompts for accelerated inference of large language models, 2023

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Com- pressing prompts for accelerated inference of large language models, 2023. URL https: //arxiv.org/abs/2310.05736

  23. [23]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations (ICLR), 2024. URL https://arxiv. org/abs/2310.06770

  24. [24]

    Needle in a haystack - pressure testing llms

    Gregory Kamradt. Needle in a haystack - pressure testing llms. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack, 2023. Accessed: 2026-05-06

  25. [25]

    Holistic agent leaderboard: The missing infrastructure for ai agent evaluation, 2025

    Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, Franck Ndzomga, Dheeraj Oruganty, Sophie Luskin, Kangheng Liu, Botao Yu, Amit Arora, Dongyoon Hahm, Harsh Trivedi, Huan Sun, Juyong Lee, Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, Da...

  26. [26]

    Ziegler, Elizabeth Barnes, and Lawrence Chan

    Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence...

  27. [27]

    Llms get lost in multi-turn conversation, 2025

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation, 2025. URLhttps://arxiv.org/abs/2505.06120

  28. [28]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023. URL https://arxiv.org/abs/2307.03172

  29. [29]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representatio...

  30. [30]

    Merrill, Alexander G

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

  31. [31]

    Mistral vibe

    Mistral AI. Mistral vibe. https://github.com/mistralai/mistral-vibe. Accessed: 2026-05-05

  32. [32]

    Devstral 2 and vibe cli

    Mistral AI. Devstral 2 and vibe cli. https://mistral.ai/news/devstral-2-vibe-cli ,

  33. [33]

    Accessed: 2026-05-05. 12

  34. [34]

    Kimi cli

    Moonshot AI. Kimi cli. https://github.com/MoonshotAI/kimi-cli. Accessed: 2026- 05-05

  35. [35]

    NVIDIA, :, Aakshita Chandiramani, Aaron Blakeman, Abdullahi Olaoye, Abhibha Gupta, Abhilash Somasamudramath, Abhinav Khattar, Adeola Adesoba, Adi Renduchintala, Adil Asif, Aditya Agrawal, Aditya Vavre, Ahmad Kiswani, Aishwarya Padmakumar, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Gron- skiy, Alex K...

  36. [36]

    URLhttps://arxiv.org/abs/2604.12374

  37. [37]

    Openai codex.https://openai.com/codex/, 2025

    OpenAI. Openai codex.https://openai.com/codex/, 2025. Accessed: 2026-05-06

  38. [38]

    Openapi initiative

    OpenAPI Initiative. Openapi initiative. https://www.openapis.org/. Accessed: 2026-05- 06

  39. [39]

    Openhands

    OpenHands. Openhands. https://github.com/OpenHands/OpenHands. Accessed: 2026- 05-05

  40. [40]

    Slopcodebench: Benchmarking how coding agents degrade over long-horizon iterative tasks, 2026

    Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge, Dyah Adila, Frederic Sala, and Aws Albarghouthi. Slopcodebench: Benchmarking how coding agents degrade over long-horizon iterative tasks, 2026. URLhttps://arxiv.org/abs/2603. 24755

  41. [41]

    Patil, Ion Stoica, and Joseph E

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. URL https: //arxiv.org/abs/2310.08560

  42. [42]

    Training software engineering agents and verifiers with swe-gym, 2025

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025. URL https: //arxiv.org/abs/2412.21139. 14

  43. [43]

    Userbench: An interactive gym environment for user-centric agents, 2025

    Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, and Huan Wang. Userbench: An interactive gym environment for user-centric agents, 2025. URL https://arxiv.org/ abs/2507.22034

  44. [44]

    Locobench: A benchmark for long-context large language models in complex software engineering, 2025

    Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, and Huan Wang. Locobench: A benchmark for long-context large language models in complex software engineering, 2025. URL https: //arxiv.org/abs/...

  45. [45]

    Qwen code

    Qwen Team. Qwen code. https://github.com/QwenLM/qwen-code. Accessed: 2026-05- 05

  46. [46]

    Qwen Team. Qwen3.5. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-05-05

  47. [47]

    Alfworld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations (ICLR), 2021. URL https://arxiv.org/abs/2010.03768

  48. [48]

    The illusion of diminishing returns: Measuring long horizon execution in llms

    Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in llms. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2509.09677

  49. [49]

    Mini-swe-agent

    SWE-agent Team. Mini-swe-agent. https://github.com/SWE-agent/mini-swe-agent . Accessed: 2026-05-05

  50. [50]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

  51. [51]

    Minh V . T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, and Nghi D. Q. Bui. Swe- evo: Benchmarking coding agents in long-horizon software evolution scenarios, 2026. URL https://arxiv.org/abs/2512.18470

  52. [52]

    Ai agentic pro- gramming: A survey of techniques, challenges, and opportunities, 2025

    Huanting Wang, Jingzhi Gong, Huawei Zhang, Jie Xu, and Zheng Wang. Ai agentic pro- gramming: A survey of techniques, challenges, and opportunities, 2025. URL https: //arxiv.org/abs/2508.11126

  53. [53]

    Codeflowbench: A multi-turn, iterative benchmark for complex code generation, 2026

    Sizhe Wang, Zhengren Wang, Dongsheng Ma, Yongan Yu, Rui Ling, Zhiyu Li, Feiyu Xiong, and Wentao Zhang. Codeflowbench: A multi-turn, iterative benchmark for complex code generation, 2026. URLhttps://arxiv.org/abs/2504.21751

  54. [54]

    Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024

    Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feedback, 2024. URL https://arxiv.org/abs/2309.10691

  55. [55]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

  56. [56]

    Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83,

    Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83,

  57. [57]

    URLhttp://www.jstor.org/stable/3001968

    ISSN 00994987. URLhttp://www.jstor.org/stable/3001968

  58. [58]

    Frontalk: Benchmarking front-end development as conversational code generation with multi-modal feedback, 2025

    Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, and Yeming Wen. Frontalk: Benchmarking front-end development as conversational code generation with multi-modal feedback, 2025. URLhttps://arxiv.org/abs/2601.04203

  59. [59]

    Travelplanner: A benchmark for real-world planning with language agents

    Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents. In International Conference on Machine Learning (ICML), 2024. URL https://arxiv.org/ abs/2402.01622

  60. [60]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URL https: //arxiv.org/abs/2...

  61. [61]

    Jimenez, Alex L

    John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024. URLhttps://arxiv.org/abs/2410.03859. 16

  62. [62]

    Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

    John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025. URLhttps://arxiv.org/abs/2504.21798

  63. [63]

    τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URL https://arxiv.org/abs/ 2406.12045

  64. [64]

    Multi-swe-bench: A multilingual benchmark for issue resolving, 2025

    Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-swe-bench: A multilingual benchmark for issue resolving, 2025. URLhttps://arxiv.org/abs/2504.02605

  65. [65]

    Commit0: Library generation from scratch, 2024

    Wenting Zhao, Nan Jiang, Celine Lee, Justin T Chiu, Claire Cardie, Matthias Gallé, and Alexander M Rush. Commit0: Library generation from scratch, 2024. URL https://arxiv. org/abs/2412.01769

  66. [66]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2307.13854

  67. [67]

    Start with value 0

    Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. Gsm-infinite: How do your llms behave over infinitely increasing context length and reasoning complexity?, 2025. URLhttps://arxiv.org/abs/2502.05252. A Experiment Cost Per-token pricing used to compute the costs reported in Table 6 is shown in Table 5. Table 5: Per-token pricing used in ...

  68. [68]

    Implement a REST API server that matches the specification and apply user requested changes

  69. [69]

    All entities must support CRUD operations (Create, Read, Update, Delete)

  70. [70]

    All operations/analytics endpoints must work correctly

  71. [71]

    You MUST create a script called ‘run_server.sh‘ that starts the server

  72. [72]

    The script must accept a port number as the first argument

  73. [73]

    You must use python programming language, but you can use ANY web framework as long as it is python

  74. [74]

    UserProfile

    You should test your implementation during development to ensure correctness. Required Script: Create a file called ‘run_server.sh‘ that: - Takes a port number as the first argument (e.g., ‘bash run_server.sh 8001‘) - Starts your REST API server on that port - The server should listen on 0.0.0.0 (all interfaces) Example run_server.sh: #!/bin/bash PORT=$1 ...

  75. [75]

    in which that category caused the first failure. 35 Missing Feature Hallucinated FeatureData Validation Error Cascade DeletionRename Failure Regression Wrong Endpoint Wrong Response Format T ype Error Default ValueEnum Handling Server CrashStuck Loop Suicide (pkill)Invalid T ool Call Other Devstral 2 + MiniSwe Devstral 2 + OpenCode Devstral 2 + OpenHands ...