Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering
Pith reviewed 2026-06-26 23:46 UTC · model grok-4.3
The pith
Coding benchmarks are misaligned with agentic software engineering because they score composite system harnesses as if they were single models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current coding benchmarks collapse the model, harness, and environment into a single end-to-end score computed against one reference solution, with no component-level signal. A coding agent is a system harness whose individual components can each move benchmark scores by margins comparable to adjacent model generations. The misalignment shows in three symptoms: conflated scores, penalization of alternative solutions, and lack of iteration signals at the component level.
What carries the argument
The system harness, a composite of models, harnesses, contexts, environments, and feedback signals, which carries the argument by showing that its parts can independently alter benchmark outcomes at model-scale levels.
Load-bearing premise
That changes to any single component of the system harness move benchmark scores by amounts comparable to improvements from new model generations.
What would settle it
A controlled study that measures score changes when only the model is swapped versus when only one harness component is changed, to check if the latter produces comparable or larger shifts.
Figures
read the original abstract
Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that coding benchmarks are misaligned with agentic software engineering because they were designed for a pre-agent era. It argues that a coding agent is a composite system harness (models, harnesses, contexts, environments, feedback signals) rather than a model alone, and that changes to any harness component can shift end-to-end benchmark scores by margins comparable to those between adjacent model generations. It identifies three symptoms: (i) scores conflate model with harness, (ii) single-reference grading penalizes valid alternatives, and (iii) lack of component-level signals hinders iteration.
Significance. If the central comparability claim holds, the position would usefully highlight limitations in existing benchmarks (e.g., those collapsing system components into single scores) and motivate redesigned evaluations that separate harness effects, support multiple references, and supply per-component signals. This could improve how agent progress is tracked in software engineering. The paper earns credit for clearly enumerating the three symptoms in a direct argumentative structure without circularity or invented parameters.
major comments (2)
- [Abstract] Abstract: The assertion that 'any one of which can move the benchmark score by margins comparable to those between adjacent model generations' is load-bearing for the misalignment diagnosis and the call for redesign, yet it is presented without quantification, ablation results, illustrative examples from benchmarks, or citations to studies measuring harness-component effect sizes. This leaves the premise ungrounded.
- [Symptoms section] Discussion of the three symptoms: The symptoms are logically described, but the manuscript supplies no analysis or external references establishing that these are the primary drivers or that their effects reach the scale of model-generation deltas; without such grounding the recommendation to redesign benchmarks rests on an untested premise rather than demonstrated condition.
minor comments (1)
- The manuscript would benefit from an explicit early definition or scope statement for 'agentic software engineering' to anchor the symptoms discussion.
Simulated Author's Rebuttal
We thank the referee for the review. We address the major comments point by point below. The manuscript is a position paper, so its contribution is the identification and logical framing of misalignment issues rather than new empirical measurements or ablations.
read point-by-point responses
-
Referee: [Abstract] The assertion that 'any one of which can move the benchmark score by margins comparable to those between adjacent model generations' is load-bearing for the misalignment diagnosis and the call for redesign, yet it is presented without quantification, ablation results, illustrative examples from benchmarks, or citations to studies measuring harness-component effect sizes. This leaves the premise ungrounded.
Authors: The paper is a position paper. The assertion synthesizes observed trends in agent development, where harness changes routinely produce score shifts on the order of model-generation gaps, but the manuscript does not include new quantification or ablations because that would alter its scope and purpose. We maintain that a position paper can legitimately highlight such patterns without performing the measurements itself. No revision is made on this point. revision: no
-
Referee: [Symptoms section] The symptoms are logically described, but the manuscript supplies no analysis or external references establishing that these are the primary drivers or that their effects reach the scale of model-generation deltas; without such grounding the recommendation to redesign benchmarks rests on an untested premise rather than demonstrated condition.
Authors: The three symptoms are derived directly from the structural mismatch between pre-agent benchmark designs and composite agent systems; the paper does not claim they are the sole primary drivers or supply new quantitative analysis. The redesign recommendation follows from the logical consequences of the symptoms as stated. Adding empirical grounding or references would convert the work into an empirical study, which exceeds the intended contribution of a position paper. No revision is made. revision: no
Circularity Check
No circularity: position paper with no derivations or self-referential constructions
full rationale
The manuscript is a purely argumentative position paper. It contains no equations, fitted parameters, derivations, or mathematical claims of any kind. The central assertions about benchmark misalignment and the role of harness components are presented directly as observations and symptoms rather than derived from any chain that could reduce to inputs by construction. No self-citations are used in a load-bearing manner for uniqueness theorems or ansatzes, and the text does not rename known results or smuggle in assumptions via citation. The paper is self-contained as a set of stated positions with no internal reduction steps to analyze.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Benchmarks for coding agents should isolate harness components to enable iteration
Reference graph
Works this paper leans on
-
[1]
AI21. 2025. Scaling Agentic Evaluation: Lessons from 200,000 SWE-bench Runs. https://www.ai21.com/blog/scaling-agentic-evaluation-swe-bench/
2025
-
[2]
Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang. 2024. SWE-Bench+: Enhanced Coding Benchmark for LLMs. (2024). arXiv:2410.06992 https://arxiv.org/abs/2410.06992
arXiv 2024
-
[3]
Anthropic. 2025. Claude Code. https://claude.com/product/claude-code
2025
-
[4]
Anthropic. 2025. Effective Harnesses for Long-Running Agents. https://www. anthropic.com/engineering/effective-harnesses-for-long-running-agents
2025
-
[5]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. (2021). arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732
Pith/arXiv arXiv 2021
-
[6]
Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, and Alexander Golubev. 2026. SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale. arXiv:2602.23866 [cs.SE] https://arxiv.org/abs/2602.23866
Pith/arXiv arXiv 2026
-
[7]
2002.Test-driven development: by example
Kent Beck. 2002.Test-driven development: by example. Addison-Wesley Profes- sional
2002
-
[8]
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. 2024. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. (2024). arXiv:2410.07095 https://arxiv.org/ abs/2410.07095
Pith/arXiv arXiv 2024
-
[9]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374
Pith/arXiv arXiv 2021
-
[10]
Cursor. 2025. Cursor Agents. https://cursor.com/agents
2025
-
[11]
Mostafa Dehghani, Yi Tay, Alexey A Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. 2021. The benchmark lottery.arXiv preprint arXiv:2107.07002(2021)
arXiv 2021
-
[12]
Zhiyu Fan, Kirill Vasilevski, Dayi Lin, Boyuan Chen, Yihao Chen, Zhiqing Zhong, Jie M. Zhang, Pinjia He, and Ahmed E. Hassan. 2025. SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints. (2025). arXiv:2509.09853 https://arxiv.org/abs/2509.09853
arXiv 2025
-
[13]
Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma, and Ao Qu. 2026. De- cisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows. (2026). arXiv:2605.19099 https://arxiv.org/abs/2605.19099
Pith/arXiv arXiv 2026
-
[14]
Gastown Hall. 2026. GasCity. https://github.com/gastownhall/gascity
2026
-
[15]
Paul Gauthier. 2024. The Aider Polyglot Coding Benchmark. https://aider.chat/ 2024/12/21/polyglot.html
2024
-
[16]
Zhuohan Gu, Qizheng Zhang, Omar Khattab, and Samuel Madden. 2026. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents. (2026). arXiv:2605.19932 https://arxiv.org/abs/2605.19932
Pith/arXiv arXiv 2026
-
[17]
Ahmed E Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic software engineering: Foundational pillars and a research roadmap.arXiv preprint arXiv:2509.06216(2025)
Pith/arXiv arXiv 2025
-
[18]
Abigail Z. Jacobs and Hanna Wallach. 2021. Measurement and Fairness. InPro- ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT). 375–385. doi:10.1145/3442188.3445901 https://arxiv.org/abs/1912.05511
-
[19]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. Live- CodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code. InInternational Conference on Learning Representations (ICLR). arXiv:2403.07974 https://arxiv.org/abs/2403.07974
Pith/arXiv arXiv 2025
-
[20]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InInternational Conference on Learning Representa- tions (ICLR). arXiv:2310.06770 https://arxiv.org/abs/2310.06770
Pith/arXiv arXiv 2024
-
[21]
Alex Kotliarskyi, Victor Zhu, and Zach Brock. 2026. An open-source spec for Codex orchestration: Symphony. https://openai.com/index/open-source-codex- orchestration-symphony/
2026
-
[22]
Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. 2026. Meta-Harness: End-to-End Optimization of Model Harnesses. (2026). arXiv:2603.28052 https://arxiv.org/abs/2603.28052
Pith/arXiv arXiv 2026
-
[23]
Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE 3.0): How Autonomous Coding Agents Are Reshaping Software Engineering. (2025). arXiv:2507.15003 https://arxiv.org/abs/ 2507.15003
Pith/arXiv arXiv 2025
-
[24]
Hao Li, Haoxiang Zhang, and Ahmed E Hassan. 2026. AIDev: Studying AI coding agents on GitHub.arXiv preprint arXiv:2602.09185(2026). https://arxiv.org/abs/ 2602.09185
arXiv 2026
-
[25]
Xiangyi Li, Wenbo Chen, Yimin Liu, et al . 2026. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks. (2026). arXiv:2602.12670 https://arxiv.org/abs/2602.12670
Pith/arXiv arXiv 2026
-
[26]
Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason.arXiv preprint arXiv:2506.12286(2025). arXiv:2506.12286 https://arxiv. org/abs/2506.12286
arXiv 2025
-
[27]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2024. AgentBench: Evaluating LLMs as Agents. InInternational Conference on Learning Representations (ICLR). arXiv:2308.03688 https://arxiv.org/abs/2308.03688
Pith/arXiv arXiv 2024
-
[28]
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, et al . 2026. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces. (2026). arXiv:2601.11868 https://arxiv.org/abs/2601.11868
Pith/arXiv arXiv 2026
-
[29]
Morph Labs. 2025. SWE-Bench Pro: A Detailed Analysis of Scaffold-Driven Score Variance. https://www.morphllm.com/swe-bench-pro
2025
-
[30]
OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/
2024
-
[31]
OpenAI. 2025. Introducing Codex. https://openai.com/index/introducing-codex/
2025
-
[32]
OpenAI. 2025. SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (2025). arXiv:2502.12115 https://arxiv.org/abs/ 2502.12115
arXiv 2025
-
[33]
OpenAI. 2026. Harness Engineering. https://openai.com/index/harness- engineering/
2026
-
[34]
OpenAI. 2026. Why SWE-bench Verified No Longer Measures Frontier Coding Capabilities. https://openai.com/index/why-we-no-longer-evaluate-swe-bench- verified/
2026
-
[35]
Proximal Labs. 2026. Frontier-SWE: A Benchmark of Long-Horizon Software Engineering Tasks. https://www.frontierswe.com/blog
2026
-
[36]
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceed- ings of the 58th annual meeting of the association for computational linguistics. 4902–4912
2020
-
[37]
Scale AI. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv:2509.16941 https://arxiv.org/abs/2509.16941
Pith/arXiv arXiv 2025
-
[38]
Brian Scanlan. 2026. How we use Claude Code today at Intercom. https://www.linkedin.com/pulse/how-we-use-claude-code-today-intercom- brian-scanlan-eb7cc/
2026
-
[39]
Gian Segato and Engineering at Anthropic. 2026. Quantifying infrastructure noise in agentic coding evals. https://www.anthropic.com/engineering/infrastructure- noise
2026
-
[40]
Maksim Shaposhnikov. 2025. A Proposed Framework For Evaluating Skills.Tessl Blog(2025). https://tessl.io/blog/a-proposed-framework-for-evaluating-skills- research-eng-blog/
2025
-
[41]
Gorinova, Rob Willoughby, and Dru Knox
Maksim Shaposhnikov, Maria I. Gorinova, Rob Willoughby, and Dru Knox. 2025. A Proposed Evaluation Framework for Coding Agents: Tiles Enhance Proper Use of Public APIs by 35%.Tessl Blog(2025). https://tessl.io/blog/proposed- evaluation-framework-for-coding-agents/
2025
-
[42]
StrongDM. 2025. StrongDM Software Factory. https://factory.strongdm.ai/. Field notes on non-interactive agentic development
2025
-
[43]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Intro- duction(2nd ed.). MIT Press
2018
-
[44]
Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, et al. 2025. Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge. InInternational Conference on Machine Learning (ICML). arXiv:2502.00561 https://arxiv.org/abs/2502.00561
arXiv 2025
-
[45]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al . 2025. Open- Hands: An Open Platform for AI Software Developers as Generalist Agents. In International Conference on Learning Representations (ICLR). arXiv:2407.16741 https://arxiv.org/abs/2407.16741
Pith/arXiv arXiv 2025
-
[46]
You Wang, Michael Pradel, and Zhongxin Liu. 2025. Are “Solved Issues” in SWE- bench Really Solved Correctly? An Empirical Study. (2025). arXiv:2503.15223 https://arxiv.org/abs/2503.15223
arXiv 2025
-
[47]
Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush. 2026. Many SWE-bench- Passing PRs Would Not Be Merged into Main. https://metr.org/notes/2026-03- 10-many-swe-bench-passing-prs-would-not-be-merged-into-main/
2026
-
[48]
Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua Clymer, Jai Dhyani, et al . 2025. RE- Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts. InInternational Conference on Machine Learning (ICML). arXiv:2411.15114 https://arxiv.org/abs/2411.15114
arXiv 2025
-
[49]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2405.15793 https://arxiv.org/abs/2405.15793
Pith/arXiv arXiv 2024
-
[50]
John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, and Ofir Press. 2025. SWE-bench Multimodal: Do AI Systems Gen- eralize to Visual Software Domains?. InInternational Conference on Learning Representations (ICLR). arXiv:2410.03859 https://arxi...
arXiv 2025
-
[51]
John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, and Ofir Press. 2026. ProgramBench: Can Language Models Rebuild Programs From Scratch? arXiv:2605.03546 [cs.SE] https://arxiv.org/abs/2605.03546
Pith/arXiv arXiv 2026
-
[52]
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. (2024). arXiv:2406.12045 https://arxiv.org/abs/2406.12045
Pith/arXiv arXiv 2024
-
[53]
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al
-
[54]
InInternational Conference on Learning Repre- sentations (ICLR)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. InInternational Conference on Learning Repre- sentations (ICLR). arXiv:2406.15877 https://arxiv.org/abs/2406.15877. 6
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.