Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

Amy Heineike; Dru Knox; Macey Baker; Maksim Shaposhnikov; Maria I. Gorinova; Rob Willoughby

arxiv: 2606.17799 · v1 · pith:H4D4QHPRnew · submitted 2026-06-16 · 💻 cs.SE · cs.AI· cs.CL

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

Maria I. Gorinova , Macey Baker , Amy Heineike , Maksim Shaposhnikov , Rob Willoughby , Dru Knox This is my paper

Pith reviewed 2026-06-26 23:46 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords coding benchmarksagentic software engineeringsystem harnessbenchmark misalignmentsoftware agentsevaluation metricsharness componentsagent systems

0 comments

The pith

Coding benchmarks are misaligned with agentic software engineering because they score composite system harnesses as if they were single models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that benchmarks for coding agents were built for an earlier time when the focus was on models alone. Today an agent is a full system made of models plus harnesses, contexts, environments and feedback loops. Any one of these pieces can change the final score by as much as switching to a newer model generation. This setup creates three problems: the overall score hides which part is responsible, it marks valid alternative solutions as wrong, and it offers no detailed signal for improving one component at a time.

Core claim

Current coding benchmarks collapse the model, harness, and environment into a single end-to-end score computed against one reference solution, with no component-level signal. A coding agent is a system harness whose individual components can each move benchmark scores by margins comparable to adjacent model generations. The misalignment shows in three symptoms: conflated scores, penalization of alternative solutions, and lack of iteration signals at the component level.

What carries the argument

The system harness, a composite of models, harnesses, contexts, environments, and feedback signals, which carries the argument by showing that its parts can independently alter benchmark outcomes at model-scale levels.

Load-bearing premise

That changes to any single component of the system harness move benchmark scores by amounts comparable to improvements from new model generations.

What would settle it

A controlled study that measures score changes when only the model is swapped versus when only one harness component is changed, to check if the latter produces comparable or larger shifts.

Figures

Figures reproduced from arXiv: 2606.17799 by Amy Heineike, Dru Knox, Macey Baker, Maksim Shaposhnikov, Maria I. Gorinova, Rob Willoughby.

read the original abstract

Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper makes a clear case that agent benchmarks need component signals but asserts harness effects without evidence or examples.

read the letter

The core point here is that coding benchmarks were built for models, not for the full agent systems that include harnesses, contexts, and feedback loops, so they give poor iteration signals. The three symptoms listed—conflating model and harness, single-reference grading, and no component breakdown—are laid out plainly and make sense as practical problems.

The paper does a decent job framing agents as composite systems rather than just models. The single-reference issue is especially straightforward and has clear implications for how agents get penalized for valid but different solutions. That part of the argument stands on its own.

The main weakness is the central assertion that changes to harness, environment, or feedback can shift scores by amounts comparable to model generation gaps. The abstract states this directly, but there are no numbers, ablations, or even short examples from existing benchmarks to show the effect sizes. The stress-test note is accurate on this—the diagnosis of misalignment at a practically important scale stays ungrounded.

This is aimed at researchers working on coding agents and their evaluation. Someone thinking about benchmark redesign would get value from the symptom list and the system-harness distinction. It deserves peer review because the position is coherent and the topic matters for current work, even though the scale claim needs more support.

Referee Report

2 major / 1 minor

Summary. The paper claims that coding benchmarks are misaligned with agentic software engineering because they were designed for a pre-agent era. It argues that a coding agent is a composite system harness (models, harnesses, contexts, environments, feedback signals) rather than a model alone, and that changes to any harness component can shift end-to-end benchmark scores by margins comparable to those between adjacent model generations. It identifies three symptoms: (i) scores conflate model with harness, (ii) single-reference grading penalizes valid alternatives, and (iii) lack of component-level signals hinders iteration.

Significance. If the central comparability claim holds, the position would usefully highlight limitations in existing benchmarks (e.g., those collapsing system components into single scores) and motivate redesigned evaluations that separate harness effects, support multiple references, and supply per-component signals. This could improve how agent progress is tracked in software engineering. The paper earns credit for clearly enumerating the three symptoms in a direct argumentative structure without circularity or invented parameters.

major comments (2)

[Abstract] Abstract: The assertion that 'any one of which can move the benchmark score by margins comparable to those between adjacent model generations' is load-bearing for the misalignment diagnosis and the call for redesign, yet it is presented without quantification, ablation results, illustrative examples from benchmarks, or citations to studies measuring harness-component effect sizes. This leaves the premise ungrounded.
[Symptoms section] Discussion of the three symptoms: The symptoms are logically described, but the manuscript supplies no analysis or external references establishing that these are the primary drivers or that their effects reach the scale of model-generation deltas; without such grounding the recommendation to redesign benchmarks rests on an untested premise rather than demonstrated condition.

minor comments (1)

The manuscript would benefit from an explicit early definition or scope statement for 'agentic software engineering' to anchor the symptoms discussion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the review. We address the major comments point by point below. The manuscript is a position paper, so its contribution is the identification and logical framing of misalignment issues rather than new empirical measurements or ablations.

read point-by-point responses

Referee: [Abstract] The assertion that 'any one of which can move the benchmark score by margins comparable to those between adjacent model generations' is load-bearing for the misalignment diagnosis and the call for redesign, yet it is presented without quantification, ablation results, illustrative examples from benchmarks, or citations to studies measuring harness-component effect sizes. This leaves the premise ungrounded.

Authors: The paper is a position paper. The assertion synthesizes observed trends in agent development, where harness changes routinely produce score shifts on the order of model-generation gaps, but the manuscript does not include new quantification or ablations because that would alter its scope and purpose. We maintain that a position paper can legitimately highlight such patterns without performing the measurements itself. No revision is made on this point. revision: no
Referee: [Symptoms section] The symptoms are logically described, but the manuscript supplies no analysis or external references establishing that these are the primary drivers or that their effects reach the scale of model-generation deltas; without such grounding the recommendation to redesign benchmarks rests on an untested premise rather than demonstrated condition.

Authors: The three symptoms are derived directly from the structural mismatch between pre-agent benchmark designs and composite agent systems; the paper does not claim they are the sole primary drivers or supply new quantitative analysis. The redesign recommendation follows from the logical consequences of the symptoms as stated. Adding empirical grounding or references would convert the work into an empirical study, which exceeds the intended contribution of a position paper. No revision is made. revision: no

Circularity Check

0 steps flagged

No circularity: position paper with no derivations or self-referential constructions

full rationale

The manuscript is a purely argumentative position paper. It contains no equations, fitted parameters, derivations, or mathematical claims of any kind. The central assertions about benchmark misalignment and the role of harness components are presented directly as observations and symptoms rather than derived from any chain that could reduce to inputs by construction. No self-citations are used in a load-bearing manner for uniqueness theorems or ansatzes, and the text does not rename known results or smuggle in assumptions via citation. The paper is self-contained as a set of stated positions with no internal reduction steps to analyze.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the described symptoms cause practical misalignment in agent development, without new parameters or entities introduced.

axioms (1)

domain assumption Benchmarks for coding agents should isolate harness components to enable iteration
Implicit in the claim that absence of component-level signal makes end-to-end scores difficult to iterate on.

pith-pipeline@v0.9.1-grok · 5710 in / 1241 out tokens · 41434 ms · 2026-06-26T23:46:35.094043+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 1 canonical work pages

[1]

AI21. 2025. Scaling Agentic Evaluation: Lessons from 200,000 SWE-bench Runs. https://www.ai21.com/blog/scaling-agentic-evaluation-swe-bench/

2025
[2]

Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang. 2024. SWE-Bench+: Enhanced Coding Benchmark for LLMs. (2024). arXiv:2410.06992 https://arxiv.org/abs/2410.06992

arXiv 2024
[3]

Anthropic. 2025. Claude Code. https://claude.com/product/claude-code

2025
[4]

Anthropic. 2025. Effective Harnesses for Long-Running Agents. https://www. anthropic.com/engineering/effective-harnesses-for-long-running-agents

2025
[5]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. (2021). arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

Pith/arXiv arXiv 2021
[6]

Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, and Alexander Golubev. 2026. SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale. arXiv:2602.23866 [cs.SE] https://arxiv.org/abs/2602.23866

Pith/arXiv arXiv 2026
[7]

2002.Test-driven development: by example

Kent Beck. 2002.Test-driven development: by example. Addison-Wesley Profes- sional

2002
[8]

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. 2024. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. (2024). arXiv:2410.07095 https://arxiv.org/ abs/2410.07095

Pith/arXiv arXiv 2024
[9]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374

Pith/arXiv arXiv 2021
[10]

Cursor. 2025. Cursor Agents. https://cursor.com/agents

2025
[11]

Mostafa Dehghani, Yi Tay, Alexey A Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. 2021. The benchmark lottery.arXiv preprint arXiv:2107.07002(2021)

arXiv 2021
[12]

Zhang, Pinjia He, and Ahmed E

Zhiyu Fan, Kirill Vasilevski, Dayi Lin, Boyuan Chen, Yihao Chen, Zhiqing Zhong, Jie M. Zhang, Pinjia He, and Ahmed E. Hassan. 2025. SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints. (2025). arXiv:2509.09853 https://arxiv.org/abs/2509.09853

arXiv 2025
[13]

Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma, and Ao Qu. 2026. De- cisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows. (2026). arXiv:2605.19099 https://arxiv.org/abs/2605.19099

Pith/arXiv arXiv 2026
[14]

Gastown Hall. 2026. GasCity. https://github.com/gastownhall/gascity

2026
[15]

Paul Gauthier. 2024. The Aider Polyglot Coding Benchmark. https://aider.chat/ 2024/12/21/polyglot.html

2024
[16]

Zhuohan Gu, Qizheng Zhang, Omar Khattab, and Samuel Madden. 2026. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents. (2026). arXiv:2605.19932 https://arxiv.org/abs/2605.19932

Pith/arXiv arXiv 2026
[17]

Ahmed E Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic software engineering: Foundational pillars and a research roadmap.arXiv preprint arXiv:2509.06216(2025)

Pith/arXiv arXiv 2025
[18]

Jacobs and Hanna Wallach

Abigail Z. Jacobs and Hanna Wallach. 2021. Measurement and Fairness. InPro- ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT). 375–385. doi:10.1145/3442188.3445901 https://arxiv.org/abs/1912.05511

work page doi:10.1145/3442188.3445901 2021
[19]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. Live- CodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code. InInternational Conference on Learning Representations (ICLR). arXiv:2403.07974 https://arxiv.org/abs/2403.07974

Pith/arXiv arXiv 2025
[20]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InInternational Conference on Learning Representa- tions (ICLR). arXiv:2310.06770 https://arxiv.org/abs/2310.06770

Pith/arXiv arXiv 2024
[21]

Alex Kotliarskyi, Victor Zhu, and Zach Brock. 2026. An open-source spec for Codex orchestration: Symphony. https://openai.com/index/open-source-codex- orchestration-symphony/

2026
[22]

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. 2026. Meta-Harness: End-to-End Optimization of Model Harnesses. (2026). arXiv:2603.28052 https://arxiv.org/abs/2603.28052

Pith/arXiv arXiv 2026
[23]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE 3.0): How Autonomous Coding Agents Are Reshaping Software Engineering. (2025). arXiv:2507.15003 https://arxiv.org/abs/ 2507.15003

Pith/arXiv arXiv 2025
[24]

Hao Li, Haoxiang Zhang, and Ahmed E Hassan. 2026. AIDev: Studying AI coding agents on GitHub.arXiv preprint arXiv:2602.09185(2026). https://arxiv.org/abs/ 2602.09185

arXiv 2026
[25]

Xiangyi Li, Wenbo Chen, Yimin Liu, et al . 2026. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks. (2026). arXiv:2602.12670 https://arxiv.org/abs/2602.12670

Pith/arXiv arXiv 2026
[26]

Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason.arXiv preprint arXiv:2506.12286(2025). arXiv:2506.12286 https://arxiv. org/abs/2506.12286

arXiv 2025
[27]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2024. AgentBench: Evaluating LLMs as Agents. InInternational Conference on Learning Representations (ICLR). arXiv:2308.03688 https://arxiv.org/abs/2308.03688

Pith/arXiv arXiv 2024
[28]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, et al . 2026. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces. (2026). arXiv:2601.11868 https://arxiv.org/abs/2601.11868

Pith/arXiv arXiv 2026
[29]

Morph Labs. 2025. SWE-Bench Pro: A Detailed Analysis of Scaffold-Driven Score Variance. https://www.morphllm.com/swe-bench-pro

2025
[30]

OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

2024
[31]

OpenAI. 2025. Introducing Codex. https://openai.com/index/introducing-codex/

2025
[32]

OpenAI. 2025. SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (2025). arXiv:2502.12115 https://arxiv.org/abs/ 2502.12115

arXiv 2025
[33]

OpenAI. 2026. Harness Engineering. https://openai.com/index/harness- engineering/

2026
[34]

OpenAI. 2026. Why SWE-bench Verified No Longer Measures Frontier Coding Capabilities. https://openai.com/index/why-we-no-longer-evaluate-swe-bench- verified/

2026
[35]

Proximal Labs. 2026. Frontier-SWE: A Benchmark of Long-Horizon Software Engineering Tasks. https://www.frontierswe.com/blog

2026
[36]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceed- ings of the 58th annual meeting of the association for computational linguistics. 4902–4912

2020
[37]

Scale AI. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv:2509.16941 https://arxiv.org/abs/2509.16941

Pith/arXiv arXiv 2025
[38]

Brian Scanlan. 2026. How we use Claude Code today at Intercom. https://www.linkedin.com/pulse/how-we-use-claude-code-today-intercom- brian-scanlan-eb7cc/

2026
[39]

Gian Segato and Engineering at Anthropic. 2026. Quantifying infrastructure noise in agentic coding evals. https://www.anthropic.com/engineering/infrastructure- noise

2026
[40]

Maksim Shaposhnikov. 2025. A Proposed Framework For Evaluating Skills.Tessl Blog(2025). https://tessl.io/blog/a-proposed-framework-for-evaluating-skills- research-eng-blog/

2025
[41]

Gorinova, Rob Willoughby, and Dru Knox

Maksim Shaposhnikov, Maria I. Gorinova, Rob Willoughby, and Dru Knox. 2025. A Proposed Evaluation Framework for Coding Agents: Tiles Enhance Proper Use of Public APIs by 35%.Tessl Blog(2025). https://tessl.io/blog/proposed- evaluation-framework-for-coding-agents/

2025
[42]

StrongDM. 2025. StrongDM Software Factory. https://factory.strongdm.ai/. Field notes on non-interactive agentic development

2025
[43]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Intro- duction(2nd ed.). MIT Press

2018
[44]

Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P

Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, et al. 2025. Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge. InInternational Conference on Machine Learning (ICML). arXiv:2502.00561 https://arxiv.org/abs/2502.00561

arXiv 2025
[45]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al . 2025. Open- Hands: An Open Platform for AI Software Developers as Generalist Agents. In International Conference on Learning Representations (ICLR). arXiv:2407.16741 https://arxiv.org/abs/2407.16741

Pith/arXiv arXiv 2025
[46]

Solved Issues

You Wang, Michael Pradel, and Zhongxin Liu. 2025. Are “Solved Issues” in SWE- bench Really Solved Correctly? An Empirical Study. (2025). arXiv:2503.15223 https://arxiv.org/abs/2503.15223

arXiv 2025
[47]

Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush. 2026. Many SWE-bench- Passing PRs Would Not Be Merged into Main. https://metr.org/notes/2026-03- 10-many-swe-bench-passing-prs-would-not-be-merged-into-main/

2026
[48]

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua Clymer, Jai Dhyani, et al . 2025. RE- Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts. InInternational Conference on Machine Learning (ICML). arXiv:2411.15114 https://arxiv.org/abs/2411.15114

arXiv 2025
[49]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2405.15793 https://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024
[50]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, and Ofir Press. 2025. SWE-bench Multimodal: Do AI Systems Gen- eralize to Visual Software Domains?. InInternational Conference on Learning Representations (ICLR). arXiv:2410.03859 https://arxi...

arXiv 2025
[51]

John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, and Ofir Press. 2026. ProgramBench: Can Language Models Rebuild Programs From Scratch? arXiv:2605.03546 [cs.SE] https://arxiv.org/abs/2605.03546

Pith/arXiv arXiv 2026
[52]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. (2024). arXiv:2406.12045 https://arxiv.org/abs/2406.12045

Pith/arXiv arXiv 2024
[53]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al
[54]

InInternational Conference on Learning Repre- sentations (ICLR)

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. InInternational Conference on Learning Repre- sentations (ICLR). arXiv:2406.15877 https://arxiv.org/abs/2406.15877. 6

Pith/arXiv arXiv

[1] [1]

AI21. 2025. Scaling Agentic Evaluation: Lessons from 200,000 SWE-bench Runs. https://www.ai21.com/blog/scaling-agentic-evaluation-swe-bench/

2025

[2] [2]

Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang. 2024. SWE-Bench+: Enhanced Coding Benchmark for LLMs. (2024). arXiv:2410.06992 https://arxiv.org/abs/2410.06992

arXiv 2024

[3] [3]

Anthropic. 2025. Claude Code. https://claude.com/product/claude-code

2025

[4] [4]

Anthropic. 2025. Effective Harnesses for Long-Running Agents. https://www. anthropic.com/engineering/effective-harnesses-for-long-running-agents

2025

[5] [5]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. (2021). arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

Pith/arXiv arXiv 2021

[6] [6]

Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, and Alexander Golubev. 2026. SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale. arXiv:2602.23866 [cs.SE] https://arxiv.org/abs/2602.23866

Pith/arXiv arXiv 2026

[7] [7]

2002.Test-driven development: by example

Kent Beck. 2002.Test-driven development: by example. Addison-Wesley Profes- sional

2002

[8] [8]

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. 2024. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. (2024). arXiv:2410.07095 https://arxiv.org/ abs/2410.07095

Pith/arXiv arXiv 2024

[9] [9]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374

Pith/arXiv arXiv 2021

[10] [10]

Cursor. 2025. Cursor Agents. https://cursor.com/agents

2025

[11] [11]

Mostafa Dehghani, Yi Tay, Alexey A Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. 2021. The benchmark lottery.arXiv preprint arXiv:2107.07002(2021)

arXiv 2021

[12] [12]

Zhang, Pinjia He, and Ahmed E

Zhiyu Fan, Kirill Vasilevski, Dayi Lin, Boyuan Chen, Yihao Chen, Zhiqing Zhong, Jie M. Zhang, Pinjia He, and Ahmed E. Hassan. 2025. SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints. (2025). arXiv:2509.09853 https://arxiv.org/abs/2509.09853

arXiv 2025

[13] [13]

Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma, and Ao Qu. 2026. De- cisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows. (2026). arXiv:2605.19099 https://arxiv.org/abs/2605.19099

Pith/arXiv arXiv 2026

[14] [14]

Gastown Hall. 2026. GasCity. https://github.com/gastownhall/gascity

2026

[15] [15]

Paul Gauthier. 2024. The Aider Polyglot Coding Benchmark. https://aider.chat/ 2024/12/21/polyglot.html

2024

[16] [16]

Zhuohan Gu, Qizheng Zhang, Omar Khattab, and Samuel Madden. 2026. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents. (2026). arXiv:2605.19932 https://arxiv.org/abs/2605.19932

Pith/arXiv arXiv 2026

[17] [17]

Ahmed E Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic software engineering: Foundational pillars and a research roadmap.arXiv preprint arXiv:2509.06216(2025)

Pith/arXiv arXiv 2025

[18] [18]

Jacobs and Hanna Wallach

Abigail Z. Jacobs and Hanna Wallach. 2021. Measurement and Fairness. InPro- ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT). 375–385. doi:10.1145/3442188.3445901 https://arxiv.org/abs/1912.05511

work page doi:10.1145/3442188.3445901 2021

[19] [19]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. Live- CodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code. InInternational Conference on Learning Representations (ICLR). arXiv:2403.07974 https://arxiv.org/abs/2403.07974

Pith/arXiv arXiv 2025

[20] [20]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InInternational Conference on Learning Representa- tions (ICLR). arXiv:2310.06770 https://arxiv.org/abs/2310.06770

Pith/arXiv arXiv 2024

[21] [21]

Alex Kotliarskyi, Victor Zhu, and Zach Brock. 2026. An open-source spec for Codex orchestration: Symphony. https://openai.com/index/open-source-codex- orchestration-symphony/

2026

[22] [22]

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. 2026. Meta-Harness: End-to-End Optimization of Model Harnesses. (2026). arXiv:2603.28052 https://arxiv.org/abs/2603.28052

Pith/arXiv arXiv 2026

[23] [23]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE 3.0): How Autonomous Coding Agents Are Reshaping Software Engineering. (2025). arXiv:2507.15003 https://arxiv.org/abs/ 2507.15003

Pith/arXiv arXiv 2025

[24] [24]

Hao Li, Haoxiang Zhang, and Ahmed E Hassan. 2026. AIDev: Studying AI coding agents on GitHub.arXiv preprint arXiv:2602.09185(2026). https://arxiv.org/abs/ 2602.09185

arXiv 2026

[25] [25]

Xiangyi Li, Wenbo Chen, Yimin Liu, et al . 2026. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks. (2026). arXiv:2602.12670 https://arxiv.org/abs/2602.12670

Pith/arXiv arXiv 2026

[26] [26]

Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason.arXiv preprint arXiv:2506.12286(2025). arXiv:2506.12286 https://arxiv. org/abs/2506.12286

arXiv 2025

[27] [27]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2024. AgentBench: Evaluating LLMs as Agents. InInternational Conference on Learning Representations (ICLR). arXiv:2308.03688 https://arxiv.org/abs/2308.03688

Pith/arXiv arXiv 2024

[28] [28]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, et al . 2026. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces. (2026). arXiv:2601.11868 https://arxiv.org/abs/2601.11868

Pith/arXiv arXiv 2026

[29] [29]

Morph Labs. 2025. SWE-Bench Pro: A Detailed Analysis of Scaffold-Driven Score Variance. https://www.morphllm.com/swe-bench-pro

2025

[30] [30]

OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

2024

[31] [31]

OpenAI. 2025. Introducing Codex. https://openai.com/index/introducing-codex/

2025

[32] [32]

OpenAI. 2025. SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (2025). arXiv:2502.12115 https://arxiv.org/abs/ 2502.12115

arXiv 2025

[33] [33]

OpenAI. 2026. Harness Engineering. https://openai.com/index/harness- engineering/

2026

[34] [34]

OpenAI. 2026. Why SWE-bench Verified No Longer Measures Frontier Coding Capabilities. https://openai.com/index/why-we-no-longer-evaluate-swe-bench- verified/

2026

[35] [35]

Proximal Labs. 2026. Frontier-SWE: A Benchmark of Long-Horizon Software Engineering Tasks. https://www.frontierswe.com/blog

2026

[36] [36]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceed- ings of the 58th annual meeting of the association for computational linguistics. 4902–4912

2020

[37] [37]

Scale AI. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv:2509.16941 https://arxiv.org/abs/2509.16941

Pith/arXiv arXiv 2025

[38] [38]

Brian Scanlan. 2026. How we use Claude Code today at Intercom. https://www.linkedin.com/pulse/how-we-use-claude-code-today-intercom- brian-scanlan-eb7cc/

2026

[39] [39]

Gian Segato and Engineering at Anthropic. 2026. Quantifying infrastructure noise in agentic coding evals. https://www.anthropic.com/engineering/infrastructure- noise

2026

[40] [40]

Maksim Shaposhnikov. 2025. A Proposed Framework For Evaluating Skills.Tessl Blog(2025). https://tessl.io/blog/a-proposed-framework-for-evaluating-skills- research-eng-blog/

2025

[41] [41]

Gorinova, Rob Willoughby, and Dru Knox

Maksim Shaposhnikov, Maria I. Gorinova, Rob Willoughby, and Dru Knox. 2025. A Proposed Evaluation Framework for Coding Agents: Tiles Enhance Proper Use of Public APIs by 35%.Tessl Blog(2025). https://tessl.io/blog/proposed- evaluation-framework-for-coding-agents/

2025

[42] [42]

StrongDM. 2025. StrongDM Software Factory. https://factory.strongdm.ai/. Field notes on non-interactive agentic development

2025

[43] [43]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Intro- duction(2nd ed.). MIT Press

2018

[44] [44]

Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P

Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, et al. 2025. Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge. InInternational Conference on Machine Learning (ICML). arXiv:2502.00561 https://arxiv.org/abs/2502.00561

arXiv 2025

[45] [45]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al . 2025. Open- Hands: An Open Platform for AI Software Developers as Generalist Agents. In International Conference on Learning Representations (ICLR). arXiv:2407.16741 https://arxiv.org/abs/2407.16741

Pith/arXiv arXiv 2025

[46] [46]

Solved Issues

You Wang, Michael Pradel, and Zhongxin Liu. 2025. Are “Solved Issues” in SWE- bench Really Solved Correctly? An Empirical Study. (2025). arXiv:2503.15223 https://arxiv.org/abs/2503.15223

arXiv 2025

[47] [47]

Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush. 2026. Many SWE-bench- Passing PRs Would Not Be Merged into Main. https://metr.org/notes/2026-03- 10-many-swe-bench-passing-prs-would-not-be-merged-into-main/

2026

[48] [48]

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua Clymer, Jai Dhyani, et al . 2025. RE- Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts. InInternational Conference on Machine Learning (ICML). arXiv:2411.15114 https://arxiv.org/abs/2411.15114

arXiv 2025

[49] [49]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2405.15793 https://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024

[50] [50]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, and Ofir Press. 2025. SWE-bench Multimodal: Do AI Systems Gen- eralize to Visual Software Domains?. InInternational Conference on Learning Representations (ICLR). arXiv:2410.03859 https://arxi...

arXiv 2025

[51] [51]

John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, and Ofir Press. 2026. ProgramBench: Can Language Models Rebuild Programs From Scratch? arXiv:2605.03546 [cs.SE] https://arxiv.org/abs/2605.03546

Pith/arXiv arXiv 2026

[52] [52]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. (2024). arXiv:2406.12045 https://arxiv.org/abs/2406.12045

Pith/arXiv arXiv 2024

[53] [53]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al

[54] [54]

InInternational Conference on Learning Repre- sentations (ICLR)

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. InInternational Conference on Learning Repre- sentations (ICLR). arXiv:2406.15877 https://arxiv.org/abs/2406.15877. 6

Pith/arXiv arXiv