arxiv: 2603.24755 · v2 · submitted 2026-03-25 · 💻 cs.SE · cs.AI· cs.CL

Recognition: no theorem link

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

Gabriel Orlanski , Devjeet Roy , Alexander Yun , Changho Shin , Alex Gu , Albert Ge , Dyah Adila , Nicholas Roberts

show 2 more authors

Frederic Sala Aws Albarghouthi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:11 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords coding agentsiterative codingcode degradationbenchmarkssoftware qualitystructural erosionverbosityLLM evaluation

0 comments

The pith

Coding agents cannot complete any full problem and their code grows more verbose and structurally eroded with each iterative extension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SlopCodeBench with 36 problems and 196 checkpoints that force agents to repeatedly extend their own prior solutions under evolving specifications. It measures two degradation types: structural erosion as concentrated complexity and verbosity as redundant code. Results show no agent finishes any problem end-to-end, the best passes only 14.8 percent of checkpoints, erosion rises in 77 percent of trajectories, and verbosity rises in 75.5 percent. Agent code ends up 2.3 times more verbose and twice as eroded as typical human open-source Python repositories, which degrade less across their git histories. A sympathetic reader would care because real software work is iterative, so these patterns reveal why agents struggle to produce maintainable code over time.

Core claim

SlopCodeBench consists of 36 problems and 196 checkpoints where agents must extend their own solutions according to evolving specifications that require architectural decisions but leave internal structure open. Across 15 evaluated agents, none solve any problem completely end-to-end and the strongest passes only 14.8 percent of checkpoints. Structural erosion increases in 77 percent of trajectories and verbosity increases in 75.5 percent. Agent code is on average 2.3 times more verbose and 2.0 times more eroded than code in 473 human open-source Python repositories, which show smaller and less frequent degradation across their git histories. Adding explicit quality guidance to prompts cuts

What carries the argument

SlopCodeBench benchmark that tracks structural erosion as concentrated complexity and verbosity as redundant code across sequences of agent-generated code extensions.

If this is right

Explicit quality guidance in prompts reduces initial verbosity and erosion by up to one third but does not slow subsequent degradation rates.
Agent code starts worse than typical human repositories and diverges further with each extension.
No evaluated agent maintains solution quality while performing the required iterative extensions.
Human repositories degrade less often and by smaller margins across their commit histories than agent trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams using agents for ongoing development may incur rising maintenance costs from accumulating bloat and complexity.
Future benchmarks could add periodic refactoring steps to test whether agents can self-correct degradation.
The gap between agent and human code quality suggests agents need built-in mechanisms for pruning or simplifying prior work.
Architectural freedom in the benchmark may expose decision-making weaknesses that more constrained setups hide.

Load-bearing premise

The chosen metrics for structural erosion and verbosity accurately reflect meaningful degradation that would matter in real software projects beyond the 36 selected problems.

What would settle it

Finding even one agent that completes multiple full problems while keeping both structural erosion and verbosity stable or decreasing across all 196 checkpoints would falsify the central degradation claim.

Figures

Figures reproduced from arXiv: 2603.24755 by Albert Ge, Alexander Yun, Alex Gu, Aws Albarghouthi, Changho Shin, Devjeet Roy, Dyah Adila, Frederic Sala, Gabriel Orlanski, Nicholas Roberts.

**Figure 2.** Figure 2: Solve rates and cost growth over problem progress. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Erosion and verbosity across problem progress for six representative models (three per [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Mean verbosity and structural erosion across normalized trajectory progress for agent [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt strategy trajectories across two models. Each point shows the mean value at a normalized progress bin. Quality-aware prompts (Anti-Slop and Plan-First) lower the initial verbosity and erosion compared to the Baseline (just-solve), but the trajectories remain largely parallel across progress. Slopes (m) indicate the mean degradation per bin. Despite significant gains in structural quality, pass rates… view at source ↗

**Figure 6.** Figure 6: Mean continuous pass rates by test type over problem progress with bootstrap 95% [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

read the original abstract

Software development is iterative, yet agentic coding benchmarks hide design issues through their single-shot setup. Recent iterative benchmarks attempt to remedy this but heavily constrain an agent's design decision space, making it impossible to faithfully measure how their decisions shape future extensions. We introduce SlopCodeBench, a benchmark of 36 problems and 196 checkpoints where agents repeatedly extend their own solutions. Unlike prior iterative benchmarks, our evolving specifications demand architectural decisions but leave internal structure to the agent. We measure two forms of degradation: structural erosion (concentrated complexity) and verbosity (redundant code). Evaluating 15 coding agents across open and closed models, we find that no agent fully solves any problem end-to-end, and the best agent passes 14.8% of checkpoints. Quality degrades across checkpoints, with structural erosion rising in 77% of trajectories and verbosity in 75.5%. Compared to 473 open-source Python repositories, agent code is 2.3x more verbose and 2.0x more eroded, and the human repositories degrade less often and by smaller margins across their git histories. Explicit quality guidance reduces initial verbosity and erosion by up to a third, without affecting degradation rates. SlopCodeBench provides the first measurement of code degradation under iterative extension, revealing that agents pass checkpoints while producing code that erodes and bloats with each turn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SlopCodeBench is a new iterative benchmark that shows agents never finish tasks and produce more eroded code than humans, but the custom metrics are the main weak point.

read the letter

The main thing to know is that SlopCodeBench is the first benchmark built to track code degradation in unconstrained iterative agent tasks. Agents never finish any problem completely, the top one clears just 14.8% of checkpoints, and their output gets worse on the custom erosion and verbosity measures than human code from hundreds of repos. What the paper does well is set up a realistic iterative loop. Agents extend their own prior code on specs that require real design decisions without dictating the structure. They test a range of 15 agents on open and closed models. The comparison to 473 open-source Python repositories and their git histories adds a useful baseline. Finding that degradation happens in 77% of trajectories for erosion and 75.5% for verbosity, while humans degrade less, gives concrete numbers to work with. The note that quality prompts reduce initial issues but not the ongoing drop is also worth noting. The soft spot is the metrics themselves. Structural erosion and verbosity are defined in the paper, but there is no validation showing they correlate with practical costs like bugs or refactoring effort. That makes the 2x worse claim harder to interpret. The set of 36 problems is small, so generalization is limited for now. This paper is for teams working on agentic coding systems who need data on long-horizon performance. It deserves peer review because the benchmark idea and the direct measurements are new and grounded enough to be useful, even if the metrics could use more support.

Referee Report

2 major / 2 minor

Summary. The paper introduces SlopCodeBench, a benchmark of 36 problems and 196 checkpoints for evaluating coding agents on long-horizon iterative extension tasks where agents build on their own prior solutions under evolving specifications. It reports that no agent fully solves any problem end-to-end, with the best agent passing only 14.8% of checkpoints. Quality degrades over iterations, with structural erosion (concentrated complexity) rising in 77% of trajectories and verbosity (redundant code) in 75.5%. Agent code is 2.3x more verbose and 2.0x more eroded than code from 473 open-source Python repositories, while human git histories show less frequent and smaller degradation; explicit quality guidance reduces initial issues by up to a third but does not alter degradation rates.

Significance. If the custom metrics prove valid, the work fills a gap in agent evaluation by moving beyond single-shot benchmarks to measure iterative degradation, with the open comparison to real repositories and git histories providing a useful empirical anchor. The finding that agents pass checkpoints while producing progressively eroded and bloated code has direct implications for deploying coding agents in sustained software development.

major comments (2)

[§4.1] §4.1 (Metric Definitions): The central claims of degradation and the 2.0x/2.3x comparisons to the 473 repositories rest on the structural erosion and verbosity metrics, yet the manuscript provides no validation (e.g., correlation with bug density, maintenance effort, or developer surveys) that these proxies capture practically meaningful quality loss rather than superficial syntactic changes.
[§5.3] §5.3 (Human Repository Comparison): The assertion that human repositories degrade less often and by smaller margins requires explicit confirmation that the same erosion and verbosity formulas were applied to the git histories with equivalent checkpoint alignment; without this, the differential degradation result is not directly comparable.

minor comments (2)

[Table 2] Table 2: The checkpoint pass-rate column would be clearer if it distinguished first-attempt versus cumulative success across the 196 checkpoints.
[Figure 4] Figure 4: Axis labels on the erosion/verbosity trajectory plots should explicitly state the units or normalization used for the y-axes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our metrics and comparisons.

read point-by-point responses

Referee: [§4.1] §4.1 (Metric Definitions): The central claims of degradation and the 2.0x/2.3x comparisons to the 473 repositories rest on the structural erosion and verbosity metrics, yet the manuscript provides no validation (e.g., correlation with bug density, maintenance effort, or developer surveys) that these proxies capture practically meaningful quality loss rather than superficial syntactic changes.

Authors: We agree that explicit validation against external criteria such as bug density would strengthen the claims. Our structural erosion metric quantifies concentration of cyclomatic complexity (following McCabe 1976 and subsequent maintainability studies), while verbosity detects redundant code via AST-based duplication. These draw on established software engineering proxies rather than being ad-hoc. In the revised manuscript we have expanded §4.1 with additional citations to prior work validating similar metrics and have added an explicit limitations paragraph acknowledging the absence of new correlation studies in this paper. We view the metrics as reasonable proxies for the degradation phenomenon we measure, but we do not claim they are fully validated substitutes for direct quality outcomes. revision: partial
Referee: [§5.3] §5.3 (Human Repository Comparison): The assertion that human repositories degrade less often and by smaller margins requires explicit confirmation that the same erosion and verbosity formulas were applied to the git histories with equivalent checkpoint alignment; without this, the differential degradation result is not directly comparable.

Authors: We confirm that identical formulas were used. For the 473 repositories we sampled commit histories and aligned measurement points to intervals comparable to the benchmark checkpoints (i.e., after each logical extension step). We have added a new paragraph in §5.3 and a short appendix subsection that explicitly describes the extraction procedure, commit selection criteria, and alignment method to make the comparison transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper introduces SlopCodeBench with 36 problems and 196 checkpoints, evaluates 15 agents by direct execution on iterative extensions, computes structural erosion and verbosity via code analysis metrics, and compares results to 473 external open-source Python repositories plus git histories. All central claims (14.8% checkpoint pass rate, 77% erosion rise, 75.5% verbosity rise, 2.3x/2.0x factors) follow from these measurements without any derivation step reducing to a fitted parameter, self-definition, or self-citation chain. Metrics are applied as defined to observed code; no ansatz, uniqueness theorem, or renaming of known results is invoked to force outcomes. The evaluation stands as self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of the new benchmark design and the assumption that the introduced metrics for structural erosion and verbosity capture meaningful code quality issues.

axioms (1)

domain assumption Structural erosion and verbosity metrics validly measure code quality degradation
These metrics are introduced to quantify the main findings but their exact computation is not specified in the abstract.

pith-pipeline@v0.9.0 · 5578 in / 1321 out tokens · 58311 ms · 2026-05-15T00:11:04.234931+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Jianming Chang, Songqiang Chen, Chao Peng, Hao Yu, Zhiming Li, Pengfei Gao, and Tao Xie

URLhttps://arxiv.org/abs/2511.13972. Jianming Chang, Songqiang Chen, Chao Peng, Hao Yu, Zhiming Li, Pengfei Gao, and Tao Xie. LessLeak-Bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks, 2026. Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, and Bing Zhao. Swe-ci: Evaluating agent capabilities in maintaining codebas...

work page doi:10.1109/saner64311.2025.00068 2026
[2]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

ISBN 978-0201485677. 13 International Organization for Standardization. ISO/IEC 25010:2011 systems and software engineer- ing – systems and software quality requirements and evaluation (SQuaRE) – system and software quality models, 2011. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: C...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06770 2011
[3]

URLhttps://arxiv.org/abs/2601.11868. Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.01973 2025
[4]

control bars

URLhttps://github.com/SWE-agent/mini-swe-agent. GitHub repository. 15 Boxi Yu, Yang Cao, Yuzhong Zhang, Liting Lin, Junjielong Xu, Zhiqing Zhong, Qinghua Xu, Guancheng Wang, Jialun Cao, Shing-Chi Cheung, Pinjia He, and Lionel Briand. SWE-ABS: Adversarial benchmark strengthening exposes inflated success rates on test-based benchmark, 2026. Daoguang Zan, Zh...

work page internal anchor Pith review doi:10.48550/arxiv 2026
[5]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

URLhttps://arxiv.org/abs/2511.04064. Zexun Zhan, Shuzheng Gao, Ruida Hu, and Cuiyun Gao. Sr-eval: Evaluating llms on code generation under stepwise requirement refinement, 2025. Binquan Zhang, Li Zhang, Lin Shi, Song Wang, Yuwei Qian, Linhui Zhao, Fang Liu, An Fu, and Yida Ye. An empirical study of interaction smells in multi-turn human-llm collaborative ...

work page internal anchor Pith review doi:10.48550/arxiv.2406.15877 2025
[6]

Before coding plan out what you need to implement
[7]

Write the simple solution first
[8]

Ensure it is 100% correct and you have covered all edge cases
[9]

id": "<non-empty string>

Refactor to ensure the code is high quality. Here are the basic style rules you must follow: - Make sure the code is documented appropriately so that it is easy to pick up. - Minimize the following gotchas: - Extra defensive checks or try/catch blocks that are abnormal. - Casts to get around type checking - Variables that are only used a single time after...
[10]

Ordering: Matches must appear by file (lexicographically), then start.line, then start.col, then rule_id
[11]

Coordinates: Lines and columns are 1-based
[12]

Path format: file is relative to <root_dir> with ’/’ separators
[13]

languages

Encoding: Read files using --encoding (default utf-8); skip files 20 that fail to decode. Listing 7: Specification forcode_searchcheckpoint 1 D.2 Checkpoint 2: Multi-language Support with Filtering Extend your code searcher to support JavaScript and C++ source files. --- ## New Requirements ### File type -> language Scan these extensions: | Language | Ext...
[14]

Pattern determinism: When multiple matches share the same start position, sort by end position (earlier end first), then by rule_id
[15]

Listing 9: Specification forcode_searchcheckpoint 3 E Problem Overview Table 5 lists all 20 problems in SlopCodeBench

Captures key order: Serialize captures with keys sorted lexicographically by metavariable name (e.g., $A before $X). Listing 9: Specification forcode_searchcheckpoint 3 E Problem Overview Table 5 lists all 20 problems in SlopCodeBench. Problems span CLIs, REST APIs, DSL interpreters, and file-processing pipelines. Each begins with a focused deliverable an...

2020