arxiv: 2605.13139 · v1 · submitted 2026-05-13 · 💻 cs.SE

Recognition: no theorem link

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

Hao Guan , Lingyue Fu , Shao Zhang , Yaoming Zhu , Kangning Zhang , Lin Qiu , Xunliang Cai , Xuezhi Cao

show 3 more authors

Weiwen Liu Weinan Zhang Yong Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:32 UTC · model grok-4.3

classification 💻 cs.SE

keywords code agentsautonomous agentssoftware benchmarksissue resolutionLLM evaluationend-to-end tasksFullCycle executionverification methods

0 comments

The pith

Code agents show sharply lower success rates when handling complete issue resolution autonomously versus in isolated subtasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces SWE-Cycle, a benchmark of 489 instances that tests code agents on three separate tasks—environment reconstruction, code implementation, and verification test generation—plus a single FullCycle task that requires them to perform all steps in sequence inside a bare repository with no human assistance. It pairs the benchmark with SWE-Judge, an evaluation agent that combines static code review and dynamic runtime testing to measure outcomes more reliably than traditional parsers. The central result is a clear drop in solve rates for the integrated task, which the authors trace to agents struggling with dependencies that cross phases and with preserving code quality throughout. Readers focused on AI tools for software work would care because the findings indicate that current agents still need substantial scaffolding to match their performance on narrower problems.

Core claim

SWE-Cycle evaluates agents across isolated tasks of environment reconstruction, code implementation, and verification test generation, as well as an end-to-end FullCycle task that integrates all three in a bare repository without human scaffolding. Using SWE-Judge, which merges static review with dynamic testing to verify functional correctness, the evaluation of agents powered by six state-of-the-art LLMs reveals a sharp drop in solve rates when moving from isolated tasks to FullCycle execution, exposing bottlenecks in handling cross-phase dependencies and maintaining code quality.

What carries the argument

SWE-Cycle benchmark consisting of isolated subtasks and an integrated FullCycle execution, paired with SWE-Judge for reliable verification of autonomous trajectories through static and dynamic checks.

If this is right

Agents require stronger mechanisms to track and resolve dependencies that span multiple development phases.
Preserving code quality across an entire autonomous resolution process is a distinct and harder challenge than solving single subtasks.
Benchmarks must incorporate full-cycle execution in bare environments to measure real autonomy rather than pre-configured subtasks.
Evaluation tools need hybrid static-dynamic verification to avoid systematic errors when assessing complex agent trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent designs could incorporate explicit planning or memory structures that persist across phase boundaries to reduce the observed integration failures.
The performance gap may generalize to other multi-step agent workflows such as data pipeline construction or scientific experiment orchestration.
Training regimens focused on sequential dependency handling might narrow the gap between isolated and full-cycle performance.

Load-bearing premise

The 489 filtered instances and the SWE-Judge evaluator accurately capture practical autonomy without selection bias or verification errors that would alter the observed performance drop.

What would settle it

Re-running the same 489 instances with an alternative verification method that shows no significant solve-rate difference between isolated tasks and FullCycle execution would falsify the claim of critical cross-phase bottlenecks.

Figures

Figures reproduced from arXiv: 2605.13139 by Hao Guan, Kangning Zhang, Lingyue Fu, Lin Qiu, Shao Zhang, Weinan Zhang, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Yaoming Zhu, Yong Yu.

**Figure 1.** Figure 1: Overview of the SWE-Cycle Framework. Left: High-quality instances are curated through a rigorous filtering pipeline. Center: Agents execute environment reconstruction, code implementation, and test generation in either isolated tasks or the FullCycle task. Right: SWE-Judge evaluates outputs via hybrid static-dynamic analysis and a test intervention mechanism to yield a robust 0-2 score. static analysis wit… view at source ↗

**Figure 2.** Figure 2: Distribution of failure categories across [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: End-to-end integration effects. (a) Per-dimension score and solve rate change ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

As autonomous code agents move toward end-to-end software development, evaluating their practical autonomy becomes critical. Current benchmarks hide friction by testing agents in pre-configured environments, and their static evaluation pipelines frequently fail when parsing fully autonomous trajectories. We address these limitations with SWE-Cycle, a benchmark of 489 rigorously filtered instances. SWE-Cycle evaluates agents across three isolated tasks, including environment reconstruction, code implementation, and verification test generation, as well as an end-to-end FullCycle task that integrates all three. The FullCycle task requires agents to work autonomously in a bare repository without human scaffolding. To reliably assess these complex execution paths, we developed SWE-Judge. By combining static code review with dynamic testing, this execution-capable evaluation agent accurately verifies functional correctness and eliminates the systematic measurement errors of traditional static parsers. We evaluate code agents powered by six state-of-the-art LLMs across these four tasks. The results reveal a sharp drop in solve rates when transitioning from isolated tasks to FullCycle execution, exposing critical bottlenecks in handling cross-phase dependencies and maintaining code quality. Together, SWE-Cycle and SWE-Judge provide a comprehensive framework for accurately measuring the end-to-end capabilities of autonomous software agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-Cycle introduces a full-cycle benchmark and hybrid judge that shows clear performance drops for code agents, but the methods section is too thin to trust the size of those drops yet.

read the letter

The main point is that agents perform much worse on the integrated FullCycle task in a bare repository than on the three isolated tasks, and the new SWE-Judge is meant to catch functional issues that static parsers miss. This setup tries to close the gap between current benchmarks and actual autonomous work by forcing agents to handle environment reconstruction, implementation, and test generation without scaffolding. The 489 filtered instances and the hybrid static-plus-dynamic evaluation are the concrete additions beyond prior work like SWE-bench. The paper does a reasonable job laying out why isolated tasks hide real friction and why a bare-repo end-to-end run is closer to practice. The drop in solve rates across six LLMs is presented as evidence of bottlenecks in cross-phase dependencies and code quality, which aligns with what most people building these agents already suspect. That part feels useful for directing future work. The soft spot is the lack of detail on filtering criteria, exact SWE-Judge rules, and the raw numbers behind the claimed drop. Without those, it is difficult to judge whether the performance gap is driven by the task design or by choices in instance selection and verification. The abstract alone does not let a reader reproduce or stress-test the results. This paper is for groups working on autonomous code agents who need benchmarks that expose integration problems rather than just single-step accuracy. A reader already familiar with SWE-bench style evaluations will get the most out of it. The core idea is solid enough that it deserves a serious referee to check the methods and data tables, even if the current version needs expansion before acceptance.

Circularity Check

0 steps flagged

No circularity detected in benchmark construction or evaluation

full rationale

The paper presents SWE-Cycle as a new benchmark of 489 filtered instances and SWE-Judge as a separate evaluation agent combining static review and dynamic testing. These are introduced as independent tools for measuring agent performance on isolated tasks and FullCycle execution. No equations, fitted parameters, or derived predictions appear in the provided text. Results are reported as empirical observations of solve-rate drops, not quantities forced by construction from the benchmark itself. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming reduces the central claims to inputs. The derivation chain consists of benchmark curation and tool development followed by external evaluation runs, which remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the representativeness of the filtered instances and the accuracy of the new SWE-Judge without independent external validation of either.

axioms (2)

domain assumption The 489 filtered instances are representative of real-world software issue resolution cycles.
Abstract states rigorous filtering but provides no criteria or validation against external distributions.
domain assumption SWE-Judge correctly identifies functional correctness without systematic false positives or negatives.
Abstract claims it eliminates measurement errors of static parsers but offers no comparative error analysis.

invented entities (1)

SWE-Judge no independent evidence
purpose: Hybrid static-plus-dynamic evaluator to verify agent trajectories in complex execution paths.
New evaluation agent introduced to address limitations of traditional static parsers.

pith-pipeline@v0.9.0 · 5539 in / 1346 out tokens · 40697 ms · 2026-05-14T18:32:49.817448+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 28 canonical work pages · 10 internal anchors

[1]

Claude 4.6 sonnet system card

Anthropic. Claude 4.6 sonnet system card. Technical report, Anthropic,
[2]

URL https://assets.anthropic.com/m/785e231869ea8b3b/original/ Claude-4-6-Sonnet-System-Card.pdf
[3]

Claude code, 2025

Anthropic. Claude code, 2025. URLhttps://github.com/anthropics/claude-code

2025
[4]

Introducing claude opus 4.5

Anthropic. Introducing claude opus 4.5. Anthropic Blog, 2025. URL https://www. anthropic.com/news/claude-opus-4-5

2025
[5]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, et al. Why do multi-agent LLM systems fail?, 2025. URLhttps://arxiv.org/abs/2503.13657

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Jimenez, John Yang, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified. OpenAI Blog, 2024. URL https://openai.com/index/introducing-swe-bench-verified/

2024
[7]

Deepseek-v3.2: Pushing the frontier of open large language models,

DeepSeek-AI and Others. Deepseek-v3.2: Pushing the frontier of open large language models,
[8]

URLhttps://arxiv.org/abs/2512.02556

work page internal anchor Pith review Pith/arXiv arXiv
[9]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-bench pro: Can AI agents solve long-ho...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

arXiv preprint arXiv:2512.12730 , year=

Jingzhe Ding and Others. Nl2repo-bench: Towards long-horizon repository generation evalua- tion of coding agents, 2026. URLhttps://arxiv.org/abs/2512.12730

work page arXiv 2026
[11]

EnvBench: A benchmark for automated environment setup, 2025

Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. EnvBench: A benchmark for automated environment setup, 2025. URL https://arxiv.org/ abs/2503.14443

work page arXiv 2025
[12]

Automatically benchmarking llm code agents through agent-driven annotation and evaluation, 2025

Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, and Yong Yu. Automatically benchmarking llm code agents through agent-driven annotation and evaluation, 2025

2025
[13]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team. Glm-5: From vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Devbench: A realistic, developer-informed benchmark for code generation models, 2026

Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu. Devbench: A realistic, developer-informed benchmark for code generation models, 2026. URLhttps://arxiv.org/abs/2601.11895

work page arXiv 2026
[15]

SWE-bench goes live!,

Alex Gu, Naman Jain Liu, Nikhil Thakur, Wen-Ding Shi, Dídac Suris, Sanjay Jain, Naomi Saphra, Celine Lee Xia, Graham Neubig, and Aditi Raghunathan. SWE-bench goes live!,
[16]

2505.23419 , archivePrefix =

URL https://arxiv.org/abs/2505.23419. NeurIPS 2025 Datasets and Bench- marks Track

work page arXiv 2025
[17]

A survey on LLM-as-a-judge,

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo. A survey on LLM-as-a-judge,
[18]

URLhttps://arxiv.org/abs/2411.15594

work page internal anchor Pith review Pith/arXiv arXiv
[19]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In Proceedings of the International Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Kimi k2.5: Scaling reinforcement learning with llms

Kimi. Kimi k2.5: Scaling reinforcement learning with llms. Kimi Blog, 2025. URL https: //www.kimi.com/blog/kimi-k2-5. 10

2025
[21]

Fea-bench: A benchmark for evaluating repository-level code generation for feature implementation

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. Fea-bench: A benchmark for evaluating repository-level code generation for feature implementation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 17160–17176, 2025

2025
[22]

The swe-bench illusion: When state-of-the-art llms remember instead of reason, 2025

Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. The swe-bench illusion: When state-of-the-art llms remember instead of reason, 2025. URL https://arxiv.org/ abs/2506.12286

work page arXiv 2025
[23]

CodeJudgeBench: Benchmarking LLM-as-a-judge for coding tasks, 2025

Xichang Liu et al. CodeJudgeBench: Benchmarking LLM-as-a-judge for coding tasks, 2025. URLhttps://arxiv.org/abs/2507.10535

work page arXiv 2025
[24]

AI-augmented CI/CD pipelines: From code commit to production with autonomous decisions, 2025

Zhengyu Liu et al. AI-augmented CI/CD pipelines: From code commit to production with autonomous decisions, 2025. URLhttps://arxiv.org/abs/2508.11867

work page arXiv 2025
[25]

Minimax 2.7

MiniMax. Minimax 2.7. MiniMax Blog, 2026. URL https://www.minimaxi.com/models/ text/m27

2026
[26]

Introducing codex, 2025

OpenAI. Introducing codex, 2025. URL https://openai.com/index/ introducing-codex/

2025
[27]

Introducing gpt -5.4

OpenAI. Introducing gpt -5.4. OpenAI Blog, 2026. URL https://openai.com/index/ introducing-gpt-5-4/

2026
[28]

Why swe-bench verified no longer measures frontier coding capabilities

OpenAI. Why swe-bench verified no longer measures frontier coding capabilities. https:// openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ , February 2026

2026
[29]

Opencode, 2026

OpenCode Contributors. Opencode, 2026. URL https://github.com/opencode-ai/ opencode

2026
[30]

Qwen3.5: Towards native multimodal agents

Qwen. Qwen3.5: Towards native multimodal agents. Qwen Blog, 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[31]

Judging the judges: A systematic study of position bias in LLM-as-a-judge

Vibhu Raina et al. Judging the judges: A systematic study of position bias in LLM-as-a-judge. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP),
[32]

URLhttps://arxiv.org/abs/2406.07791

work page arXiv
[33]

Testeval: Benchmarking large language models for test case generation, 2025

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Ling- ming Zhang, An Ran Chen, and Lei Ma. Testeval: Benchmarking large language models for test case generation, 2025. URLhttps://arxiv.org/abs/2406.04531

work page arXiv 2025
[34]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Chen, Parker Adler, Zijian Cheng, Kexun Hu, Jieyu Li, Yuqi Li, Ziniu Liu, Yufan Lu, Jiasheng Ning, et al. OpenHands: An open platform for AI software developers as generalist agents, 2024. URLhttps://arxiv.org/abs/2407.16741

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Swe-compass: Towards unified evaluation of agentic coding abilities for large language models, 2025

Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xie...

work page arXiv 2025
[36]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Liber, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents, 2025. URL https://arxiv.org/abs/2504.21798. NeurIPS 2025 Datasets and Benchmarks Track Spotlight. 11

work page arXiv 2025
[38]

A survey on agent-as-a-judge, 2026

Runyang You, Hongru Cai, Caiqi Zhang, et al. A survey on agent-as-a-judge, 2026. URL https://arxiv.org/abs/2601.05111

work page arXiv 2026
[39]

Utboost: Rigorous evaluation of coding agents on swe-bench, 2025

Boxi Yu, Yuxuan Zhu, Pinjia He, and Daniel Kang. Utboost: Rigorous evaluation of coding agents on swe-bench. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. URLhttps://arxiv.org/abs/2506.09289

work page arXiv 2025
[40]

Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-SWE-bench: A multilingual benchmark for issue resolving, 2025. URLhttps://arxiv.org/abs/2504.02605

work page arXiv 2025
[41]

Benchmarking and studying the LLM-based agent system in end-to-end software development, 2025

Zhengran Zeng, Yixin Li, Rui Xie, Wei Ye, and Shikun Zhang. Benchmarking and studying the LLM-based agent system in end-to-end software development, 2025. URL https://arxiv. org/abs/2511.04064

work page arXiv 2025
[42]

Code Review Agent Benchmark

Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf, Haifeng Ruan, Ridwan Shariffdeen, and Abhik Roychoudhury. Code review agent benchmark. arXiv preprint arXiv:2603.23448, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena, 2024. URLhttps://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Featurebench: Benchmarking agentic coding for complex feature development

Q Zhou, J Zhang, H Wang, R Hao, J Wang, M Han, Y Yang, S Wu, F Pan, L Fan, D Tu, and Z Zhang. Featurebench: Benchmarking agentic coding for complex feature development. arXiv preprint arXiv:2602.10975, 2026

work page arXiv 2026
[45]

arXiv:2507.02825 , year =

Yuxuan Zhu, Tian Jin, Yada Pruksachatkun, Aston Zhang, Shayne Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Richard Weiss, et al. Establishing best practices for building rigorous agentic benchmarks. arXiv preprint arXiv:2507.02825, 2025

work page arXiv 2025
[46]

Does this step contain a critical error? Answer with only ‘yes’ or ‘no’

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025. URL https:...

work page arXiv 2025
[47]

Read the issue description to understand the problem context
[48]

Examine the gold reference patch to establish the correct solution approach
[49]

Review the agent’s submission to understand what the agent implemented
[50]

Read SWE-Judge’s scoring and reasoning
[51]

Cross-reference with execution logs and LLM auxiliary analysis when static review is insufficient
[52]

Couldn’t find the node_modules state file

Assign a failure category from the predefined taxonomy and record which evaluator is correct. Human Verification Results.To validate the LLM-assisted categorization and rule out selection bias, we conduct human deep annotation on all 371 disagreement instances plus 86 agreement instances (a 10% random sample of cases where SWE-Judge and the script concur)...
[53]

Instruction and gold patch review.SWE-Judge reads the issue description and golden patch to establish the expected behavior: file deletion via rimraf with glob patterns for accumulated profile images
[54]

Agent patch macro-review.SWE-Judge reviews the agent’s diff across 5 modified files (src/groups/cover.js, src/socket.io/user/picture.js, src/user/delete.js, src/user/picture.js). It identifies a critical divergence: the agent uses getLocalCoverPath/getLocalAvatarPath to delete only the current file, while the golden patch uses glob patterns to delete all ...
[55]

Dynamic: Node.js v18.20.8 available, packages import successfully, 359 tests collected

ENV evaluation.Static: setup.sh runs npm install correctly. Dynamic: Node.js v18.20.8 available, packages import successfully, 359 tests collected. Score: 4/4
[56]

Dynamic: Phase 1 fails with TypeError: User.getLocalCoverPath is not a function (imprecise failure)

TEST evaluation.Static: Agent covers 3 of 4 key scenarios (missing account deletion cleanup test). Dynamic: Phase 1 fails with TypeError: User.getLocalCoverPath is not a function (imprecise failure). Score: 2/4
[57]

Adaptive eval scripting (triggered by TEST_STATIC < 2).The agent’s test suite lacks coverage for account deletion cleanup. SWE-Judge writes eval_improved.sh and test/improved-image-cleanup.js, a custom test that creates dummy profile images, calls the account deletion function, and verifies that 0 files remain afterward. The first execution discovers 3 or...
[58]

Score: 2/4

CODE evaluation using custom test results.The custom test output directly informs CODE_DYNAMIC: 3/4 tests pass (group cover, user cover, user avatar succeed; account deletion cleanup fails). Score: 2/4. Final Scores.ENV: 4, TEST: 2, CODE: 2. Total: 8/12 (0.667). This case shows that SWE-Judge writes its own verification scripts when existing coverage is i...
[59]

The agent correctly implements both required changes: adding -Dsolr.max.booleanClauses=30000 to SOLR_OPTSand definingFILTER_BOOK_LIMIT = 30_000

Initial review.SWE-Judge reads the instruction, golden patch, and agent patch. The agent correctly implements both required changes: adding -Dsolr.max.booleanClauses=30000 to SOLR_OPTSand definingFILTER_BOOK_LIMIT = 30_000
[60]

Both tests pass: test_filter_book_limit_constant_exists and test_solr_opts_has_boolean_clauses_limit

Agent test execution (Phase 2).SWE-Judge runs the agent’s test suite ( eval.sh) on the fixed code. Both tests pass: test_filter_book_limit_constant_exists and test_solr_opts_has_boolean_clauses_limit. 3.Fault injection (Phase 1).SWE-Judge reverts the agent’s changes to simulate the buggy state: git show base_commit:docker-compose.yml > /tmp/docker-compose...
[61]

SWE-Judge confirms the agent’s tests are not trivial or overfitted: they verify specific code content rather than relying on indirect signals

Verdict.The tests correctly discriminate between buggy and fixed states. SWE-Judge confirms the agent’s tests are not trivial or overfitted: they verify specific code content rather than relying on indirect signals. TEST_DYNAMIC: 2/2
[62]

Dynamic: Python 3.11.1 available, but core package import fails (ModuleNotFoundError: web)

ENV evaluation.Static: Agent uses venv instead of the requested conda environment, devi- ating from the instruction. Dynamic: Python 3.11.1 available, but core package import fails (ModuleNotFoundError: web). Score: 2/4. Final Scores.ENV: 2, TEST: 3, CODE: 4. Total: 9/12 (0.75). Fault injection verifies that the agent’s tests genuinely detect the bug rath...
[63]

Code review via git diff.SWE-Judge examines the agent’s changes: adding a Version field to the configuration struct, implementing validation logic, updating the schema, and creating test data files
[64]

The agent’s implementation aligns closely with the golden patch, using cleaner error handling patterns in some cases

Reference comparison.SWE-Judge reads the golden patch and performs a structural comparison. The agent’s implementation aligns closely with the golden patch, using cleaner error handling patterns in some cases
[65]

to confirm compilation succeeds, then uses a non-matching test pattern to verify test collection without execution

Build verification.SWE-Judge runs go build ./... to confirm compilation succeeds, then uses a non-matching test pattern to verify test collection without execution
[66]

Implementation is functionally identical to gold.patch—correctly implementsescapeseqfilter with equivalent logic

Test execution with fault injection.SWE-Judge reverts the code to the buggy state and runs the agent’s tests. Tests fail with cfg.Version undefined (compilation error). SWE-Judge notes this is a weaker detection mechanism (compile-time rather than assertion-based) but still validates that the tests cannot pass without the fix. 5.Multi-dimensional scoring....