pith. machine review for the scientific record. sign in

arxiv: 2605.13139 · v1 · submitted 2026-05-13 · 💻 cs.SE

Recognition: no theorem link

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:32 UTC · model grok-4.3

classification 💻 cs.SE
keywords code agentsautonomous agentssoftware benchmarksissue resolutionLLM evaluationend-to-end tasksFullCycle executionverification methods
0
0 comments X

The pith

Code agents show sharply lower success rates when handling complete issue resolution autonomously versus in isolated subtasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces SWE-Cycle, a benchmark of 489 instances that tests code agents on three separate tasks—environment reconstruction, code implementation, and verification test generation—plus a single FullCycle task that requires them to perform all steps in sequence inside a bare repository with no human assistance. It pairs the benchmark with SWE-Judge, an evaluation agent that combines static code review and dynamic runtime testing to measure outcomes more reliably than traditional parsers. The central result is a clear drop in solve rates for the integrated task, which the authors trace to agents struggling with dependencies that cross phases and with preserving code quality throughout. Readers focused on AI tools for software work would care because the findings indicate that current agents still need substantial scaffolding to match their performance on narrower problems.

Core claim

SWE-Cycle evaluates agents across isolated tasks of environment reconstruction, code implementation, and verification test generation, as well as an end-to-end FullCycle task that integrates all three in a bare repository without human scaffolding. Using SWE-Judge, which merges static review with dynamic testing to verify functional correctness, the evaluation of agents powered by six state-of-the-art LLMs reveals a sharp drop in solve rates when moving from isolated tasks to FullCycle execution, exposing bottlenecks in handling cross-phase dependencies and maintaining code quality.

What carries the argument

SWE-Cycle benchmark consisting of isolated subtasks and an integrated FullCycle execution, paired with SWE-Judge for reliable verification of autonomous trajectories through static and dynamic checks.

If this is right

  • Agents require stronger mechanisms to track and resolve dependencies that span multiple development phases.
  • Preserving code quality across an entire autonomous resolution process is a distinct and harder challenge than solving single subtasks.
  • Benchmarks must incorporate full-cycle execution in bare environments to measure real autonomy rather than pre-configured subtasks.
  • Evaluation tools need hybrid static-dynamic verification to avoid systematic errors when assessing complex agent trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent designs could incorporate explicit planning or memory structures that persist across phase boundaries to reduce the observed integration failures.
  • The performance gap may generalize to other multi-step agent workflows such as data pipeline construction or scientific experiment orchestration.
  • Training regimens focused on sequential dependency handling might narrow the gap between isolated and full-cycle performance.

Load-bearing premise

The 489 filtered instances and the SWE-Judge evaluator accurately capture practical autonomy without selection bias or verification errors that would alter the observed performance drop.

What would settle it

Re-running the same 489 instances with an alternative verification method that shows no significant solve-rate difference between isolated tasks and FullCycle execution would falsify the claim of critical cross-phase bottlenecks.

Figures

Figures reproduced from arXiv: 2605.13139 by Hao Guan, Kangning Zhang, Lingyue Fu, Lin Qiu, Shao Zhang, Weinan Zhang, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Yaoming Zhu, Yong Yu.

Figure 1
Figure 1. Figure 1: Overview of the SWE-Cycle Framework. Left: High-quality instances are curated through a rigorous filtering pipeline. Center: Agents execute environment reconstruction, code implementation, and test generation in either isolated tasks or the FullCycle task. Right: SWE-Judge evaluates outputs via hybrid static-dynamic analysis and a test intervention mechanism to yield a robust 0-2 score. static analysis wit… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of failure categories across [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end integration effects. (a) Per-dimension score and solve rate change ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

As autonomous code agents move toward end-to-end software development, evaluating their practical autonomy becomes critical. Current benchmarks hide friction by testing agents in pre-configured environments, and their static evaluation pipelines frequently fail when parsing fully autonomous trajectories. We address these limitations with SWE-Cycle, a benchmark of 489 rigorously filtered instances. SWE-Cycle evaluates agents across three isolated tasks, including environment reconstruction, code implementation, and verification test generation, as well as an end-to-end FullCycle task that integrates all three. The FullCycle task requires agents to work autonomously in a bare repository without human scaffolding. To reliably assess these complex execution paths, we developed SWE-Judge. By combining static code review with dynamic testing, this execution-capable evaluation agent accurately verifies functional correctness and eliminates the systematic measurement errors of traditional static parsers. We evaluate code agents powered by six state-of-the-art LLMs across these four tasks. The results reveal a sharp drop in solve rates when transitioning from isolated tasks to FullCycle execution, exposing critical bottlenecks in handling cross-phase dependencies and maintaining code quality. Together, SWE-Cycle and SWE-Judge provide a comprehensive framework for accurately measuring the end-to-end capabilities of autonomous software agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No circularity detected in benchmark construction or evaluation

full rationale

The paper presents SWE-Cycle as a new benchmark of 489 filtered instances and SWE-Judge as a separate evaluation agent combining static review and dynamic testing. These are introduced as independent tools for measuring agent performance on isolated tasks and FullCycle execution. No equations, fitted parameters, or derived predictions appear in the provided text. Results are reported as empirical observations of solve-rate drops, not quantities forced by construction from the benchmark itself. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming reduces the central claims to inputs. The derivation chain consists of benchmark curation and tool development followed by external evaluation runs, which remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the representativeness of the filtered instances and the accuracy of the new SWE-Judge without independent external validation of either.

axioms (2)
  • domain assumption The 489 filtered instances are representative of real-world software issue resolution cycles.
    Abstract states rigorous filtering but provides no criteria or validation against external distributions.
  • domain assumption SWE-Judge correctly identifies functional correctness without systematic false positives or negatives.
    Abstract claims it eliminates measurement errors of static parsers but offers no comparative error analysis.
invented entities (1)
  • SWE-Judge no independent evidence
    purpose: Hybrid static-plus-dynamic evaluator to verify agent trajectories in complex execution paths.
    New evaluation agent introduced to address limitations of traditional static parsers.

pith-pipeline@v0.9.0 · 5539 in / 1346 out tokens · 40697 ms · 2026-05-14T18:32:49.817448+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 28 canonical work pages · 10 internal anchors

  1. [1]

    Claude 4.6 sonnet system card

    Anthropic. Claude 4.6 sonnet system card. Technical report, Anthropic,

  2. [2]

    URL https://assets.anthropic.com/m/785e231869ea8b3b/original/ Claude-4-6-Sonnet-System-Card.pdf

  3. [3]

    Claude code, 2025

    Anthropic. Claude code, 2025. URLhttps://github.com/anthropics/claude-code

  4. [4]

    Introducing claude opus 4.5

    Anthropic. Introducing claude opus 4.5. Anthropic Blog, 2025. URL https://www. anthropic.com/news/claude-opus-4-5

  5. [5]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z. Pan, Shuyi Yang, et al. Why do multi-agent LLM systems fail?, 2025. URLhttps://arxiv.org/abs/2503.13657

  6. [6]

    Jimenez, John Yang, Kevin Liu, and Aleksander Madry

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified. OpenAI Blog, 2024. URL https://openai.com/index/introducing-swe-bench-verified/

  7. [7]

    Deepseek-v3.2: Pushing the frontier of open large language models,

    DeepSeek-AI and Others. Deepseek-v3.2: Pushing the frontier of open large language models,

  8. [8]

    URLhttps://arxiv.org/abs/2512.02556

  9. [9]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-bench pro: Can AI agents solve long-ho...

  10. [10]

    arXiv preprint arXiv:2512.12730 , year=

    Jingzhe Ding and Others. Nl2repo-bench: Towards long-horizon repository generation evalua- tion of coding agents, 2026. URLhttps://arxiv.org/abs/2512.12730

  11. [11]

    EnvBench: A benchmark for automated environment setup, 2025

    Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. EnvBench: A benchmark for automated environment setup, 2025. URL https://arxiv.org/ abs/2503.14443

  12. [12]

    Automatically benchmarking llm code agents through agent-driven annotation and evaluation, 2025

    Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, and Yong Yu. Automatically benchmarking llm code agents through agent-driven annotation and evaluation, 2025

  13. [13]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5 Team. Glm-5: From vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

  14. [14]

    Devbench: A realistic, developer-informed benchmark for code generation models, 2026

    Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu. Devbench: A realistic, developer-informed benchmark for code generation models, 2026. URLhttps://arxiv.org/abs/2601.11895

  15. [15]

    SWE-bench goes live!,

    Alex Gu, Naman Jain Liu, Nikhil Thakur, Wen-Ding Shi, Dídac Suris, Sanjay Jain, Naomi Saphra, Celine Lee Xia, Graham Neubig, and Aditi Raghunathan. SWE-bench goes live!,

  16. [16]

    2505.23419 , archivePrefix =

    URL https://arxiv.org/abs/2505.23419. NeurIPS 2025 Datasets and Bench- marks Track

  17. [17]

    A survey on LLM-as-a-judge,

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo. A survey on LLM-as-a-judge,

  18. [18]

    URLhttps://arxiv.org/abs/2411.15594

  19. [19]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In Proceedings of the International Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2310.06770

  20. [20]

    Kimi k2.5: Scaling reinforcement learning with llms

    Kimi. Kimi k2.5: Scaling reinforcement learning with llms. Kimi Blog, 2025. URL https: //www.kimi.com/blog/kimi-k2-5. 10

  21. [21]

    Fea-bench: A benchmark for evaluating repository-level code generation for feature implementation

    Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. Fea-bench: A benchmark for evaluating repository-level code generation for feature implementation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 17160–17176, 2025

  22. [22]

    The swe-bench illusion: When state-of-the-art llms remember instead of reason, 2025

    Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. The swe-bench illusion: When state-of-the-art llms remember instead of reason, 2025. URL https://arxiv.org/ abs/2506.12286

  23. [23]

    CodeJudgeBench: Benchmarking LLM-as-a-judge for coding tasks, 2025

    Xichang Liu et al. CodeJudgeBench: Benchmarking LLM-as-a-judge for coding tasks, 2025. URLhttps://arxiv.org/abs/2507.10535

  24. [24]

    AI-augmented CI/CD pipelines: From code commit to production with autonomous decisions, 2025

    Zhengyu Liu et al. AI-augmented CI/CD pipelines: From code commit to production with autonomous decisions, 2025. URLhttps://arxiv.org/abs/2508.11867

  25. [25]

    Minimax 2.7

    MiniMax. Minimax 2.7. MiniMax Blog, 2026. URL https://www.minimaxi.com/models/ text/m27

  26. [26]

    Introducing codex, 2025

    OpenAI. Introducing codex, 2025. URL https://openai.com/index/ introducing-codex/

  27. [27]

    Introducing gpt -5.4

    OpenAI. Introducing gpt -5.4. OpenAI Blog, 2026. URL https://openai.com/index/ introducing-gpt-5-4/

  28. [28]

    Why swe-bench verified no longer measures frontier coding capabilities

    OpenAI. Why swe-bench verified no longer measures frontier coding capabilities. https:// openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ , February 2026

  29. [29]

    Opencode, 2026

    OpenCode Contributors. Opencode, 2026. URL https://github.com/opencode-ai/ opencode

  30. [30]

    Qwen3.5: Towards native multimodal agents

    Qwen. Qwen3.5: Towards native multimodal agents. Qwen Blog, 2026. URL https: //qwen.ai/blog?id=qwen3.5

  31. [31]

    Judging the judges: A systematic study of position bias in LLM-as-a-judge

    Vibhu Raina et al. Judging the judges: A systematic study of position bias in LLM-as-a-judge. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP),

  32. [32]

    URLhttps://arxiv.org/abs/2406.07791

  33. [33]

    Testeval: Benchmarking large language models for test case generation, 2025

    Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Ling- ming Zhang, An Ran Chen, and Lei Ma. Testeval: Benchmarking large language models for test case generation, 2025. URLhttps://arxiv.org/abs/2406.04531

  34. [34]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Chen, Parker Adler, Zijian Cheng, Kexun Hu, Jieyu Li, Yuqi Li, Ziniu Liu, Yufan Lu, Jiasheng Ning, et al. OpenHands: An open platform for AI software developers as generalist agents, 2024. URLhttps://arxiv.org/abs/2407.16741

  35. [35]

    Swe-compass: Towards unified evaluation of agentic coding abilities for large language models, 2025

    Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xie...

  36. [36]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Liber, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

  37. [37]

    Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

    John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents, 2025. URL https://arxiv.org/abs/2504.21798. NeurIPS 2025 Datasets and Benchmarks Track Spotlight. 11

  38. [38]

    A survey on agent-as-a-judge, 2026

    Runyang You, Hongru Cai, Caiqi Zhang, et al. A survey on agent-as-a-judge, 2026. URL https://arxiv.org/abs/2601.05111

  39. [39]

    Utboost: Rigorous evaluation of coding agents on swe-bench, 2025

    Boxi Yu, Yuxuan Zhu, Pinjia He, and Daniel Kang. Utboost: Rigorous evaluation of coding agents on swe-bench. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. URLhttps://arxiv.org/abs/2506.09289

  40. [40]

    Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

    Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-SWE-bench: A multilingual benchmark for issue resolving, 2025. URLhttps://arxiv.org/abs/2504.02605

  41. [41]

    Benchmarking and studying the LLM-based agent system in end-to-end software development, 2025

    Zhengran Zeng, Yixin Li, Rui Xie, Wei Ye, and Shikun Zhang. Benchmarking and studying the LLM-based agent system in end-to-end software development, 2025. URL https://arxiv. org/abs/2511.04064

  42. [42]

    Code Review Agent Benchmark

    Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf, Haifeng Ruan, Ridwan Shariffdeen, and Abhik Roychoudhury. Code review agent benchmark. arXiv preprint arXiv:2603.23448, 2026

  43. [43]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena, 2024. URLhttps://arxiv.org/abs/2306.05685

  44. [44]

    Featurebench: Benchmarking agentic coding for complex feature development

    Q Zhou, J Zhang, H Wang, R Hao, J Wang, M Han, Y Yang, S Wu, F Pan, L Fan, D Tu, and Z Zhang. Featurebench: Benchmarking agentic coding for complex feature development. arXiv preprint arXiv:2602.10975, 2026

  45. [45]

    arXiv:2507.02825 , year =

    Yuxuan Zhu, Tian Jin, Yada Pruksachatkun, Aston Zhang, Shayne Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Richard Weiss, et al. Establishing best practices for building rigorous agentic benchmarks. arXiv preprint arXiv:2507.02825, 2025

  46. [46]

    Does this step contain a critical error? Answer with only ‘yes’ or ‘no’

    Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025. URL https:...

  47. [47]

    Read the issue description to understand the problem context

  48. [48]

    Examine the gold reference patch to establish the correct solution approach

  49. [49]

    Review the agent’s submission to understand what the agent implemented

  50. [50]

    Read SWE-Judge’s scoring and reasoning

  51. [51]

    Cross-reference with execution logs and LLM auxiliary analysis when static review is insufficient

  52. [52]

    Couldn’t find the node_modules state file

    Assign a failure category from the predefined taxonomy and record which evaluator is correct. Human Verification Results.To validate the LLM-assisted categorization and rule out selection bias, we conduct human deep annotation on all 371 disagreement instances plus 86 agreement instances (a 10% random sample of cases where SWE-Judge and the script concur)...

  53. [53]

    Instruction and gold patch review.SWE-Judge reads the issue description and golden patch to establish the expected behavior: file deletion via rimraf with glob patterns for accumulated profile images

  54. [54]

    Agent patch macro-review.SWE-Judge reviews the agent’s diff across 5 modified files (src/groups/cover.js, src/socket.io/user/picture.js, src/user/delete.js, src/user/picture.js). It identifies a critical divergence: the agent uses getLocalCoverPath/getLocalAvatarPath to delete only the current file, while the golden patch uses glob patterns to delete all ...

  55. [55]

    Dynamic: Node.js v18.20.8 available, packages import successfully, 359 tests collected

    ENV evaluation.Static: setup.sh runs npm install correctly. Dynamic: Node.js v18.20.8 available, packages import successfully, 359 tests collected. Score: 4/4

  56. [56]

    Dynamic: Phase 1 fails with TypeError: User.getLocalCoverPath is not a function (imprecise failure)

    TEST evaluation.Static: Agent covers 3 of 4 key scenarios (missing account deletion cleanup test). Dynamic: Phase 1 fails with TypeError: User.getLocalCoverPath is not a function (imprecise failure). Score: 2/4

  57. [57]

    Adaptive eval scripting (triggered by TEST_STATIC < 2).The agent’s test suite lacks coverage for account deletion cleanup. SWE-Judge writes eval_improved.sh and test/improved-image-cleanup.js, a custom test that creates dummy profile images, calls the account deletion function, and verifies that 0 files remain afterward. The first execution discovers 3 or...

  58. [58]

    Score: 2/4

    CODE evaluation using custom test results.The custom test output directly informs CODE_DYNAMIC: 3/4 tests pass (group cover, user cover, user avatar succeed; account deletion cleanup fails). Score: 2/4. Final Scores.ENV: 4, TEST: 2, CODE: 2. Total: 8/12 (0.667). This case shows that SWE-Judge writes its own verification scripts when existing coverage is i...

  59. [59]

    The agent correctly implements both required changes: adding -Dsolr.max.booleanClauses=30000 to SOLR_OPTSand definingFILTER_BOOK_LIMIT = 30_000

    Initial review.SWE-Judge reads the instruction, golden patch, and agent patch. The agent correctly implements both required changes: adding -Dsolr.max.booleanClauses=30000 to SOLR_OPTSand definingFILTER_BOOK_LIMIT = 30_000

  60. [60]

    Both tests pass: test_filter_book_limit_constant_exists and test_solr_opts_has_boolean_clauses_limit

    Agent test execution (Phase 2).SWE-Judge runs the agent’s test suite ( eval.sh) on the fixed code. Both tests pass: test_filter_book_limit_constant_exists and test_solr_opts_has_boolean_clauses_limit. 3.Fault injection (Phase 1).SWE-Judge reverts the agent’s changes to simulate the buggy state: git show base_commit:docker-compose.yml > /tmp/docker-compose...

  61. [61]

    SWE-Judge confirms the agent’s tests are not trivial or overfitted: they verify specific code content rather than relying on indirect signals

    Verdict.The tests correctly discriminate between buggy and fixed states. SWE-Judge confirms the agent’s tests are not trivial or overfitted: they verify specific code content rather than relying on indirect signals. TEST_DYNAMIC: 2/2

  62. [62]

    Dynamic: Python 3.11.1 available, but core package import fails (ModuleNotFoundError: web)

    ENV evaluation.Static: Agent uses venv instead of the requested conda environment, devi- ating from the instruction. Dynamic: Python 3.11.1 available, but core package import fails (ModuleNotFoundError: web). Score: 2/4. Final Scores.ENV: 2, TEST: 3, CODE: 4. Total: 9/12 (0.75). Fault injection verifies that the agent’s tests genuinely detect the bug rath...

  63. [63]

    Code review via git diff.SWE-Judge examines the agent’s changes: adding a Version field to the configuration struct, implementing validation logic, updating the schema, and creating test data files

  64. [64]

    The agent’s implementation aligns closely with the golden patch, using cleaner error handling patterns in some cases

    Reference comparison.SWE-Judge reads the golden patch and performs a structural comparison. The agent’s implementation aligns closely with the golden patch, using cleaner error handling patterns in some cases

  65. [65]

    to confirm compilation succeeds, then uses a non-matching test pattern to verify test collection without execution

    Build verification.SWE-Judge runs go build ./... to confirm compilation succeeds, then uses a non-matching test pattern to verify test collection without execution

  66. [66]

    Implementation is functionally identical to gold.patch—correctly implementsescapeseqfilter with equivalent logic

    Test execution with fault injection.SWE-Judge reverts the code to the buggy state and runs the agent’s tests. Tests fail with cfg.Version undefined (compilation error). SWE-Judge notes this is a weaker detection mechanism (compile-time rather than assertion-based) but still validates that the tests cannot pass without the fix. 5.Multi-dimensional scoring....