SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

arxiv: 2605.17526 · v1 · pith:H6UDO7IPnew · submitted 2026-05-17 · 💻 cs.SE · cs.AI

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

Qingnan Ren , Shun Zou , Shiting Huang , Ziao Zhang , Kou Shi , Zhen Fang , Yiming Zhao , Yu Zeng

show 6 more authors

Qisheng Su Lin Chen Yong Wang Zehui Chen Xiangxiang Chu Feng Zhao

This is my paper

Pith reviewed 2026-05-19 22:39 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords coding agentsSaaS engineeringAI benchmarkslong-horizon tasksenterprise softwaresystem integrationsoftware development

0 comments p. Extension

pith:H6UDO7IP Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{H6UDO7IP}

Prints a linked pith:H6UDO7IP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Coding agents fail over 95% of enterprise SaaS tasks before reaching business logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SaaSBench to test AI coding agents on realistic long-horizon enterprise SaaS projects involving multiple languages, databases, and frameworks. It demonstrates that agents rarely reach the stage of implementing core business rules because they cannot configure foundational systems or integrate components successfully. A sympathetic reader would care because this identifies why current agents cannot yet perform the coordinated, full-stack work required in actual company software development.

Core claim

The paper claims that the primary bottleneck for state-of-the-art agents in enterprise SaaS engineering is not generating isolated code logic but successfully configuring and integrating a multi-component system. Over 95% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops.

What carries the argument

SaaSBench benchmark with 30 tasks across 6 SaaS domains and 5,370 validation nodes, using 8 programming languages, 6 databases, and 13 frameworks, evaluated by a dependency-aware hybrid paradigm for long-horizon multi-component systems.

Load-bearing premise

The 30 tasks and 5,370 validation nodes sufficiently capture the heterogeneity, coupling, and long-horizon constraints of real enterprise SaaS systems without introducing artificial simplifications.

What would settle it

Running the same agents on the benchmark after providing explicit integration tools or automated setup scripts and observing whether the 95% early-failure rate persists or drops sharply would test the claim that system configuration is the central issue.

Figures

Figures reproduced from arXiv: 2605.17526 by Feng Zhao, Kou Shi, Lin Chen, Qingnan Ren, Qisheng Su, Shiting Huang, Shun Zou, Xiangxiang Chu, Yiming Zhao, Yong Wang, Yu Zeng, Zehui Chen, Zhen Fang, Ziao Zhang.

**Figure 2.** Figure 2: Overview of SaaSBench. The benchmark is grounded in real software development [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Statistical overview of SaaSBench. Left: SaaSBench includes six key SaaS domains and 30 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Left: Performance analysis of agent frameworks. Right: We classify capability units into five execution trajectories. T4 and T5 account for 95.6% of all units, showing that most failures occur before agents reach deep business logic. See Appendix B.7 for definitions. 5.2 Agent Frameworks To examine the impact of agent frameworks beyond the underlying model, we evaluate GPT-5.4 and Claude Opus 4.7 under thr… view at source ↗

read the original abstract

As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real-world software heterogeneity. Furthermore, we design a dependency-aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi-component coupling, enabling fine-grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state-of-the-art agents is not generating isolated code logic, but successfully configuring and integrating a multi-component system. Over 95\% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system-level coding agents. The code is available at \url{https://github.com/ShadeCloak/SaaSbench}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SaaSBench, a benchmark for evaluating autonomous coding agents on long-horizon enterprise SaaS engineering tasks. It consists of 30 tasks spanning 6 domains, implemented with 8 languages, 6 databases, and 13 frameworks, and supported by 5,370 validation nodes under a dependency-aware hybrid evaluation paradigm. The central empirical claim is that over 95% of failures in state-of-the-art agents occur before reaching deep business logic, primarily due to overconfidence during foundational setup or ineffective debugging loops.

Significance. If the reported failure distribution and phase classifications hold under scrutiny, the work usefully redirects attention from isolated code generation to the harder problem of multi-component system configuration and integration in realistic SaaS settings. The public release of the benchmark and code at the cited GitHub repository is a clear strength that enables direct reproducibility and follow-on research.

major comments (2)

[§5 (Results)] §5 (Results): The headline statistic that over 95% of task failures occur before agents reach deep business logic rests on a failure-phase classification whose boundary between 'foundational system setup' and 'deep business logic' is not given an explicit, agent-independent definition tied to the 5,370 validation nodes. No per-task or per-domain breakdown is supplied showing how nodes are allocated to setup/integration versus core business rules, nor is the stopping condition for 'premature halt' stated in terms of node outcomes rather than agent logs. This directly affects whether the overconfidence diagnosis is robust or sensitive to task scaffolding.
[§3 (Benchmark Design)] §3 (Benchmark Design): The claim that the 30 tasks and 5,370 nodes adequately capture the heterogeneity, coupling, and long-horizon constraints of real enterprise SaaS systems is load-bearing for interpreting the failure statistics, yet the manuscript provides no quantitative evidence (e.g., metrics of inter-task coupling or domain coverage) that the chosen tasks avoid artificial simplifications that could favor or penalize particular agent behaviors.

minor comments (2)

[Abstract] The abstract is information-dense; moving the precise counts of languages, databases, and frameworks to a table in §3 would improve readability.
[§4 (Evaluation Paradigm)] Notation for the hybrid evaluation nodes could be introduced earlier with a small diagram to clarify dependency tracking.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying the failure-phase classification and strengthening the evidence for benchmark representativeness. We have revised the manuscript accordingly to address both major comments.

read point-by-point responses

Referee: [§5 (Results)] §5 (Results): The headline statistic that over 95% of task failures occur before agents reach deep business logic rests on a failure-phase classification whose boundary between 'foundational system setup' and 'deep business logic' is not given an explicit, agent-independent definition tied to the 5,370 validation nodes. No per-task or per-domain breakdown is supplied showing how nodes are allocated to setup/integration versus core business rules, nor is the stopping condition for 'premature halt' stated in terms of node outcomes rather than agent logs. This directly affects whether the overconfidence diagnosis is robust or sensitive to task scaffolding.

Authors: We agree that an explicit, agent-independent definition tied to the validation nodes is required for robustness. In the revised manuscript we add a formal definition in §5.1: foundational system setup comprises all validation nodes that check environment configuration, dependency resolution, service bootstrapping, and cross-component integration (on average 38% of nodes per task); deep business logic comprises nodes that validate domain-specific rules, workflows, and data invariants (the remaining 62%). The classification is derived directly from the dependency graphs in the task specification files and is therefore independent of any agent’s logs or behavior. We also include a new Table 3 with per-task and per-domain node allocations and state the premature-halt stopping condition as failure to pass the first 25% of setup nodes after a fixed budget of 15 iterations, measured by node outcome results. These additions make the 95% statistic reproducible and less sensitive to scaffolding choices. revision: yes
Referee: [§3 (Benchmark Design)] §3 (Benchmark Design): The claim that the 30 tasks and 5,370 nodes adequately capture the heterogeneity, coupling, and long-horizon constraints of real enterprise SaaS systems is load-bearing for interpreting the failure statistics, yet the manuscript provides no quantitative evidence (e.g., metrics of inter-task coupling or domain coverage) that the chosen tasks avoid artificial simplifications that could favor or penalize particular agent behaviors.

Authors: We accept that quantitative support was missing. The revised §3.2 now reports three metrics computed from the released task specifications: (1) mean inter-component coupling of 4.7 cross-framework or cross-database dependencies per task (range 2–8), (2) domain coverage quantified by overlap with 12 real-world SaaS feature categories drawn from industry reports (average coverage 0.81), and (3) average dependency depth of 11.8 validation steps per task. These figures indicate that the benchmark requires genuine multi-stack orchestration rather than isolated code generation. We have also added a short discussion of how these metrics compare with typical enterprise SaaS projects to argue against artificial simplification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical failure statistics are measured outcomes, not derived by construction

full rationale

The paper is an empirical benchmark study that defines 30 tasks across 6 domains with 5,370 validation nodes and a dependency-aware hybrid evaluation. The central claim (over 95% of failures occur before deep business logic) is reported as a direct experimental count from running agents on these tasks. No equations, fitted parameters, or self-referential definitions are used to derive this statistic; it is an observed distribution of failure phases based on node outcomes. The evaluation paradigm is a methodological design choice for assessment rather than a tautology that forces the reported percentage. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text that reduce the result to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical axioms or free parameters; the central claim rests on the assumption that the constructed tasks and validation nodes are representative of real enterprise SaaS engineering.

axioms (1)

domain assumption The selected 30 tasks across 6 domains adequately represent the structural complexity and integration challenges of production SaaS systems.
Invoked when claiming the benchmark fills the gap left by prior simplified benchmarks.

pith-pipeline@v0.9.0 · 5869 in / 1174 out tokens · 28965 ms · 2026-05-19T22:39:51.142699+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/BranchSelection.lean; Cost.FunctionalEquation branch_selection; washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Over 95% of task failures occur before agents even reach deep business logic... prematurely halting during foundational system setup

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 8 internal anchors

[1]

Claude code: Ai-powered coding assistant, 2024

Anthropic. Claude code: Ai-powered coding assistant, 2024. URL https://www.claude. com/product/claude-code. Accessed: 2026-05-03

work page 2024
[2]

System card: Claude opus 4 & claude sonnet 4, 2025

Anthropic. System card: Claude opus 4 & claude sonnet 4, 2025. URL https://www-cdn. anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf. Accessed: 2026- 05-03

work page 2025
[3]

System card: Claude sonnet 4.5, 2025

Anthropic. System card: Claude sonnet 4.5, 2025. URL https://assets.anthropic.com/ m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf . Accessed: 2026-05-03

work page 2025
[4]

Introducing Claude Opus 4.7, 2026

Anthropic. Introducing Claude Opus 4.7, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7. Accessed: 2026-05-03

work page 2026
[5]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/ 2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

URLhttps://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Cursor: The ai code editor, 2024

Cursor AI. Cursor: The ai code editor, 2024. URL https://www.cursor.com. Accessed: 2026-05-03

work page 2024
[9]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, DeepSeek-AI, 2026. URL https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Accessed: 2026-05-03

work page 2026
[10]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, 10 Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can AI agents solve long...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.CoRR, abs/2512.12730, 2026

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wa...

work page arXiv 2026
[12]

A Survey on Code Generation with LLM-based Agents

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with llm-based agents.CoRR, abs/2508.00083, 2025. URL https: //arxiv.org/abs/2508.00083

work page internal anchor Pith review arXiv 2025
[13]

Automatically benchmarking LLM code agents through agent-driven annotation and evaluation.CoRR, abs/2510.24358, 2025

Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, and Yong Yu. Automatically benchmarking LLM code agents through agent-driven annotation and evaluation.CoRR, abs/2510.24358, 2025. URL https: //arxiv.org/abs/2510.24358

work page arXiv 2025
[14]

Trae agent: An llm-based agent for software engineering with test-time scaling.CoRR, abs/2507.23370, 2025

Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. Trae agent: An llm-based agent for software engineering with test-time scaling.CoRR, abs/2507.23370, 2025. URLhttps://arxiv.org/abs/2507.23370

work page arXiv 2025
[16]

URLhttps://arxiv.org/abs/2510.12399

work page arXiv
[17]

Github copilot: Your ai pair programmer, 2021

GitHub. Github copilot: Your ai pair programmer, 2021. URL https://copilot.github. com/. Accessed: 2026-05-03

work page 2021
[18]

Swe-perf: Can language models optimize code performance on real-world repositories? arXiv preprint arXiv:2507.12415, 2025

Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. Swe-perf: Can language models optimize code performance on real-world repositories? arXiv preprint arXiv:2507.12415, 2025. URLhttps://arxiv.org/abs/2507.12415

work page arXiv 2025
[19]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS.CoRR, abs/2105.09938, 2021. URL https://arxiv.org/ abs/2105.09938

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Metagpt: Meta programming for A multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for A multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations,

work page
[21]

URLhttps://openreview.net/forum?id=VtmBAGCN7o

work page
[22]

Opencoder: The open cookbook for top-tier code large language models

Siming Huang, Tianhao Cheng, Jason Klein Liu, Weidi Xu, Jiaran Hao, Liuyihan Song, Yang Xu, Jian Yang, Jiaheng Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Xianzhen Luo, Qiufeng Wang, YuanTao Fan, Qingfu Zhu, Zhaoxiang Zhang, Yang Gao, Jie Fu, Qian Liu, Houyi Li, Ge Zhang, Yuan Qi, Yinghui Xu, Wei Chu, and Zili Wang. Opencoder: The open cookbook for ...

work page
[23]

URLhttps://aclanthology.org/2025.acl-long.1591/

work page 2025
[24]

A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72, 2026

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72, 2026. URLhttps://dl.acm.org/doi/10.1145/3747588. 11

work page doi:10.1145/3747588 2026
[25]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? CoRR, abs/2310.06770, 2024. URLhttps://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Ds-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning, pages 18319–18345. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/lai23b.html

work page 2023
[27]

Prompting large language models to tackle the full software development lifecycle: A case study

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, and Kai Chen. Prompting large language models to tackle the full software development lifecycle: A case study. InProceedings of the 31st International Conference on Computational Linguistics...

work page 2025
[28]

Evocodebench: An evolving code generation benchmark aligned with real-world code repositories.CoRR, abs/2404.00599, 2024

Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. Evocodebench: An evolving code generation benchmark aligned with real-world code repositories.CoRR, abs/2404.00599, 2024. URLhttps://arxiv.org/abs/2404.00599

work page arXiv 2024
[29]

Lacking Control Increases Illusory Pattern Perception

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science 2022
[30]

Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.CoRR, abs/2508.02085, 2025

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binxing Jiao, Chen Hu, and Huacan Wang. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.CoRR, abs/2508.02085, 2025. URLhttps://arxiv.org/abs/2508.02085

work page arXiv 2025
[31]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. URL https://arxiv.org/ abs/2512.02556

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

M2RC-EV AL: massively multilingual repository-level code completion evaluation

Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin, Ge Zhang, Zekun Moore Wang, Guoan Zhang, Yingshui Tan, Bangyu Xiang, Zhaoxiang Zhang, Wenbo Su, and Bo Zheng. M2RC-EV AL: massively multilingual repository-level code completion evaluation. InProceedings of the 63rd Annual Meeting of the Associat...

work page 2025
[33]

Projecteval: A benchmark for programming agents automated evaluation on project-level code generation

Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tianrun Gao. Projecteval: A benchmark for programming agents automated evaluation on project-level code generation. InFindings of the Association for Computational Linguistics, pages 20205–20221,

work page
[34]

URLhttps://aclanthology.org/2025.findings-acl.1036/

work page 2025
[35]

Repobench: Benchmarking repository- level code auto-completion systems

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=pPjZIOuQuF

work page 2024
[36]

Projdevbench: Benchmarking AI coding agents on end-to-end project development.CoRR, abs/2602.01655, 2026

Pengrui Lu, Shiqi Zhang, Yunzhong Hou, Lyumanshan Ye, Chaoyi Huang, Zixi Chen, Ji Zeng, Hantao Jiang, Pengfei Liu, Yiwei Wang, and Ming-Hsuan Yang. Projdevbench: Benchmarking AI coding agents on end-to-end project development.CoRR, abs/2602.01655, 2026. URL https://arxiv.org/abs/2602.01655

work page arXiv 2026
[37]

Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.CoRR, abs/2505.03733, 2025. URL https: //arxiv.org/abs/2505.03733. 12

work page arXiv 2025
[38]

MiniMax M2.7: Early echoes of self-evolution, 2026

MiniMax. MiniMax M2.7: Early echoes of self-evolution, 2026. URLhttps://www.minimax. io/news/minimax-m27-en. Accessed: 2026-05-03

work page 2026
[39]

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering? InPro- ceedings of the 42nd International Conference on Machine Learning, volume 267, pages 44412– 44450. PMLR, 2025. URL https://proceedings.mlr.press/v267/miserendino25a. html

work page 2025
[40]

Kimi K2.6: Advancing open-source coding, 2026

Moonshot AI. Kimi K2.6: Advancing open-source coding, 2026. URL https://www.kimi. com/blog/kimi-k2-6. Accessed: 2026-05-03

work page 2026
[41]

Gittaskbench: A benchmark for code agents solving real-world tasks through code repository leveraging

Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Sen Hu, Bo Li, Chen Hu, Binxing Jiao, Daxin Jiang, Yuntao Du, and Pin Lyu. Gittaskbench: A benchmark for code agents solving real-world tasks through code repository leveraging. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 32564–32572, 2026. URL https...

work page doi:10.1609/aaai.v40i38.40533 2026
[42]

Codex cli, 2025

OpenAI. Codex cli, 2025. URL https://github.com/openai/codex. Accessed: 2026-05- 03

work page 2025
[43]

GPT-5.4 Thinking System Card, 2026

OpenAI. GPT-5.4 Thinking System Card, 2026. URLhttps://deploymentsafety.openai. com/gpt-5-4-thinking/gpt-5-4-thinking.pdf. Accessed: 2026-05-03

work page 2026
[44]

RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository

Zhiyuan Peng, Xin Yin, Pu Zhao, Fangkai Yang, Lu Wang, Ran Jia, Xu Chen, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Repogenesis: Benchmarking end-to-end microservice generation from readme to repository.CoRR, abs/2601.13943, 2026. URL https://arxiv. org/abs/2601.13943

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Qwen3.6-Plus: Towards real world agents, 2026

Qwen Team. Qwen3.6-Plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6. Accessed: 2026-05-03

work page 2026
[46]

Qwen3.6-27B: Flagship-level coding in a 27B dense model, 2026

Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model, 2026. URL https: //qwen.ai/blog?id=qwen3.6-27b. Accessed: 2026-05-03

work page 2026
[47]

Vibe coding vs

Ranjan Sapkota, Konstantinos I. Roumeliotis, and Manoj Karkee. Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic AI.CoRR, abs/2505.19443, 2025. URLhttps://arxiv.org/abs/2505.19443

work page arXiv 2025
[48]

Vibe coding: programming through conversation with artificial intelligence.arXiv preprint arXiv:2506.23253, 2025

Advait Sarkar and Ian Drosos. Vibe coding: programming through conversation with artificial intelligence.arXiv preprint arXiv:2506.23253, 2025. URL https://arxiv.org/abs/2506. 23253

work page arXiv 2025
[49]

Gemini 3.1 Pro: a smarter model for your most complex tasks, 2026

The Gemini Team. Gemini 3.1 Pro: a smarter model for your most complex tasks, 2026. URL https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/. Accessed: 2026-05-03

work page 2026
[50]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for AI soft...

work page 2025
[51]

Codecontests+: High-quality test case generation for competitive programming.CoRR, abs/2506.05817, 2025

Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming.CoRR, abs/2506.05817, 2025. URL https: //arxiv.org/abs/2506.05817

work page arXiv 2025
[52]

Swe-compass: Towards unified evaluation of agentic coding abilities for large language models

Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zong-Xian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xi...

work page arXiv 2025
[53]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528– 50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528– 50652, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/ hash/5a7c947568c1b1328ccc5230172e1e7c-A...

work page 2024
[54]

Glm-5.1: Towards long-horizon tasks, 2026

Z.AI. Glm-5.1: Towards long-horizon tasks, 2026. URL https://z.ai/blog/glm-5.1. Accessed: 2026-05-03

work page 2026
[55]

CodeActAgent

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Biny...

work page 2025
[56]

Ifcis in the exact-match dictionary, return the assigned backbone

work page
[57]

Otherwise, scan the prefix rules in order and return the first match: API*/Api* → API,BusinessLogic* → Logic,DataModel* → Data,Architec- ture*/Frontend*/UI*→Quality,Auth*→AuthZ

work page
[58]

weakest link

No category falls through; the algorithm is verified to cover all 5,370 nodes (Table 13). Resulting Distribution.Table 13 shows how the 5,370 validation nodes and the totalmaxScoreof 17,299.1 are distributed across the six backbones under this mapping. The distribution is intentionally non-uniform: Logic dominates by maximum score (27.2%) because business...

work page 1901

[1] [1]

Claude code: Ai-powered coding assistant, 2024

Anthropic. Claude code: Ai-powered coding assistant, 2024. URL https://www.claude. com/product/claude-code. Accessed: 2026-05-03

work page 2024

[2] [2]

System card: Claude opus 4 & claude sonnet 4, 2025

Anthropic. System card: Claude opus 4 & claude sonnet 4, 2025. URL https://www-cdn. anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf. Accessed: 2026- 05-03

work page 2025

[3] [3]

System card: Claude sonnet 4.5, 2025

Anthropic. System card: Claude sonnet 4.5, 2025. URL https://assets.anthropic.com/ m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf . Accessed: 2026-05-03

work page 2025

[4] [4]

Introducing Claude Opus 4.7, 2026

Anthropic. Introducing Claude Opus 4.7, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7. Accessed: 2026-05-03

work page 2026

[5] [5]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/ 2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [7]

URLhttps://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv

[7] [8]

Cursor: The ai code editor, 2024

Cursor AI. Cursor: The ai code editor, 2024. URL https://www.cursor.com. Accessed: 2026-05-03

work page 2024

[8] [9]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, DeepSeek-AI, 2026. URL https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Accessed: 2026-05-03

work page 2026

[9] [10]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, 10 Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can AI agents solve long...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [11]

Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.CoRR, abs/2512.12730, 2026

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wa...

work page arXiv 2026

[11] [12]

A Survey on Code Generation with LLM-based Agents

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with llm-based agents.CoRR, abs/2508.00083, 2025. URL https: //arxiv.org/abs/2508.00083

work page internal anchor Pith review arXiv 2025

[12] [13]

Automatically benchmarking LLM code agents through agent-driven annotation and evaluation.CoRR, abs/2510.24358, 2025

Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, and Yong Yu. Automatically benchmarking LLM code agents through agent-driven annotation and evaluation.CoRR, abs/2510.24358, 2025. URL https: //arxiv.org/abs/2510.24358

work page arXiv 2025

[13] [14]

Trae agent: An llm-based agent for software engineering with test-time scaling.CoRR, abs/2507.23370, 2025

Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. Trae agent: An llm-based agent for software engineering with test-time scaling.CoRR, abs/2507.23370, 2025. URLhttps://arxiv.org/abs/2507.23370

work page arXiv 2025

[14] [16]

URLhttps://arxiv.org/abs/2510.12399

work page arXiv

[15] [17]

Github copilot: Your ai pair programmer, 2021

GitHub. Github copilot: Your ai pair programmer, 2021. URL https://copilot.github. com/. Accessed: 2026-05-03

work page 2021

[16] [18]

Swe-perf: Can language models optimize code performance on real-world repositories? arXiv preprint arXiv:2507.12415, 2025

Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. Swe-perf: Can language models optimize code performance on real-world repositories? arXiv preprint arXiv:2507.12415, 2025. URLhttps://arxiv.org/abs/2507.12415

work page arXiv 2025

[17] [19]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS.CoRR, abs/2105.09938, 2021. URL https://arxiv.org/ abs/2105.09938

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [20]

Metagpt: Meta programming for A multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for A multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations,

work page

[19] [21]

URLhttps://openreview.net/forum?id=VtmBAGCN7o

work page

[20] [22]

Opencoder: The open cookbook for top-tier code large language models

Siming Huang, Tianhao Cheng, Jason Klein Liu, Weidi Xu, Jiaran Hao, Liuyihan Song, Yang Xu, Jian Yang, Jiaheng Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Xianzhen Luo, Qiufeng Wang, YuanTao Fan, Qingfu Zhu, Zhaoxiang Zhang, Yang Gao, Jie Fu, Qian Liu, Houyi Li, Ge Zhang, Yuan Qi, Yinghui Xu, Wei Chu, and Zili Wang. Opencoder: The open cookbook for ...

work page

[21] [23]

URLhttps://aclanthology.org/2025.acl-long.1591/

work page 2025

[22] [24]

A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72, 2026

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72, 2026. URLhttps://dl.acm.org/doi/10.1145/3747588. 11

work page doi:10.1145/3747588 2026

[23] [25]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? CoRR, abs/2310.06770, 2024. URLhttps://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [26]

Ds-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning, pages 18319–18345. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/lai23b.html

work page 2023

[25] [27]

Prompting large language models to tackle the full software development lifecycle: A case study

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, and Kai Chen. Prompting large language models to tackle the full software development lifecycle: A case study. InProceedings of the 31st International Conference on Computational Linguistics...

work page 2025

[26] [28]

Evocodebench: An evolving code generation benchmark aligned with real-world code repositories.CoRR, abs/2404.00599, 2024

Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. Evocodebench: An evolving code generation benchmark aligned with real-world code repositories.CoRR, abs/2404.00599, 2024. URLhttps://arxiv.org/abs/2404.00599

work page arXiv 2024

[27] [29]

Lacking Control Increases Illusory Pattern Perception

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science 2022

[28] [30]

Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.CoRR, abs/2508.02085, 2025

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binxing Jiao, Chen Hu, and Huacan Wang. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.CoRR, abs/2508.02085, 2025. URLhttps://arxiv.org/abs/2508.02085

work page arXiv 2025

[29] [31]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. URL https://arxiv.org/ abs/2512.02556

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [32]

M2RC-EV AL: massively multilingual repository-level code completion evaluation

Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin, Ge Zhang, Zekun Moore Wang, Guoan Zhang, Yingshui Tan, Bangyu Xiang, Zhaoxiang Zhang, Wenbo Su, and Bo Zheng. M2RC-EV AL: massively multilingual repository-level code completion evaluation. InProceedings of the 63rd Annual Meeting of the Associat...

work page 2025

[31] [33]

Projecteval: A benchmark for programming agents automated evaluation on project-level code generation

Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tianrun Gao. Projecteval: A benchmark for programming agents automated evaluation on project-level code generation. InFindings of the Association for Computational Linguistics, pages 20205–20221,

work page

[32] [34]

URLhttps://aclanthology.org/2025.findings-acl.1036/

work page 2025

[33] [35]

Repobench: Benchmarking repository- level code auto-completion systems

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=pPjZIOuQuF

work page 2024

[34] [36]

Projdevbench: Benchmarking AI coding agents on end-to-end project development.CoRR, abs/2602.01655, 2026

Pengrui Lu, Shiqi Zhang, Yunzhong Hou, Lyumanshan Ye, Chaoyi Huang, Zixi Chen, Ji Zeng, Hantao Jiang, Pengfei Liu, Yiwei Wang, and Ming-Hsuan Yang. Projdevbench: Benchmarking AI coding agents on end-to-end project development.CoRR, abs/2602.01655, 2026. URL https://arxiv.org/abs/2602.01655

work page arXiv 2026

[35] [37]

Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.CoRR, abs/2505.03733, 2025. URL https: //arxiv.org/abs/2505.03733. 12

work page arXiv 2025

[36] [38]

MiniMax M2.7: Early echoes of self-evolution, 2026

MiniMax. MiniMax M2.7: Early echoes of self-evolution, 2026. URLhttps://www.minimax. io/news/minimax-m27-en. Accessed: 2026-05-03

work page 2026

[37] [39]

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering? InPro- ceedings of the 42nd International Conference on Machine Learning, volume 267, pages 44412– 44450. PMLR, 2025. URL https://proceedings.mlr.press/v267/miserendino25a. html

work page 2025

[38] [40]

Kimi K2.6: Advancing open-source coding, 2026

Moonshot AI. Kimi K2.6: Advancing open-source coding, 2026. URL https://www.kimi. com/blog/kimi-k2-6. Accessed: 2026-05-03

work page 2026

[39] [41]

Gittaskbench: A benchmark for code agents solving real-world tasks through code repository leveraging

Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Sen Hu, Bo Li, Chen Hu, Binxing Jiao, Daxin Jiang, Yuntao Du, and Pin Lyu. Gittaskbench: A benchmark for code agents solving real-world tasks through code repository leveraging. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 32564–32572, 2026. URL https...

work page doi:10.1609/aaai.v40i38.40533 2026

[40] [42]

Codex cli, 2025

OpenAI. Codex cli, 2025. URL https://github.com/openai/codex. Accessed: 2026-05- 03

work page 2025

[41] [43]

GPT-5.4 Thinking System Card, 2026

OpenAI. GPT-5.4 Thinking System Card, 2026. URLhttps://deploymentsafety.openai. com/gpt-5-4-thinking/gpt-5-4-thinking.pdf. Accessed: 2026-05-03

work page 2026

[42] [44]

RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository

Zhiyuan Peng, Xin Yin, Pu Zhao, Fangkai Yang, Lu Wang, Ran Jia, Xu Chen, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Repogenesis: Benchmarking end-to-end microservice generation from readme to repository.CoRR, abs/2601.13943, 2026. URL https://arxiv. org/abs/2601.13943

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [45]

Qwen3.6-Plus: Towards real world agents, 2026

Qwen Team. Qwen3.6-Plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6. Accessed: 2026-05-03

work page 2026

[44] [46]

Qwen3.6-27B: Flagship-level coding in a 27B dense model, 2026

Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model, 2026. URL https: //qwen.ai/blog?id=qwen3.6-27b. Accessed: 2026-05-03

work page 2026

[45] [47]

Vibe coding vs

Ranjan Sapkota, Konstantinos I. Roumeliotis, and Manoj Karkee. Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic AI.CoRR, abs/2505.19443, 2025. URLhttps://arxiv.org/abs/2505.19443

work page arXiv 2025

[46] [48]

Vibe coding: programming through conversation with artificial intelligence.arXiv preprint arXiv:2506.23253, 2025

Advait Sarkar and Ian Drosos. Vibe coding: programming through conversation with artificial intelligence.arXiv preprint arXiv:2506.23253, 2025. URL https://arxiv.org/abs/2506. 23253

work page arXiv 2025

[47] [49]

Gemini 3.1 Pro: a smarter model for your most complex tasks, 2026

The Gemini Team. Gemini 3.1 Pro: a smarter model for your most complex tasks, 2026. URL https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/. Accessed: 2026-05-03

work page 2026

[48] [50]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for AI soft...

work page 2025

[49] [51]

Codecontests+: High-quality test case generation for competitive programming.CoRR, abs/2506.05817, 2025

Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming.CoRR, abs/2506.05817, 2025. URL https: //arxiv.org/abs/2506.05817

work page arXiv 2025

[50] [52]

Swe-compass: Towards unified evaluation of agentic coding abilities for large language models

Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zong-Xian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xi...

work page arXiv 2025

[51] [53]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528– 50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528– 50652, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/ hash/5a7c947568c1b1328ccc5230172e1e7c-A...

work page 2024

[52] [54]

Glm-5.1: Towards long-horizon tasks, 2026

Z.AI. Glm-5.1: Towards long-horizon tasks, 2026. URL https://z.ai/blog/glm-5.1. Accessed: 2026-05-03

work page 2026

[53] [55]

CodeActAgent

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Biny...

work page 2025

[54] [56]

Ifcis in the exact-match dictionary, return the assigned backbone

work page

[55] [57]

Otherwise, scan the prefix rules in order and return the first match: API*/Api* → API,BusinessLogic* → Logic,DataModel* → Data,Architec- ture*/Frontend*/UI*→Quality,Auth*→AuthZ

work page

[56] [58]

weakest link

No category falls through; the algorithm is verified to cover all 5,370 nodes (Table 13). Resulting Distribution.Table 13 shows how the 5,370 validation nodes and the totalmaxScoreof 17,299.1 are distributed across the six backbones under this mapping. The distribution is intentionally non-uniform: Logic dominates by maximum score (27.2%) because business...

work page 1901