SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
Pith reviewed 2026-05-19 22:39 UTC · model grok-4.3
pith:H6UDO7IP Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{H6UDO7IP}
Prints a linked pith:H6UDO7IP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Coding agents fail over 95% of enterprise SaaS tasks before reaching business logic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the primary bottleneck for state-of-the-art agents in enterprise SaaS engineering is not generating isolated code logic but successfully configuring and integrating a multi-component system. Over 95% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops.
What carries the argument
SaaSBench benchmark with 30 tasks across 6 SaaS domains and 5,370 validation nodes, using 8 programming languages, 6 databases, and 13 frameworks, evaluated by a dependency-aware hybrid paradigm for long-horizon multi-component systems.
Load-bearing premise
The 30 tasks and 5,370 validation nodes sufficiently capture the heterogeneity, coupling, and long-horizon constraints of real enterprise SaaS systems without introducing artificial simplifications.
What would settle it
Running the same agents on the benchmark after providing explicit integration tools or automated setup scripts and observing whether the 95% early-failure rate persists or drops sharply would test the claim that system configuration is the central issue.
Figures
read the original abstract
As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real-world software heterogeneity. Furthermore, we design a dependency-aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi-component coupling, enabling fine-grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state-of-the-art agents is not generating isolated code logic, but successfully configuring and integrating a multi-component system. Over 95\% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system-level coding agents. The code is available at \url{https://github.com/ShadeCloak/SaaSbench}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SaaSBench, a benchmark for evaluating autonomous coding agents on long-horizon enterprise SaaS engineering tasks. It consists of 30 tasks spanning 6 domains, implemented with 8 languages, 6 databases, and 13 frameworks, and supported by 5,370 validation nodes under a dependency-aware hybrid evaluation paradigm. The central empirical claim is that over 95% of failures in state-of-the-art agents occur before reaching deep business logic, primarily due to overconfidence during foundational setup or ineffective debugging loops.
Significance. If the reported failure distribution and phase classifications hold under scrutiny, the work usefully redirects attention from isolated code generation to the harder problem of multi-component system configuration and integration in realistic SaaS settings. The public release of the benchmark and code at the cited GitHub repository is a clear strength that enables direct reproducibility and follow-on research.
major comments (2)
- [§5 (Results)] §5 (Results): The headline statistic that over 95% of task failures occur before agents reach deep business logic rests on a failure-phase classification whose boundary between 'foundational system setup' and 'deep business logic' is not given an explicit, agent-independent definition tied to the 5,370 validation nodes. No per-task or per-domain breakdown is supplied showing how nodes are allocated to setup/integration versus core business rules, nor is the stopping condition for 'premature halt' stated in terms of node outcomes rather than agent logs. This directly affects whether the overconfidence diagnosis is robust or sensitive to task scaffolding.
- [§3 (Benchmark Design)] §3 (Benchmark Design): The claim that the 30 tasks and 5,370 nodes adequately capture the heterogeneity, coupling, and long-horizon constraints of real enterprise SaaS systems is load-bearing for interpreting the failure statistics, yet the manuscript provides no quantitative evidence (e.g., metrics of inter-task coupling or domain coverage) that the chosen tasks avoid artificial simplifications that could favor or penalize particular agent behaviors.
minor comments (2)
- [Abstract] The abstract is information-dense; moving the precise counts of languages, databases, and frameworks to a table in §3 would improve readability.
- [§4 (Evaluation Paradigm)] Notation for the hybrid evaluation nodes could be introduced earlier with a small diagram to clarify dependency tracking.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on clarifying the failure-phase classification and strengthening the evidence for benchmark representativeness. We have revised the manuscript accordingly to address both major comments.
read point-by-point responses
-
Referee: [§5 (Results)] §5 (Results): The headline statistic that over 95% of task failures occur before agents reach deep business logic rests on a failure-phase classification whose boundary between 'foundational system setup' and 'deep business logic' is not given an explicit, agent-independent definition tied to the 5,370 validation nodes. No per-task or per-domain breakdown is supplied showing how nodes are allocated to setup/integration versus core business rules, nor is the stopping condition for 'premature halt' stated in terms of node outcomes rather than agent logs. This directly affects whether the overconfidence diagnosis is robust or sensitive to task scaffolding.
Authors: We agree that an explicit, agent-independent definition tied to the validation nodes is required for robustness. In the revised manuscript we add a formal definition in §5.1: foundational system setup comprises all validation nodes that check environment configuration, dependency resolution, service bootstrapping, and cross-component integration (on average 38% of nodes per task); deep business logic comprises nodes that validate domain-specific rules, workflows, and data invariants (the remaining 62%). The classification is derived directly from the dependency graphs in the task specification files and is therefore independent of any agent’s logs or behavior. We also include a new Table 3 with per-task and per-domain node allocations and state the premature-halt stopping condition as failure to pass the first 25% of setup nodes after a fixed budget of 15 iterations, measured by node outcome results. These additions make the 95% statistic reproducible and less sensitive to scaffolding choices. revision: yes
-
Referee: [§3 (Benchmark Design)] §3 (Benchmark Design): The claim that the 30 tasks and 5,370 nodes adequately capture the heterogeneity, coupling, and long-horizon constraints of real enterprise SaaS systems is load-bearing for interpreting the failure statistics, yet the manuscript provides no quantitative evidence (e.g., metrics of inter-task coupling or domain coverage) that the chosen tasks avoid artificial simplifications that could favor or penalize particular agent behaviors.
Authors: We accept that quantitative support was missing. The revised §3.2 now reports three metrics computed from the released task specifications: (1) mean inter-component coupling of 4.7 cross-framework or cross-database dependencies per task (range 2–8), (2) domain coverage quantified by overlap with 12 real-world SaaS feature categories drawn from industry reports (average coverage 0.81), and (3) average dependency depth of 11.8 validation steps per task. These figures indicate that the benchmark requires genuine multi-stack orchestration rather than isolated code generation. We have also added a short discussion of how these metrics compare with typical enterprise SaaS projects to argue against artificial simplification. revision: yes
Circularity Check
No circularity: empirical failure statistics are measured outcomes, not derived by construction
full rationale
The paper is an empirical benchmark study that defines 30 tasks across 6 domains with 5,370 validation nodes and a dependency-aware hybrid evaluation. The central claim (over 95% of failures occur before deep business logic) is reported as a direct experimental count from running agents on these tasks. No equations, fitted parameters, or self-referential definitions are used to derive this statistic; it is an observed distribution of failure phases based on node outcomes. The evaluation paradigm is a methodological design choice for assessment rather than a tautology that forces the reported percentage. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text that reduce the result to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected 30 tasks across 6 domains adequately represent the structural complexity and integration challenges of production SaaS systems.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/BranchSelection.lean; Cost.FunctionalEquationbranch_selection; washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Over 95% of task failures occur before agents even reach deep business logic... prematurely halting during foundational system setup
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Claude code: Ai-powered coding assistant, 2024
Anthropic. Claude code: Ai-powered coding assistant, 2024. URL https://www.claude. com/product/claude-code. Accessed: 2026-05-03
work page 2024
-
[2]
System card: Claude opus 4 & claude sonnet 4, 2025
Anthropic. System card: Claude opus 4 & claude sonnet 4, 2025. URL https://www-cdn. anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf. Accessed: 2026- 05-03
work page 2025
-
[3]
System card: Claude sonnet 4.5, 2025
Anthropic. System card: Claude sonnet 4.5, 2025. URL https://assets.anthropic.com/ m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf . Accessed: 2026-05-03
work page 2025
-
[4]
Introducing Claude Opus 4.7, 2026
Anthropic. Introducing Claude Opus 4.7, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7. Accessed: 2026-05-03
work page 2026
-
[5]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/ 2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
URLhttps://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Cursor: The ai code editor, 2024
Cursor AI. Cursor: The ai code editor, 2024. URL https://www.cursor.com. Accessed: 2026-05-03
work page 2024
-
[9]
Deepseek-v4: Towards highly efficient million-token context intelligence
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, DeepSeek-AI, 2026. URL https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Accessed: 2026-05-03
work page 2026
-
[10]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, 10 Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can AI agents solve long...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wa...
-
[12]
A Survey on Code Generation with LLM-based Agents
Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with llm-based agents.CoRR, abs/2508.00083, 2025. URL https: //arxiv.org/abs/2508.00083
work page internal anchor Pith review arXiv 2025
-
[13]
Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, and Yong Yu. Automatically benchmarking LLM code agents through agent-driven annotation and evaluation.CoRR, abs/2510.24358, 2025. URL https: //arxiv.org/abs/2510.24358
-
[14]
Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. Trae agent: An llm-based agent for software engineering with test-time scaling.CoRR, abs/2507.23370, 2025. URLhttps://arxiv.org/abs/2507.23370
- [16]
-
[17]
Github copilot: Your ai pair programmer, 2021
GitHub. Github copilot: Your ai pair programmer, 2021. URL https://copilot.github. com/. Accessed: 2026-05-03
work page 2021
-
[18]
Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. Swe-perf: Can language models optimize code performance on real-world repositories? arXiv preprint arXiv:2507.12415, 2025. URLhttps://arxiv.org/abs/2507.12415
-
[19]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS.CoRR, abs/2105.09938, 2021. URL https://arxiv.org/ abs/2105.09938
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Metagpt: Meta programming for A multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for A multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations,
-
[21]
URLhttps://openreview.net/forum?id=VtmBAGCN7o
-
[22]
Opencoder: The open cookbook for top-tier code large language models
Siming Huang, Tianhao Cheng, Jason Klein Liu, Weidi Xu, Jiaran Hao, Liuyihan Song, Yang Xu, Jian Yang, Jiaheng Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Xianzhen Luo, Qiufeng Wang, YuanTao Fan, Qingfu Zhu, Zhaoxiang Zhang, Yang Gao, Jie Fu, Qian Liu, Houyi Li, Ge Zhang, Yuan Qi, Yinghui Xu, Wei Chu, and Zili Wang. Opencoder: The open cookbook for ...
-
[23]
URLhttps://aclanthology.org/2025.acl-long.1591/
work page 2025
-
[24]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1–72, 2026. URLhttps://dl.acm.org/doi/10.1145/3747588. 11
-
[25]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? CoRR, abs/2310.06770, 2024. URLhttps://arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Ds-1000: A natural and reliable benchmark for data science code generation
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning, pages 18319–18345. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/lai23b.html
work page 2023
-
[27]
Prompting large language models to tackle the full software development lifecycle: A case study
Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, and Kai Chen. Prompting large language models to tackle the full software development lifecycle: A case study. InProceedings of the 31st International Conference on Computational Linguistics...
work page 2025
-
[28]
Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. Evocodebench: An evolving code generation benchmark aligned with real-world code repositories.CoRR, abs/2404.00599, 2024. URLhttps://arxiv.org/abs/2404.00599
-
[29]
Lacking Control Increases Illusory Pattern Perception
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...
-
[30]
Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binxing Jiao, Chen Hu, and Huacan Wang. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.CoRR, abs/2508.02085, 2025. URLhttps://arxiv.org/abs/2508.02085
-
[31]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. URL https://arxiv.org/ abs/2512.02556
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
M2RC-EV AL: massively multilingual repository-level code completion evaluation
Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin, Ge Zhang, Zekun Moore Wang, Guoan Zhang, Yingshui Tan, Bangyu Xiang, Zhaoxiang Zhang, Wenbo Su, and Bo Zheng. M2RC-EV AL: massively multilingual repository-level code completion evaluation. InProceedings of the 63rd Annual Meeting of the Associat...
work page 2025
-
[33]
Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tianrun Gao. Projecteval: A benchmark for programming agents automated evaluation on project-level code generation. InFindings of the Association for Computational Linguistics, pages 20205–20221,
-
[34]
URLhttps://aclanthology.org/2025.findings-acl.1036/
work page 2025
-
[35]
Repobench: Benchmarking repository- level code auto-completion systems
Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=pPjZIOuQuF
work page 2024
-
[36]
Pengrui Lu, Shiqi Zhang, Yunzhong Hou, Lyumanshan Ye, Chaoyi Huang, Zixi Chen, Ji Zeng, Hantao Jiang, Pengfei Liu, Yiwei Wang, and Ming-Hsuan Yang. Projdevbench: Benchmarking AI coding agents on end-to-end project development.CoRR, abs/2602.01655, 2026. URL https://arxiv.org/abs/2602.01655
-
[37]
Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.CoRR, abs/2505.03733, 2025. URL https: //arxiv.org/abs/2505.03733. 12
-
[38]
MiniMax M2.7: Early echoes of self-evolution, 2026
MiniMax. MiniMax M2.7: Early echoes of self-evolution, 2026. URLhttps://www.minimax. io/news/minimax-m27-en. Accessed: 2026-05-03
work page 2026
-
[39]
Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering? InPro- ceedings of the 42nd International Conference on Machine Learning, volume 267, pages 44412– 44450. PMLR, 2025. URL https://proceedings.mlr.press/v267/miserendino25a. html
work page 2025
-
[40]
Kimi K2.6: Advancing open-source coding, 2026
Moonshot AI. Kimi K2.6: Advancing open-source coding, 2026. URL https://www.kimi. com/blog/kimi-k2-6. Accessed: 2026-05-03
work page 2026
-
[41]
Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Sen Hu, Bo Li, Chen Hu, Binxing Jiao, Daxin Jiang, Yuntao Du, and Pin Lyu. Gittaskbench: A benchmark for code agents solving real-world tasks through code repository leveraging. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 32564–32572, 2026. URL https...
-
[42]
OpenAI. Codex cli, 2025. URL https://github.com/openai/codex. Accessed: 2026-05- 03
work page 2025
-
[43]
GPT-5.4 Thinking System Card, 2026
OpenAI. GPT-5.4 Thinking System Card, 2026. URLhttps://deploymentsafety.openai. com/gpt-5-4-thinking/gpt-5-4-thinking.pdf. Accessed: 2026-05-03
work page 2026
-
[44]
RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository
Zhiyuan Peng, Xin Yin, Pu Zhao, Fangkai Yang, Lu Wang, Ran Jia, Xu Chen, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Repogenesis: Benchmarking end-to-end microservice generation from readme to repository.CoRR, abs/2601.13943, 2026. URL https://arxiv. org/abs/2601.13943
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Qwen3.6-Plus: Towards real world agents, 2026
Qwen Team. Qwen3.6-Plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6. Accessed: 2026-05-03
work page 2026
-
[46]
Qwen3.6-27B: Flagship-level coding in a 27B dense model, 2026
Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model, 2026. URL https: //qwen.ai/blog?id=qwen3.6-27b. Accessed: 2026-05-03
work page 2026
-
[47]
Ranjan Sapkota, Konstantinos I. Roumeliotis, and Manoj Karkee. Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic AI.CoRR, abs/2505.19443, 2025. URLhttps://arxiv.org/abs/2505.19443
-
[48]
Advait Sarkar and Ian Drosos. Vibe coding: programming through conversation with artificial intelligence.arXiv preprint arXiv:2506.23253, 2025. URL https://arxiv.org/abs/2506. 23253
-
[49]
Gemini 3.1 Pro: a smarter model for your most complex tasks, 2026
The Gemini Team. Gemini 3.1 Pro: a smarter model for your most complex tasks, 2026. URL https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/. Accessed: 2026-05-03
work page 2026
-
[50]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for AI soft...
work page 2025
-
[51]
Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming.CoRR, abs/2506.05817, 2025. URL https: //arxiv.org/abs/2506.05817
-
[52]
Swe-compass: Towards unified evaluation of agentic coding abilities for large language models
Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zong-Xian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang, Wuxuan Gong, Ziyuan Gao, Guanxiang Wang, Yirong Xue, Mengtong Li, Mengfei Xi...
-
[53]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528– 50652, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/ hash/5a7c947568c1b1328ccc5230172e1e7c-A...
work page 2024
-
[54]
Glm-5.1: Towards long-horizon tasks, 2026
Z.AI. Glm-5.1: Towards long-horizon tasks, 2026. URL https://z.ai/blog/glm-5.1. Accessed: 2026-05-03
work page 2026
-
[55]
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Biny...
work page 2025
-
[56]
Ifcis in the exact-match dictionary, return the assigned backbone
-
[57]
Otherwise, scan the prefix rules in order and return the first match: API*/Api* → API,BusinessLogic* → Logic,DataModel* → Data,Architec- ture*/Frontend*/UI*→Quality,Auth*→AuthZ
-
[58]
No category falls through; the algorithm is verified to cover all 5,370 nodes (Table 13). Resulting Distribution.Table 13 shows how the 5,370 validation nodes and the totalmaxScoreof 17,299.1 are distributed across the six backbones under this mapping. The distribution is intentionally non-uniform: Logic dominates by maximum score (27.2%) because business...
work page 1901
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.