Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
Pith reviewed 2026-05-18 01:59 UTC · model grok-4.3
The pith
Build-bench shows language models can repair up to 63.19 percent of real cross-ISA build failures through iterative tool use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Build-bench establishes the first architecture-aware benchmark for LLM-based software build and repair by collecting 268 real-world failed packages and integrating four auxiliary tools—Structure Extraction, File Content Extraction, Content Modification, and Build Verification—into an iterative loop where models receive updated build logs and prior outcomes to refine repairs, yielding a maximum success rate of 63.19 percent.
What carries the argument
Build-bench benchmark consisting of 268 failed packages paired with an iterative tool-augmented reasoning loop using four auxiliary tools that supply project structure, file contents, modification capabilities, and build verification.
If this is right
- Models can address complex dependencies and heterogeneous toolchains in migration tasks when given iterative feedback from build logs.
- Different models exhibit distinct tool usage patterns that influence their overall repair effectiveness.
- Repeated attempts informed by previous outcomes and error messages raise the chance of producing a working build.
- Verifiable outcomes from actual build environments enable direct, reproducible evaluation of model performance.
Where Pith is reading between the lines
- Benchmarks of this form could be extended to other migration-related tasks such as performance tuning or security updates across architectures.
- Future models might achieve higher success by combining the existing tools with more advanced planning or external documentation retrieval.
- The 63 percent ceiling points to specific remaining obstacles like very long build logs or rare dependency conflicts that could be targeted for improvement.
Load-bearing premise
The 268 selected failed packages and the four auxiliary tools represent the typical challenges of real cross-ISA migration and support fair comparisons across models.
What would settle it
Testing the same models on a fresh set of 100 previously unseen failed packages drawn from additional open-source projects and observing success rates drop below 30 percent would indicate the current collection does not capture the full range of migration difficulties.
Figures
read the original abstract
Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation across the studied models, Build-bench reveals that current models achieve a maximum build success rate of 63.19% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Build-bench, an end-to-end benchmark for evaluating LLMs on repairing build failures during cross-ISA migrations (e.g., x86_64 to aarch64). It collects 268 real-world failed packages, integrates four auxiliary tools (Structure Extraction, File Content Extraction, Content Modification, Build Verification), and runs an iterative tool-augmented repair loop that feeds updated build logs back to the model. Comparative experiments across models report a maximum success rate of 63.19% and statistically significant differences in tool-usage patterns, positioning Build-bench as the first architecture-aware benchmark for LLM-based software build and repair.
Significance. If the 268-package corpus and tool interface prove representative, the work supplies a concrete, externally verifiable benchmark that moves beyond synthetic coding tasks to real build environments and executable outcomes. It supplies the first quantitative baseline for cross-ISA repair success and documents model-specific tool-use differences, which could guide future architecture-aware agent design.
major comments (2)
- [Abstract and §3] Abstract and §3 (Benchmark Construction): the claim that the 268 packages constitute a representative sample of cross-ISA failures is load-bearing for both the 63.19% headline rate and the cross-model comparisons, yet no sampling frame, source repositories, failure-mode taxonomy, build-system distribution, or diversity statistics are supplied; without these the reported success rate cannot be generalized.
- [§4] §4 (Experimental Setup): the four auxiliary tools are presented as sufficient for autonomous repair, but the manuscript provides no ablation or coverage analysis showing that Structure Extraction + Content Modification + Build Verification together address the dominant failure modes (e.g., ISA-specific compiler flags, dependency resolution across heterogeneous toolchains); this directly affects the validity of the iterative-loop results.
minor comments (2)
- [Results] Table 1 or equivalent results table: report per-model success rates with confidence intervals or exact binomial tests rather than a single aggregate maximum; the current presentation makes it difficult to assess whether 63.19% is statistically distinguishable from other models.
- [Appendix] The prompt templates and exact tool-calling format used in the iterative loop are not reproduced in an appendix; reproducibility of the reported tool-usage patterns would be improved by including them.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review of our manuscript on Build-bench. We address each major comment point by point below, providing the strongest honest defense of the work while making revisions where the comments identify clear gaps in the original submission.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): the claim that the 268 packages constitute a representative sample of cross-ISA failures is load-bearing for both the 63.19% headline rate and the cross-model comparisons, yet no sampling frame, source repositories, failure-mode taxonomy, build-system distribution, or diversity statistics are supplied; without these the reported success rate cannot be generalized.
Authors: We agree that the representativeness of the 268-package corpus is important for interpreting the headline success rate and model comparisons. The packages were drawn from real-world cross-ISA build failures encountered during porting efforts on Linux distributions, but the original manuscript did not supply the requested metadata. In the revised manuscript we have expanded §3 with a new subsection that documents the sampling frame (packages selected from Debian and Fedora aarch64 porting queues with build logs from the last 18 months), source repositories, a failure-mode taxonomy (ISA-specific flags, toolchain heterogeneity, dependency resolution, and configuration issues), build-system distribution statistics, and basic diversity metrics such as package size and dependency count ranges. These additions allow readers to evaluate generalizability directly. revision: yes
-
Referee: [§4] §4 (Experimental Setup): the four auxiliary tools are presented as sufficient for autonomous repair, but the manuscript provides no ablation or coverage analysis showing that Structure Extraction + Content Modification + Build Verification together address the dominant failure modes (e.g., ISA-specific compiler flags, dependency resolution across heterogeneous toolchains); this directly affects the validity of the iterative-loop results.
Authors: We acknowledge that an explicit ablation or coverage analysis of the tool suite would strengthen claims about the iterative repair loop. The four tools were chosen to support the core operations observed in cross-ISA failures (structure inspection, file reading, targeted edits, and outcome verification). In the revised §4 we have added a qualitative coverage analysis that maps each tool to the dominant failure modes in our taxonomy, with examples showing how Structure Extraction and Content Modification handle ISA-specific compiler flags and how Build Verification confirms toolchain compatibility. A quantitative ablation study across tool subsets is noted as future work because it would require substantial additional compute; the current revision improves justification for the reported results without overclaiming sufficiency. revision: partial
Circularity Check
No circularity: benchmark success measured by external verifiable build outcomes
full rationale
The paper defines Build-bench explicitly via collection of 268 real-world failed packages plus four named auxiliary tools (Structure Extraction, File Content Extraction, Content Modification, Build Verification) and an iterative loop that feeds updated logs back to the model. The reported 63.19% maximum success rate is obtained from direct execution in real build environments with verifiable outcomes, not from any internal model score, fitted parameter, or self-referential definition. No equations appear; the 'first architecture-aware benchmark' claim rests on a literature positioning statement rather than a uniqueness theorem imported from the authors' prior work. The central result therefore remains independent of its own inputs and does not reduce by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
AWS Graviton: Energy Efficient Compute for Modern Workloads
2023. AWS Graviton: Energy Efficient Compute for Modern Workloads. https://aws.amazon.com/ec2/graviton/
work page 2023
-
[2]
2024.Apple Style Guide. Technical Report. Apple Inc. https://help.apple.com/pdf/applestyleguide/en_US/apple-style-guide.pdf Manuscript submitted to ACM Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems 23
work page 2024
-
[3]
2021.Alibaba Cloud Launches Yitian 710 ARM-Based Processor
Alibaba Cloud. 2021.Alibaba Cloud Launches Yitian 710 ARM-Based Processor. https://www.alibabacloud.com/blog/598159
work page 2021
-
[4]
Anthropic. 2025. Claude Sonnet 4.5 System Card. https://www.anthropic.com/news/claude-sonnet-4-5/. Accessed: 2025-10-17
work page 2025
-
[5]
2023.HetMigrate: Secure and Efficient Cross-architecture Process Live Migration
Abhishek Mandar Bapat. 2023.HetMigrate: Secure and Efficient Cross-architecture Process Live Migration. Ph. D. Dissertation. Virginia Tech
work page 2023
-
[6]
Mario R Barbacci. 2012. Instruction set processor specifications (ISPS): The notation and its applications.IEEE Trans. Comput.100, 1 (2012), 24–40
work page 2012
-
[7]
Lenz Belzner, Thomas Gabor, and Martin Wirsing. 2023. Large language model assisted software engineering: prospects, challenges, and a case study. InInternational conference on bridging the gap between AI and reality. Springer, 355–374
work page 2023
- [8]
-
[9]
Carolin E Brandt, Annibale Panichella, Andy Zaidman, and Moritz Beller. 2020. Logchunks: A data set for build log analysis. InProceedings of the 17th International Conference on Mining Software Repositories. 583–587
work page 2020
-
[10]
Max Brunsfeld. 2018. Tree-sitter: An Incremental Parsing System for Programming Tools. https://tree-sitter.github.io/tree-sitter/. Accessed: 2025-10-19
work page 2018
-
[11]
Vincent Bushong, Russell Sanders, Jacob Curtis, Mark Du, Tomas Cerny, Karel Frajtak, Miroslav Bures, Pavel Tisnovsky, and Dongwan Shin. 2020. On matching log analysis to source code: A systematic mapping study. InProceedings of the International Conference on Research in Adaptive and Convergent Systems. 181–187
work page 2020
-
[12]
Marcelo Cataldo, Audris Mockus, Jeffrey A Roberts, and James D Herbsleb. 2009. Software dependencies, work dependencies, and their impact on failures.IEEE Transactions on Software Engineering35, 6 (2009), 864–878
work page 2009
-
[13]
Lucas Vincenzo Davi, Alexandra Dmitrienko, Stefan Nürnberger, and Ahmad-Reza Sadeghi. 2013. Gadge me if you can: secure and efficient ad-hoc instruction-level randomization for x86 and ARM. InProceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security. 299–310
work page 2013
-
[14]
Hugo Sica de Andrade, Jan Schroeder, and Ivica Crnkovic. 2019. Software deployment on heterogeneous platforms: A systematic mapping study. IEEE Transactions on Software Engineering47, 8 (2019), 1683–1707
work page 2019
-
[15]
Debian Project. 2018. The Debian Packaging Guide. https://www.debian.org/doc/manuals/packaging-tutorial/ Accessed: 2025-10-27
work page 2018
-
[16]
Alexandre Decan, Tom Mens, and Philippe Grosjean. 2019. An empirical comparison of dependency network evolution in seven software packaging ecosystems.Empirical Software Engineering24, 1 (2019), 381–416
work page 2019
-
[17]
Yvonne Dittrich. 2014. Software engineering beyond the project–Sustaining software ecosystems.Information and Software Technology56, 11 (2014), 1436–1456
work page 2014
-
[18]
Fedora Project. 2022. Fedora: The Operating System for Open Source Developers. https://getfedora.org/ Accessed: 2025-10-27
work page 2022
-
[19]
Blake W Ford, Apan Qasem, Jelena Tešić, and Ziliang Zong. 2021. Migrating software from x86 to ARM Architecture: An instruction prediction approach. In2021 IEEE International Conference on Networking, Architecture and Storage (NAS). IEEE, 1–6
work page 2021
-
[20]
Blake W Ford and Ziliang Zong. 2022. A cost effective framework for analyzing cross-platform software energy efficiency.Sustainable Computing: Informatics and Systems35 (2022), 100661
work page 2022
-
[21]
Khushi Gupta and Tushar Sharma. 2021. Changing trends in computer architecture: A comprehensive analysis of arm and x86 processors. International Journal of Scientific Research in Computer Science, Engineering and Information Technology7 (2021), 619–631
work page 2021
-
[22]
Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering. ACM computing surveys (CSUR)54, 6 (2021), 1–37
work page 2021
-
[23]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79
work page 2024
-
[24]
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. arXiv:2503.23278 [cs.CR] https://arxiv.org/abs/2503.23278
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Yuchao Huang, Junjie Wang, Zhe Liu, Yawen Wang, Song Wang, Chunyang Chen, Yuanzhe Hu, and Qing Wang. 2024. CrashTranslator: Automatically Reproducing Mobile Application Crashes Directly from Stack Trace. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 18:1–18:13. doi...
-
[26]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation.CoRR abs/2406.00515 (2024). arXiv:2406.00515 doi:10.48550/ARXIV.2406.00515
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.00515 2024
-
[27]
Xue Jiang, Yihong Dong, Yongding Tao, Huanyu Liu, Zhi Jin, and Ge Li. 2025. ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 334–346. doi:10.1109/ICSE55347.2025.00133
-
[28]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,
work page 2024
-
[29]
https://openreview.net/forum?id=VTF8yNQM66
OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66
-
[30]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 International Symposium on Software Testing and Analysis(San Jose, CA, USA)(ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437–440. doi:10.1145/2610384.2628055
-
[31]
Long Kang, Jun Ai, and Minyan Lu. 2024. Automated Structural Test Case Generation for Human-Computer Interaction Software Based on Large Language Model. In11th International Conference on Dependable Systems and Their Applications, DSA 2024, Taicang, Suzhou, China, November 2-3, Manuscript submitted to ACM 24 Trovato et al
work page 2024
-
[32]
doi:10.1109/DSA63982.2024.00027
IEEE, 132–140. doi:10.1109/DSA63982.2024.00027
-
[33]
Aymen Ketata, Carlos Moreno, Sebastian Fischmeister, Jia Liang, and Krzysztof Czarnecki. 2015. Performance prediction upon toolchain migration in model-based software. In2015 ACM/IEEE 18th International Conference on Model Driven Engineering Languages and Systems (MODELS). IEEE, 302–311
work page 2015
-
[34]
Automated program repair in the era of large pre-trained language models,
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. 2023. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 919–931. doi:10.1109/ICSE48619.2023.00085
- [35]
-
[36]
Yiling Lou, Zhenpeng Chen, Yanbin Cao, Dan Hao, and Lu Zhang. 2020. Understanding build issue resolution in practice: symptoms and fix patterns. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 617–628
work page 2020
- [37]
-
[38]
Andrey Mokhov, Alexei Iliasov, Danil Sokolov, Maxim Rykunov, Alex Yakovlev, and Alexander Romanovsky. 2013. Synthesis of processor instruction sets from high-level ISA specifications.IEEE Trans. Comput.63, 6 (2013), 1552–1566
work page 2013
-
[39]
David Moreau, Kristina Wiebels, and Carl Boettiger. 2023. Containers for computational reproducibility.Nature Reviews Methods Primers3, 1 (2023), 50
work page 2023
-
[40]
Linda Northrop, Peter Feiler, Richard P Gabriel, John Goodenough, Rick Linger, Tom Longstaff, Rick Kazman, Mark Klein, Kevin Sullivan, Kurt Wallnau, et al. 2006. Ultra-large-scale systems: The software challenge of the future. (2006)
work page 2006
-
[41]
OpenAI. 2025. Introducing GPT-4o. https://openai.com/zh-Hans-CN/index/gpt-4o-system-card/. Accessed: 2025-10-17
work page 2025
-
[42]
OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/
work page 2025
-
[43]
openSUSE Project. 2022. openSUSE: The community-driven Linux distribution. https://www.opensuse.org/ Accessed: 2025-10-27
work page 2022
-
[44]
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. 2025. Training Software Engineering Agents and Verifiers with SWE-Gym. arXiv:2412.21139 [cs.SE] https://arxiv.org/abs/2412.21139
work page internal anchor Pith review arXiv 2025
-
[45]
Anshu Parashar and Jitender Kumar Chhabra. 2017. Package-restructuring based on software change history.National Academy Science Letters40, 1 (2017), 21–27
work page 2017
-
[46]
Ivan Pashchenko, Duc-Ly Vu, and Fabio Massacci. 2020. A qualitative study of dependency management and its security implications. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security. 1513–1531
work page 2020
- [47]
-
[48]
Thomas Rausch, Waldemar Hummer, Philipp Leitner, and Stefan Schulte. 2017. An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 345–355
work page 2017
-
[49]
2013.A detailed analysis of contemporary arm and x86 architectures
Karthikeyan Sankaralingam, Jaikrishnan Menon, and Emily Blem. 2013.A detailed analysis of contemporary arm and x86 architectures. Technical Report
work page 2013
-
[50]
Aditya S Shethiya. 2024. Engineering with Intelligence: How Generative AI and LLMs Are Shaping the Next Era of Software Systems.Spectrum of Research4, 1 (2024)
work page 2024
-
[51]
2011.Software Build Systems: Principles and Experience(1st ed.)
Peter Smith. 2011.Software Build Systems: Principles and Experience(1st ed.). Addison-Wesley Professional
work page 2011
-
[52]
Piotr Sowiński, Ignacio Lacalle, Rafael Vaño, Carlos E Palau, Maria Ganzha, and Marcin Paprzycki. 2024. Overview of Current Challenges in Multi-architecture Software Engineering and a Vision for the Future. InInternational Conference on Big Data Analytics. Springer, 74–94
work page 2024
-
[53]
Gengyi Sun. 2025. Intelligent Automation for Accelerating the Repair of Software Build Failures. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025 - Companion Proceedings, Ottawa, ON, Canada, April 27 - May 3, 2025. IEEE, 205–207. doi:10.1109/ICSE- COMPANION66252.2025.00062
-
[54]
Alibaba / Qwen Team. 2025. Qwen-3 Max: Latest Advancements. https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&from= research.latest-advancements-list. Accessed: 2025-10-17
work page 2025
-
[55]
DeepSeek Team. 2024. DeepSeek-V3 Technical Report.arXiv preprint(2024). arXiv:2412.19437 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
2013.Dependencies: No software is an island
Jørgen Tellnes. 2013.Dependencies: No software is an island. Master’s thesis. The University of Bergen
work page 2013
-
[57]
Colin C Venters, Rafael Capilla, Stefanie Betz, Birgit Penzenstadler, Tom Crick, Steve Crouch, Elisa Yumi Nakagawa, Christoph Becker, and Carlos Carrillo. 2018. Software sustainability: Research and practice from a software architecture viewpoint.Journal of Systems and Software138 (2018), 174–188
work page 2018
-
[58]
Christian Wressnegger, Fabian Yamaguchi, Alwin Maier, and Konrad Rieck. 2016. Twice the bits, twice the trouble: Vulnerabilities induced by migrating to 64-bit platforms. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 541–552
work page 2016
-
[59]
Tong Xing, Cong Xiong, Tianrui Wei, April Sanchez, Binoy Ravindran, Jonathan Balkind, and Antonio Barbalace. 2025. Stramash: A Fused-Kernel Operating System For Cache-Coherent, Heterogeneous-ISA Platforms. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2025,...
-
[60]
Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, and Qi Zhang. 2025. OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://open...
work page 2025
- [61]
- [62]
-
[63]
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 37:1–37:1...
-
[64]
Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang. 2025. CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building.Proceedings of the ACM on Software Engineering2, FSE (2025), 2618–2640
work page 2025
-
[65]
Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang. 2025. CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building.Proc. ACM Softw. Eng.2, FSE (2025), 2618–2640. doi:10.1145/3729386
-
[66]
Chen Zhang, Bihuan Chen, Linlin Chen, Xin Peng, and Wenyun Zhao. 2019. A large-scale empirical study of compiler errors in continuous integration. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, Marlon D...
-
[67]
Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Yudong Zhang. 2025. SWE-bench Goes Live!CoRRabs/2505.23419 (2025). arXiv:2505.23419 doi:10.48550/ARXIV.2505.23419
- [68]
- [69]
- [70]
-
[71]
Zehua Zhang, Ati Priya Bajaj, Divij Handa, Siyu Liu, Arvind S Raj, Hongkai Chen, Hulin Wang, Yibo Liu, Zion Leonahenahe Basque, Souradip Nath, Vishal Juneja, Nikhil Chapre, Yan Shoshitaishvili, Adam Doupé, Chitta Baral, and Ruoyu Wang. 2025. BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software. arXiv:2509.25248 [cs.SE] https://...
-
[72]
Renyi Zhong, Yichen Li, Jinxi Kuang, Wenwei Gu, Yintong Huo, and Michael R Lyu. 2025. LogUpdater: Automated Detection and Repair of Specific Defects in Logging Statements.ACM Transactions on Software Engineering and Methodology(2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 Manuscript submitted to ACM
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.