pith. machine review for the scientific record. sign in

arxiv: 2511.00780 · v2 · submitted 2025-11-02 · 💻 cs.SE

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

Pith reviewed 2026-05-18 01:59 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM benchmarkbuild repaircross-ISA migrationsoftware engineeringlanguage modelsbuild failuresarchitecture migration
0
0 comments X

The pith

Build-bench shows language models can repair up to 63.19 percent of real cross-ISA build failures through iterative tool use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Build-bench as a benchmark to measure how well language models can fix software build problems that occur when moving code between different processor architectures such as x86_64 and aarch64. It supplies 268 real failed packages along with tools that let models inspect project structure, read file contents, edit code, and verify builds. Models operate in repeated cycles where each failed attempt returns fresh error logs for the next try. The strongest result reaches 63.19 percent successful repairs, indicating that current models can manage some but not all of the dependencies and toolchain issues in such migrations.

Core claim

Build-bench establishes the first architecture-aware benchmark for LLM-based software build and repair by collecting 268 real-world failed packages and integrating four auxiliary tools—Structure Extraction, File Content Extraction, Content Modification, and Build Verification—into an iterative loop where models receive updated build logs and prior outcomes to refine repairs, yielding a maximum success rate of 63.19 percent.

What carries the argument

Build-bench benchmark consisting of 268 failed packages paired with an iterative tool-augmented reasoning loop using four auxiliary tools that supply project structure, file contents, modification capabilities, and build verification.

If this is right

  • Models can address complex dependencies and heterogeneous toolchains in migration tasks when given iterative feedback from build logs.
  • Different models exhibit distinct tool usage patterns that influence their overall repair effectiveness.
  • Repeated attempts informed by previous outcomes and error messages raise the chance of producing a working build.
  • Verifiable outcomes from actual build environments enable direct, reproducible evaluation of model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks of this form could be extended to other migration-related tasks such as performance tuning or security updates across architectures.
  • Future models might achieve higher success by combining the existing tools with more advanced planning or external documentation retrieval.
  • The 63 percent ceiling points to specific remaining obstacles like very long build logs or rare dependency conflicts that could be targeted for improvement.

Load-bearing premise

The 268 selected failed packages and the four auxiliary tools represent the typical challenges of real cross-ISA migration and support fair comparisons across models.

What would settle it

Testing the same models on a fresh set of 100 previously unseen failed packages drawn from additional open-source projects and observing success rates drop below 30 percent would indicate the current collection does not capture the full range of migration difficulties.

Figures

Figures reproduced from arXiv: 2511.00780 by Chaoyun Zhang, Chenyu Zhao, Chetan Bansal, Dan Pei, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Shenglin Zhang, Weilin Jin, Yongqian Sun, Zeshun Huang.

Figure 1
Figure 1. Figure 1: Comparison of different large language models (LLMs) in cross-ISA build repair tasks. (a) shows the success rates (%) achieved [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The automatic cross-ISA repair and build pipeline of [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of tool invocation behavior across LLMs. The bars represent the total number of invocations for each tool per [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of two repair strategies (Full File Generation vs. Patch Generation) across six LLMs and two architecture migration directions. The upper row reports the Build Success Rate, while the lower row presents Efficiency in terms of Average Repair Time (min) and Average Token Consumption (K). verification, suggesting that they possess a relatively strong capability for end-to-end task completion—integr… view at source ↗
Figure 5
Figure 5. Figure 5: Iterative repair process of the texmath package migration. Second Iteration: adding ldconfig scriptlets In the second iteration, GPT-5 continues the repair process using the updated build log. It first calls the Structure Extraction tool to inspect the package layout, confirming the presence of the specification file, source archive (texmath-0.13.tar.gz), and the latest failed log. Then, the File Content E… view at source ↗
read the original abstract

Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation across the studied models, Build-bench reveals that current models achieve a maximum build success rate of 63.19% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Build-bench, an end-to-end benchmark for evaluating LLMs on repairing build failures during cross-ISA migrations (e.g., x86_64 to aarch64). It collects 268 real-world failed packages, integrates four auxiliary tools (Structure Extraction, File Content Extraction, Content Modification, Build Verification), and runs an iterative tool-augmented repair loop that feeds updated build logs back to the model. Comparative experiments across models report a maximum success rate of 63.19% and statistically significant differences in tool-usage patterns, positioning Build-bench as the first architecture-aware benchmark for LLM-based software build and repair.

Significance. If the 268-package corpus and tool interface prove representative, the work supplies a concrete, externally verifiable benchmark that moves beyond synthetic coding tasks to real build environments and executable outcomes. It supplies the first quantitative baseline for cross-ISA repair success and documents model-specific tool-use differences, which could guide future architecture-aware agent design.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Benchmark Construction): the claim that the 268 packages constitute a representative sample of cross-ISA failures is load-bearing for both the 63.19% headline rate and the cross-model comparisons, yet no sampling frame, source repositories, failure-mode taxonomy, build-system distribution, or diversity statistics are supplied; without these the reported success rate cannot be generalized.
  2. [§4] §4 (Experimental Setup): the four auxiliary tools are presented as sufficient for autonomous repair, but the manuscript provides no ablation or coverage analysis showing that Structure Extraction + Content Modification + Build Verification together address the dominant failure modes (e.g., ISA-specific compiler flags, dependency resolution across heterogeneous toolchains); this directly affects the validity of the iterative-loop results.
minor comments (2)
  1. [Results] Table 1 or equivalent results table: report per-model success rates with confidence intervals or exact binomial tests rather than a single aggregate maximum; the current presentation makes it difficult to assess whether 63.19% is statistically distinguishable from other models.
  2. [Appendix] The prompt templates and exact tool-calling format used in the iterative loop are not reproduced in an appendix; reproducibility of the reported tool-usage patterns would be improved by including them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript on Build-bench. We address each major comment point by point below, providing the strongest honest defense of the work while making revisions where the comments identify clear gaps in the original submission.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): the claim that the 268 packages constitute a representative sample of cross-ISA failures is load-bearing for both the 63.19% headline rate and the cross-model comparisons, yet no sampling frame, source repositories, failure-mode taxonomy, build-system distribution, or diversity statistics are supplied; without these the reported success rate cannot be generalized.

    Authors: We agree that the representativeness of the 268-package corpus is important for interpreting the headline success rate and model comparisons. The packages were drawn from real-world cross-ISA build failures encountered during porting efforts on Linux distributions, but the original manuscript did not supply the requested metadata. In the revised manuscript we have expanded §3 with a new subsection that documents the sampling frame (packages selected from Debian and Fedora aarch64 porting queues with build logs from the last 18 months), source repositories, a failure-mode taxonomy (ISA-specific flags, toolchain heterogeneity, dependency resolution, and configuration issues), build-system distribution statistics, and basic diversity metrics such as package size and dependency count ranges. These additions allow readers to evaluate generalizability directly. revision: yes

  2. Referee: [§4] §4 (Experimental Setup): the four auxiliary tools are presented as sufficient for autonomous repair, but the manuscript provides no ablation or coverage analysis showing that Structure Extraction + Content Modification + Build Verification together address the dominant failure modes (e.g., ISA-specific compiler flags, dependency resolution across heterogeneous toolchains); this directly affects the validity of the iterative-loop results.

    Authors: We acknowledge that an explicit ablation or coverage analysis of the tool suite would strengthen claims about the iterative repair loop. The four tools were chosen to support the core operations observed in cross-ISA failures (structure inspection, file reading, targeted edits, and outcome verification). In the revised §4 we have added a qualitative coverage analysis that maps each tool to the dominant failure modes in our taxonomy, with examples showing how Structure Extraction and Content Modification handle ISA-specific compiler flags and how Build Verification confirms toolchain compatibility. A quantitative ablation study across tool subsets is noted as future work because it would require substantial additional compute; the current revision improves justification for the reported results without overclaiming sufficiency. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark success measured by external verifiable build outcomes

full rationale

The paper defines Build-bench explicitly via collection of 268 real-world failed packages plus four named auxiliary tools (Structure Extraction, File Content Extraction, Content Modification, Build Verification) and an iterative loop that feeds updated logs back to the model. The reported 63.19% maximum success rate is obtained from direct execution in real build environments with verifiable outcomes, not from any internal model score, fitted parameter, or self-referential definition. No equations appear; the 'first architecture-aware benchmark' claim rests on a literature positioning statement rather than a uniqueness theorem imported from the authors' prior work. The central result therefore remains independent of its own inputs and does not reduce by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper; no free parameters, mathematical axioms, or newly postulated entities are introduced.

pith-pipeline@v0.9.0 · 5777 in / 1192 out tokens · 37608 ms · 2026-05-18T01:59:10.735395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 4 internal anchors

  1. [1]

    AWS Graviton: Energy Efficient Compute for Modern Workloads

    2023. AWS Graviton: Energy Efficient Compute for Modern Workloads. https://aws.amazon.com/ec2/graviton/

  2. [2]

    Technical Report

    2024.Apple Style Guide. Technical Report. Apple Inc. https://help.apple.com/pdf/applestyleguide/en_US/apple-style-guide.pdf Manuscript submitted to ACM Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems 23

  3. [3]

    2021.Alibaba Cloud Launches Yitian 710 ARM-Based Processor

    Alibaba Cloud. 2021.Alibaba Cloud Launches Yitian 710 ARM-Based Processor. https://www.alibabacloud.com/blog/598159

  4. [4]

    Anthropic. 2025. Claude Sonnet 4.5 System Card. https://www.anthropic.com/news/claude-sonnet-4-5/. Accessed: 2025-10-17

  5. [5]

    2023.HetMigrate: Secure and Efficient Cross-architecture Process Live Migration

    Abhishek Mandar Bapat. 2023.HetMigrate: Secure and Efficient Cross-architecture Process Live Migration. Ph. D. Dissertation. Virginia Tech

  6. [6]

    Mario R Barbacci. 2012. Instruction set processor specifications (ISPS): The notation and its applications.IEEE Trans. Comput.100, 1 (2012), 24–40

  7. [7]

    Lenz Belzner, Thomas Gabor, and Martin Wirsing. 2023. Large language model assisted software engineering: prospects, challenges, and a case study. InInternational conference on bridging the gap between AI and reality. Springer, 355–374

  8. [8]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv:2403.17134 [cs.SE] https://arxiv.org/abs/2403.17134

  9. [9]

    Carolin E Brandt, Annibale Panichella, Andy Zaidman, and Moritz Beller. 2020. Logchunks: A data set for build log analysis. InProceedings of the 17th International Conference on Mining Software Repositories. 583–587

  10. [10]

    Max Brunsfeld. 2018. Tree-sitter: An Incremental Parsing System for Programming Tools. https://tree-sitter.github.io/tree-sitter/. Accessed: 2025-10-19

  11. [11]

    Vincent Bushong, Russell Sanders, Jacob Curtis, Mark Du, Tomas Cerny, Karel Frajtak, Miroslav Bures, Pavel Tisnovsky, and Dongwan Shin. 2020. On matching log analysis to source code: A systematic mapping study. InProceedings of the International Conference on Research in Adaptive and Convergent Systems. 181–187

  12. [12]

    Marcelo Cataldo, Audris Mockus, Jeffrey A Roberts, and James D Herbsleb. 2009. Software dependencies, work dependencies, and their impact on failures.IEEE Transactions on Software Engineering35, 6 (2009), 864–878

  13. [13]

    Lucas Vincenzo Davi, Alexandra Dmitrienko, Stefan Nürnberger, and Ahmad-Reza Sadeghi. 2013. Gadge me if you can: secure and efficient ad-hoc instruction-level randomization for x86 and ARM. InProceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security. 299–310

  14. [14]

    Hugo Sica de Andrade, Jan Schroeder, and Ivica Crnkovic. 2019. Software deployment on heterogeneous platforms: A systematic mapping study. IEEE Transactions on Software Engineering47, 8 (2019), 1683–1707

  15. [15]

    Debian Project. 2018. The Debian Packaging Guide. https://www.debian.org/doc/manuals/packaging-tutorial/ Accessed: 2025-10-27

  16. [16]

    Alexandre Decan, Tom Mens, and Philippe Grosjean. 2019. An empirical comparison of dependency network evolution in seven software packaging ecosystems.Empirical Software Engineering24, 1 (2019), 381–416

  17. [17]

    Yvonne Dittrich. 2014. Software engineering beyond the project–Sustaining software ecosystems.Information and Software Technology56, 11 (2014), 1436–1456

  18. [18]

    Fedora Project. 2022. Fedora: The Operating System for Open Source Developers. https://getfedora.org/ Accessed: 2025-10-27

  19. [19]

    Blake W Ford, Apan Qasem, Jelena Tešić, and Ziliang Zong. 2021. Migrating software from x86 to ARM Architecture: An instruction prediction approach. In2021 IEEE International Conference on Networking, Architecture and Storage (NAS). IEEE, 1–6

  20. [20]

    Blake W Ford and Ziliang Zong. 2022. A cost effective framework for analyzing cross-platform software energy efficiency.Sustainable Computing: Informatics and Systems35 (2022), 100661

  21. [21]

    Khushi Gupta and Tushar Sharma. 2021. Changing trends in computer architecture: A comprehensive analysis of arm and x86 processors. International Journal of Scientific Research in Computer Science, Engineering and Information Technology7 (2021), 619–631

  22. [22]

    Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering. ACM computing surveys (CSUR)54, 6 (2021), 1–37

  23. [23]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  24. [24]

    Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. arXiv:2503.23278 [cs.CR] https://arxiv.org/abs/2503.23278

  25. [25]

    Yuchao Huang, Junjie Wang, Zhe Liu, Yawen Wang, Song Wang, Chunyang Chen, Yuanzhe Hu, and Qing Wang. 2024. CrashTranslator: Automatically Reproducing Mobile Application Crashes Directly from Stack Trace. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 18:1–18:13. doi...

  26. [26]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation.CoRR abs/2406.00515 (2024). arXiv:2406.00515 doi:10.48550/ARXIV.2406.00515

  27. [27]

    Xue Jiang, Yihong Dong, Yongding Tao, Huanyu Liu, Zhi Jin, and Ge Li. 2025. ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 334–346. doi:10.1109/ICSE55347.2025.00133

  28. [28]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  29. [29]

    https://openreview.net/forum?id=VTF8yNQM66

    OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66

  30. [30]

    René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 International Symposium on Software Testing and Analysis(San Jose, CA, USA)(ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437–440. doi:10.1145/2610384.2628055

  31. [31]

    Long Kang, Jun Ai, and Minyan Lu. 2024. Automated Structural Test Case Generation for Human-Computer Interaction Software Based on Large Language Model. In11th International Conference on Dependable Systems and Their Applications, DSA 2024, Taicang, Suzhou, China, November 2-3, Manuscript submitted to ACM 24 Trovato et al

  32. [32]

    doi:10.1109/DSA63982.2024.00027

    IEEE, 132–140. doi:10.1109/DSA63982.2024.00027

  33. [33]

    Aymen Ketata, Carlos Moreno, Sebastian Fischmeister, Jia Liang, and Krzysztof Czarnecki. 2015. Performance prediction upon toolchain migration in model-based software. In2015 ACM/IEEE 18th International Conference on Model Driven Engineering Languages and Systems (MODELS). IEEE, 302–311

  34. [34]

    Automated program repair in the era of large pre-trained language models,

    Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. 2023. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 919–931. doi:10.1109/ICSE48619.2023.00085

  35. [35]

    Zhengyao Liu, Yunlong Ma, Jingxuan Xu, Junchen Ai, Xiang Gao, Hailong Sun, and Abhik Roychoudhury. 2025. Agent That Debugs: Dynamic State-Guided Vulnerability Repair. arXiv:2504.07634 [cs.SE] https://arxiv.org/abs/2504.07634

  36. [36]

    Yiling Lou, Zhenpeng Chen, Yanbin Cao, Dan Hao, and Lu Zhang. 2020. Understanding build issue resolution in practice: symptoms and fix patterns. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 617–628

  37. [37]

    Tom Mens and Alexandre Decan. 2024. An Overview and Catalogue of Dependency Challenges in Open Source Software Package Registries.arXiv preprint arXiv:2409.18884(2024)

  38. [38]

    Andrey Mokhov, Alexei Iliasov, Danil Sokolov, Maxim Rykunov, Alex Yakovlev, and Alexander Romanovsky. 2013. Synthesis of processor instruction sets from high-level ISA specifications.IEEE Trans. Comput.63, 6 (2013), 1552–1566

  39. [39]

    David Moreau, Kristina Wiebels, and Carl Boettiger. 2023. Containers for computational reproducibility.Nature Reviews Methods Primers3, 1 (2023), 50

  40. [40]

    Linda Northrop, Peter Feiler, Richard P Gabriel, John Goodenough, Rick Linger, Tom Longstaff, Rick Kazman, Mark Klein, Kevin Sullivan, Kurt Wallnau, et al. 2006. Ultra-large-scale systems: The software challenge of the future. (2006)

  41. [41]

    OpenAI. 2025. Introducing GPT-4o. https://openai.com/zh-Hans-CN/index/gpt-4o-system-card/. Accessed: 2025-10-17

  42. [42]

    OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/

  43. [43]

    openSUSE Project. 2022. openSUSE: The community-driven Linux distribution. https://www.opensuse.org/ Accessed: 2025-10-27

  44. [44]

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. 2025. Training Software Engineering Agents and Verifiers with SWE-Gym. arXiv:2412.21139 [cs.SE] https://arxiv.org/abs/2412.21139

  45. [45]

    Anshu Parashar and Jitender Kumar Chhabra. 2017. Package-restructuring based on software change history.National Academy Science Letters40, 1 (2017), 21–27

  46. [46]

    Ivan Pashchenko, Duc-Ly Vu, and Fabio Massacci. 2020. A qualitative study of dependency management and its security implications. InProceedings of the 2020 ACM SIGSAC conference on computer and communications security. 1513–1531

  47. [47]

    Yihao Qin, Shangwen Wang, Yiling Lou, Jinhao Dong, Kaixin Wang, Xiaoling Li, and Xiaoguang Mao. 2025. AgentFL: Scaling LLM-based Fault Localization to Project-Level Context. arXiv:2403.16362 [cs.SE] https://arxiv.org/abs/2403.16362

  48. [48]

    Thomas Rausch, Waldemar Hummer, Philipp Leitner, and Stefan Schulte. 2017. An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 345–355

  49. [49]

    2013.A detailed analysis of contemporary arm and x86 architectures

    Karthikeyan Sankaralingam, Jaikrishnan Menon, and Emily Blem. 2013.A detailed analysis of contemporary arm and x86 architectures. Technical Report

  50. [50]

    Aditya S Shethiya. 2024. Engineering with Intelligence: How Generative AI and LLMs Are Shaping the Next Era of Software Systems.Spectrum of Research4, 1 (2024)

  51. [51]

    2011.Software Build Systems: Principles and Experience(1st ed.)

    Peter Smith. 2011.Software Build Systems: Principles and Experience(1st ed.). Addison-Wesley Professional

  52. [52]

    Piotr Sowiński, Ignacio Lacalle, Rafael Vaño, Carlos E Palau, Maria Ganzha, and Marcin Paprzycki. 2024. Overview of Current Challenges in Multi-architecture Software Engineering and a Vision for the Future. InInternational Conference on Big Data Analytics. Springer, 74–94

  53. [53]

    Gengyi Sun. 2025. Intelligent Automation for Accelerating the Repair of Software Build Failures. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025 - Companion Proceedings, Ottawa, ON, Canada, April 27 - May 3, 2025. IEEE, 205–207. doi:10.1109/ICSE- COMPANION66252.2025.00062

  54. [54]

    Alibaba / Qwen Team. 2025. Qwen-3 Max: Latest Advancements. https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&from= research.latest-advancements-list. Accessed: 2025-10-17

  55. [55]

    DeepSeek Team. 2024. DeepSeek-V3 Technical Report.arXiv preprint(2024). arXiv:2412.19437 [cs.CL]

  56. [56]

    2013.Dependencies: No software is an island

    Jørgen Tellnes. 2013.Dependencies: No software is an island. Master’s thesis. The University of Bergen

  57. [57]

    Colin C Venters, Rafael Capilla, Stefanie Betz, Birgit Penzenstadler, Tom Crick, Steve Crouch, Elisa Yumi Nakagawa, Christoph Becker, and Carlos Carrillo. 2018. Software sustainability: Research and practice from a software architecture viewpoint.Journal of Systems and Software138 (2018), 174–188

  58. [58]

    Christian Wressnegger, Fabian Yamaguchi, Alwin Maier, and Konrad Rieck. 2016. Twice the bits, twice the trouble: Vulnerabilities induced by migrating to 64-bit platforms. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 541–552

  59. [59]

    Tong Xing, Cong Xiong, Tianrui Wei, April Sanchez, Binoy Ravindran, Jonathan Balkind, and Antonio Barbalace. 2025. Stramash: A Fused-Kernel Operating System For Cache-Coherent, Heterogeneous-ISA Platforms. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2025,...

  60. [60]

    Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, and Qi Zhang. 2025. OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://open...

  61. [61]

    Boyang Yang, Zijian Cai, Fengling Liu, Bach Le, Lingming Zhang, Tegawendé F Bissyandé, Yang Liu, and Haoye Tian. 2025. A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications.arXiv preprint arXiv:2506.23749(2025)

  62. [62]

    Inseok Yeo, Duksan Ryu, and Jongmoon Baik. 2025. Improving LLM-Based Fault Localization with External Memory and Project Context. arXiv:2506.03585 [cs.SE] https://arxiv.org/abs/2506.03585

  63. [63]

    Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 37:1–37:1...

  64. [64]

    Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang. 2025. CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building.Proceedings of the ACM on Software Engineering2, FSE (2025), 2618–2640

  65. [65]

    Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang. 2025. CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building.Proc. ACM Softw. Eng.2, FSE (2025), 2618–2640. doi:10.1145/3729386

  66. [66]

    Chen Zhang, Bihuan Chen, Linlin Chen, Xin Peng, and Wenyun Zhao. 2019. A large-scale empirical study of compiler errors in continuous integration. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, Marlon D...

  67. [67]

    Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Yudong Zhang. 2025. SWE-bench Goes Live!CoRRabs/2505.23419 (2025). arXiv:2505.23419 doi:10.48550/ARXIV.2505.23419

  68. [68]

    Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip S Yu, and Ying Li. 2024. A survey of aiops for failure management in the era of large language models.arXiv preprint arXiv:2406.11213(2024)

  69. [69]

    Quanjun Zhang, Chunrong Fang, Yang Xie, YuXiang Ma, Weisong Sun, Yun Yang, and Zhenyu Chen. 2024. A systematic literature review on large language models for automated program repair.arXiv preprint arXiv:2405.01466(2024)

  70. [70]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Improvement. arXiv:2404.05427 [cs.SE] https://arxiv.org/abs/2404.05427

  71. [71]

    Zehua Zhang, Ati Priya Bajaj, Divij Handa, Siyu Liu, Arvind S Raj, Hongkai Chen, Hulin Wang, Yibo Liu, Zion Leonahenahe Basque, Souradip Nath, Vishal Juneja, Nikhil Chapre, Yan Shoshitaishvili, Adam Doupé, Chitta Baral, and Ruoyu Wang. 2025. BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software. arXiv:2509.25248 [cs.SE] https://...

  72. [72]

    Renyi Zhong, Yichen Li, Jinxi Kuang, Wenwei Gu, Yintong Huo, and Michael R Lyu. 2025. LogUpdater: Automated Detection and Repair of Specific Defects in Logging Statements.ACM Transactions on Software Engineering and Methodology(2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 Manuscript submitted to ACM