pith. the verified trust layer for science. sign in

arxiv: 2510.01379 · v2 · submitted 2025-10-01 · 💻 cs.SE

Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths

Pith reviewed 2026-05-18 10:18 UTC · model grok-4.3

classification 💻 cs.SE
keywords multi-LLM orchestrationcode generationmodel routingcomplementary strengthsagent-based systemsbenchmark generalizationperformance optimization
0
0 comments X p. Extension

The pith

Multi-LLM routing via a static memory matrix outperforms any single model on code generation across languages and unseen benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models possess stable complementary strengths that vary by programming language, problem category, and development stage. Rather than relying on one model or simple voting, the authors build PerfOrch, a four-agent system that uses per-stage routing guided by a precomputed ranking matrix. This matrix is derived once from profiling on HumanEval-X and then applied directly to the distinct EffiBench-X benchmark. The approach yields higher pass rates and faster execution than the strongest single-model baseline while using roughly half the tokens of exhaustive multi-model evaluation.

Core claim

PerfOrch decomposes code generation into categorization, generation, debugging, and refinement agents. Each agent consults a Memory module containing a ranking matrix indexed by language and problem category. The matrix is built from offline profiling on HumanEval-X and selects the best-suited model for the current subtask. This structured collaboration produces average pass@1 rates of 97.19 percent on HumanEval-X and 95.83 percent on EffiBench-X, exceeding the strongest single-model pipeline by 1.22 to 14.58 percentage points across languages, while also delivering execution-time speedups on 61 to 90 percent of solved problems.

What carries the argument

The Memory module: a static ranking matrix indexed by programming language and problem category, built once from HumanEval-X profiling and consulted at runtime to route each subtask to the strongest model.

If this is right

  • Structured per-stage routing lifts correctness beyond what majority voting or single-model selection achieves.
  • A one-time profiling pass on a single benchmark produces rankings that transfer to a different benchmark distribution.
  • Execution-time improvements appear for most solved problems without extra search cost.
  • Token usage stays near that of a single model while matching the coverage of exhaustive multi-model evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model strengths may be sufficiently intrinsic that similar static routing tables could be reused across entirely new task families.
  • The approach reduces the practical cost of maintaining multiple models by turning offline profiling into a reusable asset.
  • If the stability assumption holds, teams could maintain one shared memory matrix instead of retraining or re-profiling per application domain.

Load-bearing premise

The relative strengths of the different LLMs are stable properties of the models themselves rather than artifacts of any particular problem set.

What would settle it

Re-profiling the ranking matrix directly on EffiBench-X and observing whether the selected models or overall performance change substantially from the HumanEval-X-derived matrix.

Figures

Figures reproduced from arXiv: 2510.01379 by Haotang Li, Hong Chen, Huashan Chen, In Kee Kim, Jinfu Chen, Kebin Peng, Kyu Hyung Lee, Sen He, Weiyi Shang, Zhenyu Qi.

Figure 1
Figure 1. Figure 1: Comparison between GPT-4o and PerfOrch on Rust across HumanEval-X and EffiBench-X benchmarks, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example problem from HumanEval-X in Go, including prompt, test cases, and canonical solution. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Flowchart of multi-stage performance-guided LLM orchestration framework. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The design of PerfOrch, an LLM agent for automated performant code generation. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Execution time optimization from PerfOrch refinement across languages and benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Three solutions of HumanEval-X C++/16, including canonical solution, Claude, and PerfOrch. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have become central to automated code generation, yet existing approaches operate within a single-LLM paradigm: one model is selected and applied throughout the entire generation process. We observe that different LLMs exhibit complementary strengths: no single model dominates across all programming languages, algorithmic problem categories, or development stages. Multi-LLM collaboration, structured as per-stage, per-category routing rather than majority voting, produces higher-quality code than any individual model. Based on this observation, we propose PerfOrch, a multi-agent orchestration system that decomposes code generation into four collaborative agents: categorization, generation, debugging, and refinement. Each agent maintains a Memory module: a ranking matrix indexed by programming language and problem category, constructed from offline profiling and consulted at runtime to select the most suitable model for each task. We evaluate PerfOrch on two benchmarks, HumanEval-X and EffiBench-X, totaling 2,500 problems across five languages (Python, Java, C++, Go, and Rust). PerfOrch achieves average pass@1 rates of 97.19% on HumanEval-X and 95.83% on EffiBench-X, improving over the strongest single-model pipeline by 1.22-14.58 percentage points across languages. Notably, Memory rankings constructed solely from HumanEval-X profiling generalize to the entirely unseen EffiBench-X benchmark without re-profiling, demonstrating that the complementary-strength patterns PerfOrch exploits are properties of the models rather than artifacts of a specific problem distribution. Beyond correctness, PerfOrch improves execution time for 61-90% of solved problems with mean speedups of 4.7-29.9%, matching the refinement coverage of exhaustive multi-model evaluation at roughly half the token cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces PerfOrch, a multi-agent orchestration system for code generation that decomposes the process into four agents (categorization, generation, debugging, refinement). Each agent consults a static Memory module—a ranking matrix indexed by language and problem category—constructed via offline profiling on HumanEval-X. The system is evaluated on HumanEval-X and the unseen EffiBench-X (total 2500 problems across Python, Java, C++, Go, Rust), reporting average pass@1 of 97.19% and 95.83% respectively, with gains of 1.22–14.58 pp over the strongest single-model baseline, plus execution-time speedups on 61–90% of solved problems at roughly half the token cost of exhaustive multi-model evaluation. The central claim is that per-stage per-category routing exploits stable complementary model strengths that transfer without re-profiling.

Significance. If the generalization result holds, the work provides concrete evidence that model strengths can be captured once in a static ranking matrix and applied to new problem distributions, offering a more efficient alternative to majority voting or per-benchmark retraining in multi-LLM code generation. The reported pass@1 numbers on two public benchmarks and the efficiency gains constitute falsifiable, reproducible empirical support for structured routing; this could inform practical multi-agent LLM pipelines in software engineering.

major comments (1)
  1. [§4 (Evaluation on EffiBench-X)] §4 (Evaluation on EffiBench-X) and the generalization paragraph in the abstract: the claim that Memory rankings capture model properties rather than benchmark artifacts rests on EffiBench-X representing a distinct distribution. Both benchmarks use the identical five languages and the manuscript provides no category-frequency statistics, problem-type overlap analysis, or distributional comparison; without these data the observed transfer could arise from similar category distributions, weakening the isolation of the per-category routing benefit.
minor comments (2)
  1. [Results] Table 1 or the results section: include per-language pass@1 breakdowns and statistical significance tests (e.g., McNemar or bootstrap intervals) alongside the reported averages to allow readers to assess consistency of gains.
  2. [§3.2] §3.2 (Memory construction): clarify whether the ranking matrix is strictly static after HumanEval-X profiling or admits any online update rule; the current description leaves open whether runtime adaptation occurs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. We address the single major comment below and outline the changes we will make to strengthen the generalization argument.

read point-by-point responses
  1. Referee: [§4 (Evaluation on EffiBench-X)] §4 (Evaluation on EffiBench-X) and the generalization paragraph in the abstract: the claim that Memory rankings capture model properties rather than benchmark artifacts rests on EffiBench-X representing a distinct distribution. Both benchmarks use the identical five languages and the manuscript provides no category-frequency statistics, problem-type overlap analysis, or distributional comparison; without these data the observed transfer could arise from similar category distributions, weakening the isolation of the per-category routing benefit.

    Authors: We agree that the manuscript would benefit from explicit evidence that EffiBench-X constitutes a meaningfully distinct distribution from HumanEval-X. The current text describes EffiBench-X as an unseen benchmark drawn from different problem sources, but does not supply the quantitative comparisons requested. In the revised manuscript we will add (1) category-frequency histograms for both benchmarks, (2) a problem-type overlap analysis based on the categorization taxonomy used by the Categorization Agent, and (3) a distributional divergence measure (e.g., Jensen-Shannon divergence over category distributions together with average embedding similarity of problem statements). These additions will allow readers to assess the degree to which the observed transfer can be attributed to stable model strengths rather than shared category distributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; generalization tested via direct execution on held-out benchmark

full rationale

The core derivation proceeds from empirical observation of complementary model strengths (via offline profiling on HumanEval-X) to construction of a static Memory ranking matrix, followed by runtime routing in PerfOrch and direct measurement of pass@1 and execution-time improvements on the entirely unseen EffiBench-X benchmark. These performance numbers are obtained by executing generated code against ground-truth test cases, not by algebraic reduction or re-use of fitted quantities inside the same derivation. No load-bearing uniqueness theorem, self-citation chain, or ansatz is invoked to justify the routing policy; the generalization claim rests on the empirical transfer result itself rather than on any definitional equivalence. The paper therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the empirical stability of model strengths across problem distributions and on the assumption that a single offline profiling pass produces a reusable ranking matrix.

free parameters (1)
  • model selection rule from ranking matrix
    The exact threshold or tie-breaking rule used to pick the top-ranked model for a given language-category pair is not specified in the abstract.
axioms (1)
  • domain assumption Different LLMs exhibit stable complementary strengths across languages, algorithmic categories, and development stages that can be captured by a static ranking matrix.
    This premise is invoked to justify building the Memory module from offline profiling and reusing it at runtime.
invented entities (1)
  • Memory module (ranking matrix indexed by language and problem category) no independent evidence
    purpose: To enable runtime selection of the most suitable LLM for each agent task without exhaustive evaluation.
    The matrix is constructed from offline profiling and consulted at runtime; no independent evidence outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5891 in / 1436 out tokens · 35063 ms · 2026-05-18T10:18:50.518590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 7 internal anchors

  1. [1]

    Grattafiori Aaron, Dubey Abhimanyu, Jauhri Abhinav, et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783 , Vol. 1, No. 1, Article . Publication date: October 2018. Beyond Single LLMs: Enhanced Code Generation via Multi-Stage Performance-Guided LLM Orchestration 27

  2. [2]

    Manish Acharya, Yifan Zhang, Kevin Leach, and Yu Huang. 2025. Optimizing Code Runtime Performance Through Context-Aware Retrieval-Augmented Generation . In2025 IEEE/ACM 33rd International Conference on Program Compre- hension (ICPC). IEEE Computer Society, Los Alamitos, CA, USA, 1–5. https://doi.org/10.1109/ICPC66645.2025.00028

  3. [3]

    Codium AI. 2025. Codium: AI-Powered Test Generator and Code Reviewer. https://app.codium.ai/

  4. [4]

    Mistral AI. 2025. Codestral Mamba. https://mistral.ai/news/codestral-mamba

  5. [5]

    Abdulaziz Alaboudi and Thomas D. LaToza. 2021. Edit - Run Behavior in Programming and Debugging . In2021 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE Computer Society, Los Alamitos, CA, USA, 1–10. https://doi.org/10.1109/VL/HCC51201.2021.9576170

  6. [6]

    Deema Alshoaibi, Kevin Hannigan, Hiten Gupta, and Mohamed Wiem Mkaouer. 2019. PRICE: Detection of Performance Regression Introducing Code Changes Using Static and Dynamic Metrics. InSearch-Based Software Engineering: 11th International Symposium, SSBSE 2019, Tallinn, Estonia, August 31 – September 1, 2019, Proceedings(Tallinn, Estonia). Springer-Verlag, B...

  7. [7]

    Amazon. 2023. Amazon Q – Generative AI Assistant. https://aws.amazon.com/q/

  8. [8]

    Anthropic. 2025. Claude 3.7 Sonnet and Claude Code. https://www.anthropic.com/news/claude-3-7-sonnet

  9. [9]

    Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D C, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. Codeplan: Repository-level coding using llms and planning. Proceedings of the ACM on Software Engineering1, FSE (2024), 675–698

  10. [10]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 2188–2200. https://doi.org/10.1109/ICSE55347.2025.00157

  11. [11]

    Jinfu Chen and Weiyi Shang. 2017. An Exploratory Study of Performance Regression Introducing Code Changes . In2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 341–352. https://doi.org/10.1109/ICSME.2017.13

  12. [12]

    Mark Chen, Jerry Tworek, Heewoo Jun, et al . 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374

  13. [13]

    Leet Code. 2024. The world’s leading online programming learning platform. https://leetcode.com/

  14. [14]

    Codota. 2024. Tabnine AI Code Assistant. https://www.tabnine.com/

  15. [15]

    Peng Di, Jianguo Li, Hang Yu, et al. 2024. CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice(Lisbon, Portugal)(ICSE-SEIP ’24). Association for Computing Machinery, New York, NY, USA, 418–429. https://doi.org/10. 1145/3639477.3639719

  16. [16]

    Adam Dingle and Martin Krulis. 2024. Tackling Students’ Coding Assignments with LLMs. InProceedings of the 1st International Workshop on Large Language Models for Code(Lisbon, Portugal)(LLM4Code ’24). Association for Computing Machinery, New York, NY, USA, 94–101. https://doi.org/10.1145/3643795.3648389

  17. [17]

    Benedetta Donato, Leonardo Mariani, Daniela Micucci, and Oliviero Riganelli. 2025. Studying How Configurations Impact Code Generation in LLMs: The Case of ChatGPT . In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, Los Alamitos, CA, USA, 442–453. https://doi.org/10.1109/ICPC66645. 2025.00055

  18. [18]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York...

  19. [19]

    Yanlin Feng, Sajjadur Rahman, Aaron Feng, Vincent Chen, and Eser Kandogan. 2024. CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems. InProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI(Santiago, AA, Chile)(GUIDE-AI ’24). Association for Computing Machi...

  20. [20]

    Nat Friedman. 2022. Introducing github copilot: Your AI pair programmer. https://github.blog/news-insights/product- news/introducing-github-copilot-ai-pair-programmer/

  21. [21]

    Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, Hongyu Zhang, and Michael R. Lyu. 2024. What Makes Good In-Context Demonstrations for Code Intelligence Tasks with LLMs?. InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering(Echternach, Luxembourg)(ASE ’23). IEEE Press, 761–773. https://doi.org/10.1109/ASE5622...

  22. [22]

    Google. 2025. Gemini 2.0 Flash. https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash

  23. [23]

    Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. 2025. RedCode: risky code execution and generation benchmark for code agents. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 3369, 47 page...

  24. [24]

    Sen He, Tianyi Liu, Palden Lama, Jaewoo Lee, In Kee Kim, and Wei Wang. 2022. Performance testing for cloud computing with dependent data bootstrapping. InProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering(Melbourne, Australia)(ASE ’21). IEEE Press, 666–678. https://doi.org/10.1109/ASE51524.2021.9678687

  25. [25]

    Sen He, Glenna Manns, John Saunders, Wei Wang, Lori Pollock, and Mary Lou Soffa. 2019. A statistics-based performance testing methodology for cloud applications. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Tallinn, Estonia)(ESEC/FSE 2019). Association...

  26. [26]

    Eugenio Herrera-Berg, Tomás Vergara Browne, Pablo León-Villagrá, Marc-Lluís Vives, and Cristian Buc Calderon

  27. [27]

    InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.)

    Large Language Models are biased to overestimate profoundness. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 9653–9661. https://doi.org/10.18653/v1/2023.emnlp-main.599

  28. [28]

    Thomas Hirsch and Birgit Hofer. 2021. What we can learn from how programmers debug their code. In2021 IEEE/ACM 8th International Workshop on Software Engineering Research and Industrial Practice (SER&IP). IEEE, 37–40. https://doi.org/10.1109/SER-IP52554.2021.00014

  29. [29]

    Md Sifat Hossain, Anika Tabassum, Md Fahim Arefin, and Tarannum Shaila Zaman. 2025. Llm-pros: Analyzing large language models’ performance in competitive problem solving. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, IEEE, 80–87

  30. [30]

    Dong Huang, Jianbo Dai, Han Weng, Puzhen Wu, Yuhao Qing, Heming Cui, Zhijiang Guo, and Jie Zhang. 2024. Effilearner: Enhancing efficiency of generated code via self-optimization.Advances in Neural Information Processing Systems37 (2024), 84482–84522

  31. [31]

    Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M Zhang. 2024. Effibench: Benchmarking the efficiency of automatically generated code.Advances in Neural Information Processing Systems37 (2024), 11506–11544

  32. [32]

    Tao Huang, Zhihong Sun, Zhi Jin, Ge Li, and Chen Lyu. 2024. Knowledge-Aware Code Generation with Large Language Models. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension(Lisbon, Portugal) (ICPC ’24). Association for Computing Machinery, New York, NY, USA, 52–63. https://doi.org/10.1145/3643916.3644418

  33. [33]

    2010.The Linux Programming Interface: A Linux and UNIX System Programming Handbook(1st ed.)

    Michael Kerrisk. 2010.The Linux Programming Interface: A Linux and UNIX System Programming Handbook(1st ed.). No Starch Press, USA

  34. [34]

    Sung Yong Kim, Zhiyu Fan, Yannic Noller, and Abhik Roychoudhury. 2024. Codexity: Secure AI-assisted Code Generation. arXiv:2405.03927 [cs.SE] https://arxiv.org/abs/2405.03927

  35. [35]

    Sylvain Kouemo Ngassom, Arghavan Moradi Dakhel, Florian Tambon, and Foutse Khomh. 2024. Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs. InProceedings of the 1st ACM International Conference on AI-Powered Software(Porto de Galinhas, Brazil)(AIware 2024). Association for Computing Machinery, New York, NY, USA, ...

  36. [36]

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems35 (2022), 21314–21328

  37. [37]

    Shuang Li, Yuntao Cheng, Jinfu Chen, Jifeng Xuan, Sen He, and Weiyi Shang. 2024. Assessing the performance of ai-generated code: A case study on github copilot. In2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, IEEE, 216–227

  38. [39]

    Xiaoli Lian, Shuaisong Wang, Jieping Ma, Xin Tan, Fang Liu, Lin Shi, Cuiyun Gao, and Li Zhang. 2024. Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings(Lisbon, Portugal) (ICSE-Companion ’24). ...

  39. [40]

    Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. 2024. Exploring and Evaluating Hallucinations in LLM-Powered Code Generation. arXiv:2404.00971 [cs.SE] https://arxiv. org/abs/2404.00971

  40. [41]

    Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, and Liang Feng Zhang. 2024. No need to lift a finger anymore? assessing the quality of code generation by chatgpt.IEEE Transactions on Software Engineering50, 6 (2024), 1548–1584

  41. [42]

    Noble Saji Mathews and Meiyappan Nagappan. 2024. Test-Driven Development and LLM-based Code Generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 1583–1594. https://doi.org/10.1145/3691620. 3695527 , Vol. 1, No. 1, A...

  42. [43]

    Microsoft. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL] https://arxiv.org/abs/2404.14219

  43. [44]

    Claudia Misale. 2014. Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity. InProceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP ’14). IEEE Computer Society, USA, 578–585. https://doi.org/10.1109/PDP.2014.50

  44. [45]

    Paul Movall, Ward Nelson, and Shaun Wetzstein. 2005. Linux physical memory analysis. InProceedings of the Annual Conference on USENIX Annual Technical Conference(Anaheim, CA)(ATEC ’05). USENIX Association, USA, 39

  45. [46]

    Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354

  46. [47]

    Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. 2024. OctoPack: Instruction Tuning Code Large Language Models. arXiv:2308.07124 [cs.CL] https://arxiv.org/abs/2308.07124

  47. [48]

    Nathalia Nascimento, Everton Guimaraes, Sai Sanjna Chintakunta, and Santhosh Anitha Boominathan. 2025. How Effective are LLMs for Data Science Coding? A Controlled Experiment. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, IEEE, 211–222

  48. [49]

    Changan Niu, Ting Zhang, Chuanyi Li, Bin Luo, and Vincent Ng. 2024. On Evaluating the Efficiency of Source Code Generated by LLMs. InProceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering(Lisbon, Portugal)(FORGE ’24). Association for Computing Machinery, New York, NY, USA, 103–107. https://doi.org/...

  49. [50]

    OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL] https://arxiv.org/abs/2410.21276

  50. [51]

    Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2025. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–28

  51. [52]

    Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo

    Yun Peng, Akhilesh Deepak Gotmare, Michael R. Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2025. PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). IEEE, 1–13. https://doi.org/10. 1109/Forge66646.2025.00008

  52. [53]

    Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. 2025. Coffe: A code efficiency benchmark for code generation. Proceedings of the ACM on Software Engineering2, FSE (2025), 242–265

  53. [54]

    Chongli Qin and Jost Tobias Springenberg. 2025. Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved). arXiv:2507.12856 [cs.LG] https://arxiv.org/abs/2507.12856

  54. [55]

    EffiBench-X: A multi-language benchmark for measuring efficiency of LLM-generated code,

    Yuhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, and Luu Anh Tuan. 2025. EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code. arXiv:2505.13004 [cs.CL] https://arxiv.org/abs/2505.13004

  55. [56]

    Qwen. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] https://arxiv.org/abs/2412.15115

  56. [57]

    Charles C Rawlins. 2022. An intelligent distributed ledger construction algorithm for IoT.IEEe Access10 (2022), 10838–10851

  57. [58]

    Baidu Research. 2023. Introducing ERNIE 3.5: Baidu’s Knowledge-Enhanced Foundation Model Takes a Giant Leap Forward. https://research.baidu.com/Blog/index-view?id=185

  58. [59]

    Wolfgang Richter, Canturk Isci, Benjamin Gilbert, Jan Harkes, Vasanth Bala, and Mahadev Satyanarayanan. 2014. Agentless Cloud-Wide Streaming of Guest File System Updates. In2014 IEEE International Conference on Cloud Engineering. IEEE, 7–16. https://doi.org/10.1109/IC2E.2014.36

  59. [60]

    Junjie Sheng, Yanqiu Lin, Jiehao Wu, Yanhong Huang, Jianqi Shi, Min Zhang, and Xiangfeng Wang. 2025. SolSearch: An LLM-Driven Framework for Efficient SAT-Solving Code Generation. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering: New Ideas and Emerging Results(Ottawa, Ontario, Canada)(ICSE-NIER ’25). IEEE Press, 6–10. htt...

  60. [61]

    Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu. 2025. From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging. arXiv:2410.01215 [cs.CL] https://arxiv.org/abs/2410. 01215

  61. [62]

    Gemma Team. 2024. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 [cs.CL] https://arxiv.org/abs/2408.00118

  62. [63]

    Qwen Team. 2024. Code with CodeQwen1.5. https://qwenlm.github.io/blog/codeqwen1.5/

  63. [64]

    Jacob Trentini, Victor Liu, Yiming Peng, and Ziliang Zong. 2025. Advancing Large Language Models in Code Generation: Usaco Benchmark and Bug Mitigation Insights . In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, Los Alamitos, CA, USA, 1–12. https://doi.org/10.1109/ ICPC66645.2025.00057 , Vol. 1, No. 1, ...

  64. [65]

    Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. 2025. Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 2587–2599. https://doi.org/10.1...

  65. [66]

    xAI. 2024. Grok 3 Beta — The Age of Reasoning Agents. https://x.ai/news/grok-3

  66. [67]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Demystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. https://doi.org/10.1145/3715754

  67. [68]

    Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. 2025. LeetCode- Dataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs. arXiv:2504.14655 [cs.LG] https://arxiv.org/abs/2504.14655

  68. [69]

    Weiwei Xu, Kai Gao, Hao He, and Minghui Zhou. 2025. LiCoEval: Evaluating LLMs on License Compliance in Code Generation. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1665–1677. https://doi.org/10.1109/ICSE55347.2025.00052

  69. [70]

    1995.Multiple attribute decision making: an introduction

    K Paul Yoon and Ching-Lai Hwang. 1995.Multiple attribute decision making: an introduction. Sage publications

  70. [71]

    Yuanliang Zhang, Yifan Xie, Shanshan Li, Ke Liu, Chong Wang, Zhouyang Jia, Xiangbing Huang, Jie Song, Chaopeng Luo, Zhizheng Zheng, Rulin Xu, Yitong Liu, Si Zheng, and Xiangke Liao. 2025. Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar. InProceedings of the IEEE/ACM 47th International Conference on Software Engine...

  71. [72]

    Jiawei Zheng, Hanghai Hong, Feiyan Liu, Xiaoli Wang, Jingsong Su, Yonggui Liang, and Shikai Wu. 2024. Fine-tuning Large Language Models for Domain-specific Machine Translation. arXiv:2402.15061 [cs.CL] https://arxiv.org/abs/ 2402.15061

  72. [73]

    Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Long Beach, CA, USA)(K...

  73. [74]

    Li Zhong, Zilong Wang, and Jingbo Shang. 2024. Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 851–870. https://doi.org/...

  74. [75]

    Yuqi Zhu, Jia Li, Ge Li, YunFei Zhao, Jia Li, Zhi Jin, and Hong Mei. 2024. Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models.Proceedings of the AAAI Conference on Artificial Intelligence38, 1 (Mar. 2024), 437–445. https://doi.org/10.1609/aaai.v38i1.27798 , Vol. 1, No. 1, Article . Publication date: October 2018