pith. machine review for the scientific record. sign in

arxiv: 2604.19201 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:33 UTC · model grok-4.3

classification 💻 cs.SE
keywords code editinglarge language modelssmall language modelsmodel cascadeedit sketchesefficiencysoftware development
0
0 comments X

The pith

Decomposing code editing into large-model sketch generation and small-model application cuts token use while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that code editing can be split into two stages to gain both effectiveness and speed. A large model first creates concise sketches that capture only the required changes from a natural-language request. A smaller model then merges those sketches into the original code to produce the final result. This split matters because full-file generation by large models repeats most of the unchanged code, wasting time and cost, while small models alone cannot reliably track long contexts or cross-file links. If the smaller model can be strengthened for the application step, the cascade delivers edited code comparable to a large model alone but with far less expensive output.

Core claim

The authors claim that code editing decomposes naturally into edit sketch generation, where a large model produces compact outlines of the needed modifications, and edit sketch application, where a smaller model inserts those outlines into the full original codebase. The large model therefore outputs far fewer tokens, improving efficiency, while the smaller model performs the bulk of the reconstruction once the hard reasoning is complete.

What carries the argument

The two-stage cascade of edit sketch generation by a large model followed by sketch application by a smaller model.

If this is right

  • Large models generate only the concise sketches rather than entire modified files.
  • The smaller model handles the majority of token output, lowering overall generation cost and latency.
  • Effectiveness holds only if the smaller model receives targeted improvements for long-context and cross-file reasoning.
  • The final edited code matches large-model quality provided the application stage succeeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sketch-plus-application split could reduce large-model usage in other code tasks that separate reasoning from implementation.
  • Specialized training of small models for sketch application might further shrink the role of large models in routine edits.

Load-bearing premise

Smaller models can be enhanced enough to apply the sketches accurately inside long code contexts and across multiple files without adding more errors than a large model would produce alone.

What would settle it

On a benchmark of multi-file code edits, if the cascaded outputs contain substantially more incorrect changes than a single large model generating full files, the claim of maintained effectiveness fails.

Figures

Figures reproduced from arXiv: 2604.19201 by Chaozheng Wang, Cuiyun Gao, Hailiang Huang, Michael R. Lyu, Shuzheng Gao, Ting Peng, Yichen Li, Yuetang Deng, Zezhou Yang, Zongjie Li.

Figure 1
Figure 1. Figure 1: An example comparing direct editing (a) and cascaded code editing (b). [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A motivating example about the challenge that small models face in precise sketch application. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model performance (FM and EM) on sketch application tasks by token range complexity. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
read the original abstract

Code editing constitutes a fundamental practice in software development, wherein developers modify existing codebases according to natural language requirements. Accurate code editing necessitates a comprehensive understanding of both the existing codebase and the modification requirements. Although large language models (LLMs) have demonstrated promising performance in code editing tasks, they suffer from substantial inefficiency by generating entire modified files that largely consist of unchanged code. While smaller models could potentially address this inefficiency, they typically lack the capacity to effectively comprehend long code contexts required for accurate editing. To ensure both effectiveness and efficiency, we propose to decompose code editing into a two-stage cascade: \textbf{edit sketch generation}, wherein a large model first produces concise sketches representing the requisite modifications (the more challenging phase), and \textbf{edit sketch application}, wherein a smaller model integrates these sketches into the original code to produce the final output edited code (the simpler phase). This cascaded design reduces the number of tokens generated by the large model, as the majority of the output is handled by the smaller, more efficient model, thereby enhancing overall efficiency. However, the effectiveness of this approach is constrained by current small models' limited capabilities in handling long-context scenarios and cross-file dependencies, which are essential for accurate sketch application in real-world codebases. To address these limitations and enhance smaller models' sketch application capabilities, ...

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Cascaded Code Editing, a two-stage framework for LLM-based code editing. A large model first generates concise 'edit sketches' capturing the required modifications (claimed to be the harder phase), after which a smaller model integrates those sketches into the original (potentially multi-file) codebase to produce the final edited code (claimed to be the simpler phase). The design reduces large-model token generation for efficiency while aiming to preserve accuracy; the abstract explicitly notes that current small models lack capacity for long contexts and cross-file dependencies and states that enhancements are proposed to address this.

Significance. If the proposed enhancements allow the small model to apply sketches with accuracy comparable to direct large-model editing, the approach would offer a practical way to improve efficiency in real-world code editing without sacrificing effectiveness. It provides a concrete decomposition that could influence hybrid LLM pipelines in software engineering.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): the central effectiveness claim rests on the assertion that sketch application is the 'simpler phase' once enhancements are applied, yet the abstract itself states that small models currently cannot handle the required long-context and cross-file scenarios. No ablation or quantitative comparison (e.g., cascade accuracy vs. direct large-model editing on multi-file benchmarks) is referenced in the provided text to show that the enhancements close this gap; without such evidence the joint effectiveness-efficiency guarantee is unverified.
  2. [§4] §4 (Experiments): if results are present, they must include controls that isolate whether small-model sketch application maintains parity with large-model baselines on tasks involving cross-file dependencies; otherwise the decomposition's load-bearing assumption remains untested.
minor comments (2)
  1. [Abstract] The abstract is truncated mid-sentence ('To address these limitations...'); the full description of the enhancements should be moved or summarized earlier for readability.
  2. [Introduction] Notation for 'edit sketch' is introduced without a formal definition or example in the opening paragraphs; a small illustrative figure or pseudocode would clarify the interface between the two stages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of our effectiveness claims and experimental design. We address each major comment below, clarifying the manuscript's contributions while acknowledging where additional evidence or controls would strengthen the presentation. We are prepared to revise accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): the central effectiveness claim rests on the assertion that sketch application is the 'simpler phase' once enhancements are applied, yet the abstract itself states that small models currently cannot handle the required long-context and cross-file scenarios. No ablation or quantitative comparison (e.g., cascade accuracy vs. direct large-model editing on multi-file benchmarks) is referenced in the provided text to show that the enhancements close this gap; without such evidence the joint effectiveness-efficiency guarantee is unverified.

    Authors: We agree that the abstract explicitly notes current small-model limitations in long-context and cross-file handling, and that the effectiveness of the cascade depends on the proposed enhancements closing this gap. Section 3 describes concrete enhancements (context compression, dependency-aware prompting, and sketch-specific fine-tuning) intended to address these issues. The experiments in §4 report overall cascade accuracy comparable to direct large-model editing on multi-file benchmarks while achieving substantial token savings. However, we acknowledge that an explicit ablation isolating the contribution of the enhancements (e.g., small-model application with vs. without enhancements versus direct large-model editing) is not separately tabulated. We will add this ablation to the revised manuscript to make the load-bearing assumption directly verifiable. revision: yes

  2. Referee: [§4] §4 (Experiments): if results are present, they must include controls that isolate whether small-model sketch application maintains parity with large-model baselines on tasks involving cross-file dependencies; otherwise the decomposition's load-bearing assumption remains untested.

    Authors: Section 4 already evaluates the full cascade on benchmarks containing cross-file dependencies and reports accuracy parity with large-model baselines alongside efficiency gains. To more rigorously isolate the sketch-application stage, we will add targeted controls in the revision: (1) small-model application accuracy with and without the §3 enhancements, and (2) direct comparison of those results against large-model editing on the same cross-file subsets. These controls will be presented in a new table or subsection to confirm that the decomposition's assumption holds under the proposed enhancements. revision: yes

Circularity Check

0 steps flagged

No circularity: methodological proposal with no derivations or self-referential reductions

full rationale

The paper proposes a two-stage cascade for code editing (large model for sketch generation, small model for application) as a design choice to balance effectiveness and efficiency. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. The abstract acknowledges small-model limitations on long contexts and cross-file dependencies, then states an intent to address them, but this is an explicit assumption and enhancement plan rather than a circular reduction of any claimed result to its inputs. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The central claim remains a self-contained methodological suggestion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; the approach assumes large models are strong at sketch generation and that small models can be improved for sketch application, but no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5569 in / 1128 out tokens · 33243 ms · 2026-05-10T02:33:08.073123+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    Tushar Aggarwal, Swayam Singh, Abhijeet Awasthi, Aditya Kanade, and Nagarajan Natarajan. 2025. NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=3B6fF1PxYD

  2. [2]

    Aider. 2025. Aider LLM Leaderboards. https://aider.chat/docs/leaderboards/

  3. [3]

    Aider. 2025. Aider’s polyglot benchmark. https://aider.chat/2024/12/21/polyglot.html#the-polyglot-benchmark

  4. [4]

    Anthropic. 2025. Introducing Claude 4. https://www.anthropic.com/news/claude-4

  5. [5]

    Nazmus Ashrafi, Salah Bouktif, and Mohammed Mediani. 2025. Enhancing llm code generation: A systematic evaluation of multi-agent collaboration and runtime debugging for improved accuracy, reliability, and latency.arXiv preprint arXiv:2505.02133(2025)

  6. [6]

    C., Arun Iyer, Suresh Parthasarathy, Sriram K

    Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Planning.Proc. ACM Softw. Eng.1, FSE (2024), 675–698

  7. [7]

    Rajiv D Banker, Gordon B Davis, and Sandra A Slaughter. 1998. Software development practices, software complexity, and software maintenance performance: A field study.Management science44, 4 (1998), 433–450

  8. [8]

    Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Jacob Ginesin, Edward Berman, George Chakhnashvili, Anton Lozhkov, Carolyn Jane Anderson, et al . 2024. Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions. InFirst Conference on Language Modeling

  9. [9]

    Saikat Chakraborty, Yangruibo Ding, Miltiadis Allamanis, and Baishakhi Ray. 2022. CODIT: Code Editing With Tree-Based Neural Models.IEEE Trans. Software Eng.48, 4 (2022), 1385–1399

  10. [10]

    Lahiri, and Nikhil Swamy

    Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu K. Lahiri, and Nikhil Swamy. 2025. Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 1755–1767

  11. [11]

    Saikat Chakraborty and Baishakhi Ray. 2021. On Multi-Modal Learning of Editing Source Code. In36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 443–455

  12. [12]

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174(2016)

  13. [13]

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self- Debug. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

  14. [14]

    Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. 2024. ChatUniTest: A Framework for LLM-Based Test Generation. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, Porto de Galinhas, Brazil, July 15-19, 2024, Marcelo d’Amorim (Ed.). ACM, 572–576

  15. [15]

    Clang LLVM. 2025. ClangFormat. https://clang.llvm.org/docs/ClangFormat.html

  16. [16]

    CurSor. 2025. The AI Code Editor. https://cursor.com/en

  17. [17]

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems35 (2022), 16344–16359

  18. [18]

    DeepInfra. 2025. Simple Pricing, Deep Infrastructure. https://deepinfra.com/pricing

  19. [19]

    Google. 2025. Gemini 2.5 Pro. https://deepmind.google/models/gemini/pro/

  20. [20]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  21. [21]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al

  22. [22]

    DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

  23. [23]

    Siming Huang, Tianhao Cheng, Jason Klein Liu, Weidi Xu, Jiaran Hao, Liuyihan Song, Yang Xu, Jian Yang, Jiaheng Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Xianzhen Luo, Qiufeng Wang, YuanTao Fan, Qingfu Zhu, Zhaoxiang Zhang, Yang Gao, Jie Fu, Qian Liu, Houyi Li, Ge Zhang, Yuan Qi, Xu Yinghui, Wei Chu, and Zili Wang. 2025. OpenCoder: The Open Cookboo...

  24. [24]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

  25. [25]

    Loshchilov Ilya and Hutter Frank. 2018. Decoupled Weight Decay Regularization.International Conference on Learning Representations, ICLR(2018). Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE094. Publication date: July 2026. FSE094:22 Chaozheng Wang, Zezhou Yang, Shuzheng Gao, C. Gao, Zongjie Li, Yichen Li, Ting Peng, Hailiang Huang, Yuetang Deng, and...

  26. [26]

    Hamed Jelodar, Mohammad Meymani, and Roozbeh Razavi-Far. 2025. Large language models (llms) for source code analysis: applications, models and datasets.arXiv preprint arXiv:2503.17502(2025)

  27. [27]

    Zimo Ji, Daoyuan Wu, Wenyuan Jiang, Pingchuan Ma, Zongjie Li, and Shuai Wang. 2025. Measuring and Augmenting Large Language Models for Solving Offensive Security Challenges. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, CCS 2025, Taipei, Taiwan, October 13-17, 2025

  28. [28]

    Uttamjit Kaur and Gagandeep Singh. 2015. A review on software maintenance issues and how to reduce maintenance efforts.International Journal of Computer Applications118, 1 (2015), 6–11

  29. [29]

    Tobias Kuipers. 2016. Why you need to know about code maintainability. https://www.oreilly.com/content/why-you- need-to-know-about-code-maintainability/

  30. [30]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626

  31. [31]

    Jia Li, Ge Li, Zhuo Li, Zhi Jin, Xing Hu, Kechi Zhang, and Zhiyi Fu. 2023. CodeEditor: Learning to Edit Source Code with Pre-trained Models.ACM Trans. Softw. Eng. Methodol.32, 6 (2023), 143:1–143:22

  32. [32]

    Kaixin Li, Qisheng Hu, James Xu Zhao, Hui Chen, Yuxi Xie, Tiedong Liu, Michael Shieh, and Junxian He. 2024. InstructCoder: Instruction Tuning Large Language Models for Code Editing, Xiyan Fu and Eve Fleisig (Eds.)

  33. [33]

    Zongjie Li, Chaozheng Wang, Zhibo Liu, Haoxuan Wang, Dong Chen, Shuai Wang, and Cuiyun Gao. 2023. CCTEST: Testing and Repairing Code Completion Systems. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 1238–1250

  34. [34]

    Zongjie Li, Daoyuan Wu, Shuai Wang, and Zhendong Su. 2025. Api-guided dataset synthesis to finetune large code models.Proceedings of the ACM on Programming Languages9, OOPSLA1 (2025), 786–815

  35. [35]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  36. [36]

    Chenyan Liu, Yufan Cai, Yun Lin, Yuhuan Huang, Yunrui Pei, Bo Jiang, Ping Yang, Jin Song Dong, and Hong Mei. 2024. CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive Nature. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria,...

  37. [37]

    Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Warm Restarts. InInternational Conference on Learning Representations

  38. [38]

    Hafedh Mili, Fatma Mili, and Ali Mili. 2002. Reusing software: Issues and research directions.IEEE transactions on Software Engineering21, 6 (2002), 528–562

  39. [39]

    Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. 2023. Octopack: Instruction tuning code large language models. In NeurIPS 2023 workshop on instruction tuning and instruction following

  40. [40]

    Hellendoorn, and Satish Chandra

    Daye Nam, Ahmed Omran, Ambar Murillo, Saksham Thakur, Abner Araujo, Marcel Blistein, Alexander Frömmgen, Vincent J. Hellendoorn, and Satish Chandra. 2025. Prompting LLMs for Code Editing: Struggles and Remedies.CoRR abs/2504.20196 (2025)

  41. [41]

    OpenAI. 2025. Introducing gpt-oss. https://openai.com/index/introducing-gpt-oss/

  42. [42]

    Qwen3. 2025. Qwen3 Coder - Agentic Coding Adventure. https://qwen3lm.com/

  43. [43]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16

  44. [44]

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. {Zero-offload}: Democratizing {billion-scale} model training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). 551–564

  45. [45]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)

  46. [47]

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al . 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

  47. [48]

    2007.Simhash: Hash-based similarity detection

    Caitlin Sadowski and Greg Levin. 2007.Simhash: Hash-based similarity detection. Technical Report. Technical report, Google. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE094. Publication date: July 2026. Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing FSE094:23

  48. [49]

    ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, et al. 2025. Seed-Coder: Let the Code Model Curate Data for Itself.arXiv preprint arXiv:2506.03524 (2025)

  49. [50]

    Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. 2025. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193(2025)

  50. [51]

    Yida Tao, Yingnong Dang, Tao Xie, Dongmei Zhang, and Sunghun Kim. 2012. How do software engineers understand code changes? An exploratory study in industry. InProceedings of the ACM SIGSOFT 20th International symposium on the foundations of software engineering. 1–11

  51. [52]

    Chaozheng Wang, Jia Feng, Shuzheng Gao, Cuiyun Gao, Zongjie Li, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu. 2025. Beyond PEFT: Layer-Wise Optimization for More Effective and Efficient Large Code Model Tuning. Proceedings of the ACM on Software Engineering2, FSE (2025), 1567–1590

  52. [53]

    Chaozheng Wang, Shuzheng Gao, Cuiyun Gao, Wenxuan Wang, Chun Yong Chong, Shan Gao, and Michael R Lyu

  53. [54]

    InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

    A systematic evaluation of large code models in api suggestion: When, which, and how. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 281–293

  54. [55]

    Chaozheng Wang, Zongjie Li, Cuiyun Gao, Wenxuan Wang, Ting Peng, Hailiang Huang, Yuetang Deng, Shuai Wang, and Michael R Lyu. 2024. Exploring Multi-Lingual Bias of Large Code Models in Code Generation.arXiv preprint arXiv:2404.19368(2024)

  55. [56]

    Chaozheng Wang, Zezhou Yang, Shuzheng Gao, Cuiyun Gao, Ting Peng, Hailiang Huang, Yuetang Deng, and Michael Lyu. 2025. RAG or Fine-tuning? A Comparative Study on LCMs-based Code Completion in Industry. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 93–104

  56. [57]

    Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, and Zibin Zheng. 2025. Beyond functional correctness: Investigating coding style inconsistencies in large language models.Proceedings of the ACM on Software Engineering2, FSE (2025), 690–712

  57. [58]

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120(2023)

  58. [59]

    Wai Kin Wong, Daoyuan Wu, Huaijin Wang, Zongjie Li, Zhibo Liu, Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu

  59. [60]

    In Proceedings of the 34th ACM SIGSOFT International Symposium on Software Testing and Analysis

    DecLLM: LLM-Augmented Recompilable Decompilation for Enabling Programmatic Use of Decompiled Code. In Proceedings of the 34th ACM SIGSOFT International Symposium on Software Testing and Analysis

  60. [61]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  61. [62]

    Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, and Qiufeng Yin. 2024. WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5140–5153

  62. [63]

    Kunpeng Zhang, Zongjie Li, Daoyuan Wu, Shuai Wang, and Xin Xia. 2025. Low-Cost and Comprehensive Non-textual Input Fuzzing with LLM-Synthesized Input Generators.arXiv preprint arXiv:2501.19282(2025)

  63. [64]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583. Received 2025-09-12; accepted 2026-03-24 Proc. ACM Sof...