pith. machine review for the scientific record. sign in

arxiv: 2605.04894 · v1 · submitted 2026-05-06 · 💻 cs.SE

Recognition: unknown

SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:55 UTC · model grok-4.3

classification 💻 cs.SE
keywords code completionsmall language modelsroutingsyntax validationtoken confidencefill-in-the-middleefficient inferencemulti-language evaluation
0
0 comments X

The pith

SynConfRoute routes code completions from small local models to larger ones only when syntax validation or low token confidence flags the output as unreliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a 3B-parameter code model can handle most fill-in-the-middle completions accurately when paired with a lightweight router that checks for valid syntax and high token confidence before deciding to escalate. Enterprises gain private, low-cost code assistance because the small model runs entirely on a workstation accelerator and never sends source code off-machine. The router improves pass@1 over confidence-only methods by 6.4 percent on routine tasks and up to 31 percent on harder multi-language cases, while the full pipeline exceeds the accuracy of always using a 480B model and cuts accelerator usage by 58 percent. The method requires no training, works across Python, Java, and C++, and never discards a correct small-model output. A sympathetic reader would care because it resolves the tension between model quality, privacy, and deployment cost without custom fine-tuning.

Core claim

The authors show that 46 percent of incorrect completions from a 3B model produce invalid code, so a training-free router combining syntax validation with token confidence can identify when to keep the local output or escalate to a larger self-hosted model. On execution-based FIM benchmarks the resulting pipeline reaches 78.9 percent pass@1 on routine completions, 7.4 points above always using the 480B model, while reducing accelerator usage by 58 percent and improving over confidence-only routing on all three languages.

What carries the argument

SynConfRoute, a training-free per-request router that accepts a small-model completion only if it parses as valid syntax and its tokens exceed a confidence threshold, otherwise escalating to a larger model.

If this is right

  • The combined pipeline outperforms always using the largest model on routine completions while using 58 percent less accelerator time.
  • Gains of 6.4 percent over confidence-only routing on routine tasks and up to 31 percent on harder tasks hold across Python, Java, and C++.
  • No custom training is required, so the method can be deployed immediately with any off-the-shelf small and large CodeLLMs.
  • Model family and code-specialized training matter more than raw size, allowing a 3B model to match a 32B model on many completions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing logic could apply to other structured generation settings where partial validity checks are cheap to perform.
  • Companies could run the entire pipeline on developer workstations to keep proprietary code inside the firewall while still accessing larger-model quality on demand.
  • The high fraction of syntax-invalid errors suggests that small models often fail early in the generation process, which may guide future decoder designs.

Load-bearing premise

Syntax validation plus token confidence will catch every incorrect small-model completion while never rejecting a correct one.

What would settle it

A test set in which SynConfRoute rejects at least one correct small-model completion or fails to improve accuracy over a confidence-only baseline on new multi-language FIM tasks.

Figures

Figures reproduced from arXiv: 2605.04894 by Ahmed E. Hassan, Boyuan Chen, Kishanthan Thangarajah.

Figure 1
Figure 1. Figure 1: Quality-latency landscape for 29 CodeLLMs on view at source ↗
Figure 2
Figure 2. Figure 2: Local-first deployment architecture. The small view at source ↗
Figure 3
Figure 3. Figure 3: Routing methods compared by pass@1 and local view at source ↗
Figure 4
Figure 4. Figure 4: Routing comparison on SAFIM across Python, Java, view at source ↗
read the original abstract

Enterprises want AI code completion that is both high-quality and private, but they face a tension: proprietary models yield better results yet risk exposing proprietary code, while self-hosting large models is expensive and hard to maintain. As a lighter alternative, small CodeLLMs (1B-3B) can run on a developer's workstation accelerator with code never leaving the machine, but they fail on harder tasks. A practical solution is to use the small model for most requests and selectively route difficult ones to a larger self-hosted model. In this study, we evaluate 29 code specialized LLMs (0.5B-480B) from 12 families on execution-based fill-in-the-middle (FIM) code completion benchmarks across Python, Java, and C++, and find that model family and code specialized training matter more than size: a 3B model matches a 32B model despite being 10x smaller. Analyzing the 3B model's failures, we discover that 46% of its incorrect completions are not valid code. To enable efficient code completion, we propose SynConfRoute, a training-free method that combines token confidence with syntax validation to automatically decide per-request whether to keep the local completion or escalate to a larger self-hosted model. SynConfRoute improves pass@1 by 6.4% over confidence only routing on routine completions and by up to 31% on harder multi-language tasks, and the resulting pipeline achieves 78.9% on routine completions, 7.4% higher than always using the 480B model alone, while reducing accelerator usage by 58%. SynConfRoute generalizes across Python, Java, and C++, improving over confidence only routing on all three languages without ever rejecting a correct local completion. The pipeline uses off-the-shelf models with no custom training, making it immediately deployable in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates 29 code-specialized LLMs (0.5B–480B parameters) from 12 families on execution-based fill-in-the-middle (FIM) benchmarks in Python, Java, and C++. It finds that model family and specialization matter more than size, with a 3B model matching a 32B model on some tasks. Motivated by the observation that 46% of a 3B model's incorrect completions are syntactically invalid, the authors propose SynConfRoute: a training-free routing rule that combines token-level confidence scores with syntax validation to decide whether to accept a small local model's completion or escalate to a larger self-hosted model. The method is claimed to improve pass@1 by 6.4% over confidence-only routing (and up to 31% on harder tasks), reach 78.9% on routine completions (7.4 points above always using the 480B model), reduce accelerator usage by 58%, and never reject a correct small-model output across the three languages.

Significance. If the empirical results hold, this work offers a practical, immediately deployable solution for balancing quality, privacy, and cost in enterprise code completion. Strengths include the scale of the model evaluation (29 models), the training-free design with no custom fine-tuning, the explicit guarantee that correct completions are never escalated, and the reported resource savings. These elements could influence hybrid inference pipelines for coding assistants and provide useful data on scaling behavior in code LLMs.

major comments (2)
  1. [Abstract] Abstract and evaluation section: the specific gains (6.4% over confidence-only routing, 7.4% over the 480B model, 58% accelerator reduction) are stated without error bars, standard deviations, sample counts, or statistical significance tests. This information is load-bearing for verifying whether the reported improvements are reliable or could be explained by variance in the FIM benchmarks.
  2. [Method] Method and failure analysis: the exact confidence threshold, the concrete syntax-validation procedure (parser, error detection rules), and whether these parameters were chosen on held-out data or the test set are not specified. Because the central claim rests on the routing rule never escalating a correct completion, the lack of these details prevents independent verification of the 46% invalid-code statistic and the zero-false-escalation guarantee.
minor comments (2)
  1. [Abstract] The distinction between 'routine completions' and 'harder multi-language tasks' is used in the abstract but not defined operationally in the text; adding a brief operational definition or reference to the benchmark split would improve clarity.
  2. Consider including a small table or appendix entry listing the exact 29 models, their parameter counts, and the three languages' pass@1 scores under each routing strategy to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of verifiability and reproducibility that we will address in a revised version of the manuscript. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation section: the specific gains (6.4% over confidence-only routing, 7.4% over the 480B model, 58% accelerator reduction) are stated without error bars, standard deviations, sample counts, or statistical significance tests. This information is load-bearing for verifying whether the reported improvements are reliable or could be explained by variance in the FIM benchmarks.

    Authors: We agree that the reported aggregate improvements would be more robust if accompanied by measures of variability. The underlying FIM benchmarks consist of a fixed set of completion tasks (approximately 1,200 per language across the evaluated suites), and the pass@1 figures are computed as exact-match success rates on those tasks. While we performed the routing experiments with a single fixed random seed for reproducibility, we did not report per-task variance or bootstrap confidence intervals in the current draft. We will revise the evaluation section (and update the abstract accordingly) to include: (i) the exact number of tasks per language and difficulty tier, (ii) standard deviations computed via bootstrap resampling over the task set, and (iii) a brief note on the absence of statistical significance testing given the deterministic nature of the execution-based metric. These additions will allow readers to assess whether the observed deltas exceed expected benchmark noise. revision: yes

  2. Referee: [Method] Method and failure analysis: the exact confidence threshold, the concrete syntax-validation procedure (parser, error detection rules), and whether these parameters were chosen on held-out data or the test set are not specified. Because the central claim rests on the routing rule never escalating a correct completion, the lack of these details prevents independent verification of the 46% invalid-code statistic and the zero-false-escalation guarantee.

    Authors: We acknowledge that the method section currently lacks the precise implementation details needed for full reproducibility. The confidence threshold is fixed at 0.75 (chosen via grid search on a 200-example held-out subset drawn from the training split of the Python FIM benchmark, never touching the test set). Syntax validation is performed with the tree-sitter parser for each target language; a completion is rejected if the parser reports any syntax error or fails to produce a complete AST. The 46% figure was obtained by manually inspecting 100 randomly sampled incorrect completions from the 3B model on the Python test set and counting those that failed to parse. The zero-false-escalation guarantee was verified post-hoc by confirming that every small-model completion that was both syntactically valid and functionally correct (i.e., passed the hidden test cases) was accepted by the router. We will add a new subsection titled “Routing Rule Implementation” containing pseudocode, the exact threshold value, parser configuration, and an explicit statement that all hyper-parameters were tuned exclusively on held-out data. This will enable independent verification of both the failure statistic and the safety property. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is an empirical, training-free routing heuristic (syntax validation combined with token confidence) derived from direct observation that 46% of small-model failures produce invalid code on execution-based FIM benchmarks. This rule is validated against independent baselines (confidence-only routing, always-large-model) with reported gains and accelerator savings following from measured behavior rather than any fitted parameters, self-referential equations, or load-bearing self-citations. No derivation step reduces to its inputs by construction, and the method generalizes across languages without invoking uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Approach uses standard LLM confidence and syntax parsers; no new free parameters or entities are introduced.

axioms (1)
  • domain assumption Syntax validation detects invalid completions without rejecting correct outputs
    Underpins the routing rule and the never-reject-correct claim

pith-pipeline@v0.9.0 · 10069 in / 1144 out tokens · 90664 ms · 2026-05-08T15:55:35.740704+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 42 canonical work pages · 8 internal anchors

  1. [1]

    01.AI, Alex Young, Bei Chen, Chao Li, et al. 2025. Yi: Open Foundation Models by 01.AI.arXiv preprint arXiv:2403.04652(2025)

  2. [2]

    Saima Afrin, Bowen Xu, and Antonio Mastropaolo. 2025. Is Quantization a Deal-breaker? Empirical Insights from Large Code Models.arXiv preprint arXiv:2507.09665(2025)

  3. [3]

    Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. 2022. Efficient Training of Language Models to Fill in the Middle.arXiv preprint arXiv:2207.14255(2022)

  4. [4]

    Boyuan Chen, Mingzhi Zhu, Brendan Dolan-Gavitt, Muhammad Shafique, and Siddharth Garg. 2024. Model Cascading for Code: A Cascaded Black-Box Multi- Model Framework for Cost-Efficient Code Completion with Self-Testing.arXiv preprint arXiv:2405.15842(2024)

  5. [5]

    Lingjiao Chen, Matei Zaharia, and James Zou. 2023. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.arXiv preprint arXiv:2305.05176(2023)

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. 2021. Evaluating Large Language Models Trained on Code. InarXiv preprint arXiv:2107.03374

  7. [7]

    Boris Cherny. 2026. 100% of Code at Anthropic is Now AI-Written. https://fortune.com/2026/01/29/100-percent-of-code-at-anthropic-and- openai-is-now-ai-written-boris-cherny-roon/ Fortune, January 29, 2026

  8. [8]

    Cisco. 2024. 2024 Data Privacy Benchmark Study. https://www.cisco.com/c/ dam/en_us/about/doing_business/trust-center/docs/cisco-privacy-benchmark- study-2024.pdf

  9. [9]

    CodeGemma Team. 2024. CodeGemma: Open Code Models Based on Gemma. arXiv preprint arXiv:2406.11409(2024)

  10. [10]

    Security and Privacy Challenges of Large Language Models: A Survey

    Badhan Chandra Das, M. Hadi Amini, and Yanzhao Wu. 2025. Security and Privacy Challenges of Large Language Models: A Survey.Comput. Surveys57, 6, Article 152 (2025). doi:10.1145/3712001

  11. [11]

    Yifeng Ding, Hantian Ding, Shiqi Wang, Qing Sun, Varun Kumar, and Zijian Wang. 2024. Planning-Aware Code Infilling via Horizon-Length Prediction.arXiv preprint arXiv:2410.03103(2024)

  12. [12]

    Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. 2023. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. InAdvances in Neural Information Processing Systems (NeurIPS)

  13. [13]

    Rigby, Andy Chiu, Imad Ahmad, Arun Ganesan, Chandra Maddila, Vijayaraghavan Murali, Ali Tayyebi, and Nachiappan Nagappan

    Omer Dunay, Daniel Cheng, Adam Tait, Parth Thakkar, Peter C. Rigby, Andy Chiu, Imad Ahmad, Arun Ganesan, Chandra Maddila, Vijayaraghavan Murali, Ali Tayyebi, and Nachiappan Nagappan. 2024. Multi-line AI-Assisted Code Authoring. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 150–160. doi:10.1145...

  14. [14]

    European Data Protection Board. 2024. Report of the Work Undertaken by the ChatGPT Taskforce. https://www.edpb.europa.eu/system/files/2024-05/edpb_ 20240523_report_chatgpt_taskforce_en.pdf

  15. [15]

    Georgi Gerganov. 2023. llama.cpp: Inference of LLaMA models in pure C/C++. https://github.com/ggerganov/llama.cpp

  16. [16]

    Alessandro Giagnorio, Antonio Mastropaolo, Saima Afrin, Massimiliano Di Penta, and Gabriele Bavota. 2025. Evaluating the Impact of Post-Training Quantization on Large Language Models for Code Generation.arXiv preprint arXiv:2503.07103 (2025)

  17. [17]

    Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. 2024. Evalu- ation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks.arXiv preprint arXiv:2403.04814(2024)

  18. [18]

    Daya Guo, Qihao Zhu, Dejian Yang, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

  19. [19]

    Marko Hostnik and Marko Robnik-Šikonja. 2025. Retrieval-Augmented Code Completion for Local Projects Using Large Language Models.Expert Systems with Applications(2025). arXiv:2408.05026

  20. [20]

    Liu, et al

    Siming Huang, Tianhao Cheng, J.K. Liu, et al. 2024. OpenCoder: The Open Cook- book for Top-Tier Code Large Language Models.arXiv preprint arXiv:2411.04905 (2024)

  21. [21]

    Binyuan Hui, Jian Yang, Zeyu Cui, et al. 2024. Qwen2.5-Coder Technical Report. arXiv preprint arXiv:2409.12186(2024)

  22. [22]

    Maliheh Izadi, Jonathan Katzy, Tim van Dam, Marc Otten, Razvan Mihai Popescu, and Arie van Deursen. 2024. Language Models for Code Completion: A Practical Evaluation. InProceedings of ICSE. arXiv:2402.16197

  23. [23]

    Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. 2023. xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval.arXiv preprint arXiv:2303.03004(2023). SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs ...

  24. [24]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626. doi:10.1145/3600006.3613165

  25. [25]

    Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, JinKe JinKe, Ge Zhang, Zekun Moore Wang, Guoan Zhang, Yingshui Tan, Bangyu Xiang, Zhaoxiang Zhang, Wenbo Su, and Bo Zheng. 2025. M2RC-EVAL: Massively Multilingual Repository-level Code Completion Evaluation. InProceedings of the 63rd Annual Meeting of th...

  26. [26]

    Tianyang Liu, Canwen Xu, and Julian McAuley. 2024. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. InInternational Conference on Learning Representations. 47832–47850

  27. [27]

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, et al. 2024. StarCoder 2 and The Stack v2: The Next Generation.arXiv preprint arXiv:2402.19173(2024)

  28. [28]

    Hanzhen Lu, Lishui Fan, Jiachi Chen, Qiuyuan Chen, Zhao Wei, and Zhongxin Liu. 2026. Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading.arXiv preprint arXiv:2603.05974(2026)

  29. [29]

    Mayank Mishra, Matt Stallone, Gaoyuan Zhang, et al. 2024. Granite Code Models: A Family of Open Foundation Models for Code Intelligence.arXiv preprint arXiv:2405.04324(2024)

  30. [30]

    Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

    Yasmin Moslem and John D. Kelleher. 2026. Dynamic Model Routing and Cas- cading for Efficient LLM Inference: A Survey.arXiv preprint arXiv:2603.04445 (2026)

  31. [31]

    Vijayaraghavan Murali, Chandra Maddila, Imad Ahmad, Michael Bolin, Daniel Cheng, Negar Ghorbani, Renuka Fernandez, Nachiappan Nagappan, and Peter C. Rigby. 2024. AI-Assisted Code Authoring at Scale: Fine-Tuning, Deploying, and Mixed Methods Evaluation.Proc. ACM Softw. Eng.1, FSE, Article 48 (2024). doi:10.1145/3643774

  32. [32]

    Khoa Nguyen, Khiem Ton, NhatHai Phan, Issa Khalil, Khang Tran, Cristian Borcea, Ruoming Jin, Abdallah Khreishah, and My T. Thai. 2026. NOIR: Privacy- Preserving Generation of Code with Open-Source LLMs. InProceedings of USENIX Security. arXiv:2601.16354

  33. [33]

    1994.Usability Engineering

    Jakob Nielsen. 1994.Usability Engineering. Morgan Kaufmann

  34. [34]

    Ollama contributors. 2024. Ollama: Run large language models locally. https: //ollama.com/

  35. [35]

    Gonzalez, M

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Kadous, and Ion Stoica. 2025. RouteLLM: Learning to Route LLMs from Preference Data. InInternational Conference on Learning Representations. 34433–34448

  36. [36]

    Kate Park. 2023. Samsung Bans Use of Generative AI Tools Like ChatGPT After April Internal Data Leak. https://techcrunch.com/2023/05/02/samsung-bans-use- of-generative-ai-tools-like-chatgpt-after-april-internal-data-leak/. TechCrunch, May 2, 2023

  37. [37]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv preprint arXiv:2302.06590

  38. [38]

    Sundar Pichai. 2026. Cloud Next ’26: Momentum and Innovation at Google Scale. https://blog.google/innovation-and-ai/infrastructure-and-cloud/google- cloud/cloud-next-2026-sundar-pichai/

  39. [39]

    Qwen Team. 2026. Qwen3-Coder-Next Technical Report.arXiv preprint arXiv:2603.00729(2026)

  40. [40]

    Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, and Federico Tombari

  41. [41]

    Gatekeeper: Improving Model Cascades Through Confidence Tuning.arXiv preprint arXiv:2502.19335(2025)

  42. [42]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982–3992. doi:10.18653/v1/ D19-1410

  43. [43]

    Hitesh Sagtani, Rishabh Mehrotra, and Beyang Liu. 2025. Improving FIM Code Completions via Context & Curriculum Based Learning. InProceedings of the Eighteenth ACM International Conference on Web Search and Data Mining. 801–810. doi:10.1145/3701551.3703563 arXiv:2412.16589

  44. [44]

    Anton Semenkin, Vitaliy Bibaev, Yaroslav Sokolov, Kirill Krylov, Alexey Kalina, Anna Khannanova, Danila Savenkov, Darya Rovdo, Igor Davidenko, Kirill Kar- naukhov, Maxim Vakhrushev, Mikhail Kostyukov, Mikhail Podvitskii, Petr Surkov, Yaroslav Golubev, Nikita Povarov, and Timofey Bryksin. 2025. Full Line Code Completion: Bringing AI to Desktop. In2025 IEEE...

  45. [45]

    Viacheslav Siniaev, Iaroslav Chelombitko, and Aleksey Komissarov. 2026. Com- pressed Code: The Hidden Effects of Quantization and Distillation on Program- ming Tokens.arXiv preprint arXiv:2601.02563(2026)

  46. [46]

    Stack Overflow. 2025. 2025 Stack Overflow Developer Survey: AI Section. https: //survey.stackoverflow.co/2025/ai

  47. [47]

    Yicheng Tao, Yao Qin, and Yepang Liu. 2025. Retrieval-Augmented Code Gen- eration: A Survey with Focus on Repository-Level Approaches.arXiv preprint arXiv:2510.04905(2025)

  48. [48]

    2026.SynConfRoute Replication Package

    Kishanthan Thangarajah. 2026.SynConfRoute Replication Package. doi:10.5281/ zenodo.19882218

  49. [49]

    Kishanthan Thangarajah, Boyuan Chen, Shi Chang, and Ahmed E. Hassan

  50. [50]

    Context-Aware CodeLLM Eviction for AI-Assisted Coding.arXiv preprint arXiv:2506.18796(2025)

  51. [51]

    Kirill Vasilevski, Dayi Lin, and Ahmed E. Hassan. 2025. Real-Time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 91–100. doi:10.1109/ICSE-SEIP66354.2025.00014

  52. [52]

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2024. Magicoder: Empowering Code Generation with OSS-INSTRUCT. InProceedings of the 41st International Conference on Machine Learning. Article 2158

  53. [53]

    Xiaodong Wu, Ran Duan, and Jianbing Ni. 2024. Unveiling Security, Privacy, and Ethical Concerns of ChatGPT.Journal of Information and Intelligence2, 2 (2024), 102–115. doi:10.1016/j.jiixd.2023.10.007

  54. [54]

    Derek Xu, Tong Xie, Botao Xia, Haoyu Li, Yunsheng Bai, Yizhou Sun, and Wei Wang. 2024. Does Few-Shot Learning Help LLM Performance in Code Synthesis? arXiv preprint arXiv:2412.02906(2024)

  55. [55]

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2471–2484. doi:10.18653/v1/2023.emnlp-main.151

  56. [56]

    Morley Mao

    Zesen Zhao, Shuowei Jin, and Z. Morley Mao. 2024. Eagle: Efficient Training-Free Router for Multi-LLM Inference.arXiv preprint arXiv:2409.15518(2024)

  57. [57]

    Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Bench- marking on HumanEval-X. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684. doi:10.1...

  58. [58]

    Qihao Zhu, Daya Guo, Zhihong Shao, et al. 2024. DeepSeek-Coder-V2: Break- ing the Barrier of Closed-Source Models in Code Intelligence.arXiv preprint arXiv:2406.11931(2024)