pith. sign in

arxiv: 2506.03535 · v2 · submitted 2025-06-04 · 💻 cs.SE

Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation

Pith reviewed 2026-05-19 11:48 UTC · model grok-4.3

classification 💻 cs.SE
keywords cross-lingual code generationretrieval-augmented generationprogramming language transfermultilingual LLMscode migrationknowledge transferRACG
0
0 comments X

The pith

Retrieval-augmented code generation transfers knowledge across programming languages unevenly, with success tied to language similarity and pretraining diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether retrieval-augmented code generation can move useful code knowledge from one programming language to another without heavy retraining. Researchers built a new dataset of nearly 14,000 examples spanning 13 languages to run controlled tests. Direct injection of retrieved code from a different language produces gains but remains difficult. Transfer performs better between languages that share structural traits and when the underlying model encountered many languages during pretraining. Systems using code-focused retrievers draw little additional value from natural language comments inside the source code.

Core claim

Our experiments reveal three key insights: (1) Knowledge transfer in RACG across PLs is non-trivial even using direct injection. (2) RACG exhibits unequal cross-lingual knowledge transfer, and its efficacy depends on linguistic affinity of PL pair and diversity of LLM pretraining corpus. (3) RACG shows limited reliance on natural language information embedded in code when equipped with a code-specific retriever. These findings provide practical guidance for designing effective multilingual RACG systems.

What carries the argument

A newly constructed dataset of nearly 14K instances across 13 programming languages that enables controlled measurement of cross-lingual knowledge transfer in retrieval-augmented code generation.

If this is right

  • Multilingual RACG systems should prioritize language pairs that share structural similarities for higher transfer success.
  • Greater diversity in an LLM's pretraining corpus improves cross-lingual code generation performance.
  • Code-specific retrievers can be used without heavy dependence on natural language comments inside retrieved snippets.
  • Direct injection of cross-language retrieved code offers measurable but limited gains for code migration tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether adding targeted pretraining on low-resource programming languages closes the observed transfer gaps.
  • Hybrid retrievers that blend code structure with selected natural language signals might yield further gains in languages with richer documentation.
  • The results point toward practical design rules for retrieval-augmented tools that developers could apply when porting code between specific language pairs.

Load-bearing premise

The newly constructed dataset of nearly 14K instances across 13 programming languages faithfully represents the distribution and difficulty of real-world cross-lingual code generation and migration tasks that developers encounter.

What would settle it

Running the same RACG experiments on an independently gathered set of real code-migration tasks from open-source repositories and finding that the three reported insights do not appear.

Figures

Figures reproduced from arXiv: 2506.03535 by Hongyu Lin, Jialun Cao, Le Sun, Qiming Zhu, Shing-Chi Cheung, WeiLi Zhang, Xianpei Han, Xuanang Chen, Yaojie Lu.

Figure 1
Figure 1. Figure 1: Pipeline construction and four experimental settings for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Venn diagrams illustrating the distribution of cases with [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Current research on large language models (LLMs) with retrieval-augmented code generation (RACG) has largely focused on single-language settings, leaving their cross-lingual effectiveness underexplored. Multilingual RACG systems are increasingly important for migrating and reusing code across programming languages (PLs), a common yet challenging task in modern software development. To systematically study cross-lingual code knowledge transfer in RACG, we construct a dataset covering 13 PLs with nearly 14K instances. Our experiments reveal three key insights: (1) Knowledge transfer in RACG across PLs is non-trivial even using direct injection. (2) RACG exhibits unequal cross-lingual knowledge transfer, and its efficacy depends on linguistic affinity of PL pair and diversity of LLM pretraining corpus. (3) RACG shows limited reliance on natural language information embedded in code when equipped with a code-specific retriever. These findings provide practical guidance for designing effective multilingual RACG systems. https://github.com/icip-cas/Cross-Lingual-RACG

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper constructs a new dataset of nearly 14K instances across 13 programming languages to empirically study cross-lingual retrieval-augmented code generation (RACG). Experiments on this dataset yield three insights: (1) knowledge transfer across PLs remains non-trivial even with direct injection, (2) transfer efficacy is unequal and depends on linguistic affinity between PL pairs as well as diversity of the LLM pretraining corpus, and (3) RACG exhibits limited reliance on natural language information in code when paired with a code-specific retriever. The work concludes with practical guidance for designing multilingual RACG systems.

Significance. If the central empirical patterns hold, the study provides a valuable contribution by filling a gap in multilingual RACG research, which has been underexplored relative to single-language settings. The newly constructed cross-lingual dataset and the three concrete insights on transfer behaviors offer actionable implications for code migration and reuse tasks. The public release of the dataset and code (via the linked GitHub repository) is a clear strength that supports reproducibility and enables follow-on work.

major comments (2)
  1. [§3] §3 (Dataset Construction): All three reported insights rest on the newly constructed ~14K-instance dataset. The manuscript provides insufficient detail on the exact filtering rules, pairing criteria for cross-lingual instances, and any external anchoring against real-world migration PRs or production codebases. This leaves open whether the observed unequal transfer and limited NL reliance are general properties of RACG or artifacts of the curation process.
  2. [§5] §5 (Experimental Results): The second insight on unequal cross-lingual transfer would be strengthened by reporting statistical significance tests (e.g., p-values or confidence intervals) for performance differences across language pairs; without them, it is difficult to rule out that some observed disparities arise from experimental variance rather than systematic linguistic or pretraining effects.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly quantify the dataset scale and language coverage in the opening sentences to improve immediate clarity for readers.
  2. [Figures/Tables] Figure captions and table headers would benefit from additional detail on the exact metrics plotted (e.g., which retrieval-augmented vs. baseline configurations are compared) to aid quick interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive feedback. The comments on dataset transparency and statistical rigor are well-taken. We address each point below and will revise the manuscript accordingly to strengthen clarity and evidence.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): All three reported insights rest on the newly constructed ~14K-instance dataset. The manuscript provides insufficient detail on the exact filtering rules, pairing criteria for cross-lingual instances, and any external anchoring against real-world migration PRs or production codebases. This leaves open whether the observed unequal transfer and limited NL reliance are general properties of RACG or artifacts of the curation process.

    Authors: We appreciate the referee’s call for greater methodological transparency. Section 3 of the original manuscript already describes the high-level construction pipeline, but we agree it can be expanded. In the revision we will add an explicit subsection detailing: (1) the precise filtering rules (e.g., minimum token length, removal of duplicates via exact and semantic similarity thresholds, and exclusion of trivial or malformed snippets); (2) the pairing criteria for cross-lingual instances, which rely on functional equivalence verified through unit-test execution and semantic similarity computed via code embeddings; and (3) our rationale for the synthetic construction approach, which prioritizes controlled isolation of linguistic factors over direct anchoring to specific GitHub PRs. While we did not perform an exhaustive audit against production migration datasets, the controlled design allows us to isolate the effects of linguistic affinity and pretraining diversity—the core phenomena under study. These additions will help readers evaluate the scope of our claims. revision: yes

  2. Referee: [§5] §5 (Experimental Results): The second insight on unequal cross-lingual transfer would be strengthened by reporting statistical significance tests (e.g., p-values or confidence intervals) for performance differences across language pairs; without them, it is difficult to rule out that some observed disparities arise from experimental variance rather than systematic linguistic or pretraining effects.

    Authors: We fully agree that statistical tests would increase confidence in the unequal-transfer finding. In the revised Section 5 we will report: (a) 95% confidence intervals around the Pass@1 and Pass@10 metrics for each language pair, and (b) p-values from paired statistical tests (Wilcoxon signed-rank for non-normal distributions and paired t-tests where appropriate) comparing performance differences across pairs. These results will be presented both in the main text and in an expanded appendix table. This addition directly addresses the concern that observed disparities might reflect experimental variance rather than systematic effects of linguistic affinity or pretraining corpus diversity. revision: yes

Circularity Check

0 steps flagged

Empirical measurements on newly constructed dataset yield independent insights

full rationale

The paper's central claims consist of three empirical insights derived from experiments performed on a freshly constructed dataset of nearly 14K instances spanning 13 programming languages. No mathematical derivation, parameter fitting, or self-referential definition is present; the reported patterns on knowledge transfer, linguistic affinity effects, and retriever behavior are direct outputs of the experimental measurements rather than quantities defined in terms of themselves or prior self-citations. The construction and evaluation steps remain self-contained against external benchmarks because they introduce new data and report observable results without reducing any prediction or uniqueness claim to an input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard empirical assumptions about dataset representativeness and evaluation metrics rather than introducing new free parameters or postulated entities.

axioms (1)
  • domain assumption The constructed dataset instances accurately reflect realistic cross-lingual code generation scenarios.
    The paper uses this dataset to draw conclusions about knowledge transfer; the assumption is invoked when generalizing experimental results to practical multilingual RACG systems.

pith-pipeline@v0.9.0 · 5740 in / 1239 out tokens · 64140 ms · 2026-05-19T11:48:26.172645+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 6 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374 , 2021

  2. [2]

    Competition- level code generation with alphacode,

    Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al. , “Competition- level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022

  3. [3]

    Code Llama: Open Foundation Models for Code

    B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez et al. , “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950 , 2023

  4. [4]

    Jigsaw: Large language models meet program synthesis,

    N. Jain, S. Vaidyanath, A. Iyer, N. Natarajan, S. Parthasarathy, S. Ra- jamani, and R. Sharma, “Jigsaw: Large language models meet program synthesis,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 1219–1231

  5. [5]

    Docprompting: Generating code by retrieving the docs,

    S. Zhou, U. Alon, F. F. Xu, Z. Jiang, and G. Neubig, “Docprompting: Generating code by retrieving the docs,” in The Eleventh International Conference on Learning Representations

  6. [6]

    Repocoder: Repository-level code completion through iterative retrieval and generation,

    F. Zhang, B. Chen, Y . Zhang, J. Keung, J. Liu, D. Zan, Y . Mao, J.- G. Lou, and W. Chen, “Repocoder: Repository-level code completion through iterative retrieval and generation,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , 2023, pp. 2471–2484

  7. [7]

    Swe-bench: Can language models resolve real-world github issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” in ICLR, 2024

  8. [8]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al. , “Retrieval- augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems , vol. 33, pp. 9459–9474, 2020

  9. [9]

    Retrieval augmented language model pre-training,

    K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” in International conference on machine learning . PMLR, 2020, pp. 3929–3938

  10. [10]

    Revisiting and improving retrieval-augmented deep assertion generation,

    W. Sun, H. Li, M. Yan, Y . Lei, and H. Zhang, “Revisiting and improving retrieval-augmented deep assertion generation,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2023, pp. 1123–1135

  11. [11]

    Droidcoder: Enhanced android code completion with context-enriched retrieval-augmented generation,

    X. Yu, C. Li, M. Pan, and X. Li, “Droidcoder: Enhanced android code completion with context-enriched retrieval-augmented generation,” in Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , 2024, pp. 681–693

  12. [12]

    Rap-gen: Retrieval- augmented patch generation with codet5 for automatic program repair,

    W. Wang, Y . Wang, S. Joty, and S. C. Hoi, “Rap-gen: Retrieval- augmented patch generation with codet5 for automatic program repair,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the F oundations of Software Engineering, 2023, pp. 146–158

  13. [13]

    Evor: Evolving retrieval for code generation,

    H. Su, S. Jiang, Y . Lai, H. Wu, B. Shi, C. Liu, Q. Liu, and T. Yu, “Evor: Evolving retrieval for code generation,” in Findings of the Association for Computational Linguistics: EMNLP 2024 , 2024, pp. 2538–2554

  14. [14]

    Prompt-based code completion via multi-retrieval augmented genera- tion,

    H. Tan, Q. Luo, L. Jiang, Z. Zhan, J. Li, H. Zhang, and Y . Zhang, “Prompt-based code completion via multi-retrieval augmented genera- tion,” ACM Transactions on Software Engineering and Methodology , 2024

  15. [15]

    Rar: Retrieval-augmented retrieval for code generation in low resource lan- guages,

    A. Dutta, M. Singh, G. Verbruggen, S. Gulwani, and V . Le, “Rar: Retrieval-augmented retrieval for code generation in low resource lan- guages,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , 2024, pp. 21 506–21 515

  16. [16]

    Improving retrieval-augmented code comment generation by retrieving for generation,

    H. Lu and Z. Liu, “Improving retrieval-augmented code comment generation by retrieving for generation,” in 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 2024, pp. 350–362

  17. [17]

    Building a coding assistant via the retrieval-augmented language model,

    X. Li, H. Wang, Z. Liu, S. Yu, S. Wang, Y . Yan, Y . Fu, Y . Gu, and G. Yu, “Building a coding assistant via the retrieval-augmented language model,” ACM Transactions on Information Systems , vol. 43, no. 2, pp. 1–25, 2025

  18. [18]

    An empirical study of retrieval-augmented code generation: Challenges and opportunities,

    Z. Yang, S. Chen, C. Gao, Z. Li, X. Hu, K. Liu, and X. Xia, “An empirical study of retrieval-augmented code generation: Challenges and opportunities,” ACM Transactions on Software Engineering and Methodology, 2025

  19. [19]

    CodeRAG-bench: Can retrieval augment code generation?

    Z. Z. Wang, A. Asai, X. V . Yu, F. F. Xu, Y . Xie, G. Neubig, and D. Fried, “CodeRAG-bench: Can retrieval augment code generation?” in Findings of the Association for Computational Linguistics: NAACL 2025 , L. Chiruzzo, A. Ritter, and L. Wang, Eds. Albuquerque, New Mexico: Association for Computational Linguistics, Apr. 2025, pp. 3199–3214. [Online]. Avai...

  20. [20]

    Preference-guided refactored tuning for retrieval augmented code gen- eration,

    X. Gao, Y . Xiong, D. Wang, Z. Guan, Z. Shi, H. Wang, and S. Li, “Preference-guided refactored tuning for retrieval augmented code gen- eration,” in Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , 2024, pp. 65–77

  21. [21]

    Retrieval augmented code generation and summarization,

    M. R. Parvez, W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Retrieval augmented code generation and summarization,” in Findings of the Association for Computational Linguistics: EMNLP 2021 , 2021, pp. 2719–2734

  22. [22]

    Multi-language software development: Issues, challenges, and solutions,

    H. Yang, Y . Nong, S. Wang, and H. Cai, “Multi-language software development: Issues, challenges, and solutions,” IEEE Transactions on Software Engineering, vol. 50, no. 3, pp. 512–533, 2024

  23. [23]

    How should we build a benchmark? revisiting 274 code-related benchmarks for llms,

    J. Cao, Y .-K. Chan, Z. Ling, W. Wang, S. Li, M. Liu, R. Qiao, Y . Han, C. Wang, B. Yu, P. He, S. Wang, Z. Zheng, M. R. Lyu, and S.-C. Cheung, “How should we build a benchmark? revisiting 274 code-related benchmarks for llms,” 2025. [Online]. Available: https://arxiv.org/abs/2501.10711

  24. [24]

    Popularity of programming languages,

    D. Ður ¯dev, “Popularity of programming languages,” AIDASCO Reviews, vol. 2, no. 2, pp. 24–29, 2024

  25. [25]

    Towards a common under- standing of contributing factors for cross-lingual transfer in multilingual language models: A review,

    F. Philippy, S. Guo, and S. Haddadan, “Towards a common under- standing of contributing factors for cross-lingual transfer in multilingual language models: A review,” in The 61st Annual Meeting Of The Association F or Computational Linguistics , 2023

  26. [26]

    Key ingredients for effective zero-shot cross-lingual knowledge transfer in generative tasks,

    N. Chirkova and V . Nikoulina, “Key ingredients for effective zero-shot cross-lingual knowledge transfer in generative tasks,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers) , 2024, pp. 7215–7231

  27. [27]

    A lightweight polyglot code transformation language,

    A. Ketkar, D. Ramos, L. Clapp, R. Barik, and M. K. Ramanathan, “A lightweight polyglot code transformation language,” Proceedings of the ACM on Programming Languages , vol. 8, no. PLDI, pp. 1288–1312, 2024

  28. [28]

    Scal- able, validated code translation of entire projects using large language models,

    H. Zhang, C. David, M. Wang, B. Paulsen, and D. Kroening, “Scal- able, validated code translation of entire projects using large language models,” arXiv preprint arXiv:2412.08035 , 2024

  29. [29]

    On multi-language software development, cross-language links and accompanying tools: a survey of professional software developers,

    P. Mayer, M. Kirsch, and M. A. Le, “On multi-language software development, cross-language links and accompanying tools: a survey of professional software developers,” Journal of Software Engineering Research and Development , vol. 5, pp. 1–33, 2017

  30. [30]

    Legacy web application modernization by generating a rest service layer,

    R. R. Echeverria, F. Macias, V . M. Pavon, J. M. Conejero, and F. S. Figueroa, “Legacy web application modernization by generating a rest service layer,” IEEE Latin America Transactions , vol. 13, no. 7, pp. 2379–2383, 2015

  31. [31]

    Challenges in migrating legacy software systems to the cloud—an empirical study,

    M. F. Gholami, F. Daneshgar, G. Beydoun, and F. Rabhi, “Challenges in migrating legacy software systems to the cloud—an empirical study,” Information Systems , vol. 67, pp. 100–113, 2017

  32. [32]

    Knowledge transfer from high-resource to low-resource programming languages for code llms,

    F. Cassano, J. Gouwar, F. Lucchetti, C. Schlesinger, A. Freeman, C. J. Anderson, M. Q. Feldman, M. Greenberg, A. Jangda, and A. Guha, “Knowledge transfer from high-resource to low-resource programming languages for code llms,” Proceedings of the ACM on Programming Languages, vol. 8, no. OOPSLA2, pp. 677–708, 2024

  33. [33]

    Speq: Translation of sparse codes using equivalences,

    A. Laird, B. Liu, N. Bjørner, and M. M. Dehnavi, “Speq: Translation of sparse codes using equivalences,” Proceedings of the ACM on Programming Languages, vol. 8, no. PLDI, pp. 1680–1703, 2024

  34. [34]

    Poisonedrag: Knowledge corruption attacks to retrieval- augmented generation of large language models

    W. Zou, R. Geng, B. Wang, and J. Jia, “Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models,” arXiv preprint arXiv:2402.07867 , 2024

  35. [35]

    From allies to adversaries: Manipulating LLM tool-calling through adversarial injection,

    R. Zhang, H. Wang, J. Wang, M. Li, Y . Huang, D. Wang, and Q. Wang, “From allies to adversaries: Manipulating LLM tool-calling through adversarial injection,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers) , L. Chiruzzo, A. ...

  36. [36]

    Poisoning web- scale training datasets is practical,

    N. Carlini, M. Jagielski, C. A. Choquette-Choo, D. Paleka, W. Pearce, H. Anderson, A. Terzis, K. Thomas, and F. Tramèr, “Poisoning web- scale training datasets is practical,” in2024 IEEE Symposium on Security and Privacy (SP) . IEEE, 2024, pp. 407–425

  37. [37]

    Artifact of this paper

    Anonymous, “Artifact of this paper.” [Online]. Available: https: //anonymous.4open.science/r/Cross-Lingual-RACG-0F3C

  38. [38]

    Adversarial Robustness of Deep Code Comment Generation,

    Y . Zhou, X. Zhang, J. Shen, T. Han, and T. Chen, “Adversarial Robustness of Deep Code Comment Generation,” ACM Transactions on Software Engineering and Methodology , vol. 31, no. 4, pp. 1–30, Oct. 2022

  39. [39]

    Analyzing apis documentation and code to detect directive defects,

    Y . Zhou, R. Gu, T. Chen, Z. Huang, S. Panichella, and H. Gall, “Analyzing apis documentation and code to detect directive defects,” in 2017 IEEE/ACM 39th International Conference on Software Engi- neering (ICSE) . IEEE, 2017, pp. 27–37

  40. [40]

    Codecleaner: Elevating standards with a robust data contamination mitigation toolkit,

    J. Cao, S. Chen, W. Zhang, H. C. Lo, and S.-C. Cheung, “Codecleaner: Elevating standards with a robust data contamination mitigation toolkit,”

  41. [41]

    Available: https://arxiv.org/abs/2411.10842

    [Online]. Available: https://arxiv.org/abs/2411.10842

  42. [42]

    Software documentation: the practitioners’ perspective,

    E. Aghajani, C. Nagy, M. Linares-Vásquez, L. Moreno, G. Bavota, M. Lanza, and D. C. Shepherd, “Software documentation: the practitioners’ perspective,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , ser. ICSE ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 590–601. [Online]. Available: https:/...

  43. [43]

    Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,

    Q. Zheng, X. Xia, X. Zou, Y . Dong, S. Wang, Y . Xue, L. Shen, Z. Wang, A. Wang, Y . Li et al. , “Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2023, pp. 5673–5684

  44. [44]

    Multi-lingual evaluation of code generation models,

    B. Athiwaratkun, S. K. Gouda, Z. Wang, X. Li, Y . Tian, M. Tan, W. U. Ahmad, S. Wang, Q. Sun, M. Shang et al., “Multi-lingual evaluation of code generation models,” in The Eleventh International Conference on Learning Representations

  45. [45]

    Mceval: Massively multilingual code evaluation,

    L. Chai, S. Liu, J. Yang, Y . Yin, K. Jin, J. Liu, T. Sun, G. Zhang, C. Ren, H. Guo et al. , “Mceval: Massively multilingual code evaluation,” arXiv preprint arXiv:2406.07436, 2024

  46. [46]

    A survey of automatic generation of source code comments: Algorithms and techniques,

    X. Song, H. Sun, X. Wang, and J. Yan, “A survey of automatic generation of source code comments: Algorithms and techniques,” IEEE Access , vol. 7, pp. 111 411–111 428, 2019

  47. [47]

    Cornstack: High-quality contrastive data for better code ranking,

    T. Suresh, R. G. Reddy, Y . Xu, Z. Nussbaum, A. Mulyar, B. Duderstadt, and H. Ji, “Cornstack: High-quality contrastive data for better code ranking,” arXiv preprint arXiv:2412.01007 , 2024

  48. [48]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Li et al. , “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,” arXiv preprint arXiv:2401.14196, 2024

  49. [49]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu et al. , “Qwen2. 5-coder technical report,” arXiv preprint arXiv:2409.12186, 2024

  50. [50]

    Textbooks Are All You Need

    S. Gunasekar, Y . Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi et al. , “Textbooks are all you need,” arXiv preprint arXiv:2306.11644 , 2023

  51. [51]

    Textbooks Are All You Need II: phi-1.5 technical report

    Y . Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y . T. Lee, “Textbooks are all you need ii: phi-1.5 technical report,” arXiv preprint arXiv:2309.05463, 2023

  52. [52]

    When llms meet api documentation: Can retrieval augmentation aid code generation just as it helps developers?

    J. Chen, S. Chen, J. Cao, J. Shen, and S.-C. Cheung, “When llms meet api documentation: Can retrieval augmentation aid code generation just as it helps developers?” 2025. [Online]. Available: https://arxiv.org/abs/2503.15231

  53. [53]

    Multipl-e: a scalable and polyglot approach to benchmarking neural code generation,

    F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y . Zi, C. J. Anderson, M. Q. Feldman et al. , “Multipl-e: a scalable and polyglot approach to benchmarking neural code generation,” IEEE Transactions on Software Engineering , vol. 49, no. 7, pp. 3675–3691, 2023

  54. [54]

    Reacc: A retrieval-augmented code completion framework,

    S. Lu, N. Duan, H. Han, D. Guo, S.-w. Hwang, and A. Svyatkovskiy, “Reacc: A retrieval-augmented code completion framework,” in Pro- ceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers) , 2022, pp. 6227–6240

  55. [55]

    Large language model-aware in-context learning for code generation,

    J. Li, C. Tao, J. Li, G. Li, Z. Jin, H. Zhang, Z. Fang, and F. Liu, “Large language model-aware in-context learning for code generation,” ACM Transactions on Software Engineering and Methodology , 2023

  56. [56]

    Codegrag: Extracting composed syntax graphs for retrieval augmented cross-lingual code generation,

    K. Du, R. Rui, H. Chai, L. Fu, W. Xia, Y . Wang, R. Tang, Y . Yu, and W. Zhang, “Codegrag: Extracting composed syntax graphs for retrieval augmented cross-lingual code generation,” arXiv preprint arXiv:2405.02355, 2024

  57. [57]

    Cruxeval-x: A benchmark for multilingual code reasoning, understanding and execution, 2025

    R. Xu, J. Cao, Y . Lu, H. Lin, X. Han, B. He, S.-C. Cheung, and L. Sun, “Cruxeval-x: A benchmark for multilingual code reasoning, understanding and execution,” arXiv preprint arXiv:2408.13001 , 2024

  58. [58]

    Poison-rag: Adversarial data poisoning attacks on retrieval-augmented generation in recommender systems,

    F. Nazary, Y . Deldjoo, and T. d. Noia, “Poison-rag: Adversarial data poisoning attacks on retrieval-augmented generation in recommender systems,” in European Conference on Information Retrieval . Springer, 2025, pp. 239–251

  59. [59]

    Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models,

    J. Xue, M. Zheng, Y . Hu, F. Liu, X. Chen, and Q. Lou, “Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models,” arXiv preprint arXiv:2406.00083 , 2024

  60. [60]

    Exploring the security threats of knowledge base poisoning in retrieval-augmented code generation,

    B. Lin, S. Wang, L. Chen, and X. Mao, “Exploring the security threats of knowledge base poisoning in retrieval-augmented code generation,”

  61. [61]

    Available: https://arxiv.org/abs/2502.03233

    [Online]. Available: https://arxiv.org/abs/2502.03233