pith. machine review for the scientific record. sign in

arxiv: 2604.25960 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.LG· cs.PL

Recognition: unknown

Large Language Models for Multilingual Code Intelligence: A Survey

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:41 UTC · model grok-4.3

classification 💻 cs.SE cs.LGcs.PL
keywords large language modelsmultilingual code generationcode translationpolyglot systemssoftware engineeringcross-language generalizationbenchmarksevaluation metrics
0
0 comments X

The pith

Large language models must overcome bias toward high-resource languages like Python to support reliable code intelligence in polyglot systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews how large language models handle programming tasks across multiple languages. It establishes that real-world software systems typically combine several languages at once, so models need strong performance not just in Python but also in lower-resource languages such as Rust and OCaml. The survey organizes its review around two concrete tasks: generating code in different languages from the same natural-language requirements and translating code from one language to another while preserving its original meaning and behavior. It catalogs existing methods, benchmarks, and evaluation metrics, then points to open challenges in achieving consistent cross-language results.

Core claim

Current research on large language models for code remains heavily biased toward high-resource languages such as Python, with noticeably weaker performance on languages like Rust and OCaml. Because real-world systems are inherently polyglot, the survey centers on two key tasks: multilingual code generation from shared natural-language requirements and multilingual code translation that preserves semantics across languages. It reviews representative methods, benchmarks, and evaluation metrics while highlighting challenges and opportunities for trustworthy cross-language generalization.

What carries the argument

The two primary tasks—multilingual code generation from shared natural-language requirements and semantic-preserving multilingual code translation—which organize the review of methods, benchmarks, and metrics for cross-language capabilities.

If this is right

  • Better multilingual generation would let developers write one natural-language specification and receive correct implementations in several languages.
  • Reliable semantic-preserving translation would reduce the cost and risk of porting existing codebases between languages.
  • Metrics that directly measure semantic equivalence across languages would give clearer signals for model improvement than current proxies.
  • Overcoming generalization gaps would make AI coding assistants practical for the mixed-language projects that dominate industry codebases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training strategies that treat programming languages more symmetrically with natural language could reduce the current resource imbalance.
  • Insights from this code-focused survey may transfer to improving multilingual capabilities in other structured output domains such as formal specifications.
  • Deployment testing on actual mixed-language repositories would be needed to confirm whether the reviewed methods scale beyond isolated benchmarks.

Load-bearing premise

The representative methods, benchmarks, and metrics chosen for the survey adequately capture the current state of the field and the core difficulties of maintaining semantics across programming languages.

What would settle it

A new evaluation on a benchmark that includes low-resource languages such as OCaml or Rust where an off-the-shelf model matches its Python performance without any cross-language training or adaptation would undermine the surveyed claim that focused multilingual research is required.

Figures

Figures reproduced from arXiv: 2604.25960 by Chao Jiang, Cheng Wen, Dugang Liu, Hua Zheng, Jawwad Ahmed Shamsi, Muhammad Sadiq, Shengchao Qin, Zhiwu Xu, Zhong Ming.

Figure 1
Figure 1. Figure 1: Statistical Data on the Collected Benchmarks view at source ↗
read the original abstract

Large language models have transformed AI-assisted software engineering, but current research remains biased toward high-resource languages such as Python, with weaker performance in languages like Rust and OCaml. Since real-world systems are inherently polyglot, robust multilingual code intelligence is crucial. This survey focuses on two key tasks: multilingual code generation from shared natural-language requirements, and multilingual code translation that preserves semantics across languages. It reviews representative methods, benchmarks, and evaluation metrics, and highlights challenges and opportunities for trustworthy cross-language generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript is a survey on large language models for multilingual code intelligence. It notes the bias of current research toward high-resource languages such as Python and weaker performance on languages like Rust and OCaml. The survey centers on two tasks—multilingual code generation from shared natural-language requirements and semantics-preserving code translation across languages—while reviewing representative methods, benchmarks, and metrics and discussing challenges and opportunities for trustworthy cross-language generalization.

Significance. A well-executed survey in this area would be useful for directing research on polyglot code intelligence, given that real-world software systems are inherently multilingual. By synthesizing methods, benchmarks, and metrics for the two focal tasks, the paper could help identify gaps in cross-language semantic preservation and generalization.

minor comments (2)
  1. [Abstract] Abstract: the claim that the survey reviews 'representative methods, benchmarks, and evaluation metrics' would be strengthened by an explicit statement of selection criteria, search strategy, or inclusion thresholds (e.g., publication venues, time window, or minimum citation count).
  2. [Abstract] The abstract frames the two tasks clearly but does not indicate the approximate number of papers or systems covered; adding this information would help readers gauge the survey's breadth without needing to consult the full reference list.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our survey and for recommending minor revision. We are encouraged that the work is viewed as a useful synthesis for directing research on polyglot code intelligence, and we will incorporate minor revisions in the updated manuscript.

Circularity Check

0 steps flagged

No significant circularity: survey reviews external literature without internal derivations

full rationale

This is a literature survey paper whose scope is to review representative methods, benchmarks, and metrics for multilingual code generation and translation tasks from existing work. No equations, fitted parameters, predictions, or first-principles derivations are present in the abstract or stated claims. The central content consists of citations to external papers, with no self-citation chains used to justify uniqueness theorems or ansatzes that reduce to the survey's own inputs. The selection of reviewed items is presented as representative rather than derived from any internal model, satisfying the condition for a self-contained survey with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the domain assumption that real-world software is polyglot and that current LLM performance is biased toward high-resource languages; no free parameters, new entities, or ad-hoc axioms are introduced beyond standard background knowledge in the field.

axioms (1)
  • domain assumption Real-world systems are inherently polyglot
    Invoked in the abstract as the motivation for focusing on multilingual code intelligence.

pith-pipeline@v0.9.0 · 5397 in / 1125 out tokens · 39047 ms · 2026-05-08T02:41:29.087380+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 21 canonical work pages · 2 internal anchors

  1. [1]

    When large language models meet formal theorem proving: A survey

    Junjie Hu, Cheng Wen, Jialun Cao, Yikun Hu, Dugang Liu, Zhi Ma, Zhiwu Xu, and Shengchao Qin. When large language models meet formal theorem proving: A survey. InInternational Conference on Knowledge Science, Engineering and Management. Springer, 2026

  2. [2]

    A survey on static code analysis with large language models

    ChengWen,YuandaoCai,HuaZheng,BinYu,DugangLiu,ZhiwuXu,Kuanishbay Sadatdiynov, and Shengchao Qin. A survey on static code analysis with large language models. InInternational Conference on Knowledge Science, Engineering and Management. Springer, 2026

  3. [3]

    Big code models leaderboard, 2026.https://huggingface.co/ spaces/bigcode/bigcode-models-leaderboard

    BigCodeProject. Big code models leaderboard, 2026.https://huggingface.co/ spaces/bigcode/bigcode-models-leaderboard

  4. [4]

    Humaneval-xl: A multilingual code gen- eration benchmark for cross-lingual natural language generalization.arXiv preprint arXiv:2402.16694, 2024

    Qiwei Peng, Yekun Chai, and Xuhong Li. Humaneval-xl: A multilingual code gen- eration benchmark for cross-lingual natural language generalization.arXiv preprint arXiv:2402.16694, 2024

  5. [5]

    Mceval: Massively multilingual code evaluation.arXiv preprint arXiv:2406.07436, 2024

    Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, et al. Mceval: Massively multilingual code evaluation.arXiv preprint arXiv:2406.07436, 2024

  6. [6]

    arXiv preprint arXiv:2206.08474 , year=

    Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravindran, Sindhu Tipirneni, and Chandan K Reddy. Xlcost: A benchmark dataset for cross-lingual code intelligence. arXiv preprint arXiv:2206.08474, 2022. Large Language Models for Multilingual Code Intelligence: A Survey 11

  7. [7]

    Repotransbench: A real-world benchmark for repository-level code translation.arXiv preprint arXiv:2412.17744, 2024

    Yanli Wang, Yanlin Wang, et al. Repotransbench: A real-world benchmark for repository-level code translation.arXiv preprint arXiv:2412.17744, 2024

  8. [8]

    Smartc2rust: Itera- tive, feedback-driven c-to-rust translation via large language models for safety and equivalence

    Momoko Shiraishi, Yinzhi Cao, and Takahiro Shinagawa. Smartc2rust: Itera- tive, feedback-driven c-to-rust translation via large language models for safety and equivalence. 2026

  9. [9]

    arXiv preprint arXiv:2509.12973 (2025)

    Aamer Aljagthami, Mohammed Banabila, Musab Alshehri, Mohammed Kabini, and Mohammad D Alahmadi. Evaluating large language models for code translation: Effects of prompt language and prompt design.arXiv preprint arXiv:2509.12973, 2025

  10. [10]

    Isolating language-coding from problem-solving: Benchmarking llms with pseudo- eval.arXiv preprint arXiv:2502.19149, 2025

    Jiarong Wu, Songqiang Chen, Jialun Cao, Hau Ching Lo, and Shing-Chi Cheung. Isolating language-coding from problem-solving: Benchmarking llms with pseudo- eval.arXiv preprint arXiv:2502.19149, 2025

  11. [11]

    mhumaneval- a multilingual benchmark to evaluate large language models for code generation

    Md Nishat Raihan, Antonios Anastasopoulos, and Marcos Zampieri. mhumaneval- a multilingual benchmark to evaluate large language models for code generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11432–11461, 2025

  12. [12]

    arXiv preprint arXiv:2507.08627 (2025)

    Chi-en Amy Tai, Pengyu Nie, Lukasz Golab, and Alexander Wong. Nl in the mid- dle: Code translation with llms and intermediate representations.arXiv preprint arXiv:2507.08627, 2025

  13. [13]

    Llm- assisted translation of legacy fortran codes to c++: A cross-platform study

    Nishath Rajiv Ranasinghe, Shawn M Jones, Michal Kucer, Ayan Biswas, Daniel O’Malley, Alexander Most, Selma Liliane Wanna, and Ajay Sreekumar. Llm- assisted translation of legacy fortran codes to c++: A cross-platform study. In Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities, pages 58–69, 2025

  14. [14]

    Inter- trans: Leveraging transitive intermediate translations to enhance llm-based code translation.arXiv preprint arXiv:2411.01063, 2024

    Marcos Macedo, Yuan Tian, Pengyu Nie, Filipe R Cogo, and Bram Adams. Inter- trans: Leveraging transitive intermediate translations to enhance llm-based code translation.arXiv preprint arXiv:2411.01063, 2024

  15. [15]

    A comparative study of code generation using chatgpt 3.5 across 10 programming languages.arXiv preprint arXiv:2308.04477, 2023

    Alessio Buscemi. A comparative study of code generation using chatgpt 3.5 across 10 programming languages.arXiv preprint arXiv:2308.04477, 2023

  16. [16]

    Enhanc- ing code generation for low-resource languages: No silver bullet.arXiv preprint arXiv:2501.19085, 2025

    Alessandro Giagnorio, Alberto Martin-Lopez, and Gabriele Bavota. Enhanc- ing code generation for low-resource languages: No silver bullet.arXiv preprint arXiv:2501.19085, 2025

  17. [17]

    Ircoder: Intermediate represen- tations make language models robust multilingual code generators.arXiv preprint arXiv:2403.03894, 2024

    Indraneil Paul, Goran Glavaš, and Iryna Gurevych. Ircoder: Intermediate represen- tations make language models robust multilingual code generators.arXiv preprint arXiv:2403.03894, 2024

  18. [18]

    Magicoder: Empow- ering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

    Yuxiang Wei, Zhe Wang, Jiawei Liu, et al. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

  19. [19]

    Octopack: Instruction tuning code large language models

    Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. InNeurIPS 2023 workshop on instruction tuning and instruction following, 2023

  20. [20]

    Beyond lan- guage barriers: Multi-agent coordination for multi-language code generation.arXiv preprint arXiv:2509.19918, 2025

    Micheline Bénédicte Moumoula, Serge Lionel Nikiema, Albérick Euraste Djire, Abdoul Kader Kabore, Jacques Klein, and Tegawendé F Bissyande. Beyond lan- guage barriers: Multi-agent coordination for multi-language code generation.arXiv preprint arXiv:2509.19918, 2025

  21. [21]

    Unipar: A unified llm-based framework for parallel and accelerated code translation in hpc

    Tomer Bitan, Tal Kadosh, Erel Kaplan, Shira Meiri, Le Chen, Peter Morales, Niranjan Hasabnis, and Gal Oren. Unipar: A unified llm-based framework for parallel and accelerated code translation in hpc. In2025 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE, 2025. 12 Authors Suppressed Due to Excessive Length

  22. [22]

    Matchfixagent: Language-agnostic autonomous repository-level code translation validation and repair.arXiv preprint arXiv:2509.16187, 2025

    Ali Reza Ibrahimzada, Brandon Paulsen, Reyhaneh Jabbarvand, Joey Dodds, and Daniel Kroening. Matchfixagent: Language-agnostic autonomous repository-level code translation validation and repair.arXiv preprint arXiv:2509.16187, 2025

  23. [23]

    Evoc2rust: A skeleton-guided framework for project-level c-to-rust translation.arXiv preprint arXiv:2508.04295, 2025

    Chaofan Wang, Tingrui Yu, Chen Xie, Jie Wang, Dong Chen, Wenrui Zhang, Yul- ing Shi, Xiaodong Gu, and Beijun Shen. Evoc2rust: A skeleton-guided framework for project-level c-to-rust translation.arXiv preprint arXiv:2508.04295, 2025

  24. [24]

    Retrieval-augmented code generation: A survey with focus on repository- level approaches.arXiv preprint arXiv:2510.04905, 2025

    Yicheng Tao, Yao Qin, and Yepang Liu. Retrieval-augmented code generation: A survey with focus on repository-level approaches.arXiv preprint arXiv:2510.04905, 2025

  25. [25]

    Arcs: Agentic retrieval-augmented code synthesis with iterative refinement.arXiv preprint arXiv:2504.20434, 2025

    Manish Bhattarai, Miguel Cordova, Minh Vu, Javier Santos, Ismael Boureima, and Dan O’Malley. Arcs: Agentic retrieval-augmented code synthesis with iterative refinement.arXiv preprint arXiv:2504.20434, 2025

  26. [26]

    Enhancing code translation in language models with few- shot learning via retrieval-augmented generation

    Manish Bhattarai, Javier E Santos, Shawn Jones, Ayan Biswas, Boian Alexandrov, and Daniel O’Malley. Enhancing code translation in language models with few- shot learning via retrieval-augmented generation. In2024 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8. IEEE, 2024

  27. [27]

    Integrating ensemble learning and large language models for efficient formal verification of ip-based aerospace systems

    Zhi Ma, Cheng Wen, Bin Yu, and Jie Su. Integrating ensemble learning and large language models for efficient formal verification of ip-based aerospace systems. Information Fusion, 125:103466, 2026

  28. [28]

    Automated ltl specification generation from industrial aerospace requirements

    Zhi Ma, Xiao Liang, Cheng Wen, Rui Chen, Bin Gu, Shengchao Qin, Cong Tian, and Mengfei Yang. Automated ltl specification generation from industrial aerospace requirements. InProceedings of the 27th International Symposium on Formal Methods (FM), 2026

  29. [29]

    Bridging natural language and formal specification - automated translation of software requirements to ltl via hierarchical semantics decomposition using llms

    Zhi Ma, Cheng Wen, Zhexin Su, Xiao Liang, Cong Tian, Shengchao Qin, and Mengfei Yang. Bridging natural language and formal specification - automated translation of software requirements to ltl via hierarchical semantics decomposition using llms. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering, 2025

  30. [30]

    Xcodeeval: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval

    Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. Xcodeeval: An execution-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...

  31. [31]

    Code- transocean: A comprehensive multilingual benchmark for code translation.arXiv preprint arXiv:2310.04951, 2023

    Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. Code- transocean: A comprehensive multilingual benchmark for code translation.arXiv preprint arXiv:2310.04951, 2023

  32. [32]

    On the evaluation of neural code translation: Taxonomy and benchmark

    MingshengJiao,TingruiYu,XuanLi,GuanjieQiu,XiaodongGu,andBeijunShen. On the evaluation of neural code translation: Taxonomy and benchmark. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1529–1541. IEEE, 2023

  33. [33]

    CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021

  34. [34]

    arXiv preprint arXiv:2105.12655 (2021)

    Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655, 2021

  35. [35]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, et al. Program synthesis with large lan- guage models.arXiv preprint arXiv:2108.07732, 2021