Multi-lingual Evaluation of Code Generation Models

Arash Farahani; Baishakhi Ray; Ben Athiwaratkun; Bing Xiang; Dan Roth; Haifeng Qian; Hantian Ding; Ming Tan; Mingyue Shang; Murali Krishna Ramanathan

arxiv: 2210.14868 · v3 · pith:XZHAFFC6new · submitted 2022-10-26 · 💻 cs.LG · cs.CL

Multi-lingual Evaluation of Code Generation Models

Ben Athiwaratkun , Sanjay Krishna Gouda , Zijian Wang , Xiaopeng Li , Yuchen Tian , Ming Tan , Wasi Uddin Ahmad , Shiqi Wang

show 17 more authors

Qing Sun Mingyue Shang Sujan Kumar Gonugondla Hantian Ding Varun Kumar Nathan Fulton Arash Farahani Siddhartha Jain Robert Giaquinto Haifeng Qian Murali Krishna Ramanathan Ramesh Nallapati Baishakhi Ray Parminder Bhatia Sudipta Sengupta Dan Roth Bing Xiang

This is my paper

classification 💻 cs.LG cs.CL

keywords codemodelsgenerationlanguagesbenchmarksdatasetslanguagemulti-lingual

0 comments

read the original abstract

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of language models' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs
cs.AI 2026-06 unverdicted novelty 7.0

Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
cs.SE 2026-04 conditional novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
cs.SE 2025-04 accept novelty 7.0

OpenClassGen supplies 324,843 real-world Python classes with self-contained skeletons and static metrics to support LLM class generation research and evaluation.
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
cs.SE 2025-04 unverdicted novelty 7.0

Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.
SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code
cs.SE 2026-06 unverdicted novelty 6.0

New benchmark for LLM performance on incremental AWS CDK infrastructure code edits shows top models succeed in 34% of cases.
CodegenBench: Can LLMs Write Efficient Code Across Architectures?
cs.SE 2026-06 unverdicted novelty 6.0

CodegenBench shows LLMs generate optimized code well for x86_64 but exhibit significant performance degradation on Sunway and Kunpeng due to limited documentation and training data.
Evaluating LLM-Generated Code: A Benchmark and Developer Study
cs.SE 2026-05 unverdicted novelty 6.0

Introduces a three-fold benchmark for LLM-generated code combining correctness testing on a complex project, quality verification, and developer surveys to assess production readiness.
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
cs.SE 2026-04 unverdicted novelty 6.0

RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo g...
A Taxonomy of Programming Languages for Code Generation
cs.CL 2026-03 accept novelty 6.0

The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
cs.CL 2025-08 unverdicted novelty 6.0

Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
Mutation-Guided Unit Test Generation with a Large Language Model
cs.SE 2025-06 conditional novelty 6.0

MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.
A Study of LLMs' Preferences for Libraries and Programming Languages
cs.SE 2025-03 unverdicted novelty 6.0

Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Evaluating LLM-Generated Code: A Benchmark and Developer Study
cs.SE 2026-05 unverdicted novelty 5.0

A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
cs.SE 2026-04 conditional novelty 5.0

RealBench is a repo-level code generation benchmark pairing UML diagrams with natural language requirements, revealing that LLMs perform significantly worse on realistic repo-level tasks than existing benchmarks suggest.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages
cs.AI 2026-06 unverdicted novelty 4.0

Multi-LCB extends LiveCodeBench to 12 languages by translating Python tasks, revealing Python overfitting and performance disparities when evaluating 24 LLMs.
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
cs.SE 2026-02 unverdicted novelty 3.0

A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...