pith. sign in

arxiv: 2210.14868 · v3 · pith:XZHAFFC6new · submitted 2022-10-26 · 💻 cs.LG · cs.CL

Multi-lingual Evaluation of Code Generation Models

classification 💻 cs.LG cs.CL
keywords codemodelsgenerationlanguagesbenchmarksdatasetslanguagemulti-lingual
0
0 comments X
read the original abstract

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of language models' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

    cs.AI 2026-06 unverdicted novelty 7.0

    Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.

  2. Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

    cs.SE 2026-04 conditional novelty 7.0

    Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.

  3. OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research

    cs.SE 2025-04 accept novelty 7.0

    OpenClassGen supplies 324,843 real-world Python classes with self-contained skeletons and static metrics to support LLM class generation research and evaluation.

  4. Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

    cs.SE 2025-04 unverdicted novelty 7.0

    Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.

  5. SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code

    cs.SE 2026-06 unverdicted novelty 6.0

    New benchmark for LLM performance on incremental AWS CDK infrastructure code edits shows top models succeed in 34% of cases.

  6. CodegenBench: Can LLMs Write Efficient Code Across Architectures?

    cs.SE 2026-06 unverdicted novelty 6.0

    CodegenBench shows LLMs generate optimized code well for x86_64 but exhibit significant performance degradation on Sunway and Kunpeng due to limited documentation and training data.

  7. Evaluating LLM-Generated Code: A Benchmark and Developer Study

    cs.SE 2026-05 unverdicted novelty 6.0

    Introduces a three-fold benchmark for LLM-generated code combining correctness testing on a complex project, quality verification, and developer surveys to assess production readiness.

  8. MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation

    cs.SE 2026-04 unverdicted novelty 6.0

    MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.

  9. RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

    cs.SE 2026-04 unverdicted novelty 6.0

    RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo g...

  10. A Taxonomy of Programming Languages for Code Generation

    cs.CL 2026-03 accept novelty 6.0

    The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.

  11. Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

    cs.CL 2025-08 unverdicted novelty 6.0

    Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.

  12. Mutation-Guided Unit Test Generation with a Large Language Model

    cs.SE 2025-06 conditional novelty 6.0

    MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.

  13. A Study of LLMs' Preferences for Libraries and Programming Languages

    cs.SE 2025-03 unverdicted novelty 6.0

    Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.

  14. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  15. Evaluating LLM-Generated Code: A Benchmark and Developer Study

    cs.SE 2026-05 unverdicted novelty 5.0

    A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.

  16. RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

    cs.SE 2026-04 conditional novelty 5.0

    RealBench is a repo-level code generation benchmark pairing UML diagrams with natural language requirements, revealing that LLMs perform significantly worse on realistic repo-level tasks than existing benchmarks suggest.

  17. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  18. Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

    cs.AI 2026-06 unverdicted novelty 4.0

    Multi-LCB extends LiveCodeBench to 12 languages by translating Python tasks, revealing Python overfitting and performance disparities when evaluating 24 LLMs.

  19. LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review

    cs.SE 2026-02 unverdicted novelty 3.0

    A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...

  20. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...