pith. sign in

arxiv: 2606.20517 · v1 · pith:RJ4LX4Q3new · submitted 2026-06-18 · 💻 cs.AI · cs.PL

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Pith reviewed 2026-06-26 17:36 UTC · model grok-4.3

classification 💻 cs.AI cs.PL
keywords multi-language benchmarkcode generationLLM evaluationLiveCodeBenchcompetitive programmingcross-language generalizationcontamination controlPython overfitting
0
0 comments X

The pith

Multi-LCB converts LiveCodeBench problems to twelve languages to evaluate LLM code generation beyond Python.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates Multi-LCB by translating LiveCodeBench's Python competitive programming tasks into equivalent problems in eleven other languages while retaining the original contamination controls based on release dates. The benchmark stays fully compatible with LiveCodeBench so that future updates carry over automatically. A sympathetic reader cares because real software engineering requires competence in many languages and Python-only tests may hide gaps in model ability. Evaluation of twenty-four LLMs on the new benchmark uncovers Python overfitting, language-specific contamination, and wide performance differences across languages.

Core claim

Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. Evaluations of twenty-four LLMs uncover evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance.

What carries the argument

Translation of Python competitive programming tasks into equivalent tasks in other languages that preserves contamination controls and the original evaluation protocol.

If this is right

  • LLMs display Python overfitting in code generation performance.
  • Language-specific contamination affects evaluation results for different languages.
  • Performance exhibits substantial disparities across the twelve programming languages.
  • Multi-LCB remains compatible with ongoing LiveCodeBench updates for continued tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training data for LLMs may need more balanced exposure to non-Python languages to reduce the observed gaps.
  • Future benchmarks could benefit from including problems written natively in each target language rather than relying solely on translations.
  • The size of the performance gaps may indicate limits in how current model architectures share representations across languages.

Load-bearing premise

Translating Python competitive programming tasks into other languages preserves the original difficulty, contamination controls, and evaluation validity without introducing language-specific artifacts or changes in problem semantics.

What would settle it

A side-by-side comparison in which human programmers solve both the original Python problems and their translations in another language at materially different rates would show that equivalence does not hold.

Figures

Figures reproduced from arXiv: 2606.20517 by Adamenko Pavel, Alexey Kutalev, Dmitrii Babaev, Ivan Lopatin, Ivan Petrov, Maria Ivanova, Pavel Zadorozhny, Rodion Levichev.

Figure 1
Figure 1. Figure 1: Multi-LCB overview. Top: LCB natural-language problem descriptions are wrapped into prompts specifying the target programming language and passed to the LLM for STDIN/STDOUT code generation. AtCoder and Codeforces problem tests are passed directly to the execution stage. Bottom: LeetCode problem tests are transformed through a dedicated test converter to produce equivalent STDIN/STDOUT inputs. The generate… view at source ↗
Figure 2
Figure 2. Figure 2: Top-10 models by Pass@1 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scatter of Python vs. Average Pass@1 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pass@1 distribution across 12 languages [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Monthly Pass@1 trends averaged across all [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Multi-LCB web interface showing tasks released between January 2024 to Decem [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Monthly distribution of Tasks by Difficulty. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Monthly distribution of LCB tasks by platform. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Monthly task distribution by I/O data dimensionality (LeetCode Functional format). [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Code generation performance heatmap by platform for Python, C++, Java, and C#. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Code generation performance heatmap by platform for Ruby, PHP, Kotlin, and [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Code generation performance heatmap by platform for TypeScript, Go, Rust, and Scala. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Code generation performance heatmap by difficulty level for Python, C++, Java, and C#. [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Code generation performance heatmap by difficulty level for Ruby, PHP, Kotlin, and [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Code generation performance heatmap by difficulty level for TypeScript, Go, Rust, and [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Monthly Pass@1 trends for Python [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: Monthly Pass@1 trends for C# [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 21
Figure 21. Figure 21: Monthly Pass@1 trends for [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Monthly Pass@1 trends for Kotlin [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗
Figure 25
Figure 25. Figure 25: Monthly Pass@1 trends for Rust [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗
Figure 27
Figure 27. Figure 27: Monthly Pass@1 trends for Type [PITH_FULL_IMAGE:figures/full_fig_p031_27.png] view at source ↗
Figure 29
Figure 29. Figure 29: DeepSeek-R1-0528* [PITH_FULL_IMAGE:figures/full_fig_p032_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Devstral-Small-2505 [PITH_FULL_IMAGE:figures/full_fig_p033_30.png] view at source ↗
Figure 33
Figure 33. Figure 33: OpenCodeReasoning-Nemotron [PITH_FULL_IMAGE:figures/full_fig_p033_33.png] view at source ↗
Figure 35
Figure 35. Figure 35: Qwen2.5-Coder-14B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p033_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Qwen2.5-Coder-32B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p034_36.png] view at source ↗
Figure 38
Figure 38. Figure 38: Qwen3-14B* [PITH_FULL_IMAGE:figures/full_fig_p034_38.png] view at source ↗
Figure 41
Figure 41. Figure 41: Qwen3-235B-A22B-Thinking [PITH_FULL_IMAGE:figures/full_fig_p034_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Qwen3-Coder-30B-A3B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p035_42.png] view at source ↗
read the original abstract

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Multi-LCB, an extension of LiveCodeBench to twelve programming languages obtained by translating Python competitive-programming tasks while claiming to preserve LCB's contamination controls and evaluation protocol. It reports an evaluation of 24 LLMs that uncovers Python overfitting, language-specific contamination, and cross-language performance disparities, positioning Multi-LCB as a rigorous multilingual benchmark that will automatically track future LCB updates.

Significance. If the translation procedure can be shown to maintain problem difficulty, semantics, and contamination status, the resulting benchmark would supply a needed contamination-aware instrument for measuring cross-language code generation and would enable systematic tracking of multilingual competence beyond Python-only evaluation.

major comments (3)
  1. [Abstract] Abstract and translation-procedure section: the central claim that translated tasks remain 'equivalent' and preserve 'evaluation validity' rests on an untested invariance assumption; no quantitative checks (human solve-rate parity, test-case coverage equivalence, or matched-pair performance correlation) are supplied to confirm that language-specific idioms and library differences do not alter difficulty or semantics.
  2. [Evaluation] Evaluation section (24-model results): reported evidence of Python overfitting and language disparities cannot be assessed for robustness because the manuscript supplies no controls or ablation that isolate translation-induced artifacts from genuine model limitations.
  3. [Benchmark construction] Benchmark-construction description: the statement that Multi-LCB 'automatically track[s] future LCB updates' presupposes that the translation pipeline can be reapplied without reintroducing contamination or difficulty drift; no protocol or validation for this ongoing process is described.
minor comments (2)
  1. [Abstract] The abstract states 'twelve programming languages, including Python' but does not enumerate the languages; a table or explicit list would improve clarity.
  2. [Introduction] Consider adding a short related-work paragraph contrasting Multi-LCB with existing multilingual code benchmarks to situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where the manuscript requires strengthening and outlining specific revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and translation-procedure section: the central claim that translated tasks remain 'equivalent' and preserve 'evaluation validity' rests on an untested invariance assumption; no quantitative checks (human solve-rate parity, test-case coverage equivalence, or matched-pair performance correlation) are supplied to confirm that language-specific idioms and library differences do not alter difficulty or semantics.

    Authors: We agree that the current description relies on an untested assumption of equivalence. The translations were performed manually by experienced programmers following semantic-preserving guidelines, but no quantitative validation (such as human solve-rate parity or performance correlation) was reported. In revision we will add a dedicated subsection on the translation procedure that includes: (1) explicit handling rules for language-specific idioms and libraries, (2) a small-scale human study comparing perceived difficulty on a random subset of 50 problems across languages, and (3) test-case coverage statistics before and after translation. These additions will directly test the invariance claim. revision: yes

  2. Referee: [Evaluation] Evaluation section (24-model results): reported evidence of Python overfitting and language disparities cannot be assessed for robustness because the manuscript supplies no controls or ablation that isolate translation-induced artifacts from genuine model limitations.

    Authors: The referee correctly notes the absence of controls for translation artifacts. We will add two analyses in the revised evaluation section: (1) an ablation translating a subset of Python problems back into Python and measuring performance drop relative to the original, and (2) correlation of per-problem scores across language pairs for models with documented strong multilingual training. These controls will help separate translation effects from intrinsic model limitations and increase confidence in the reported overfitting and disparity findings. revision: yes

  3. Referee: [Benchmark construction] Benchmark-construction description: the statement that Multi-LCB 'automatically track[s] future LCB updates' presupposes that the translation pipeline can be reapplied without reintroducing contamination or difficulty drift; no protocol or validation for this ongoing process is described.

    Authors: We acknowledge that the claim of automatic tracking requires an explicit, repeatable protocol. The revised manuscript will include a new subsection that formalizes the pipeline: (a) reuse of LCB's release-date filtering to maintain contamination controls, (b) mandatory equivalence checks (test-case coverage and semantic review) on every new batch, and (c) a schedule for periodic human validation to detect difficulty drift. This protocol will be presented as part of the benchmark release so that future updates remain rigorous. revision: yes

Circularity Check

0 steps flagged

No circularity detected; benchmark construction is self-contained

full rationale

The manuscript describes the creation of Multi-LCB via translation of existing LCB Python problems into twelve languages while asserting preservation of contamination controls and evaluation protocol. No equations, derivations, fitted parameters, or predictions appear in the provided text. Claims rest on the independent translation process rather than reducing to self-citations, self-definitions, or renamed inputs. The central premise does not invoke load-bearing prior results by the same authors that would create a circular chain; external LCB is treated as given input without the paper's own results feeding back into its validity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or required for the high-level benchmark extension claim.

pith-pipeline@v0.9.1-grok · 5784 in / 1020 out tokens · 27388 ms · 2026-06-26T17:36:58.271893+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 20 canonical work pages · 12 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  5. [5]

    arXiv preprint arXiv:2401.08500 , year=

    Code generation with alphacodium: From prompt engineering to flow engineering , author=. arXiv preprint arXiv:2401.08500 , year=

  6. [6]

    StarCoder 2 and The Stack v2: The Next Generation

    Starcoder 2 and the stack v2: The next generation , author=. arXiv preprint arXiv:2402.19173 , year=

  7. [7]

    Code Llama: Open Foundation Models for Code

    Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=

  8. [8]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    Codegen: An open large language model for code with multi-turn program synthesis , author=. arXiv preprint arXiv:2203.13474 , year=

  9. [9]

    Science , volume=

    Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

  10. [10]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

  11. [11]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

  12. [12]

    Gemini 2.5: Our most intelligent AI model , year =

  13. [13]

    DeepSeek-R1-0528 Release , year =

  14. [14]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

  15. [15]

    arXiv preprint arXiv:2506.11928 , year=

    LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? , author=. arXiv preprint arXiv:2506.11928 , year=

  16. [16]

    Proceedings of the 29th symposium on operating systems principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

  17. [17]

    Advances in neural information processing systems , volume=

    Sglang: Efficient execution of structured language model programs , author=. Advances in neural information processing systems , volume=

  18. [18]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  19. [19]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  20. [20]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  21. [21]

    arXiv preprint arXiv:2404.00599 , year=

    Evocodebench: An evolving code generation benchmark aligned with real-world code repositories , author=. arXiv preprint arXiv:2404.00599 , year=

  22. [22]

    arXiv preprint arXiv:2508.04865 , year=

    Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment , author=. arXiv preprint arXiv:2508.04865 , year=

  23. [23]

    arXiv preprint arXiv:2303.03004 , year=

    xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval , author=. arXiv preprint arXiv:2303.03004 , year=

  24. [24]

    arXiv preprint arXiv:2406.07436 , year=

    Mceval: Massively multilingual code evaluation , author=. arXiv preprint arXiv:2406.07436 , year=

  25. [25]

    IEEE Transactions on Software Engineering , volume=

    Multipl-e: A scalable and polyglot approach to benchmarking neural code generation , author=. IEEE Transactions on Software Engineering , volume=. 2023 , publisher=

  26. [26]

    arXiv preprint arXiv:2402.16694 , year=

    Humaneval-xl: A multilingual code generation benchmark for cross-lingual natural language generalization , author=. arXiv preprint arXiv:2402.16694 , year=

  27. [27]

    CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

    Codexglue: A machine learning benchmark dataset for code understanding and generation , author=. arXiv preprint arXiv:2102.04664 , year=

  28. [28]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions , author=. arXiv preprint arXiv:2406.15877 , year=

  29. [29]

    arXiv preprint arXiv:2210.14868 , year=

    Multi-lingual evaluation of code generation models , author=. arXiv preprint arXiv:2210.14868 , year=

  30. [30]

    Information Discovery and Delivery , volume=

    Introducing mathqa: a math-aware question answering system , author=. Information Discovery and Delivery , volume=. 2018 , publisher=