pith. sign in

arxiv: 2604.12911 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI

Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

Pith reviewed 2026-05-10 16:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multilingual evaluationround-trip translationlanguage proficiencyAI benchmarksfrontier modelsLost in TranslationLMArena
0
0 comments X

The pith

Frontier multilingual benchmarks test mathematical reasoning and factual recall instead of language proficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multilingual benchmarks for frontier AI models follow the same structure as reasoning and knowledge tests but apply them across languages. This causes them to measure mathematical reasoning and factual recall far more than actual multilingual proficiency. Thinking variants of models outperform instruct variants on these benchmarks yet often underperform on real-world multilingual tasks such as LMArena user ratings. The paper shows that round-trip translation, by exposing semantic gaps after sending text to another language and back, provides a stronger signal of multilingual generation capability. This method correlates at 0.94 with user ratings, needs no human reference translations, and requires no judge stronger than the models being tested.

Core claim

Multilingual benchmarks measure mathematical reasoning and factual recall rather than multilingual proficiency, since thinking variants dramatically outperform instruct variants on them but perform worse on real tasks; round-trip translation reveals failures in multilingual generation by measuring semantic preservation after translation to a target language and back, correlating at ρ = 0.94 with LMArena ratings without human references or a stronger judge, and forms the basis of the new Lost in Translation benchmark.

What carries the argument

Round-trip translation, which sends source text to a target language and back while checking for semantic differences that indicate multilingual generation failures.

If this is right

  • Current benchmarks reward models that excel at reasoning across languages more than models that excel at fluent generation.
  • Round-trip translation enables evaluation of multilingual capability without human reference translations.
  • No stronger multilingual judge is needed to assess the tested models.
  • The Lost in Translation benchmark spans widely spoken languages and offers a more realistic test of frontier model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives might shift toward improving cross-language generation fidelity once round-trip translation becomes a standard metric.
  • Evaluations could separate reasoning skill from language skill to avoid over-optimizing one at the expense of the other.
  • Low-resource languages could be assessed more fairly without depending on scarce high-quality human translations.

Load-bearing premise

Semantic preservation after round-trip translation primarily reflects the tested model's multilingual generation proficiency rather than the quality of the translation systems or other factors.

What would settle it

A collection of models that achieve high round-trip translation scores but low LMArena user ratings, or low round-trip scores but high user ratings, would disprove the claimed correlation.

Figures

Figures reproduced from arXiv: 2604.12911 by Ameya Prabhu, Matthias Bethge, Ronald Skorobogat.

Figure 1
Figure 1. Figure 1: Multilingual benchmarks correlate poorly with human preferences. We show benchmark scores against LMArena Elo ratings for six frontier open-source models, each in Thinking and Non-Thinking variants. (a) MT-AIME24 (Son et al., 2025) shows near￾zero correlation (ρ = −0.09), with thinking variants dramatically outperforming on the benchmark but perform no better on LMArena; (b) INCLUDE (Romanou et al., 2025) … view at source ↗
Figure 2
Figure 2. Figure 2: Multilingual benchmarks track English reasoning performance, and models reason in English regardless of input language. (a) MT-AIME24 scores correlate strongly with English AIME25 performance (ρ = 0.94, one-sided permutation test p=0.008), indi￾cating the benchmark primarily measures mathematical reasoning ability. (b) INCLUDE scores correlate strongly with English MMLU-Pro (ρ = 0.83, one-sided permutation… view at source ↗
Figure 3
Figure 3. Figure 3: Errors on multilingual benchmarks stem from reasoning and knowledge gaps—not [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qwen-3 and GPT-OSS models default to English reasoning on MT-AIME24. We [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GLM-4.7 and MiMo-V2-Flash show contrasting reasoning language patterns on [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qwen-3 and GPT-OSS models reason almost entirely in English on INCLUDE [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: GLM-4.7 reasons in English while MiMo-V2-Flash shows mixed reasoning patterns [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Error analysis across four additional models confirms MT-AIME24 errors are logi [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Error analysis on INCLUDE confirms errors are factual and knowledge-based, [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗
read the original abstract

Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that standard multilingual benchmarks for frontier models primarily test mathematical reasoning and factual recall rather than multilingual proficiency, as shown by thinking variants outperforming instruct variants on benchmarks yet underperforming on real-world tasks like LMArena. It proposes round-trip translation (source to target and back, with semantic similarity) as a better alternative, reporting a near-perfect correlation (ρ = 0.94) with LMArena user ratings; the method requires no human references and no stronger multilingual judge. The authors introduce the Lost in Translation (LiT) benchmark spanning widely spoken languages.

Significance. If the round-trip method can be shown to isolate multilingual generation failures beyond general model capabilities, this could meaningfully shift evaluation practices toward scalable, reference-free proxies that better align with user preferences. Strengths include the high reported correlation, the absence of human references, and the concrete LiT benchmark resource; these are genuine contributions if the central proxy assumption holds after appropriate controls.

major comments (2)
  1. [Abstract and results section on LMArena correlation] The core claim that round-trip translation specifically measures multilingual generation proficiency (rather than general capability or translation-model quality) is load-bearing but not yet established. The reported ρ = 0.94 correlation with LMArena could be driven by shared dependence on model scale, training compute, or English-centric performance; without controls such as partial correlations or regressions that residualize out English benchmark scores or parameter count, the distinction from math/reasoning benchmarks remains unproven. (This directly engages the skeptic concern and the weakest assumption noted in the review.)
  2. [§4 and §5] §4 (LiT benchmark construction) and §5 (experimental results): the manuscript provides insufficient detail on the translation models used for the round-trip process, the exact semantic-similarity metric, language/text selection criteria, and any statistical controls for translation quality. These omissions make it impossible to verify that semantic gaps primarily reflect the tested models' multilingual generation rather than the auxiliary translation step itself.
minor comments (3)
  1. [Abstract] The LaTeX macro in the abstract appears as “r{ho}”; this should be corrected to the standard ρ symbol.
  2. [Method description] Clarify whether the same frontier model is used for both directions of the round-trip or whether a separate translation model is employed; this choice affects the interpretation of the “no more capable judge” claim.
  3. [§4] Add a table or appendix listing the exact languages, domains, and number of examples in the LiT benchmark for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps clarify the scope of our claims and the presentation of our methods. We address each major comment below and outline revisions that will strengthen the manuscript while preserving its core contributions.

read point-by-point responses
  1. Referee: [Abstract and results section on LMArena correlation] The core claim that round-trip translation specifically measures multilingual generation proficiency (rather than general capability or translation-model quality) is load-bearing but not yet established. The reported ρ = 0.94 correlation with LMArena could be driven by shared dependence on model scale, training compute, or English-centric performance; without controls such as partial correlations or regressions that residualize out English benchmark scores or parameter count, the distinction from math/reasoning benchmarks remains unproven. (This directly engages the skeptic concern and the weakest assumption noted in the review.)

    Authors: We agree that the specificity of round-trip translation to multilingual generation (as opposed to general capabilities) requires stronger evidence to support our central claim. Our current results highlight that thinking variants excel on standard multilingual benchmarks yet lag on LMArena, with round-trip translation showing high correlation to LMArena; however, we recognize that shared variance with scale or English performance could contribute to the observed ρ = 0.94. In the revised manuscript, we will add partial correlation and regression analyses that residualize out English benchmark scores and parameter counts. These controls will test whether the correlation with LMArena persists after accounting for general capability, thereby addressing the concern directly and clarifying the distinction from reasoning-focused benchmarks. revision: yes

  2. Referee: [§4 and §5] §4 (LiT benchmark construction) and §5 (experimental results): the manuscript provides insufficient detail on the translation models used for the round-trip process, the exact semantic-similarity metric, language/text selection criteria, and any statistical controls for translation quality. These omissions make it impossible to verify that semantic gaps primarily reflect the tested models' multilingual generation rather than the auxiliary translation step itself.

    Authors: We acknowledge that greater methodological detail is necessary for reproducibility and to isolate the contribution of the tested models' multilingual generation. In the revised manuscript, we will expand §4 to explicitly describe the translation models used in the round-trip process, the precise semantic-similarity metric (including the embedding model and computation method), the language and text selection criteria (e.g., speaker population thresholds and data availability), and any quality controls or filters applied to the auxiliary translations. In §5, we will include additional analyses quantifying the auxiliary step's impact, such as sensitivity checks across different translation models. These expansions will enable verification that semantic gaps primarily capture the evaluated models' capabilities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on empirical contrasts and external correlation

full rationale

The paper's chain proceeds from observed performance gaps (thinking variants outperforming instruct variants on existing multilingual benchmarks but underperforming on LMArena) to the proposal of round-trip translation, validated by a reported ρ=0.94 correlation with LMArena user ratings. No step reduces by construction to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation; the central claim that round-trip exposes multilingual generation failures is supported by independent external ratings rather than internal renaming or ansatz smuggling. The derivation remains self-contained against the provided benchmarks and ratings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that round-trip semantic similarity is a valid proxy for multilingual generation capability, and introduces the LiT benchmark as a new evaluation resource.

axioms (1)
  • domain assumption Semantic gaps in round-trip translation primarily expose failures in multilingual generation capabilities
    Core justification for using round-trip as the evaluation method
invented entities (1)
  • Lost in Translation (LiT) benchmark no independent evidence
    purpose: Challenging round-trip translation benchmark spanning widely spoken languages
    Newly proposed evaluation resource

pith-pipeline@v0.9.0 · 5476 in / 1197 out tokens · 75976 ms · 2026-05-10T16:26:19.771935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , url =

    URLhttps://aclanthology.org/2025.acl-long.169/. Regina Barzilay and Mirella Lapata. Modeling local coherence: an entity-based approach. InProceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pp. 141–148, USA, 2005. Association for Computational Linguistics. doi: 10.3115/ 1219840.1219858. URLhttps://doi.org/10.3115/...

  2. [2]

    Scaling Laws for Neural Language Models

    ISSN 1947-4040. doi: 10.2200/S00509ED1V01Y201305HLT023. Publisher Copyright: © Morgan and Claypool Publishers. All rights reserved. Yann Dubois, Percy Liang, and Tatsunori Hashimoto. Length-controlled alpacaeval: A simple debiasing of automatic evaluators. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=CybBmzWBX0. 14 Rou...

  3. [3]

    Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond ˇrej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, et al

    URLhttps://aclanthology.org/2023.wmt-1.64/. Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond ˇrej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, et al. Findings of the wmt25 general machine translation shared task: Time to stop evaluating on easy test sets. InProceedings of the Tenth Confer...

  4. [4]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Association for Computational Linguistics. doi: 10.18653/v1/D18-1512. URL https://aclanthology.org/D18-1512/. Alon Lavie, Greg Hanneman, Sweta Agrawal, Diptesh Kanojia, Chi-Kiu Lo, Vil´em Zouhar, Frederic Blain, Chrysoula Zerva, Eleftherios Avramidis, Sourabh Deoghare, Archchana Sindhujan, Jiayi Wang, David Ifeoluwa Adelani, Brian Thompson, Tom Kocmi, Mar...

  5. [5]

    Dickerson

    doi: 10.48550/ARXIV .2407.21530. URL https://doi.org/10.48550/arXiv.2407. 21530. John R Searle.Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, 1969. Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. InProceedings of the 54th annual meeting of the associatio...

  6. [6]

    Does the target text accurately reflect the source meaning?

    Accuracy: (Mistranslation, Omission, Addition, Untranslated). Does the target text accurately reflect the source meaning?

  7. [7]

    Is the target text linguistically correct and natural?

    Fluency: (Grammar, Spelling, Punctuation, Unintelligible). Is the target text linguistically correct and natural?

  8. [8]

    Does it adhere to domain standards?

    Terminology: (Inconsistent, Wrong Term). Does it adhere to domain standards?

  9. [9]

    score": <0-100>,

    Style/Locale: Does it follow local formats (dates, currencies) and cultural norms? Does the translation match the required formality/register (e.g., formal vs. casual)? The three severity categories are: •’minor’: Has a limited impact on accuracy, stylistic quality, consistency, fluency, clarity, or general appeal of the content. •’major’: Seriously affec...