Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

Ameya Prabhu; Matthias Bethge; Ronald Skorobogat

arxiv: 2604.12911 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI

Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

Ronald Skorobogat , Ameya Prabhu , Matthias Bethge This is my paper

Pith reviewed 2026-05-10 16:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multilingual evaluationround-trip translationlanguage proficiencyAI benchmarksfrontier modelsLost in TranslationLMArena

0 comments

The pith

Frontier multilingual benchmarks test mathematical reasoning and factual recall instead of language proficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multilingual benchmarks for frontier AI models follow the same structure as reasoning and knowledge tests but apply them across languages. This causes them to measure mathematical reasoning and factual recall far more than actual multilingual proficiency. Thinking variants of models outperform instruct variants on these benchmarks yet often underperform on real-world multilingual tasks such as LMArena user ratings. The paper shows that round-trip translation, by exposing semantic gaps after sending text to another language and back, provides a stronger signal of multilingual generation capability. This method correlates at 0.94 with user ratings, needs no human reference translations, and requires no judge stronger than the models being tested.

Core claim

Multilingual benchmarks measure mathematical reasoning and factual recall rather than multilingual proficiency, since thinking variants dramatically outperform instruct variants on them but perform worse on real tasks; round-trip translation reveals failures in multilingual generation by measuring semantic preservation after translation to a target language and back, correlating at ρ = 0.94 with LMArena ratings without human references or a stronger judge, and forms the basis of the new Lost in Translation benchmark.

What carries the argument

Round-trip translation, which sends source text to a target language and back while checking for semantic differences that indicate multilingual generation failures.

If this is right

Current benchmarks reward models that excel at reasoning across languages more than models that excel at fluent generation.
Round-trip translation enables evaluation of multilingual capability without human reference translations.
No stronger multilingual judge is needed to assess the tested models.
The Lost in Translation benchmark spans widely spoken languages and offers a more realistic test of frontier model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives might shift toward improving cross-language generation fidelity once round-trip translation becomes a standard metric.
Evaluations could separate reasoning skill from language skill to avoid over-optimizing one at the expense of the other.
Low-resource languages could be assessed more fairly without depending on scarce high-quality human translations.

Load-bearing premise

Semantic preservation after round-trip translation primarily reflects the tested model's multilingual generation proficiency rather than the quality of the translation systems or other factors.

What would settle it

A collection of models that achieve high round-trip translation scores but low LMArena user ratings, or low round-trip scores but high user ratings, would disprove the claimed correlation.

Figures

Figures reproduced from arXiv: 2604.12911 by Ameya Prabhu, Matthias Bethge, Ronald Skorobogat.

**Figure 1.** Figure 1: Multilingual benchmarks correlate poorly with human preferences. We show benchmark scores against LMArena Elo ratings for six frontier open-source models, each in Thinking and Non-Thinking variants. (a) MT-AIME24 (Son et al., 2025) shows nearzero correlation (ρ = −0.09), with thinking variants dramatically outperforming on the benchmark but perform no better on LMArena; (b) INCLUDE (Romanou et al., 2025) … view at source ↗

**Figure 2.** Figure 2: Multilingual benchmarks track English reasoning performance, and models reason in English regardless of input language. (a) MT-AIME24 scores correlate strongly with English AIME25 performance (ρ = 0.94, one-sided permutation test p=0.008), indicating the benchmark primarily measures mathematical reasoning ability. (b) INCLUDE scores correlate strongly with English MMLU-Pro (ρ = 0.83, one-sided permutation… view at source ↗

**Figure 3.** Figure 3: Errors on multilingual benchmarks stem from reasoning and knowledge gaps—not [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qwen-3 and GPT-OSS models default to English reasoning on MT-AIME24. We [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗

**Figure 5.** Figure 5: GLM-4.7 and MiMo-V2-Flash show contrasting reasoning language patterns on [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗

**Figure 6.** Figure 6: Qwen-3 and GPT-OSS models reason almost entirely in English on INCLUDE [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗

**Figure 7.** Figure 7: GLM-4.7 reasons in English while MiMo-V2-Flash shows mixed reasoning patterns [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

**Figure 8.** Figure 8: Error analysis across four additional models confirms MT-AIME24 errors are logi [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

**Figure 9.** Figure 9: Error analysis on INCLUDE confirms errors are factual and knowledge-based, [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗

read the original abstract

Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Round-trip translation is a simple, reference-free idea worth testing as a complement to current multilingual evals, but the paper still needs to show it measures language proficiency rather than general model strength.

read the letter

The main thing to take from this paper is that many multilingual benchmarks appear to reward reasoning and recall more than actual cross-language generation, and round-trip translation is offered as a practical alternative that lines up closely with real user ratings on LMArena. They back this with the observation that thinking variants beat instruct variants on the usual tests yet often underperform on arena-style tasks, plus a new LiT benchmark that spans many languages and reports a 0.94 correlation without needing human references or stronger judges. That setup is straightforward and avoids some of the usual translation pitfalls. The paper does a solid job flagging a mismatch that many practitioners have noticed anecdotally, and the reference-free nature of round-trip makes it easy to apply at scale. The correlation number is the strongest empirical hook they have. The soft spot is the same one the stress-test note flags: it is not yet clear whether round-trip success tracks multilingual generation specifically or simply tracks overall capability. If the same models that excel at math and English tasks also preserve meaning better when translating back and forth, then the high correlation with LMArena could be driven by shared variance in scale and training rather than by isolating the proficiency the authors claim. The abstract gives little detail on controls for translation model quality, the exact semantic similarity measure, or whether round-trip retains its predictive edge after accounting for English performance or model size. Without those, the claim that standard benchmarks miss multilingual proficiency rests on an assumption that still needs direct testing. This work is aimed at people who build or evaluate multilingual systems and want benchmarks that better match downstream use. Readers who care about practical evaluation gaps will get something concrete from it. The idea is clear enough and the initial numbers interesting enough that it deserves a serious referee rather than a desk reject, even if the current version will likely come back with requests for more controls and ablations. I would send it out for review.

Referee Report

2 major / 3 minor

Summary. The paper claims that standard multilingual benchmarks for frontier models primarily test mathematical reasoning and factual recall rather than multilingual proficiency, as shown by thinking variants outperforming instruct variants on benchmarks yet underperforming on real-world tasks like LMArena. It proposes round-trip translation (source to target and back, with semantic similarity) as a better alternative, reporting a near-perfect correlation (ρ = 0.94) with LMArena user ratings; the method requires no human references and no stronger multilingual judge. The authors introduce the Lost in Translation (LiT) benchmark spanning widely spoken languages.

Significance. If the round-trip method can be shown to isolate multilingual generation failures beyond general model capabilities, this could meaningfully shift evaluation practices toward scalable, reference-free proxies that better align with user preferences. Strengths include the high reported correlation, the absence of human references, and the concrete LiT benchmark resource; these are genuine contributions if the central proxy assumption holds after appropriate controls.

major comments (2)

[Abstract and results section on LMArena correlation] The core claim that round-trip translation specifically measures multilingual generation proficiency (rather than general capability or translation-model quality) is load-bearing but not yet established. The reported ρ = 0.94 correlation with LMArena could be driven by shared dependence on model scale, training compute, or English-centric performance; without controls such as partial correlations or regressions that residualize out English benchmark scores or parameter count, the distinction from math/reasoning benchmarks remains unproven. (This directly engages the skeptic concern and the weakest assumption noted in the review.)
[§4 and §5] §4 (LiT benchmark construction) and §5 (experimental results): the manuscript provides insufficient detail on the translation models used for the round-trip process, the exact semantic-similarity metric, language/text selection criteria, and any statistical controls for translation quality. These omissions make it impossible to verify that semantic gaps primarily reflect the tested models' multilingual generation rather than the auxiliary translation step itself.

minor comments (3)

[Abstract] The LaTeX macro in the abstract appears as “r{ho}”; this should be corrected to the standard ρ symbol.
[Method description] Clarify whether the same frontier model is used for both directions of the round-trip or whether a separate translation model is employed; this choice affects the interpretation of the “no more capable judge” claim.
[§4] Add a table or appendix listing the exact languages, domains, and number of examples in the LiT benchmark for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps clarify the scope of our claims and the presentation of our methods. We address each major comment below and outline revisions that will strengthen the manuscript while preserving its core contributions.

read point-by-point responses

Referee: [Abstract and results section on LMArena correlation] The core claim that round-trip translation specifically measures multilingual generation proficiency (rather than general capability or translation-model quality) is load-bearing but not yet established. The reported ρ = 0.94 correlation with LMArena could be driven by shared dependence on model scale, training compute, or English-centric performance; without controls such as partial correlations or regressions that residualize out English benchmark scores or parameter count, the distinction from math/reasoning benchmarks remains unproven. (This directly engages the skeptic concern and the weakest assumption noted in the review.)

Authors: We agree that the specificity of round-trip translation to multilingual generation (as opposed to general capabilities) requires stronger evidence to support our central claim. Our current results highlight that thinking variants excel on standard multilingual benchmarks yet lag on LMArena, with round-trip translation showing high correlation to LMArena; however, we recognize that shared variance with scale or English performance could contribute to the observed ρ = 0.94. In the revised manuscript, we will add partial correlation and regression analyses that residualize out English benchmark scores and parameter counts. These controls will test whether the correlation with LMArena persists after accounting for general capability, thereby addressing the concern directly and clarifying the distinction from reasoning-focused benchmarks. revision: yes
Referee: [§4 and §5] §4 (LiT benchmark construction) and §5 (experimental results): the manuscript provides insufficient detail on the translation models used for the round-trip process, the exact semantic-similarity metric, language/text selection criteria, and any statistical controls for translation quality. These omissions make it impossible to verify that semantic gaps primarily reflect the tested models' multilingual generation rather than the auxiliary translation step itself.

Authors: We acknowledge that greater methodological detail is necessary for reproducibility and to isolate the contribution of the tested models' multilingual generation. In the revised manuscript, we will expand §4 to explicitly describe the translation models used in the round-trip process, the precise semantic-similarity metric (including the embedding model and computation method), the language and text selection criteria (e.g., speaker population thresholds and data availability), and any quality controls or filters applied to the auxiliary translations. In §5, we will include additional analyses quantifying the auxiliary step's impact, such as sensitivity checks across different translation models. These expansions will enable verification that semantic gaps primarily capture the evaluated models' capabilities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on empirical contrasts and external correlation

full rationale

The paper's chain proceeds from observed performance gaps (thinking variants outperforming instruct variants on existing multilingual benchmarks but underperforming on LMArena) to the proposal of round-trip translation, validated by a reported ρ=0.94 correlation with LMArena user ratings. No step reduces by construction to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation; the central claim that round-trip exposes multilingual generation failures is supported by independent external ratings rather than internal renaming or ansatz smuggling. The derivation remains self-contained against the provided benchmarks and ratings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that round-trip semantic similarity is a valid proxy for multilingual generation capability, and introduces the LiT benchmark as a new evaluation resource.

axioms (1)

domain assumption Semantic gaps in round-trip translation primarily expose failures in multilingual generation capabilities
Core justification for using round-trip as the evaluation method

invented entities (1)

Lost in Translation (LiT) benchmark no independent evidence
purpose: Challenging round-trip translation benchmark spanning widely spoken languages
Newly proposed evaluation resource

pith-pipeline@v0.9.0 · 5476 in / 1197 out tokens · 75976 ms · 2026-05-10T16:26:19.771935+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , url =

URLhttps://aclanthology.org/2025.acl-long.169/. Regina Barzilay and Mirella Lapata. Modeling local coherence: an entity-based approach. InProceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pp. 141–148, USA, 2005. Association for Computational Linguistics. doi: 10.3115/ 1219840.1219858. URLhttps://doi.org/10.3115/...

work page doi:10.3115/1219840.1219858 2025
[2]

Scaling Laws for Neural Language Models

ISSN 1947-4040. doi: 10.2200/S00509ED1V01Y201305HLT023. Publisher Copyright: © Morgan and Claypool Publishers. All rights reserved. Yann Dubois, Percy Liang, and Tatsunori Hashimoto. Length-controlled alpacaeval: A simple debiasing of automatic evaluators. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=CybBmzWBX0. 14 Rou...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.2200/s00509ed1v01y201305hlt023 1947
[3]

Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond ˇrej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, et al

URLhttps://aclanthology.org/2023.wmt-1.64/. Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond ˇrej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, et al. Findings of the wmt25 general machine translation shared task: Time to stop evaluating on easy test sets. InProceedings of the Tenth Confer...

work page doi:10.18653/v1/w17-3204 2023
[4]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Association for Computational Linguistics. doi: 10.18653/v1/D18-1512. URL https://aclanthology.org/D18-1512/. Alon Lavie, Greg Hanneman, Sweta Agrawal, Diptesh Kanojia, Chi-Kiu Lo, Vil´em Zouhar, Frederic Blain, Chrysoula Zerva, Eleftherios Avramidis, Sourabh Deoghare, Archchana Sindhujan, Jiayi Wang, David Ifeoluwa Adelani, Brian Thompson, Tom Kocmi, Mar...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d18-1512 2025
[5]

Dickerson

doi: 10.48550/ARXIV .2407.21530. URL https://doi.org/10.48550/arXiv.2407. 21530. John R Searle.Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, 1969. Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. InProceedings of the 54th annual meeting of the associatio...

work page internal anchor Pith review doi:10.48550/arxiv 1969
[6]

Does the target text accurately reflect the source meaning?

Accuracy: (Mistranslation, Omission, Addition, Untranslated). Does the target text accurately reflect the source meaning?

work page
[7]

Is the target text linguistically correct and natural?

Fluency: (Grammar, Spelling, Punctuation, Unintelligible). Is the target text linguistically correct and natural?

work page
[8]

Does it adhere to domain standards?

Terminology: (Inconsistent, Wrong Term). Does it adhere to domain standards?

work page
[9]

score": <0-100>,

Style/Locale: Does it follow local formats (dates, currencies) and cultural norms? Does the translation match the required formality/register (e.g., formal vs. casual)? The three severity categories are: •’minor’: Has a limited impact on accuracy, stylistic quality, consistency, fluency, clarity, or general appeal of the content. •’major’: Seriously affec...

work page

[1] [1]

Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , url =

URLhttps://aclanthology.org/2025.acl-long.169/. Regina Barzilay and Mirella Lapata. Modeling local coherence: an entity-based approach. InProceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pp. 141–148, USA, 2005. Association for Computational Linguistics. doi: 10.3115/ 1219840.1219858. URLhttps://doi.org/10.3115/...

work page doi:10.3115/1219840.1219858 2025

[2] [2]

Scaling Laws for Neural Language Models

ISSN 1947-4040. doi: 10.2200/S00509ED1V01Y201305HLT023. Publisher Copyright: © Morgan and Claypool Publishers. All rights reserved. Yann Dubois, Percy Liang, and Tatsunori Hashimoto. Length-controlled alpacaeval: A simple debiasing of automatic evaluators. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=CybBmzWBX0. 14 Rou...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.2200/s00509ed1v01y201305hlt023 1947

[3] [3]

Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond ˇrej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, et al

URLhttps://aclanthology.org/2023.wmt-1.64/. Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond ˇrej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, et al. Findings of the wmt25 general machine translation shared task: Time to stop evaluating on easy test sets. InProceedings of the Tenth Confer...

work page doi:10.18653/v1/w17-3204 2023

[4] [4]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Association for Computational Linguistics. doi: 10.18653/v1/D18-1512. URL https://aclanthology.org/D18-1512/. Alon Lavie, Greg Hanneman, Sweta Agrawal, Diptesh Kanojia, Chi-Kiu Lo, Vil´em Zouhar, Frederic Blain, Chrysoula Zerva, Eleftherios Avramidis, Sourabh Deoghare, Archchana Sindhujan, Jiayi Wang, David Ifeoluwa Adelani, Brian Thompson, Tom Kocmi, Mar...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d18-1512 2025

[5] [5]

Dickerson

doi: 10.48550/ARXIV .2407.21530. URL https://doi.org/10.48550/arXiv.2407. 21530. John R Searle.Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, 1969. Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. InProceedings of the 54th annual meeting of the associatio...

work page internal anchor Pith review doi:10.48550/arxiv 1969

[6] [6]

Does the target text accurately reflect the source meaning?

Accuracy: (Mistranslation, Omission, Addition, Untranslated). Does the target text accurately reflect the source meaning?

work page

[7] [7]

Is the target text linguistically correct and natural?

Fluency: (Grammar, Spelling, Punctuation, Unintelligible). Is the target text linguistically correct and natural?

work page

[8] [8]

Does it adhere to domain standards?

Terminology: (Inconsistent, Wrong Term). Does it adhere to domain standards?

work page

[9] [9]

score": <0-100>,

Style/Locale: Does it follow local formats (dates, currencies) and cultural norms? Does the translation match the required formality/register (e.g., formal vs. casual)? The three severity categories are: •’minor’: Has a limited impact on accuracy, stylistic quality, consistency, fluency, clarity, or general appeal of the content. •’major’: Seriously affec...

work page