Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss
Pith reviewed 2026-05-10 16:26 UTC · model grok-4.3
The pith
Frontier multilingual benchmarks test mathematical reasoning and factual recall instead of language proficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multilingual benchmarks measure mathematical reasoning and factual recall rather than multilingual proficiency, since thinking variants dramatically outperform instruct variants on them but perform worse on real tasks; round-trip translation reveals failures in multilingual generation by measuring semantic preservation after translation to a target language and back, correlating at ρ = 0.94 with LMArena ratings without human references or a stronger judge, and forms the basis of the new Lost in Translation benchmark.
What carries the argument
Round-trip translation, which sends source text to a target language and back while checking for semantic differences that indicate multilingual generation failures.
If this is right
- Current benchmarks reward models that excel at reasoning across languages more than models that excel at fluent generation.
- Round-trip translation enables evaluation of multilingual capability without human reference translations.
- No stronger multilingual judge is needed to assess the tested models.
- The Lost in Translation benchmark spans widely spoken languages and offers a more realistic test of frontier model performance.
Where Pith is reading between the lines
- Training objectives might shift toward improving cross-language generation fidelity once round-trip translation becomes a standard metric.
- Evaluations could separate reasoning skill from language skill to avoid over-optimizing one at the expense of the other.
- Low-resource languages could be assessed more fairly without depending on scarce high-quality human translations.
Load-bearing premise
Semantic preservation after round-trip translation primarily reflects the tested model's multilingual generation proficiency rather than the quality of the translation systems or other factors.
What would settle it
A collection of models that achieve high round-trip translation scores but low LMArena user ratings, or low round-trip scores but high user ratings, would disprove the claimed correlation.
Figures
read the original abstract
Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard multilingual benchmarks for frontier models primarily test mathematical reasoning and factual recall rather than multilingual proficiency, as shown by thinking variants outperforming instruct variants on benchmarks yet underperforming on real-world tasks like LMArena. It proposes round-trip translation (source to target and back, with semantic similarity) as a better alternative, reporting a near-perfect correlation (ρ = 0.94) with LMArena user ratings; the method requires no human references and no stronger multilingual judge. The authors introduce the Lost in Translation (LiT) benchmark spanning widely spoken languages.
Significance. If the round-trip method can be shown to isolate multilingual generation failures beyond general model capabilities, this could meaningfully shift evaluation practices toward scalable, reference-free proxies that better align with user preferences. Strengths include the high reported correlation, the absence of human references, and the concrete LiT benchmark resource; these are genuine contributions if the central proxy assumption holds after appropriate controls.
major comments (2)
- [Abstract and results section on LMArena correlation] The core claim that round-trip translation specifically measures multilingual generation proficiency (rather than general capability or translation-model quality) is load-bearing but not yet established. The reported ρ = 0.94 correlation with LMArena could be driven by shared dependence on model scale, training compute, or English-centric performance; without controls such as partial correlations or regressions that residualize out English benchmark scores or parameter count, the distinction from math/reasoning benchmarks remains unproven. (This directly engages the skeptic concern and the weakest assumption noted in the review.)
- [§4 and §5] §4 (LiT benchmark construction) and §5 (experimental results): the manuscript provides insufficient detail on the translation models used for the round-trip process, the exact semantic-similarity metric, language/text selection criteria, and any statistical controls for translation quality. These omissions make it impossible to verify that semantic gaps primarily reflect the tested models' multilingual generation rather than the auxiliary translation step itself.
minor comments (3)
- [Abstract] The LaTeX macro in the abstract appears as “r{ho}”; this should be corrected to the standard ρ symbol.
- [Method description] Clarify whether the same frontier model is used for both directions of the round-trip or whether a separate translation model is employed; this choice affects the interpretation of the “no more capable judge” claim.
- [§4] Add a table or appendix listing the exact languages, domains, and number of examples in the LiT benchmark for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which helps clarify the scope of our claims and the presentation of our methods. We address each major comment below and outline revisions that will strengthen the manuscript while preserving its core contributions.
read point-by-point responses
-
Referee: [Abstract and results section on LMArena correlation] The core claim that round-trip translation specifically measures multilingual generation proficiency (rather than general capability or translation-model quality) is load-bearing but not yet established. The reported ρ = 0.94 correlation with LMArena could be driven by shared dependence on model scale, training compute, or English-centric performance; without controls such as partial correlations or regressions that residualize out English benchmark scores or parameter count, the distinction from math/reasoning benchmarks remains unproven. (This directly engages the skeptic concern and the weakest assumption noted in the review.)
Authors: We agree that the specificity of round-trip translation to multilingual generation (as opposed to general capabilities) requires stronger evidence to support our central claim. Our current results highlight that thinking variants excel on standard multilingual benchmarks yet lag on LMArena, with round-trip translation showing high correlation to LMArena; however, we recognize that shared variance with scale or English performance could contribute to the observed ρ = 0.94. In the revised manuscript, we will add partial correlation and regression analyses that residualize out English benchmark scores and parameter counts. These controls will test whether the correlation with LMArena persists after accounting for general capability, thereby addressing the concern directly and clarifying the distinction from reasoning-focused benchmarks. revision: yes
-
Referee: [§4 and §5] §4 (LiT benchmark construction) and §5 (experimental results): the manuscript provides insufficient detail on the translation models used for the round-trip process, the exact semantic-similarity metric, language/text selection criteria, and any statistical controls for translation quality. These omissions make it impossible to verify that semantic gaps primarily reflect the tested models' multilingual generation rather than the auxiliary translation step itself.
Authors: We acknowledge that greater methodological detail is necessary for reproducibility and to isolate the contribution of the tested models' multilingual generation. In the revised manuscript, we will expand §4 to explicitly describe the translation models used in the round-trip process, the precise semantic-similarity metric (including the embedding model and computation method), the language and text selection criteria (e.g., speaker population thresholds and data availability), and any quality controls or filters applied to the auxiliary translations. In §5, we will include additional analyses quantifying the auxiliary step's impact, such as sensitivity checks across different translation models. These expansions will enable verification that semantic gaps primarily capture the evaluated models' capabilities. revision: yes
Circularity Check
No significant circularity; derivation relies on empirical contrasts and external correlation
full rationale
The paper's chain proceeds from observed performance gaps (thinking variants outperforming instruct variants on existing multilingual benchmarks but underperforming on LMArena) to the proposal of round-trip translation, validated by a reported ρ=0.94 correlation with LMArena user ratings. No step reduces by construction to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation; the central claim that round-trip exposes multilingual generation failures is supported by independent external ratings rather than internal renaming or ansatz smuggling. The derivation remains self-contained against the provided benchmarks and ratings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic gaps in round-trip translation primarily expose failures in multilingual generation capabilities
invented entities (1)
-
Lost in Translation (LiT) benchmark
no independent evidence
Reference graph
Works this paper leans on
-
[1]
URLhttps://aclanthology.org/2025.acl-long.169/. Regina Barzilay and Mirella Lapata. Modeling local coherence: an entity-based approach. InProceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pp. 141–148, USA, 2005. Association for Computational Linguistics. doi: 10.3115/ 1219840.1219858. URLhttps://doi.org/10.3115/...
-
[2]
Scaling Laws for Neural Language Models
ISSN 1947-4040. doi: 10.2200/S00509ED1V01Y201305HLT023. Publisher Copyright: © Morgan and Claypool Publishers. All rights reserved. Yann Dubois, Percy Liang, and Tatsunori Hashimoto. Length-controlled alpacaeval: A simple debiasing of automatic evaluators. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=CybBmzWBX0. 14 Rou...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.2200/s00509ed1v01y201305hlt023 1947
-
[3]
URLhttps://aclanthology.org/2023.wmt-1.64/. Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond ˇrej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, et al. Findings of the wmt25 general machine translation shared task: Time to stop evaluating on easy test sets. InProceedings of the Tenth Confer...
-
[4]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Association for Computational Linguistics. doi: 10.18653/v1/D18-1512. URL https://aclanthology.org/D18-1512/. Alon Lavie, Greg Hanneman, Sweta Agrawal, Diptesh Kanojia, Chi-Kiu Lo, Vil´em Zouhar, Frederic Blain, Chrysoula Zerva, Eleftherios Avramidis, Sourabh Deoghare, Archchana Sindhujan, Jiayi Wang, David Ifeoluwa Adelani, Brian Thompson, Tom Kocmi, Mar...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d18-1512 2025
-
[5]
doi: 10.48550/ARXIV .2407.21530. URL https://doi.org/10.48550/arXiv.2407. 21530. John R Searle.Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, 1969. Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. InProceedings of the 54th annual meeting of the associatio...
work page internal anchor Pith review doi:10.48550/arxiv 1969
-
[6]
Does the target text accurately reflect the source meaning?
Accuracy: (Mistranslation, Omission, Addition, Untranslated). Does the target text accurately reflect the source meaning?
-
[7]
Is the target text linguistically correct and natural?
Fluency: (Grammar, Spelling, Punctuation, Unintelligible). Is the target text linguistically correct and natural?
-
[8]
Does it adhere to domain standards?
Terminology: (Inconsistent, Wrong Term). Does it adhere to domain standards?
-
[9]
Style/Locale: Does it follow local formats (dates, currencies) and cultural norms? Does the translation match the required formality/register (e.g., formal vs. casual)? The three severity categories are: •’minor’: Has a limited impact on accuracy, stylistic quality, consistency, fluency, clarity, or general appeal of the content. •’major’: Seriously affec...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.