The Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks

Fengming Liu; Handi Li; Zezheng Lin

arxiv: 2605.07093 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG

The Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks

Zezheng Lin , Fengming Liu , Handi Li This is my paper

Pith reviewed 2026-05-11 00:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords translation taxmultilingual benchmarksEnglish-to-Chinesecue inheritanceLLM evaluationbenchmark validitycounterfactual auditresidue effects

0 comments

The pith

Translated benchmarks do not carry a single translation tax but a set of estimator- and item-dependent validity risks from English cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper audits the assumption that translated benchmarks inflate LLM scores through a fixed translation tax by preserving English-source cues. Three proxy estimators applied to English-to-Chinese data produce inconsistent pictures: back-translation gaps are small and sensitive to parsing choices, cue-score calibration fails to predict item gains, and native-language controls point to model-family differences instead of uniform benchmark effects. A same-item naturalization test that rewrites only the Chinese surface form while keeping answers and content fixed reveals, after prompt correction, a residue dose-response where high-residue items gain but low-residue items do not. A reader would care because many cross-lingual evaluations depend on these translated tests, and a single correction factor cannot address validity risks that differ by both measurement method and specific item.

Core claim

What carries the argument

The LLM-naturalization stress test that holds answer, options, and content fixed while rewriting only Chinese surface form to isolate residue-dependent cue effects.

If this is right

Back-translation gaps remain small and sensitive to parser choices across the tested items.
Cue-score calibration shows no reliable link to observed item-level performance gains.
Native-language model comparisons reveal family-specific patterns instead of uniform translation effects.
After prompt correction the naturalization test yields gains only for high-residue items and none for low-residue items.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluations using translated benchmarks should report per-item residue levels rather than a single aggregate adjustment.
The naturalization protocol could be extended to other language pairs to test whether the item-dependent pattern generalizes.
Benchmark papers would benefit from adopting the released reporting checklist to catch prompt-construction artifacts.

Load-bearing premise

The LLM-naturalization stress test, after prompt bug correction, isolates English-source cue inheritance effects without introducing new artifacts from the rewriting process or prompt variations.

What would settle it

Re-running the naturalization contrast with an independent rewriting model or additional prompt variants that eliminates the differential gain between high-residue and low-residue items would falsify the claim of item-dependent validity risks.

Figures

Figures reproduced from arXiv: 2605.07093 by Fengming Liu, Handi Li, Zezheng Lin.

**Figure 2.** Figure 2: Strict-QC TTback across six (model × benchmark) cells. Circles: MMMLU; squares: Belebele. Point estimates with 95% paired-bootstrap CIs. Five of six cells are positive and one is exactly zero; most individual CIs contain zero. Effect sizes are small in magnitude (range 0.000– 0.047). The largest cell (MMMLU × gpt-5.4-mini) is parser-fragile. 5.2 Cross-Cell Pattern and Parser Sensitivity All non-zero point … view at source ↗

**Figure 3.** Figure 3: After correcting the prompt-construction bug, E4 no longer supports a stable model-family [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

The Translation Tax is often treated as a scalar: translated benchmarks are assumed to inflate scores by preserving English-source cues. We audit this claim in an English-to-Chinese setting. Three proxy estimators disagree: back-translation gaps are small and parser-fragile; cue-score calibration does not predict item-level gains; and a six-model native-control comparison shows model-family rather than uniform benchmark effects. We add a same-item LLM-naturalization stress test that holds answer, options, and content fixed while rewriting Chinese surface form. After correcting a prompt-construction bug, this contrast no longer supports a model-family interaction, but it preserves a residue dose-response: high-residue items benefit while low-residue items do not. The result is not a single Translation Tax, but a set of estimator- and item-dependent validity risks. We release per-cell evidence, the naturalization protocol, human QC, and a reporting checklist for translated multilingual benchmark papers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript audits the assumption that the 'Translation Tax' in English-to-Chinese multilingual benchmarks is a uniform scalar due to preserved English-source cues. Using three proxy estimators (back-translation gaps, cue-score calibration, and six-model native controls) that yield inconsistent results, plus an LLM-naturalization stress test that rewrites surface form while holding answers/options/content fixed, the authors find estimator-dependent effects and, after prompt-bug correction, a residue dose-response where high-residue items benefit from naturalization but low-residue items do not. They conclude that validity risks are item- and estimator-dependent rather than scalar, and release per-cell data, the naturalization protocol, human QC, and a reporting checklist.

Significance. If the central findings hold, the work is significant for multilingual NLP evaluation by shifting focus from scalar translation penalties to nuanced, item-specific validity risks, with direct implications for benchmark construction and interpretation. The disagreement among proxies and the preserved dose-response after bug correction provide a useful counter to oversimplified views. Credit is due for the release of per-cell evidence, the naturalization protocol, human QC, and a reporting checklist, which support reproducibility and offer practical guidance for future translated-benchmark papers.

major comments (2)

[LLM-naturalization stress test] LLM-naturalization stress test (post prompt-bug correction): The item-dependent validity-risk claim rests on the residue dose-response, but this requires that the rewriting alters only surface form without introducing differential semantic drift, new English cues, or difficulty shifts for high- vs. low-residue items. The paper mentions human QC but should supply quantitative checks (e.g., pre/post semantic similarity scores, option-difficulty ratings, or inter-annotator agreement) to rule out artifacts that could artifactually produce the observed benefit for high-residue items.
[Results section] Results for proxy estimators and naturalization contrast: The abstract and main text provide limited detail on statistical controls, data exclusion rules, and error bars for the back-translation gaps, cue-score calibration, six-model comparison, and residue dose-response. Explicit reporting of confidence intervals, p-values, or robustness checks is needed to substantiate the claims of 'disagreement' among estimators and the dose-response pattern.

minor comments (2)

[Abstract] The abstract is dense; a brief expansion on the direction and magnitude of each proxy estimator's result would improve standalone readability without lengthening the paper.
[Data availability] Ensure repository links or DOIs for the released per-cell evidence, protocol, and checklist are explicitly stated in the main text and data-availability statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where additional methodological transparency and quantitative validation will strengthen our audit of non-scalar translation effects in English-to-Chinese benchmarks. We address each major comment below and commit to revisions that enhance the rigor of our claims without altering the core findings.

read point-by-point responses

Referee: [LLM-naturalization stress test] LLM-naturalization stress test (post prompt-bug correction): The item-dependent validity-risk claim rests on the residue dose-response, but this requires that the rewriting alters only surface form without introducing differential semantic drift, new English cues, or difficulty shifts for high- vs. low-residue items. The paper mentions human QC but should supply quantitative checks (e.g., pre/post semantic similarity scores, option-difficulty ratings, or inter-annotator agreement) to rule out artifacts that could artifactually produce the observed benefit for high-residue items.

Authors: We agree that explicit quantitative checks are necessary to confirm the naturalization process preserves semantics and difficulty without differential artifacts. Our human QC protocol used multiple expert annotators to verify semantic fidelity, absence of new English cues, and option integrity, but aggregate metrics were not reported in the initial submission. In the revision, we will add pre- and post-naturalization semantic similarity scores (via multilingual sentence embeddings), inter-annotator agreement statistics (e.g., Fleiss' kappa), and difficulty preservation checks where benchmark metadata permits. These will be presented alongside the dose-response results to rule out confounds and support the item-dependent validity-risk interpretation. revision: yes
Referee: [Results section] Results for proxy estimators and naturalization contrast: The abstract and main text provide limited detail on statistical controls, data exclusion rules, and error bars for the back-translation gaps, cue-score calibration, six-model comparison, and residue dose-response. Explicit reporting of confidence intervals, p-values, or robustness checks is needed to substantiate the claims of 'disagreement' among estimators and the dose-response pattern.

Authors: We acknowledge that the Results section would benefit from expanded statistical detail to better substantiate estimator disagreement and the residue dose-response. The manuscript emphasizes observed patterns across proxies, but we will revise to include confidence intervals and error bars on all key comparisons and figures. Data exclusion criteria (e.g., parsing failures in back-translation, incomplete items) will be explicitly documented. We will report p-values for the dose-response analysis and add robustness checks, including bootstrap resampling and sensitivity to model subsets. These enhancements will appear in the main Results section and supplementary materials. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical audit with direct experimental contrasts

full rationale

The paper performs an empirical audit via multiple proxy estimators, native-control comparisons, and an LLM-naturalization stress test on translated benchmarks. All load-bearing claims derive from observable experimental outcomes (e.g., back-translation gaps, residue dose-response after bug correction, model-family effects) rather than any equations, fitted parameters, self-definitions, or self-citation chains that reduce findings to inputs by construction. Data, protocol, and checklist are released for external verification, confirming the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the three proxy estimators and the naturalization protocol as measures of English cue inheritance; no explicit free parameters are described as fitted to data.

axioms (1)

domain assumption LLM responses to benchmarks primarily reflect underlying capabilities rather than surface-form artifacts
Invoked when interpreting score differences in the native-control comparison and naturalization stress test.

pith-pipeline@v0.9.0 · 5471 in / 1313 out tokens · 98513 ms · 2026-05-11T00:51:02.600499+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

Translation artifacts in cross-lingual transfer learning

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. Translation artifacts in cross-lingual transfer learning. In EMNLP, 2020

work page 2020
[2]

The B elebele benchmark: a parallel reading comprehension dataset in 122 language variants

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The B elebele benchmark: a parallel reading comprehension dataset in 122 language variants. In ACL, 2024

work page 2024
[3]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. T y D i QA : A benchmark for information-seeking question answering in typologically diverse languages. TACL, 8:454--470, 2020

work page 2020
[4]

XNLI : Evaluating cross-lingual sentence representations

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. XNLI : Evaluating cross-lingual sentence representations. In EMNLP, 2018

work page 2018
[5]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR, 2021

work page 2021
[6]

State of what art? a call for multi-prompt LLM evaluation

Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt LLM evaluation. TACL, 12:933--949, 2024

work page 2024
[7]

XCOPA : A multilingual dataset for causal commonsense reasoning

Edoardo Maria Ponti, Goran Glava s , Olga Majewska, Qianchu Liu, Ivan Vuli\' c , and Anna Korhonen. XCOPA : A multilingual dataset for causal commonsense reasoning. In EMNLP, 2020

work page 2020
[8]

INCLUDE : Evaluating multilingual language understanding with regional knowledge

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Sree Harsha Tanneru, Zeming Chen, Antoine Bosselut, and Syrielle Montariol. INCLUDE : Evaluating multilingual language understanding with regional knowledge. In ICLR, 2025

work page 2025
[9]

Coverage, representativeness, trust and scientific rigor in multilingual benchmark evaluation

Sunayana Sitaram. Coverage, representativeness, trust and scientific rigor in multilingual benchmark evaluation. Invited talk, NeurIPS 2025 Workshop on Centering Low Resource Languages and Cultures

work page 2025
[10]

The bitter lesson learned from 2,000+ multilingual benchmarks,

Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. The bitter lesson learned from 2 , 000+ multilingual benchmarks. arXiv:2504.15521, 2025

work page arXiv 2025
[11]

ProSA : Assessing and understanding the prompt sensitivity of LLM s

Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. ProSA : Assessing and understanding the prompt sensitivity of LLM s. In EMNLP Findings, 2024

work page 2024

[1] [1]

Translation artifacts in cross-lingual transfer learning

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. Translation artifacts in cross-lingual transfer learning. In EMNLP, 2020

work page 2020

[2] [2]

The B elebele benchmark: a parallel reading comprehension dataset in 122 language variants

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The B elebele benchmark: a parallel reading comprehension dataset in 122 language variants. In ACL, 2024

work page 2024

[3] [3]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. T y D i QA : A benchmark for information-seeking question answering in typologically diverse languages. TACL, 8:454--470, 2020

work page 2020

[4] [4]

XNLI : Evaluating cross-lingual sentence representations

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. XNLI : Evaluating cross-lingual sentence representations. In EMNLP, 2018

work page 2018

[5] [5]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR, 2021

work page 2021

[6] [6]

State of what art? a call for multi-prompt LLM evaluation

Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt LLM evaluation. TACL, 12:933--949, 2024

work page 2024

[7] [7]

XCOPA : A multilingual dataset for causal commonsense reasoning

Edoardo Maria Ponti, Goran Glava s , Olga Majewska, Qianchu Liu, Ivan Vuli\' c , and Anna Korhonen. XCOPA : A multilingual dataset for causal commonsense reasoning. In EMNLP, 2020

work page 2020

[8] [8]

INCLUDE : Evaluating multilingual language understanding with regional knowledge

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Sree Harsha Tanneru, Zeming Chen, Antoine Bosselut, and Syrielle Montariol. INCLUDE : Evaluating multilingual language understanding with regional knowledge. In ICLR, 2025

work page 2025

[9] [9]

Coverage, representativeness, trust and scientific rigor in multilingual benchmark evaluation

Sunayana Sitaram. Coverage, representativeness, trust and scientific rigor in multilingual benchmark evaluation. Invited talk, NeurIPS 2025 Workshop on Centering Low Resource Languages and Cultures

work page 2025

[10] [10]

The bitter lesson learned from 2,000+ multilingual benchmarks,

Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. The bitter lesson learned from 2 , 000+ multilingual benchmarks. arXiv:2504.15521, 2025

work page arXiv 2025

[11] [11]

ProSA : Assessing and understanding the prompt sensitivity of LLM s

Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. ProSA : Assessing and understanding the prompt sensitivity of LLM s. In EMNLP Findings, 2024

work page 2024