pith. sign in

arxiv: 2604.14934 · v2 · submitted 2026-04-16 · 💻 cs.CL

XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

Pith reviewed 2026-05-10 10:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords machine translation evaluationcross-lingual biastranslation metricsparallel quality datasetmultilingual evaluationautomatic evaluation
0
0 comments X

The pith

A dataset of parallel-quality translations across languages shows that metrics assign different scores to equally good outputs depending on the language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work builds XQ-MEval by injecting controlled errors into gold translations and using native-speaker filters to create instances where translation quality is matched across nine directions. Experiments with nine common metrics then demonstrate that simple averaging of scores fails to align with human judgments because the metrics themselves produce systematically different scores for translations of identical quality. The authors also derive a normalization step from the dataset that brings score distributions into better alignment across languages. A sympathetic reader would care because multilingual translation systems are judged by these averaged numbers, so hidden language bias can distort which systems appear strongest and slow progress on fair evaluation.

Core claim

XQ-MEval supplies source-reference-pseudo-translation triplets in which quality is held constant across languages through MQM error injection followed by native-speaker validation. When nine representative metrics are run on these triplets, their scores diverge across languages even though human quality is the same, producing the first direct empirical evidence of cross-lingual scoring bias. The same data yields a normalization procedure that equalizes score distributions and thereby improves the reliability of multilingual metric comparisons.

What carries the argument

The XQ-MEval dataset itself, produced by automatic MQM error injection into gold translations, native-speaker filtering for reliability, and controlled merging of errors to generate pseudo translations whose quality is matched across language pairs.

If this is right

  • Averaging metric scores across languages without correction misrepresents overall system quality.
  • Metrics must be adjusted or normalized before cross-lingual comparisons can be trusted.
  • The normalization derived from XQ-MEval produces more stable rankings that better match human judgments.
  • Metric developers should test new scores on parallel-quality data to detect and reduce language bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same construction method could be applied to additional language pairs to test whether bias patterns generalize.
  • Metric training pipelines might incorporate parallel-quality examples as a regular calibration step.
  • Evaluation protocols for multilingual systems could shift from raw averages to normalized or per-language reporting.
  • The dataset offers a reusable test bed for checking whether newly proposed metrics still carry the same cross-lingual skew.

Load-bearing premise

The semi-automatically created pseudo translations really do have equivalent quality across languages once native speakers have filtered them.

What would settle it

A blind human evaluation in which raters compare the actual quality of the pseudo translations across the nine directions and find consistent quality gaps would show that the dataset does not hold quality constant.

Figures

Figures reproduced from arXiv: 2604.14934 by Hidetaka Kamigaito, Jingxuan Liu, Jin Tei, Lemao Liu, Taro Watanabe, Zhi Qu.

Figure 1
Figure 1. Figure 1: A clue of this study, showing the inconsistency [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The illustration of our pipeline. Specifically, stages from (a) to (c) show the data construction and reveal [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of nine metric scores across nine directions at varying translation quality levels. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The illustration of COMET score distribution [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of three metrics scores across [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The prompt for different error types to guide GPT-4o to introduce errors to references. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of LLM-based evaluation scores across three directions at varying translation quality levels. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of nine metrics scores under the LGN strategy across nine directions at varying translation [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are then paired with corresponding sources and references to form triplets used in assessing the qualities of translation metrics. Using XQ-MEval, our experiments on nine representative metrics reveal the inconsistency between averaging and human judgment and provide the first empirical evidence of cross-lingual scoring bias. Finally, we propose a normalization strategy derived from XQ-MEval that aligns score distributions across languages, improving the fairness and reliability of multilingual metric evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces XQ-MEval, a semi-automatically constructed dataset spanning nine translation directions that supplies parallel-quality instances for benchmarking automatic translation metrics. Pseudo translations are generated by automatically injecting MQM-defined errors into gold references, followed by native-speaker filtering and error merging to produce controllable quality levels. Experiments with nine representative metrics on this dataset demonstrate inconsistencies between simple averaging of scores and human judgments, supply the first empirical evidence of cross-lingual scoring bias, and motivate a normalization procedure derived from the dataset to align score distributions across languages.

Significance. If the quality-parity assumption holds, the dataset would fill a clear gap in multilingual MT evaluation by enabling controlled tests of metric fairness; the normalization strategy could then improve the reliability of cross-lingual comparisons. The work correctly identifies that current averaging practices rest on an untested assumption and supplies a concrete resource for future metric development.

major comments (1)
  1. [Method (error injection and filtering procedure)] The central claim that XQ-MEval exposes cross-lingual scoring bias rests on the unverified premise that the generated pseudo translations possess equivalent quality across the nine directions. The method (automatic MQM-error injection into gold references followed by native-speaker filtering) is described at a high level, yet no quantitative validation—such as cross-lingual human quality ratings, inter-annotator agreement statistics, or calibration experiments comparing error severity across language pairs—is reported. Because error impact is language-dependent (e.g., word-order perturbations affect agglutinative versus analytic languages differently), residual quality differences could explain observed metric-score divergences rather than bias in the metrics themselves.
minor comments (2)
  1. [Abstract] The abstract refers to “nine representative metrics” without naming them; an explicit list (or a pointer to the table that enumerates them) would improve readability.
  2. [Introduction / Dataset description] The precise set of language pairs and directions covered by the nine translation directions should be stated explicitly in the introduction or dataset section to allow readers to assess coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment point by point below, providing clarifications on our methodology while acknowledging areas where additional details will strengthen the manuscript.

read point-by-point responses
  1. Referee: The central claim that XQ-MEval exposes cross-lingual scoring bias rests on the unverified premise that the generated pseudo translations possess equivalent quality across the nine directions. The method (automatic MQM-error injection into gold references followed by native-speaker filtering) is described at a high level, yet no quantitative validation—such as cross-lingual human quality ratings, inter-annotator agreement statistics, or calibration experiments comparing error severity across language pairs—is reported. Because error impact is language-dependent (e.g., word-order perturbations affect agglutinative versus analytic languages differently), residual quality differences could explain observed metric-score divergences rather than bias in the metrics themselves.

    Authors: We agree that validating quality equivalence across languages is essential to support our claims about cross-lingual scoring bias. The native-speaker filtering was designed to enforce consistency by having annotators verify that the injected MQM errors matched the target severity levels, using standardized error definitions to minimize subjectivity. However, we acknowledge that the initial submission did not report quantitative measures such as inter-annotator agreement or cross-lingual calibration ratings. In the revised manuscript, we will add inter-annotator agreement statistics from the filtering stage and expand the method section to discuss potential language-specific error impacts (e.g., word order in different language families). We will also include a brief analysis of quality consistency where possible. This addresses the concern without altering the core findings, as the controllable error merging still enables parallel-quality comparisons within the limits of the semi-automatic construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs XQ-MEval via automatic MQM error injection into gold references followed by native-speaker filtering and error merging to produce pseudo-translations of controllable quality levels. It then performs direct empirical comparisons of nine metrics' scores against these levels across nine directions, identifies inconsistencies with averaging, and derives a post-hoc normalization from the resulting score distributions. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the chain; the central claims rest on independent empirical outcomes from the newly generated data rather than reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that MQM error injection plus native-speaker filtering yields reliable parallel-quality examples; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption MQM-defined errors can be automatically injected into gold translations to create controllable quality levels
    Core step in dataset construction described in abstract
  • domain assumption Native speaker filtering ensures the pseudo translations have reliable quality labels
    Required for the dataset to serve as a valid benchmark

pith-pipeline@v0.9.0 · 5514 in / 1191 out tokens · 32495 ms · 2026-05-10T10:41:26.181591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    InProceedings of the Second Conference on Machine Translation, pages 169–214, Copenhagen, Denmark

    Findings of the 2017 conference on machine translation (WMT17). InProceedings of the Second Conference on Machine Translation, pages 169–214, Copenhagen, Denmark. Association for Computa- tional Linguistics. Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, An- tonio Jimeno Yepes, Philipp Koehn, Varvara Lo- ...

  2. [2]

    InProceedings of the Eighth Conference on Machine Translation, pages 756–767, Singapore

    MetricX-23: The Google submission to the WMT 2023 metrics shared task. InProceedings of the Eighth Conference on Machine Translation, pages 756–767, Singapore. Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika, 30(1-2):81–93. Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondˇrej Bojar, Anton Dvorkovich, Christian Feder- mann, Mark F...

  3. [3]

    InPro- ceedings of the Sixth Conference on Machine Trans- lation, pages 1030–1040, Online

    Are references really needed? unbabel-IST 2021 submission for the metrics shared task. InPro- ceedings of the Sixth Conference on Machine Trans- lation, pages 1030–1040, Online. Association for Computational Linguistics. Ricardo Rei, Nuno M. Guerreiro, José Pombal, Daan van Stigt, Marcos Treviso, Luisa Coheur, José G. C. de Souza, and André F. T. Martins....

  4. [4]

    Addition

    A measure of the system dependence of au- tomated metrics. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 87–99, Vienna, Austria. Weiqi Wang, Limeng Cui, Xin Liu, Sreyashi Nag, Wenju Xu, Chen Luo, Sheikh Muhammad Sarwar, Yang Li, Hansu Gu, Hui Liu, Changlong Yu, Jiaxin Bai, Yifa...

  5. [6]

    Please keep the rest the same

    You should disrupt the sub-part by adding some words which includes information not present in the selected sub- part. Please keep the rest the same. Then output the disrupted sub-part

  6. [7]

    Replace the selected sub-part by the disrupted sub-part to get the updated sentence

  7. [8]

    Omission

    Finally, output the updated sentence. "Omission": \n Given a sentence, your task is to add an omission error to disrupt the quality. Please do the following instructions step by step:

  8. [10]

    Note that a segment means some words or a phrase rather than a clause

    You should select a segment containing some important information in the sub-part. Note that a segment means some words or a phrase rather than a clause. Then output the segment you selected

  9. [11]

    You should delete the segment from the sub-part to get the disrupted sub-part, make sure that you just delete one segment

  10. [12]

    Replace the sub-part in the sentence by the disrupted sub-part to get the updated sentence

  11. [13]

    Mistranslation

    Finally, output the updated sentence. "Mistranslation": \n Given a sentence, your task is to add an mistranslation error to disrupt the quality. Please do the following instructions step by step:

  12. [14]

    You should select a sub-part of the sentence in the part enclosed by <{position}> and </{position}>, then output the sub-part you selected

  13. [15]

    Ensure the segment is a natural and coherent phrase rather than fragments of different sentences or clauses

    You should select a segment containing some important information in the sub-part. Ensure the segment is a natural and coherent phrase rather than fragments of different sentences or clauses. And the segment is typically a short phrase that conveys a key idea without unnecessary details. Then output the segment you selected

  14. [16]

    1" with

    You should replace the segment you selected in the sub-part, with alternatives that change the meaning of that part to get the disrupted segment. Do NOT perform simple substitutions, such as replacing "1" with "2" or "good" with "bad". Use descriptive phrases or reframe the meaning to introduce different information. Then output the disrupted segment

  15. [17]

    Replace the selected segment in the selected sub-part by the disrupted segment to get the disrupted sub-part, then output the disrupted sub-part

  16. [18]

    Replace the selected sub-part in the sentence by the disrupted sub-part to get the updated sentence

  17. [19]

    Untranslated

    Finally, output the updated sentence. "Untranslated": \n Given a source sentence and target sentence, your task is to add an untranslation error to disrupt the translation quality. Please do the following instructions step by step:

  18. [20]

    Note that, a sub-part means a word or a phrase instead of a clause

    You should select a sub-part of the target sentence in the part enclosed by <{position}> and </{position}>. Note that, a sub-part means a word or a phrase instead of a clause. Then output the sub-part you selected

  19. [21]

    Please validate it

    Given that our objective is to create an untranslation error, the selected sub-part should be in {language} instead of English and does not present in the source sentence. Please validate it. If it cannot meet our requirement, please select another sub-part in {language}

  20. [22]

    Then output the corresponding source part

    You should find the corresponding part from the source sentence. Then output the corresponding source part

  21. [23]

    Figure 6: The prompt for different error types to guide GPT-4o to introduce errors to references

    Replace the selected sub-part by the corresponding source part to get the updated target sentence, finally output the updated sentence. Figure 6: The prompt for different error types to guide GPT-4o to introduce errors to references. cept lo, si, and de. In contrast, KIWI22 and KIWI23 more closely align with the desired proper- ties of an ideal metric, as...