M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

Juraj Juraska, Daniel Deutsch, Mara Finkelstein, Markus Freitag · 2024 · DOI 10.18653/v1/2024.wmt-1.35

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open at publisher browse 6 citing papers

representative citing papers

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

cs.CL · 2025-12-18 · unverdicted · novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.

Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Dynamic Meta-Metrics learns source-sentence-conditioned combinations of MT metrics, with MLP-based hard and soft clustering versions outperforming static linear and Gaussian process ensembles on WMT data.

Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

cs.LG · 2026-05-06 · unverdicted · novelty 4.0

Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.

SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures

cs.CL · 2026-05-04 · unverdicted · novelty 4.0

SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.

citing papers explorer

Showing 6 of 6 citing papers.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations cs.CL · 2026-05-13 · unverdicted · none · ref 39
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs cs.CL · 2025-12-18 · unverdicted · none · ref 45
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation cs.CL · 2026-05-09 · unverdicted · none · ref 12
Dynamic Meta-Metrics learns source-sentence-conditioned combinations of MT metrics, with MLP-based hard and soft clustering versions outperforming static linear and Gaussian process ensembles on WMT data.
Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains cs.CL · 2026-04-19 · unverdicted · none · ref 23
Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization cs.LG · 2026-05-06 · unverdicted · none · ref 71
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures cs.CL · 2026-05-04 · unverdicted · none · ref 175
SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.

M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

fields

years

verdicts

representative citing papers

citing papers explorer