Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.
Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.
citing papers explorer
-
Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains
Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.
-
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
-
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.