Dynamic Meta-Metrics learns source-sentence-conditioned combinations of MT metrics, with MLP-based hard and soft clustering versions outperforming static linear and Gaussian process ensembles on WMT data.
Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Small open-source LLMs achieve competitive system-level correlations with human judgments in machine translation quality estimation, outperforming traditional neural metrics and fine-tuned models via single-pass multi-output prompting.
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.
citing papers explorer
-
Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation
Dynamic Meta-Metrics learns source-sentence-conditioned combinations of MT metrics, with MLP-based hard and soft clustering versions outperforming static linear and Gaussian process ensembles on WMT data.
-
CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs
Small open-source LLMs achieve competitive system-level correlations with human judgments in machine translation quality estimation, outperforming traditional neural metrics and fine-tuned models via single-pass multi-output prompting.
-
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
-
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.