Findings of the WMT 25 Shared Task on Automated Translation Evaluation Systems: Linguistic Diversity is Challenging and References Still Help

Lavie, Alon, Hanneman, Greg, Agrawal, Sweta, Kanojia, Diptesh, Lo, Chi-Kiu · 2025 · DOI 10.18653/v1/2025.wmt-1.24

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

open at publisher browse 18 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

cs.CL · 2025-12-18 · unverdicted · novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.

Beyond "To whom it may concern": Tailoring Machine Translation to Audience and Intent

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Explicit purpose instructions improve LLM translation adaptedness across 50 languages and 8 domains, with larger gains on informal text, while standard metrics often penalize the adapted outputs.

Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Dynamic Meta-Metrics learns source-sentence conditioned combinations of MT metrics, with MLP-based and soft-conditioned versions showing gains over linear and GP ensembles on WMT data.

Misaligned by Reward: Socially Undesirable Preferences in LLMs

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

Reward models for LLMs frequently select socially undesirable options across four social domains, show no overall best performer, and exhibit a bias-avoidance versus context-sensitivity trade-off.

Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison-Eliciting Posts They Fail to Detect

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

LLMs generate Xiaohongshu-style posts that elicit social comparison but show stable failures in prompt-based detection of the same reader-grounded signal.

Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages

cs.CL · 2026-07-02 · unverdicted · novelty 5.0

Meta-analysis of 33 ACL papers shows inconsistent LLM-as-a-Judge results, overtrust, and single-model reliance in multilingual/low-resource settings, with recommendations for better practice.

A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models

cs.CL · 2026-06-02 · unverdicted · novelty 5.0

Lexical richness is a robust linguistic signal for AI-generated text detection across models and domains, while most other features are context-dependent.

CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs

cs.CL · 2026-05-15 · unverdicted · novelty 5.0

Small open-source LLMs achieve competitive system-level correlations with human judgments in machine translation quality estimation, outperforming traditional neural metrics and fine-tuned models via single-pass multi-output prompting.

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

cs.CL · 2026-06-07 · unverdicted · novelty 4.0

HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.

Model-Based Quality Assessment for Massively Multilingual Parallel Data

cs.CL · 2026-05-29 · unverdicted · novelty 4.0

Large-scale benchmarks of multilingual embeddings and QE models show no universal performer; direction-aware routing and calibration recommended for parallel data assessment.

Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish

cs.CL · 2026-05-11 · unverdicted · novelty 4.0

Cross-lingual transfer and language-specific data efforts are interdependent and complementary for effective low-resource NLP, as demonstrated through Luxembourgish case studies and synthesis.

LLM Consumer Behavior Theory: Foundations of a Novel Research Field

cs.AI · 2026-06-16 · unverdicted · novelty 3.0

Introduces LLM Consumer Behavior Theory to analyze consumer behavior when LLMs serve as autonomous decision-making agents in markets.

ROC Analysis for Evaluating Translation Quality Estimation Systems

cs.CL · 2026-05-23 · unverdicted · novelty 3.0

ROC analysis is proposed for evaluating translation quality estimation systems, claimed to match existing methods while providing actionable business insights.

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

cs.CL · 2026-05-21 · unverdicted · novelty 3.0

Hy-MT2 presents three new multilingual translation models that claim to outperform listed open-source and commercial systems on diverse tasks while enabling low-storage on-device use.

FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals

cs.CL · 2026-05-05 · unverdicted · novelty 3.0

A feature-based decision tree with parsing-derived signals and heuristics detects LLM-generated code in a lightweight, CPU-only setup for SemEval-2026 Task 13.

Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation

cs.CL · 2025-04-02 · unverdicted · novelty 3.0

A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

cs.CL · 2026-04-20

citing papers explorer

Showing 17 of 17 citing papers after filters.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations cs.CL · 2026-05-13 · unverdicted · none · ref 108
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs cs.CL · 2025-12-18 · unverdicted · none · ref 55
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
Beyond "To whom it may concern": Tailoring Machine Translation to Audience and Intent cs.CL · 2026-06-02 · unverdicted · none · ref 17
Explicit purpose instructions improve LLM translation adaptedness across 50 languages and 8 domains, with larger gains on informal text, while standard metrics often penalize the adapted outputs.
Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation cs.CL · 2026-05-09 · unverdicted · none · ref 28
Dynamic Meta-Metrics learns source-sentence conditioned combinations of MT metrics, with MLP-based and soft-conditioned versions showing gains over linear and GP ensembles on WMT data.
Misaligned by Reward: Socially Undesirable Preferences in LLMs cs.CL · 2026-05-06 · unverdicted · none · ref 155
Reward models for LLMs frequently select socially undesirable options across four social domains, show no overall best performer, and exhibit a bias-avoidance versus context-sensitivity trade-off.
Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison-Eliciting Posts They Fail to Detect cs.CL · 2026-05-01 · unverdicted · none · ref 155
LLMs generate Xiaohongshu-style posts that elicit social comparison but show stable failures in prompt-based detection of the same reader-grounded signal.
Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages cs.CL · 2026-07-02 · unverdicted · none · ref 4
Meta-analysis of 33 ACL papers shows inconsistent LLM-as-a-Judge results, overtrust, and single-model reliance in multilingual/low-resource settings, with recommendations for better practice.
A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models cs.CL · 2026-06-02 · unverdicted · none · ref 155
Lexical richness is a robust linguistic signal for AI-generated text detection across models and domains, while most other features are context-dependent.
CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs cs.CL · 2026-05-15 · unverdicted · none · ref 27
Small open-source LLMs achieve competitive system-level correlations with human judgments in machine translation quality estimation, outperforming traditional neural metrics and fine-tuned models via single-pass multi-output prompting.
HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task cs.CL · 2026-06-07 · unverdicted · none · ref 18
HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.
Model-Based Quality Assessment for Massively Multilingual Parallel Data cs.CL · 2026-05-29 · unverdicted · none · ref 58
Large-scale benchmarks of multilingual embeddings and QE models show no universal performer; direction-aware routing and calibration recommended for parallel data assessment.
Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish cs.CL · 2026-05-11 · unverdicted · none · ref 167
Cross-lingual transfer and language-specific data efforts are interdependent and complementary for effective low-resource NLP, as demonstrated through Luxembourgish case studies and synthesis.
LLM Consumer Behavior Theory: Foundations of a Novel Research Field cs.AI · 2026-06-16 · unverdicted · none · ref 272
Introduces LLM Consumer Behavior Theory to analyze consumer behavior when LLMs serve as autonomous decision-making agents in markets.
ROC Analysis for Evaluating Translation Quality Estimation Systems cs.CL · 2026-05-23 · unverdicted · none · ref 13
ROC analysis is proposed for evaluating translation quality estimation systems, claimed to match existing methods while providing actionable business insights.
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild cs.CL · 2026-05-21 · unverdicted · none · ref 68
Hy-MT2 presents three new multilingual translation models that claim to outperform listed open-source and commercial systems on diverse tasks while enabling low-storage on-device use.
FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals cs.CL · 2026-05-05 · unverdicted · none · ref 239
A feature-based decision tree with parsing-derived signals and heuristics detects LLM-generated code in a lightweight, CPU-only setup for SemEval-2026 Task 13.
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation cs.CL · 2025-04-02 · unverdicted · none · ref 174
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.

Findings of the WMT 25 Shared Task on Automated Translation Evaluation Systems: Linguistic Diversity is Challenging and References Still Help

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer