Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
Findings of the WMT 25 Shared Task on Automated Translation Evaluation Systems: Linguistic Diversity is Challenging and References Still Help
18 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
Explicit purpose instructions improve LLM translation adaptedness across 50 languages and 8 domains, with larger gains on informal text, while standard metrics often penalize the adapted outputs.
Dynamic Meta-Metrics learns source-sentence conditioned combinations of MT metrics, with MLP-based and soft-conditioned versions showing gains over linear and GP ensembles on WMT data.
Reward models for LLMs frequently select socially undesirable options across four social domains, show no overall best performer, and exhibit a bias-avoidance versus context-sensitivity trade-off.
LLMs generate Xiaohongshu-style posts that elicit social comparison but show stable failures in prompt-based detection of the same reader-grounded signal.
Meta-analysis of 33 ACL papers shows inconsistent LLM-as-a-Judge results, overtrust, and single-model reliance in multilingual/low-resource settings, with recommendations for better practice.
Lexical richness is a robust linguistic signal for AI-generated text detection across models and domains, while most other features are context-dependent.
Small open-source LLMs achieve competitive system-level correlations with human judgments in machine translation quality estimation, outperforming traditional neural metrics and fine-tuned models via single-pass multi-output prompting.
HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.
Large-scale benchmarks of multilingual embeddings and QE models show no universal performer; direction-aware routing and calibration recommended for parallel data assessment.
Cross-lingual transfer and language-specific data efforts are interdependent and complementary for effective low-resource NLP, as demonstrated through Luxembourgish case studies and synthesis.
Introduces LLM Consumer Behavior Theory to analyze consumer behavior when LLMs serve as autonomous decision-making agents in markets.
ROC analysis is proposed for evaluating translation quality estimation systems, claimed to match existing methods while providing actionable business insights.
Hy-MT2 presents three new multilingual translation models that claim to outperform listed open-source and commercial systems on diverse tasks while enabling low-storage on-device use.
A feature-based decision tree with parsing-derived signals and heuristics detects LLM-generated code in a lightweight, CPU-only setup for SemEval-2026 Task 13.
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.
citing papers explorer
-
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
-
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
-
Beyond "To whom it may concern": Tailoring Machine Translation to Audience and Intent
Explicit purpose instructions improve LLM translation adaptedness across 50 languages and 8 domains, with larger gains on informal text, while standard metrics often penalize the adapted outputs.
-
Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation
Dynamic Meta-Metrics learns source-sentence conditioned combinations of MT metrics, with MLP-based and soft-conditioned versions showing gains over linear and GP ensembles on WMT data.
-
Misaligned by Reward: Socially Undesirable Preferences in LLMs
Reward models for LLMs frequently select socially undesirable options across four social domains, show no overall best performer, and exhibit a bias-avoidance versus context-sensitivity trade-off.
-
Psychologically Potent, Computationally Invisible: LLMs Generate Social-Comparison-Eliciting Posts They Fail to Detect
LLMs generate Xiaohongshu-style posts that elicit social comparison but show stable failures in prompt-based detection of the same reader-grounded signal.
-
Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages
Meta-analysis of 33 ACL papers shows inconsistent LLM-as-a-Judge results, overtrust, and single-model reliance in multilingual/low-resource settings, with recommendations for better practice.
-
A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models
Lexical richness is a robust linguistic signal for AI-generated text detection across models and domains, while most other features are context-dependent.
-
CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs
Small open-source LLMs achieve competitive system-level correlations with human judgments in machine translation quality estimation, outperforming traditional neural metrics and fine-tuned models via single-pass multi-output prompting.
-
HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task
HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.
-
Model-Based Quality Assessment for Massively Multilingual Parallel Data
Large-scale benchmarks of multilingual embeddings and QE models show no universal performer; direction-aware routing and calibration recommended for parallel data assessment.
-
Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish
Cross-lingual transfer and language-specific data efforts are interdependent and complementary for effective low-resource NLP, as demonstrated through Luxembourgish case studies and synthesis.
-
LLM Consumer Behavior Theory: Foundations of a Novel Research Field
Introduces LLM Consumer Behavior Theory to analyze consumer behavior when LLMs serve as autonomous decision-making agents in markets.
-
ROC Analysis for Evaluating Translation Quality Estimation Systems
ROC analysis is proposed for evaluating translation quality estimation systems, claimed to match existing methods while providing actionable business insights.
-
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild
Hy-MT2 presents three new multilingual translation models that claim to outperform listed open-source and commercial systems on diverse tasks while enabling low-storage on-device use.
-
FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals
A feature-based decision tree with parsing-derived signals and heuristics detects LLM-generated code in a lightweight, CPU-only setup for SemEval-2026 Task 13.
-
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.