Pitfalls and Outlooks in Using COMET

Chen, Pinzhen, Lam, Tsz Kin, Moghe, Nikita, Haddow, Barry · 2024 · DOI 10.18653/v1/2024.wmt-1.121

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open at publisher browse 6 citing papers

representative citing papers

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

cs.CL · 2025-12-18 · unverdicted · novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.

Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding

cs.CL · 2026-06-21 · unverdicted · novelty 6.0

VCM is a training-free decoding intervention that applies PMI-driven token elevation and variance-adaptive penalization to reduce repetitive degeneration in LLM open-ended generation.

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

cs.CL · 2026-06-05 · unverdicted · novelty 6.0

MADE is a new multilingual agentic diagnosing engine that produces higher-quality diagnostic reports (47% better than baseline) on a large-scale evaluation substrate covering 33 model families and 26 languages.

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

cs.CL · 2026-06-07 · unverdicted · novelty 4.0

HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

cs.LG · 2026-05-06 · unverdicted · novelty 4.0

Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.

SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures

cs.CL · 2026-05-04 · unverdicted · novelty 4.0

SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.

citing papers explorer

Showing 6 of 6 citing papers.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs cs.CL · 2025-12-18 · unverdicted · none · ref 110
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding cs.CL · 2026-06-21 · unverdicted · none · ref 262
VCM is a training-free decoding intervention that applies PMI-driven token elevation and variance-adaptive penalization to reduce repetitive degeneration in LLM open-ended generation.
MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights cs.CL · 2026-06-05 · unverdicted · none · ref 59
MADE is a new multilingual agentic diagnosing engine that produces higher-quality diagnostic reports (47% better than baseline) on a large-scale evaluation substrate covering 33 model families and 26 languages.
HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task cs.CL · 2026-06-07 · unverdicted · none · ref 33
HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization cs.LG · 2026-05-06 · unverdicted · none · ref 157
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures cs.CL · 2026-05-04 · unverdicted · none · ref 261
SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.

Pitfalls and Outlooks in Using COMET

fields

years

verdicts

representative citing papers

citing papers explorer