pith. sign in

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text translation (ST) and other downstream tasks, bypassing traditional transcription-based pipelines. Whether this integration improves ST quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperform cascades in various settings while SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

fields

cs.CL 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

Pearmut: Human Evaluation of Translation Made Trivial

cs.CL · 2026-01-06 · unverdicted · novelty 5.0

Pearmut is a platform that makes end-to-end human evaluation of translations as easy as automatic metrics by supporting DA, ESA, MQM and features like document context and attention checks.

citing papers explorer

Showing 1 of 1 citing paper.

  • Pearmut: Human Evaluation of Translation Made Trivial cs.CL · 2026-01-06 · unverdicted · none · ref 6 · internal anchor

    Pearmut is a platform that makes end-to-end human evaluation of translations as easy as automatic metrics by supporting DA, ESA, MQM and features like document context and attention checks.