arxiv: 2510.03093 · v2 · submitted 2025-10-03 · 💻 cs.CL · cs.SD

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

Oriol Pareras , Gerard I. G\'allego , Federico Costa , Cristina Espa\~na-Bonet , Javier Hernando This is my paper

Pith reviewed 2026-05-18 10:24 UTC · model grok-4.3

classification 💻 cs.CL cs.SD

keywords speech-to-text translationdirect promptingchain-of-thoughtLLM-based S2TTscaling behaviorpseudo-labelingspeech LLMs

0 comments

The pith

Direct prompting improves more consistently than CoT as S2TT data scales up.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares direct prompting against Chain-of-Thought prompting in LLM-based speech-to-text translation systems. CoT has been favored because it can draw on large ASR and text-translation datasets by first transcribing then translating. To test behavior at scale, the authors pseudo-label an ASR corpus into six European languages to create S2TT training sets of increasing size. Experiments show that direct prompting, which maps speech straight to target text, gains more steadily as data volume grows. This indicates that larger dedicated S2TT resources could make direct prompting the stronger choice.

Core claim

Direct prompting for LLM-based speech-to-text translation improves more consistently as the amount of S2TT data increases, unlike CoT prompting which benefits less from additional end-to-end translation examples, as shown by training on pseudo-labeled data from an ASR corpus translated into six languages.

What carries the argument

Comparison of direct versus CoT prompting strategies in LLM-based S2TT models trained on increasing volumes of pseudo-labeled speech translation data.

If this is right

Direct prompting may become the preferred method once larger S2TT datasets exist.
End-to-end S2TT models could avoid explicit transcription steps without losing accuracy.
System design may shift toward collecting more direct speech-translation pairs rather than separate ASR and T2TT resources.
Scaling laws for speech translation may favor simpler direct mappings at high data regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The advantage of direct prompting could reduce the need for explicit reasoning chains when sufficient paired data is available.
The pattern might appear in other speech tasks where end-to-end supervision can replace cascaded steps.
Real human S2TT corpora would provide a cleaner test of whether the scaling benefit holds beyond pseudo-labels.

Load-bearing premise

The pseudo-labeling process that translates ASR transcriptions into target languages produces training data whose quality and distribution do not systematically favor one prompting strategy over the other.

What would settle it

Train both prompting strategies on a large collection of human-annotated real S2TT data at multiple scales and check whether direct prompting still shows stronger and more consistent gains.

read the original abstract

Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Direct prompting scales more consistently than CoT as S2TT data grows, but the pseudo-labeling step could be tilting the comparison.

read the letter

The main point is that this paper shows direct prompting improving more steadily than CoT prompting once you scale up the amount of speech-to-text translation data. They take an ASR corpus, generate pseudo-labels by translating the transcripts into six European languages, and then train the same LLM-based models under both prompting styles at several data sizes. The result is that direct catches up and becomes more reliable as the S2TT volume increases, which runs against the usual story that CoT wins by breaking the task into ASR then translation steps that can use separate large datasets.

Referee Report

2 major / 1 minor

Summary. The paper claims that Direct prompting outperforms Chain-of-Thought (CoT) prompting in scaling behavior for LLM-based Speech-to-Text Translation (S2TT) as the volume of S2TT training data grows. This is shown via a data-scaling experiment that pseudo-labels an ASR corpus by machine-translating its transcriptions into six European languages and then trains Speech LLM systems under both prompting regimes at multiple data scales.

Significance. If the central empirical observation holds, the work provides evidence that end-to-end Direct modeling may eventually surpass the modular CoT approach once sufficiently large native S2TT corpora exist, reducing reliance on separate ASR and T2TT resources. The systematic scaling design itself is a positive contribution, as it directly tests the data-efficiency hypothesis rather than reporting single-point comparisons.

major comments (2)

[Data preparation / experimental setup] The pseudo-labeling procedure used to create the S2TT training targets (ASR transcriptions translated into target languages) receives no description of the translation model, filtering criteria, or error-rate validation against gold S2TT references. Because CoT explicitly factors the task into transcription followed by translation, any systematic domain shift or noise pattern in the generated targets could penalize CoT more than Direct, undermining the scaling comparison.
[Results] The results section reports that Direct 'improves more consistently' with data scale but supplies no concrete metrics (e.g., BLEU or COMET), model sizes, exact data-scale points, variance across runs, or statistical significance tests for the observed trends. These omissions make it impossible to assess whether the claimed differential scaling is robust or merely suggestive.

minor comments (1)

[Abstract] The abstract and title would benefit from naming the base Speech LLM and the six target languages to allow readers to gauge the scope of the claimed generalization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and for the constructive major comments. We address each point below and will revise the manuscript to improve transparency and completeness.

read point-by-point responses

Referee: [Data preparation / experimental setup] The pseudo-labeling procedure used to create the S2TT training targets (ASR transcriptions translated into target languages) receives no description of the translation model, filtering criteria, or error-rate validation against gold S2TT references. Because CoT explicitly factors the task into transcription followed by translation, any systematic domain shift or noise pattern in the generated targets could penalize CoT more than Direct, undermining the scaling comparison.

Authors: We agree that the manuscript would benefit from explicit details on the pseudo-labeling pipeline. In the revision we will add a dedicated subsection specifying the machine translation system used to generate the target-language labels from the ASR transcriptions, the filtering criteria applied to retain high-quality pseudo-labels, and any quantitative validation (e.g., BLEU or error rates) performed against available gold S2TT references. On the potential bias concern: both Direct and CoT models are trained on exactly the same pseudo-labeled corpus, so any systematic noise or domain shift is shared. Nevertheless, we acknowledge that CoT's explicit transcription step could interact differently with noisy targets; we will add a brief discussion of this possibility and its implications for interpreting the scaling results. revision: yes
Referee: [Results] The results section reports that Direct 'improves more consistently' with data scale but supplies no concrete metrics (e.g., BLEU or COMET), model sizes, exact data-scale points, variance across runs, or statistical significance tests for the observed trends. These omissions make it impossible to assess whether the claimed differential scaling is robust or merely suggestive.

Authors: We accept that the current presentation relies primarily on qualitative descriptions of the scaling curves and would be strengthened by quantitative detail. In the revised manuscript we will insert a table reporting BLEU and COMET scores for both prompting regimes at every data scale examined, state the exact model sizes and architectures, list the precise data volumes (e.g., 10k, 50k, 100k, … examples), and include error bars or run-to-run variance where multiple seeds were used. If formal statistical significance tests were not conducted, we will either perform them on the existing results or explicitly note this limitation. These additions will allow readers to evaluate the robustness of the differential scaling claim directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scaling comparison is self-contained

full rationale

The paper reports experimental results from training Speech LLM systems with Direct and CoT prompting on pseudo-labeled S2TT data at multiple scales. The central claim—that Direct prompting scales more consistently—is an observation from these runs rather than any derivation, equation, or self-referential definition. No load-bearing steps reduce to fitted inputs, self-citations, or ansatzes; the work is a standard empirical comparison without mathematical reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that pseudo-labeled data is sufficiently clean for fair comparison and that the observed scaling trends generalize beyond the tested European languages and model family.

axioms (1)

domain assumption Pseudo-labeling ASR transcriptions yields S2TT training data whose noise characteristics do not bias the direct versus CoT comparison.
Invoked when the authors create additional S2TT data by translating existing ASR transcripts.

pith-pipeline@v0.9.0 · 5705 in / 1113 out tokens · 37941 ms · 2026-05-18T10:24:54.580061+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 4 internal anchors

[1]

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

INTRODUCTION End-to-End (E2E) S2TT is increasingly preferred over the traditional cascaded pipeline of ASR followed by T2TT. Un- like cascaded approaches, E2E systems avoid error propaga- tion across modules and can in principle exploit acoustic and prosodic cues that are lost in intermediate transcripts. Re- cent studies show that the performance of E2E ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

To do so, we generate pseudo- labeled S2TT data (S2TT pl) and evaluate a series of models trained with varying amounts of it

METHODOLOGY We systematically compare how COT and DIRECTprompt- ing strategies scale in S2TT. To do so, we generate pseudo- labeled S2TT data (S2TT pl) and evaluate a series of models trained with varying amounts of it. 2.1. Translation pseudo-labeling We generate pseudo-labeled data by translating the tran- scriptions of an ASR dataset (Figure 1). We use...

work page
[3]

EXPERIMENTAL SETUP 3.1. Data Our training framework includes four types of datasets: ASR, T2TT, S2TT and pseudo-labeled S2TT (S2TT pl) in six lan- guages: Catalan (ca), German (de), English (en), Spanish (es), French (fr), and Italian (it). Table 1 reports the amount of data per language. For ASR, we use the train splits of Com- mon V oice 21.0 [17] and M...

work page 2048
[4]

RESULTS Initial COT BASE superiorityWe first verify that our base- lines reproduce trends reported in prior work [4]. As ex- pected, Table 2 shows that the COT BASE outperforms the DI- RECT BASE across all languages, confirming the benefit of de- composing the task into transcription and translation in this scenario. The average baseline gap is approximat...

work page
[5]

Our experiments show that COT struggles to improve with more data, regardless of whether the transcription step is trained or not

CONCLUSIONS AND FUTURE DIRECTIONS In this work, we systematically compare COT and DIRECT prompting strategies for S2TT using pseudo-labeled data, which has become the standard approach for scaling this task. Our experiments show that COT struggles to improve with more data, regardless of whether the transcription step is trained or not. In contrast, DIREC...

work page
[6]

Generaci ´on D

ACKNOWLEDGEMENTS This work is funded by the Ministerio para la Transformaci ´on Dig- ital y de la Funci ´on P ´ublica and Plan de Recuperaci ´on, Transfor- maci´on y Resiliencia – Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje. FC and CEB acknowledge their AI4S fellowship within the “Generaci ´on D” ini- tiative b...

work page
[7]

Making LLMs better many- to-many speech-to-text translators with curriculum learning,

Y . Du, Y . Pan, Z. Ma, et al., “Making LLMs better many- to-many speech-to-text translators with curriculum learning,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds., Vienna, Austria, July 2025, pp. 12466–12478, Association for Co...

work page 2025
[8]

On decoder-only architecture for speech-to-text and large language model integration,

J. Wu, Y . Gaur, Z. Chen, et al., “On decoder-only architecture for speech-to-text and large language model integration,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8

work page 2023
[9]

LLaST: Improved end- to-end speech translation system leveraged by large language models,

X. Chen, S. Zhang, Q. Bai, et al., “LLaST: Improved end- to-end speech translation system leveraged by large language models,” inFindings of the Association for Computational Lin- guistics: ACL 2024, L.-W. Ku, A. Martins, and V . Srikumar, Eds., Bangkok, Thailand, Aug. 2024, pp. 6976–6987, Associ- ation for Computational Linguistics

work page 2024
[10]

Chain-of-thought prompting for speech translation,

K. Hu, Z. Chen, C.-H. H. Yang, et al., “Chain-of-thought prompting for speech translation,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2025, pp. 1–5

work page 2025
[11]

Speech Translation with Large Language Models: An Industrial Practice,

Z. Huang, R. Ye, T. Ko, et al., “Speech Translation with Large Language Models: An Industrial Practice,” Dec. 2023, arXiv:2312.13585 [cs]

work page arXiv 2023
[12]

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought,

Y . Du, Z. Ma, Y . Yang, et al., “CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought,” Sept. 2024, arXiv:2409.19510 [cs]

work page arXiv 2024
[13]

Harnessing indirect training data for end-to-end automatic speech translation: Tricks of the trade,

J. Pino, L. Puzon, J. Gu, et al., “Harnessing indirect training data for end-to-end automatic speech translation: Tricks of the trade,” inProceedings of the 16th International Conference on Spoken Language Translation, J. Niehues, R. Cattoni, S. St¨uker, et al., Eds., Hong Kong, Nov. 2-3 2019, Association for Computational Linguistics

work page 2019
[14]

Improving cross-lingual transfer learning for end-to-end speech recognition with speech trans- lation,

C. Wang, J. Pino, and J. Gu, “Improving cross-lingual transfer learning for end-to-end speech recognition with speech trans- lation,” inInterspeech 2020, 2020, pp. 4731–4735

work page 2020
[15]

Deep Speech: Scaling up end-to-end speech recognition

A. Y . Hannun, C. Case, J. Casper, et al., “Deep speech: Scaling up end-to-end speech recognition,”arXiv preprint arXiv:1412.5567, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation,

D. Dale and M. R. Costa-juss `a, “BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds., Miami, Florida, USA, Nov. 2024, pp. 16075–16085, Association for Computational Lin- guistics

work page 2024
[17]

SONAR: sentence-level multimodal and language-agnostic representa- tions,

P.-A. Duquenne, H. Schwenk, and B. Sagot, “SONAR: sentence-level multimodal and language-agnostic representa- tions,” 2023

work page 2023
[18]

GlotLID: Language identification for low-resource languages,

A. H. Kargaran, A. Imani, F. Yvon, and H. Sch¨utze, “GlotLID: Language identification for low-resource languages,” inThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[19]

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversa- tional Abilities,

D. Zhang, S. Li, X. Zhang, et al., “SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversa- tional Abilities,” inFindings of the Association for Computa- tional Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds., Singapore, Dec. 2023, pp. 15757–15773, Associa- tion for Computational Linguistics

work page 2023
[20]

AudioPaLM: A Large Language Model That Can Speak and Listen

P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, et al., “AudioPaLM: A Large Language Model That Can Speak and Listen,” June 2023, arXiv:2306.12925 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Textually Pre- trained Speech Language Models,

M. Hassid, T. Remez, T. A. Nguyen, et al., “Textually Pre- trained Speech Language Models,” inThirty-seventh Confer- ence on Neural Information Processing Systems, Nov. 2023

work page 2023
[22]

HuBERT: Self- Supervised Speech Representation Learning by Masked Pre- diction of Hidden Units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, et al., “HuBERT: Self- Supervised Speech Representation Learning by Masked Pre- diction of Hidden Units,”IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, pp. 3451–3460, Oct. 2021

work page 2021
[23]

Common V oice: A Massively-Multilingual Speech Corpus,

R. Ardila, M. Branson, K. Davis, et al., “Common V oice: A Massively-Multilingual Speech Corpus,” inProceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. B´echet, P. Blache, et al., Eds., Marseille, France, May 2020, pp. 4218–4222, European Language Resources As- sociation

work page 2020
[24]

MLS: A Large-Scale Multilingual Dataset for Speech Research

V . Pratap, Q. Xu, A. Sriram, et al., “Mls: A large- scale multilingual dataset for speech research,”ArXiv, vol. abs/2012.03411, 2020

work page internal anchor Pith review arXiv 2012
[25]

Parallel data, tools and interfaces in opus,

J. Tiedemann, “Parallel data, tools and interfaces in opus,” inProceedings of the Eight International Conference on Lan- guage Resources and Evaluation (LREC’12), N. C. C. Chair), K. Choukri, T. Declerck, et al., Eds., Istanbul, Turkey, may 2012, European Language Resources Association (ELRA)

work page 2012
[26]

Europarl-st: A multilingual corpus for speech translation of parliamentary debates,

J. Iranzo-S ´anchez, J. A. Silvestre-Cerd `a, J. Jorge, et al., “Europarl-st: A multilingual corpus for speech translation of parliamentary debates,” inICASSP 2020 - 2020 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 8229–8233

work page 2020
[27]

CoV oST 2 and Massively Multilingual Speech Translation,

C. Wang, A. Wu, J. Gu, and J. Pino, “CoV oST 2 and Massively Multilingual Speech Translation,” inProc. Interspeech 2021, 2021, pp. 2247–2251

work page 2021
[28]

FLEURS: Few-Shot Learning Evaluation of Universal Representations of Speech,

A. Conneau, M. Ma, S. Khanuja, et al., “FLEURS: Few-Shot Learning Evaluation of Universal Representations of Speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), Jan. 2023, pp. 798–805

work page 2023
[29]

Bleu: a Method for Automatic Evaluation of Machine Translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting of the ACL, P. Is- abelle, E. Charniak, and D. Lin, Eds., Philadelphia, Pennsyl- vania, USA, July 2002, pp. 311–318, Association for Compu- tational Linguistics

work page 2002
[30]

A Call for Clarity in Reporting BLEU Scores,

M. Post, “A Call for Clarity in Reporting BLEU Scores,” in Proceedings of the Third Conference on Machine Translation: Research Papers, O. Bojar, R. Chatterjee, C. Federmann, et al., Eds., Brussels, Belgium, Oct. 2018, pp. 186–191, Association for Computational Linguistics

work page 2018
[31]

xcomet: Transpar- ent machine translation evaluation through fine-grained error detection,

N. M. Guerreiro, R. Rei, D. v. Stigt, et al., “xcomet: Transpar- ent machine translation evaluation through fine-grained error detection,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 979–995, 2024. 5

work page 2024