Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

Chenghan Lin; Chenrui Cui; Chunyu Qiang; Guochen Yu; Jianwu Dang; Longbiao Wang; Tianrui Wang; Xie Chen; Xingyu Ma; Xuanchen Li

arxiv: 2606.10368 · v1 · pith:Z5FEAEY7new · submitted 2026-06-09 · 💻 cs.SD · cs.AI

Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

Xuanchen Li , Tianrui Wang , Yuheng Lu , Zikang Huang , Yu Jiang , Chenghan Lin , Chenrui Cui , Ziyang Ma

show 6 more authors

Xingyu Ma Chunyu Qiang Guochen Yu Xie Chen Longbiao Wang Jianwu Dang

This is my paper

Pith reviewed 2026-06-27 12:01 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords speech recognitionspeech translationcontinuous diffusionflow matchingaudio conditioninglatent spaceerror analysis

0 comments

The pith

Continuous-target diffusion models unify speech recognition and translation by tracing both errors to close-distance confusion in latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts a pre-trained continuous language model to accept speech audio and generate text representations in continuous space for both recognition and translation. A frozen audio encoder feeds into a single linear projector that conditions the flow-matching process on noisy text latents, with added training steps that force attention to the audio input. On standard benchmarks the resulting model reaches competitive accuracy while revealing that surface-level differences between recognition and translation errors mask a shared cause: the model confuses nearby points in the continuous latent space. This observation supports the view that recognition and translation draw on one underlying semantic mapping process.

Core claim

ELF-S2T prepends audio-conditioned vectors to noisy text latents and performs flow-matching denoising inside the pre-trained continuous representation space. Audio forcing during training plus classifier-free guidance at inference keep the model from ignoring the speech input. Experiments on LibriSpeech and CoVoST2 yield competitive word-error and BLEU scores. Error analysis then shows that mistakes in both tasks arise from the same mechanism: close-distance confusion between points in the continuous latent space, indicating a common semantic mapping beneath recognition and translation.

What carries the argument

Audio-conditioned flow-matching denoising of continuous text latents, driven by a linear projector on frozen Whisper features and enforced by audio forcing plus classifier-free guidance.

If this is right

The same model architecture reaches competitive accuracy on both ASR and S2TT benchmarks.
Errors in the two tasks share a single root: close-distance confusion inside the continuous latent space.
The continuous generation paradigm aligns with one semantic mapping process that serves both recognition and translation.
Audio forcing and classifier-free guidance successfully shift reliance from text pre-training to the audio condition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Methods that explicitly enlarge distances between nearby latent points could reduce errors across both tasks at once.
The same conditioning pattern might transfer to other speech tasks that map audio to semantic output.
Testing whether discrete-token models show analogous shared error patterns would clarify whether the finding is specific to continuous spaces.

Load-bearing premise

The pre-trained text backbone plus one linear audio projector, audio forcing, and classifier-free guidance together suffice to make the model depend on the speech input rather than defaulting to its text-only training.

What would settle it

If error analysis after training shows that ASR and S2TT mistakes arise from qualitatively different causes in the latent space, or if removing audio forcing leaves performance unchanged, the shared-cause claim would not hold.

Figures

Figures reproduced from arXiv: 2606.10368 by Chenghan Lin, Chenrui Cui, Chunyu Qiang, Guochen Yu, Jianwu Dang, Longbiao Wang, Tianrui Wang, Xie Chen, Xingyu Ma, Xuanchen Li, Yuheng Lu, Yu Jiang, Zikang Huang, Ziyang Ma.

**Figure 1.** Figure 1: ELF-S2T casts speech-to-text as audioconditioned generation in a continuous text space. Starting from Gaussian noise at t = 0, the text latent is denoised toward the target under the audio condition, and tokens are unembedded only at the final step t= 1. in parallel over multiple denoising rounds. Both proposals keep the target space discrete and report ASR results only. Despite the diversity of decoders,… view at source ↗

**Figure 2.** Figure 2: Overview of ELF-S2T. A frozen Whisper encoder and a single projector turn speech into an audio condition that is [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Sweeps of the audio-guidance scale w (a) and the sampler-step count K (b) on ELF-B under the audio-forcing recipe, for ASR (blue, WER) and S2TT (red, BLEU). WER axes are inverted so that up is better on both curves. In (b), K ∈ {32, 64, 128} is plotted against relative inference cost, normalised to K = 32. toward lower values, where the text latent is too corrupted to recover the target on its own and the … view at source ↗

read the original abstract

Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts ELF to audio-conditioned continuous generation for ASR and S2TT with an error analysis on shared latent confusion, but the abstract supplies no metrics and the conditioning strength is unproven.

read the letter

The main contribution is extending the pre-trained ELF flow model to speech input by prepending a linear projection of frozen Whisper features to the noisy text latents, then using audio forcing in training and classifier-free guidance at inference to generate continuous targets for both recognition and translation. They also run an error analysis concluding that surface differences in ASR and S2TT mistakes trace back to the same close-distance confusion in the continuous space.

The work does release code and models, and it evaluates on the usual LibriSpeech and CoVoST2 sets. That combination of continuous-target modeling plus a concrete error analysis across the two tasks is not routine, and the public artifacts make it possible to check the claims.

The soft spots are straightforward. The abstract contains no WER, BLEU, or other numbers, no baseline comparisons, and no description of the error-analysis procedure, so the competitive-performance and shared-cause statements cannot be assessed from the given text. The architecture depends on a single linear projector plus the forcing and guidance tricks to prevent the model from ignoring the audio and defaulting to its text pre-training; if those mechanisms fall short, the latent-space error patterns would reflect the ELF text prior rather than an audio-driven mapping. The stress-test note identifies this exact risk, and nothing in the abstract rules it out.

This is for speech-processing researchers already following continuous or flow-based generation work. A reader in that niche could extract the conditioning recipe and the error-analysis framing, but the paper needs the quantitative details and ablations before the central argument can be taken as settled.

I would send it to peer review because the direction is coherent and the artifacts are available, even though the current write-up leaves the key claims without visible support.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ELF-S2T, an audio-conditioned continuous-target generative model for speech recognition (ASR) and speech-to-text translation (S2TT). It builds on the pre-trained Embedded Language Flows (ELF) backbone, conditions on speech via a frozen Whisper encoder plus single linear projector prepended to noisy text latents, and employs flow-matching denoising with audio forcing during training and classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 are reported to achieve competitive performance; error analysis concludes that surface-different ASR and S2TT errors share the same root cause of close-distance confusion in the continuous latent space, implying a common semantic mapping process.

Significance. If the performance claims and error analysis hold after verification that the model relies on audio conditioning, the work would indicate that continuous-target diffusion can unify ASR and S2TT under a shared latent-space mechanism, extending the continuous representation paradigm beyond discrete-token approaches. Public release of code and pretrained models is a clear strength supporting reproducibility.

major comments (2)

[Abstract and §3] Abstract and §3 (architecture/training): The central claim that ASR/S2TT errors arise from close-distance confusion inside an audio-conditioned continuous latent space (rather than the text prior) is load-bearing on the model actually using the Whisper+linear audio condition. The description of prepending the projector output, audio forcing, and CFG does not include ablations (e.g., WER/BLEU drop or error-pattern change when audio input is removed or replaced by noise) or diagnostics (attention maps, conditioning strength metrics) showing that these mechanisms override the ELF text backbone. Without such evidence the error analysis cannot substantiate the claimed audio-driven shared semantic mapping.
[§4] §4 (experiments): The abstract states 'competitive' performance on LibriSpeech and CoVoST2 yet supplies no WER, BLEU, baseline tables, statistical significance, or description of the error-analysis procedure. If these details exist later in the manuscript they must be explicitly cross-referenced to the abstract claim; otherwise the quantitative support for both the performance and the error-cause conclusion remains unverifiable.

minor comments (1)

[§3] Notation for the linear projector and the exact form of the audio-forcing loss should be defined with an equation in §3 to avoid ambiguity when readers attempt to reproduce the conditioning mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important points for strengthening the claims regarding audio conditioning and quantitative support. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and additions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (architecture/training): The central claim that ASR/S2TT errors arise from close-distance confusion inside an audio-conditioned continuous latent space (rather than the text prior) is load-bearing on the model actually using the Whisper+linear audio condition. The description of prepending the projector output, audio forcing, and CFG does not include ablations (e.g., WER/BLEU drop or error-pattern change when audio input is removed or replaced by noise) or diagnostics (attention maps, conditioning strength metrics) showing that these mechanisms override the ELF text backbone. Without such evidence the error analysis cannot substantiate the claimed audio-driven shared semantic mapping.

Authors: We agree that the error analysis claim requires explicit evidence that the audio conditioning is actively used rather than being overridden by the pre-trained ELF text backbone. The manuscript describes the Whisper encoder, linear projector, prepending mechanism, audio forcing during training, and classifier-free guidance at inference as the means to incorporate audio. However, we acknowledge that no ablations or conditioning diagnostics are currently included. In the revised manuscript, we will add ablations (e.g., performance and error-pattern changes when audio is removed or replaced by noise) and any feasible diagnostics to demonstrate that the audio condition drives the shared semantic mapping. revision: yes
Referee: [§4] §4 (experiments): The abstract states 'competitive' performance on LibriSpeech and CoVoST2 yet supplies no WER, BLEU, baseline tables, statistical significance, or description of the error-analysis procedure. If these details exist later in the manuscript they must be explicitly cross-referenced to the abstract claim; otherwise the quantitative support for both the performance and the error-cause conclusion remains unverifiable.

Authors: The quantitative results (WER/BLEU scores, baselines, statistical details) and the error-analysis procedure are presented in §4. We agree that the abstract claim would benefit from explicit cross-references to these sections. In the revision, we will add direct references from the abstract to the relevant parts of §4 to ensure the support for competitive performance and the error analysis is immediately verifiable. revision: yes

Circularity Check

0 steps flagged

Empirical error analysis on public benchmarks exhibits no circular reduction

full rationale

The paper constructs ELF-S2T by prepending a linear projection of frozen Whisper features to noisy ELF text latents and applies audio forcing plus CFG to encourage audio conditioning. The central claim—that ASR and S2TT errors share a close-distance confusion cause in continuous latent space—is obtained from post-training error inspection on LibriSpeech and CoVoST2 outputs rather than any equation that equates a reported quantity to a fitted parameter or prior self-citation. No self-definitional loop, fitted-input prediction, or load-bearing uniqueness theorem appears in the architecture or analysis; the derivation chain therefore remains independent of its own outputs and is validated against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard flow-matching and pre-trained models without introducing new free parameters or entities; the core assumption is that the continuous latent space supports the claimed semantic mapping.

axioms (1)

domain assumption Flow-matching denoising can be applied to continuous text latents when conditioned on audio embeddings from a separate encoder.
This is the central architectural choice described for ELF-S2T.

pith-pipeline@v0.9.1-grok · 5821 in / 1316 out tokens · 35494 ms · 2026-06-27T12:01:06.988689+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Continuous Language Diffusion as a Decoder-Interface Problem
cs.CL 2026-06 unverdicted novelty 7.0

Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated a...

Reference graph

Works this paper leans on

25 extracted references · 9 linked inside Pith · cited by 1 Pith paper

[1]

D.; Ho, J.; Tarlow, D.; and van den Berg, R

Austin, J.; Johnson, D. D.; Ho, J.; Tarlow, D.; and van den Berg, R. 2021. Structured denoising diffusion models in discrete state-spaces. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713845393

2021
[2]

Baas, M.; Eloff, K.; and Kamper, H. 2022. TransFusion: Transcribing Speech with Multinomial Diffusion. arXiv:2210.07677

arXiv 2022
[3]

Fathullah, Y.; Wu, C.; Lakomkin, E.; Jia, J.; Shangguan, Y.; Li, K.; Guo, J.; Xiong, W.; Mahadeokar, J.; Kalinli, O.; Fuegen, C.; and Seltzer, M. 2023. Prompting Large Language Models with Speech Recognition Abilities. arXiv:2307.11795

arXiv 2023
[4]

Guo, H.; Zhao, Q.; Zhao, Y.; Nie, S.; Zhu, R.; Guo, Q.; Wang, F.; Yang, T.; Zhao, H.; Wei, G.; and Zeng, Y. 2026. Continuous Latent Diffusion Language Model. arXiv:2605.06548

Pith/arXiv arXiv 2026
[5]

Ho, J.; and Salimans, T. 2022. Classifier-Free Diffusion Guidance. arXiv:2207.12598

Pith/arXiv arXiv 2022
[6]

Hu, K.; Qiu, L.; Lu, Y.; Zhao, H.; Li, T.; Kim, Y.; Andreas, J.; and He, K. 2026. ELF: Embedded Language Flows. arXiv:2605.10938

Pith/arXiv arXiv 2026
[7]

G.; and Lee, H.-J

Kwon, T.; Ahn, J.; Yun, T.; Jwa, H.; Choi, Y.; Park, S.; Kim, N.-J.; Kim, J.; Ryu, H. G.; and Lee, H.-J. 2025. Whisfusion: Parallel ASR Decoding via a Diffusion Transformer. arXiv:2508.07048

Pith/arXiv arXiv 2025
[8]

Leng, S.; Xing, Y.; Cheng, Z.; Zhou, Y.; Zhang, H.; Li, X.; Zhao, D.; Lu, S.; Miao, C.; and Bing, L. 2024. The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio. arXiv:2410.12787

arXiv 2024
[9]

Lipman, Y.; Chen, R. T. Q.; Ben-Hamu, H.; Nickel, M.; and Le, M. 2023. Flow Matching for Generative Modeling. arXiv:2210.02747

Pith/arXiv arXiv 2023
[10]

Lou, A.; Meng, C.; and Ermon, S. 2024. Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

2024
[11]

Ma, Z.; Yang, G.; Yang, Y.; Gao, Z.; Wang, J.; Du, Z.; Yu, F.; Chen, Q.; Zheng, S.; Zhang, S.; and Chen, X. 2024. An Embarrassingly Simple Approach for LLM with Strong ASR Capacity. arXiv:2402.08846

arXiv 2024
[12]

Ma, Z.; Yang, G.; Yang, Y.; Gao, Z.; Wang, J.; Du, Z.; Yu, F.; Chen, Q.; Zheng, S.; Zhang, S.; and Chen, X. 2025. Speech recognition meets large language model: benchmarking, models, and exploration. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intellig...

2025
[13]

Panayotov, V.; Chen, G.; Povey, D.; and Khudanpur, S. 2015. Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206--5210

2015
[14]

W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I

Radford, A.; Kim, J. W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356

Pith/arXiv arXiv 2022
[15]

Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2023. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683

Pith/arXiv arXiv 2023
[16]

S.; Arriola, M.; Schiff, Y.; Gokaslan, A.; Marroquin, E.; Chiu, J

Sahoo, S. S.; Arriola, M.; Schiff, Y.; Gokaslan, A.; Marroquin, E.; Chiu, J. T.; Rush, A.; and Kuleshov, V. 2024. Simple and effective masked diffusion language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24. Red Hook, NY, USA: Curran Associates Inc. ISBN 9798331314385

2024
[17]

Seamless Communication ; Barrault, L.; et al. 2023. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation. arXiv:2308.11596

arXiv 2023
[18]

Tang, C.; Yu, W.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; Ma, Z.; and Zhang, C. 2024. SALMONN: Towards Generic Hearing Abilities for Large Language Models. In The Twelfth International Conference on Learning Representations

2024
[19]

Wang, C.; Wu, A.; and Pino, J. 2020. CoVoST 2 and Massively Multilingual Speech-to-Text Translation. arXiv:2007.10310

arXiv 2020
[20]

Wang, D.; Li, J.; Cui, M.; Yang, D.; Chen, X.; and Meng, H. 2025. Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs. arXiv:2508.17863

arXiv 2025
[21]

Wu, H.; Tang, M.; Zheng, X.; and Jiang, H. 2025. When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models. arXiv:2508.10552

arXiv 2025
[22]

Xu, J.; Guo, Z.; He, J.; Hu, H.; He, T.; Bai, S.; Chen, K.; Wang, J.; Fan, Y.; Dang, K.; Zhang, B.; Wang, X.; Chu, Y.; and Lin, J. 2025 a . Qwen2.5-Omni Technical Report. arXiv:2503.20215

Pith/arXiv arXiv 2025
[23]

Xu, J.; Guo, Z.; Hu, H.; Chu, Y.; Wang, X.; He, J.; Wang, Y.; Shi, X.; He, T.; Zhu, X.; Lv, Y.; Wang, Y.; Guo, D.; Wang, H.; Ma, L.; Zhang, P.; Zhang, X.; Hao, H.; Guo, Z.; Yang, B.; Zhang, B.; Ma, Z.; Wei, X.; Bai, S.; Chen, K.; Liu, X.; Wang, P.; Yang, M.; Liu, D.; Ren, X.; Zheng, B.; Men, R.; Zhou, F.; Yu, B.; Yang, J.; Yu, L.; Zhou, J.; and Lin, J. 20...

Pith/arXiv arXiv 2025
[24]

Xu, Y.; Zhang, S.-X.; Yu, J.; Wu, Z.; and Yu, D. 2024. Comparing Discrete and Continuous Space LLMs for Speech Recognition. arXiv:2409.00800

arXiv 2024
[25]

Yu, W.; Tang, C.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; Ma, Z.; and Zhang, C. 2023. Connecting Speech Encoder and Large Language Model for ASR. arXiv:2309.13963

arXiv 2023

[1] [1]

D.; Ho, J.; Tarlow, D.; and van den Berg, R

Austin, J.; Johnson, D. D.; Ho, J.; Tarlow, D.; and van den Berg, R. 2021. Structured denoising diffusion models in discrete state-spaces. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713845393

2021

[2] [2]

Baas, M.; Eloff, K.; and Kamper, H. 2022. TransFusion: Transcribing Speech with Multinomial Diffusion. arXiv:2210.07677

arXiv 2022

[3] [3]

Fathullah, Y.; Wu, C.; Lakomkin, E.; Jia, J.; Shangguan, Y.; Li, K.; Guo, J.; Xiong, W.; Mahadeokar, J.; Kalinli, O.; Fuegen, C.; and Seltzer, M. 2023. Prompting Large Language Models with Speech Recognition Abilities. arXiv:2307.11795

arXiv 2023

[4] [4]

Guo, H.; Zhao, Q.; Zhao, Y.; Nie, S.; Zhu, R.; Guo, Q.; Wang, F.; Yang, T.; Zhao, H.; Wei, G.; and Zeng, Y. 2026. Continuous Latent Diffusion Language Model. arXiv:2605.06548

Pith/arXiv arXiv 2026

[5] [5]

Ho, J.; and Salimans, T. 2022. Classifier-Free Diffusion Guidance. arXiv:2207.12598

Pith/arXiv arXiv 2022

[6] [6]

Hu, K.; Qiu, L.; Lu, Y.; Zhao, H.; Li, T.; Kim, Y.; Andreas, J.; and He, K. 2026. ELF: Embedded Language Flows. arXiv:2605.10938

Pith/arXiv arXiv 2026

[7] [7]

G.; and Lee, H.-J

Kwon, T.; Ahn, J.; Yun, T.; Jwa, H.; Choi, Y.; Park, S.; Kim, N.-J.; Kim, J.; Ryu, H. G.; and Lee, H.-J. 2025. Whisfusion: Parallel ASR Decoding via a Diffusion Transformer. arXiv:2508.07048

Pith/arXiv arXiv 2025

[8] [8]

Leng, S.; Xing, Y.; Cheng, Z.; Zhou, Y.; Zhang, H.; Li, X.; Zhao, D.; Lu, S.; Miao, C.; and Bing, L. 2024. The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio. arXiv:2410.12787

arXiv 2024

[9] [9]

Lipman, Y.; Chen, R. T. Q.; Ben-Hamu, H.; Nickel, M.; and Le, M. 2023. Flow Matching for Generative Modeling. arXiv:2210.02747

Pith/arXiv arXiv 2023

[10] [10]

Lou, A.; Meng, C.; and Ermon, S. 2024. Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

2024

[11] [11]

Ma, Z.; Yang, G.; Yang, Y.; Gao, Z.; Wang, J.; Du, Z.; Yu, F.; Chen, Q.; Zheng, S.; Zhang, S.; and Chen, X. 2024. An Embarrassingly Simple Approach for LLM with Strong ASR Capacity. arXiv:2402.08846

arXiv 2024

[12] [12]

Ma, Z.; Yang, G.; Yang, Y.; Gao, Z.; Wang, J.; Du, Z.; Yu, F.; Chen, Q.; Zheng, S.; Zhang, S.; and Chen, X. 2025. Speech recognition meets large language model: benchmarking, models, and exploration. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intellig...

2025

[13] [13]

Panayotov, V.; Chen, G.; Povey, D.; and Khudanpur, S. 2015. Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206--5210

2015

[14] [14]

W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I

Radford, A.; Kim, J. W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356

Pith/arXiv arXiv 2022

[15] [15]

Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2023. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683

Pith/arXiv arXiv 2023

[16] [16]

S.; Arriola, M.; Schiff, Y.; Gokaslan, A.; Marroquin, E.; Chiu, J

Sahoo, S. S.; Arriola, M.; Schiff, Y.; Gokaslan, A.; Marroquin, E.; Chiu, J. T.; Rush, A.; and Kuleshov, V. 2024. Simple and effective masked diffusion language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24. Red Hook, NY, USA: Curran Associates Inc. ISBN 9798331314385

2024

[17] [17]

Seamless Communication ; Barrault, L.; et al. 2023. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation. arXiv:2308.11596

arXiv 2023

[18] [18]

Tang, C.; Yu, W.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; Ma, Z.; and Zhang, C. 2024. SALMONN: Towards Generic Hearing Abilities for Large Language Models. In The Twelfth International Conference on Learning Representations

2024

[19] [19]

Wang, C.; Wu, A.; and Pino, J. 2020. CoVoST 2 and Massively Multilingual Speech-to-Text Translation. arXiv:2007.10310

arXiv 2020

[20] [20]

Wang, D.; Li, J.; Cui, M.; Yang, D.; Chen, X.; and Meng, H. 2025. Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs. arXiv:2508.17863

arXiv 2025

[21] [21]

Wu, H.; Tang, M.; Zheng, X.; and Jiang, H. 2025. When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models. arXiv:2508.10552

arXiv 2025

[22] [22]

Xu, J.; Guo, Z.; He, J.; Hu, H.; He, T.; Bai, S.; Chen, K.; Wang, J.; Fan, Y.; Dang, K.; Zhang, B.; Wang, X.; Chu, Y.; and Lin, J. 2025 a . Qwen2.5-Omni Technical Report. arXiv:2503.20215

Pith/arXiv arXiv 2025

[23] [23]

Xu, J.; Guo, Z.; Hu, H.; Chu, Y.; Wang, X.; He, J.; Wang, Y.; Shi, X.; He, T.; Zhu, X.; Lv, Y.; Wang, Y.; Guo, D.; Wang, H.; Ma, L.; Zhang, P.; Zhang, X.; Hao, H.; Guo, Z.; Yang, B.; Zhang, B.; Ma, Z.; Wei, X.; Bai, S.; Chen, K.; Liu, X.; Wang, P.; Yang, M.; Liu, D.; Ren, X.; Zheng, B.; Men, R.; Zhou, F.; Yu, B.; Yang, J.; Yu, L.; Zhou, J.; and Lin, J. 20...

Pith/arXiv arXiv 2025

[24] [24]

Xu, Y.; Zhang, S.-X.; Yu, J.; Wu, Z.; and Yu, D. 2024. Comparing Discrete and Continuous Space LLMs for Speech Recognition. arXiv:2409.00800

arXiv 2024

[25] [25]

Yu, W.; Tang, C.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; Ma, Z.; and Zhang, C. 2023. Connecting Speech Encoder and Large Language Model for ASR. arXiv:2309.13963

arXiv 2023