Encoder-dominated ASR models using text-only data via modality matching and downsampling achieve comparable performance to larger-decoder models on LibriSpeech, with simple random duration approaches proving effective.
What language model architecture and pretraining objective works best for zero-shot generalization?
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.
citing papers explorer
-
Text-Utilization for Encoder-dominated Speech Recognition Models
Encoder-dominated ASR models using text-only data via modality matching and downsampling achieve comparable performance to larger-decoder models on LibriSpeech, with simple random duration approaches proving effective.
-
LLMs and Speech: Integration vs. Combination
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.