SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
arXiv preprint arXiv:2509.15845 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
citing papers explorer
-
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.