VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
WavAlign introduces an adaptive hybrid post-training recipe that makes reinforcement learning practical for spoken dialogue models by separating semantic preference updates from acoustic anchoring and regulating their mixture to yield better semantic quality and expressiveness.
citing papers explorer
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
-
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints
Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.
-
Step-Audio 2 Technical Report
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
-
WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training
WavAlign introduces an adaptive hybrid post-training recipe that makes reinforcement learning practical for spoken dialogue models by separating semantic preference updates from acoustic anchoring and regulating their mixture to yield better semantic quality and expressiveness.