Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Ailin Huang et al · 2025 · arXiv 2506.08967

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

cs.SD · 2026-04-20 · unverdicted · novelty 6.0

Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

WavAlign introduces an adaptive hybrid post-training recipe that makes reinforcement learning practical for spoken dialogue models by separating semantic preference updates from acoustic anchoring and regulating their mixture to yield better semantic quality and expressiveness.

citing papers explorer

Showing 4 of 4 citing papers.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 14
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints cs.SD · 2026-04-20 · unverdicted · none · ref 20
Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 31
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training cs.AI · 2026-04-16 · unverdicted · none · ref 2
WavAlign introduces an adaptive hybrid post-training recipe that makes reinforcement learning practical for spoken dialogue models by separating semantic preference updates from acoustic anchoring and regulating their mixture to yield better semantic quality and expressiveness.

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

fields

years

verdicts

representative citing papers

citing papers explorer