AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
Unimoe-audio: Unified speech and music generation with dynamic-capacity moe.arXiv preprint arXiv:2510.13344, 2025d
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
AudioX-Turbo distills a Multimodal Diffusion Transformer into a 4-step student model for efficient multimodal anything-to-audio generation, trained on a new 9.2M-sample dataset IF-caps-Pro.
citing papers explorer
-
AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
-
AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation
AudioX-Turbo distills a Multimodal Diffusion Transformer into a 4-step student model for efficient multimodal anything-to-audio generation, trained on a new 9.2M-sample dataset IF-caps-Pro.