OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
Son- icvisionlm: Playing sound with vision language models
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SD 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.