JenBridge pretrains a flow-matching Transformer on text-audio data then adapts it with video conditioning and an LLM director to select transitions, claiming better coherence than prior methods on a new LVS benchmark.
Analyzable chain-of-musical-thought prompting for high-fidelity music generation.arXiv preprint arXiv:2503.19611,
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.SD 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
LeVo 2 presents a hierarchical LLM-Diffusion model with progressive post-training stages to generate full-length songs that balance semantic planning, track-specific acoustics, and musicality.
SketchSong uses temporal sketch planning with high-level tokens and explicit modeling of four tracks (vocals, bass, drums, other) to generate more coherent songs than baselines.
citing papers explorer
-
JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions
JenBridge pretrains a flow-matching Transformer on text-audio data then adapts it with video conditioning and an LLM director to select transitions, claiming better coherence than prior methods on a new LVS benchmark.
-
UniVocal: Unified Speech-Singing Code-Switching Synthesis
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
-
LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training
LeVo 2 presents a hierarchical LLM-Diffusion model with progressive post-training stages to generate full-length songs that balance semantic planning, track-specific acoustics, and musicality.
-
SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling
SketchSong uses temporal sketch planning with high-level tokens and explicit modeling of four tracks (vocals, bass, drums, other) to generate more coherent songs than baselines.