pith. sign in

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

While end-to-end lyrics-to-song models offer convenience for casual users, professional songwriters require score-to-song systems that allow them to retain authorship over the core melody. However, existing score-to-song methods are limited to short-form snippets and fail to maintain coherence in long-form generation, particularly during vocal-silent sections like intros and bridges. To address this long-form bottleneck, we propose MIDI-informed singing accompaniment generation (MIDI-SAG). Unlike conventional audio-only models, MIDI-SAG utilizes symbolic timing and chord information derived from the vocal MIDI to provide a stable musical roadmap. By incorporating structure planning, which defines temporal boundaries and semantic labels, our framework facilitates consistent generation across both vocal and non-vocal sections. We demonstrate the feasibility of this compositional pipeline by leveraging specialized pre-trained modules, enabling data-efficient training on a single GPU. Our experiments show the potential of this approach for both professional score-to-song and general lyrics-to-song tasks. While an early exploration, MIDI-SAG suggests a promising direction for structured, long-form music synthesis. Audio demos are available, and the code will be open-sourced at https://composerflow.github.io/web_revealed/.

citation-role summary

background 1

citation-polarity summary

fields

cs.SD 1

years

2026 1

verdicts

UNVERDICTED 1

roles

background 1

polarities

background 1

representative citing papers

citing papers explorer

Showing 1 of 1 citing paper.

  • LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation cs.SD · 2026-04-13 · unverdicted · none · ref 46 · internal anchor

    LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestration over prior baselines.