{"paper":{"title":"Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A fine-tuned text-to-audio model converts high-resolution drum MIDI into matching audio while adopting a reference timbre.","cross_cats":["cs.AI"],"primary_cat":"cs.SD","authors_text":"Chihiro Nagashima, Christian Simon, Junghyun Koo, Keisuke Toyama, Kin Wai Cheuk, Qiyu Wu, Shusuke Takahashi, Shuyang Cui, Woosung Choi, Yukara Ikemiya, Zachary Novack, Zhi Zhong","submitted_at":"2026-05-14T08:32:38Z","abstract_excerpt":"Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The proposed content encoder and hybrid conditioning mechanism, when applied via fine-tuning to a pre-trained text-to-audio model, can effectively handle polyphonic percussive drum synthesis using the constructed paired dataset.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A fine-tuned text-to-audio model converts high-resolution drum MIDI into matching audio while adopting a reference timbre.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5c47a84b88b58717031dffda80cc4541fe70f28ceb974413200107daa6b04b2b"},"source":{"id":"2605.14555","kind":"arxiv","version":1},"verdict":{"id":"18c5acda-e369-4b32-8593-f0934f88f390","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T01:23:43.769584Z","strongest_claim":"our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity.","one_line_summary":"Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The proposed content encoder and hybrid conditioning mechanism, when applied via fine-tuning to a pre-trained text-to-audio model, can effectively handle polyphonic percussive drum synthesis using the constructed paired dataset.","pith_extraction_headline":"A fine-tuned text-to-audio model converts high-resolution drum MIDI into matching audio while adopting a reference timbre."},"references":{"count":45,"sample":[{"doi":"","year":null,"title":"INTRODUCTION In digital music production, drums play a foundational role in shap- ing the rhythm, energy, and overall character of a composition. Con- ventional workflows for creating expressive drum ","work_id":"f486bd62-b2e8-4def-a208-a348ed4f9f9d","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis","work_id":"8b9ee716-c084-427f-8765-cb578f3a0d00","ref_index":2,"cited_arxiv_id":"2605.14555","is_internal_anchor":true},{"doi":"","year":null,"title":"1 shows the overview of our proposed method","work_id":"c5c17aa4-ca53-4c29-932d-d831dc3cf80b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2048,"title":"EXPERIMENTS 4.1. Data We train and evaluate our approach on two variations of the Groove MIDI Dataset (GMD)[30], which consists of 1059 unique human- performed MIDI drum sequences aligned with corresp","work_id":"3640e885-d350-4d88-83b9-b344cb0b9578","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"RESULTS our model’s key capabilities are evaluated in this section. 5.1. Temporal Granularity We train our proposed method with drum MIDI representations of different temporal resolutions. As expected","work_id":"782e4b3b-d4b3-44de-9e8e-1061e792d275","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":45,"snapshot_sha256":"801aac662602df17890034c2beb70220dcb3aa0d7ee430a9b58bafa00204e875","internal_anchors":4},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}