Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

Cheng-Zhi Anna Huang; Chinmay Talegaonkar; Haven Kim; Hugo Flores Garc\'ia; Julian McAuley; Nithya Shikarpur; Stephen Brade; Suwan Kim; Taylor Berg-Kirkpatrick; Valerie K. Chen

arxiv: 2605.22717 · v1 · pith:FOUBZ55Xnew · submitted 2026-05-21 · 💻 cs.SD · cs.AI· cs.LG· cs.MM

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

Zachary Novack , Stephen Brade , Haven Kim , Hugo Flores Garc\'ia , Nithya Shikarpur , Chinmay Talegaonkar , Suwan Kim , Valerie K. Chen

show 3 more authors

Julian McAuley Taylor Berg-Kirkpatrick Cheng-Zhi Anna Huang

This is my paper

Pith reviewed 2026-05-22 03:21 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.LGcs.MM

keywords live music generationdiffusion modelsinteractive streamingKV cachingpost-training alignmentautoregressive music modelsreal-time synthesisconsumer hardware deployment

0 comments

The pith

Live music diffusion models recover the inference speed of discrete autoregressive models through block-wise KV caching while adding stable alignment via ARC-Forcing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard audio diffusion models, which are bidirectional and non-streaming, can be turned into interactive generators suitable for live performance. A simple change to the diffusion process using block-wise KV caching brings their computational cost down to match or beat discrete Live Music Models. The same models then support a new post-training step called ARC-Forcing that reduces error buildup during long generations without any reinforcement learning or reward models. This combination lets the system run on ordinary consumer laptops and supports real-time uses such as text prompts, sketch input, jamming, and live collaboration with musicians.

Core claim

Live Music Diffusion Models modify the generative diffusion process with block-wise KV Caching to recover and then outperform the inference complexity of discrete Live Music Models, while the novel ARC-Forcing paradigm enables stable post-training alignment that reduces error accumulation without explicit RL or reward models.

What carries the argument

Block-wise KV Caching applied inside the bidirectional diffusion denoising loop, paired with ARC-Forcing for alignment during fine-tuning.

If this is right

LMDMs run locally on consumer gaming laptops for real-time text-conditioned music generation.
The same models support sketch-based synthesis and live jamming sessions.
ARC-Forcing allows continued post-training that keeps generations stable over long streams.
Artists can treat the model as a generative delay effect that transforms live improvisation in real time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The caching trick may transfer to other diffusion tasks that need streaming output, such as video or speech.
Removing the need for RL in alignment could make interactive music tools easier for open-source groups to maintain.
Live use as a generative instrument suggests diffusion models can serve as real-time co-creation partners rather than offline generators only.

Load-bearing premise

Block-wise KV caching can be inserted into the bidirectional diffusion process without creating new artifacts or requiring extra work that would erase the claimed speed gains.

What would settle it

A side-by-side latency and quality measurement on identical hardware showing that LMDMs either run slower than discrete LMMs during streaming or produce audible artifacts that grow over time.

Figures

Figures reproduced from arXiv: 2605.22717 by Cheng-Zhi Anna Huang, Chinmay Talegaonkar, Haven Kim, Hugo Flores Garc\'ia, Julian McAuley, Nithya Shikarpur, Stephen Brade, Suwan Kim, Taylor Berg-Kirkpatrick, Valerie K. Chen, Zachary Novack.

**Figure 1.** Figure 1: Live Music Diffusion Models. Standard block-AR diffusion (top left) concatenates clean context with noisy states over all frames with full bidirectional attention, leaving no way to cache the context encoding despite it staying fixed for each block. LMDMs (right side) route the clean context and noisy target frames through separate projections and utilize custom attention masks to ensure clean context enco… view at source ↗

**Figure 2.** Figure 2: Difference in initial computation graph between standard block-AR diffusion (left) and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: ARC-Forcing. Gϕ is post-trained by generating AR rollouts across time with KV Caching, and then passing in the rollouts along with real music (with the same starting context and conditions) into the bidirectional Dψ, which uses a relativistic objective. Dψ is also trained with an auxiliary contrastive loss on real music with matching vs. mismatched captions to encourage text following. where P is a random … view at source ↗

**Figure 4.** Figure 4: Global Text-Conditioned metrics over time. In both Enc-Dec and Block-Causal LMDMs, [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt Transitions using Enc-Dec LMDMs. the strong conditioning from past generations, and normal CFG resulted in severe over-saturation artifacts at the low step count used. To remedy this, we found two solutions: (1) whenever one prompt crosses a dominant weight over the other prompt, we drop out the first d = 180 frames of context, reducing the signal of past audio, and (2) we adapt CFG++ (Chung et al.,… view at source ↗

**Figure 6.** Figure 6: Relative CoCoLA score for Accompaniment LMDMs at variable tf . Given our best Enc-Dec setting from the text-conditioned experiments, we next test LMDMs for stem-conditioned accompaniment. In this accompaniment-like case, we mainly investigate how LMDMs perform with different amounts of future visibility: in settings where tf ≥ 0, the model can condition on stem audio that directly corresponds the the targ… view at source ↗

**Figure 7.** Figure 7: User interface of the system built in JUCE leveraged in the user studies and performances. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LMDMs show a workable path to efficient interactive diffusion music via block-wise KV caching and ARC-Forcing, though the caching mechanics need explicit validation against the bidirectional process.

read the letter

The main point is that this paper demonstrates how to adapt open diffusion models for live music generation on consumer hardware. They use block-wise KV caching to bring inference cost down to or below discrete autoregressive live music models, and they introduce ARC-Forcing as a post-training alignment step that avoids explicit RL or reward models. That combination is the actual new piece beyond routine extensions of prior outpainting work.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Live Music Diffusion Models (LMDMs) to adapt bidirectional audio diffusion models for interactive, streaming music generation. It identifies inefficiencies in block-wise outpainting diffusion and introduces a simple modification using block-wise KV caching to recover and surpass the inference complexity of discrete autoregressive Live Music Models (LMMs). Additionally, it presents the ARC-Forcing paradigm for stable post-training alignment without requiring explicit reinforcement learning or reward models. The work includes demonstrations in text-conditioned generation, sketch-based synthesis, jamming, and a live artist-AI collaboration running on consumer hardware.

Significance. If the efficiency claims hold and the method avoids introducing artifacts in the generated audio, this work would be significant for bridging the gap between widely available diffusion models and the interactive capabilities currently dominated by discrete AR models. The ability to perform post-training alignment without RL could simplify the development pipeline for aligned generative music systems. The real-world demonstration on consumer hardware strengthens the practical impact.

major comments (2)

[§3.2] §3.2 (Block-wise KV Caching): The description of the block-wise KV caching modification does not include equations or pseudocode specifying cache maintenance, invalidation logic, or masking adjustments across denoising timesteps while preserving the bidirectional attention required by the diffusion process. This detail is load-bearing for the central claim that the approach recovers and outperforms discrete LMM inference complexity without reintroducing quadratic costs or artifacts.
[§4.1] §4.1 (ARC-Forcing): The claim that ARC-Forcing enables stable post-training alignment by reducing error accumulation is presented without quantitative ablations, comparisons to RL baselines, or metrics on alignment stability. This is central to the assertion that the method works without explicit RL or reward models.

minor comments (2)

[Abstract] Abstract: The statement that standard block-wise outpainting diffusion has 'strictly worse computational efficiency' than discrete LMMs lacks specific metrics, FLOPs counts, or baseline references.
[§5] §5 (Applications): The description of the live jamming and artist collaboration would benefit from explicit timing measurements or latency figures to support the consumer-hardware claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the significance of Live Music Diffusion Models for bridging diffusion and interactive generation. We address each major comment below with specific plans for revision where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (Block-wise KV Caching): The description of the block-wise KV caching modification does not include equations or pseudocode specifying cache maintenance, invalidation logic, or masking adjustments across denoising timesteps while preserving the bidirectional attention required by the diffusion process. This detail is load-bearing for the central claim that the approach recovers and outperforms discrete LMM inference complexity without reintroducing quadratic costs or artifacts.

Authors: We agree that formalizing the cache mechanics will improve clarity and reproducibility. In the revised manuscript we will add a dedicated subsection with equations defining the block-wise KV update (including how keys/values from prior blocks are cached and reused across denoising steps) and a pseudocode algorithm that explicitly shows cache maintenance, selective invalidation for the current block, and the masking scheme that preserves intra-block bidirectionality while preventing quadratic recomputation over the full sequence. These additions will directly support the efficiency claims without altering the core method. revision: yes
Referee: [§4.1] §4.1 (ARC-Forcing): The claim that ARC-Forcing enables stable post-training alignment by reducing error accumulation is presented without quantitative ablations, comparisons to RL baselines, or metrics on alignment stability. This is central to the assertion that the method works without explicit RL or reward models.

Authors: We acknowledge that additional quantitative support would strengthen the section. The current manuscript demonstrates stability through successful deployment in live jamming and artist collaboration without RL, but we will add an ablation study in the revision reporting error accumulation curves (e.g., divergence from conditioning over successive blocks) with and without ARC-Forcing, plus a limited comparison against a supervised fine-tuning baseline. Full RL comparisons are resource-intensive and outside the paper's scope, yet the qualitative and practical results already illustrate the advantage of avoiding reward model training. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivations build on independent adaptations of diffusion and caching

full rationale

The paper's core claims rest on proposing block-wise KV caching as a modification to bidirectional diffusion for interactive music generation and introducing ARC-Forcing for post-training alignment. These are presented as novel engineering adaptations rather than reductions to prior fitted parameters or self-referential definitions. No equations or steps in the abstract or described pipeline equate a 'prediction' directly to an input by construction, nor does the central result depend on a load-bearing self-citation chain. The derivation chain remains self-contained against external benchmarks like discrete LMM inference complexity, with the modifications grounded in standard diffusion mechanics and caching techniques without circular redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions about diffusion model behavior in outpainting settings and introduces new techniques without listing explicit free parameters or additional axioms beyond domain knowledge of audio generation.

axioms (1)

domain assumption Block-wise outpainting diffusion pipelines can be made efficient for streaming via caching without quality degradation
Invoked when identifying inefficiencies and proposing KV caching as the fix.

invented entities (1)

ARC-Forcing paradigm no independent evidence
purpose: Stable post-training alignment to reduce error accumulation without RL or reward models
Newly introduced method described as novel in the abstract.

pith-pipeline@v0.9.0 · 5845 in / 1302 out tokens · 58929 ms · 2026-05-22T03:21:23.796841+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

[1]

MusicLM: Generating Music From Text

Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. MusicLM: Generating music from text.arXiv:2301.11325,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

The mtg-jamendo dataset for automatic music tagging

Bogdanov, D., Won, M., Tovstogan, P., Porter, A., and Serra, X. The mtg-jamendo dataset for automatic music tagging. InMachine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States,

work page 2019
[3]

ISBN 9798400719844

Association for Computing Machinery. ISBN 9798400719844. doi: 10.1145/3742413.3789104. URL https://doi.org/10.1145/3742413. 3789104. Cachay, S. R., Aittala, M., Kreis, K., Brenowitz, N., Vahdat, A., Mardani, M., and Yu, R. Eluci- dated rolling diffusion models for probabilistic forecasting of complex dynamics.arXiv preprint arXiv:2506.20024,

work page doi:10.1145/3742413.3789104
[4]

Rave:Avariationalautoencoderforfastandhigh-qualityneuralaudiosynthesis

Caillon, A. and Esling, P. RA VE: A variational autoencoder for fast and high-quality neural audio synthesis.arXiv:2111.05011,

work page arXiv
[5]

and Zukowski, Z

Carr, C. and Zukowski, Z. Prompt jockeys (2024) - the rise of djing with a neural network,

work page 2024
[6]

Chen, B., Martí Monsó, D., Du, Y ., Simchowitz, M., Tedrake, R., and Sitzmann, V

URLhttps://www.youtube.com/watch?v=_fpnAHoRSqU. Chen, B., Martí Monsó, D., Du, Y ., Simchowitz, M., Tedrake, R., and Sitzmann, V . Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024a. Chen, K., Wu, Y ., Liu, H., Nezhurina, M., Berg-Kirkpatrick, T., and Dubnov, S. M...

work page arXiv
[7]

Cocola: Coherence-oriented contrastive learning of musical audio representations

Ciranni, R., Mariani, G., Mancusi, M., Postolache, E., Fabbro, G., Rodolà, E., and Cosmo, L. Cocola: Coherence-oriented contrastive learning of musical audio representations. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2025
[8]

Long-form music generation with latent diffusion.arXiv:2404.10301, 2024a

Evans, Z., Parker, J., Carr, C., Zukowski, Z., Taylor, J., and Pons, J. Long-form music generation with latent diffusion.arXiv:2404.10301, 2024a. Evans, Z., Parker, J. D., Carr, C., Zukowski, Z., Taylor, J., and Pons, J. Stable audio open. arXiv:2407.14358, 2024b. Fitzgerald, J., Moore, G. R. D., Shirken, B., Glass, P., Zananiri, E., Novack, Z., McAuley, ...

work page arXiv
[9]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., and Shechtman, E. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP

Karchkhadze, T. and Dubnov, S. Towards real-time human-ai musical co-performance: Accompa- niment generation with latent diffusion models and max/msp.arXiv preprint arXiv:2604.07612,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Amuse: Human-ai collaborative songwriting with multimodal inspirations

Kim, Y ., Lee, S.-J., and Donahue, C. Amuse: Human-ai collaborative songwriting with multimodal inspirations. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,

work page 2025
[12]

F., Huang, A., and Donahue, C

Kim, Y ., Brade, S., Wang, A., Zhou, D., Kim, H., Wang, B., Lee, S.-J., Flores Garcia, H. F., Huang, A., and Donahue, C. A design space for live music agents. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, pp. 1–36,

work page 2026
[13]

Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069,

Koutini, K., Schlüter, J., Eghbal-Zadeh, H., and Widmer, G. Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069,

work page arXiv
[14]

J., Llano Rodriguez, M

Krol, S. J., Llano Rodriguez, M. T., and Loor Paredes, M. J. Exploring the Needs of Practising Musicians in Co-Creative AI Through Co-Design. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1–13,

work page 2025
[15]

High fidelity text-guided music editing via single-stage flow matching,

15 Lan, G. L., Shi, B., Ni, Z., Srinivasan, S., Kumar, A., Ellis, B., Kant, D., Nagaraja, V ., Chang, E., Hsu, W.-N., et al. High fidelity text-guided music editing via single-stage flow matching. arXiv:2407.03648,

work page arXiv
[16]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv:2311.10057,

Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y ., Bodganov, D., Wu, Y ., Chen, K., Tovstogan, P., Benetos, E., et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv:2311.10057,

work page arXiv
[18]

Improving musical accompaniment co-creation via diffusion transformers.arXiv:2410.23005,

Nistal, J., Pasini, M., and Lattner, S. Improving musical accompaniment co-creation via diffusion transformers.arXiv:2410.23005,

work page arXiv
[19]

Novack, Z., Zhu, G., Casebeer, J., McAuley, J., Berg-Kirkpatrick, T., and Bryan, N

Workshop: AI for Music, 2025b. Novack, Z., Zhu, G., Casebeer, J., McAuley, J., Berg-Kirkpatrick, T., and Bryan, N. J. Presto! distilling steps and layers for accelerating music generation. InICLR, 2025c. Pasini, M., Nistal, J., Lattner, S., and Fazekas, G. Continuous autoregressive models with noise augmentation avoid error accumulation.arXiv preprint arX...

work page arXiv
[20]

Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,

Prabhudesai, M., Wu, M., Zadeh, A., Fragkiadaki, K., and Pathak, D. Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,

work page arXiv
[21]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

URL https://arxiv.org/abs/1910.10683. Rafii, Z., Liutkus, A., Stöter, F.-R., Mimilakis, S. I., and Bittner, R. Musdb18-hq - an uncompressed version of musdb18, August

work page internal anchor Pith review Pith/arXiv arXiv 1910
[22]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B

URLhttps://doi.org/10.5281/zenodo.3338373. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. InCVPR,

work page doi:10.5281/zenodo.3338373
[23]

Continuous audio language models.arXiv preprint arXiv:2509.06926,

16 Rouard, S., Orsini, M., Roebel, A., Zeghidour, N., and Défossez, A. Continuous audio language models.arXiv preprint arXiv:2509.06926,

work page arXiv
[24]

Sayigh, L., Daher, M

URL https://arxiv.org/abs/2510.02110. Sayigh, L., Daher, M. A., Allen, J., Gordon, H., Joyce, K., Stuhlmann, C., and Tyack, P. The watkins marine mammal sound database: an online, freely accessible resource. InProceedings of Meetings on Acoustics, volume 27, pp. 040013. Acoustical Society of America,

work page arXiv
[25]

History-Guided Video Diffusion

Song, K., Chen, B., Simchowitz, M., Du, Y ., Tedrake, R., and Sitzmann, V . History-guided video diffusion.arXiv preprint arXiv:2502.06764,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Auto-regressive vs flow-matching: a comparative study of modeling paradigms for text-to-music generation.arXiv preprint arXiv:2506.08570,

Tal, O., Kreuk, F., and Adi, Y . Auto-regressive vs flow-matching: a comparative study of modeling paradigms for text-to-music generation.arXiv preprint arXiv:2506.08570,

work page arXiv
[27]

I., et al

Team, L., Caillon, A., McWilliams, B., Tarakajian, C., Simon, I., Manco, I., Engel, J., Constant, N., Li, P., Denk, T. I., et al. Live music models.arXiv preprint arXiv:2508.04651,

work page arXiv
[28]

Musec- ontrollite: Multifunctional music generation with lightweight conditioners.arXiv preprint arXiv:2506.18729,

Tsai, F.-D., Wu, S.-L., Lee, W., Yang, S.-P., Chen, B.-R., Cheng, H.-C., and Yang, Y .-H. Musec- ontrollite: Multifunctional music generation with lightweight conditioners.arXiv preprint arXiv:2506.18729,

work page arXiv
[29]

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Wu, Y ., Brade, S., Ma, A. T., Fowler, T.-J., Yang, E., Banar, B., Courville, A., Jaques, N., and Huang, C.-Z. A. Generative adversarial post-training mitigates reward hacking in live human-ai music interaction.arXiv preprint arXiv:2511.17879, 2025a. Wu, Y ., Cooijmans, T., Kastner, K., Roberts, A., Simon, I., Scarlatos, A., Donahue, C., Tarakajian, C., O...

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Yue: Scaling open foundation models for long-form music generation.arXiv:2503.08638,

Yuan, R., Lin, H., Guo, S., Zhang, G., Pan, J., Zang, Y ., Liu, H., Liang, Y ., Ma, W., Du, X., et al. Yue: Scaling open foundation models for long-form music generation.arXiv:2503.08638,

work page arXiv
[31]

17 Contributions and Acknowledgments Zachary Novack– Project Lead, Algorithmic Methodology, Text- and Stem-Conditioned Model Development Stephen Brade– Project Co-Lead, Sketch-Conditioned Model Development, On-Device Model Wrangling, Live API Development, Artist Collaboration Haven Kim– Data collection and Pre-processing, Evaluation Design and Development...

work page 2025
[32]

as the reference and control source, with captions from MusicCaps (Agostinelli et al., 2023). A.2 Training and Inference Setup All models are finetuned from the base version of Stable Audio Open Small (Novack et al., 2025a), a 340M parameter DiT originally trained on≈12s of latent audio from Freesound. A.2.1 Text-Conditioned Generation LMDMs are trained o...

work page 2023
[33]

We refer to these settings asprimedandtext-onlyhereafter

We report results for two inference settings on 47 s clips:audio-primed generation, where the model is given a caption and the first s frames of the corresponding ground-truth track as a prefix, andtext-only generation, where only the caption is provided. We refer to these settings asprimedandtext-onlyhereafter. To observe drift over time, each generation...

work page 2023
[34]

electric bass

with a weight of 0.7. A.2.2 Accompaniment Generation Following Wu et al. (2025c), we finetune and post-train Enc-Dec LMDMs on the Slakh MIDI dataset of synthesized stems (Manilow et al., 2019), where stems from the same piece are randomly sampled as context and target. In this setup, ARC-Forcing only occurs for 8k steps due to observed faster convergence....

work page 2019
[35]

consistency-style

Method Architecture Dataset Block Size +AF? Eff. BS Steps LMDM (ED) Enc-Dec FSD50k 208/47✗128 120k LMDM (ED) Enc-Dec Humpback whale 208/47✗128 10k LMDM (ED) Enc-Dec Jamendo 192/48✗128 130k LMDM (ED) Enc-Dec Jamendo 192/48✓288 4.3k LMDM (ED) Enc-Dec Jamendo 230/10✗128 140k LMDM (ED) Enc-Dec Jamendo 230/10✓288 3.0k LMDM (Bidir) Bidirectional Jamendo 240/–✗1...

work page 2023
[36]

(2023) captions, sampled from a pool of 256 distinct captions

Both endpoints are drawn from the Song Describer Dataset (SDD) Manco et al. (2023) captions, sampled from a pool of 256 distinct captions. Table 4: Prompt transition pairs used in the cross-prompt continuity evaluation. Each row contains a source promptAand a target promptB. # PromptAPromptB 1 Driving, energetic and positive rock song (male voice) perfect...

work page 2023

[1] [1]

MusicLM: Generating Music From Text

Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. MusicLM: Generating music from text.arXiv:2301.11325,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

The mtg-jamendo dataset for automatic music tagging

Bogdanov, D., Won, M., Tovstogan, P., Porter, A., and Serra, X. The mtg-jamendo dataset for automatic music tagging. InMachine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States,

work page 2019

[3] [3]

ISBN 9798400719844

Association for Computing Machinery. ISBN 9798400719844. doi: 10.1145/3742413.3789104. URL https://doi.org/10.1145/3742413. 3789104. Cachay, S. R., Aittala, M., Kreis, K., Brenowitz, N., Vahdat, A., Mardani, M., and Yu, R. Eluci- dated rolling diffusion models for probabilistic forecasting of complex dynamics.arXiv preprint arXiv:2506.20024,

work page doi:10.1145/3742413.3789104

[4] [4]

Rave:Avariationalautoencoderforfastandhigh-qualityneuralaudiosynthesis

Caillon, A. and Esling, P. RA VE: A variational autoencoder for fast and high-quality neural audio synthesis.arXiv:2111.05011,

work page arXiv

[5] [5]

and Zukowski, Z

Carr, C. and Zukowski, Z. Prompt jockeys (2024) - the rise of djing with a neural network,

work page 2024

[6] [6]

Chen, B., Martí Monsó, D., Du, Y ., Simchowitz, M., Tedrake, R., and Sitzmann, V

URLhttps://www.youtube.com/watch?v=_fpnAHoRSqU. Chen, B., Martí Monsó, D., Du, Y ., Simchowitz, M., Tedrake, R., and Sitzmann, V . Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024a. Chen, K., Wu, Y ., Liu, H., Nezhurina, M., Berg-Kirkpatrick, T., and Dubnov, S. M...

work page arXiv

[7] [7]

Cocola: Coherence-oriented contrastive learning of musical audio representations

Ciranni, R., Mariani, G., Mancusi, M., Postolache, E., Fabbro, G., Rodolà, E., and Cosmo, L. Cocola: Coherence-oriented contrastive learning of musical audio representations. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2025

[8] [8]

Long-form music generation with latent diffusion.arXiv:2404.10301, 2024a

Evans, Z., Parker, J., Carr, C., Zukowski, Z., Taylor, J., and Pons, J. Long-form music generation with latent diffusion.arXiv:2404.10301, 2024a. Evans, Z., Parker, J. D., Carr, C., Zukowski, Z., Taylor, J., and Pons, J. Stable audio open. arXiv:2407.14358, 2024b. Fitzgerald, J., Moore, G. R. D., Shirken, B., Glass, P., Zananiri, E., Novack, Z., McAuley, ...

work page arXiv

[9] [9]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., and Shechtman, E. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP

Karchkhadze, T. and Dubnov, S. Towards real-time human-ai musical co-performance: Accompa- niment generation with latent diffusion models and max/msp.arXiv preprint arXiv:2604.07612,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Amuse: Human-ai collaborative songwriting with multimodal inspirations

Kim, Y ., Lee, S.-J., and Donahue, C. Amuse: Human-ai collaborative songwriting with multimodal inspirations. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,

work page 2025

[12] [12]

F., Huang, A., and Donahue, C

Kim, Y ., Brade, S., Wang, A., Zhou, D., Kim, H., Wang, B., Lee, S.-J., Flores Garcia, H. F., Huang, A., and Donahue, C. A design space for live music agents. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, pp. 1–36,

work page 2026

[13] [13]

Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069,

Koutini, K., Schlüter, J., Eghbal-Zadeh, H., and Widmer, G. Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069,

work page arXiv

[14] [14]

J., Llano Rodriguez, M

Krol, S. J., Llano Rodriguez, M. T., and Loor Paredes, M. J. Exploring the Needs of Practising Musicians in Co-Creative AI Through Co-Design. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1–13,

work page 2025

[15] [15]

High fidelity text-guided music editing via single-stage flow matching,

15 Lan, G. L., Shi, B., Ni, Z., Srinivasan, S., Kumar, A., Ellis, B., Kant, D., Nagaraja, V ., Chang, E., Hsu, W.-N., et al. High fidelity text-guided music editing via single-stage flow matching. arXiv:2407.03648,

work page arXiv

[16] [16]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv:2311.10057,

Manco, I., Weck, B., Doh, S., Won, M., Zhang, Y ., Bodganov, D., Wu, Y ., Chen, K., Tovstogan, P., Benetos, E., et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv:2311.10057,

work page arXiv

[18] [18]

Improving musical accompaniment co-creation via diffusion transformers.arXiv:2410.23005,

Nistal, J., Pasini, M., and Lattner, S. Improving musical accompaniment co-creation via diffusion transformers.arXiv:2410.23005,

work page arXiv

[19] [19]

Novack, Z., Zhu, G., Casebeer, J., McAuley, J., Berg-Kirkpatrick, T., and Bryan, N

Workshop: AI for Music, 2025b. Novack, Z., Zhu, G., Casebeer, J., McAuley, J., Berg-Kirkpatrick, T., and Bryan, N. J. Presto! distilling steps and layers for accelerating music generation. InICLR, 2025c. Pasini, M., Nistal, J., Lattner, S., and Fazekas, G. Continuous autoregressive models with noise augmentation avoid error accumulation.arXiv preprint arX...

work page arXiv

[20] [20]

Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,

Prabhudesai, M., Wu, M., Zadeh, A., Fragkiadaki, K., and Pathak, D. Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,

work page arXiv

[21] [21]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

URL https://arxiv.org/abs/1910.10683. Rafii, Z., Liutkus, A., Stöter, F.-R., Mimilakis, S. I., and Bittner, R. Musdb18-hq - an uncompressed version of musdb18, August

work page internal anchor Pith review Pith/arXiv arXiv 1910

[22] [22]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B

URLhttps://doi.org/10.5281/zenodo.3338373. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. InCVPR,

work page doi:10.5281/zenodo.3338373

[23] [23]

Continuous audio language models.arXiv preprint arXiv:2509.06926,

16 Rouard, S., Orsini, M., Roebel, A., Zeghidour, N., and Défossez, A. Continuous audio language models.arXiv preprint arXiv:2509.06926,

work page arXiv

[24] [24]

Sayigh, L., Daher, M

URL https://arxiv.org/abs/2510.02110. Sayigh, L., Daher, M. A., Allen, J., Gordon, H., Joyce, K., Stuhlmann, C., and Tyack, P. The watkins marine mammal sound database: an online, freely accessible resource. InProceedings of Meetings on Acoustics, volume 27, pp. 040013. Acoustical Society of America,

work page arXiv

[25] [25]

History-Guided Video Diffusion

Song, K., Chen, B., Simchowitz, M., Du, Y ., Tedrake, R., and Sitzmann, V . History-guided video diffusion.arXiv preprint arXiv:2502.06764,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Auto-regressive vs flow-matching: a comparative study of modeling paradigms for text-to-music generation.arXiv preprint arXiv:2506.08570,

Tal, O., Kreuk, F., and Adi, Y . Auto-regressive vs flow-matching: a comparative study of modeling paradigms for text-to-music generation.arXiv preprint arXiv:2506.08570,

work page arXiv

[27] [27]

I., et al

Team, L., Caillon, A., McWilliams, B., Tarakajian, C., Simon, I., Manco, I., Engel, J., Constant, N., Li, P., Denk, T. I., et al. Live music models.arXiv preprint arXiv:2508.04651,

work page arXiv

[28] [28]

Musec- ontrollite: Multifunctional music generation with lightweight conditioners.arXiv preprint arXiv:2506.18729,

Tsai, F.-D., Wu, S.-L., Lee, W., Yang, S.-P., Chen, B.-R., Cheng, H.-C., and Yang, Y .-H. Musec- ontrollite: Multifunctional music generation with lightweight conditioners.arXiv preprint arXiv:2506.18729,

work page arXiv

[29] [29]

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Wu, Y ., Brade, S., Ma, A. T., Fowler, T.-J., Yang, E., Banar, B., Courville, A., Jaques, N., and Huang, C.-Z. A. Generative adversarial post-training mitigates reward hacking in live human-ai music interaction.arXiv preprint arXiv:2511.17879, 2025a. Wu, Y ., Cooijmans, T., Kastner, K., Roberts, A., Simon, I., Scarlatos, A., Donahue, C., Tarakajian, C., O...

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Yue: Scaling open foundation models for long-form music generation.arXiv:2503.08638,

Yuan, R., Lin, H., Guo, S., Zhang, G., Pan, J., Zang, Y ., Liu, H., Liang, Y ., Ma, W., Du, X., et al. Yue: Scaling open foundation models for long-form music generation.arXiv:2503.08638,

work page arXiv

[31] [31]

17 Contributions and Acknowledgments Zachary Novack– Project Lead, Algorithmic Methodology, Text- and Stem-Conditioned Model Development Stephen Brade– Project Co-Lead, Sketch-Conditioned Model Development, On-Device Model Wrangling, Live API Development, Artist Collaboration Haven Kim– Data collection and Pre-processing, Evaluation Design and Development...

work page 2025

[32] [32]

as the reference and control source, with captions from MusicCaps (Agostinelli et al., 2023). A.2 Training and Inference Setup All models are finetuned from the base version of Stable Audio Open Small (Novack et al., 2025a), a 340M parameter DiT originally trained on≈12s of latent audio from Freesound. A.2.1 Text-Conditioned Generation LMDMs are trained o...

work page 2023

[33] [33]

We refer to these settings asprimedandtext-onlyhereafter

We report results for two inference settings on 47 s clips:audio-primed generation, where the model is given a caption and the first s frames of the corresponding ground-truth track as a prefix, andtext-only generation, where only the caption is provided. We refer to these settings asprimedandtext-onlyhereafter. To observe drift over time, each generation...

work page 2023

[34] [34]

electric bass

with a weight of 0.7. A.2.2 Accompaniment Generation Following Wu et al. (2025c), we finetune and post-train Enc-Dec LMDMs on the Slakh MIDI dataset of synthesized stems (Manilow et al., 2019), where stems from the same piece are randomly sampled as context and target. In this setup, ARC-Forcing only occurs for 8k steps due to observed faster convergence....

work page 2019

[35] [35]

consistency-style

Method Architecture Dataset Block Size +AF? Eff. BS Steps LMDM (ED) Enc-Dec FSD50k 208/47✗128 120k LMDM (ED) Enc-Dec Humpback whale 208/47✗128 10k LMDM (ED) Enc-Dec Jamendo 192/48✗128 130k LMDM (ED) Enc-Dec Jamendo 192/48✓288 4.3k LMDM (ED) Enc-Dec Jamendo 230/10✗128 140k LMDM (ED) Enc-Dec Jamendo 230/10✓288 3.0k LMDM (Bidir) Bidirectional Jamendo 240/–✗1...

work page 2023

[36] [36]

(2023) captions, sampled from a pool of 256 distinct captions

Both endpoints are drawn from the Song Describer Dataset (SDD) Manco et al. (2023) captions, sampled from a pool of 256 distinct captions. Table 4: Prompt transition pairs used in the cross-prompt continuity evaluation. Each row contains a source promptAand a target promptB. # PromptAPromptB 1 Driving, energetic and positive rock song (male voice) perfect...

work page 2023