Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training alignment without RL.
Efficient training of audio transformers with patchout.arXiv preprint arXiv:2110.05069,
7 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 7representative citing papers
SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.
MMHNet enables video-to-audio models trained on short clips to generalize and generate audio for videos over 5 minutes long.
Unsupervised VQ-VAE training on PaSST embeddings discovers repeatable discrete acoustic tokens in honey bee buzzing that separate queenright from queenless conditions and identify three stable sub-states in queenless hives.
MMAudioSep adapts a pretrained video-to-audio model via fine-tuning for video/text-queried sound separation, outperforming baselines while preserving generation ability.
citing papers explorer
-
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training alignment without RL.
-
SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification
SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.
-
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
-
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.
-
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
MMHNet enables video-to-audio models trained on short clips to generalize and generate audio for videos over 5 minutes long.
-
BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing
Unsupervised VQ-VAE training on PaSST embeddings discovers repeatable discrete acoustic tokens in honey bee buzzing that separate queenright from queenless conditions and identify three stable sub-states in queenless hives.
-
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
MMAudioSep adapts a pretrained video-to-audio model via fine-tuning for video/text-queried sound separation, outperforming baselines while preserving generation ability.