DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
hub Canonical reference
MusicLM: Generating Music From Text
Canonical reference. 73% of citing Pith papers cite this work as background.
abstract
We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training alignment without RL.
MusicDET models the distribution of real music features with frequency-guided normalizing flows to detect AI-generated music as out-of-distribution samples in a zero-shot setting.
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.
ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.
MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.
TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
Stylus achieves training-free music style transfer on Mel-spectrograms by repurposing image diffusion models via style-key injection in self-attention plus phase-preserving reconstruction, outperforming baselines by 34.1% in content preservation and 25.7% in perceptual quality per 2,925 human raters
DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Black-box membership inference on text-to-music models reaches up to 98.6% accuracy by training an auditor on semantic alignment patterns extracted from shadow-model generations.
A training-free audio watermarking method that reduces vocabulary via community detection to boost detection robustness by orders of magnitude while resisting audio modifications.
S2Accompanist is a 402M-parameter semantic-aware diffusion model that achieves SOTA on the ATTM Grand Challenge benchmark for music accompaniment generation via automated data processing and structure-guided VAE fine-tuning.
ARIA decomposes music training data attribution into musical aspects and supplies reliability diagnostics from similarity metrics and score matrix analysis, with validation on symbolic models using counterfactual retraining.
Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.
Lexical acoustic coding lets LLMs transmit audio waveforms as editable natural-language sentences that another LLM can parse and reconstruct into sound.
citing papers explorer
-
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
-
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training alignment without RL.
-
MusicDET: Zero-Shot AI-Generated Music Detection
MusicDET models the distribution of real music features with frequency-guided normalizing flows to detect AI-generated music as out-of-distribution samples in a zero-shot setting.
-
Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
-
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
-
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.
-
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
-
Latent Fourier Transform
LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
-
ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics
ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.
-
MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline
MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.
-
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
-
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
-
Steering Autoregressive Music Generation with Recursive Feature Machines
MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
-
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms
Stylus achieves training-free music style transfer on Mel-spectrograms by repurposing image diffusion models via style-key injection in self-attention plus phase-preserving reconstruction, outperforming baselines by 34.1% in content preservation and 25.7% in perceptual quality per 2,925 human raters
-
DASB - Discrete Audio and Speech Benchmark
DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
Auditing Training Data in Generative Music Models via Black-Box Membership Inference
Black-box membership inference on text-to-music models reaches up to 98.6% accuracy by training an auditor on semantic alignment patterns extracted from shadow-model generations.
-
Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio
A training-free audio watermarking method that reduces vocabulary via community detection to boost detection robustness by orders of magnitude while resisting audio modifications.
-
S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation
S2Accompanist is a 402M-parameter semantic-aware diffusion model that achieves SOTA on the ATTM Grand Challenge benchmark for music accompaniment generation via automated data processing and structure-guided VAE fine-tuning.
-
ARIA: A Diagnostic Framework for Music Training Data Attribution
ARIA decomposes music training data attribution into musical aspects and supplies reliability diagnostics from similarity metrics and score matrix analysis, with validation on symbolic models using counterfactual retraining.
-
Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering
Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.
-
Communicating Sound Through Natural Language
Lexical acoustic coding lets LLMs transmit audio waveforms as editable natural-language sentences that another LLM can parse and reconstruct into sound.
-
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation
L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
-
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
-
MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention
MindMelody combines real-time EEG emotion decoding with an LLM for intervention planning and a hierarchical controller for generating affect-aware music in a continuous feedback loop.
-
Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP
A latent diffusion model with consistency distillation generates real-time instrumental accompaniment from live context audio, integrated with MAX/MSP for feasible human-AI co-performance.
-
Language-Guided Multimodal Texture Authoring via Generative Models
A language-driven system generates semantically consistent multimodal textures from text prompts by linking autoregressive haptic models and diffusion-based visuals through a shared latent representation.
-
TADA! Tuning Audio Diffusion Models through Activation Steering
Activation steering at a semantic bottleneck in audio diffusion models achieves state-of-the-art control over musical attributes such as instruments, vocals, and genres.
-
One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization
LLMs using in-context learning and fine-tuning on listener experiment data generate equalization settings that align better with population preferences than random sampling or static presets.
-
Two-Dimensional Quantization for Geometry-Aware Audio Coding
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
-
Investigating Modality Contribution in Audio LLMs for Music
Adapts MM-SHAP to quantify modality contributions in two Audio LLMs on MuChoMusic, showing text dominance alongside limited audio localization of key events.
-
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
-
VideoPoet: A Large Language Model for Zero-Shot Video Generation
VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
-
SALMONN: Towards Generic Hearing Abilities for Large Language Models
SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
-
AudioPaLM: A Large Language Model That Can Speak and Listen
AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
-
Shap-E: Generating Conditional 3D Implicit Functions
Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
-
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization
MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.
-
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling
AuDirector proposes a self-reflective closed-loop multi-agent framework with identity-aware pre-production, collaborative synthesis-correction, and human-guided refinement for coherent immersive audio storytelling.
-
Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs
A Transformer predicts tokens from neural audio codecs (EnCodec, DAC, X-Codec) to convert expressive drum grids into audio, trained and evaluated on the E-GMD dataset using objective metrics.
-
Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences
Music-flavor correspondences transfer from small human-annotated collections to large synthetic FMA datasets, with computational targets showing significant alignment to human listener ratings.
-
Woosh: A Sound Effects Foundation Model
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
-
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
-
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention
XAttnMark is a new neural audio watermarking method using partial parameter sharing, cross-attention for message retrieval, temporal conditioning, and a psychoacoustic TF masking loss that reports state-of-the-art detection and attribution robustness.
-
Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model
The paper introduces Musical Attention, an attention variant that incorporates eight musical features including metadata to generate more coherent and varied music than standard or strided attention baselines.
-
Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid
G-TRACE provides region-aware estimates of GenAI carbon emissions including 4309 MWh and 2068 tCO2 for a 2024-2025 image generation trend, paired with a seven-level AI Sustainability Pyramid for policy guidance.
-
Qwen2-Audio Technical Report
Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
-
Toward Native Multimodal Modeling: A Roadmap
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.