hub Canonical reference

MusicLM: Generating Music From Text

· 2023 · cs.SD · arXiv 2301.11325

Canonical reference. 75% of citing Pith papers cite this work as background.

73 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 73 citing papers arXiv PDF

abstract

We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 dataset 4 method 1

citation-polarity summary

background 12 use dataset 3 use method 1

representative citing papers

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

What's a Credit Worth? A Market Framework for Attribution-Aware Compensation in Generative Music

cs.CY · 2026-07-01 · conditional · novelty 7.0

Proposes an attribution-aware compensation framework for generative music that derives closed-form payments from catalog-level attribution informativeness and quantifies welfare effects under competition.

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

eess.AS · 2026-06-08 · unverdicted · novelty 7.0

HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

cs.SD · 2026-06-05 · unverdicted · novelty 7.0

UniSinger unifies speaker-cloned song generation and accompaniment co-generation SVC in one multimodal diffusion transformer model trained with curriculum learning via task-specific modality masking.

Exploring LLMs for South Asian Music Understanding and Generation

cs.SD · 2026-06-03 · unverdicted · novelty 7.0

This paper introduces a 504-question benchmark for South Asian music understanding and a controlled prompting framework for generation, reporting frontier LLMs at 85-90% on understanding but only 40% stylistic faithfulness on generation.

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

cs.SD · 2026-05-21 · unverdicted · novelty 7.0

Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training alignment without RL.

MusicDET: Zero-Shot AI-Generated Music Detection

cs.SD · 2026-05-18 · unverdicted · novelty 7.0

MusicDET models the distribution of real music features with frequency-guided normalizing flows to detect AI-generated music as out-of-distribution samples in a zero-shot setting.

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

cs.SD · 2026-05-15 · unverdicted · novelty 7.0

BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.

FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

cs.MM · 2026-05-11 · unverdicted · novelty 7.0

FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.

Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

cs.SD · 2026-04-22 · unverdicted · novelty 7.0

ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.

Latent Fourier Transform

cs.SD · 2026-04-20 · unverdicted · novelty 7.0

LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

cs.SD · 2026-02-24 · unverdicted · novelty 7.0

MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.

The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

cs.SD · 2026-01-06 · unverdicted · novelty 7.0

TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

cs.SD · 2026-01-06 · unverdicted · novelty 7.0

A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.

Steering Autoregressive Music Generation with Recursive Feature Machines

cs.LG · 2025-10-21 · unverdicted · novelty 7.0

MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.

AudioMoG: Guiding Audio Generation with Mixture-of-Guidance

cs.SD · 2025-09-28 · unverdicted · novelty 7.0

AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

cs.SD · 2025-07-10 · unverdicted · novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms

cs.SD · 2024-11-24 · conditional · novelty 7.0

Stylus achieves training-free music style transfer on Mel-spectrograms by repurposing image diffusion models via style-key injection in self-attention plus phase-preserving reconstruction, outperforming baselines by 34.1% in content preservation and 25.7% in perceptual quality per 2,925 human raters

DASB - Discrete Audio and Speech Benchmark

cs.SD · 2024-06-20 · unverdicted · novelty 7.0

DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

cs.CV · 2023-10-09 · unverdicted · novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

Neural Audio Codec with Adjustable Token Temporal Resolution Using Sampling-Frequency-Independent Convolutional Layers

eess.AS · 2026-07-02 · unverdicted · novelty 6.0

A single neural audio codec can operate at multiple token temporal resolutions by generating TTR-dependent convolutional kernels from shared parameters while adjusting kernel size and stride.

citing papers explorer

Showing 50 of 73 citing papers.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues cs.AI · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
What's a Credit Worth? A Market Framework for Attribution-Aware Compensation in Generative Music cs.CY · 2026-07-01 · conditional · none · ref 3 · internal anchor
Proposes an attribution-aware compensation framework for generative music that derives closed-form payments from catalog-level attribution informativeness and quantifies welfare effects under competition.
HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis eess.AS · 2026-06-08 · unverdicted · none · ref 2 · internal anchor
HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.
Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation cs.SD · 2026-06-05 · unverdicted · none · ref 9 · internal anchor
UniSinger unifies speaker-cloned song generation and accompaniment co-generation SVC in one multimodal diffusion transformer model trained with curriculum learning via task-specific modality masking.
Exploring LLMs for South Asian Music Understanding and Generation cs.SD · 2026-06-03 · unverdicted · none · ref 1 · internal anchor
This paper introduces a 504-question benchmark for South Asian music understanding and a controlled prompting framework for generation, reporting frontier LLMs at 85-90% on understanding but only 40% stylistic faithfulness on generation.
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators cs.SD · 2026-05-21 · unverdicted · none · ref 1 · internal anchor
Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training alignment without RL.
MusicDET: Zero-Shot AI-Generated Music Detection cs.SD · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
MusicDET models the distribution of real music features with frequency-guided normalizing flows to detect AI-generated music as out-of-distribution samples in a zero-shot setting.
Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation cs.SD · 2026-05-15 · unverdicted · none · ref 5 · internal anchor
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries cs.MM · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration cs.SD · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization cs.LG · 2026-05-07 · unverdicted · none · ref 1 · 2 links · internal anchor
PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence cs.SD · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
Latent Fourier Transform cs.SD · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics cs.SD · 2026-04-17 · unverdicted · none · ref 14 · internal anchor
ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.
MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline cs.SD · 2026-02-24 · unverdicted · none · ref 1 · internal anchor
MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models cs.SD · 2026-01-06 · unverdicted · none · ref 1 · internal anchor
TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
Omni2Sound: Towards Unified Video-Text-to-Audio Generation cs.SD · 2026-01-06 · unverdicted · none · ref 40 · internal anchor
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
Steering Autoregressive Music Generation with Recursive Feature Machines cs.LG · 2025-10-21 · unverdicted · none · ref 1 · internal anchor
MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance cs.SD · 2025-09-28 · unverdicted · none · ref 1 · internal anchor
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models cs.SD · 2025-07-10 · unverdicted · none · ref 3 · internal anchor
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms cs.SD · 2024-11-24 · conditional · none · ref 10 · internal anchor
Stylus achieves training-free music style transfer on Mel-spectrograms by repurposing image diffusion models via style-key injection in self-attention plus phase-preserving reconstruction, outperforming baselines by 34.1% in content preservation and 25.7% in perceptual quality per 2,925 human raters
DASB - Discrete Audio and Speech Benchmark cs.SD · 2024-06-20 · unverdicted · none · ref 20 · internal anchor
DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation cs.CV · 2023-10-09 · unverdicted · none · ref 258 · internal anchor
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Neural Audio Codec with Adjustable Token Temporal Resolution Using Sampling-Frequency-Independent Convolutional Layers eess.AS · 2026-07-02 · unverdicted · none · ref 9 · internal anchor
A single neural audio codec can operate at multiple token temporal resolutions by generating TTR-dependent convolutional kernels from shared parameters while adjusting kernel size and stride.
A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models cs.SD · 2026-07-01 · conditional · none · ref 1 · internal anchor
A text-to-procedural-audio system using LLMs to emit controllable categorical configurations, with live crossfading generator and three interchangeable backends for uninterrupted performance.
ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation eess.AS · 2026-06-16 · unverdicted · none · ref 36 · internal anchor
ELSA introduces an event-level semantic alignment metric for reference-free text-to-audio evaluation that reports higher correlation with human ratings than CLAP-based baselines across four benchmarks.
DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment eess.AS · 2026-06-08 · unverdicted · none · ref 2 · internal anchor
DeRA-MOS decouples music impression and text alignment training with listwise ranking and score-anchored modality alignment losses, reporting gains on MusicEval ranking metrics.
FIGMA: Towards FIne-Grained Music retrievAl cs.SD · 2026-06-04 · unverdicted · none · ref 1 · internal anchor
FIGMA proposes a multi-view contrastive architecture plus the FGMCaps dataset to retrieve music from fine-grained textual descriptions of musical attributes, reporting up to 73.3% relative gains over CLAP baselines.
Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation cs.SD · 2026-06-02 · unverdicted · none · ref 54 · internal anchor
Foley-Omni extends isolated audio synthesis to joint generation of full video soundtracks across speech, effects, and music, with a new V2ST-Bench for evaluation showing competitive single-task results and gains in mixed-track consistency.
JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions cs.SD · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
JenBridge pretrains a flow-matching Transformer on text-audio data then adapts it with video conditioning and an LLM director to select transitions, claiming better coherence than prior methods on a new LVS benchmark.
Auditing Training Data in Generative Music Models via Black-Box Membership Inference cs.LG · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
Black-box membership inference on text-to-music models reaches up to 98.6% accuracy by training an auditor on semantic alignment patterns extracted from shadow-model generations.
Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio cs.LG · 2026-05-25 · unverdicted · none · ref 1 · internal anchor
A training-free audio watermarking method that reduces vocabulary via community detection to boost detection robustness by orders of magnitude while resisting audio modifications.
Mental Damage: Caption Poisoning Attacks on Retrieval-Augmented Text-to-Music Generation cs.SD · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
Caption poisoning attacks can steer retrieval-augmented text-to-music generation toward attacker-chosen targets by injecting crafted captions into the knowledge database.
S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation eess.AS · 2026-05-17 · unverdicted · none · ref 10 · internal anchor
S2Accompanist is a 402M-parameter semantic-aware diffusion model that achieves SOTA on the ATTM Grand Challenge benchmark for music accompaniment generation via automated data processing and structure-guided VAE fine-tuning.
ARIA: A Diagnostic Framework for Music Training Data Attribution cs.SD · 2026-05-15 · unverdicted · none · ref 1 · internal anchor
ARIA decomposes music training data attribution into musical aspects and supplies reliability diagnostics from similarity metrics and score matrix analysis, with validation on symbolic models using counterfactual retraining.
Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music cs.SD · 2026-05-14 · unverdicted · none · ref 3 · internal anchor
Introduces the first large-scale Persian music dataset and shows fine-tuned MusicGen produces compositions more aligned with Persian stylistic conventions via tag-based evaluation.
Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering cs.SD · 2026-05-13 · unverdicted · none · ref 2 · internal anchor
Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.
Communicating Sound Through Natural Language cs.LG · 2026-05-09 · unverdicted · none · ref 3 · 2 links · internal anchor
Lexical acoustic coding lets LLMs transmit audio waveforms as editable natural-language sentences that another LLM can parse and reconstruct into sound.
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation eess.AS · 2026-05-09 · unverdicted · none · ref 1 · internal anchor
L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions eess.AS · 2026-05-06 · unverdicted · none · ref 46 · internal anchor
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
PHALAR: Phasors for Learned Musical Audio Representations cs.SD · 2026-05-05 · unverdicted · none · ref 34 · 3 links · internal anchor
PHALAR achieves up to 70% relative accuracy gain in stem retrieval over prior art using under half the parameters and 7x faster training by enforcing musical equivariances via spectral pooling and complex heads.
MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention cs.SD · 2026-05-02 · unverdicted · none · ref 15 · 2 links · internal anchor
MindMelody combines real-time EEG emotion decoding with an LLM for intervention planning and a hierarchical controller for generating affect-aware music in a continuous feedback loop.
Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP cs.SD · 2026-04-08 · unverdicted · none · ref 12 · internal anchor
A latent diffusion model with consistency distillation generates real-time instrumental accompaniment from live context audio, integrated with MAX/MSP for feasible human-AI co-performance.
Language-Guided Multimodal Texture Authoring via Generative Models cs.HC · 2026-04-07 · unverdicted · none · ref 68 · internal anchor
A language-driven system generates semantically consistent multimodal textures from text prompts by linking autoregressive haptic models and diffusion-based visuals through a shared latent representation.
TADA! Tuning Audio Diffusion Models through Activation Steering cs.SD · 2026-02-12 · unverdicted · none · ref 2 · internal anchor
Activation steering at a semantic bottleneck in audio diffusion models achieves state-of-the-art control over musical attributes such as instruments, vocals, and genres.
One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization cs.SD · 2026-01-14 · unverdicted · none · ref 14 · internal anchor
LLMs using in-context learning and fine-tuning on listener experiment data generate equalization settings that align better with population preferences than random sampling or static presets.
Two-Dimensional Quantization for Geometry-Aware Audio Coding cs.SD · 2025-12-01 · unverdicted · none · ref 5 · internal anchor
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
Investigating Modality Contribution in Audio LLMs for Music cs.LG · 2025-09-25 · unverdicted · none · ref 17 · internal anchor
Adapts MM-SHAP to quantify modality contributions in two Audio LLMs on MuChoMusic, showing text dominance alongside limited audio localization of key events.
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound cs.SD · 2025-02-07 · unverdicted · none · ref 50 · internal anchor
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
VideoPoet: A Large Language Model for Zero-Shot Video Generation cs.CV · 2023-12-21 · unverdicted · none · ref 1 · internal anchor
VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.

MusicLM: Generating Music From Text

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer