hub Mixed citations

Jukebox: A Generative Model for Music

URLhttps://arxiv · 2020 · eess.AS · arXiv 2005.00341

Mixed citation behavior. Most common role is background (67%).

35 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples at https://jukebox.openai.com, along with model weights and code at https://github.com/openai/jukebox

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 dataset 1 method 1

citation-polarity summary

background 4 use dataset 1 use method 1

representative citing papers

ENSEMBITS: an alphabet of protein conformational ensembles

cs.LG · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.

MusicLM: Generating Music From Text

cs.SD · 2023-01-26 · conditional · novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

From Daily Song to Daily Self: Supporting Reflective Songwriting of Deaf and Hard-of-Hearing Individuals through Generative Music AI

cs.HC · 2026-03-09 · unverdicted · novelty 7.0

SoulNote enables multi-session GenAI songwriting for DHH users, producing measurable gains in self-insight, emotion regulation, and self-care attitudes.

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

cs.SD · 2026-02-24 · unverdicted · novelty 7.0

MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.

Finite Scalar Quantization: VQ-VAE Made Simple

cs.CV · 2023-09-27 · conditional · novelty 7.0

Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.

HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation

cs.HC · 2026-05-11 · unverdicted · novelty 7.0

HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.

Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions without supervision, achieving new state-of-the-art results on HuGaDB, LARa, and BABEL while reducing segment length bias.

Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions

cs.CV · 2026-03-31 · unverdicted · novelty 7.0

An inference-time optimization using a control-energy objective on pretrained diffusion models enables coherent long-range human motion generation with explicit domain transitions.

High Fidelity Neural Audio Compression

eess.AS · 2022-10-24 · accept · novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0 · 2 refs

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Diffusion Models Beat GANs on Image Synthesis

cs.LG · 2021-05-11 · accept · novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.

Scaling Laws for Autoregressive Generative Modeling

cs.LG · 2020-10-28 · accept · novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.

Two-Dimensional Quantization for Geometry-Aware Audio Coding

cs.SD · 2025-12-01 · unverdicted · novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

cs.SD · 2025-05-30 · unverdicted · novelty 6.0

SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.

Not that Groove: Zero-Shot Symbolic Music Editing

cs.SD · 2025-05-13 · unverdicted · novelty 6.0

The work formalizes zero-shot symbolic drum editing as LLM reasoning over a drumroll grid notation, evaluates it on a new benchmark with automated symbolic unit tests, and reports up to 68% success across eight models.

GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

cs.GR · 2025-02-25 · unverdicted · novelty 6.0

GCDance is a text-and-music-conditioned diffusion framework that generates genre-consistent 3D dance sequences and reports better results than prior methods on FineDance and AIST++.

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

cs.CL · 2024-12-03 · conditional · novelty 6.0

GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens before fine-tuning on conversational speech.

Shap-E: Generating Conditional 3D Implicit Functions

cs.CV · 2023-05-03 · accept · novelty 6.0

Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.

Is Conditional Generative Modeling all you need for Decision-Making?

cs.LG · 2022-11-28 · unverdicted · novelty 6.0

Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

An initial continuous autoencoder training phase prevents dimensional collapse in VQ-VAEs and yields lower reconstruction and perceptual losses.

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

eess.AS · 2026-04-24 · unverdicted · novelty 6.0

UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

cs.SD · 2026-04-20 · unverdicted · novelty 6.0

Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.

citing papers explorer

Showing 35 of 35 citing papers.

ENSEMBITS: an alphabet of protein conformational ensembles cs.LG · 2026-05-13 · unverdicted · none · ref 1 · 2 links · internal anchor
Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
MusicLM: Generating Music From Text cs.SD · 2023-01-26 · conditional · none · ref 6
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
From Daily Song to Daily Self: Supporting Reflective Songwriting of Deaf and Hard-of-Hearing Individuals through Generative Music AI cs.HC · 2026-03-09 · unverdicted · none · ref 32 · internal anchor
SoulNote enables multi-session GenAI songwriting for DHH users, producing measurable gains in self-insight, emotion regulation, and self-care attitudes.
MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline cs.SD · 2026-02-24 · unverdicted · none · ref 7 · internal anchor
MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.
Finite Scalar Quantization: VQ-VAE Made Simple cs.CV · 2023-09-27 · conditional · none · ref 8 · internal anchor
Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.
HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation cs.HC · 2026-05-11 · unverdicted · none · ref 45
HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics cs.SD · 2026-04-17 · unverdicted · none · ref 15
ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.
Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization cs.CV · 2026-04-16 · unverdicted · none · ref 11
A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions without supervision, achieving new state-of-the-art results on HuGaDB, LARa, and BABEL while reducing segment length bias.
Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions cs.CV · 2026-03-31 · unverdicted · none · ref 5
An inference-time optimization using a control-energy objective on pretrained diffusion models enables coherent long-range human motion generation with explicit domain transitions.
High Fidelity Neural Audio Compression eess.AS · 2022-10-24 · accept · none · ref 11
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 145 · 2 links
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Diffusion Models Beat GANs on Image Synthesis cs.LG · 2021-05-11 · accept · none · ref 13
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Scaling Laws for Autoregressive Generative Modeling cs.LG · 2020-10-28 · accept · none · ref 5
Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Two-Dimensional Quantization for Geometry-Aware Audio Coding cs.SD · 2025-12-01 · unverdicted · none · ref 21 · internal anchor
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization cs.SD · 2025-05-30 · unverdicted · none · ref 34 · internal anchor
SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.
Not that Groove: Zero-Shot Symbolic Music Editing cs.SD · 2025-05-13 · unverdicted · none · ref 6 · internal anchor
The work formalizes zero-shot symbolic drum editing as LLM reasoning over a drumroll grid notation, evaluates it on a new benchmark with automated symbolic unit tests, and reports up to 68% success across eight models.
GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation cs.GR · 2025-02-25 · unverdicted · none · ref 65 · internal anchor
GCDance is a text-and-music-conditioned diffusion framework that generates genre-consistent 3D dance sequences and reports better results than prior methods on FineDance and AIST++.
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot cs.CL · 2024-12-03 · conditional · none · ref 14 · internal anchor
GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens before fine-tuning on conversational speech.
Shap-E: Generating Conditional 3D Implicit Functions cs.CV · 2023-05-03 · accept · none · ref 10 · internal anchor
Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
Is Conditional Generative Modeling all you need for Decision-Making? cs.LG · 2022-11-28 · unverdicted · none · ref 195 · internal anchor
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 29 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse cs.LG · 2026-05-07 · unverdicted · none · ref 2 · 2 links
An initial continuous autoencoder training phase prevents dimensional collapse in VQ-VAEs and yields lower reconstruction and perceptual losses.
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions eess.AS · 2026-04-24 · unverdicted · none · ref 4
UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints cs.SD · 2026-04-20 · unverdicted · none · ref 17
Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.
Make it Simple, Make it Dance: Dance Motion Simplification to Support Novices' Dance Learning cs.HC · 2026-04-12 · unverdicted · none · ref 22
Rule-based and learning-based algorithms simplify dance motions to help novices learn more effectively while maintaining naturalness and style.
Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP cs.SD · 2026-04-08 · unverdicted · none · ref 13
A latent diffusion model with consistency distillation generates real-time instrumental accompaniment from live context audio, integrated with MAX/MSP for feasible human-AI co-performance.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 115
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
No Language Left Behind: Scaling Human-Centered Machine Translation cs.CL · 2022-07-11 · unverdicted · none · ref 8
A sparsely gated mixture-of-experts model trained on newly mined low-resource data achieves 44% relative BLEU improvement across 200 languages while adding human safety evaluation.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 57
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
VideoGPT: Video Generation using VQ-VAE and Transformers cs.CV · 2021-04-20 · accept · none · ref 11
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity cs.AI · 2026-03-03 · unverdicted · none · ref 52 · internal anchor
Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias cs.LG · 2026-01-07 · unverdicted · none · ref 34 · internal anchor
Smart Embedding reduces parameters by 48.3 percent in polyphonic music models with information-theoretic loss bounds under 0.153 bits and tighter generalization via Rademacher complexity.
Continuous diffusion for categorical data cs.CL · 2022-11-28 · unverdicted · none · ref 14 · internal anchor
The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems cs.IR · 2026-04-25 · unverdicted · none · ref 21
Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.
PHALAR: Phasors for Learned Musical Audio Representations cs.SD · 2026-05-05 · unreviewed · ref 39 · 3 links

Jukebox: A Generative Model for Music

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer