{"total":35,"items":[{"citing_arxiv_id":"2605.13789","ref_index":1,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ENSEMBITS: an alphabet of protein conformational ensembles","primary_cat":"cs.LG","submitted_at":"2026-05-13T17:08:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Descriptors are standardized to zero mean and unit variance per feature using statistics computed on the training split; the same (µ, σ) are bundled with the model checkpoint for downstream inference. During training, the number of input frames is sampled uniformly from peff ∼ U {1, . . . ,10} at each step (the variable-Pschedule used by SFTD), so the encoder sees everyP∈[1,10]. Hardware and runtime.Training was performed on a single NVIDIA H200 and converged in approximately 7.2 hours (∼25,964 seconds for 195 epochs over 6,557,466 training residues / 719,507validation residues). Final codebook utilization.At convergence, the L1 = 2048 -code primary codebook reaches 96.3% utilization on validation set (1,973 unique codes assigned at least once; perplexity ≈1114 );"},{"citing_arxiv_id":"2605.09971","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation","primary_cat":"cs.HC","submitted_at":"2026-05-11T04:26:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models,\"IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 11, pp. 7327- 7347, 2021. [44] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, \"Jukebox: A generative model for music,\"arXiv preprint arXiv:2005.00341, 2020. [45] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Shar- ifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchiet al., \"Audi- olm: a language modeling approach to audio generation,\"IEEE/ACM transactions on audio, speech, and language processing, vol. 31, pp. 2523-2533, 2023. [46] Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L."},{"citing_arxiv_id":"2605.06870","ref_index":2,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse","primary_cat":"cs.LG","submitted_at":"2026-05-07T19:13:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An initial continuous autoencoder training phase prevents dimensional collapse in VQ-VAEs and yields lower reconstruction and perceptual losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03929","ref_index":39,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PHALAR: Phasors for Learned Musical Audio Representations","primary_cat":"cs.SD","submitted_at":"2026-05-05T16:19:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PHALAR achieves up to 70% relative accuracy gain in stem retrieval with under half the parameters and 7x faster training by using phasor-based equivariant representations, setting new SOTA on multiple datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23077","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems","primary_cat":"cs.IR","submitted_at":"2026-04-25T00:09:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"incorporate embeddings into a recommender system by using them as frozen item embeddings with a learned transformation over them. However, it is also possible to try other approaches, like predicting collaborative embeddings using content information [ 89], using content embedding as a regularization on collaborative one [44], or doing the parameter-efficient transfer learning [21]. We also did not try to retrain the content models as part of a recommendation model from scratch for two reasons. Firstly, training some content models is very costly. We could not possibly retrain Jukebox or MuQ, which is reported to have been trained for two weeks using 32 GPUs. Secondly, there is ambiguity about where to draw the line between building on existing models"},{"citing_arxiv_id":"2604.22209","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions","primary_cat":"eess.AS","submitted_at":"2026-04-24T04:26:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18489","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints","primary_cat":"cs.SD","submitted_at":"2026-04-20T16:40:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16254","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics","primary_cat":"cs.SD","submitted_at":"2026-04-17T17:14:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15196","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization","primary_cat":"cs.CV","submitted_at":"2026-04-16T16:24:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions without supervision, achieving new state-of-the-art results on HuGaDB, LARa, and BABEL while reducing segment length bias.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10490","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Make it Simple, Make it Dance: Dance Motion Simplification to Support Novices' Dance Learning","primary_cat":"cs.HC","submitted_at":"2026-04-12T06:45:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rule-based and learning-based algorithms simplify dance motions to help novices learn more effectively while maintaining naturalness and style.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07612","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP","primary_cat":"cs.SD","submitted_at":"2026-04-08T21:30:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent diffusion model with consistency distillation generates real-time instrumental accompaniment from live context audio, integrated with MAX/MSP for feasible human-AI co-performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03310","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions","primary_cat":"cs.CV","submitted_at":"2026-03-31T11:35:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An inference-time optimization using a control-energy objective on pretrained diffusion models enables coherent long-range human motion generation with explicit domain transitions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"In contrast, our work focuses on exploiting classifier-free guidance within a single diffusion model trained on multiple movement classes to address the problem of coherent domain transitions. Long-range motion generation.EDGE [29] extends diffusion-based motion gener- ation to long-range music-conditioned dance synthesis using a Transformer back- bone and audio features extracted from Jukebox [5]. Conditioning is incorporated through feature-wise modulation using FiLM-style affine transformations [24], enabling expressive choreography aligned with music over extended time horizons. From a stochastic optimal control perspective, such modulation mechanisms can be interpreted as injecting control signals into the generative dynamics, where modulation coefficients determine how strongly conditional information perturbs"},{"citing_arxiv_id":"2603.07956","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Daily Song to Daily Self: Supporting Reflective Songwriting of Deaf and Hard-of-Hearing Individuals through Generative Music AI","primary_cat":"cs.HC","submitted_at":"2026-03-09T04:44:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SoulNote enables multi-session GenAI songwriting for DHH users, producing measurable gains in self-insight, emotion regulation, and self-care attitudes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This lyrics-music mapping encouraged participants to experiment with musical fea- tures they might otherwise avoid, enabling them to discover more fitting emotional expressions or encounter novel affective effects. These observations align with prior research emphasizing the role of musical characteristics in shaping emotional experience and regulation [32, 41]. 9 Limitations SoulNoteshows preliminary promise in improving access to song- writing for DHH individuals and supporting beneficial emotional experiences. However, several limitations should be noted. First, the sample was relatively small and drawn from a specific DHH subgroup, primarily based on residual hearing. The study does not fully represent born Deaf individuals, who may face greater chal-"},{"citing_arxiv_id":"2603.03190","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity","primary_cat":"cs.AI","submitted_at":"2026-03-03T17:47:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.22029","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline","primary_cat":"cs.SD","submitted_at":"2026-02-24T06:43:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.03612","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias","primary_cat":"cs.LG","submitted_at":"2026-01-07T05:40:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Smart Embedding reduces parameters by 48.3 percent in polyphonic music models with information-theoretic loss bounds under 0.153 bits and tighter generalization via Rademacher complexity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01537","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Two-Dimensional Quantization for Geometry-Aware Audio Coding","primary_cat":"cs.SD","submitted_at":"2025-12-01T11:06:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.24437","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization","primary_cat":"cs.SD","submitted_at":"2025-05-30T10:20:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.08203","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not that Groove: Zero-Shot Symbolic Music Editing","primary_cat":"cs.SD","submitted_at":"2025-05-13T03:33:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The work formalizes zero-shot symbolic drum editing as LLM reasoning over a drumroll grid notation, evaluates it on a new benchmark with automated symbolic unit tests, and reports up to 68% success across eight models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.18309","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation","primary_cat":"cs.GR","submitted_at":"2025-02-25T15:53:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GCDance is a text-and-music-conditioned diffusion framework that generates genre-consistent 3D dance sequences and reports better results than prior methods on FineDance and AIST++.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.02612","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot","primary_cat":"cs.CL","submitted_at":"2024-12-03T17:41:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens before fine-tuning on conversational speech.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.15505","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Finite Scalar Quantization: VQ-VAE Made Simple","primary_cat":"cs.CV","submitted_at":"2023-09-27T09:13:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.02463","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Shap-E: Generating Conditional 3D Implicit Functions","primary_cat":"cs.CV","submitted_at":"2023-05-03T23:59:13+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2301.11325","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MusicLM: Generating Music From Text","primary_cat":"cs.SD","submitted_at":"2023-01-26T18:58:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.15657","ref_index":195,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Is Conditional Generative Modeling all you need for Decision-Making?","primary_cat":"cs.LG","submitted_at":"2022-11-28T18:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.15089","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Continuous diffusion for categorical data","primary_cat":"cs.CL","submitted_at":"2022-11-28T06:08:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2210.13438","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"High Fidelity Neural Audio Compression","primary_cat":"eess.AS","submitted_at":"2022-10-24T17:52:02+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2207.05221","ref_index":115,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Language Models (Mostly) Know What They Know","primary_cat":"cs.CL","submitted_at":"2022-07-11T22:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2207.04672","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"No Language Left Behind: Scaling Human-Centered Machine Translation","primary_cat":"cs.CL","submitted_at":"2022-07-11T07:33:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A sparsely gated mixture-of-experts model trained on newly mined low-resource data achieves 44% relative BLEU improvement across 200 languages while adding human safety evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.01068","ref_index":145,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OPT: Open Pre-trained Transformer Language Models","primary_cat":"cs.CL","submitted_at":"2022-05-02T17:49:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.00861","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A General Language Assistant as a Laboratory for Alignment","primary_cat":"cs.CL","submitted_at":"2021-12-01T22:24:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2105.05233","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diffusion Models Beat GANs on Image Synthesis","primary_cat":"cs.LG","submitted_at":"2021-05-11T17:50:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The helmholtz machine. Neural computation, 7(5):889-904, 1995. [11] Harm de Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron Courville. Modulating early visual processing by language. arXiv:1707.00683, 2017. [12] DeepMind. Biggan-deep 128x128 on tensorﬂow hub. https://tfhub.dev/deepmind/ biggan-deep-128/1, 2018. 13 [13] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv:2005.00341, 2020. [14] Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. arXiv:1907.02544, 2019. [15] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models."},{"citing_arxiv_id":"2104.10157","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoGPT: Video Generation using VQ-VAE and Transformers","primary_cat":"cs.CV","submitted_at":"2021-04-20T17:58:03+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2102.01293","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Laws for Transfer","primary_cat":"cs.LG","submitted_at":"2021-02-02T04:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2010.14701","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Laws for Autoregressive Generative Modeling","primary_cat":"cs.LG","submitted_at":"2020-10-28T02:17:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}