Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing

Hao Fu , Chunyuan Li , XiaoDong Liu , Jianfeng Gao , Asli Celikyilmaz , Lawrence Carin

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIcs.CLcs.CVstat.ML

keywords annealingbetacyclicallanguagecodesdecoderlatentschedule

read the original abstract

Variational autoencoders (VAEs) with an auto-regressive decoder have been applied for many natural language processing (NLP) tasks. The VAE objective consists of two terms, (i) reconstruction and (ii) KL regularization, balanced by a weighting hyper-parameter \beta. One notorious training difficulty is that the KL term tends to vanish. In this paper we study scheduling schemes for \beta, and show that KL vanishing is caused by the lack of good latent codes in training the decoder at the beginning of optimization. To remedy this, we propose a cyclical annealing schedule, which repeats the process of increasing \beta multiple times. This new procedure allows the progressive learning of more meaningful latent codes, by leveraging the informative representations of previous cycles as warm re-starts. The effectiveness of cyclical annealing is validated on a broad range of NLP tasks, including language modeling, dialog response generation and unsupervised language pre-training.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification
stat.ML 2026-04 unverdicted novelty 6.0

Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
From Unsupervised to Guided Clustering: A Variational Implementation
stat.ME 2026-04 unverdicted novelty 6.0

GCVAE is a variational autoencoder that structures its latent space as a Gaussian mixture and optimizes a variational objective to make the representation maximally informative about a user-chosen guiding variable, en...