Blockwise Adaptivity: Faster Training and Better Generalization in Deep Learning

Shuai Zheng , James T. Kwok

Authors on Pith no claims yet

classification 💻 cs.LG math.OCstat.ML

keywords adaptivityblockwiseadaptivegeneralizationgradientcoordinate-wisedescentfaster

read the original abstract

Stochastic methods with coordinate-wise adaptive stepsize (such as RMSprop and Adam) have been widely used in training deep neural networks. Despite their fast convergence, they can generalize worse than stochastic gradient descent. In this paper, by revisiting the design of Adagrad, we propose to split the network parameters into blocks, and use a blockwise adaptive stepsize. Intuitively, blockwise adaptivity is less aggressive than adaptivity to individual coordinates, and can have a better balance between adaptivity and generalization. We show theoretically that the proposed blockwise adaptive gradient descent has comparable convergence rate as its counterpart with coordinate-wise adaptive stepsize, but is faster up to some constant. We also study its uniform stability and show that blockwise adaptivity can lead to lower generalization error than coordinate-wise adaptivity. Experimental results show that blockwise adaptive gradient descent converges faster and improves generalization performance over Nesterov's accelerated gradient and Adam.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent
cs.LG 2026-05 unverdicted novelty 6.0

PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressi...