CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

Alexandra Volkova; Andrei Panferov; Dan Alistarh; Mher Safaryan; Soroush Tabesh

arxiv: 2510.18784 · v3 · pith:U72GGMCAnew · submitted 2025-10-21 · 💻 cs.LG

CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

Soroush Tabesh , Mher Safaryan , Andrei Panferov , Alexandra Volkova , Dan Alistarh This is my paper

classification 💻 cs.LG

keywords cageaccuracycurvature-awaregradientimplementationlossmethodprior

0 comments

read the original abstract

Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with the quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches the accuracy achieved at 4-bits (W4A4) with the prior best method. The official implementation can be found over https://github.com/IST-DASLab/CAGE .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Zero-Shot Quantization via Weight-Space Arithmetic
cs.CV 2026-04 unverdicted novelty 8.0

A quantization vector derived from a donor model via weight-space arithmetic can be added to a receiver model to improve post-PTQ Top-1 accuracy by up to 60 points in 3-bit settings without receiver-side QAT or data.
WinQ: Accelerating Quantization-Aware Training of Language Models Around Saddle Points
cs.LG 2026-05 unverdicted novelty 6.0

WinQ accelerates quantization-aware training up to 4x and improves sub-4-bit accuracy up to 8.8% by weight interpolation resets and noise-regularized gradients that increase Hessian eigenvalue magnitudes around saddle points.
TinyNeRV: Compact Neural Video Representations via Capacity Scaling, Distillation, and Low-Precision Inference
cs.CV 2026-04 unverdicted novelty 4.0

Tiny NeRV models using capacity scaling, frequency-aware distillation, and low-precision quantization achieve favorable quality-efficiency trade-offs with far fewer parameters and lower computational costs than standard NeRV.