Model-Preserving Adaptive Rounding
read the original abstract
The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, most quantization algorithms minimize the immediate activation error of each layer as a proxy for the end-to-end error. However, this ignores the effect of future layers, making it a poor proxy. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that directly considers the error at the network's output. YAQA introduces a series of theoretical results that culminate in the first end-to-end error bounds for quantization algorithms. First, we characterize the convergence time of adaptive rounding algorithms via the structure of their Hessian approximations. We then show that the end-to-end error can be bounded by the approximation's cosine similarity to the true Hessian. This admits a natural Kronecker-factored approximation with corresponding near-optimal Hessian sketches. YAQA is provably better than GPTQ/LDLQ and empirically reduces the error by $\approx 30\%$ over these methods. YAQA even achieves a lower error than quantization aware training. This translates to state of the art performance on downstream tasks, all while adding no inference overhead.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
High-Rate Quantized Matrix Multiplication II
Waterfilling rate allocation makes quantized matrix multiplication for LLMs near information-theoretically optimal, with WaterSIC being basis-free and within 0.25 bits per entry of the limit.
-
Statistically-Lossless Quantization of Large Language Models
SLQ achieves task-lossless LLM quantization below 4 bits per parameter and distribution-lossless at 5-6 bits on average, with 1.7-3.6x speedups over FP16.
-
CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization
CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.
-
IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression
IO-SVD performs SVD-based LLM compression by constructing a KL-aware double-sided whitening space and using first-order loss estimates for heterogeneous rank allocation.
-
High-Rate Quantized Matrix Multiplication I
High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.