Model-Preserving Adaptive Rounding

Albert Tseng; Christopher De Sa; Zhaofeng Sun

arxiv: 2505.22988 · v3 · pith:VVBDNYOYnew · submitted 2025-05-29 · 💻 cs.LG · cs.AI

Model-Preserving Adaptive Rounding

Albert Tseng , Zhaofeng Sun , Christopher De Sa This is my paper

classification 💻 cs.LG cs.AI

keywords errorquantizationyaqaadaptivealgorithmsend-to-endhessianrounding

0 comments

read the original abstract

The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, most quantization algorithms minimize the immediate activation error of each layer as a proxy for the end-to-end error. However, this ignores the effect of future layers, making it a poor proxy. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that directly considers the error at the network's output. YAQA introduces a series of theoretical results that culminate in the first end-to-end error bounds for quantization algorithms. First, we characterize the convergence time of adaptive rounding algorithms via the structure of their Hessian approximations. We then show that the end-to-end error can be bounded by the approximation's cosine similarity to the true Hessian. This admits a natural Kronecker-factored approximation with corresponding near-optimal Hessian sketches. YAQA is provably better than GPTQ/LDLQ and empirically reduces the error by $\approx 30\%$ over these methods. YAQA even achieves a lower error than quantization aware training. This translates to state of the art performance on downstream tasks, all while adding no inference overhead.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

High-Rate Quantized Matrix Multiplication II
cs.LG 2026-05 unverdicted novelty 6.0

Waterfilling rate allocation makes quantized matrix multiplication for LLMs near information-theoretically optimal, with WaterSIC being basis-free and within 0.25 bits per entry of the limit.
Statistically-Lossless Quantization of Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SLQ achieves task-lossless LLM quantization below 4 bits per parameter and distribution-lossless at 5-6 bits on average, with 1.7-3.6x speedups over FP16.
CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization
cs.LG 2026-02 unverdicted novelty 6.0

CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.
IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression
cs.LG 2026-05 unverdicted novelty 5.0

IO-SVD performs SVD-based LLM compression by constructing a KL-aware double-sided whitening space and using first-order loss estimates for heterogeneous rank allocation.
High-Rate Quantized Matrix Multiplication I
cs.IT 2026-01 unverdicted novelty 5.0

High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.