Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

Hanyang Li; Jianhao Ma; Ying Cui

arxiv: 2606.09012 · v1 · pith:HKT7AFGVnew · submitted 2026-06-08 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

Hanyang Li , Jianhao Ma , Ying Cui This is my paper

Pith reviewed 2026-06-27 17:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords quantization-aware trainingpost-training quantizationstraight-through estimatorloss landscapebasin geometrylow-bit quantizationgradient bias

0 comments

The pith

Straight-through estimator in QAT evaluates gradients at quantized weights, creating an inward bias that steers iterates back into the low-loss basin where local PTQ can exit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models full-precision training as following a low-loss river inside a wider valley whose normal neighborhood forms a nearly flat basin. When the quantization grid size becomes comparable to basin width, standard PTQ objectives such as rounding or Hessian-based reconstruction can select a high-loss deployed point outside the basin even when low-loss quantized points lie nearby. Straight-through-estimator QAT evaluates gradients at the deployed quantized weights while updating latent full-precision weights, so the gradient senses the valley wall and acquires an inward component that pulls subsequent quantized points back inside. The authors formalize the landscape geometry, construct the PTQ failure mode, and prove finite-time recovery under local quantizer-compatibility assumptions. Experiments on vision and language models across quantization schemes match the predicted basin-crossing behavior.

Core claim

We model full-precision training as following a low-loss river inside a wider valley: a normal neighborhood of the river forms a nearly flat basin, while leaving this basin incurs a sharp loss increase. When the quantization grid is comparable to the basin width, local PTQ objectives, including rounding and Hessian-based second-order reconstruction, can select a high-loss deployed quantized point outside the basin even when nearby low-loss quantized points exist. In this regime, straight-through-estimator-based QAT has a useful bias: it evaluates gradients at the deployed quantized weights while updating latent full-precision weights, causing the gradient to sense the valley wall and acquire

What carries the argument

The local landscape model of a low-loss river inside a valley with a flat basin, together with the straight-through estimator that evaluates gradients at deployed quantized weights.

If this is right

PTQ can geometrically fail by selecting high-loss quantized points outside the basin even when low-loss ones exist nearby.
STE-based QAT acquires an inward gradient component from the valley wall that steers quantized iterates back inside the basin.
Finite-time recovery to the basin is provable under the stated local quantizer-compatibility assumptions.
The same geometric mechanism appears across vision and language models under multiple quantization schemes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The basin model suggests testing whether explicit penalties on basin exit during QAT further improve low-bit accuracy.
The framework could be extended to multi-basin landscapes to predict when QAT recovery breaks down.
It raises the question of whether other gradient estimators or quantization schedules inherit similar inward bias properties.

Load-bearing premise

Full-precision training follows a low-loss river inside a wider valley whose normal neighborhood forms a nearly flat basin, and the quantizer satisfies the local compatibility assumptions needed for the finite-time recovery proof.

What would settle it

A controlled low-dimensional loss surface where the quantization step equals basin width, the full-precision optimum lies inside the basin, PTQ rounds to a point outside the basin, and QAT with STE either fails to recover or the measured gradient at the outside point has no inward component.

Figures

Figures reproduced from arXiv: 2606.09012 by Hanyang Li, Jianhao Ma, Ying Cui.

**Figure 2.** Figure 2: Local landscape diagnostics around pretrained checkpoints. Each panel plots the loss along [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: ResNet/CIFAR-10 and DeiT/ImageNet landscape diagnostics under 2-bit quantization. For ResNet-56, one-dimensional loss profiles from the FP checkpoint toward (a) the AdaRound anchor and (b) the QAT anchor; (c) Two-dimensional loss contour in the plane spanned by the FP→QAT direction and a randomly generated orthogonal direction. For DeiT-Tiny, (d)–(f) show the same diagnostics. 4.1 ResNet and DeiT on Image … view at source ↗

**Figure 4.** Figure 4: reports one-dimensional loss profiles from the converged full-precision checkpoint toward the GPTQ and QAT anchors, together with a two-dimensional contour plot around the FP→QAT direction. The QAT displacement is much shorter than the GPTQ displacement in parameter-space Euclidean norm. Moreover, the two displacements are nearly orthogonal, forming an angle of approximately 92◦ . Therefore, QAT is not a s… view at source ↗

**Figure 3.** Figure 3: W2: FP-180 → RTN/QAT 1D profile W2: FP-180 → AdaRound/QAT 1D profile W2: FP-200 → AdaRound W3: FP-180 → RTN/QAT 1D profile W3: FP-180 → AdaRound/QAT 1D profile W3: FP-200 → AdaRound W4: FP-180 → RTN/QAT 1D profile W4: FP-180 → AdaRound/QAT 1D profile W4: FP-200 → AdaRound [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 5.** Figure 5: Additional ResNet-20 on CIFAR-10 interpolation diagnostics. W3/W4 seed-1 interpolation [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Additional ResNet-56 on CIFAR-10 interpolation diagnostics. W3/W4 seed-1 interpolation [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: DeiT/ImageNet interpolation diagnostics. Rows correspond to W3/W4. The first column [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: shows one-dimensional loss profiles along these two directions. The FP→PTQ profile rises smoothly through the PTQ anchor at t = 1 to its f = 0.7502 value, while the FP→QAT profile reaches a lower f = 0.2543 at t = 1; both rise monotonically because Xfp is the global minimum, but the slope along ∆QAT is markedly gentler, indicating a basin-aligned displacement compared to the rounding direction [PITH_FULL_… view at source ↗

**Figure 9.** Figure 9: FP/PTQ/QAT 2D loss landscape for the high-dimensional Matrix Factorization simulation. Loss on the affine plane X(a, b) = Xfp + a ∆PTQ + b ∆QAT through the three anchors (FP at (0, 0), PTQ at (1, 0), QAT at (0, 1)). Axes are in Frobenius distance units (∥∆PTQ∥ = 0.474, ∥∆QAT∥ = 0.594, angle 95.3 ◦ ). PTQ sits on a steeper part of the surface than QAT; the QAT anchor lands inside the warm-color (low-loss) b… view at source ↗

**Figure 10.** Figure 10: River-cross 3D loss landscape for the high-dimensional Matrix Factorization simulation. Loss f on the plane spanned by ∆FP→QAT = Q(XQAT) − Xfp and a random direction projected orthogonal to it and rescaled to ∥∆FP→QAT∥, averaged pointwise over 5 random seeds (same construction as the ResNet/DeiT/Llama river-cross plots in the main text). The valley along b = 0 is narrow and deep: loss stays low along ∆FP→… view at source ↗

read the original abstract

Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidths, it can fail sharply at aggressive bitwidths; QAT is more expensive but can often recover the lost accuracy. We propose a unified geometric framework that explains both PTQ failure and QAT recovery. We model full-precision training as following a low-loss \emph{river} inside a wider \emph{valley}: a normal neighborhood of the river forms a nearly flat \emph{basin}, while leaving this basin incurs a sharp loss increase. When the quantization grid is comparable to the basin width, local PTQ objectives, including rounding and Hessian-based second-order reconstruction, can select a high-loss deployed quantized point outside the basin even when nearby low-loss quantized points exist. In this regime, straight-through-estimator-based QAT has a useful bias: it evaluates gradients at the deployed quantized weights while updating latent full-precision weights, causing the gradient to sense the valley wall and acquire an inward component that steers subsequent quantized iterates back into the basin. We formalize this mechanism through a local landscape model, construct a geometric PTQ failure mode, and prove finite-time QAT recovery under local quantizer-compatibility assumptions. Experiments across vision and language models under multiple neural-network quantization schemes corroborate the predicted basin-crossing failure of PTQ and the corresponding recovery mechanism of QAT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a geometric story for the PTQ-QAT gap via a river-valley-basin loss landscape, but the model and proof assumptions appear constructed to produce the desired bias result.

read the letter

The main thing to know is that the authors model full-precision training as following a low-loss river inside a wider valley, with a nearly flat basin around it, and claim that when the quantization grid is roughly the basin width, PTQ objectives like rounding or Hessian reconstruction can land outside the basin even when better quantized points exist nearby. Straight-through-estimator QAT then gets an inward gradient component because it evaluates at the deployed quantized weights, and they prove finite-time recovery under local quantizer-compatibility assumptions, backed by experiments on vision and language models.

What is actually new is the explicit construction of the basin-crossing failure for PTQ and the finite-time recovery theorem for QAT. The unified geometric framing is not the usual empirical comparison in the quantization literature. The experiments are set up to check the predicted behaviors across multiple schemes, which gives the claims some grounding.

The soft spots are in the foundational pieces. The river-valley-basin geometry is introduced as a local landscape model rather than derived from the actual loss surfaces of the trained networks. The proof then adds quantizer-compatibility assumptions that enable the inward bias to appear. If those do not hold for the models and quantizers in practice, the mechanistic account loses its force even when the accuracy gap between PTQ and QAT remains. The abstract does not show independent checks on the basin shape or sensitivity to the assumptions.

This is for people working on quantization methods who want a mechanistic account beyond accuracy tables. A reader looking for geometric intuition on why QAT helps at aggressive bitwidths could get value from the framing, though they would need to test the model themselves.

It deserves a serious referee. The formal attempt and the experiments are enough to warrant review, even if the referee will have to press on whether the landscape model and assumptions are realistic or tailored.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a unified geometric framework explaining PTQ failure and QAT recovery. Full-precision training is modeled as following a low-loss river inside a wider valley whose normal neighborhood forms a nearly flat basin. When the quantization grid scale is comparable to basin width, local PTQ objectives (rounding, Hessian-based reconstruction) can select high-loss deployed points outside the basin even when low-loss quantized points exist nearby. STE-based QAT evaluates gradients at the deployed quantized weights, imparting an inward bias that steers iterates back into the basin. The paper constructs the geometric PTQ failure mode, proves finite-time QAT recovery under local quantizer-compatibility assumptions, and presents corroborating experiments on vision and language models under multiple quantization schemes.

Significance. If the geometric mechanism and recovery proof hold, the work supplies a principled explanation for the empirical superiority of QAT over PTQ at aggressive bit-widths and could guide algorithm design. The formal finite-time recovery result and the breadth of experimental corroboration across models and schemes are explicit strengths.

major comments (2)

[Abstract and local landscape model section] Abstract and local landscape model section: the PTQ failure construction and the QAT inward-bias mechanism are derived only after positing the specific river-valley-basin geometry together with the local quantizer-compatibility assumptions; these modeling choices are load-bearing for the central claim yet receive no independent empirical verification (e.g., no landscape visualizations or basin-width measurements) that would confirm the geometry holds for the networks and quantizers studied.
[Finite-time recovery theorem] Finite-time recovery theorem: the proof invokes additional local quantizer-compatibility assumptions whose restrictiveness is not quantified; without showing that these assumptions are satisfied by the uniform and other quantizers used in the experiments, the theorem does not yet establish that the claimed steering mechanism operates in the reported settings.

minor comments (2)

[Experiments section] Experiments section: while the accuracy-gap results are consistent with the predicted basin-crossing behavior, adding controls that directly probe the posited geometry (e.g., loss-surface slices around quantized points) would tighten the link between theory and observation.
[Notation] Notation: ensure that the distinction between latent full-precision weights, deployed quantized weights, and the STE gradient is denoted uniformly across all equations and algorithm boxes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of the geometric framework and finite-time recovery result. We address the two major comments point by point below, proposing targeted revisions to strengthen the presentation of modeling assumptions and their relation to the experiments.

read point-by-point responses

Referee: [Abstract and local landscape model section] Abstract and local landscape model section: the PTQ failure construction and the QAT inward-bias mechanism are derived only after positing the specific river-valley-basin geometry together with the local quantizer-compatibility assumptions; these modeling choices are load-bearing for the central claim yet receive no independent empirical verification (e.g., no landscape visualizations or basin-width measurements) that would confirm the geometry holds for the networks and quantizers studied.

Authors: We agree that the river-valley-basin geometry is a modeling choice introduced to explain the observed PTQ failure mode and QAT recovery. The manuscript provides indirect support through experiments showing that PTQ exhibits the predicted basin-crossing failures and QAT recovers under the same conditions across vision and language models with multiple quantization schemes. Direct landscape visualizations or basin-width measurements are not included, as they are computationally prohibitive at the scale of the studied networks. In revision we will add a dedicated paragraph in the local landscape model section that explicitly discusses the alignment between the posited geometry and the experimental outcomes, while acknowledging the lack of direct geometric measurements as a limitation. revision: partial
Referee: [Finite-time recovery theorem] Finite-time recovery theorem: the proof invokes additional local quantizer-compatibility assumptions whose restrictiveness is not quantified; without showing that these assumptions are satisfied by the uniform and other quantizers used in the experiments, the theorem does not yet establish that the claimed steering mechanism operates in the reported settings.

Authors: The local quantizer-compatibility assumptions are stated explicitly in the theorem (Section 4). We will revise the manuscript to add a new subsection that quantifies the restrictiveness of these assumptions (e.g., by bounding the deviation from uniformity near basin boundaries) and verifies that they hold for the uniform quantizer and the other schemes employed in the experiments under the local approximation. This verification will be performed analytically for the uniform case and numerically for the reported bit-widths. revision: yes

Circularity Check

0 steps flagged

No significant circularity: model and proof are explicit modeling choices with stated assumptions, not reductions by construction.

full rationale

The paper introduces its local landscape model (river/valley/basin geometry) and quantizer-compatibility assumptions directly in the text as a proposed framework, then derives the PTQ failure construction and QAT recovery theorem from them. No equations or claims reduce to their inputs by definition, no parameters are fitted then relabeled as predictions, and no self-citations are invoked as load-bearing uniqueness results. The derivation chain is self-contained as a hypothesis plus conditional proof plus experiments; this matches the default non-circular case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework introduces one invented geometric construct and one set of ad-hoc assumptions required for the proof; no free parameters are stated in the abstract.

axioms (1)

ad hoc to paper local quantizer-compatibility assumptions
Required to prove finite-time QAT recovery in the local landscape model.

invented entities (1)

low-loss river inside a wider valley with nearly flat basin no independent evidence
purpose: Models the loss landscape geometry that produces PTQ basin-crossing failure and QAT inward bias
Constructed as part of the local landscape model; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5823 in / 1412 out tokens · 22475 ms · 2026-06-27T17:32:37.489264+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 4 linked inside Pith

[1]

Proxquant: Quantized neural networks via proximal operators

Yu Bai, Yu-Xiang Wang, and Edo Liberty. Proxquant: Quantized neural networks via proximal operators. InInternational Conference on Learning Representations, 2018

2018
[2]

Post training 4-bit quantization of convolu- tional networks for rapid-deployment

Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolu- tional networks for rapid-deployment. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019
[3]

Gradient extremals, talwegs, valleys, and directional alignment for generic gradient descent.arXiv preprint arXiv:2604.11213, 2026

Pascal Bégout, Jérôme Bolte, Thomas Mariotti, and Francisco Silva. Gradient extremals, talwegs, valleys, and directional alignment for generic gradient descent.arXiv preprint arXiv:2604.11213, 2026

Pith/arXiv arXiv 2026
[4]

Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

Pith/arXiv arXiv 2013
[5]

Training dynamics impact post- training quantization robustness

Albert Catalan-Tatjer, Niccolò Ajroldi, and Jonas Geiping. Training dynamics impact post- training quantization robustness. InInternational Conference on Learning Representations, 2026. 12

2026
[6]

Unveiling the basin-like loss landscape in large language models

Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Unveiling the basin-like loss landscape in large language models. InInternational Conference on Learning Representations, 2026

2026
[7]

BinaryConnect: Training deep neural networks with binary weights during propagations

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training deep neural networks with binary weights during propagations. InAdvances in Neural Information Processing Systems, volume 28, pages 3123–3131, 2015

2015
[8]

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth.Mathematical Programming, pages 1–66, 2025

Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth.Mathematical Programming, pages 1–66, 2025

2025
[9]

Demystify- ing and generalizing binaryconnect

Tim Dockhorn, Yaoliang Yu, Eyyüb Sari, Mahdi Zolnouri, and Vahid Partovi Nia. Demystify- ing and generalizing binaryconnect. InAdvances in Neural Information Processing Systems, volume 34, pages 13202–13216, 2021

2021
[10]

Compute- Optimal Quantization-Aware Training

Aleksandr Dremov, David Grangier, Angelos Katharopoulos, and Awni Hannun. Compute- Optimal Quantization-Aware Training. InInternational Conference on Learning Representations, 2026

2026
[11]

GPTQ: Accurate post- training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023

2023
[12]

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016

2016
[13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016
[14]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018

2018
[15]

Beyond discreteness: Finite-sample analysis of straight-through estimator for quantization.arXiv preprint arXiv:2505.18113, 2025

Halyun Jeong, Jack Xin, and Penghang Yin. Beyond discreteness: Finite-sample analysis of straight-through estimator for quantization.arXiv preprint arXiv:2505.18113, 2025

Pith/arXiv arXiv 2025
[16]

PARQ: Piecewise-affine regularized quantization

Lisa Jin, Jianhao Ma, Zechun Liu, Andrey Gromov, Aaron Defazio, and Lin Xiao. PARQ: Piecewise-affine regularized quantization. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 28044–28062. PMLR, 2025

2025
[17]

Quantizing deep convolutional networks for efficient inference: A whitepaper.arXiv preprint arXiv:1806.08342, 2018

Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper.arXiv preprint arXiv:1806.08342, 2018

Pith/arXiv arXiv 2018
[18]

Training quantized nets: A deeper understanding

Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training quantized nets: A deeper understanding. InAdvances in Neural Information Processing Systems, volume 30, 2017. 13

2017
[19]

ParetoQ: Improving scaling laws in extremely low-bit LLM quantization

Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, et al. ParetoQ: Improving scaling laws in extremely low-bit LLM quantization. InAdvances in Neural Information Processing Systems, 2025

2025
[20]

Beyond the quadratic approximation: The multiscale structure of neural network loss landscapes.arXiv preprint arXiv:2204.11326, 2022

Chao Ma, Daniel Kunin, Lei Wu, and Lexing Ying. Beyond the quadratic approximation: The multiscale structure of neural network loss landscapes.arXiv preprint arXiv:2204.11326, 2022

arXiv 2022
[21]

Up or down? adaptive rounding for post-training quantization

Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. InInternational Conference on Machine Learning, pages 7197–7206, 2020

2020
[22]

Loss aware post-training quantization.Machine Learning, 110(11):3245–3262, 2021

Yury Nahshan, Brian Chmiel, Chaim Baskin, Evgenii Zheltonozhskii, Ron Banner, Alex M Bronstein, and Avi Mendelson. Loss aware post-training quantization.Machine Learning, 110(11):3245–3262, 2021

2021
[23]

Efficient processing of deep neural networks: A tutorial and survey.Proceedings of the IEEE, 105(12):2295–2329, 2017

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey.Proceedings of the IEEE, 105(12):2295–2329, 2017

2017
[24]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021

2021
[25]

Understanding warmup-stable-decay learning rates: A river valley loss landscape view

Kaiyue Wen, Zhiyuan Li, Jason S Wang, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape view. In International Conference on Learning Representations, 2025

2025
[26]

APHQ- ViT: Post-training quantization with average perturbation Hessian based reconstruction for vision transformers

Zhuguanyu Wu, Jiayi Zhang, Jiaxin Chen, Jinyang Guo, Di Huang, and Yunhong Wang. APHQ- ViT: Post-training quantization with average perturbation Hessian based reconstruction for vision transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9686–9695, 2025

2025
[27]

Understanding straight-through estimator in training activation quantized neural nets.International Conference on Learning Representations, 2019

P Yin, J Lyu, S Zhang, S Osher, YY Qi, and J Xin. Understanding straight-through estimator in training activation quantized neural nets.International Conference on Learning Representations, 2019

2019
[28]

PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization

Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization. InEuropean Conference on Computer Vision, pages 191–207. Springer, 2022. 14 A Experimental Details ResNet/CIFAR-10.Each full-precision ResNet is trained for 200 epochs with SGD, momentum 0.9, batch si...

2022
[29]

We next control the error caused by replacingCP with the frozen metricC0

Applying the moving-metric estimate with threshold2R/ √ 5gives ⟨∇Zf(P, Z), ZC P ⟩ D0(Z) ≥ 16 25 c∗. We next control the error caused by replacingCP with the frozen metricC0. OnU, ∥∇Zf(P, Z)∥ F ≤4∥Z∥ F ∥P∥ 2 op + 4∥Z∥3 F ≤4R U(M 2 +R 2 U)≜G. SinceD 0(Z)≥ √λf ∥Z∥F, we have ⟨∇Zf(P, Z), Z(C 0 −C P )⟩ D0(Z) ≤ ∥∇Zf(P, Z)∥ F ∥Z∥F ∥C0 −C P ∥op D0(Z) ≤ GλsδH√λf . ...

[1] [1]

Proxquant: Quantized neural networks via proximal operators

Yu Bai, Yu-Xiang Wang, and Edo Liberty. Proxquant: Quantized neural networks via proximal operators. InInternational Conference on Learning Representations, 2018

2018

[2] [2]

Post training 4-bit quantization of convolu- tional networks for rapid-deployment

Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolu- tional networks for rapid-deployment. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019

[3] [3]

Gradient extremals, talwegs, valleys, and directional alignment for generic gradient descent.arXiv preprint arXiv:2604.11213, 2026

Pascal Bégout, Jérôme Bolte, Thomas Mariotti, and Francisco Silva. Gradient extremals, talwegs, valleys, and directional alignment for generic gradient descent.arXiv preprint arXiv:2604.11213, 2026

Pith/arXiv arXiv 2026

[4] [4]

Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

Pith/arXiv arXiv 2013

[5] [5]

Training dynamics impact post- training quantization robustness

Albert Catalan-Tatjer, Niccolò Ajroldi, and Jonas Geiping. Training dynamics impact post- training quantization robustness. InInternational Conference on Learning Representations, 2026. 12

2026

[6] [6]

Unveiling the basin-like loss landscape in large language models

Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Unveiling the basin-like loss landscape in large language models. InInternational Conference on Learning Representations, 2026

2026

[7] [7]

BinaryConnect: Training deep neural networks with binary weights during propagations

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training deep neural networks with binary weights during propagations. InAdvances in Neural Information Processing Systems, volume 28, pages 3123–3131, 2015

2015

[8] [8]

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth.Mathematical Programming, pages 1–66, 2025

Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth.Mathematical Programming, pages 1–66, 2025

2025

[9] [9]

Demystify- ing and generalizing binaryconnect

Tim Dockhorn, Yaoliang Yu, Eyyüb Sari, Mahdi Zolnouri, and Vahid Partovi Nia. Demystify- ing and generalizing binaryconnect. InAdvances in Neural Information Processing Systems, volume 34, pages 13202–13216, 2021

2021

[10] [10]

Compute- Optimal Quantization-Aware Training

Aleksandr Dremov, David Grangier, Angelos Katharopoulos, and Awni Hannun. Compute- Optimal Quantization-Aware Training. InInternational Conference on Learning Representations, 2026

2026

[11] [11]

GPTQ: Accurate post- training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023

2023

[12] [12]

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016

2016

[13] [13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016

[14] [14]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018

2018

[15] [15]

Beyond discreteness: Finite-sample analysis of straight-through estimator for quantization.arXiv preprint arXiv:2505.18113, 2025

Halyun Jeong, Jack Xin, and Penghang Yin. Beyond discreteness: Finite-sample analysis of straight-through estimator for quantization.arXiv preprint arXiv:2505.18113, 2025

Pith/arXiv arXiv 2025

[16] [16]

PARQ: Piecewise-affine regularized quantization

Lisa Jin, Jianhao Ma, Zechun Liu, Andrey Gromov, Aaron Defazio, and Lin Xiao. PARQ: Piecewise-affine regularized quantization. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 28044–28062. PMLR, 2025

2025

[17] [17]

Quantizing deep convolutional networks for efficient inference: A whitepaper.arXiv preprint arXiv:1806.08342, 2018

Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper.arXiv preprint arXiv:1806.08342, 2018

Pith/arXiv arXiv 2018

[18] [18]

Training quantized nets: A deeper understanding

Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training quantized nets: A deeper understanding. InAdvances in Neural Information Processing Systems, volume 30, 2017. 13

2017

[19] [19]

ParetoQ: Improving scaling laws in extremely low-bit LLM quantization

Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, et al. ParetoQ: Improving scaling laws in extremely low-bit LLM quantization. InAdvances in Neural Information Processing Systems, 2025

2025

[20] [20]

Beyond the quadratic approximation: The multiscale structure of neural network loss landscapes.arXiv preprint arXiv:2204.11326, 2022

Chao Ma, Daniel Kunin, Lei Wu, and Lexing Ying. Beyond the quadratic approximation: The multiscale structure of neural network loss landscapes.arXiv preprint arXiv:2204.11326, 2022

arXiv 2022

[21] [21]

Up or down? adaptive rounding for post-training quantization

Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. InInternational Conference on Machine Learning, pages 7197–7206, 2020

2020

[22] [22]

Loss aware post-training quantization.Machine Learning, 110(11):3245–3262, 2021

Yury Nahshan, Brian Chmiel, Chaim Baskin, Evgenii Zheltonozhskii, Ron Banner, Alex M Bronstein, and Avi Mendelson. Loss aware post-training quantization.Machine Learning, 110(11):3245–3262, 2021

2021

[23] [23]

Efficient processing of deep neural networks: A tutorial and survey.Proceedings of the IEEE, 105(12):2295–2329, 2017

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey.Proceedings of the IEEE, 105(12):2295–2329, 2017

2017

[24] [24]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021

2021

[25] [25]

Understanding warmup-stable-decay learning rates: A river valley loss landscape view

Kaiyue Wen, Zhiyuan Li, Jason S Wang, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape view. In International Conference on Learning Representations, 2025

2025

[26] [26]

APHQ- ViT: Post-training quantization with average perturbation Hessian based reconstruction for vision transformers

Zhuguanyu Wu, Jiayi Zhang, Jiaxin Chen, Jinyang Guo, Di Huang, and Yunhong Wang. APHQ- ViT: Post-training quantization with average perturbation Hessian based reconstruction for vision transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9686–9695, 2025

2025

[27] [27]

Understanding straight-through estimator in training activation quantized neural nets.International Conference on Learning Representations, 2019

P Yin, J Lyu, S Zhang, S Osher, YY Qi, and J Xin. Understanding straight-through estimator in training activation quantized neural nets.International Conference on Learning Representations, 2019

2019

[28] [28]

PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization

Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization. InEuropean Conference on Computer Vision, pages 191–207. Springer, 2022. 14 A Experimental Details ResNet/CIFAR-10.Each full-precision ResNet is trained for 200 epochs with SGD, momentum 0.9, batch si...

2022

[29] [29]

We next control the error caused by replacingCP with the frozen metricC0

Applying the moving-metric estimate with threshold2R/ √ 5gives ⟨∇Zf(P, Z), ZC P ⟩ D0(Z) ≥ 16 25 c∗. We next control the error caused by replacingCP with the frozen metricC0. OnU, ∥∇Zf(P, Z)∥ F ≤4∥Z∥ F ∥P∥ 2 op + 4∥Z∥3 F ≤4R U(M 2 +R 2 U)≜G. SinceD 0(Z)≥ √λf ∥Z∥F, we have ⟨∇Zf(P, Z), Z(C 0 −C P )⟩ D0(Z) ≤ ∥∇Zf(P, Z)∥ F ∥Z∥F ∥C0 −C P ∥op D0(Z) ≤ GλsδH√λf . ...