Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization

Alexandru Drimbarean; Colm O'Riordan; James McDermott; R\'ois\'in Luo

arxiv: 2408.00923 · v2 · submitted 2024-08-01 · 💻 cs.CV · cs.AI

Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization

R\'ois\'in Luo , Alexandru Drimbarean , James McDermott , Colm O'Riordan This is my paper

Pith reviewed 2026-05-23 22:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords low-bit quantizationquantization residuallow-rank adaptationarchitecture searchconvolutional neural networkspost-training quantizationImageNet

0 comments

The pith

CoRa recovers quantization residuals in ConvNets by searching low-rank adapter architectures in a space orders of magnitude smaller than the weight space, matching 4-bit and 3-bit baselines with under 250 iterations on 1600 images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the information lost when quantizing weights to 4 bits or fewer can be reclaimed by inserting low-rank adapters whose architectures are optimized through search rather than by directly tuning the quantized weights. This reframes the problem from weight optimization in huge spaces to architecture search in much smaller ones. The resulting CoRa method reaches accuracy levels comparable to existing quantization-aware training and post-training quantization approaches on ImageNet while requiring far fewer optimization steps and a tiny calibration set. A reader would care because the approach promises to make low-bit quantization of large models feasible with limited compute and data.

Core claim

CoRa frames optimal low-bit quantization as an architecture search problem for low-rank adapters that approximate the quantization residual weights. This reclaims critical residual knowledge with only infinitesimal extra parameters. On multiple pre-trained ConvNets evaluated on ImageNet, the method achieves performance comparable to state-of-the-art quantization-aware training and post-training quantization baselines in both 4-bit and 3-bit settings, using fewer than 250 iterations on a calibration set of 1600 images, thereby establishing a new state-of-the-art in optimization efficiency.

What carries the argument

Low-rank adapters whose architectures are searched to approximate quantization residual weights.

If this is right

Comparable accuracy to quantization-aware training and post-training quantization baselines at 4-bit and 3-bit precision.
Optimization finishes in fewer than 250 iterations.
Only 1600 calibration images are required.
New state-of-the-art reported for optimization efficiency in low-bit quantization of ConvNets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same residual-reclamation idea could be tested on transformer architectures beyond the ConvNets studied here.
Fewer iterations might translate into lower energy cost when compressing models at scale.
Architecture search over residuals could be adapted to recover information lost in other compression techniques such as pruning.

Load-bearing premise

The information lost in quantization can be recovered sufficiently well by low-rank adapters whose structures are found through search in a space much smaller than the original weight space.

What would settle it

Direct measurement on ImageNet showing that after 250 iterations on 1600 images the CoRa-quantized models fall more than a few percent below the accuracy of BRECQ or similar baselines would falsify the comparable-performance claim.

Figures

Figures reproduced from arXiv: 2408.00923 by Alexandru Drimbarean, Colm O'Riordan, James McDermott, R\'ois\'in Luo.

**Figure 1.** Figure 1: CoRa framework: Searching for the optimal adapters, reclaiming the quantization residual knowledge, instead for the optimal quantized weights. The low-rank convolutional adapter at the l-th layer B (l) rl ⊛ A (l) rl is determined by a discrete integer rl . QAT methods seek the optimal quantized weights during the training process to minimize performance degradation. Despite their promising performance, the… view at source ↗

**Figure 2.** Figure 2: Differentiable thresholding with a high-order normalized Butterworth kernel [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Optimization iterations and solution. The experiments are with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Optimization efficiency on ImageNet. Results are in a logarithmic scale. 25 50 75 100 125 0 0.5 1 1.5 2 densenet121 inception resnet18 resnet34resnet50 # of ConvNet filters a c c @ t o p - 1 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Performance scalability with respect to ConvNet sizes. 2 4 6 8 10 12 14 63.5 64 64.5 65 order of NBK (k) acc@top-1 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 9.** Figure 9: Top-1 accuracy of multiple vision architectures on [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Solution for resnet18. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Solution for resnet34. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Solution for resnet50. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Solution for inception. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Solution for wide_resnet50_2. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

read the original abstract

This paper explores a novel paradigm in low-bit (i.e. 4-bits or lower) quantization, differing from existing state-of-the-art methods, by framing optimal quantization as an architecture search problem within convolutional neural networks (ConvNets). Our framework, dubbed \textbf{CoRa} (Optimal Quantization Residual \textbf{Co}nvolutional Operator Low-\textbf{Ra}nk Adaptation), is motivated by two key aspects. Firstly, quantization residual knowledge, i.e. the lost information between floating-point weights and quantized weights, has long been neglected by the research community. Reclaiming the critical residual knowledge, with an infinitesimal extra parameter cost, can reverse performance degradation without training. Secondly, state-of-the-art quantization frameworks search for optimal quantized weights to address the performance degradation. Yet, the vast search spaces in weight optimization pose a challenge for the efficient optimization in large models. For example, state-of-the-art BRECQ necessitates $2 \times 10^4$ iterations to quantize models. Fundamentally differing from existing methods, \textbf{CoRa} searches for the optimal architectures of low-rank adapters, reclaiming critical quantization residual knowledge, within the search spaces smaller compared to the weight spaces, by many orders of magnitude. The low-rank adapters approximate the quantization residual weights, discarded in previous methods. We evaluate our approach over multiple pre-trained ConvNets on ImageNet. \textbf{CoRa} achieves comparable performance against both state-of-the-art quantization-aware training and post-training quantization baselines, in $4$-bit and $3$-bit quantization, by using less than $250$ iterations on a small calibration set with $1600$ images. Thus, \textbf{CoRa} establishes a new state-of-the-art in terms of the optimization efficiency in low-bit quantization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoRa's main hook is turning residual recovery into low-rank adapter architecture search to cut iterations from 20k to under 250, but the low-rank assumption on residuals lacks supporting checks.

read the letter

The paper's core move is to treat quantization residual recovery as an architecture search over low-rank adapters rather than direct weight optimization. This shrinks the search space by orders of magnitude compared to BRECQ-style methods and lets them run on just 1600 calibration images with under 250 iterations while claiming comparable 3-bit and 4-bit accuracy on ImageNet ConvNets. That efficiency angle is the clearest practical difference from prior post-training quantization work. The abstract does a clean job of spelling out the contrast in search spaces and the motivation around neglected residual knowledge. If the numbers hold in the full experiments, the iteration reduction could matter for people who actually compress models for edge use. The soft spots sit around verification. There are no error bars, no ablations on adapter rank, and no direct measurement of how low-rank the actual quantization residuals are in the layers they test. The stress-test concern lands: if the residual matrices have effective rank close to their dimensions, low-rank adapters will leave recoverable error on the table and the claimed efficiency gain could come at an accuracy cost that the abstract does not address. Without rank analysis or approximation-error plots, the central claim stays plausible but unanchored. This is aimed at researchers who care about practical PTQ workflows and search-cost reduction more than marginal accuracy lifts. A reader already working on efficient deployment would find the iteration numbers worth looking at even if the accuracy story is only parity. It deserves a serious referee because the efficiency reframing is distinct enough to test, though the review would need to press on the residual-rank evidence and add the missing controls.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes CoRa, a novel paradigm for low-bit (3- and 4-bit) quantization of ConvNets that reframes the problem as an architecture search over low-rank adapters to recover quantization residuals (W_fp - W_q), rather than directly optimizing quantized weights. It claims this yields accuracy comparable to SOTA QAT and PTQ baselines on ImageNet while requiring <250 iterations on a 1600-image calibration set, due to search spaces orders of magnitude smaller than weight spaces.

Significance. If the low-rank recovery of residuals proves reliable and the efficiency gains are reproducible, the work would offer a meaningful advance in post-training quantization efficiency for large ConvNets. The conceptual shift from weight optimization to adapter-architecture search is distinctive and could reduce iteration counts substantially, but the absence of residual-rank analysis or approximation-error quantification leaves the central efficiency claim unverified.

major comments (3)

[Abstract] Abstract: the assertion that CoRa searches 'within the search spaces smaller compared to the weight spaces, by many orders of magnitude' is not accompanied by any quantitative comparison of search-space cardinalities or explicit verification that this reduction produces the reported <250 iterations versus BRECQ's 2×10^4.
[Abstract] Abstract: no analysis, table, or figure reports the effective rank of the quantization residual matrices for the evaluated ConvNet layers, leaving untested the load-bearing assumption that low-rank adapters can recover residuals without substantial approximation error.
[Abstract] Abstract: results are presented without error bars, without ablations on adapter rank or search-space size, and without confirmation that the claimed search-space reduction is what enables the reported iteration counts and accuracy parity.

minor comments (1)

[Abstract] The phrasing 'infinitesimal extra parameter cost' is imprecise; the manuscript should state the exact additional parameter count relative to the base quantized model.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the presentation of our efficiency claims and supporting analyses. We address each major comment below and will incorporate revisions to provide the requested quantitative details, analyses, and ablations.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that CoRa searches 'within the search spaces smaller compared to the weight spaces, by many orders of magnitude' is not accompanied by any quantitative comparison of search-space cardinalities or explicit verification that this reduction produces the reported <250 iterations versus BRECQ's 2×10^4.

Authors: We agree that explicit quantification would better support the claim. In the revised manuscript, we will add a dedicated section or table that computes and compares the cardinalities of CoRa's low-rank adapter architecture search space (discrete choices over ranks, placements, and configurations per layer) against the continuous weight spaces optimized in BRECQ and similar methods. We will also include a brief analysis linking the reduced cardinality to the observed iteration counts on the 1600-image calibration set. revision: yes
Referee: [Abstract] Abstract: no analysis, table, or figure reports the effective rank of the quantization residual matrices for the evaluated ConvNet layers, leaving untested the load-bearing assumption that low-rank adapters can recover residuals without substantial approximation error.

Authors: We acknowledge this gap. The revised version will include a new analysis (with accompanying table or figure) reporting the effective ranks of the quantization residual matrices (W_fp - W_q) across layers of the evaluated ConvNets. This will quantify the approximation error when using low-rank adapters of varying ranks and confirm the suitability of the low-rank assumption for residual recovery. revision: yes
Referee: [Abstract] Abstract: results are presented without error bars, without ablations on adapter rank or search-space size, and without confirmation that the claimed search-space reduction is what enables the reported iteration counts and accuracy parity.

Authors: We will revise the experimental section to include error bars from multiple independent runs, ablations varying adapter rank and search-space size, and additional discussion or controlled experiments that isolate the contribution of search-space reduction to the iteration efficiency and accuracy results. These additions will directly address the request for confirmation. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture search is independent of its reported outcomes.

full rationale

The provided text frames CoRa as an empirical search over low-rank adapter architectures to approximate quantization residuals, with performance measured directly against QAT/PTQ baselines on ImageNet using <250 iterations on 1600 images. No equations, fitted parameters, or self-citations are shown that reduce the claimed accuracy or efficiency metrics to quantities defined by the method itself. The central premise is a reframing of quantization as architecture search rather than a self-referential derivation or renamed known result. This matches the expectation of self-contained empirical work with no load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that quantization residuals admit low-rank approximations whose architectures can be searched efficiently; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Quantization residuals between floating-point and low-bit weights can be recovered by low-rank adapters without retraining the base ConvNet
Stated as the first key motivation in the abstract
domain assumption The architecture search space over low-rank adapters is smaller than the weight optimization space by many orders of magnitude
Explicit contrast drawn with BRECQ in the abstract

pith-pipeline@v0.9.0 · 5874 in / 1454 out tokens · 21603 ms · 2026-05-23T22:24:19.470741+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 9 internal anchors

[1]

A survey of convolutional neural networks: analysis, applications, and prospects

Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 2021a. Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundatio...

work page arXiv
[2]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

An overview of neural network compression

James O’ Neill. An overview of neural network compression. arXiv preprint arXiv:2006.03669,

work page arXiv 2006
[4]

A survey on deep neural network compression: Challenges, overview, and solutions

Rahul Mishra, Hari Prabhat Gupta, and Tanima Dutta. A survey on deep neural network compression: Challenges, overview, and solutions. arXiv preprint arXiv:2010.03954,

work page arXiv 2010
[5]

Low-bit quantization of neural networks for efficient inference

Yoni Choukroun, Eli Kravchik, Fan Yang, and Pavel Kisilev. Low-bit quantization of neural networks for efficient inference. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3009–3018. IEEE,

work page 2019
[6]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee,

work page 2009
[7]

A White Paper on Neural Network Quantization

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

A Survey on Methods and Theories of Quantized Neural Networks

Yunhui Guo. A survey on methods and theories of quantized neural networks. arXiv preprint arXiv:1808.04752,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

PACT: Parameterized Clipping Activation for Quantized Neural Networks

9 CoRa: Reclaiming Quantization Residual Knowledge POSTPRINT Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Learning low-precision neural networks without Straight-Through Estimator(STE)

Zhi-Gang Liu and Matthew Mattina. Learning low-precision neural networks without straight-through estimator (ste). arXiv preprint arXiv:1903.01061,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[11]

K., McKinstry, J

Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. arXiv preprint arXiv:1902.08153,

work page arXiv 1902
[12]

Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021b. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on ...

work page arXiv
[13]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Accelerating very deep convolutional networks for clas- sification and detection

Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for clas- sification and detection. IEEE transactions on pattern analysis and machine intelligence , 38(10):1943–1955,

work page 1943
[15]

Speeding up Convolutional Neural Networks with Low Rank Expansions

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Convolution meets lora: Parameter efficient finetuning for segment anything model

Zihan Zhong, Zhiqiang Tang, Tong He, Haoyang Fang, and Chun Yuan. Convolution meets lora: Parameter efficient finetuning for segment anything model. arXiv preprint arXiv:2401.17868,

work page arXiv
[17]

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Improving post training neural quantization: Layer-wise calibration and integer programming

Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Improving post training neural quantization: Layer-wise calibration and integer programming. arXiv preprint arXiv:2006.10518,

work page arXiv 2006
[19]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Suppose: ⌊·⌉ : RI1×I2×···×IN 7→ ZI1×I2×···×IN (14) is an element-wise rounding operator in tensor space such as round(·), f loor(·) or ceil(·) in pytorch (Ketkar et al., 2021)

11 CoRa: Reclaiming Quantization Residual Knowledge POSTPRINT Appendix A Uniform quantization We formally introduce uniform quantization, which refers to the integer representations of floating-point tensors by taking the quantization intervals uniformly (Gholami et al., 2022; Guo, 2018). Suppose: ⌊·⌉ : RI1×I2×···×IN 7→ ZI1×I2×···×IN (14) is an element-wi...

work page 2022

[1] [1]

A survey of convolutional neural networks: analysis, applications, and prospects

Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 2021a. Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundatio...

work page arXiv

[2] [2]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

An overview of neural network compression

James O’ Neill. An overview of neural network compression. arXiv preprint arXiv:2006.03669,

work page arXiv 2006

[4] [4]

A survey on deep neural network compression: Challenges, overview, and solutions

Rahul Mishra, Hari Prabhat Gupta, and Tanima Dutta. A survey on deep neural network compression: Challenges, overview, and solutions. arXiv preprint arXiv:2010.03954,

work page arXiv 2010

[5] [5]

Low-bit quantization of neural networks for efficient inference

Yoni Choukroun, Eli Kravchik, Fan Yang, and Pavel Kisilev. Low-bit quantization of neural networks for efficient inference. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3009–3018. IEEE,

work page 2019

[6] [6]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee,

work page 2009

[7] [7]

A White Paper on Neural Network Quantization

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

A Survey on Methods and Theories of Quantized Neural Networks

Yunhui Guo. A survey on methods and theories of quantized neural networks. arXiv preprint arXiv:1808.04752,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

PACT: Parameterized Clipping Activation for Quantized Neural Networks

9 CoRa: Reclaiming Quantization Residual Knowledge POSTPRINT Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Learning low-precision neural networks without Straight-Through Estimator(STE)

Zhi-Gang Liu and Matthew Mattina. Learning low-precision neural networks without straight-through estimator (ste). arXiv preprint arXiv:1903.01061,

work page internal anchor Pith review Pith/arXiv arXiv 1903

[11] [11]

K., McKinstry, J

Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. arXiv preprint arXiv:1902.08153,

work page arXiv 1902

[12] [12]

Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021b. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on ...

work page arXiv

[13] [13]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Accelerating very deep convolutional networks for clas- sification and detection

Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for clas- sification and detection. IEEE transactions on pattern analysis and machine intelligence , 38(10):1943–1955,

work page 1943

[15] [15]

Speeding up Convolutional Neural Networks with Low Rank Expansions

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Convolution meets lora: Parameter efficient finetuning for segment anything model

Zihan Zhong, Zhiqiang Tang, Tong He, Haoyang Fang, and Chun Yuan. Convolution meets lora: Parameter efficient finetuning for segment anything model. arXiv preprint arXiv:2401.17868,

work page arXiv

[17] [17]

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Improving post training neural quantization: Layer-wise calibration and integer programming

Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Improving post training neural quantization: Layer-wise calibration and integer programming. arXiv preprint arXiv:2006.10518,

work page arXiv 2006

[19] [19]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Suppose: ⌊·⌉ : RI1×I2×···×IN 7→ ZI1×I2×···×IN (14) is an element-wise rounding operator in tensor space such as round(·), f loor(·) or ceil(·) in pytorch (Ketkar et al., 2021)

11 CoRa: Reclaiming Quantization Residual Knowledge POSTPRINT Appendix A Uniform quantization We formally introduce uniform quantization, which refers to the integer representations of floating-point tensors by taking the quantization intervals uniformly (Gholami et al., 2022; Guo, 2018). Suppose: ⌊·⌉ : RI1×I2×···×IN 7→ ZI1×I2×···×IN (14) is an element-wi...

work page 2022