DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
read the original abstract
We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients are stochastically quantized to low bitwidth numbers before being propagated to convolutional layers. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFa-Net opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 6-bit gradients to get 46.1\% top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.
This paper has not been read by Pith yet.
Forward citations
Cited by 17 Pith papers
-
FTerViT: Fully Ternary Vision Transformer
FTerViT introduces fully ternary Vision Transformers with TernaryBitConv2d and TernaryLayerNorm operators, achieving 82.43% ImageNet top-1 at 6.09 MB with 15x compression.
-
Training single-electron and single-photon stochastic physical neural networks
Single-electron and single-photon stochastic physical neural networks achieve over 97% MNIST test accuracy when trained with empirical outputs in the backward pass using few trials per layer.
-
DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x thr...
-
Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization
CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.
-
Mixed Precision Training
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
-
QuantSR+: Pushing the Limit of Quantized Image Super-Resolution Networks
QuantSR+ introduces RBD, QSA, and SFD techniques to achieve state-of-the-art accuracy-efficiency trade-offs in 2-4 bit quantized image super-resolution networks, with reported PSNR gains like 0.29 dB on Urban100 for SwinIR-S.
-
Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers
A modular framework decomposes Transformer nonlinearities into spike-compatible primitives realized via LIF population coding and bit-shift scaling, supporting Softmax, SiLU, and normalization with under 1% accuracy d...
-
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
SURGE introduces a dual-path gradient compensator and adaptive scaler to improve surrogate gradient estimation in binarized neural network training.
-
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.
-
DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression
DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.
-
FP8 Formats for Deep Learning
FP8 formats E4M3 and E5M2 match 16-bit training accuracy on CNNs, RNNs, and Transformers up to 175B parameters without hyperparameter changes.
-
Multibit neural inference in a N-ary crossbar architecture
Simulation of 4-state MTJ crossbars achieves 94.48% MNIST accuracy for neural inference, close to 97.56% software baseline, with analysis showing quantization as primary error and an optimal number of states per cell.
-
Design and Implementation of BNN-Based Object Detection on FPGA
A BNN-based YOLOv3-tiny-like object detector with 1-bit weights and 8-bit activations is implemented in Verilog on FPGA, achieving 39.6% mAP50 on VOC and 0.999964 correlation with the ONNX model in RTL simulation.
-
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models
CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
-
Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression
The prune-quantize-distill ordering produces a better accuracy-size-latency frontier on CIFAR-10/100 than any single technique or other orderings, with INT8 QAT providing the main runtime gain.
-
Weight Normalization based Quantization for Deep Neural Network Compression
WNQ uses weight normalization to reshape weight distributions and reduce quantization error, outperforming baselines on CIFAR-100 and ImageNet.
-
Design and Implementation of BNN-Based Object Detection on FPGA
A BNN-based YOLOv3-tiny object detector is implemented on FPGA achieving 39.6% mAP50 on VOC dataset with 0.098 GFLOPs and near-exact match to ONNX model in RTL simulation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.