Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Huizi Mao; Song Han; William J. Dally

arxiv: 1510.00149 · v5 · submitted 2015-10-01 · 💻 cs.CV · cs.NE

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han , Huizi Mao , William J. Dally This is my paper

Pith reviewed 2026-05-12 15:53 UTC · model grok-4.3

classification 💻 cs.CV cs.NE

keywords deep compressionnetwork pruningtrained quantizationHuffman codingmodel compressionAlexNetVGG-16embedded deployment

0 comments

The pith

A three-stage pipeline of pruning, trained quantization and Huffman coding reduces neural network storage by 35x to 49x without accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to make large neural networks small enough to run on devices with tight memory limits. It does this through a pipeline that first prunes away unimportant connections, then forces the remaining weights to share a small set of quantized values, and finally encodes them with Huffman coding. Retraining after pruning and quantization restores performance. On ImageNet, this shrinks AlexNet from 240 MB to 6.9 MB and VGG-16 from 552 MB to 11.3 MB while keeping accuracy the same. The smaller models fit in fast on-chip memory and run with better speed and energy use on CPU, GPU, and mobile hardware.

Core claim

Pruning reduces connections by 9x to 13x, trained quantization drops each weight from 32 bits to 5 bits through weight sharing, and Huffman coding adds further lossless compression; together these steps cut storage by 35x for AlexNet and 49x for VGG-16 on ImageNet with no accuracy loss after retraining the pruned and quantized network.

What carries the argument

Deep compression pipeline that sequences connection pruning, trained quantization with weight sharing, Huffman coding, and retraining after the first two stages.

If this is right

Compressed models fit into on-chip SRAM cache instead of off-chip DRAM memory.
The networks run 3x to 4x faster layerwise on CPU, GPU, and mobile GPU.
Energy efficiency improves 3x to 7x across the same hardware platforms.
Complex networks become feasible in mobile apps limited by storage size and download bandwidth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stages could be tested on recurrent or transformer models to check whether similar compression ratios hold outside convolutional networks.
Pairing the compressed weights with dedicated accelerators might multiply the observed speed and energy gains.
If the quantized centroids remain stable, the approach could support on-device fine-tuning with minimal extra memory.

Load-bearing premise

Retraining after pruning and quantization fully recovers any accuracy lost from removing connections and forcing weight sharing.

What would settle it

Running the three-stage pipeline on AlexNet and measuring lower top-1 or top-5 accuracy on the ImageNet validation set than the original uncompressed model.

read the original abstract

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows you can prune, quantize to 5 bits with learned centroids, retrain, and Huffman-code AlexNet and VGG-16 to 35-49x smaller size on ImageNet with no accuracy loss.

read the letter

The main takeaway is that this three-stage pipeline delivers real compression on two standard ImageNet models without hurting accuracy. Pruning removes 9-13x of the connections, trained quantization drops each weight to 5 bits by clustering and fine-tuning the centroids, and Huffman coding squeezes the indices further. The end result is AlexNet at 6.9 MB and VGG-16 at 11.3 MB, both fitting in on-chip SRAM, plus measured speed and energy gains on CPU, GPU, and mobile GPU.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces 'deep compression', a three-stage pipeline of pruning (reducing connections by 9x-13x), trained quantization (to 5 bits with learned centroids), and Huffman coding. It claims this achieves 35x compression for AlexNet (240MB to 6.9MB) and 49x for VGG-16 (552MB to 11.3MB) on ImageNet with no accuracy loss, plus 3x-4x layerwise speedup and 3x-7x energy efficiency gains on CPU/GPU/mobile GPU after retraining the pruned and quantized network.

Significance. If the accuracy preservation and compression ratios hold under the reported conditions, the work is significant for enabling deployment of large DNNs on memory-constrained embedded and mobile devices. The empirical results on standard ImageNet models provide concrete evidence of practical utility for model compression techniques.

major comments (2)

[Abstract] Abstract: The claim of no accuracy loss after pruning and quantization depends entirely on the subsequent retraining step to 'fine tune the remaining connections and the quantized centroids,' but the manuscript provides no quantification of the accuracy drop prior to retraining, no details on the retraining protocol (epochs, learning rates, or convergence criteria), and no evidence that recovery is robust rather than specific to the chosen hyperparameters.
[Abstract] Abstract: The reported compression ratios and accuracy preservation lack error bars, variance across multiple runs, or a full experimental protocol (e.g., pruning threshold selection, quantization bit-width tuning, or dataset splits), which undermines assessment of whether the 35x-49x gains are reproducible and generalizable.

minor comments (1)

The abstract references benchmarking results for speedup and energy efficiency but does not indicate where in the manuscript the detailed tables, figures, or methodology for these measurements appear, reducing clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the manuscript to improve clarity and completeness where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of no accuracy loss after pruning and quantization depends entirely on the subsequent retraining step to 'fine tune the remaining connections and the quantized centroids,' but the manuscript provides no quantification of the accuracy drop prior to retraining, no details on the retraining protocol (epochs, learning rates, or convergence criteria), and no evidence that recovery is robust rather than specific to the chosen hyperparameters.

Authors: We agree that the abstract and main text would benefit from greater transparency on the retraining step. In the revised manuscript we will add a quantification of the accuracy drop immediately after pruning and quantization (before retraining), include the specific retraining protocol (number of epochs, learning-rate schedule, and convergence criteria), and provide evidence of robustness by reporting results across a small range of hyperparameter choices. revision: yes
Referee: [Abstract] Abstract: The reported compression ratios and accuracy preservation lack error bars, variance across multiple runs, or a full experimental protocol (e.g., pruning threshold selection, quantization bit-width tuning, or dataset splits), which undermines assessment of whether the 35x-49x gains are reproducible and generalizable.

Authors: We acknowledge the value of additional experimental detail for reproducibility. We will expand the methods and experimental sections to document the exact procedures for selecting pruning thresholds, tuning quantization bit-widths, and the dataset splits employed. While the primary results are reported from single executions (standard practice for these large-scale ImageNet experiments), we will add a sensitivity analysis with respect to the main hyperparameters to support generalizability. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical pipeline with measured outcomes on public benchmarks

full rationale

The paper describes an algorithmic three-stage compression procedure (pruning, trained quantization, Huffman coding) followed by retraining, then reports directly measured storage reductions (35x-49x) and accuracy on ImageNet for AlexNet and VGG-16. No first-principles derivations, predictions, or uniqueness theorems are claimed; compression factors follow arithmetically from the observed connection counts and bit widths after pruning/quantization, and accuracy is an external empirical outcome rather than a quantity fitted or defined inside the same experiment. Self-citations, if present, support prior algorithmic components but are not load-bearing for the reported gains.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The method depends on choices of pruning threshold and number of quantization levels that are selected or tuned per network; these act as free parameters.

free parameters (2)

pruning threshold
Determines which connections are removed; value is chosen to achieve target sparsity while allowing recovery on retraining.
quantization bit width
Set to 5 bits; controls the number of shared weight values.

pith-pipeline@v0.9.0 · 5584 in / 1085 out tokens · 57447 ms · 2026-05-12T15:53:35.629214+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
cs.LG 2026-05 conditional novelty 8.0

INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.
Federated Learning: Strategies for Improving Communication Efficiency
cs.LG 2016-10 conditional novelty 8.0

Structured updates (low-rank or masked) and sketched updates (quantized, rotated, subsampled) reduce uplink communication in federated learning by up to two orders of magnitude on convolutional and recurrent networks.
Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection
cs.CV 2026-05 unverdicted novelty 7.0

PDI-Net integrates physics-aware priors into a dual network that shares semi-reconstruction features with a YOLO detector, cutting inference time 84% while raising mAP 5% on low-SNR M3FD data.
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
cs.CV 2026-05 unverdicted novelty 7.0

The paper presents AIGaitor, a privacy-preserving on-device monocular motion analysis system that performs end-to-end pose estimation and deep learning gait analysis on consumer smartphones.
When Bits Break Recourse: Counterfactual-Faithful Quantization
cs.LG 2026-05 unverdicted novelty 7.0

CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis
cs.LG 2026-05 unverdicted novelty 7.0

QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
Winning Lottery Tickets in Neural Networks via a Quantum-Inspired Classical Algorithm
quant-ph 2026-05 conditional novelty 7.0

A classical polynomial-time algorithm for optimized sampling of lottery tickets in neural networks removes the exponential dependence on data dimension from prior classical approaches.
Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns
cs.LG 2026-05 unverdicted novelty 7.0

SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
cs.CR 2026-05 unverdicted novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
cs.LG 2026-05 unverdicted novelty 7.0

INT4 quantization recovers forgotten data in unlearned LLMs up to 22x, exposing a trilemma with no existing method solving forgetting, utility, and robustness together; a new sharpness-aware method achieves cross-prec...
On the Decompositionality of Neural Networks
cs.LO 2026-04 unverdicted novelty 7.0

Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
cs.CL 2025-12 conditional novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization
cs.CV 2024-08 unverdicted novelty 7.0

CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
cs.CV 2017-04 accept novelty 7.0

MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems
cs.LG 2026-05 unverdicted novelty 6.0

AutoMCU uses feasibility-first LLM multi-agent coordination to automate MCU-constrained neural network design, delivering competitive accuracy on CIFAR-10/100 in 1-2 hours versus hundreds of GPU hours for prior HW-NAS...
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis
eess.SP 2026-05 unverdicted novelty 6.0

Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems
cs.LG 2026-05 conditional novelty 6.0

ROMER cuts perplexity by up to 59% in noisy analog CIM environments for MoE LLMs via expert replacement and router recalibration calibrated on real-chip measurements.
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
Theory-optimal Quantization Based on Flatness
cs.LG 2026-05 unverdicted novelty 6.0

The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on ...
Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning
cs.LG 2026-05 unverdicted novelty 6.0

Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on m...
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...
Homodyne Photonic Tensor Processor exceeds 1,000-TOPS
cs.ET 2026-04 unverdicted novelty 6.0

A homodyne photonic tensor processor using TFLN transmitters and Si/SiN circuits demonstrates 1,000-6,000 TOPS throughput with 6-7 bit accuracy at up to 120 Gbaud/s clock rates.
UCCL-Zip: Lossless Compression Supercharged GPU Communication
cs.DC 2026-04 unverdicted novelty 6.0

UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.
Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition
cs.AR 2026-04 unverdicted novelty 6.0

A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
cs.CL 2026-04 unverdicted novelty 6.0

Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization
cs.CV 2026-04 unverdicted novelty 6.0

DeFakeQ introduces an adaptive bidirectional quantization method tailored for deepfake detectors that maintains detection accuracy while enabling real-time performance on resource-constrained edge devices.
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models
cs.LG 2025-12 unverdicted novelty 6.0

A post-training 1-bit quantization method for LLMs that fixes error accumulation and anisotropic representation distortion to outperform prior weight-driven and naive output-driven baselines.
LILogic Net: Compact Logic Gate Networks with Learnable Connectivity for Efficient Hardware Deployment
cs.LG 2025-11 unverdicted novelty 6.0

LILogicNet trains compact logic-gate networks with learnable sparse connectivity via Top-K selection, reaching 98.45% MNIST accuracy with 8k gates and 60.98% CIFAR-10 accuracy with 256k gates while using far fewer gat...
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs
cs.LG 2025-06 unverdicted novelty 6.0

MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
Poisoning with A Pill: Circumventing Detection in Federated Learning
cs.LG 2024-07 unverdicted novelty 6.0

A three-stage pill-based augmentation makes existing FL poisoning attacks evade popular defenses while raising error rates up to 7x on both IID and non-IID data.
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
cs.LG 2023-10 conditional novelty 6.0

SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
cs.LG 2023-06 unverdicted novelty 6.0

H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
cs.LG 2023-05 accept novelty 6.0

FrugalGPT learns query-specific cascades across heterogeneous LLM APIs to match or exceed top-model accuracy at far lower cost.
Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT
stat.ML 2019-07 unverdicted novelty 6.0

NoNN partitions a teacher model into disjoint compressed students via network science for distributed IoT inference, matching teacher accuracy with far lower per-device memory and communication.
Open DNN Box by Power Side-Channel Attack
cs.CR 2019-07 unverdicted novelty 6.0

Power side-channel analysis recovers DNN architecture and parameters at 96.5% average accuracy on real embedded devices.
A Unified Optimization Approach for CNN Model Inference on Integrated GPUs
cs.DC 2019-07 unverdicted novelty 6.0

A unified IR plus ML-based scheduling for CNN inference on multi-vendor integrated GPUs matches or exceeds vendor libraries (up to 1.62x) on image models while supporting more models.
COP: Customized Deep Model Compression via Regularized Correlation-Based Filter-Level Pruning
cs.CV 2019-06 unverdicted novelty 6.0

COP prunes CNN filters using correlation-based importance with global normalization and dual regularization on parameter quantity and FLOPs to enable customized compression.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
cs.CL 2016-09 accept novelty 6.0

GNMT deploys 8-layer LSTMs with attention, wordpieces, low-precision inference, and coverage-penalized beam search to match state-of-the-art on WMT'14 En-Fr and En-De while cutting translation errors by 60% in human e...
SGDR: Stochastic Gradient Descent with Warm Restarts
cs.LG 2016-08 accept novelty 6.0

SGDR uses periodic warm restarts of the learning rate in SGD to reach new state-of-the-art error rates of 3.14% on CIFAR-10 and 16.21% on CIFAR-100.
On the Stability of Growth in Structural Plasticity
cs.LG 2026-05 unverdicted novelty 5.0

Newborn units in growing neural networks are forward-active but backward-starved, receiving weaker gradients than existing units and creating integration challenges that make growth less reliable than pruning in compl...
Multibit neural inference in a N-ary crossbar architecture
cs.AR 2026-04 unverdicted novelty 5.0

Simulation of 4-state MTJ crossbars achieves 94.48% MNIST accuracy for neural inference, close to 97.56% software baseline, with analysis showing quantization as primary error and an optimal number of states per cell.
FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices
cs.LG 2026-04 unverdicted novelty 5.0

Fed-FSTQ reduces uplink traffic by 46x and improves time-to-accuracy by 52% in federated LLM fine-tuning using Fisher-guided token quantization and selection.
Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models
cs.LG 2025-12 unverdicted novelty 5.0

A post-training quantization technique for 1-bit LLMs that corrects layer-wise error accumulation and anisotropic representation distortion to preserve output behavior more effectively than existing methods.
Vanishing Contributions: A Unified Framework for Smooth and Iterative Model Compression
cs.LG 2025-10 unverdicted novelty 5.0

VCON is a unified framework for smooth iterative DNN compression that uses parallel execution and an affine combination to progressively replace the original model with its compressed form during fine-tuning.
AutoSculpt: A Pattern-based Model Auto-pruning Framework Using Reinforcement Learning and Graph Learning
cs.AI 2024-12 unverdicted novelty 5.0

AutoSculpt models DNNs as graphs, embeds pruning patterns, and uses deep reinforcement learning to reach up to 90% pruning and 18% better FLOPs reduction than baselines on ResNet, MobileNet, VGG, and Vision Transformers.
Neuron ranking -- an informed way to condense convolutional neural networks architecture
cs.LG 2019-07 unverdicted novelty 5.0

Shapley value and variational importance switch methods produce consistent rankings of filter importance in CNNs, enabling compression and interpretability.
One Size Does Not Fit All: Quantifying and Exposing the Accuracy-Latency Trade-off in Machine Learning Cloud Service APIs via Tolerance Tiers
cs.LG 2019-06 unverdicted novelty 5.0

Proposes Tolerance Tiers architecture for MLaaS to let consumers select accuracy-latency trade-offs, shown to outperform single-version deployment on ASR and vision workloads.
New pointwise convolution in Deep Neural Networks through Extremely Fast and Non Parametric Transforms
cs.CV 2019-06 unverdicted novelty 5.0

Replacing pointwise convolutions with DWHT yields a model with 79.1% fewer parameters, 48.4% fewer FLOPs, and 1.49% higher accuracy than MobileNet-V1 on CIFAR-100.
MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization
cs.AR 2026-05 unverdicted novelty 4.0

MASQ claims up to 16.06x speedup and 4.18x energy gain over A100 for masked diffusion via stage-wise multi-precision quantization and specialized hardware units while preserving quality.
GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection
cs.CV 2026-05 unverdicted novelty 4.0

GSA-YOLO modifies YOLOv8n with structured sparsity via Group Lasso and Sparse Structure Selection plus Adaptive Knowledge Distillation, reporting 189.62 FPS and mAP50:95 gains of 2.4% and 1.8% on HiXray and PIDray datasets.
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
cs.CL 2026-05 unverdicted novelty 4.0

m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
Trajectory-Aware Adaptive Inference in Object Detection Models
cs.CV 2026-05 unverdicted novelty 4.0

Introduces an early-exit mechanism in YOLOv8 that uses inter-vessel distance and closing speed from trajectories to adapt computation depth per frame in maritime scenes.
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
cs.CV 2026-05 unverdicted novelty 4.0

A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators
cs.AR 2026-04 unverdicted novelty 4.0

Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
minAction.net: Energy-First Neural Architecture Design -- From Biological Principles to Systematic Validation
cs.LG 2026-04 conditional novelty 4.0

Large-scale experiments show architecture performance depends on task type, not universality, and a single-parameter energy penalty reduces computational energy by ~1000x with negligible accuracy cost.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 62 Pith papers · 1 internal anchor

[1]

Fixed point optimization of deep convolutional neural networks for object recognition

Anwar, Sajid, Hwang, Kyuyeon, and Sung, Wonyong. Fixed point optimization of deep convolutional neural networks for object recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on , pp. 1131–1135. IEEE,

work page 2015
[2]

Provable bounds for learning some deep representations

Arora, Sanjeev, Bhaskara, Aditya, Ge, Rong, and Ma, Tengyu. Provable bounds for learning some deep representations. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, pp. 584–592,

work page 2014
[3]

Caffe model zoo

BVLC. Caffe model zoo. URL http://caffe.berkeleyvision.org/model_zoo. Chen, Wenlin, Wilson, James T., Tyree, Stephen, Weinberger, Kilian Q., and Chen, Yixin. Compress- ing neural networks with the hashing trick. arXiv preprint arXiv:1504.04788,

work page arXiv
[4]

Memory bounded deep convolutional networks

Collins, Maxwell D and Kohli, Pushmeet. Memory bounded deep convolutional networks. arXiv preprint arXiv:1412.1442,

work page arXiv
[5]

Fast R-CNN

Girshick, Ross. Fast r-cnn. arXiv preprint arXiv:1504.08083,

work page Pith review arXiv
[6]

Compressing Deep Convolutional Networks using Vector Quantization

Gong, Yunchao, Liu, Liu, Yang, Ming, and Bourdev, Lubomir. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115,

work page Pith review arXiv
[7]

EIE: Efﬁcient inference engine on compressed deep neural network

Han, Song, Liu, Xingyu, Mao, Huizi, Pu, Jing, Pedram, Ardavan, Horowitz, Mark A, and Dally, William J. EIE: Efﬁcient inference engine on compressed deep neural network. arXiv preprint arXiv:1602.01528,

work page arXiv
[8]

Comparing biases for minimal network construction with back-propagation

12 Published as a conference paper at ICLR 2016 Hanson, Stephen Jos´e and Pratt, Lorien Y . Comparing biases for minimal network construction with back-propagation. In Advances in neural information processing systems , pp. 177–185,

work page 2016
[9]

In Signal Processing Systems (SiPS), 2014 IEEE Workshop on , pp. 1–6. IEEE,

work page 2014
[10]

Caffe: Convolutional Architecture for Fast Feature Embedding

Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093,

work page Pith review arXiv
[11]

Network In Network

Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network. arXiv:1312.4400,

work page Pith review arXiv
[12]

Very Deep Convolutional Networks for Large-Scale Image Recognition

NVIDIA. Technical brief: NVIDIA jetson TK1 development kit bringing GPU-accelerated computing to embedded systems, a. URL http://www.nvidia.com. NVIDIA. Whitepaper: GPU-based deep learning inference: A performance and power analysis, b. URL http://www.nvidia.com/object/white-papers.html. Simonyan, Karen and Zisserman, Andrew. Very deep convolutional netwo...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Going Deeper with Convolutions

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. arXiv preprint arXiv:1409.4842,

work page Pith review arXiv
[14]

Cross-domain synthesis of medical images using efﬁcient location-sensitive deep network

Van Nguyen, Hien, Zhou, Kevin, and Vemulapalli, Raviteja. Cross-domain synthesis of medical images using efﬁcient location-sensitive deep network. InMedical Image Computing and Computer- Assisted Intervention–MICCAI 2015, pp. 677–684. Springer,

work page 2015
[15]

Deep fried convnets

Yang, Zichao, Moczulski, Marcin, Denil, Misha, de Freitas, Nando, Smola, Alex, Song, Le, and Wang, Ziyu. Deep fried convnets. arXiv preprint arXiv:1412.7149,

work page arXiv
[16]

To avoid variance, we measured the time spent on each layer for 4096 input samples, and averaged the time regarding each input sample

13 Published as a conference paper at ICLR 2016 A A PPENDIX : DETAILED TIMING / POWER REPORTS OF DENSE & SPARSE NETWORK LAYERS Table 8: Average time on different layers. To avoid variance, we measured the time spent on each layer for 4096 input samples, and averaged the time regarding each input sample. For GPU, the time consumed by cudaMalloc and cudaMem...

work page 2016

[1] [1]

Fixed point optimization of deep convolutional neural networks for object recognition

Anwar, Sajid, Hwang, Kyuyeon, and Sung, Wonyong. Fixed point optimization of deep convolutional neural networks for object recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on , pp. 1131–1135. IEEE,

work page 2015

[2] [2]

Provable bounds for learning some deep representations

Arora, Sanjeev, Bhaskara, Aditya, Ge, Rong, and Ma, Tengyu. Provable bounds for learning some deep representations. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, pp. 584–592,

work page 2014

[3] [3]

Caffe model zoo

BVLC. Caffe model zoo. URL http://caffe.berkeleyvision.org/model_zoo. Chen, Wenlin, Wilson, James T., Tyree, Stephen, Weinberger, Kilian Q., and Chen, Yixin. Compress- ing neural networks with the hashing trick. arXiv preprint arXiv:1504.04788,

work page arXiv

[4] [4]

Memory bounded deep convolutional networks

Collins, Maxwell D and Kohli, Pushmeet. Memory bounded deep convolutional networks. arXiv preprint arXiv:1412.1442,

work page arXiv

[5] [5]

Fast R-CNN

Girshick, Ross. Fast r-cnn. arXiv preprint arXiv:1504.08083,

work page Pith review arXiv

[6] [6]

Compressing Deep Convolutional Networks using Vector Quantization

Gong, Yunchao, Liu, Liu, Yang, Ming, and Bourdev, Lubomir. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115,

work page Pith review arXiv

[7] [7]

EIE: Efﬁcient inference engine on compressed deep neural network

Han, Song, Liu, Xingyu, Mao, Huizi, Pu, Jing, Pedram, Ardavan, Horowitz, Mark A, and Dally, William J. EIE: Efﬁcient inference engine on compressed deep neural network. arXiv preprint arXiv:1602.01528,

work page arXiv

[8] [8]

Comparing biases for minimal network construction with back-propagation

12 Published as a conference paper at ICLR 2016 Hanson, Stephen Jos´e and Pratt, Lorien Y . Comparing biases for minimal network construction with back-propagation. In Advances in neural information processing systems , pp. 177–185,

work page 2016

[9] [9]

In Signal Processing Systems (SiPS), 2014 IEEE Workshop on , pp. 1–6. IEEE,

work page 2014

[10] [10]

Caffe: Convolutional Architecture for Fast Feature Embedding

Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093,

work page Pith review arXiv

[11] [11]

Network In Network

Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network. arXiv:1312.4400,

work page Pith review arXiv

[12] [12]

Very Deep Convolutional Networks for Large-Scale Image Recognition

NVIDIA. Technical brief: NVIDIA jetson TK1 development kit bringing GPU-accelerated computing to embedded systems, a. URL http://www.nvidia.com. NVIDIA. Whitepaper: GPU-based deep learning inference: A performance and power analysis, b. URL http://www.nvidia.com/object/white-papers.html. Simonyan, Karen and Zisserman, Andrew. Very deep convolutional netwo...

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Going Deeper with Convolutions

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. arXiv preprint arXiv:1409.4842,

work page Pith review arXiv

[14] [14]

Cross-domain synthesis of medical images using efﬁcient location-sensitive deep network

Van Nguyen, Hien, Zhou, Kevin, and Vemulapalli, Raviteja. Cross-domain synthesis of medical images using efﬁcient location-sensitive deep network. InMedical Image Computing and Computer- Assisted Intervention–MICCAI 2015, pp. 677–684. Springer,

work page 2015

[15] [15]

Deep fried convnets

Yang, Zichao, Moczulski, Marcin, Denil, Misha, de Freitas, Nando, Smola, Alex, Song, Le, and Wang, Ziyu. Deep fried convnets. arXiv preprint arXiv:1412.7149,

work page arXiv

[16] [16]

To avoid variance, we measured the time spent on each layer for 4096 input samples, and averaged the time regarding each input sample

13 Published as a conference paper at ICLR 2016 A A PPENDIX : DETAILED TIMING / POWER REPORTS OF DENSE & SPARSE NETWORK LAYERS Table 8: Average time on different layers. To avoid variance, we measured the time spent on each layer for 4096 input samples, and averaged the time regarding each input sample. For GPU, the time consumed by cudaMalloc and cudaMem...

work page 2016