hub Canonical reference

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, William J. Dally · 2015 · cs.CV · arXiv 1510.00149

Canonical reference. 83% of citing Pith papers cite this work as background.

72 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 72 citing papers arXiv PDF

abstract

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 method 2

citation-polarity summary

background 10 use method 2

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

cs.LG · 2026-05-04 · conditional · novelty 8.0 · 2 refs

INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.

Federated Learning: Strategies for Improving Communication Efficiency

cs.LG · 2016-10-18 · conditional · novelty 8.0

Structured updates (low-rank or masked) and sketched updates (quantized, rotated, subsampled) reduce uplink communication in federated learning by up to two orders of magnitude on convolutional and recurrent networks.

AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.

When Bits Break Recourse: Counterfactual-Faithful Quantization

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.

Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.

Winning Lottery Tickets in Neural Networks via a Quantum-Inspired Classical Algorithm

quant-ph · 2026-05-13 · conditional · novelty 7.0

A classical polynomial-time algorithm for optimized sampling of lottery tickets in neural networks removes the exponential dependence on data dimension from prior classical approaches.

Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.

On the Decompositionality of Neural Networks

cs.LO · 2026-04-09 · unverdicted · novelty 7.0

Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

cs.CL · 2025-12-01 · conditional · novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization

cs.CV · 2024-08-01 · unverdicted · novelty 7.0

CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

cs.CV · 2017-04-17 · accept · novelty 7.0

MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.

The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks

cs.RO · 2026-06-26 · unverdicted · novelty 6.0 · 2 refs

TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.

When AI Reviews Its Own Code: Recursive Self-Training Collapse in Code LLMs

cs.SE · 2026-06-26 · unverdicted · novelty 6.0

Experiments across code LLMs show no-review collapses fastest, human-gated filters slow collapse, and AI self-gates lose effect over time, degenerating to ungated self-training under self-confirming acceptance as proven via gated distributional reweighting and spectral analysis.

CascadeFormer: Depth-Tapered Transformers Motivated by Gradient Fan-in Asymmetry

cs.LG · 2026-06-25 · unverdicted · novelty 6.0

CascadeFormer tapers Transformer width with depth based on gradient fan-in asymmetry to match uniform baselines in perplexity while cutting latency 8.6%.

Neural Network Compression by Approximate Differential Equivalence

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Neural networks are compressed by lumping neurons with approximately matching dynamics in a polynomial ODE encoding, yielding substantial size reduction with preserved accuracy on synthetic and regression tasks.

Growing a Neural Network in Breadth, Depth, and Time

q-bio.NC · 2026-05-24 · unverdicted · novelty 6.0

Recurrent CNNs are trained with joint task and resource costs on breadth, depth, and time, yielding organic growth in all three dimensions that trades off for accuracy and matches human reaction times on object recognition.

Motion-Compensated Weight Compression

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

MCWC aligns permutation-symmetric blocks across layers to enable sequential prediction and residual entropy coding, improving rate-accuracy tradeoffs versus quantization and prior codecs on language and vision models.

AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

AutoMCU uses feasibility-first LLM multi-agent coordination to automate MCU-constrained neural network design, delivering competitive accuracy on CIFAR-10/100 in 1-2 hours versus hundreds of GPU hours for prior HW-NAS methods.

Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis

eess.SP · 2026-05-16 · unverdicted · novelty 6.0

Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.

ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems

cs.LG · 2026-05-12 · conditional · novelty 6.0

ROMER cuts perplexity by up to 59% in noisy analog CIM environments for MoE LLMs via expert replacement and router recalibration calibrated on real-chip measurements.

ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Federated Learning: Strategies for Improving Communication Efficiency cs.LG · 2016-10-18 · conditional · none · ref 10 · internal anchor
Structured updates (low-rank or masked) and sketched updates (quantized, rotated, subsampled) reduce uplink communication in federated learning by up to two orders of magnitude on convolutional and recurrent networks.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation cs.CL · 2016-09-26 · accept · none · ref 20 · internal anchor
GNMT deploys 8-layer LSTMs with attention, wordpieces, low-precision inference, and coverage-penalized beam search to match state-of-the-art on WMT'14 En-Fr and En-De while cutting translation errors by 60% in human evaluations.
SGDR: Stochastic Gradient Descent with Warm Restarts cs.LG · 2016-08-13 · accept · none · ref 4 · internal anchor
SGDR uses periodic warm restarts of the learning rate in SGD to reach new state-of-the-art error rates of 3.14% on CIFAR-10 and 16.21% on CIFAR-100.

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer