HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
hub Canonical reference
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
Canonical reference. 83% of citing Pith papers cite this work as background.
abstract
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.
Structured updates (low-rank or masked) and sketched updates (quantized, rotated, subsampled) reduce uplink communication in federated learning by up to two orders of magnitude on convolutional and recurrent networks.
Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.
Introduces zero-inflated Gaussian distributions for EDAs to jointly optimize sparsity patterns and active parameter values without bi-level schemes or custom operators.
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
A classical polynomial-time algorithm for optimized sampling of lottery tickets in neural networks removes the exponential dependence on data dimension from prior classical approaches.
SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
DiT-Pruning introduces an energy-based saliency metric balancing weights and activations plus clustering-aware granularity for post-training pruning of DiTs, showing near-zero CLIP score degradation at 50% sparsity on FLUX.1-dev.
TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
Experiments across code LLMs show no-review collapses fastest, human-gated filters slow collapse, and AI self-gates lose effect over time, degenerating to ungated self-training under self-confirming acceptance as proven via gated distributional reweighting and spectral analysis.
CascadeFormer tapers Transformer width with depth based on gradient fan-in asymmetry to match uniform baselines in perplexity while cutting latency 8.6%.
Compositionality emerges in neural networks only in a narrow depth-connectivity regime, with gradient descent converging to fractured solutions outside it.
Operator Boosting constructs compact neural-operator PDE surrogates by sequential residual learning with validation-selected shrinkage, yielding 72-95% parameter reduction and accuracy gains on 21 of 30 dataset-architecture pairs.
Neural networks are compressed by lumping neurons with approximately matching dynamics in a polynomial ODE encoding, yielding substantial size reduction with preserved accuracy on synthetic and regression tasks.
Recurrent CNNs are trained with joint task and resource costs on breadth, depth, and time, yielding organic growth in all three dimensions that trades off for accuracy and matches human reaction times on object recognition.
MCWC aligns permutation-symmetric blocks across layers to enable sequential prediction and residual entropy coding, improving rate-accuracy tradeoffs versus quantization and prior codecs on language and vision models.
citing papers explorer
-
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
-
Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization
CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.
-
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
-
Post-Training Pruning for Diffusion Transformers
DiT-Pruning introduces an energy-based saliency metric balancing weights and activations plus clustering-aware granularity for post-training pruning of DiTs, showing near-zero CLIP score degradation at 50% sparsity on FLUX.1-dev.
-
Motion-Compensated Weight Compression
MCWC aligns permutation-symmetric blocks across layers to enable sequential prediction and residual entropy coding, improving rate-accuracy tradeoffs versus quantization and prior codecs on language and vision models.
-
DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization
DeFakeQ introduces an adaptive bidirectional quantization method tailored for deepfake detectors that maintains detection accuracy while enabling real-time performance on resource-constrained edge devices.
-
COP: Customized Deep Model Compression via Regularized Correlation-Based Filter-Level Pruning
COP prunes CNN filters using correlation-based importance with global normalization and dual regularization on parameter quantity and FLOPs to enable customized compression.
-
Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception
Edge-TSR shows benchmark evaluations overestimate real-world edge inference performance by 20-30% and uses temporal stabilization to recover up to 10.16% classification accuracy in sustained roadside perception deployments.
-
Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection
PDI-Net integrates a semi-U-Net encoder with YOLO detection using a physics-aware PALS-Bridge and optical simulation to deliver 84% faster inference and 5% higher mAP than pruned reconstruction-plus-detection on low-SNR M3FD infrared data.
-
New pointwise convolution in Deep Neural Networks through Extremely Fast and Non Parametric Transforms
Replacing pointwise convolutions with DWHT yields a model with 79.1% fewer parameters, 48.4% fewer FLOPs, and 1.49% higher accuracy than MobileNet-V1 on CIFAR-100.
-
GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection
GSA-YOLO modifies YOLOv8n with structured sparsity via Group Lasso and Sparse Structure Selection plus Adaptive Knowledge Distillation, reporting 189.62 FPS and mAP50:95 gains of 2.4% and 1.8% on HiXray and PIDray datasets.
-
Trajectory-Aware Adaptive Inference in Object Detection Models
Introduces an early-exit mechanism in YOLOv8 that uses inter-vessel distance and closing speed from trajectories to adapt computation depth per frame in maritime scenes.
-
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
-
A Targeted Acceleration and Compression Framework for Low bit Neural Networks
TAC framework separates optimization of convolutional and fully connected layers in 1-bit DNNs to improve accuracy while maintaining efficiency.
-
GAN-Knowledge Distillation for one-stage Object Detection
A GAN-based adversarial training method distills knowledge from teacher to student networks by treating their feature maps as real and fake samples to boost one-stage object detector performance.
-
Edge-Constrained UAV Small-Object Detection with P2 Enhancement and Quantum-Inspired Lightweight Structure Search
Adding a P2 branch to YOLOX-Nano raises small-object AP by 31.10% on VisDrone; QIEA screens structures balancing accuracy, FLOPs, latency, memory and recall.