FPGA probabilistic computers speed up sampling for neural quantum states, delivering accurate energies on 80x80 Ising lattices and training deep models on 30x30 systems.
hub
The state of sparsity in deep neural networks.ArXiv, abs/1902.09574
10 Pith papers cite this work. Polarity classification is still indexing.
abstract
We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet. Across thousands of experiments, we demonstrate that complex techniques (Molchanov et al., 2017; Louizos et al., 2017b) shown to yield high compression rates on smaller datasets perform inconsistently, and that simple magnitude pruning approaches achieve comparable or better results. Additionally, we replicate the experiments performed by (Frankle & Carbin, 2018) and (Liu et al., 2018) at scale and show that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization. Together, these results highlight the need for large-scale benchmarks in the field of model compression. We open-source our code, top performing model checkpoints, and results of all hyperparameter configurations to establish rigorous baselines for future work on compression and sparsification.
hub tools
representative citing papers
EMP maps importance scores to effective sample size N_eff and prunes the lowest N - N_eff components, with a derived lower bound on retained effective mass and upper bound on loss increase.
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.
SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Overprovisioned KANs with sparsification, deep supervision, and depth selection under differentiable MDL yield smaller models with competitive accuracy on benchmarks.
Pruning small-magnitude weights from pre-trained LLMs causes monotonic irreversible performance degradation on difficult downstream tasks, supporting the Junk DNA Hypothesis that these weights hold essential knowledge.
Arbitrary heterogeneous fan-in profiles in sparse networks match uniform random accuracy at high sparsity, but initializing RigL dynamic sparse training with equilibrium-matched lognormal profiles improves performance by up to 0.49% on classification tasks.
citing papers explorer
-
Probabilistic Computers for Neural Quantum States
FPGA probabilistic computers speed up sampling for neural quantum states, delivering accurate energies on 80x80 Ising lattices and training deep models on 30x30 systems.
-
Effective Model Pruning: Measure The Redundancy of Model Components
EMP maps importance scores to effective sample size N_eff and prunes the lowest N - N_eff components, with a derived lower bound on retained effective mass and upper bound on loss increase.
-
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
-
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.
-
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.
-
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask
SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
Optimized Architectures for Kolmogorov-Arnold Networks
Overprovisioned KANs with sparsification, deep supervision, and depth selection under differentiable MDL yield smaller models with competitive accuracy on benchmarks.
-
Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs
Pruning small-magnitude weights from pre-trained LLMs causes monotonic irreversible performance degradation on difficult downstream tasks, supporting the Junk DNA Hypothesis that these weights hold essential knowledge.
-
Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria
Arbitrary heterogeneous fan-in profiles in sparse networks match uniform random accuracy at high sparsity, but initializing RigL dynamic sparse training with equilibrium-matched lognormal profiles improves performance by up to 0.49% on classification tasks.