Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.
hub
Slicegpt: Compress large language models by deleting rows and columns
29 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
Width pruning in Llama-3.2 models reduces parametric knowledge while enhancing instruction-following and preserving reasoning.
L2A trains one LLM with input-and-budget-conditioned gates to adapt sparsity across layers, heads, and tokens, tracing the compute-accuracy frontier while staying within 0.6% of dense performance at 34% layer sparsity on tested models.
Cascaded multi-granularity pruning reaches 13.8x compression on MHA+GELU LLMs for bearing fault diagnosis at 83.82% accuracy while causing ~74pp collapse on GQA+SwiGLU models that violate the formalized Structural Independence Assumption.
HyperQuant unifies Hadamard transform, optimal lattice quantization, and entropy coding to outperform prior schemes on LLM weight and KV cache quantization down to 1.7 bits per scalar while preserving quality on a 19B DiT model.
DOT-MoE uses differentiable optimal transport and straight-through estimators to partition FFN layers into capacity-constrained experts, outperforming heuristic baselines in retaining 90% performance at 50% active parameters.
ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.
SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.
Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on most Matbench Discovery metrics and downstream tasks.
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.
A bidirectional optimization method using parameterized transformations enables near-zero loss barriers for linear mode connectivity in medium-scale language models and small barriers in billion-parameter transformers.
AIR augments activation-aware SVD compression of LLMs with an influence metric and a closed-form ALS update, claiming >18% perplexity improvement at 60% parameter retention and 90% less calibration data than SVD-LLM(W).
Tensor decompositions face practical limits in large-scale LLM compression due to mismatch between assumed shared subspaces and heterogeneous model representations.
Compression of LLMs often decouples accuracy from uncertainty, with larger models absorbing the effect better and inflation occurring in a threshold-like manner.
CRePE enhances relative importance-based post-training pruning for LLMs with 2D convolution-aware context and adaptive coefficients, paired with PHO for fast hyperparameter search that generalizes across models.
MixT compresses Transformer LLMs by substituting targeted linear projections with tensor-operator mixtures, preserving MMLU accuracy up to model-specific boundaries where parameter count drops 47.5% and inference memory 60.4% on LLaMA2-7B.
Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.
Task-aware pruning improves OOD model performance by realigning distorted OOD layerwise norm and pairwise-distance profiles with the task-adapted geometry observed on ID inputs.
citing papers explorer
-
Tapered Language Models
Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2
Width pruning in Llama-3.2 models reduces parametric knowledge while enhancing instruction-following and preserving reasoning.
-
End-to-End Dynamic Sparsity for Resource-Adaptive LLM Inference
L2A trains one LLM with input-and-budget-conditioned gates to adapt sparsity across layers, heads, and tokens, tracing the compute-accuracy frontier while staying within 0.6% of dense performance at 34% layer sparsity on tested models.
-
Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT
Cascaded multi-granularity pruning reaches 13.8x compression on MHA+GELU LLMs for bearing fault diagnosis at 83.82% accuracy while causing ~74pp collapse on GQA+SwiGLU models that violate the formalized Structural Independence Assumption.
-
HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models
HyperQuant unifies Hadamard transform, optimal lattice quantization, and entropy coding to outperform prior schemes on LLM weight and KV cache quantization down to 1.7 bits per scalar while preserving quality on a 19B DiT model.
-
DOT-MoE: Differentiable Optimal Transport for MoEfication
DOT-MoE uses differentiable optimal transport and straight-through estimators to partition FFN layers into capacity-constrained experts, outperforming heuristic baselines in retaining 90% performance at 50% active parameters.
-
ProactiveLLM: Learning Active Interaction for Streaming Large Language Models
ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.
-
SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models
SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.
-
Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning
Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on most Matbench Discovery metrics and downstream tasks.
-
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
-
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs
MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
-
Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis
An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.
-
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
A bidirectional optimization method using parameterized transformations enables near-zero loss barriers for linear mode connectivity in medium-scale language models and small barriers in billion-parameter transformers.
-
Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs
AIR augments activation-aware SVD compression of LLMs with an influence metric and a closed-form ALS update, claiming >18% perplexity improvement at 60% parameter retention and 90% less calibration data than SVD-LLM(W).
-
Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression
Tensor decompositions face practical limits in large-scale LLM compression due to mismatch between assumed shared subspaces and heterogeneous model representations.
-
Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction
Compression of LLMs often decouples accuracy from uncertainty, with larger models absorbing the effect better and inflation occurring in a threshold-like manner.
-
CRePE: Convolution-aware Relative Importance in Post-training Pruning with Efficient Search
CRePE enhances relative importance-based post-training pruning for LLMs with 2D convolution-aware context and adaptive coefficients, paired with PHO for fast hyperparameter search that generalizes across models.
-
A general tensor-structured compression scheme for efficient large language models
MixT compresses Transformer LLMs by substituting targeted linear projections with tensor-operator mixtures, preserving MMLU accuracy up to model-specific boundaries where parameter count drops 47.5% and inference memory 60.4% on LLaMA2-7B.
-
Prune, Update and Trim: Robust Structured Pruning for Large Language Models
Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.
-
TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability
Task-aware pruning improves OOD model performance by realigning distorted OOD layerwise norm and pairwise-distance profiles with the task-adapted geometry observed on ID inputs.
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
-
EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer
EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.
-
On the Limits of Layer Pruning for Generative Reasoning in Large Language Models
Layer pruning preserves classification performance in LLMs but fundamentally limits recovery of generative reasoning capabilities even after extensive self-supervised finetuning.
-
RAP: Runtime Adaptive Pruning for LLM Inference
RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.