hub

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman · 2024 · arXiv 2401.15024

29 Pith papers cite this work. Polarity classification is still indexing.

29 Pith papers citing it

read on arXiv browse 29 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Tapered Language Models

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.

Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

cs.AR · 2026-03-28 · unverdicted · novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

cs.CL · 2025-12-27 · unverdicted · novelty 7.0

Width pruning in Llama-3.2 models reduces parametric knowledge while enhancing instruction-following and preserving reasoning.

End-to-End Dynamic Sparsity for Resource-Adaptive LLM Inference

cs.IR · 2026-06-26 · unverdicted · novelty 6.0

L2A trains one LLM with input-and-budget-conditioned gates to adapt sparsity across layers, heads, and tokens, tracing the compute-accuracy frontier while staying within 0.6% of dense performance at 34% layer sparsity on tested models.

Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

Cascaded multi-granularity pruning reaches 13.8x compression on MHA+GELU LLMs for bearing fault diagnosis at 83.82% accuracy while causing ~74pp collapse on GQA+SwiGLU models that violate the formalized Structural Independence Assumption.

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

HyperQuant unifies Hadamard transform, optimal lattice quantization, and entropy coding to outperform prior schemes on LLM weight and KV cache quantization down to 1.7 bits per scalar while preserving quality on a 19B DiT model.

DOT-MoE: Differentiable Optimal Transport for MoEfication

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

DOT-MoE uses differentiable optimal transport and straight-through estimators to partition FFN layers into capacity-constrained experts, outperforming heuristic baselines in retaining 90% performance at 50% active parameters.

ProactiveLLM: Learning Active Interaction for Streaming Large Language Models

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.

Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on most Matbench Discovery metrics and downstream tasks.

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.

Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.

SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.

MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

cs.LG · 2025-06-15 · unverdicted · novelty 6.0

MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.

Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

cs.LG · 2025-02-06 · unverdicted · novelty 6.0

An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.

Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers

cs.LG · 2026-06-22 · unverdicted · novelty 5.0

A bidirectional optimization method using parameterized transformations enables near-zero loss barriers for linear mode connectivity in medium-scale language models and small barriers in billion-parameter transformers.

Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

cs.LG · 2026-06-18 · unverdicted · novelty 5.0

AIR augments activation-aware SVD compression of LLMs with an influence metric and a closed-form ALS update, claiming >18% perplexity improvement at 60% parameter retention and 90% less calibration data than SVD-LLM(W).

Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

Tensor decompositions face practical limits in large-scale LLM compression due to mismatch between assumed shared subspaces and heterogeneous model representations.

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

cs.AI · 2026-06-01 · unverdicted · novelty 5.0

Compression of LLMs often decouples accuracy from uncertainty, with larger models absorbing the effect better and inflation occurring in a threshold-like manner.

CRePE: Convolution-aware Relative Importance in Post-training Pruning with Efficient Search

cs.LG · 2026-06-01 · unverdicted · novelty 5.0

CRePE enhances relative importance-based post-training pruning for LLMs with 2D convolution-aware context and adaptive coefficients, paired with PHO for fast hyperparameter search that generalizes across models.

A general tensor-structured compression scheme for efficient large language models

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

MixT compresses Transformer LLMs by substituting targeted linear projections with tensor-operator mixtures, preserving MMLU accuracy up to model-specific boundaries where parameter count drops 47.5% and inference memory 60.4% on LLaMA2-7B.

Prune, Update and Trim: Robust Structured Pruning for Large Language Models

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.

TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

Task-aware pruning improves OOD model performance by realigning distorted OOD layerwise norm and pairwise-distance profiles with the task-adapted geometry observed on ID inputs.

citing papers explorer

Showing 28 of 28 citing papers after filters.

Tapered Language Models cs.LG · 2026-06-22 · unverdicted · none · ref 1
Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions cs.CL · 2026-05-08 · unverdicted · none · ref 46
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs cs.AR · 2026-03-28 · unverdicted · none · ref 4
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2 cs.CL · 2025-12-27 · unverdicted · none · ref 2
Width pruning in Llama-3.2 models reduces parametric knowledge while enhancing instruction-following and preserving reasoning.
End-to-End Dynamic Sparsity for Resource-Adaptive LLM Inference cs.IR · 2026-06-26 · unverdicted · none · ref 1
L2A trains one LLM with input-and-budget-conditioned gates to adapt sparsity across layers, heads, and tokens, tracing the compute-accuracy frontier while staying within 0.6% of dense performance at 34% layer sparsity on tested models.
Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT cs.CL · 2026-06-25 · unverdicted · none · ref 9
Cascaded multi-granularity pruning reaches 13.8x compression on MHA+GELU LLMs for bearing fault diagnosis at 83.82% accuracy while causing ~74pp collapse on GQA+SwiGLU models that violate the formalized Structural Independence Assumption.
HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models cs.LG · 2026-06-22 · unverdicted · none · ref 3
HyperQuant unifies Hadamard transform, optimal lattice quantization, and entropy coding to outperform prior schemes on LLM weight and KV cache quantization down to 1.7 bits per scalar while preserving quality on a 19B DiT model.
DOT-MoE: Differentiable Optimal Transport for MoEfication cs.LG · 2026-06-01 · unverdicted · none · ref 14
DOT-MoE uses differentiable optimal transport and straight-through estimators to partition FFN layers into capacity-constrained experts, outperforming heuristic baselines in retaining 90% performance at 50% active parameters.
ProactiveLLM: Learning Active Interaction for Streaming Large Language Models cs.CL · 2026-05-30 · unverdicted · none · ref 63
ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.
SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models cs.LG · 2026-05-18 · unverdicted · none · ref 2
SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.
Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning cs.LG · 2026-05-09 · unverdicted · none · ref 15
Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on most Matbench Discovery metrics and downstream tasks.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models cs.CL · 2026-05-09 · unverdicted · none · ref 38
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression cs.LG · 2026-05-09 · unverdicted · none · ref 57
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models cs.LG · 2026-04-06 · unverdicted · none · ref 12
SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs cs.LG · 2025-06-15 · unverdicted · none · ref 2
MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis cs.LG · 2025-02-06 · unverdicted · none · ref 26
An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers cs.LG · 2026-06-22 · unverdicted · none · ref 18
A bidirectional optimization method using parameterized transformations enables near-zero loss barriers for linear mode connectivity in medium-scale language models and small barriers in billion-parameter transformers.
Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs cs.LG · 2026-06-18 · unverdicted · none · ref 166
AIR augments activation-aware SVD compression of LLMs with an influence metric and a closed-form ALS update, claiming >18% perplexity improvement at 60% parameter retention and 90% less calibration data than SVD-LLM(W).
Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression cs.LG · 2026-06-02 · unverdicted · none · ref 2
Tensor decompositions face practical limits in large-scale LLM compression due to mismatch between assumed shared subspaces and heterogeneous model representations.
Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction cs.AI · 2026-06-01 · unverdicted · none · ref 15
Compression of LLMs often decouples accuracy from uncertainty, with larger models absorbing the effect better and inflation occurring in a threshold-like manner.
CRePE: Convolution-aware Relative Importance in Post-training Pruning with Efficient Search cs.LG · 2026-06-01 · unverdicted · none · ref 2
CRePE enhances relative importance-based post-training pruning for LLMs with 2D convolution-aware context and adaptive coefficients, paired with PHO for fast hyperparameter search that generalizes across models.
A general tensor-structured compression scheme for efficient large language models cs.CL · 2026-05-25 · unverdicted · none · ref 13
MixT compresses Transformer LLMs by substituting targeted linear projections with tensor-operator mixtures, preserving MMLU accuracy up to model-specific boundaries where parameter count drops 47.5% and inference memory 60.4% on LLaMA2-7B.
Prune, Update and Trim: Robust Structured Pruning for Large Language Models cs.LG · 2026-05-18 · unverdicted · none · ref 21
Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.
TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability cs.LG · 2026-05-14 · unverdicted · none · ref 76
Task-aware pruning improves OOD model performance by realigning distorted OOD layerwise norm and pairwise-distance profiles with the task-adapted geometry observed on ID inputs.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 25 · 2 links
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer cs.CL · 2026-05-03 · unverdicted · none · ref 3
EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.
On the Limits of Layer Pruning for Generative Reasoning in Large Language Models cs.LG · 2026-02-02 · unverdicted · none · ref 1
Layer pruning preserves classification performance in LLMs but fundamentally limits recovery of generative reasoning capabilities even after extensive self-supervised finetuning.
RAP: Runtime Adaptive Pruning for LLM Inference cs.LG · 2025-05-22 · unverdicted · none · ref 3
RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.

Slicegpt: Compress large language models by deleting rows and columns

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer