hub

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman · 2024 · arXiv 2401.15024

25 Pith papers cite this work. Polarity classification is still indexing.

25 Pith papers citing it

read on arXiv browse 25 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

cs.AR · 2026-03-28 · unverdicted · novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

cs.CL · 2025-12-27 · unverdicted · novelty 7.0

Width pruning in Llama-3.2 models reduces parametric knowledge while enhancing instruction-following and preserving reasoning.

End-to-End Dynamic Sparsity for Resource-Adaptive LLM Inference

cs.IR · 2026-06-26 · unverdicted · novelty 6.0

L2A trains one LLM with input-and-budget-conditioned gates to adapt sparsity across layers, heads, and tokens, tracing the compute-accuracy frontier while staying within 0.6% of dense performance at 34% layer sparsity on tested models.

Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

Cascaded multi-granularity pruning reaches 13.8x compression on MHA+GELU LLMs for bearing fault diagnosis at 83.82% accuracy while causing ~74pp collapse on GQA+SwiGLU models that violate the formalized Structural Independence Assumption.

DOT-MoE: Differentiable Optimal Transport for MoEfication

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

DOT-MoE uses differentiable optimal transport and straight-through estimators to partition FFN layers into capacity-constrained experts, outperforming heuristic baselines in retaining 90% performance at 50% active parameters.

ProactiveLLM: Learning Active Interaction for Streaming Large Language Models

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.

Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on most Matbench Discovery metrics and downstream tasks.

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.

Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.

SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.

MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

cs.LG · 2025-06-15 · unverdicted · novelty 6.0

MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.

Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

cs.LG · 2025-02-06 · unverdicted · novelty 6.0

An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.

Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

Tensor decompositions face practical limits in large-scale LLM compression due to mismatch between assumed shared subspaces and heterogeneous model representations.

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

cs.AI · 2026-06-01 · unverdicted · novelty 5.0

Compression of LLMs often decouples accuracy from uncertainty, with larger models absorbing the effect better and inflation occurring in a threshold-like manner.

CRePE: Convolution-aware Relative Importance in Post-training Pruning with Efficient Search

cs.LG · 2026-06-01 · unverdicted · novelty 5.0

CRePE enhances relative importance-based post-training pruning for LLMs with 2D convolution-aware context and adaptive coefficients, paired with PHO for fast hyperparameter search that generalizes across models.

A general tensor-structured compression scheme for efficient large language models

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

MixT compresses Transformer LLMs by substituting targeted linear projections with tensor-operator mixtures, preserving MMLU accuracy up to model-specific boundaries where parameter count drops 47.5% and inference memory 60.4% on LLaMA2-7B.

Prune, Update and Trim: Robust Structured Pruning for Large Language Models

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.

TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

Task-aware pruning improves OOD model performance by realigning distorted OOD layerwise norm and pairwise-distance profiles with the task-adapted geometry observed on ID inputs.

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

cs.LG · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.

EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer

cs.CL · 2026-05-03 · unverdicted · novelty 5.0

EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.

On the Limits of Layer Pruning for Generative Reasoning in Large Language Models

cs.LG · 2026-02-02 · unverdicted · novelty 5.0

Layer pruning preserves classification performance in LLMs but fundamentally limits recovery of generative reasoning capabilities even after extensive self-supervised finetuning.

RAP: Runtime Adaptive Pruning for LLM Inference

cs.LG · 2025-05-22 · unverdicted · novelty 5.0

RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Slicegpt: Compress large language models by deleting rows and columns

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer