A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
SubFit enables better LLM compression by fitting residual bypasses to non-contiguously selected submodules, outperforming layer-granularity baselines in accuracy-perplexity trade-offs at 12.5-37.5% sparsity.
citing papers explorer
-
Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy
A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.
-
From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
SubFit enables better LLM compression by fitting residual bypasses to non-contiguously selected submodules, outperforming layer-granularity baselines in accuracy-perplexity trade-offs at 12.5-37.5% sparsity.