DASH discovers stronger hybrid attention architectures for LLMs via minutes-scale differentiable search, outperforming selector baselines and Jet-Nemotron on RULER while using 0.006% of prior search tokens.
hub
arXiv preprint arXiv:2402.02834 , volume=
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Width pruning in Llama-3.2 models reduces parametric knowledge while enhancing instruction-following and preserving reasoning.
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
Layer pruning preserves classification performance in LLMs but fundamentally limits recovery of generative reasoning capabilities even after extensive self-supervised finetuning.
TALE selectively prunes task-detrimental layers in LLMs at inference time to match or exceed baseline performance with lower computational cost across multiple models and tasks.