hub

Shortened llama: Depth pruning for large language models with comparison of retraining methods

Shortened llama: A simple depth pruning for large language models , author= · 2024 · arXiv 2402.02834

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

DASH discovers stronger hybrid attention architectures for LLMs via minutes-scale differentiable search, outperforming selector baselines and Jet-Nemotron on RULER while using 0.006% of prior search tokens.

Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.

SimDiff: Depth Pruning via Similarity and Difference

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

cs.CL · 2025-12-27 · unverdicted · novelty 7.0

Width pruning in Llama-3.2 models reduces parametric knowledge while enhancing instruction-following and preserving reasoning.

Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

cs.LG · 2026-05-15 · unverdicted · novelty 5.0

Ghosted Layers recovers accuracy in layer-pruned LLMs via a closed-form unconstrained linear operator that aligns boundary activations using a small calibration set.

TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

Task-aware pruning improves OOD performance by removing layers that distort task-adapted representation profiles, realigning OOD inputs with the geometry observed on ID data.

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

cs.LG · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.

On the Limits of Layer Pruning for Generative Reasoning in Large Language Models

cs.LG · 2026-02-02 · unverdicted · novelty 5.0

Layer pruning preserves classification performance in LLMs but fundamentally limits recovery of generative reasoning capabilities even after extensive self-supervised finetuning.

TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination

cs.LG · 2025-10-26 · unverdicted · novelty 5.0

TALE selectively prunes task-detrimental layers in LLMs at inference time to match or exceed baseline performance with lower computational cost across multiple models and tasks.

citing papers explorer

Showing 10 of 10 citing papers.

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU cs.LG · 2026-05-20 · unverdicted · none · ref 25
DASH discovers stronger hybrid attention architectures for LLMs via minutes-scale differentiable search, outperforming selector baselines and Jet-Nemotron on RULER while using 0.006% of prior search tokens.
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions cs.CL · 2026-05-08 · unverdicted · none · ref 54
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
SimDiff: Depth Pruning via Similarity and Difference cs.AI · 2026-04-21 · unverdicted · none · ref 12
SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 18
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2 cs.CL · 2025-12-27 · unverdicted · none · ref 7
Width pruning in Llama-3.2 models reduces parametric knowledge while enhancing instruction-following and preserving reasoning.
Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs cs.LG · 2026-05-15 · unverdicted · none · ref 17
Ghosted Layers recovers accuracy in layer-pruned LLMs via a closed-form unconstrained linear operator that aligns boundary activations using a small calibration set.
TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability cs.LG · 2026-05-14 · unverdicted · none · ref 22
Task-aware pruning improves OOD performance by removing layers that distort task-adapted representation profiles, realigning OOD inputs with the geometry observed on ID data.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 37 · 2 links
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
On the Limits of Layer Pruning for Generative Reasoning in Large Language Models cs.LG · 2026-02-02 · unverdicted · none · ref 16
Layer pruning preserves classification performance in LLMs but fundamentally limits recovery of generative reasoning capabilities even after extensive self-supervised finetuning.
TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination cs.LG · 2025-10-26 · unverdicted · none · ref 9
TALE selectively prunes task-detrimental layers in LLMs at inference time to match or exceed baseline performance with lower computational cost across multiple models and tasks.

Shortened llama: Depth pruning for large language models with comparison of retraining methods

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer