SpenseGPT introduces a hybrid sparse-dense weight format and one-shot pruning that delivers 1.2x end-to-end LLM decoding speedup on B200 GPUs with FP8 while preserving accuracy on Qwen3-32B and Seed-OSS-36B.
STUN : Structured-Then-Unstructured Pruning for Scalable M o E Pruning
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
TENP applies trapezoidal expert-neuron pruning to MoE models, retaining key experts while pruning others via projected neuron contribution, yielding only 1-point accuracy drop at 40% sparsity on DeepSeek with 10% code-generation gain.
citing papers explorer
-
SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference
SpenseGPT introduces a hybrid sparse-dense weight format and one-shot pruning that delivers 1.2x end-to-end LLM decoding speedup on B200 GPUs with FP8 while preserving accuracy on Qwen3-32B and Seed-OSS-36B.
-
TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts
TENP applies trapezoidal expert-neuron pruning to MoE models, retaining key experts while pruning others via projected neuron contribution, yielding only 1-point accuracy drop at 40% sparsity on DeepSeek with 10% code-generation gain.