CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into weights, retaining 83.27% Top-1 accuracy on DeiT-Huge after 50% pruning.
Optimal brain compression: A framework for accurate post-training quantization and pruning
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
MCWC aligns permutation-symmetric blocks across layers to enable sequential prediction and residual entropy coding, improving rate-accuracy tradeoffs versus quantization and prior codecs on language and vision models.
FADE adaptively compensates for quantization errors layer-by-layer in ASR models using diagnostic scores from weight geometry and calibration data, yielding lower word error rates at 3- and 4-bit precision.
citing papers explorer
-
Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models
FADE adaptively compensates for quantization errors layer-by-layer in ASR models using diagnostic scores from weight geometry and calibration data, yielding lower word error rates at 3- and 4-bit precision.