Recognition: no theorem link
Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2
Pith reviewed 2026-05-16 18:52 UTC · model grok-4.3
The pith
MAW-guided width pruning of Llama-3.2 models reduces parametric knowledge while substantially improving instruction-following performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Structured width pruning guided by the Maximum Absolute Weight criterion in the GLU-MLP layers of Llama-3.2 models produces a systematic dichotomy: performance on parametric knowledge tasks degrades while instruction-following capabilities improve substantially and multi-step reasoning remains robust. The expansion ratio serves as a critical architectural parameter that selectively modulates these capabilities rather than causing uniform degradation.
What carries the argument
The Maximum Absolute Weight (MAW) criterion applied to select weights for pruning inside the expansion layers of GLU-MLP blocks, which acts as a filter that removes connections linked to factual recall while preserving those supporting instruction alignment.
Load-bearing premise
The measured gains in instruction-following result from the pruning step itself rather than from unmeasured differences in training, evaluation prompts, or benchmark formatting.
What would settle it
Retrain the pruned models from scratch using identical data and training procedures, then re-measure IFEval and MMLU scores to check whether the instruction-following advantage disappears.
Figures
read the original abstract
Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model's ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that MAW-guided width pruning of GLU-MLP layers in Llama-3.2-1B and 3B models produces a capability dichotomy: predictable degradation on parametric knowledge tasks (MMLU, GSM8K) and perplexity as the expansion ratio is reduced, contrasted with substantial gains in instruction-following (+46% to +75% on IFEval) and preserved multi-step reasoning (MUSR), plus a robust inverse correlation (r = -0.864, p = 0.012) between MMLU and TruthfulQA-MC2; the authors interpret this as pruning acting as a selective filter that reduces knowledge while enhancing behavioral alignment, with additional efficiency trade-offs in energy and latency.
Significance. If the causal attribution to pruning holds after controls, the result would be significant for showing that targeted architectural compression can improve alignment metrics by reducing parametric knowledge, linking pruning literature to truthfulness research, and providing a practical lever (expansion ratio) for capability trade-offs rather than uniform degradation.
major comments (2)
- [Abstract] Abstract: the reported IFEval gains (+46% to +75%) and the selective-preservation claim are presented without any information on whether the seven pruned configurations received identical fine-tuning steps, learning-rate schedules, or prompt templates as the unpruned baselines; this control is load-bearing for attributing the dichotomy to width reduction rather than training or evaluation differences.
- [Abstract] Abstract: the inverse correlation (r = -0.864, p = 0.012) between MMLU and TruthfulQA-MC2 is stated without the number of data points (expansion ratios), error bars, or regression details, weakening the statistical support for the knowledge-truthfulness link.
minor comments (1)
- The abstract refers to 'seven expansion ratio configurations' but does not enumerate them or reference a table; listing the exact ratios tested would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify key methodological and statistical aspects of our work. We address each point below and will revise the manuscript accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported IFEval gains (+46% to +75%) and the selective-preservation claim are presented without any information on whether the seven pruned configurations received identical fine-tuning steps, learning-rate schedules, or prompt templates as the unpruned baselines; this control is load-bearing for attributing the dichotomy to width reduction rather than training or evaluation differences.
Authors: We agree that explicit details on evaluation controls are necessary to support causal attribution to pruning. All seven pruned configurations and the unpruned baselines were evaluated under identical conditions: the same prompt templates, zero-shot inference settings, and benchmark implementations, with no post-pruning fine-tuning or additional training steps applied to any model. Pruning was performed directly on the pre-trained weights, and evaluation followed standard protocols uniformly across all expansion ratios. To address the concern, we will revise the abstract and expand the Methods section to explicitly document this uniform evaluation protocol, including confirmation that no differential fine-tuning occurred. revision: yes
-
Referee: [Abstract] Abstract: the inverse correlation (r = -0.864, p = 0.012) between MMLU and TruthfulQA-MC2 is stated without the number of data points (expansion ratios), error bars, or regression details, weakening the statistical support for the knowledge-truthfulness link.
Authors: The reported correlation uses the seven expansion ratio configurations tested on the Llama-3.2-3B model (n=7 data points). The full manuscript includes a corresponding scatter plot (Figure 3) showing the regression. We will update the abstract, main text, and figure caption to state the number of points explicitly, include error bars (standard deviation from repeated evaluations where available), and provide full regression details such as the exact computation of the p-value and slope. This will strengthen the statistical presentation without altering the underlying result. revision: yes
Circularity Check
No significant circularity; results are direct benchmark measurements
full rationale
The paper reports empirical outcomes from MAW-guided width pruning experiments on Llama-3.2 models, with all key claims (IFEval gains, MMLU degradation, inverse correlation r=-0.864, energy trade-offs) presented as direct measurements on standard public benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described content. The analysis identifies the expansion ratio as a modulator based on observed patterns rather than any self-definitional or self-citation load-bearing step. The derivation chain is self-contained against external benchmarks and does not reduce any result to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Maximum absolute weight is a reliable proxy for weight importance in GLU-MLP layers
Reference graph
Works this paper leans on
-
[1]
URLhttp://arxiv.org/abs/2509.14230. arXiv:2509.14230 [cs]. Meta AI. llama-models/models/llama3_2/MODEL_card.md at main·meta-llama/llama-models,
-
[2]
URL http://arxiv.org/abs/2401.15024. arXiv: 2401.15024. Benoît Courty, Victor Schmidt, Goyal-Kamal, inimaz, MarionCoutarel, Luis Blanche, Boris Feld, Jérémy Lecourt, LiamConnell, Amine Saboni, SabAmine, supatomic, Patrick LLORET, Mathilde Léval, Alexis Cruveiller, ouminasara, Franklin Zhao, Christian Bauer, Aditya Joshi, Jerry Laruba Festus, Alexis Bogrof...
-
[3]
URLhttps://arxiv. org/abs/2509.00096v2. Sia Gholami and Marwan Omar. Can pruning make Large Language Models more efficient?, October
-
[4]
URLhttp://arxiv.org/abs/2310.04573. arXiv:2310.04573 [cs]. Zhiyu Guo, Hidetaka Kamigaito, and Taro Wanatnabe. Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models, October
-
[5]
URLhttp://arxiv.org/abs/2405.01943. arXiv:2405.01943 [cs]. Bairu Hou, Qibin Chen, Jianyu Wang, Guoli Yin, Chong Wang, Nan Du, Ruoming Pang, Shiyu Chang, and Tao Lei. Instruction-Following Pruning for Large Language Models, June
-
[6]
URLhttp://arxiv.org/ abs/2501.02086. arXiv:2501.02086 [cs]. Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung- Kyu Song. Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods, June
-
[7]
URLhttp://arxiv.org/abs/2402.02834. arXiv: 2402.02834. Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, and Zining Zhu. Truth Neurons, July
-
[8]
URLhttp://arxiv.org/abs/2505.12182. arXiv:2505.12182 [cs]. Pere Martra. Exploring GLU expansion ratios: Structured pruning in Llama-3.2 models, December 2024a. URLhttps://osf.io/qgxea_v1/. Pere Martra. optipfair: A Library for Structured Pruning and Bias Visualization of Large Language Models, 2024b. URLhttps://github.com/peremartra/optipfair. Versión 0.2...
-
[9]
URLhttp://arxiv.org/abs/2504.21174. arXiv: 2504.21174. Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact Language Models via Pruning and Knowledge Distillation, July
-
[10]
URLhttp://arxiv.org/abs/2407.14679. arXiv: 2407.14679. Waleed Reda, Abhinav Jangda, and Krishna Chintalapudi. How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve, October
-
[11]
URLhttp://arxiv.org/abs/2505.18350. arXiv:2505.18350 [cs]. Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. THE TRUTH IS IN THERE: IMPROVING REASONINGINLANGUAGEMODELSWITHLAYER-SELECTIVERANKREDUCTION,2023. URL https://pratyushasharma.github.io/laser/. Noam Shazeer. GLU Variants Improve Transformer, February
-
[12]
URLhttp://arxiv.org/abs/2002. 05202. arXiv:2002.05202 [cs]. Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A Simple and Effective Pruning Approach for Large Language Models, May
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[13]
A Simple and Effective Pruning Approach for Large Language Models
URLhttp://arxiv.org/abs/2306.11695. arXiv: 2306.11695. Lintang Sutawika, Hailey Schoelkopf, Leo Gao, Baber Abbasi, Stella Biderman, Jonathan Tow, ben fat- tori, Charles Lovering, farzanehnakhaee70, Jason Phang, Anish Thite, Fazz, Aflah, Niklas Muennighoff, Thomas Wang, sdtblck, nopperl, gakada, tttyuntian, researcher2, Julen Etxaniz, Chris, Hanwool Al- be...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Aalbers, et al., Axfoundation/strax: v1.6.4 (2024)
URLhttps://zenodo.org/doi/10.5281/zenodo. 12608602. 18 Keyu Wang, Tian Lyu, Guinan Su, Jonas Geiping, Lu Yin, Marco Canini, and Shiwei Liu. When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs, October
-
[15]
URL http://arxiv.org/abs/2510.22228. arXiv: 2510.22228. Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications, October
-
[16]
URLhttp://arxiv.org/abs/2402.05162. arXiv:2402.05162 [cs]. Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, April
-
[17]
URLhttp://arxiv.org/abs/2310.06694. arXiv:2310.06694 [cs]. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A Survey of Large...
-
[18]
A Survey of Large Language Models
URLhttp://arxiv.org/abs/2303.18223. arXiv:2303.18223 [cs]. Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A Survey on Model Compression for Large Language Models, July
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
URLhttp://arxiv.org/abs/2308.07633. arXiv: 2308.07633. A Complete Benchmark Results (Base Models) This appendix provides the complete results of the 13 benchmarks from the evaluation suite (described in Table
-
[20]
for all expansion ratio configurations of the Llama-3.2-1B and Llama-3.2-3B base models. Metric: All scores are Accuracy or Acc-Norm (higher is better), except WikiText and Lambada, which are Perplexity (lower is better). Source: Data is extracted from the project results files llama_1b_complete_results_latest.json and llama_3b_complete_results_latest.jso...
work page 2024
-
[21]
Lambada (PPL) WikiText-2 (PPL) Selection Criteria Value∆vs
Table 9: Catastrophic Collapse of Alternative Pruning Methods (10% Pruning, Llama-3.2-1B). Lambada (PPL) WikiText-2 (PPL) Selection Criteria Value∆vs. Base Value∆vs. Base Baseline (0%) 5.75 – 11.57 – MAW (Maximum Absolute Weight) 20.59 +259% 17.45 +51% VOW (Variance of Weights) 532.36 +9,207% 50.56 +337% PON (Product of Norms) 2032.80 +35,440% 72.52 +527%...
work page 2032
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.