arxiv: 2512.22671 · v2 · submitted 2025-12-27 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Pere Martra

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords width pruninginstruction followingLlama-3.2parametric knowledgeGLU-MLPmodel compressionexpansion ratiotruthfulness

0 comments

The pith

MAW-guided width pruning of Llama-3.2 models reduces parametric knowledge while substantially improving instruction-following performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that structured width pruning of GLU-MLP layers, selected by the Maximum Absolute Weight criterion, produces a consistent split in model behavior. Tasks that depend on stored facts, such as MMLU and GSM8K, lose accuracy in a predictable way, yet instruction-following scores on IFEval rise sharply by 46 to 75 percent and multi-step reasoning on MUSR stays intact. The expansion ratio therefore functions as a tunable architectural lever rather than a simple compression knob. The study also reports a strong negative correlation between factual knowledge scores and truthfulness metrics, indicating that reduced knowledge can reduce the model's tendency to repeat misconceptions. These results reframe pruning as a selective filter that trims parametric storage while protecting or strengthening behavioral alignment.

Core claim

Structured width pruning guided by the Maximum Absolute Weight criterion in the GLU-MLP layers of Llama-3.2 models produces a systematic dichotomy: performance on parametric knowledge tasks degrades while instruction-following capabilities improve substantially and multi-step reasoning remains robust. The expansion ratio serves as a critical architectural parameter that selectively modulates these capabilities rather than causing uniform degradation.

What carries the argument

The Maximum Absolute Weight (MAW) criterion applied to select weights for pruning inside the expansion layers of GLU-MLP blocks, which acts as a filter that removes connections linked to factual recall while preserving those supporting instruction alignment.

Load-bearing premise

The measured gains in instruction-following result from the pruning step itself rather than from unmeasured differences in training, evaluation prompts, or benchmark formatting.

What would settle it

Retrain the pruned models from scratch using identical data and training procedures, then re-measure IFEval and MMLU scores to check whether the instruction-following advantage disappears.

Figures

Figures reproduced from arXiv: 2512.22671 by Pere Martra.

**Figure 2.** Figure 2: Llama-3.2-1B Benchmarks. Panel A (Fragile Capabilities) shows the predictable collapse of [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Llama-3.2-3B Benchmarks. Panel A (Fragile Capabilities) demonstrates monotonic degradation of [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: The Truthfulness Paradox. Divergent trajectories of factual knowledge (MMLU, blue dashed [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Efficiency Trade-offs — Single-Request vs Batch Processing. Panel A (Llama-3.2-1B) and Panel [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model's ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAW width pruning on Llama-3.2 appears to trade factual knowledge for better instruction-following, but the causal link rests on thin experimental controls.

read the letter

The main thing to know is that this paper maps a split in Llama-3.2 behavior under MAW-guided width pruning of the GLU layers: knowledge tasks like MMLU and GSM8K drop as the expansion ratio shrinks, while instruction-following on IFEval rises 46-75% and multi-step reasoning holds steady. They also report a strong inverse correlation between MMLU and TruthfulQA scores across the seven ratios tested. That pattern is presented as new and potentially useful for tuning compressed models toward chat-style use without extra alignment steps.

Referee Report

2 major / 1 minor

Summary. The paper claims that MAW-guided width pruning of GLU-MLP layers in Llama-3.2-1B and 3B models produces a capability dichotomy: predictable degradation on parametric knowledge tasks (MMLU, GSM8K) and perplexity as the expansion ratio is reduced, contrasted with substantial gains in instruction-following (+46% to +75% on IFEval) and preserved multi-step reasoning (MUSR), plus a robust inverse correlation (r = -0.864, p = 0.012) between MMLU and TruthfulQA-MC2; the authors interpret this as pruning acting as a selective filter that reduces knowledge while enhancing behavioral alignment, with additional efficiency trade-offs in energy and latency.

Significance. If the causal attribution to pruning holds after controls, the result would be significant for showing that targeted architectural compression can improve alignment metrics by reducing parametric knowledge, linking pruning literature to truthfulness research, and providing a practical lever (expansion ratio) for capability trade-offs rather than uniform degradation.

major comments (2)

[Abstract] Abstract: the reported IFEval gains (+46% to +75%) and the selective-preservation claim are presented without any information on whether the seven pruned configurations received identical fine-tuning steps, learning-rate schedules, or prompt templates as the unpruned baselines; this control is load-bearing for attributing the dichotomy to width reduction rather than training or evaluation differences.
[Abstract] Abstract: the inverse correlation (r = -0.864, p = 0.012) between MMLU and TruthfulQA-MC2 is stated without the number of data points (expansion ratios), error bars, or regression details, weakening the statistical support for the knowledge-truthfulness link.

minor comments (1)

The abstract refers to 'seven expansion ratio configurations' but does not enumerate them or reference a table; listing the exact ratios tested would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key methodological and statistical aspects of our work. We address each point below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the reported IFEval gains (+46% to +75%) and the selective-preservation claim are presented without any information on whether the seven pruned configurations received identical fine-tuning steps, learning-rate schedules, or prompt templates as the unpruned baselines; this control is load-bearing for attributing the dichotomy to width reduction rather than training or evaluation differences.

Authors: We agree that explicit details on evaluation controls are necessary to support causal attribution to pruning. All seven pruned configurations and the unpruned baselines were evaluated under identical conditions: the same prompt templates, zero-shot inference settings, and benchmark implementations, with no post-pruning fine-tuning or additional training steps applied to any model. Pruning was performed directly on the pre-trained weights, and evaluation followed standard protocols uniformly across all expansion ratios. To address the concern, we will revise the abstract and expand the Methods section to explicitly document this uniform evaluation protocol, including confirmation that no differential fine-tuning occurred. revision: yes
Referee: [Abstract] Abstract: the inverse correlation (r = -0.864, p = 0.012) between MMLU and TruthfulQA-MC2 is stated without the number of data points (expansion ratios), error bars, or regression details, weakening the statistical support for the knowledge-truthfulness link.

Authors: The reported correlation uses the seven expansion ratio configurations tested on the Llama-3.2-3B model (n=7 data points). The full manuscript includes a corresponding scatter plot (Figure 3) showing the regression. We will update the abstract, main text, and figure caption to state the number of points explicitly, include error bars (standard deviation from repeated evaluations where available), and provide full regression details such as the exact computation of the p-value and slope. This will strengthen the statistical presentation without altering the underlying result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are direct benchmark measurements

full rationale

The paper reports empirical outcomes from MAW-guided width pruning experiments on Llama-3.2 models, with all key claims (IFEval gains, MMLU degradation, inverse correlation r=-0.864, energy trade-offs) presented as direct measurements on standard public benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described content. The analysis identifies the expansion ratio as a modulator based on observed patterns rather than any self-definitional or self-citation load-bearing step. The derivation chain is self-contained against external benchmarks and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical benchmark results obtained with a standard weight-magnitude pruning rule; no new free parameters are fitted and no new entities are postulated.

axioms (1)

domain assumption Maximum absolute weight is a reliable proxy for weight importance in GLU-MLP layers
Invoked to justify the pruning criterion without additional validation in the provided abstract.

pith-pipeline@v0.9.0 · 5597 in / 1149 out tokens · 51861 ms · 2026-05-16T18:52:20.901247+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

[1]

arXiv:2509.14230 [cs]

URLhttp://arxiv.org/abs/2509.14230. arXiv:2509.14230 [cs]. Meta AI. llama-models/models/llama3_2/MODEL_card.md at main·meta-llama/llama-models,

work page arXiv
[2]

arXiv: 2401.15024

URL http://arxiv.org/abs/2401.15024. arXiv: 2401.15024. Benoît Courty, Victor Schmidt, Goyal-Kamal, inimaz, MarionCoutarel, Luis Blanche, Boris Feld, Jérémy Lecourt, LiamConnell, Amine Saboni, SabAmine, supatomic, Patrick LLORET, Mathilde Léval, Alexis Cruveiller, ouminasara, Franklin Zhao, Christian Bauer, Aditya Joshi, Jerry Laruba Festus, Alexis Bogrof...

work page arXiv
[3]

org/abs/2509.00096v2

URLhttps://arxiv. org/abs/2509.00096v2. Sia Gholami and Marwan Omar. Can pruning make Large Language Models more efficient?, October

work page arXiv
[4]

arXiv:2310.04573 [cs]

URLhttp://arxiv.org/abs/2310.04573. arXiv:2310.04573 [cs]. Zhiyu Guo, Hidetaka Kamigaito, and Taro Wanatnabe. Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models, October

work page arXiv
[5]

arXiv:2405.01943 [cs]

URLhttp://arxiv.org/abs/2405.01943. arXiv:2405.01943 [cs]. Bairu Hou, Qibin Chen, Jianyu Wang, Guoli Yin, Chong Wang, Nan Du, Ruoming Pang, Shiyu Chang, and Tao Lei. Instruction-Following Pruning for Large Language Models, June

work page arXiv
[6]

arXiv:2501.02086 [cs]

URLhttp://arxiv.org/ abs/2501.02086. arXiv:2501.02086 [cs]. Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung- Kyu Song. Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods, June

work page arXiv
[7]

arXiv: 2402.02834

URLhttp://arxiv.org/abs/2402.02834. arXiv: 2402.02834. Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, and Zining Zhu. Truth Neurons, July

work page arXiv
[8]

arXiv:2505.12182 [cs]

URLhttp://arxiv.org/abs/2505.12182. arXiv:2505.12182 [cs]. Pere Martra. Exploring GLU expansion ratios: Structured pruning in Llama-3.2 models, December 2024a. URLhttps://osf.io/qgxea_v1/. Pere Martra. optipfair: A Library for Structured Pruning and Bias Visualization of Large Language Models, 2024b. URLhttps://github.com/peremartra/optipfair. Versión 0.2...

work page arXiv
[9]

arXiv: 2504.21174

URLhttp://arxiv.org/abs/2504.21174. arXiv: 2504.21174. Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact Language Models via Pruning and Knowledge Distillation, July

work page arXiv
[10]

arXiv: 2407.14679

URLhttp://arxiv.org/abs/2407.14679. arXiv: 2407.14679. Waleed Reda, Abhinav Jangda, and Krishna Chintalapudi. How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve, October

work page arXiv
[11]

arXiv:2505.18350 [cs]

URLhttp://arxiv.org/abs/2505.18350. arXiv:2505.18350 [cs]. Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. THE TRUTH IS IN THERE: IMPROVING REASONINGINLANGUAGEMODELSWITHLAYER-SELECTIVERANKREDUCTION,2023. URL https://pratyushasharma.github.io/laser/. Noam Shazeer. GLU Variants Improve Transformer, February

work page arXiv 2023
[12]

URLhttp://arxiv.org/abs/2002. 05202. arXiv:2002.05202 [cs]. Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A Simple and Effective Pruning Approach for Large Language Models, May

work page internal anchor Pith review Pith/arXiv arXiv 2002
[13]

A Simple and Effective Pruning Approach for Large Language Models

URLhttp://arxiv.org/abs/2306.11695. arXiv: 2306.11695. Lintang Sutawika, Hailey Schoelkopf, Leo Gao, Baber Abbasi, Stella Biderman, Jonathan Tow, ben fat- tori, Charles Lovering, farzanehnakhaee70, Jason Phang, Anish Thite, Fazz, Aflah, Niklas Muennighoff, Thomas Wang, sdtblck, nopperl, gakada, tttyuntian, researcher2, Julen Etxaniz, Chris, Hanwool Al- be...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Aalbers, et al., Axfoundation/strax: v1.6.4 (2024)

URLhttps://zenodo.org/doi/10.5281/zenodo. 12608602. 18 Keyu Wang, Tian Lyu, Guinan Su, Jonas Geiping, Lu Yin, Marco Canini, and Shiwei Liu. When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs, October

work page doi:10.5281/zenodo
[15]

arXiv: 2510.22228

URL http://arxiv.org/abs/2510.22228. arXiv: 2510.22228. Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications, October

work page arXiv
[16]

arXiv:2402.05162 [cs]

URLhttp://arxiv.org/abs/2402.05162. arXiv:2402.05162 [cs]. Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, April

work page arXiv
[17]

arXiv:2310.06694 [cs]

URLhttp://arxiv.org/abs/2310.06694. arXiv:2310.06694 [cs]. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A Survey of Large...

work page arXiv
[18]

A Survey of Large Language Models

URLhttp://arxiv.org/abs/2303.18223. arXiv:2303.18223 [cs]. Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A Survey on Model Compression for Large Language Models, July

work page internal anchor Pith review Pith/arXiv arXiv
[19]

arXiv: 2308.07633

URLhttp://arxiv.org/abs/2308.07633. arXiv: 2308.07633. A Complete Benchmark Results (Base Models) This appendix provides the complete results of the 13 benchmarks from the evaluation suite (described in Table

work page arXiv
[20]

Capability Dichotomy

for all expansion ratio configurations of the Llama-3.2-1B and Llama-3.2-3B base models. Metric: All scores are Accuracy or Acc-Norm (higher is better), except WikiText and Lambada, which are Perplexity (lower is better). Source: Data is extracted from the project results files llama_1b_complete_results_latest.json and llama_3b_complete_results_latest.jso...

work page 2024
[21]

Lambada (PPL) WikiText-2 (PPL) Selection Criteria Value∆vs

Table 9: Catastrophic Collapse of Alternative Pruning Methods (10% Pruning, Llama-3.2-1B). Lambada (PPL) WikiText-2 (PPL) Selection Criteria Value∆vs. Base Value∆vs. Base Baseline (0%) 5.75 – 11.57 – MAW (Maximum Absolute Weight) 20.59 +259% 17.45 +51% VOW (Variance of Weights) 532.36 +9,207% 50.56 +337% PON (Product of Norms) 2032.80 +35,440% 72.52 +527%...

work page 2032