pith. machine review for the scientific record. sign in

arxiv: 2512.22671 · v2 · submitted 2025-12-27 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords width pruninginstruction followingLlama-3.2parametric knowledgeGLU-MLPmodel compressionexpansion ratiotruthfulness
0
0 comments X

The pith

MAW-guided width pruning of Llama-3.2 models reduces parametric knowledge while substantially improving instruction-following performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that structured width pruning of GLU-MLP layers, selected by the Maximum Absolute Weight criterion, produces a consistent split in model behavior. Tasks that depend on stored facts, such as MMLU and GSM8K, lose accuracy in a predictable way, yet instruction-following scores on IFEval rise sharply by 46 to 75 percent and multi-step reasoning on MUSR stays intact. The expansion ratio therefore functions as a tunable architectural lever rather than a simple compression knob. The study also reports a strong negative correlation between factual knowledge scores and truthfulness metrics, indicating that reduced knowledge can reduce the model's tendency to repeat misconceptions. These results reframe pruning as a selective filter that trims parametric storage while protecting or strengthening behavioral alignment.

Core claim

Structured width pruning guided by the Maximum Absolute Weight criterion in the GLU-MLP layers of Llama-3.2 models produces a systematic dichotomy: performance on parametric knowledge tasks degrades while instruction-following capabilities improve substantially and multi-step reasoning remains robust. The expansion ratio serves as a critical architectural parameter that selectively modulates these capabilities rather than causing uniform degradation.

What carries the argument

The Maximum Absolute Weight (MAW) criterion applied to select weights for pruning inside the expansion layers of GLU-MLP blocks, which acts as a filter that removes connections linked to factual recall while preserving those supporting instruction alignment.

Load-bearing premise

The measured gains in instruction-following result from the pruning step itself rather than from unmeasured differences in training, evaluation prompts, or benchmark formatting.

What would settle it

Retrain the pruned models from scratch using identical data and training procedures, then re-measure IFEval and MMLU scores to check whether the instruction-following advantage disappears.

Figures

Figures reproduced from arXiv: 2512.22671 by Pere Martra.

Figure 1
Figure 1. Figure 1: GLU (Gated Linear Unit) architecture within the MLP block of Llama-3.2-1B. The diagram shows [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Llama-3.2-1B Benchmarks. Panel A (Fragile Capabilities) shows the predictable collapse of [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Llama-3.2-3B Benchmarks. Panel A (Fragile Capabilities) demonstrates monotonic degradation of [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The Truthfulness Paradox. Divergent trajectories of factual knowledge (MMLU, blue dashed [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Efficiency Trade-offs — Single-Request vs Batch Processing. Panel A (Llama-3.2-1B) and Panel [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model's ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that MAW-guided width pruning of GLU-MLP layers in Llama-3.2-1B and 3B models produces a capability dichotomy: predictable degradation on parametric knowledge tasks (MMLU, GSM8K) and perplexity as the expansion ratio is reduced, contrasted with substantial gains in instruction-following (+46% to +75% on IFEval) and preserved multi-step reasoning (MUSR), plus a robust inverse correlation (r = -0.864, p = 0.012) between MMLU and TruthfulQA-MC2; the authors interpret this as pruning acting as a selective filter that reduces knowledge while enhancing behavioral alignment, with additional efficiency trade-offs in energy and latency.

Significance. If the causal attribution to pruning holds after controls, the result would be significant for showing that targeted architectural compression can improve alignment metrics by reducing parametric knowledge, linking pruning literature to truthfulness research, and providing a practical lever (expansion ratio) for capability trade-offs rather than uniform degradation.

major comments (2)
  1. [Abstract] Abstract: the reported IFEval gains (+46% to +75%) and the selective-preservation claim are presented without any information on whether the seven pruned configurations received identical fine-tuning steps, learning-rate schedules, or prompt templates as the unpruned baselines; this control is load-bearing for attributing the dichotomy to width reduction rather than training or evaluation differences.
  2. [Abstract] Abstract: the inverse correlation (r = -0.864, p = 0.012) between MMLU and TruthfulQA-MC2 is stated without the number of data points (expansion ratios), error bars, or regression details, weakening the statistical support for the knowledge-truthfulness link.
minor comments (1)
  1. The abstract refers to 'seven expansion ratio configurations' but does not enumerate them or reference a table; listing the exact ratios tested would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key methodological and statistical aspects of our work. We address each point below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported IFEval gains (+46% to +75%) and the selective-preservation claim are presented without any information on whether the seven pruned configurations received identical fine-tuning steps, learning-rate schedules, or prompt templates as the unpruned baselines; this control is load-bearing for attributing the dichotomy to width reduction rather than training or evaluation differences.

    Authors: We agree that explicit details on evaluation controls are necessary to support causal attribution to pruning. All seven pruned configurations and the unpruned baselines were evaluated under identical conditions: the same prompt templates, zero-shot inference settings, and benchmark implementations, with no post-pruning fine-tuning or additional training steps applied to any model. Pruning was performed directly on the pre-trained weights, and evaluation followed standard protocols uniformly across all expansion ratios. To address the concern, we will revise the abstract and expand the Methods section to explicitly document this uniform evaluation protocol, including confirmation that no differential fine-tuning occurred. revision: yes

  2. Referee: [Abstract] Abstract: the inverse correlation (r = -0.864, p = 0.012) between MMLU and TruthfulQA-MC2 is stated without the number of data points (expansion ratios), error bars, or regression details, weakening the statistical support for the knowledge-truthfulness link.

    Authors: The reported correlation uses the seven expansion ratio configurations tested on the Llama-3.2-3B model (n=7 data points). The full manuscript includes a corresponding scatter plot (Figure 3) showing the regression. We will update the abstract, main text, and figure caption to state the number of points explicitly, include error bars (standard deviation from repeated evaluations where available), and provide full regression details such as the exact computation of the p-value and slope. This will strengthen the statistical presentation without altering the underlying result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are direct benchmark measurements

full rationale

The paper reports empirical outcomes from MAW-guided width pruning experiments on Llama-3.2 models, with all key claims (IFEval gains, MMLU degradation, inverse correlation r=-0.864, energy trade-offs) presented as direct measurements on standard public benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described content. The analysis identifies the expansion ratio as a modulator based on observed patterns rather than any self-definitional or self-citation load-bearing step. The derivation chain is self-contained against external benchmarks and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical benchmark results obtained with a standard weight-magnitude pruning rule; no new free parameters are fitted and no new entities are postulated.

axioms (1)
  • domain assumption Maximum absolute weight is a reliable proxy for weight importance in GLU-MLP layers
    Invoked to justify the pruning criterion without additional validation in the provided abstract.

pith-pipeline@v0.9.0 · 5597 in / 1149 out tokens · 51861 ms · 2026-05-16T18:52:20.901247+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

  1. [1]

    arXiv:2509.14230 [cs]

    URLhttp://arxiv.org/abs/2509.14230. arXiv:2509.14230 [cs]. Meta AI. llama-models/models/llama3_2/MODEL_card.md at main·meta-llama/llama-models,

  2. [2]

    arXiv: 2401.15024

    URL http://arxiv.org/abs/2401.15024. arXiv: 2401.15024. Benoît Courty, Victor Schmidt, Goyal-Kamal, inimaz, MarionCoutarel, Luis Blanche, Boris Feld, Jérémy Lecourt, LiamConnell, Amine Saboni, SabAmine, supatomic, Patrick LLORET, Mathilde Léval, Alexis Cruveiller, ouminasara, Franklin Zhao, Christian Bauer, Aditya Joshi, Jerry Laruba Festus, Alexis Bogrof...

  3. [3]

    org/abs/2509.00096v2

    URLhttps://arxiv. org/abs/2509.00096v2. Sia Gholami and Marwan Omar. Can pruning make Large Language Models more efficient?, October

  4. [4]

    arXiv:2310.04573 [cs]

    URLhttp://arxiv.org/abs/2310.04573. arXiv:2310.04573 [cs]. Zhiyu Guo, Hidetaka Kamigaito, and Taro Wanatnabe. Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models, October

  5. [5]

    arXiv:2405.01943 [cs]

    URLhttp://arxiv.org/abs/2405.01943. arXiv:2405.01943 [cs]. Bairu Hou, Qibin Chen, Jianyu Wang, Guoli Yin, Chong Wang, Nan Du, Ruoming Pang, Shiyu Chang, and Tao Lei. Instruction-Following Pruning for Large Language Models, June

  6. [6]

    arXiv:2501.02086 [cs]

    URLhttp://arxiv.org/ abs/2501.02086. arXiv:2501.02086 [cs]. Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung- Kyu Song. Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods, June

  7. [7]

    arXiv: 2402.02834

    URLhttp://arxiv.org/abs/2402.02834. arXiv: 2402.02834. Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, and Zining Zhu. Truth Neurons, July

  8. [8]

    arXiv:2505.12182 [cs]

    URLhttp://arxiv.org/abs/2505.12182. arXiv:2505.12182 [cs]. Pere Martra. Exploring GLU expansion ratios: Structured pruning in Llama-3.2 models, December 2024a. URLhttps://osf.io/qgxea_v1/. Pere Martra. optipfair: A Library for Structured Pruning and Bias Visualization of Large Language Models, 2024b. URLhttps://github.com/peremartra/optipfair. Versión 0.2...

  9. [9]

    arXiv: 2504.21174

    URLhttp://arxiv.org/abs/2504.21174. arXiv: 2504.21174. Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact Language Models via Pruning and Knowledge Distillation, July

  10. [10]

    arXiv: 2407.14679

    URLhttp://arxiv.org/abs/2407.14679. arXiv: 2407.14679. Waleed Reda, Abhinav Jangda, and Krishna Chintalapudi. How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve, October

  11. [11]

    arXiv:2505.18350 [cs]

    URLhttp://arxiv.org/abs/2505.18350. arXiv:2505.18350 [cs]. Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. THE TRUTH IS IN THERE: IMPROVING REASONINGINLANGUAGEMODELSWITHLAYER-SELECTIVERANKREDUCTION,2023. URL https://pratyushasharma.github.io/laser/. Noam Shazeer. GLU Variants Improve Transformer, February

  12. [12]

    URLhttp://arxiv.org/abs/2002. 05202. arXiv:2002.05202 [cs]. Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A Simple and Effective Pruning Approach for Large Language Models, May

  13. [13]

    A Simple and Effective Pruning Approach for Large Language Models

    URLhttp://arxiv.org/abs/2306.11695. arXiv: 2306.11695. Lintang Sutawika, Hailey Schoelkopf, Leo Gao, Baber Abbasi, Stella Biderman, Jonathan Tow, ben fat- tori, Charles Lovering, farzanehnakhaee70, Jason Phang, Anish Thite, Fazz, Aflah, Niklas Muennighoff, Thomas Wang, sdtblck, nopperl, gakada, tttyuntian, researcher2, Julen Etxaniz, Chris, Hanwool Al- be...

  14. [14]

    Aalbers, et al., Axfoundation/strax: v1.6.4 (2024)

    URLhttps://zenodo.org/doi/10.5281/zenodo. 12608602. 18 Keyu Wang, Tian Lyu, Guinan Su, Jonas Geiping, Lu Yin, Marco Canini, and Shiwei Liu. When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs, October

  15. [15]

    arXiv: 2510.22228

    URL http://arxiv.org/abs/2510.22228. arXiv: 2510.22228. Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications, October

  16. [16]

    arXiv:2402.05162 [cs]

    URLhttp://arxiv.org/abs/2402.05162. arXiv:2402.05162 [cs]. Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, April

  17. [17]

    arXiv:2310.06694 [cs]

    URLhttp://arxiv.org/abs/2310.06694. arXiv:2310.06694 [cs]. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A Survey of Large...

  18. [18]

    A Survey of Large Language Models

    URLhttp://arxiv.org/abs/2303.18223. arXiv:2303.18223 [cs]. Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A Survey on Model Compression for Large Language Models, July

  19. [19]

    arXiv: 2308.07633

    URLhttp://arxiv.org/abs/2308.07633. arXiv: 2308.07633. A Complete Benchmark Results (Base Models) This appendix provides the complete results of the 13 benchmarks from the evaluation suite (described in Table

  20. [20]

    Capability Dichotomy

    for all expansion ratio configurations of the Llama-3.2-1B and Llama-3.2-3B base models. Metric: All scores are Accuracy or Acc-Norm (higher is better), except WikiText and Lambada, which are Perplexity (lower is better). Source: Data is extracted from the project results files llama_1b_complete_results_latest.json and llama_3b_complete_results_latest.jso...

  21. [21]

    Lambada (PPL) WikiText-2 (PPL) Selection Criteria Value∆vs

    Table 9: Catastrophic Collapse of Alternative Pruning Methods (10% Pruning, Llama-3.2-1B). Lambada (PPL) WikiText-2 (PPL) Selection Criteria Value∆vs. Base Value∆vs. Base Baseline (0%) 5.75 – 11.57 – MAW (Maximum Absolute Weight) 20.59 +259% 17.45 +51% VOW (Variance of Weights) 532.36 +9,207% 50.56 +337% PON (Product of Norms) 2032.80 +35,440% 72.52 +527%...