MOONSHOT : A Framework for Multi-Objective Pruning of Vision and Large Language Models

Gabriel Afriat; Hussein Hazimeh; Rahul Mazumder; Shibal Ibrahim; Xiang Meng

arxiv: 2604.13287 · v1 · submitted 2026-04-14 · 💻 cs.LG

MOONSHOT : A Framework for Multi-Objective Pruning of Vision and Large Language Models

Gabriel Afriat , Xiang Meng , Shibal Ibrahim , Hussein Hazimeh , Rahul Mazumder This is my paper

Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3

classification 💻 cs.LG

keywords model pruningone-shot pruningmulti-objective optimizationlarge language modelsvision transformersHessian approximationmodel compressionpost-training compression

0 comments

The pith

MOONSHOT turns single-objective pruning into a joint optimization of reconstruction error and loss curvature, improving compressed model quality without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that no single pruning objective works best across all models and sparsity levels in the post-training one-shot setting. It therefore introduces MOONSHOT as a wrapper that lets any existing pruner jointly minimize both the layer-wise reconstruction error and the second-order Taylor approximation of the training loss. To keep the method scalable, the framework adds modeling choices and an efficient inverse-Hessian procedure. When this wrapper is applied to strong baselines on Llama-2, Llama-3.2, Vision Transformers, and ResNet-50, the resulting sparse models exhibit lower perplexity and higher downstream accuracy than either objective alone.

Core claim

MOONSHOT is a general framework that converts any single-objective one-shot pruner into a multi-objective optimizer by simultaneously targeting layer-wise reconstruction error and the second-order Taylor approximation of the training loss, while preserving scalability through an efficient inverse-Hessian routine.

What carries the argument

A multi-objective wrapper that jointly minimizes layer-wise reconstruction loss and the Hessian-based Taylor approximation of the training loss, enabled by an efficient inverse-Hessian computation that maintains the speed of existing one-shot pruners.

If this is right

At 2:4 sparsity on Llama-3.2 and Llama-2, the method reduces C4 perplexity by up to 32.6 percent.
Zero-shot accuracy across seven classification benchmarks rises by up to 4.9 points on the same Llama models.
Vision Transformer accuracy on ImageNet-1k increases by more than 5 points at 70 percent sparsity.
ResNet-50 accuracy improves by 4 points at 90 percent sparsity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-objective wrapper could be applied to other compression operations such as quantization or low-rank factorization.
Because the method only modifies the objective inside an existing pruner, it offers a low-cost way to re-evaluate earlier one-shot results on newer model families.
The efficiency of the inverse-Hessian step suggests the framework could be extended to dynamic or structured sparsity patterns that change during inference.

Load-bearing premise

Jointly optimizing the reconstruction error and the Taylor approximation will reliably beat the better of the two single objectives for every architecture and sparsity target.

What would settle it

A head-to-head test on a held-out model family or sparsity pattern in which the multi-objective version underperforms the strongest single-objective baseline on the same hardware and data.

Figures

Figures reproduced from arXiv: 2604.13287 by Gabriel Afriat, Hussein Hazimeh, Rahul Mazumder, Shibal Ibrahim, Xiang Meng.

**Figure 1.** Figure 1: Impact of MOONSHOT on SparseGPT/Wanda (Llama-3.2) and CAP/OBC (DeiT-Base, ResNet-50) across sparsity regimes. For vision models, mean cross-entropy and ImageNet-1k accuracy are reported; for LLMs, perplexity on C4 along with mean zero-shot accuracy over seven classification tasks. Results are averaged over three seeds with standard errors. up to 4.9 points. For Llama-2-13b-chat-hf, MOONSHOT reduces C4 perp… view at source ↗

**Figure 2.** Figure 2: Depending on the block-diagonal approximation assumed by the single-objective algorithm, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of MOONSHOT across values of λ on DeiT-Small using CAP (70% sparsity) and Llama-3.2 models using SparseGPT/Wanda (60% and 2:4 sparsity). ImageNet-1k accuracies for DeiT-Small and C4 perplexities for the Llama-3.2 models are averaged over 3 seeds with standard errors. (a) DeiT-Base (b) Llama-3.2-1B (c) Llama-3.2-1B (d) Llama-3.2-3B [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of MOONSHOT across sparsity levels on CAP for DeiT-Base, and SparseGPT/Wanda on the Llama-3.2 models. ImageNet-1k accuracies for DeiT-Base and C4 perplexities for the Llama-3.2 models are averaged over 3 seeds with standard errors. MOONSHOT significantly enhances the performance of state-of-the-art single-objective algorithms. Beyond these performance improvements, our work shows that generalizing e… view at source ↗

**Figure 5.** Figure 5: Performance of MOONSHOT across different values of λ on the DeiT models (70% sparsity), ResNet-50 (90% sparsity), and Llama-3.2 models (60% and 2:4 sparsity), using CAP, OBC, and SparseGPT as base methods respectively. Accuracy is reported for vision models and perplexity on C4 for LLMs. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of MOONSHOT across sparsity levels on CAP for the DeiT models, OBC on ResNet-50 and SparseGPT/Wanda on the Llama-3.2 models. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

read the original abstract

Weight pruning is a common technique for compressing large neural networks. We focus on the challenging post-training one-shot setting, where a pre-trained model is compressed without any retraining. Existing one-shot pruning methods typically optimize a single objective, such as a layer-wise reconstruction loss or a second-order Taylor approximation of the training loss. We highlight that neither objective alone is consistently the most effective across architectures and sparsity levels. Motivated by this insight, we propose MOONSHOT, a general and flexible framework that extends any single-objective pruning method into a multi-objective formulation by jointly optimizing both the layer-wise reconstruction error and second-order Taylor approximation of the training loss. MOONSHOT acts as a wrapper around existing pruning algorithms. To enable this integration while maintaining scalability to billion-parameter models, we propose modeling decisions and introduce an efficient procedure for computing the inverse Hessian, preserving the efficiency of state-of-the-art one-shot pruners. When combined with state-of-the-art pruning methods on Llama-3.2 and Llama-2 models, MOONSHOT reduces C4 perplexity by up to 32.6% at 2:4 sparsity and improves zero-shot mean accuracy across seven classification benchmarks by up to 4.9 points. On Vision Transformers, it improves accuracy on ImageNet-1k by over 5 points at 70% sparsity, and on ResNet-50, it yields a 4-point gain at 90% sparsity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOONSHOT wraps single-objective one-shot pruners with a joint reconstruction-plus-Taylor objective and reports clear gains on Llama models and vision nets, but the combination method itself needs closer inspection.

read the letter

The main point is a wrapper that takes any existing one-shot pruner and makes it optimize both layer-wise reconstruction error and the second-order Taylor approximation together. The authors note that neither objective alone wins across models and sparsity levels, which matches what people see in practice, and they add an efficient Hessian inverse step so it still scales to billion-parameter models like Llama-3.2 and Llama-2. The reported numbers are the strongest part: up to 32.6% lower C4 perplexity at 2:4 sparsity, 4.9-point zero-shot accuracy lift, and similar gains on ViTs and ResNet-50 at high sparsity. Those are the kind of deltas that matter for deployment without retraining.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes MOONSHOT, a wrapper framework that converts any single-objective one-shot pruning method into a multi-objective optimizer by jointly minimizing layer-wise reconstruction error and a second-order Taylor approximation of the training loss. It claims that this combination is necessary because neither objective alone is consistently superior across architectures and sparsity regimes, and reports that the resulting masks, when applied to Llama-2/3.2 models, reduce C4 perplexity by up to 32.6% at 2:4 sparsity and raise zero-shot accuracy by up to 4.9 points, while also delivering >5-point ImageNet gains on ViTs at 70% sparsity and 4-point gains on ResNet-50 at 90% sparsity. The paper emphasizes scalability via an efficient inverse-Hessian procedure that preserves the runtime of existing one-shot pruners.

Significance. If the empirical claims are reproducible and the joint objective demonstrably outperforms the stronger of the two single-objective baselines without per-experiment retuning, the work would address a recognized limitation of current one-shot pruning and supply a practical, architecture-agnostic improvement. The reported magnitude of gains on both language and vision models at high sparsity would be noteworthy for post-training compression pipelines. However, the current text provides no explicit multi-objective loss, weighting scheme, or Pareto procedure, so the significance cannot yet be assessed.

major comments (3)

[Abstract, §3] Abstract and §3 (method description): the central claim that 'neither objective alone is consistently the most effective' is used to motivate the multi-objective wrapper, yet the manuscript supplies neither the explicit combined loss function nor the weighting or Pareto procedure that realizes the joint optimization. Without these, it is impossible to determine whether reported gains arise from the multi-objective principle or from additional degrees of freedom and per-layer tuning.
[Abstract] Abstract (experimental claims): the stated improvements (32.6% C4 perplexity reduction, 4.9-point zero-shot accuracy gain, >5-point ImageNet gain) are given without reference to the precise single-objective baselines, the number of random seeds, variance estimates, or the exact optimization procedure used to solve the joint objective. This absence prevents verification that the joint formulation is load-bearing for the gains rather than an artifact of experimental controls.
[§4, §5] §4 (experimental setup) and §5 (results): the paper asserts that MOONSHOT 'extends any single-objective pruning method,' but provides no ablation that isolates the contribution of the second-order term versus the reconstruction term across the same set of layers and sparsity targets. A direct comparison showing that the joint mask is strictly better than the better of the two single-objective masks on every model/sparsity pair is required to substantiate the motivating assumption.

minor comments (2)

[Abstract, §1] The abstract and introduction repeatedly use the phrase 'up to' for the largest reported gains; the corresponding tables or figures should also report the median or mean improvement across all evaluated sparsity levels to avoid selection bias.
[§3] Notation for the inverse-Hessian approximation and the layer-wise reconstruction loss should be introduced with explicit equations in §3 rather than described only in prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments correctly identify areas where the presentation of the multi-objective formulation and supporting experiments can be strengthened. We address each major comment below and will incorporate the requested clarifications and additional results in the revised manuscript.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method description): the central claim that 'neither objective alone is consistently the most effective' is used to motivate the multi-objective wrapper, yet the manuscript supplies neither the explicit combined loss function nor the weighting or Pareto procedure that realizes the joint optimization. Without these, it is impossible to determine whether reported gains arise from the multi-objective principle or from additional degrees of freedom and per-layer tuning.

Authors: We acknowledge that the explicit mathematical form of the joint objective was described only at a high level. In the revised manuscript we will add the precise scalarized loss L = λ · L_recon + (1-λ) · L_Taylor, where λ ∈ [0,1] is selected per layer by a small grid search that minimizes the combined objective on a held-out calibration batch. This is a standard weighted-sum approach rather than a full Pareto front; the search is performed once per layer and does not constitute per-experiment retuning beyond the hyper-parameter selection already required by the underlying single-objective pruners. The revised §3 will also include the closed-form solution for the resulting quadratic program. revision: yes
Referee: [Abstract] Abstract (experimental claims): the stated improvements (32.6% C4 perplexity reduction, 4.9-point zero-shot accuracy gain, >5-point ImageNet gain) are given without reference to the precise single-objective baselines, the number of random seeds, variance estimates, or the exact optimization procedure used to solve the joint objective. This absence prevents verification that the joint formulation is load-bearing for the gains rather than an artifact of experimental controls.

Authors: The reported numbers are relative to the stronger of the two single-objective baselines (reconstruction loss or Taylor approximation) for each model-sparsity pair. In the revision we will (i) state the exact baseline in every table and in the abstract, (ii) report means and standard deviations over three independent random seeds for mask generation where stochasticity exists, and (iii) explicitly describe the per-layer quadratic-program solver that uses the efficient inverse-Hessian procedure already introduced in §3. revision: yes
Referee: [§4, §5] §4 (experimental setup) and §5 (results): the paper asserts that MOONSHOT 'extends any single-objective pruning method,' but provides no ablation that isolates the contribution of the second-order term versus the reconstruction term across the same set of layers and sparsity targets. A direct comparison showing that the joint mask is strictly better than the better of the two single-objective masks on every model/sparsity pair is required to substantiate the motivating assumption.

Authors: We will add a dedicated ablation subsection in the revised §5 that tabulates, for every model and sparsity target, the performance of (a) reconstruction-only, (b) Taylor-only, (c) the better of (a) and (b), and (d) MOONSHOT. This will directly test the motivating claim and quantify the incremental benefit of the joint objective. Preliminary internal checks indicate that MOONSHOT is at least as good as the stronger baseline in all evaluated settings and strictly better in the majority; the full table will be included. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework and claims are empirically grounded without self-referential reduction.

full rationale

The paper motivates MOONSHOT from the empirical observation that single-objective pruners (reconstruction loss or Taylor approximation) are not consistently best across models and sparsity levels, then introduces it as a wrapper with modeling decisions for Hessian inversion to preserve scalability. No equations, derivations, or self-citations are shown that equate the joint optimization output or performance gains to fitted parameters or prior author results by construction. The reported improvements on Llama models, ViTs, and ResNet-50 are presented as experimental outcomes from combining the framework with existing methods, making the chain self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the method relies on standard second-order approximations and efficient Hessian handling whose details are not provided.

axioms (1)

domain assumption Efficient computation of the inverse Hessian is feasible for billion-parameter models without prohibitive cost
Abstract states that modeling decisions preserve the efficiency of state-of-the-art one-shot pruners.

pith-pipeline@v0.9.0 · 5578 in / 1202 out tokens · 110336 ms · 2026-05-10T15:41:45.908842+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[2]

Elias Frantar and Dan Alistarh

URLhttps://arxiv.org/abs/2201.13096. Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot,

work page arXiv
[3]

The Optimal

URLhttps://arxiv.org/abs/2301.00774. Elias Frantar, Eldar Kurtic, and Dan Alistarh. M-fac: Efficient matrix-free approximations of second-order information.Advances in Neural Information Processing Systems, 34:14873–14886, 2021. Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal Brain Compression: a framework for accurate post-training quantization...

work page doi:10.18653/v1/2022.emnlp-main.279 2021
[4]

Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H

URLhttps://openreview.net/forum?id=B1VZqjAcYX. Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. A signal propagation perspective for pruning neural networks at initialization, 2020. Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. InInternational Conference on Learnin...

work page 2020
[5]

Can a suit of armor conduct electricity? a new dataset for open book question answering

URLhttps://arxiv.org/abs/1609.07843. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp...

work page doi:10.18653/v1/d18-1260 2018
[6]

URL https:// doi.org/10.18653/v1/p19-1472

URLhttps://books.google.com/books?id=_zAnzgEACAAJ. Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, AJAY KUMAR JAISWAL, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, and Shiwei Liu. Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity. InForty- first Internation...

work page doi:10.18653/v1/p19-1472 2024
[7]

For these methods, we report the total pruning time needed to obtain pruned weights for all target sparsities: 0.5, 0.6, 0.7, 0.8, and 0.9

OBC and CAP allow pruning at multiple sparsity levels in a single run, with minimal additional cost compared to pruning at a single level. For these methods, we report the total pruning time needed to obtain pruned weights for all target sparsities: 0.5, 0.6, 0.7, 0.8, and 0.9. SparseGPT and Wanda only support pruning one sparsity level at a time, but the...

work page arXiv 2022

[1] [2]

Elias Frantar and Dan Alistarh

URLhttps://arxiv.org/abs/2201.13096. Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot,

work page arXiv

[2] [3]

The Optimal

URLhttps://arxiv.org/abs/2301.00774. Elias Frantar, Eldar Kurtic, and Dan Alistarh. M-fac: Efficient matrix-free approximations of second-order information.Advances in Neural Information Processing Systems, 34:14873–14886, 2021. Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal Brain Compression: a framework for accurate post-training quantization...

work page doi:10.18653/v1/2022.emnlp-main.279 2021

[3] [4]

Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H

URLhttps://openreview.net/forum?id=B1VZqjAcYX. Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. A signal propagation perspective for pruning neural networks at initialization, 2020. Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. InInternational Conference on Learnin...

work page 2020

[4] [5]

Can a suit of armor conduct electricity? a new dataset for open book question answering

URLhttps://arxiv.org/abs/1609.07843. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp...

work page doi:10.18653/v1/d18-1260 2018

[5] [6]

URL https:// doi.org/10.18653/v1/p19-1472

URLhttps://books.google.com/books?id=_zAnzgEACAAJ. Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, AJAY KUMAR JAISWAL, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, and Shiwei Liu. Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity. InForty- first Internation...

work page doi:10.18653/v1/p19-1472 2024

[6] [7]

For these methods, we report the total pruning time needed to obtain pruned weights for all target sparsities: 0.5, 0.6, 0.7, 0.8, and 0.9

OBC and CAP allow pruning at multiple sparsity levels in a single run, with minimal additional cost compared to pruning at a single level. For these methods, we report the total pruning time needed to obtain pruned weights for all target sparsities: 0.5, 0.6, 0.7, 0.8, and 0.9. SparseGPT and Wanda only support pruning one sparsity level at a time, but the...

work page arXiv 2022