LLM Compression by Block Removal with Constrained Binary Optimization

Ali Hashemi; David Jansen; David Montero; Rom\'an Or\'us; Roman Rausch

arxiv: 2602.00161 · v2 · pith:SWNYXZTZnew · submitted 2026-01-29 · 💻 cs.LG · cs.AI· cs.CL· quant-ph

LLM Compression by Block Removal with Constrained Binary Optimization

David Jansen , Roman Rausch , Ali Hashemi , David Montero , Rom\'an Or\'us This is my paper

classification 💻 cs.LG cs.AIcs.CLquant-ph

keywords compressionb-instructblockllama-3problembinaryblock-removalconstrained

0 comments

read the original abstract

In this paper, we formulate the compression of large language models (LLMs) by optimally deleting transformer blocks (``block removal'') as a constrained binary optimization (CBO) problem that can be mapped to a physical system (Ising glass), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations yielding many high-quality, non-trivial solutions beyond those only removing consecutive regions. Our method performs strongly in the deep compression regime, such as for 50% compression of Llama-3.3-70B-Instruct, where we achieve an almost 23 percentage point increase on the MMLU benchmark compared to other state-of-the-art (SOTA) block-removal methods. For lighter compression, it performs on par with those methods across several benchmarks for Llama-3.1-8B-Instruct, Qwen3-14B (both before and after retraining), as well as Llama-3.3-70B-Instruct. The approach is computationally efficient and requires only forward and backward passes on a calibration dataset for a few active parameters. Additionally, we demonstrate that using good heuristic solvers for the CBO problem provides solutions that perform well on downstream tasks in negligible runtime when it is unfeasible to solve the problem exactly. The method can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure, and where we outperform SOTA for AIME25 and GPQA when removing either 2 attention layers or 3 mixture-of-experts layers.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning
cs.LG 2026-04 unverdicted novelty 7.0

Calibration objectives influence redundant layer identification in LLM depth pruning more than search algorithms do, with different objectives producing different layer rankings.
Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning
cs.LG 2026-04 unverdicted novelty 6.0

Different calibration objectives produce distinct layer pruning patterns in LLMs, while search algorithms converge to similar solutions under a fixed objective.