PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models

Jiaqi Zhao; Kaihao Zhang; Miao Zhang; Ming Wang; Min Zhang; Weili Guan; Yaowei Wang; Yuzhang Shang

arxiv: 2502.13179 · v2 · pith:VECKID5Cnew · submitted 2025-02-18 · 💻 cs.LG · cs.AI

PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models

Jiaqi Zhao , Miao Zhang , Ming Wang , Yuzhang Shang , Kaihao Zhang , Weili Guan , Yaowei Wang , Min Zhang This is my paper

classification 💻 cs.LG cs.AI

keywords quantizationextremelylow-bitweightptq1calledchannelsfirst

0 comments

read the original abstract

Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. Several existing sub 2-bit post-training quantization (PTQ) methods utilize a mix-precision scheme by leveraging an unstructured fine-grained mask to explicitly distinguish salient weights, while which introduces an extra 1-bit or more per weight. To explore the real limit of PTQ, we propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Specifically, we first introduce a one-dimensional structured mask with negligibly additional 0.0002-bit per weight based on input activations from the perspective of reducing the upper bound of quantization error to allocate corresponding salient weight channels to 4-bit. For non-salient channels binarization, an efficient block-wise scaling factors optimization framework is then presented to take implicit row-wise correlations and angular biases into account. Different from prior works that concentrate on adjusting quantization methodologies, we further propose a novel paradigm called quantization preprocessing, where we argue that transforming the weight distribution of the pretrained model before quantization can alleviate the difficulty in per-channel extremely low-bit PTQ. Extensive experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization. Codes are available at https://github.com/zjq0455/PTQ1.61.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
cs.LG 2026-06 unverdicted novelty 6.0

LiftQuant uses dimensional lifting of weights to a higher-dimensional 1-bit lattice followed by projection to achieve tunable continuous bit-widths in LLM quantization while remaining hardware-friendly.
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
cs.LG 2026-06 unverdicted novelty 6.0

LiftQuant enables continuous bit-width LLM quantization via dimensional lifting and projection from a 1-bit lattice, allowing 2.4-bit compression of 70B models that outperforms fixed 2-bit baselines on identical hardware.
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
cs.LG 2026-04 unverdicted novelty 6.0

LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existin...