SOP post-training quantization for LLMs reports lower weight reconstruction error than per-layer FP8 at 1.5 bpw lower cost using per-layer codebook search and hardware-aware formats.
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
A major recent advance in quantization is given by microscaled 4-bit formats such as NVFP4 and MXFP4, quantizing values into small groups sharing a scale, assuming a fixed floating-point grid. In this paper, we study the following natural extension: assume that, for each group of values, we are free to select the "better" among two or more 4-bit grids marked by one or more bits in the scale value. We formalize the power-of-two-grids (PO2) problem, and provide theoretical results showing that practical small-group formats such as MXFP or NVFP can benefit significantly from PO2 grids, while the advantage vanishes for very large groups. On the practical side, we instantiate several grid families, including 1) PO2(NF4), which pairs the standard NF4 normal grid with a learned grid, 2) MPO2, a grid pair that is fully learned over real weights and activations, 3) PO2(Split87), an explicit-zero asymmetric grid and 4) SFP4, a TensorCore-implementable triple which pairs NVFP4 with two shifted variants. Results for post-training quantization of standard open models and pre-training of Llama-like models show that adaptive grids consistently improve accuracy vs single-grid FP4 under both weight-only and weight+activation. Source code is available at https://github.com/IST-DASLab/GridGames.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models
SOP post-training quantization for LLMs reports lower weight reconstruction error than per-layer FP8 at 1.5 bpw lower cost using per-layer codebook search and hardware-aware formats.