GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Alireza Dadgarnia; Dan Alistarh; Eldar Kurtic; Mahdi Nikdan; Maximilian Kleinegger; Michael Helcig; Soroush Tabesh

arxiv: 2604.18556 · v2 · pith:B3J5DKC6new · submitted 2026-04-20 · 💻 cs.CL · cs.LG

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Alireza Dadgarnia , Soroush Tabesh , Mahdi Nikdan , Michael Helcig , Eldar Kurtic , Maximilian Kleinegger , Dan Alistarh This is my paper

Pith reviewed 2026-05-19 17:56 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords quantizationlarge language modelslow-precision inferencescalar quantizationGumbel-Softmaxpost-training optimization

0 comments

The pith

Gumbel-Softmax relaxation of discrete grid choices lets scalar quantization recover most accuracy of vector methods at 2-3 bits while staying kernel-compatible.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether the accuracy gap between basic scalar quantization and advanced vector or trellis methods is unavoidable or stems from suboptimal grid selection. It answers by relaxing the assignment of each weight to a quantization level into a continuous problem that can be optimized end-to-end with Gumbel-Softmax sampling. The relaxation size is deliberately kept small to match the few levels available at low bit widths, allowing joint learning of assignments and per-group scales. If the approach works, simple scalar formats can deliver accuracy close to far more complex methods without forcing changes to inference engines or hardware support. The technique is demonstrated on standard language models and shown to scale to mixture-of-experts architectures with trillions of parameters.

Core claim

GSQ is a post-training scalar quantization procedure that treats the choice of quantization level for each coordinate as a discrete assignment problem and relaxes it via Gumbel-Softmax sampling. By setting the cardinality of the relaxed distribution equal to the small number of levels at the target bit width, the method makes gradient-based optimization of both assignments and group-wise scales tractable. The resulting symmetric scalar grids achieve accuracy that closes most of the gap to vector-quantized baselines while remaining directly usable by existing scalar inference kernels.

What carries the argument

Gumbel-Softmax relaxation of the discrete grid-assignment problem, sized to the number of quantization levels so that joint optimization of assignments and per-group scales becomes feasible.

If this is right

Scalar quantization pipelines already in use can be upgraded to higher accuracy at 2-3 bits without altering deployment formats or kernels.
The same assignment optimization can be applied to refine existing quantized checkpoints and then written back into the original format.
The approach remains practical for models containing trillions of parameters where vector methods become difficult to apply.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same relaxation technique could be tested on other discrete decisions inside model compression or training pipelines.
Wider adoption might reduce pressure to develop specialized hardware or kernels for advanced quantization formats.

Load-bearing premise

The Gumbel-Softmax relaxation converges to discrete assignments that preserve model accuracy without introducing bias or instability from the continuous approximation.

What would settle it

Apply the method to produce a 2-bit or 3-bit quantized model, run it on held-out language-modeling and downstream benchmarks, and check whether accuracy remains within a few percent of the best vector-quantized result at the same bit width.

Figures

Figures reproduced from arXiv: 2604.18556 by Alireza Dadgarnia, Dan Alistarh, Eldar Kurtic, Mahdi Nikdan, Maximilian Kleinegger, Michael Helcig, Soroush Tabesh.

**Figure 1.** Figure 1: Local-shift parameterization at higher bit-widths. Each row shows, for a single weight coordinate, the logit distribution over candidate grid points before and after training. The red bar and dot mark the GPTQ-initialized grid point used to warm-start the logits; the green bar and dot mark the grid point selected by GSQ after training. Top (naive): placing one trainable logit on every grid point costs 2 b … view at source ↗

read the original abstract

Quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier but are notoriously hard to implement and to scale. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized $\textit{scalar}$ quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus remains compatible with existing scalar inference kernels. We further show that the same discrete-assignment optimization can be applied to practical GGUF K-Quant checkpoints: starting from publicly released GGUF models, GSQ improves accuracy while projecting the result back into the same deployment format. Finally, GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply. The source code is publicly available at https://github.com/IST-DASLab/GSQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GSQ shows that Gumbel-Softmax can optimize scalar grids well enough to close most of the accuracy gap to vector methods like QTIP at 2-3 bits while staying kernel-compatible.

read the letter

GSQ shows that a scalar quantizer can recover most of the accuracy gap to vector methods like QTIP at 2-3 bits on Llama-3.1 models by using Gumbel-Softmax to jointly learn per-coordinate grid assignments and per-group scales. The key move is matching the relaxation cardinality exactly to the small number of levels (3-8) so the optimization stays tractable, then hardening to a symmetric scalar grid that works with existing inference kernels. They also demonstrate it can refine released GGUF checkpoints and scale to large MoE models where vector approaches get impractical. The public code is a clear plus for anyone who wants to test the claims directly. The approach is new in its specific tailoring of the relaxation to this per-coordinate, low-cardinality setting rather than a generic discrete optimizer. On the soft side, the abstract gives no numbers, baselines, or ablations, so the size of the actual gains and their consistency across tasks are hard to judge from the summary alone. The stress-test point about possible residual bias or incomplete hardening in the Gumbel-Softmax step is worth checking in the full experiments; if the straight-through estimator or annealing leaves systematic sub-optimal assignments, the reported improvements could shrink. Nothing in the description looks circular or self-referential, which keeps the central claim clean. This paper is for people who need better low-bit scalar quantization for deployment on memory-limited hardware without custom kernels. A practitioner or researcher working on post-training compression would get concrete value from the method and the scaling results. It has enough technical novelty and practical scope to deserve a serious referee.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GSQ, a post-training scalar quantization method for LLMs that employs a Gumbel-Softmax relaxation to jointly optimize per-coordinate grid assignments and per-group scales for low bit-widths (2-3 bits). It claims that this approach closes most of the accuracy gap between conventional scalar quantizers (e.g., GPTQ, AWQ) and vector/trellis methods such as QTIP on Llama-3.1-8B/70B models, improves existing GGUF K-Quant checkpoints while projecting back to the same format, and scales to large MoE models like Kimi-K2.5, all while preserving compatibility with standard scalar inference kernels. The source code is released publicly.

Significance. If the reported empirical gains are robust, GSQ would be a meaningful contribution to practical LLM quantization by demonstrating that carefully optimized scalar methods can approach the accuracy of more complex vector approaches without requiring new inference kernels or sacrificing deployability. The public code release supports reproducibility and is a clear strength. This could meaningfully impact local inference pipelines that rely on scalar quantization formats.

major comments (2)

[§3.2] §3.2 (Gumbel-Softmax relaxation and annealing): The central claim that the continuous relaxation yields high-quality discrete grid assignments after hardening rests on the assumption that temperature annealing and the straight-through estimator produce solutions free of systematic bias or instability at 2-3 bits (3-8 levels). No direct comparison to exhaustive discrete search or standard rounding on identical groups is described, which is load-bearing for asserting that GSQ recovers most of the QTIP gap rather than benefiting from relaxation artifacts.
[Experimental results (Tables 1-2)] Experimental results (Tables 1-2 and associated figures): The reported perplexity and downstream task gains on Llama-3.1-8B/70B at 2 and 3 bits lack error bars, multiple random seeds, or ablations isolating the contribution of the Gumbel-Softmax schedule versus the symmetric grid choice; without these, the magnitude of the claimed gap closure relative to GPTQ/AWQ baselines cannot be fully assessed.

minor comments (2)

[§5] The description of how GSQ is applied to pre-existing GGUF K-Quant checkpoints and projected back into the original format would benefit from an explicit algorithmic outline or pseudocode to clarify the exact steps.
[§3] Notation for the symmetric scalar grid and group-wise scale factors could be formalized with an equation early in the method section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments identify important areas for strengthening the empirical validation of the Gumbel-Softmax approach. We address each major comment below and will revise the manuscript to incorporate additional comparisons, ablations, and statistical reporting. We believe these changes will make the contribution clearer while preserving the core claims supported by the existing results.

read point-by-point responses

Referee: [§3.2] §3.2 (Gumbel-Softmax relaxation and annealing): The central claim that the continuous relaxation yields high-quality discrete grid assignments after hardening rests on the assumption that temperature annealing and the straight-through estimator produce solutions free of systematic bias or instability at 2-3 bits (3-8 levels). No direct comparison to exhaustive discrete search or standard rounding on identical groups is described, which is load-bearing for asserting that GSQ recovers most of the QTIP gap rather than benefiting from relaxation artifacts.

Authors: We agree that direct validation against simpler baselines on identical groups would strengthen the argument. Exhaustive enumeration of all grid assignments is intractable even for small groups (combinatorial explosion with 4–8 levels), which is why we adopted the relaxation. In the revision we will add a controlled comparison on the same per-group weight tensors: (i) standard symmetric rounding, (ii) k-means clustering into the target number of levels, and (iii) the GSQ-optimized assignments after hardening. We will report both per-group quantization MSE and the resulting model perplexity. We will also include a short analysis of the annealing schedule (temperature decay and straight-through estimator) with plots showing convergence of the relaxed and hardened objectives, addressing potential bias concerns. revision: yes
Referee: [Experimental results (Tables 1-2)] Experimental results (Tables 1-2 and associated figures): The reported perplexity and downstream task gains on Llama-3.1-8B/70B at 2 and 3 bits lack error bars, multiple random seeds, or ablations isolating the contribution of the Gumbel-Softmax schedule versus the symmetric grid choice; without these, the magnitude of the claimed gap closure relative to GPTQ/AWQ baselines cannot be fully assessed.

Authors: We acknowledge the absence of variance estimates and targeted ablations. Because the Gumbel-Softmax procedure is stochastic, we will re-run the 2-bit and 3-bit experiments on Llama-3.1-8B and 70B with at least three independent random seeds, reporting mean and standard deviation for both perplexity and downstream task scores. We will also add an explicit ablation that fixes the grid to a symmetric uniform spacing and compares (a) standard rounding versus (b) Gumbel-Softmax optimization of the per-coordinate assignments while keeping the same group-wise scales. These results will be inserted into the revised Tables 1–2 and a new ablation table, allowing readers to isolate the contribution of the learned discrete assignments. revision: yes

Circularity Check

0 steps flagged

GSQ introduces independent Gumbel-Softmax optimization with no reduction to prior fits or self-citations

full rationale

The paper proposes GSQ as a new post-training method that applies Gumbel-Softmax relaxation to jointly learn discrete grid assignments and group-wise scales for scalar quantization. This is an algorithmic optimization procedure based on standard relaxation techniques and temperature annealing, not a derivation that reduces predictions or results to fitted inputs by construction. No self-definitional equations, fitted-input predictions, load-bearing self-citations, or uniqueness theorems imported from prior author work appear in the abstract or described approach. The central claim of closing most of the scalar-to-QTIP gap rests on empirical outcomes from this independent optimization, which remains self-contained and externally falsifiable on held-out model accuracy.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The approach relies on the standard Gumbel-Softmax relaxation from prior literature and common assumptions in post-training quantization such as group-wise scaling.

pith-pipeline@v0.9.0 · 5945 in / 1268 out tokens · 57350 ms · 2026-05-19T17:56:27.981350+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GSQ ... jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3–8 levels for ternary and 3 bpp, respectively)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the Gumbel-Softmax relaxation ... as the temperature is annealed, the soft assignments collapse onto hard grid points

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Attention Sinks and Outliers in Attention Residuals
cs.LG 2026-05 unverdicted novelty 4.0

OASIS mitigates attention sinks and outliers in AttnResidual models via Softmax1 null space and inter-layer signals, reporting norm and kurtosis reductions plus large gains in quantized perplexity and task accuracy.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Db-llm: Accurate dual-binarization for efficient llms

Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, et al. Db-llm: Accurate dual-binarization for efficient llms. arXiv preprint arXiv:2402.11960, 2024a. Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware t...

work page arXiv
[2]

Symbolic discovery of optimization algorithms

URLhttps://arxiv.org/abs/2302.06675. Yuanteng Chen, Yuantian Shao, Peisong Wang, and Jian Cheng. Eac-moe: Expert-selection aware compressor for mixture-of-experts large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12942– 12963, 2025b. Krishna Teja Chitty-Venkata, ...

work page arXiv
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Differentiable model compression via pseudo quantization noise.arXiv preprint arXiv:2104.09987,

Alexandre Défossez, Yossi Adi, and Gabriel Synnaeve. Differentiable model compression via pseudo quantization noise.arXiv preprint arXiv:2104.09987,

work page arXiv
[5]

Dettmers, M

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861,

work page arXiv
[6]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Spqr: A sparse-quantized rep- resentation for near-lossless llm weight compression,

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashk- boos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized represen- tation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

work page arXiv
[8]

Peijie Dong, Lujun Li, Dayou Du, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Wenhan Luo, Qi fei Liu, Yi-Ting Guo, and Xiaowen Chu

Peijie Dong, Lujun Li, Yuedong Zhong, Dayou Du, Ruibo Fan, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Yike Guo, et al. Stbllm: Breaking the 1-bit barrier with structured binary llms. arXiv preprint arXiv:2408.01803,

work page arXiv
[9]

Elias Frantar and Dan Alistarh

14 Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization.arXiv preprint arXiv:2401.06118,

work page arXiv
[10]

Learned step size quantization

Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmen- dra S Modha. Learned step size quantization.arXiv preprint arXiv:1902.08153,

work page arXiv 1902
[11]

Router choice matters: Rank-aware post-training quantization for moe models

Yi-Zeng Fang and Juinn-Dar Huang. Router choice matters: Rank-aware post-training quantization for moe models. Elias Frantar and Dan Alistarh. Qmoe: Practical sub-1-bit compression of trillion-parameter models. arXiv preprint arXiv:2310.16795,

work page arXiv
[12]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Eaquant: Enhancing post-training quantization for moe models via expert-aware optimization.arXiv preprint arXiv:2506.13329,

Zhongqian Fu, Ning Ding, Kai Han, Xianzhi Yu, Xiaosong Li, Xinghao Chen, Yehui Tang, and Yunhe Wang. Eaquant: Enhancing post-training quantization for moe models via expert-aware optimization.arXiv preprint arXiv:2506.13329,

work page arXiv
[14]

Jamie Hayes, Ilia Shumailov, and Itay Yona

URLhttps://zenodo.org/records/10256836. Georgi Gerganov and contributors. llama.cpp: Inference of LLaMA models in pure C/C++. https: //github.com/ggerganov/llama.cpp,

work page arXiv
[15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

OpenThoughts: Data Recipes for Reasoning Models

URLhttps://arxiv.org/abs/2506.04178. Yufei Guo, Zecheng Hao, Jiahang Shao, Jie Zhou, Xiaode Liu, Xin Tong, Yuhan Zhang, Yuanpei Chen, Weihang Peng, and Zhe Ma. Pt-bitnet: Scaling up the 1-bit large language model with post-training quantization.Neural Networks, page 107855,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025

Xing Hu, Zhixuan Chen, Dawei Yang, Zukang Xu, Chen Xu, Zhihang Yuan, Sifan Zhou, and Jiangyong Yu. Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,

work page arXiv
[18]

Tequila: Trapping-free ternary quantization for large language models

15 Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, and Dapeng Wu. Tequila: Trapping-free ternary quantization for large language models. arXiv preprint arXiv:2509.23809,

work page arXiv
[19]

Billm: Pushing the limit of post-training quantization for llms

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

work page arXiv
[20]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

URL https://github.com/inclusionAI/humming. Open-source library for vLLM-integrated weight-only quantization kernels supporting integer bitwidths 4–8. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large langua...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Arb-llm: Alternating refined binarizations for large language models

Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Linghe Kong, Yulun Zhang, Xiaokang Yang, et al. Arb-llm: Alternating refined binarizations for large language models. arXiv preprint arXiv:2410.03129,

work page arXiv
[23]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krish- namoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables.arXiv preprint arXiv:1611.00712,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

HIGGS: Pushing the limits of large language model quantization via the linearity theorem

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. HIGGS: Pushing the limits of large language model quantization via the linearity theorem. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Compu- tational Linguistics, volume 1, pages 10857–10886. Association for Computat...

work page 2025
[26]

Pb-llm: Partially binarized large language models

Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. Pb-llm: Partially binarized large language models.arXiv preprint arXiv:2310.00034,

work page arXiv
[27]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426,

work page arXiv
[28]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024a. Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Informa...

work page arXiv
[31]

Vodrahalli, S

Kiran V odrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, et al. Michelangelo: Long context evaluations beyond haystacks via latent structure queries.arXiv preprint arXiv:2409.12640,

work page arXiv
[32]

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Ptqtp: Post-training quantization to trit-planes for large language models.arXiv preprint arXiv:2509.16989,

He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zhen Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, and Ngai Wong. Ptqtp: Post-training quantization to trit-planes for large language models. arXiv preprint arXiv:2509.16989,

work page arXiv
[34]

Pt2-llm: Post-training ternarization for large language models.arXiv preprint arXiv:2510.03267,

17 Xianglong Yan, Chengzhu Bao, Zhiteng Li, Tianao Zhang, Kaicheng Yang, Haotong Qin, Ruobing Xie, Xingwu Sun, and Yulun Zhang. Pt2-llm: Post-training ternarization for large language models. arXiv preprint arXiv:2510.03267,

work page arXiv
[35]

American invitational mathematics examination (aime) 2025,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025,

work page 2025
[36]

Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, and Min Zhang. Ptq1. 61: Push the real limit of extremely low-bit post-training quantization methods for large language models.arXiv preprint arXiv:2502.13179,

work page arXiv
[37]

Moqa: Rethinking moe quantization with multi-stage data-model distribution awareness.arXiv preprint arXiv:2503.21135,

Zihao Zheng, Xiuping Cui, Size Zheng, Maoliang Li, Jiayu Chen, Yun Liang, and Xiang Chen. Moqa: Rethinking moe quantization with multi-stage data-model distribution awareness.arXiv preprint arXiv:2503.21135,

work page arXiv
[38]

Bit-widthLogits lrGroup scales lrWeight decayBetas Epochs# Seqs.Seq

18 Table 5: Training hyperparameters for the Llama experiments. Bit-widthLogits lrGroup scales lrWeight decayBetas Epochs# Seqs.Seq. len.Batch sizeGroup sizeτschedule κschedule α std 1.58-bit1e-4 5e-5 1.0 (0.9,0.95)20 4096 4096 64 128 linear:2→0.05linear:100→5003 0.01 2/3-bit 1e-4 5e-5 1.0 (0.9,0.95)20 4096 4096 64 128 linear:2→0.05linear:100→5006 0.01 Ta...

work page 2024

[1] [1]

Db-llm: Accurate dual-binarization for efficient llms

Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, et al. Db-llm: Accurate dual-binarization for efficient llms. arXiv preprint arXiv:2402.11960, 2024a. Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware t...

work page arXiv

[2] [2]

Symbolic discovery of optimization algorithms

URLhttps://arxiv.org/abs/2302.06675. Yuanteng Chen, Yuantian Shao, Peisong Wang, and Jian Cheng. Eac-moe: Expert-selection aware compressor for mixture-of-experts large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12942– 12963, 2025b. Krishna Teja Chitty-Venkata, ...

work page arXiv

[3] [3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Differentiable model compression via pseudo quantization noise.arXiv preprint arXiv:2104.09987,

Alexandre Défossez, Yossi Adi, and Gabriel Synnaeve. Differentiable model compression via pseudo quantization noise.arXiv preprint arXiv:2104.09987,

work page arXiv

[5] [5]

Dettmers, M

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861,

work page arXiv

[6] [6]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Spqr: A sparse-quantized rep- resentation for near-lossless llm weight compression,

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashk- boos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized represen- tation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

work page arXiv

[8] [8]

Peijie Dong, Lujun Li, Dayou Du, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Wenhan Luo, Qi fei Liu, Yi-Ting Guo, and Xiaowen Chu

Peijie Dong, Lujun Li, Yuedong Zhong, Dayou Du, Ruibo Fan, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Yike Guo, et al. Stbllm: Breaking the 1-bit barrier with structured binary llms. arXiv preprint arXiv:2408.01803,

work page arXiv

[9] [9]

Elias Frantar and Dan Alistarh

14 Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization.arXiv preprint arXiv:2401.06118,

work page arXiv

[10] [10]

Learned step size quantization

Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmen- dra S Modha. Learned step size quantization.arXiv preprint arXiv:1902.08153,

work page arXiv 1902

[11] [11]

Router choice matters: Rank-aware post-training quantization for moe models

Yi-Zeng Fang and Juinn-Dar Huang. Router choice matters: Rank-aware post-training quantization for moe models. Elias Frantar and Dan Alistarh. Qmoe: Practical sub-1-bit compression of trillion-parameter models. arXiv preprint arXiv:2310.16795,

work page arXiv

[12] [12]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Eaquant: Enhancing post-training quantization for moe models via expert-aware optimization.arXiv preprint arXiv:2506.13329,

Zhongqian Fu, Ning Ding, Kai Han, Xianzhi Yu, Xiaosong Li, Xinghao Chen, Yehui Tang, and Yunhe Wang. Eaquant: Enhancing post-training quantization for moe models via expert-aware optimization.arXiv preprint arXiv:2506.13329,

work page arXiv

[14] [14]

Jamie Hayes, Ilia Shumailov, and Itay Yona

URLhttps://zenodo.org/records/10256836. Georgi Gerganov and contributors. llama.cpp: Inference of LLaMA models in pure C/C++. https: //github.com/ggerganov/llama.cpp,

work page arXiv

[15] [15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

OpenThoughts: Data Recipes for Reasoning Models

URLhttps://arxiv.org/abs/2506.04178. Yufei Guo, Zecheng Hao, Jiahang Shao, Jie Zhou, Xiaode Liu, Xin Tong, Yuhan Zhang, Yuanpei Chen, Weihang Peng, and Zhe Ma. Pt-bitnet: Scaling up the 1-bit large language model with post-training quantization.Neural Networks, page 107855,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804, 2025

Xing Hu, Zhixuan Chen, Dawei Yang, Zukang Xu, Chen Xu, Zhihang Yuan, Sifan Zhou, and Jiangyong Yu. Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,

work page arXiv

[18] [18]

Tequila: Trapping-free ternary quantization for large language models

15 Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, and Dapeng Wu. Tequila: Trapping-free ternary quantization for large language models. arXiv preprint arXiv:2509.23809,

work page arXiv

[19] [19]

Billm: Pushing the limit of post-training quantization for llms

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

work page arXiv

[20] [20]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

URL https://github.com/inclusionAI/humming. Open-source library for vLLM-integrated weight-only quantization kernels supporting integer bitwidths 4–8. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large langua...

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Arb-llm: Alternating refined binarizations for large language models

Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Linghe Kong, Yulun Zhang, Xiaokang Yang, et al. Arb-llm: Alternating refined binarizations for large language models. arXiv preprint arXiv:2410.03129,

work page arXiv

[23] [23]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krish- namoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables.arXiv preprint arXiv:1611.00712,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

HIGGS: Pushing the limits of large language model quantization via the linearity theorem

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. HIGGS: Pushing the limits of large language model quantization via the linearity theorem. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Compu- tational Linguistics, volume 1, pages 10857–10886. Association for Computat...

work page 2025

[26] [26]

Pb-llm: Partially binarized large language models

Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. Pb-llm: Partially binarized large language models.arXiv preprint arXiv:2310.00034,

work page arXiv

[27] [27]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426,

work page arXiv

[28] [28]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024a. Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Informa...

work page arXiv

[31] [31]

Vodrahalli, S

Kiran V odrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, et al. Michelangelo: Long context evaluations beyond haystacks via latent structure queries.arXiv preprint arXiv:2409.12640,

work page arXiv

[32] [32]

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Ptqtp: Post-training quantization to trit-planes for large language models.arXiv preprint arXiv:2509.16989,

He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zhen Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, and Ngai Wong. Ptqtp: Post-training quantization to trit-planes for large language models. arXiv preprint arXiv:2509.16989,

work page arXiv

[34] [34]

Pt2-llm: Post-training ternarization for large language models.arXiv preprint arXiv:2510.03267,

17 Xianglong Yan, Chengzhu Bao, Zhiteng Li, Tianao Zhang, Kaicheng Yang, Haotong Qin, Ruobing Xie, Xingwu Sun, and Yulun Zhang. Pt2-llm: Post-training ternarization for large language models. arXiv preprint arXiv:2510.03267,

work page arXiv

[35] [35]

American invitational mathematics examination (aime) 2025,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025,

work page 2025

[36] [36]

Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, and Min Zhang. Ptq1. 61: Push the real limit of extremely low-bit post-training quantization methods for large language models.arXiv preprint arXiv:2502.13179,

work page arXiv

[37] [37]

Moqa: Rethinking moe quantization with multi-stage data-model distribution awareness.arXiv preprint arXiv:2503.21135,

Zihao Zheng, Xiuping Cui, Size Zheng, Maoliang Li, Jiayu Chen, Yun Liang, and Xiang Chen. Moqa: Rethinking moe quantization with multi-stage data-model distribution awareness.arXiv preprint arXiv:2503.21135,

work page arXiv

[38] [38]

Bit-widthLogits lrGroup scales lrWeight decayBetas Epochs# Seqs.Seq

18 Table 5: Training hyperparameters for the Llama experiments. Bit-widthLogits lrGroup scales lrWeight decayBetas Epochs# Seqs.Seq. len.Batch sizeGroup sizeτschedule κschedule α std 1.58-bit1e-4 5e-5 1.0 (0.9,0.95)20 4096 4096 64 128 linear:2→0.05linear:100→5003 0.01 2/3-bit 1e-4 5e-5 1.0 (0.9,0.95)20 4096 4096 64 128 linear:2→0.05linear:100→5006 0.01 Ta...

work page 2024