arxiv: 2604.18556 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.LG

Recognition: unknown

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Alireza Dadgarnia , Soroush Tabesh , Mahdi Nikdan , Michael Helcig , Eldar Kurtic , Dan Alistarh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:26 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords scalar quantizationGumbel-SoftmaxLLM compressionlow-bit inferencepost-training quantizationLlama modelsgrid optimization

0 comments

The pith

GSQ uses Gumbel-Softmax to jointly optimize scalar grid assignments and scales, closing most of the accuracy gap between scalar and vector quantization for LLMs at 2-3 bits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks if the accuracy difference between basic scalar quantization and advanced vector methods for compressing LLMs at 2-3 bits per parameter is fundamental or due to how the grids are chosen. It answers by introducing GSQ, a post-training method that relaxes the discrete choice of which grid level each weight uses into a continuous problem via Gumbel-Softmax sampling. This allows the grid assignments and the per-group scales to be learned together. By setting the number of samples in the relaxation equal to the small number of levels needed at these bit widths, the optimization stays tractable. On Llama-3.1-8B and 70B models, the resulting scalar-quantized models recover most of the performance of QTIP while remaining compatible with ordinary scalar inference kernels, and the approach extends to trillion-parameter Mixture-of-Experts models.

Core claim

GSQ is a post-training scalar quantization technique that applies a Gumbel-Softmax relaxation to jointly learn per-coordinate grid assignments and per-group scales. By matching the cardinality of the relaxation to the small number of levels available in the 2-3 bit regime, it achieves accuracy close to the QTIP frontier on standard Llama-3.1-8B/70B-Instruct models while using a symmetric scalar grid with group-wise quantization and thus works with existing scalar inference kernels. The same method scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5.

What carries the argument

Gumbel-Softmax relaxation of the discrete grid assignments, with its number of samples set equal to the small number of quantization levels (3-8 for 2-3 bits).

If this is right

GSQ closes most of the accuracy gap to QTIP at 2 and 3 bits on Llama-3.1-8B and 70B Instruct models.
The quantized models use only symmetric scalar grids and group-wise quantization, so they run directly on existing scalar inference kernels.
The method scales to trillion-parameter Mixture-of-Experts models where vector-quantized approaches are difficult to apply.
Careful optimization of scalar quantizers can recover most performance that previously required vector or trellis methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The long-standing gap between scalar and vector quantization may have been driven more by optimization difficulty than by any hard limit of the scalar format.
Widespread use could let high-accuracy low-bit LLMs run on a broader range of standard hardware without custom vector kernels.
The same style of relaxation could be tested on other discrete choices in model compression such as pruning masks or per-layer bit allocation.

Load-bearing premise

The Gumbel-Softmax relaxation stays close enough to the true discrete grid assignment problem and the joint optimization remains tractable even when the number of levels is very small.

What would settle it

Applying GSQ to Llama-3.1-8B at 2 bits and finding that its perplexity or task accuracy stays at the level of GPTQ or AWQ instead of approaching QTIP would show the gap is not closed.

Figures

Figures reproduced from arXiv: 2604.18556 by Alireza Dadgarnia, Dan Alistarh, Eldar Kurtic, Mahdi Nikdan, Michael Helcig, Soroush Tabesh.

**Figure 1.** Figure 1: Local-shift parameterization at higher bit-widths. Each row shows, for a single weight coordinate, the logit distribution over candidate grid points before and after training. The red bar and dot mark the GPTQ-initialized grid point used to warm-start the logits; the green bar and dot mark the grid point selected by GSQ after training. Top (naive): placing one trainable logit on every grid point costs 2 b … view at source ↗

read the original abstract

Weight quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into two sets of methods: simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier at low bit-widths but are notoriously hard to implement and to scale, and have gained relatively less traction. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized scalar quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making the relaxation tight and the optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus fully compatible with existing scalar inference kernels. We further show that GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GSQ applies Gumbel-Softmax relaxation matched to small level counts to jointly optimize scalar grid assignments and scales, claiming to close most of the gap to vector methods like QTIP at 2-3 bits while staying kernel-compatible.

read the letter

The main point is that this paper introduces GSQ, a post-training scalar quantization technique that relaxes the discrete grid choice with Gumbel-Softmax and sizes the categorical support to the actual bit-width (3-8 levels). It then optimizes assignments and per-group scales together before rounding to a symmetric scalar grid. The authors report that this recovers most of the accuracy difference versus vector or trellis methods on Llama-3.1-8B and 70B Instruct at 2 and 3 bits, and that the same approach works on a trillion-parameter MoE model where vector methods are harder to scale. That framing and the compatibility claim are the useful parts; scalar kernels are already deployed everywhere, so any accuracy lift without new inference code matters for practical serving.

Referee Report

2 major / 1 minor

Summary. The paper introduces GSQ, a post-training scalar quantization technique for LLMs that employs a Gumbel-Softmax relaxation to jointly optimize per-coordinate grid assignments and per-group scales. It claims that, by matching the relaxation cardinality to the small number of levels at 2-3 bpp (3-8 levels), GSQ recovers most of the accuracy gap between standard scalar methods (GPTQ/AWQ) and vector/trellis methods (QTIP) on Llama-3.1-8B/70B-Instruct while remaining fully compatible with existing scalar inference kernels; it further reports scaling to trillion-parameter MoE models such as Kimi-K2.5.

Significance. If the empirical claims hold with the reported margins, the result would be significant: it would indicate that the scalar-to-vector quantization gap is largely an optimization artifact rather than a fundamental limit, enabling high-accuracy low-bit deployment without the implementation and scaling difficulties of vector methods. The explicit compatibility with scalar kernels and the MoE scaling demonstration are practical strengths.

major comments (2)

[Abstract] Abstract: the central claim that GSQ 'closes most of the gap' to the QTIP frontier on Llama-3.1-8B/70B at 2 and 3 bits is stated without any quantitative numbers, baselines, error bars, or ablation details, preventing verification of the performance assertion from the provided text.
[Method] Method description (Gumbel-Softmax relaxation): the paper asserts that matching the categorical cardinality to 3-8 levels makes the relaxation tight and the joint optimization of assignments and scales tractable, yet supplies no analysis of the final rounding gap, temperature-annealing schedule, or straight-through estimator bias; if this discrepancy is non-negligible, the reported accuracy gains would not be reproducible from the learned soft parameters.

minor comments (1)

[Abstract] Abstract: the experimental protocol (datasets, calibration data, group size, exact bit-width configurations, and comparison models) is not summarized, which is required for a methods paper even in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments. We address each major comment below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that GSQ 'closes most of the gap' to the QTIP frontier on Llama-3.1-8B/70B at 2 and 3 bits is stated without any quantitative numbers, baselines, error bars, or ablation details, preventing verification of the performance assertion from the provided text.

Authors: We agree that the abstract would benefit from including key quantitative results to make the claims more verifiable. In the revised manuscript, we will update the abstract to include specific performance numbers, such as the perplexity or accuracy improvements on Llama-3.1 models at 2 and 3 bits compared to GPTQ, AWQ, and QTIP, along with mentions of the evaluation setup. This will provide readers with immediate evidence of the claimed gap closure without requiring them to read the full paper. revision: yes
Referee: [Method] Method description (Gumbel-Softmax relaxation): the paper asserts that matching the categorical cardinality to 3-8 levels makes the relaxation tight and the joint optimization of assignments and scales tractable, yet supplies no analysis of the final rounding gap, temperature-annealing schedule, or straight-through estimator bias; if this discrepancy is non-negligible, the reported accuracy gains would not be reproducible from the learned soft parameters.

Authors: The referee raises a valid point regarding the lack of detailed analysis on the relaxation's tightness. While the manuscript explains that matching the cardinality to the small number of levels (3-8) makes the Gumbel-Softmax approximation effective, we acknowledge that additional analysis would strengthen the method section. We will add a subsection discussing the temperature annealing schedule used, empirical observations on the rounding gap after discretization, and the use of the straight-through estimator, including any bias mitigation strategies. If space permits, we can include a small ablation or theoretical bound on the approximation error. revision: yes

Circularity Check

0 steps flagged

No significant circularity; GSQ introduces independent optimization procedure

full rationale

The paper defines GSQ as a new post-training method that applies Gumbel-Softmax relaxation to jointly optimize discrete grid assignments and group scales for scalar quantization. This procedure is evaluated empirically on held-out model performance (Llama-3.1-8B/70B and Kimi-K2.5) against external baselines such as GPTQ, AWQ, and QTIP. No step reduces by construction to a fitted parameter, self-citation chain, or renamed input; the relaxation cardinality matching and annealing are standard techniques applied to the quantization objective rather than tautological redefinitions. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard Gumbel-Softmax continuous relaxation of categorical variables and the assumption that group-wise scales can be jointly optimized with per-coordinate assignments. No new physical entities or ad-hoc constants are introduced in the abstract.

axioms (1)

standard math Gumbel-Softmax provides a differentiable relaxation of discrete categorical sampling that becomes exact in the low-temperature limit.
Invoked to enable gradient-based optimization of grid assignments.

pith-pipeline@v0.9.0 · 5664 in / 1356 out tokens · 56804 ms · 2026-05-10T05:26:21.391673+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 35 canonical work pages · 10 internal anchors

[1]

arXiv preprint arXiv:2402.11960 , year=

Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, et al. Db-llm: Accurate dual-binarization for efficient llms. arXiv preprint arXiv:2402.11960, 2024a. Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware t...

work page arXiv
[2]

Symbolic discovery of optimization algorithms,

URLhttps://arxiv.org/abs/2302.06675. Yuanteng Chen, Yuantian Shao, Peisong Wang, and Jian Cheng. Eac-moe: Expert-selection aware compressor for mixture-of-experts large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12942– 12963, 2025b. Krishna Teja Chitty-Venkata, ...

work page arXiv
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Diﬀerentiable model compression via pseudo quantiza- tion noise

Alexandre Défossez, Yossi Adi, and Gabriel Synnaeve. Differentiable model compression via pseudo quantization noise.arXiv preprint arXiv:2104.09987,

work page arXiv
[5]

arXiv preprint arXiv:2110.02861 , year=

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861,

work page arXiv
[6]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339,

work page internal anchor Pith review arXiv
[7]

Spqr: A sparse-quantized representation for near-lossless llm weight compression,

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashk- boos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized represen- tation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

work page arXiv
[8]

Stbllm: Breaking the 1-bit barrier with structured binary llms

Peijie Dong, Lujun Li, Yuedong Zhong, Dayou Du, Ruibo Fan, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Yike Guo, et al. Stbllm: Breaking the 1-bit barrier with structured binary llms. arXiv preprint arXiv:2408.01803,

work page arXiv
[9]

Extreme compression of large language models via additive quantization

14 Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization.arXiv preprint arXiv:2401.06118,

work page arXiv
[10]

Learned step size quantization,

Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmen- dra S Modha. Learned step size quantization.arXiv preprint arXiv:1902.08153,

work page arXiv 1902
[11]

Router choice matters: Rank-aware post-training quantization for moe models

Yi-Zeng Fang and Juinn-Dar Huang. Router choice matters: Rank-aware post-training quantization for moe models. Elias Frantar and Dan Alistarh. Qmoe: Practical sub-1-bit compression of trillion-parameter models. arXiv preprint arXiv:2310.16795,

work page arXiv
[12]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review arXiv
[13]

Eaquant: Enhancing post-training quantization for moe models via expert-aware optimization.arXiv preprint arXiv:2506.13329,

Zhongqian Fu, Ning Ding, Kai Han, Xianzhi Yu, Xiaosong Li, Xinghao Chen, Yehui Tang, and Yunhe Wang. Eaquant: Enhancing post-training quantization for moe models via expert-aware optimization.arXiv preprint arXiv:2506.13329,

work page arXiv
[14]

Saeed Ghadimi and Guanghui Lan

URLhttps://zenodo.org/records/10256836. Georgi Gerganov and contributors. llama.cpp: Inference of LLaMA models in pure C/C++. https: //github.com/ggerganov/llama.cpp,

work page arXiv
[15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

OpenThoughts: Data Recipes for Reasoning Models

URLhttps://arxiv.org/abs/2506.04178. Yufei Guo, Zecheng Hao, Jiahang Shao, Jie Zhou, Xiaode Liu, Xin Tong, Yuhan Zhang, Yuanpei Chen, Weihang Peng, and Zhe Ma. Pt-bitnet: Scaling up the 1-bit large language model with post-training quantization.Neural Networks, page 107855,

work page internal anchor Pith review arXiv
[17]

Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,

Xing Hu, Zhixuan Chen, Dawei Yang, Zukang Xu, Chen Xu, Zhihang Yuan, Sifan Zhou, and Jiangyong Yu. Moequant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance.arXiv preprint arXiv:2505.03804,

work page arXiv
[18]

Tequila: Trapping-free ternary quantization for large language models

15 Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, and Dapeng Wu. Tequila: Trapping-free ternary quantization for large language models. arXiv preprint arXiv:2509.23809,

work page arXiv
[19]

Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

work page arXiv
[20]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

URL https://github.com/inclusionAI/humming. Open-source library for vLLM-integrated weight-only quantization kernels supporting integer bitwidths 4–8. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large langua...

work page internal anchor Pith review arXiv
[21]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144,

work page internal anchor Pith review arXiv
[22]

Arb-llm: Alternating refined binarizations for large language models

Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Linghe Kong, Yulun Zhang, Xiaokang Yang, et al. Arb-llm: Alternating refined binarizations for large language models. arXiv preprint arXiv:2410.03129,

work page arXiv
[23]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krish- namoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406,

work page internal anchor Pith review arXiv
[24]

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables.arXiv preprint arXiv:1611.00712,

work page Pith review arXiv
[25]

HIGGS: Pushing the limits of large language model quantization via the linearity theorem

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. HIGGS: Pushing the limits of large language model quantization via the linearity theorem. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Compu- tational Linguistics, volume 1, pages 10857–10886. Association for Computat...

2025
[26]

Pb-llm: Partially binarized large language models,

Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. Pb-llm: Partially binarized large language models.arXiv preprint arXiv:2310.00034,

work page arXiv
[27]

Flatquant: Flatness matters for LLM quantization.CoRR, abs/2410.09426, 2024

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426,

work page arXiv
[28]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review arXiv
[29]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review arXiv
[30]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024a. Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Informa...

work page arXiv
[31]

Michelangelo: Long context evaluations beyond haystacks via latent structure queries, 2024

Kiran V odrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, et al. Michelangelo: Long context evaluations beyond haystacks via latent structure queries.arXiv preprint arXiv:2409.12640,

work page arXiv
[32]

Bitnet: Scaling 1- bit transformers for large language models.arXiv preprint arXiv:2310.11453,

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453,

work page arXiv
[33]

Ptqtp: Post-training quantization to trit-planes for large language models

He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zhen Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, and Ngai Wong. Ptqtp: Post-training quantization to trit-planes for large language models. arXiv preprint arXiv:2509.16989,

work page arXiv
[34]

Pt2-llm: Post-training ternarization for large language models

17 Xianglong Yan, Chengzhu Bao, Zhiteng Li, Tianao Zhang, Kaicheng Yang, Haotong Qin, Ruobing Xie, Xingwu Sun, and Yulun Zhang. Pt2-llm: Post-training ternarization for large language models. arXiv preprint arXiv:2510.03267,

work page arXiv
[35]

American invitational mathematics examination (aime) 2025,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025,

2025
[36]

Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, and Min Zhang. Ptq1. 61: Push the real limit of extremely low-bit post-training quantization methods for large language models.arXiv preprint arXiv:2502.13179,

work page arXiv
[37]

Moqa: Rethinking moe quantization with multi-stage data-model distribution awareness.arXiv preprint arXiv:2503.21135,

Zihao Zheng, Xiuping Cui, Size Zheng, Maoliang Li, Jiayu Chen, Yun Liang, and Xiang Chen. Moqa: Rethinking moe quantization with multi-stage data-model distribution awareness.arXiv preprint arXiv:2503.21135,

work page arXiv
[38]

Bit-widthLogits lrGroup scales lrWeight decayBetas Epochs# Seqs.Seq

18 Table 5: Training hyperparameters for the Llama experiments. Bit-widthLogits lrGroup scales lrWeight decayBetas Epochs# Seqs.Seq. len.Batch sizeGroup sizeτschedule κschedule α std 1.58-bit1e-4 5e-5 1.0 (0.9,0.95)20 4096 4096 64 128 linear:2→0.05linear:100→5003 0.01 2/3-bit 1e-4 5e-5 1.0 (0.9,0.95)20 4096 4096 64 128 linear:2→0.05linear:100→5006 0.01 Ta...

2024