pith. sign in

arxiv: 2605.08894 · v2 · pith:46ATTTLZnew · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Pith reviewed 2026-05-19 15:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords quantizationlarge language modelssmoothnesspost-training quantizationquantization-aware trainingtoken predictiongeneration qualitydecoding tree
0
0 comments X

The pith

Extremely quantized LLMs lose generation quality from smoothness degradation in token predictions, beyond numerical accuracy loss alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that quantizing large language models to very low bit widths causes a systematic loss of smoothness in their prediction distributions, separate from the usual numerical errors. This smoothness loss shrinks the number of viable next-token choices around each prediction, which sparsifies the decoding tree and lowers output quality. The authors introduce a straightforward smoothness-preserving rule applied during both post-training quantization and quantization-aware training, showing measurable improvements on top of accuracy-focused methods. A sympathetic reader would care because this points to a new design axis that could make extreme quantization more practical for deploying capable models with limited hardware.

Core claim

Extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy the degradation grows worse as bit width falls. Sequence neighborhood modeling reveals that quantized models quickly reduce the count of effective token candidates inside each prediction neighborhood, directly producing a sparser decoding tree and lower generation quality. Applying a simple smoothness-preserving principle inside post-training quantization and quantization-aware training yields additional performance gains beyond what numerical accuracy improvements alone deliver.

What carries the argument

The smoothness proxy based on sequence neighborhood modeling that tracks the shrinkage of effective token candidates and the resulting sparsity of the decoding tree.

If this is right

  • Quantization algorithms for extreme low bits must treat smoothness preservation as a first-class objective alongside numerical accuracy.
  • The observed drop in effective token candidates directly explains why generation quality falls faster than accuracy metrics predict.
  • Both post-training quantization and quantization-aware training pipelines gain from the same smoothness-preserving rule.
  • Future extreme quantization work should monitor decoding-tree sparsity as an additional diagnostic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Smoothness degradation may appear under other compression techniques such as pruning or knowledge distillation and could be checked with the same neighborhood proxy.
  • A direct correlation between the proxy and human preference scores on generated text would strengthen the case that smoothness is the causal driver.
  • The principle might extend naturally to multimodal models where output distributions over tokens or patches also need to remain smooth.

Load-bearing premise

The chosen smoothness proxy correctly measures the degradation that actually drives poorer generation quality rather than reflecting an unrelated side effect of the measurement itself.

What would settle it

An experiment in which models quantized with the smoothness principle show the same number of effective tokens and the same generation quality as accuracy-only baselines, or in which the effective-token count stays high even at extremely low bit widths.

Figures

Figures reproduced from arXiv: 2605.08894 by Pengzhan Li, Wanxiang Che, Xu Han, Yuxuan Li, Yuzhuang Xu.

Figure 1
Figure 1. Figure 1: Smoothness degradation in extreme quantization. Smoothness score distributions of GPTQ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Approximate Lipschitz constant, i.e. expected input gradient [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Key definitions in sequence neighborhood modeling. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results of rPPL-1 to rPPL-40 on original and quantized LLaMA-2-7B models. For GPTQ, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustrative decoding trees of original and quantized models. For any given node, darker [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The row-wise forward pass and the column-wise backward pass [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: To investigate this phenomenon, we compare layer-wise gradients between a BF16 model [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Forward / backward preservation. accuracy smoothness 4-bit 3-bit 2-bit [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of fitting optimization and smoothness optimization. Given a prompt, the [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Evolution of Cavg during training across different models and settings. We use 0.4B model in this experiment. b) Is smoother training always better? In QAT, maximizing smoothness is not unconditionally beneficial. Moreover, it is unrealistic to expect a ternary model to achieve parity with a full-precision model in both smoothness and performance simultaneously. The core idea of this work is to identify t… view at source ↗
Figure 12
Figure 12. Figure 12: Evolution of Cavg during training when integrate RMSNorm or freeze word embedding. abnormal gradient channel [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Important/abnormal gradient channels and relaxed optimization. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of zero-shot accuracy under different values of the coefficient [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
read the original abstract

Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low-bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit-width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness-preserving principle in both post-training quantization and quantization-aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods. Code is available at https://github.com/xuyuzhuang11/FINE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy based on sequence neighborhood modeling, they observe that this degradation worsens as bit-width decreases, causing a rapid reduction in effective token candidates, a sparser decoding tree, and degraded generation quality. They introduce a simple smoothness-preserving principle for both post-training quantization and quantization-aware training, demonstrating additional gains beyond numerical accuracy, and position smoothness preservation as an important design consideration for future extreme quantization methods. Code is released.

Significance. If the claims hold after addressing the proxy validation issues, the work would usefully highlight a distinct mechanism (smoothness loss) affecting generation in low-bit LLMs, separate from standard numerical accuracy concerns. The empirical gains and open code would support its relevance for efficient LLM deployment research.

major comments (2)
  1. [Section 3] Section 3: The smoothness proxy defined via local sequence perturbations does not include controls for entropy or top-k mass concentration. Without reporting whether the metric remains predictive after such controls, it is unclear if the proxy isolates smoothness degradation or largely tracks the known effect of quantization concentrating probability mass on fewer tokens.
  2. [Sections 4 and 5] Sections 4 and 5: Experiments demonstrate gains from the smoothness-preserving principle, but lack an ablation that restores numerical accuracy while leaving the proxy unchanged. This omission leaves open the possibility that reported improvements stem from better quantization overall rather than smoothness preservation per se.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'smoothness proxy' is used without a one-sentence definition or explicit cross-reference to its formal definition in Section 3, which would improve immediate clarity for readers.
  2. [Figures] Figure captions (e.g., those illustrating neighborhood modeling): Captions could more explicitly state how the effective token candidate count is computed from the proxy to aid interpretation without requiring reference to the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Section 3] Section 3: The smoothness proxy defined via local sequence perturbations does not include controls for entropy or top-k mass concentration. Without reporting whether the metric remains predictive after such controls, it is unclear if the proxy isolates smoothness degradation or largely tracks the known effect of quantization concentrating probability mass on fewer tokens.

    Authors: We thank the referee for this observation. The proxy is constructed from local sequence neighborhood modeling to quantify degradation in output consistency across perturbed inputs, which is conceptually distinct from global entropy or top-k concentration. To directly address the concern, we will add controlled analyses in the revised Section 3, including partial correlations of the proxy with generation metrics after conditioning on entropy and top-k mass, as well as stratified reporting across bins of these quantities. These additions will clarify the unique contribution of the smoothness measure. revision: yes

  2. Referee: [Sections 4 and 5] Sections 4 and 5: Experiments demonstrate gains from the smoothness-preserving principle, but lack an ablation that restores numerical accuracy while leaving the proxy unchanged. This omission leaves open the possibility that reported improvements stem from better quantization overall rather than smoothness preservation per se.

    Authors: We agree that isolating the smoothness effect requires such an ablation. Our current experiments already compare against strong baselines with comparable numerical accuracy, yet we acknowledge the value of an explicit control. In the revised Sections 4 and 5 we will add an ablation that restores numerical accuracy (via post-hoc logit adjustment or temperature scaling to match perplexity) while disabling the smoothness-preserving component, and will report the resulting generation metrics. This will help demonstrate that the observed gains are attributable to the change in the smoothness proxy rather than numerical fidelity alone. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation relies on independent empirical validation

full rationale

The paper defines a smoothness proxy via local sequence perturbations and neighborhood modeling in Section 3, then applies a smoothness-preserving principle in PTQ and QAT to report gains beyond numerical accuracy in Sections 4 and 5. No equation or claim reduces by construction to a fitted parameter renamed as prediction, nor does any load-bearing step depend on self-citation chains or imported uniqueness theorems. The central distinction between numerical precision loss and smoothness degradation is tested through separate ablations and external benchmarks rather than presupposed in the definitions themselves. The derivation chain remains self-contained against observable generation quality metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the smoothness proxy and neighborhood modeling are treated as measurement tools rather than new postulates requiring independent evidence.

pith-pipeline@v0.9.0 · 5708 in / 1136 out tokens · 48688 ms · 2026-05-19T15:12:38.831731+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 8 internal anchors

  1. [1]

    MathQA: Towards interpretable math word problem solving with operation-based formalisms

    A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, Y . Choi, and H. Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2357–2367, 2019. URL ...

  2. [2]

    37 Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang

    D. Bahri, H. Mobahi, and Y . Tay. Sharpness-aware minimization improves language model generalization. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7360–7371, 2022. URL https://doi.org/10.18653/v1/2022. acl-long.508

  3. [3]

    P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems (NeurIPS), 30:6240–6249, 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/hash/b22b257ad0519d4500539da3c8bcf4dd-Abstract.html

  4. [4]

    Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence (AAAI), pages 7432–7439, 2020. URLhttps://doi.org/10.48550/arXiv.1911.11641

  5. [5]

    A. Chan, Y . Tay, and Y .-S. Ong. What it thinks is important is important: Robustness transfers through input gradients. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 332–341, 2020. URL https://openaccess. thecvf.com/content_CVPR_2020/html/Chan_What_It_Thinks_Is_Important_Is_ Important_Robustness_Transf...

  6. [6]

    B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

    C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2924–2936, 2019. URL https://doi. org/10.186...

  7. [7]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457v1, 2018. URLhttps://doi.org/10.48550/arXiv.1803.05457

  8. [8]

    Certified Adversarial Robustness via Randomized Smoothing

    J. Cohen, E. Rosenfeld, and Z. Kolter. Certified adversarial robustness via randomized smooth- ing. InProceedings of International Conference on Machine Learning (ICML), pages 1310– 1320, 2019. URLhttps://doi.org/10.48550/arXiv.1902.02918

  9. [9]

    Dettmers, R

    T. Dettmers, R. Svirschevski, V . Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. SpQR: A sparse-quantized representation for near- lossless LLM weight compression. InProceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id= Q1u25ahSuy. 11

  10. [10]

    P. Dong, L. Li, Y . Zhong, D. Du, R. FAN, Y . Chen, Z. Tang, Q. Wang, W. Xue, Y . Guo, et al. STBLLM: Breaking the 1-bit barrier with structured binary llms. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=6XUSDvBFkV

  11. [11]

    Dwivedi, B

    P. Dwivedi, B. Islam, and M. Kajal. Smooth gradient loss: a loss function for gradient regularization in deep learning optimization.The Journal of Supercomputing, 81:1–45, 2025. URLhttps://doi.org/10.1007/s11227-025-07954-9

  12. [12]

    Egiazarian, A

    V . Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh. Ex- treme compression of large language models via additive quantization. InProceedings of International Conference on Machine Learning (ICML), pages 12284–12303, 2024. URL https://proceedings.mlr.press/v235/egiazarian24a.html

  13. [13]

    N. B. Erichson, O. Azencot, A. Queiruga, L. Hodgkinson, and M. W. Mahoney. Lipschitz recurrent neural networks. InProceedings of the Ninth International Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/forum?id=-N7PBXqOUJZ

  14. [14]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323v2, 2022. URL https://doi.org/10.48550/arXiv.2210.17323

  15. [15]

    H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitz continuity.Machine Learning, 110:393–416, 2021. URL https://doi. org/10.1007/s10994-020-05929-w

  16. [16]

    Huang, Y

    W. Huang, Y . Liu, H. Qin, Y . Li, S. Zhang, X. Liu, M. Magno, and X. Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InProceedings of International Conference on Machine Learning (ICML), pages 20023–20042, 2024. URL https://proceedings.mlr. press/v235/huang24q.html

  17. [17]

    Khromov and S

    G. Khromov and S. P. Singh. Some fundamental aspects about lipschitz continuity of neural networks. InProceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=5jWsW08zUh

  18. [18]

    H. Kim, G. Papamakarios, and A. Mnih. The lipschitz constant of self-attention. InProceedings of International Conference on Machine Learning (ICML), pages 5562–5571, 2021. URL https://proceedings.mlr.press/v139/kim21i.html

  19. [19]

    H. Lee, S. Cho, and C. Kim. Indirect gradient matching for adversarial robust distillation. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR),

  20. [20]

    URLhttps://doi.org/10.48550/arXiv.2312.03286

  21. [21]

    T. Li, P. Zhou, Z. He, X. Cheng, and X. Huang. Friendly sharpness-aware minimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 5631–5640, 2024. URL https://openaccess.thecvf.com/content/CVPR2024/ html/Li_Friendly_Sharpness-Aware_Minimization_CVPR_2024_paper.html

  22. [22]

    Z. Li, X. Yan, T. Zhang, H. Qin, D. Xie, J. Tian, L. Kong, Y . Zhang, X. Yang, et al. ARB- LLM: Alternating refined binarizations for large language models. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https: //openreview.net/forum?id=ZU8OdDLTts

  23. [23]

    Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y . Mehdad, Y . Shi, R. Krishnamoorthi, and V . Chandra. LLM-QAT: Data-free quantization aware training for large language models. In Findings of the Association for Computational Linguistics (ACL Findings), pages 467–484,

  24. [24]

    URLhttps://doi.org/10.18653/v1/2024.findings-acl.26

  25. [25]

    Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort. SpinQuant: LLM quantization with learned rotations. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://doi.org/10.48550/arXiv.2405.16406. 12

  26. [26]

    K. Lyu, Z. Li, and S. Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems (NeurIPS), 35: 34689–34708, 2022. URL https://papers.nips.cc/paper_files/paper/2022/hash/ dffd1c523512e557f4e75e8309049213-Abstract-Conference.html

  27. [27]

    S. Ma, H. Wang, L. Ma, L. Wang, W. Wang, S. Huang, L. Dong, R. Wang, J. Xue, and F. Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 2024. URLhttps://doi.org/10.48550/arXiv.2402.17764

  28. [28]

    Merity, C

    S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. InProceedings of the Fifth International Conference on Learning Representations (ICLR), 2017. URL https: //openreview.net/forum?id=Byj72udxe

  29. [29]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2381–2391, 2018. URL https://doi.org/10.48550/arXiv.1809.02789

  30. [30]

    Pauli, A

    P. Pauli, A. Koch, J. Berberich, P. Kohler, and F. Allgöwer. Training robust neural networks using lipschitz bounds.IEEE Control Systems Letters, 6:121–126, 2021. URL https://doi. org/10.1109/LCSYS.2021.3050444

  31. [31]

    X. Qi, J. Wang, Y . Chen, Y . Shi, and L. Zhang. LipsFormer: Introducing lipschitz continuity to vision transformers. InProceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023. URLhttps://openreview.net/forum?id=cHf1DcCwcH3

  32. [32]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21:5485–5551, 2020. URL https://jmlr.org/papers/ volume21/20-074/20-074.pdf

  33. [33]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021. URL https: //doi.org/10.48550/arXiv.1907.10641

  34. [34]

    W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y . Qiao, and P. Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InProceed- ings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=8Wuvhh0LYW

  35. [35]

    Z. Shi, Y . Wang, H. Zhang, J. Z. Kolter, and C.-J. Hsieh. Efficiently com- puting local lipschitz constants of neural networks via bound propagation.Ad- vances in Neural Information Processing Systems (NeurIPS), 35:2350–2364, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 0ff54b4ec4f70b3ae12c8621ca8a49f4-Abstract-Conference.html

  36. [36]

    Singla and S

    S. Singla and S. Feizi. Skew orthogonal convolutions. InProceedings of International Confer- ence on Machine Learning (ICML), pages 9756–9766, 2021. URL https://proceedings. mlr.press/v139/singla21a.html

  37. [37]

    Tseng, J

    A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of International Conference on Machine Learning (ICML), pages 48630–48656, 2024. URL https://proceedings.mlr. press/v235/tseng24a.html

  38. [38]

    Virmaux and K

    A. Virmaux and K. Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in Neural Information Processing Systems (NeurIPS), 31:3835–3844, 2018. URL https://proceedings.neurips.cc/paper_files/paper/ 2018/hash/d54e99a6c03704e95e6965532dec148b-Abstract.html

  39. [39]

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018. URLhttps://doi.org/10.18653/v1/W18-5446. 13

  40. [40]

    R. Wang, Y . Gong, X. Liu, G. Zhao, Z. Yang, B. Guo, Z.-J. Zha, and P. CHENG. Op- timizing large language model training using FP4 quantization. InProceedings of In- ternational Conference on Machine Learning (ICML), pages 62937–62957, 2025. URL https://proceedings.mlr.press/v267

  41. [41]

    Z. Wang, G. Prakriya, and S. Jha. A quantitative geometric approach to neural-network smoothness.Advances in Neural Information Processing Systems (NeurIPS), 35:34201– 34215, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/ hash/dd1322ce23cbbdd9d7ebb0ad1223c27a-Abstract-Conference.html

  42. [42]

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of In- ternational Conference on Machine Learning (ICML), pages 38087–38099, 2023. URL https://proceedings.mlr.press/v202/xiao23c.html

  43. [43]

    Y . Xu, X. Han, Z. Yang, S. Wang, Q. Zhu, Z. Liu, W. Liu, and W. Che. OneBit: Towards extremely low-bit large language models.Advances in Neural Information Processing System (NeurIPS), 37:66357–66382, 2024. URLhttps://doi.org/10.52202/079017-2122

  44. [44]

    Y . Xu, S. Ji, Q. Zhu, and W. Che. CRVQ: Channel-relaxed vector quantization for extreme compression of LLMs.Transactions of the Association for Computational Linguistics (TACL), 13:1488–1506, 2025. URLhttps://doi.org/10.1162/TACL.a.45

  45. [45]

    Zellers, A

    R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4791–4800, 2019. URL https://doi.org/10. 18653/v1/P19-1472

  46. [46]

    Gradient Ridge

    Z. Zhou, J. Liang, Y . Song, L. Yu, H. Wang, W. Zhang, Y . Yu, and Z. Zhang. Lipschitz generative adversarial nets. InProceedings of International Conference on Machine Learning (ICML), pages 7584–7593, 2019. URLhttps://proceedings.mlr.press/v97/zhou19c.html. A Appendix A.1 Smoothness in Training Process In this section, we discuss several topics related ...