Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Pengzhan Li; Wanxiang Che; Xu Han; Yuxuan Li; Yuzhuang Xu

arxiv: 2605.08894 · v2 · pith:46ATTTLZnew · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Yuzhuang Xu , Xu Han , Yuxuan Li , Pengzhan Li , Wanxiang Che This is my paper

Pith reviewed 2026-05-19 15:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords quantizationlarge language modelssmoothnesspost-training quantizationquantization-aware trainingtoken predictiongeneration qualitydecoding tree

0 comments

The pith

Extremely quantized LLMs lose generation quality from smoothness degradation in token predictions, beyond numerical accuracy loss alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that quantizing large language models to very low bit widths causes a systematic loss of smoothness in their prediction distributions, separate from the usual numerical errors. This smoothness loss shrinks the number of viable next-token choices around each prediction, which sparsifies the decoding tree and lowers output quality. The authors introduce a straightforward smoothness-preserving rule applied during both post-training quantization and quantization-aware training, showing measurable improvements on top of accuracy-focused methods. A sympathetic reader would care because this points to a new design axis that could make extreme quantization more practical for deploying capable models with limited hardware.

Core claim

Extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy the degradation grows worse as bit width falls. Sequence neighborhood modeling reveals that quantized models quickly reduce the count of effective token candidates inside each prediction neighborhood, directly producing a sparser decoding tree and lower generation quality. Applying a simple smoothness-preserving principle inside post-training quantization and quantization-aware training yields additional performance gains beyond what numerical accuracy improvements alone deliver.

What carries the argument

The smoothness proxy based on sequence neighborhood modeling that tracks the shrinkage of effective token candidates and the resulting sparsity of the decoding tree.

If this is right

Quantization algorithms for extreme low bits must treat smoothness preservation as a first-class objective alongside numerical accuracy.
The observed drop in effective token candidates directly explains why generation quality falls faster than accuracy metrics predict.
Both post-training quantization and quantization-aware training pipelines gain from the same smoothness-preserving rule.
Future extreme quantization work should monitor decoding-tree sparsity as an additional diagnostic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Smoothness degradation may appear under other compression techniques such as pruning or knowledge distillation and could be checked with the same neighborhood proxy.
A direct correlation between the proxy and human preference scores on generated text would strengthen the case that smoothness is the causal driver.
The principle might extend naturally to multimodal models where output distributions over tokens or patches also need to remain smooth.

Load-bearing premise

The chosen smoothness proxy correctly measures the degradation that actually drives poorer generation quality rather than reflecting an unrelated side effect of the measurement itself.

What would settle it

An experiment in which models quantized with the smoothness principle show the same number of effective tokens and the same generation quality as accuracy-only baselines, or in which the effective-token count stays high even at extremely low bit widths.

Figures

Figures reproduced from arXiv: 2605.08894 by Pengzhan Li, Wanxiang Che, Xu Han, Yuxuan Li, Yuzhuang Xu.

**Figure 2.** Figure 2: Approximate Lipschitz constant, i.e. expected input gradient [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Key definitions in sequence neighborhood modeling. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Results of rPPL-1 to rPPL-40 on original and quantized LLaMA-2-7B models. For GPTQ, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Illustrative decoding trees of original and quantized models. For any given node, darker [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The row-wise forward pass and the column-wise backward pass [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: To investigate this phenomenon, we compare layer-wise gradients between a BF16 model [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Forward / backward preservation. accuracy smoothness 4-bit 3-bit 2-bit [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: Comparison of fitting optimization and smoothness optimization. Given a prompt, the [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Evolution of Cavg during training across different models and settings. We use 0.4B model in this experiment. b) Is smoother training always better? In QAT, maximizing smoothness is not unconditionally beneficial. Moreover, it is unrealistic to expect a ternary model to achieve parity with a full-precision model in both smoothness and performance simultaneously. The core idea of this work is to identify t… view at source ↗

**Figure 12.** Figure 12: Evolution of Cavg during training when integrate RMSNorm or freeze word embedding. abnormal gradient channel [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Important/abnormal gradient channels and relaxed optimization. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of zero-shot accuracy under different values of the coefficient [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

read the original abstract

Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low-bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit-width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness-preserving principle in both post-training quantization and quantization-aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods. Code is available at https://github.com/xuyuzhuang11/FINE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags smoothness loss in extreme quantization as something beyond standard numerical fixes, with a simple preservation rule showing modest extra gains, but the proxy needs better separation from known distribution effects.

read the letter

The main point is that this paper argues extreme quantization causes smoothness degradation in LLMs that goes beyond just numerical inaccuracy, shrinking the pool of likely next tokens and hurting output quality. They back it with a proxy and test a preservation principle that adds gains on top of accuracy-focused methods. What stands out as new is treating smoothness as its own thing rather than folding it into accuracy metrics. The sequence neighborhood modeling to measure effective token candidates gives a practical handle on the issue. Showing benefits in both post-training and aware training setups is solid, and the code link lets others dig in. The softer part is the proxy itself. It might be picking up on distribution sharpening from low bits, which is already known, instead of a distinct continuity loss. Without checks that separate it from entropy changes or mass concentration, the claim that it's 'beyond numerical precision loss' needs more support. The experiments show gains, but an ablation fixing accuracy while changing smoothness would help clarify if the principle is doing something unique. This is for the quantization and efficient inference crowd. Someone building or optimizing low-bit models could pick up a useful angle here, even if the broader impact is limited to that niche. It deserves a serious referee because the observation is testable and the proposed fix is lightweight. I'd recommend sending it to review, mainly to get feedback on validating the proxy against standard quant effects.

Referee Report

2 major / 2 minor

Summary. The paper claims that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy based on sequence neighborhood modeling, they observe that this degradation worsens as bit-width decreases, causing a rapid reduction in effective token candidates, a sparser decoding tree, and degraded generation quality. They introduce a simple smoothness-preserving principle for both post-training quantization and quantization-aware training, demonstrating additional gains beyond numerical accuracy, and position smoothness preservation as an important design consideration for future extreme quantization methods. Code is released.

Significance. If the claims hold after addressing the proxy validation issues, the work would usefully highlight a distinct mechanism (smoothness loss) affecting generation in low-bit LLMs, separate from standard numerical accuracy concerns. The empirical gains and open code would support its relevance for efficient LLM deployment research.

major comments (2)

[Section 3] Section 3: The smoothness proxy defined via local sequence perturbations does not include controls for entropy or top-k mass concentration. Without reporting whether the metric remains predictive after such controls, it is unclear if the proxy isolates smoothness degradation or largely tracks the known effect of quantization concentrating probability mass on fewer tokens.
[Sections 4 and 5] Sections 4 and 5: Experiments demonstrate gains from the smoothness-preserving principle, but lack an ablation that restores numerical accuracy while leaving the proxy unchanged. This omission leaves open the possibility that reported improvements stem from better quantization overall rather than smoothness preservation per se.

minor comments (2)

[Abstract] Abstract: The phrase 'smoothness proxy' is used without a one-sentence definition or explicit cross-reference to its formal definition in Section 3, which would improve immediate clarity for readers.
[Figures] Figure captions (e.g., those illustrating neighborhood modeling): Captions could more explicitly state how the effective token candidate count is computed from the proxy to aid interpretation without requiring reference to the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Section 3] Section 3: The smoothness proxy defined via local sequence perturbations does not include controls for entropy or top-k mass concentration. Without reporting whether the metric remains predictive after such controls, it is unclear if the proxy isolates smoothness degradation or largely tracks the known effect of quantization concentrating probability mass on fewer tokens.

Authors: We thank the referee for this observation. The proxy is constructed from local sequence neighborhood modeling to quantify degradation in output consistency across perturbed inputs, which is conceptually distinct from global entropy or top-k concentration. To directly address the concern, we will add controlled analyses in the revised Section 3, including partial correlations of the proxy with generation metrics after conditioning on entropy and top-k mass, as well as stratified reporting across bins of these quantities. These additions will clarify the unique contribution of the smoothness measure. revision: yes
Referee: [Sections 4 and 5] Sections 4 and 5: Experiments demonstrate gains from the smoothness-preserving principle, but lack an ablation that restores numerical accuracy while leaving the proxy unchanged. This omission leaves open the possibility that reported improvements stem from better quantization overall rather than smoothness preservation per se.

Authors: We agree that isolating the smoothness effect requires such an ablation. Our current experiments already compare against strong baselines with comparable numerical accuracy, yet we acknowledge the value of an explicit control. In the revised Sections 4 and 5 we will add an ablation that restores numerical accuracy (via post-hoc logit adjustment or temperature scaling to match perplexity) while disabling the smoothness-preserving component, and will report the resulting generation metrics. This will help demonstrate that the observed gains are attributable to the change in the smoothness proxy rather than numerical fidelity alone. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation relies on independent empirical validation

full rationale

The paper defines a smoothness proxy via local sequence perturbations and neighborhood modeling in Section 3, then applies a smoothness-preserving principle in PTQ and QAT to report gains beyond numerical accuracy in Sections 4 and 5. No equation or claim reduces by construction to a fitted parameter renamed as prediction, nor does any load-bearing step depend on self-citation chains or imported uniqueness theorems. The central distinction between numerical precision loss and smoothness degradation is tested through separate ablations and external benchmarks rather than presupposed in the definitions themselves. The derivation chain remains self-contained against observable generation quality metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the smoothness proxy and neighborhood modeling are treated as measurement tools rather than new postulates requiring independent evidence.

pith-pipeline@v0.9.0 · 5708 in / 1136 out tokens · 48688 ms · 2026-05-19T15:12:38.831731+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the smoothness proxy... as expected input gradient C_avg = E_x∈S ||∇x f_θ||_2 ... joint objective min ||WX-ŴX||_F + ||W⊤G-Ŵ⊤G||_F
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sequence neighborhood modeling... rPPL-k... sparser decoding tree

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 8 internal anchors

[1]

MathQA: Towards interpretable math word problem solving with operation-based formalisms

A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, Y . Choi, and H. Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2357–2367, 2019. URL ...

work page doi:10.18653/v1/n19-1245 2019
[2]

37 Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang

D. Bahri, H. Mobahi, and Y . Tay. Sharpness-aware minimization improves language model generalization. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7360–7371, 2022. URL https://doi.org/10.18653/v1/2022. acl-long.508

work page doi:10.18653/v1/2022 2022
[3]

P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems (NeurIPS), 30:6240–6249, 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/hash/b22b257ad0519d4500539da3c8bcf4dd-Abstract.html

work page 2017
[4]

Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence (AAAI), pages 7432–7439, 2020. URLhttps://doi.org/10.48550/arXiv.1911.11641

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1911.11641 2020
[5]

A. Chan, Y . Tay, and Y .-S. Ong. What it thinks is important is important: Robustness transfers through input gradients. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 332–341, 2020. URL https://openaccess. thecvf.com/content_CVPR_2020/html/Chan_What_It_Thinks_Is_Important_Is_ Important_Robustness_Transf...

work page 2020
[6]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2924–2936, 2019. URL https://doi. org/10.186...

work page doi:10.18653/v1/n19-1300 2019
[7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457v1, 2018. URLhttps://doi.org/10.48550/arXiv.1803.05457

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.05457 2018
[8]

Certified Adversarial Robustness via Randomized Smoothing

J. Cohen, E. Rosenfeld, and Z. Kolter. Certified adversarial robustness via randomized smooth- ing. InProceedings of International Conference on Machine Learning (ICML), pages 1310– 1320, 2019. URLhttps://doi.org/10.48550/arXiv.1902.02918

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.02918 2019
[9]

Dettmers, R

T. Dettmers, R. Svirschevski, V . Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. SpQR: A sparse-quantized representation for near- lossless LLM weight compression. InProceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id= Q1u25ahSuy. 11

work page 2024
[10]

P. Dong, L. Li, Y . Zhong, D. Du, R. FAN, Y . Chen, Z. Tang, Q. Wang, W. Xue, Y . Guo, et al. STBLLM: Breaking the 1-bit barrier with structured binary llms. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=6XUSDvBFkV

work page 2025
[11]

Dwivedi, B

P. Dwivedi, B. Islam, and M. Kajal. Smooth gradient loss: a loss function for gradient regularization in deep learning optimization.The Journal of Supercomputing, 81:1–45, 2025. URLhttps://doi.org/10.1007/s11227-025-07954-9

work page doi:10.1007/s11227-025-07954-9 2025
[12]

Egiazarian, A

V . Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh. Ex- treme compression of large language models via additive quantization. InProceedings of International Conference on Machine Learning (ICML), pages 12284–12303, 2024. URL https://proceedings.mlr.press/v235/egiazarian24a.html

work page 2024
[13]

N. B. Erichson, O. Azencot, A. Queiruga, L. Hodgkinson, and M. W. Mahoney. Lipschitz recurrent neural networks. InProceedings of the Ninth International Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/forum?id=-N7PBXqOUJZ

work page 2021
[14]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323v2, 2022. URL https://doi.org/10.48550/arXiv.2210.17323

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.17323 2022
[15]

H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitz continuity.Machine Learning, 110:393–416, 2021. URL https://doi. org/10.1007/s10994-020-05929-w

work page doi:10.1007/s10994-020-05929-w 2021
[16]

Huang, Y

W. Huang, Y . Liu, H. Qin, Y . Li, S. Zhang, X. Liu, M. Magno, and X. Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InProceedings of International Conference on Machine Learning (ICML), pages 20023–20042, 2024. URL https://proceedings.mlr. press/v235/huang24q.html

work page 2024
[17]

Khromov and S

G. Khromov and S. P. Singh. Some fundamental aspects about lipschitz continuity of neural networks. InProceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=5jWsW08zUh

work page 2024
[18]

H. Kim, G. Papamakarios, and A. Mnih. The lipschitz constant of self-attention. InProceedings of International Conference on Machine Learning (ICML), pages 5562–5571, 2021. URL https://proceedings.mlr.press/v139/kim21i.html

work page 2021
[19]

H. Lee, S. Cho, and C. Kim. Indirect gradient matching for adversarial robust distillation. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR),

work page
[20]

URLhttps://doi.org/10.48550/arXiv.2312.03286

work page doi:10.48550/arxiv.2312.03286
[21]

T. Li, P. Zhou, Z. He, X. Cheng, and X. Huang. Friendly sharpness-aware minimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 5631–5640, 2024. URL https://openaccess.thecvf.com/content/CVPR2024/ html/Li_Friendly_Sharpness-Aware_Minimization_CVPR_2024_paper.html

work page 2024
[22]

Z. Li, X. Yan, T. Zhang, H. Qin, D. Xie, J. Tian, L. Kong, Y . Zhang, X. Yang, et al. ARB- LLM: Alternating refined binarizations for large language models. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https: //openreview.net/forum?id=ZU8OdDLTts

work page 2025
[23]

Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y . Mehdad, Y . Shi, R. Krishnamoorthi, and V . Chandra. LLM-QAT: Data-free quantization aware training for large language models. In Findings of the Association for Computational Linguistics (ACL Findings), pages 467–484,

work page
[24]

URLhttps://doi.org/10.18653/v1/2024.findings-acl.26

work page doi:10.18653/v1/2024.findings-acl.26 2024
[25]

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort. SpinQuant: LLM quantization with learned rotations. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://doi.org/10.48550/arXiv.2405.16406. 12

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.16406 2025
[26]

K. Lyu, Z. Li, and S. Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems (NeurIPS), 35: 34689–34708, 2022. URL https://papers.nips.cc/paper_files/paper/2022/hash/ dffd1c523512e557f4e75e8309049213-Abstract-Conference.html

work page 2022
[27]

S. Ma, H. Wang, L. Ma, L. Wang, W. Wang, S. Huang, L. Dong, R. Wang, J. Xue, and F. Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 2024. URLhttps://doi.org/10.48550/arXiv.2402.17764

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.17764 2024
[28]

Merity, C

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. InProceedings of the Fifth International Conference on Learning Representations (ICLR), 2017. URL https: //openreview.net/forum?id=Byj72udxe

work page 2017
[29]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2381–2391, 2018. URL https://doi.org/10.48550/arXiv.1809.02789

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1809.02789 2018
[30]

Pauli, A

P. Pauli, A. Koch, J. Berberich, P. Kohler, and F. Allgöwer. Training robust neural networks using lipschitz bounds.IEEE Control Systems Letters, 6:121–126, 2021. URL https://doi. org/10.1109/LCSYS.2021.3050444

work page doi:10.1109/lcsys.2021.3050444 2021
[31]

X. Qi, J. Wang, Y . Chen, Y . Shi, and L. Zhang. LipsFormer: Introducing lipschitz continuity to vision transformers. InProceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023. URLhttps://openreview.net/forum?id=cHf1DcCwcH3

work page 2023
[32]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21:5485–5551, 2020. URL https://jmlr.org/papers/ volume21/20-074/20-074.pdf

work page 2020
[33]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021. URL https: //doi.org/10.48550/arXiv.1907.10641

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.10641 2021
[34]

W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y . Qiao, and P. Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InProceed- ings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=8Wuvhh0LYW

work page 2024
[35]

Z. Shi, Y . Wang, H. Zhang, J. Z. Kolter, and C.-J. Hsieh. Efficiently com- puting local lipschitz constants of neural networks via bound propagation.Ad- vances in Neural Information Processing Systems (NeurIPS), 35:2350–2364, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 0ff54b4ec4f70b3ae12c8621ca8a49f4-Abstract-Conference.html

work page 2022
[36]

Singla and S

S. Singla and S. Feizi. Skew orthogonal convolutions. InProceedings of International Confer- ence on Machine Learning (ICML), pages 9756–9766, 2021. URL https://proceedings. mlr.press/v139/singla21a.html

work page 2021
[37]

Tseng, J

A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of International Conference on Machine Learning (ICML), pages 48630–48656, 2024. URL https://proceedings.mlr. press/v235/tseng24a.html

work page 2024
[38]

Virmaux and K

A. Virmaux and K. Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in Neural Information Processing Systems (NeurIPS), 31:3835–3844, 2018. URL https://proceedings.neurips.cc/paper_files/paper/ 2018/hash/d54e99a6c03704e95e6965532dec148b-Abstract.html

work page 2018
[39]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018. URLhttps://doi.org/10.18653/v1/W18-5446. 13

work page doi:10.18653/v1/w18-5446 2018
[40]

R. Wang, Y . Gong, X. Liu, G. Zhao, Z. Yang, B. Guo, Z.-J. Zha, and P. CHENG. Op- timizing large language model training using FP4 quantization. InProceedings of In- ternational Conference on Machine Learning (ICML), pages 62937–62957, 2025. URL https://proceedings.mlr.press/v267

work page 2025
[41]

Z. Wang, G. Prakriya, and S. Jha. A quantitative geometric approach to neural-network smoothness.Advances in Neural Information Processing Systems (NeurIPS), 35:34201– 34215, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/ hash/dd1322ce23cbbdd9d7ebb0ad1223c27a-Abstract-Conference.html

work page 2022
[42]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of In- ternational Conference on Machine Learning (ICML), pages 38087–38099, 2023. URL https://proceedings.mlr.press/v202/xiao23c.html

work page 2023
[43]

Y . Xu, X. Han, Z. Yang, S. Wang, Q. Zhu, Z. Liu, W. Liu, and W. Che. OneBit: Towards extremely low-bit large language models.Advances in Neural Information Processing System (NeurIPS), 37:66357–66382, 2024. URLhttps://doi.org/10.52202/079017-2122

work page doi:10.52202/079017-2122 2024
[44]

Y . Xu, S. Ji, Q. Zhu, and W. Che. CRVQ: Channel-relaxed vector quantization for extreme compression of LLMs.Transactions of the Association for Computational Linguistics (TACL), 13:1488–1506, 2025. URLhttps://doi.org/10.1162/TACL.a.45

work page doi:10.1162/tacl.a.45 2025
[45]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4791–4800, 2019. URL https://doi.org/10. 18653/v1/P19-1472

work page 2019
[46]

Gradient Ridge

Z. Zhou, J. Liang, Y . Song, L. Yu, H. Wang, W. Zhang, Y . Yu, and Z. Zhang. Lipschitz generative adversarial nets. InProceedings of International Conference on Machine Learning (ICML), pages 7584–7593, 2019. URLhttps://proceedings.mlr.press/v97/zhou19c.html. A Appendix A.1 Smoothness in Training Process In this section, we discuss several topics related ...

work page 2019

[1] [1]

MathQA: Towards interpretable math word problem solving with operation-based formalisms

A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, Y . Choi, and H. Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2357–2367, 2019. URL ...

work page doi:10.18653/v1/n19-1245 2019

[2] [2]

37 Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang

D. Bahri, H. Mobahi, and Y . Tay. Sharpness-aware minimization improves language model generalization. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7360–7371, 2022. URL https://doi.org/10.18653/v1/2022. acl-long.508

work page doi:10.18653/v1/2022 2022

[3] [3]

P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems (NeurIPS), 30:6240–6249, 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/hash/b22b257ad0519d4500539da3c8bcf4dd-Abstract.html

work page 2017

[4] [4]

Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence (AAAI), pages 7432–7439, 2020. URLhttps://doi.org/10.48550/arXiv.1911.11641

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1911.11641 2020

[5] [5]

A. Chan, Y . Tay, and Y .-S. Ong. What it thinks is important is important: Robustness transfers through input gradients. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 332–341, 2020. URL https://openaccess. thecvf.com/content_CVPR_2020/html/Chan_What_It_Thinks_Is_Important_Is_ Important_Robustness_Transf...

work page 2020

[6] [6]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2924–2936, 2019. URL https://doi. org/10.186...

work page doi:10.18653/v1/n19-1300 2019

[7] [7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457v1, 2018. URLhttps://doi.org/10.48550/arXiv.1803.05457

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.05457 2018

[8] [8]

Certified Adversarial Robustness via Randomized Smoothing

J. Cohen, E. Rosenfeld, and Z. Kolter. Certified adversarial robustness via randomized smooth- ing. InProceedings of International Conference on Machine Learning (ICML), pages 1310– 1320, 2019. URLhttps://doi.org/10.48550/arXiv.1902.02918

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.02918 2019

[9] [9]

Dettmers, R

T. Dettmers, R. Svirschevski, V . Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. SpQR: A sparse-quantized representation for near- lossless LLM weight compression. InProceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id= Q1u25ahSuy. 11

work page 2024

[10] [10]

P. Dong, L. Li, Y . Zhong, D. Du, R. FAN, Y . Chen, Z. Tang, Q. Wang, W. Xue, Y . Guo, et al. STBLLM: Breaking the 1-bit barrier with structured binary llms. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=6XUSDvBFkV

work page 2025

[11] [11]

Dwivedi, B

P. Dwivedi, B. Islam, and M. Kajal. Smooth gradient loss: a loss function for gradient regularization in deep learning optimization.The Journal of Supercomputing, 81:1–45, 2025. URLhttps://doi.org/10.1007/s11227-025-07954-9

work page doi:10.1007/s11227-025-07954-9 2025

[12] [12]

Egiazarian, A

V . Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh. Ex- treme compression of large language models via additive quantization. InProceedings of International Conference on Machine Learning (ICML), pages 12284–12303, 2024. URL https://proceedings.mlr.press/v235/egiazarian24a.html

work page 2024

[13] [13]

N. B. Erichson, O. Azencot, A. Queiruga, L. Hodgkinson, and M. W. Mahoney. Lipschitz recurrent neural networks. InProceedings of the Ninth International Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/forum?id=-N7PBXqOUJZ

work page 2021

[14] [14]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323v2, 2022. URL https://doi.org/10.48550/arXiv.2210.17323

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.17323 2022

[15] [15]

H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitz continuity.Machine Learning, 110:393–416, 2021. URL https://doi. org/10.1007/s10994-020-05929-w

work page doi:10.1007/s10994-020-05929-w 2021

[16] [16]

Huang, Y

W. Huang, Y . Liu, H. Qin, Y . Li, S. Zhang, X. Liu, M. Magno, and X. Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InProceedings of International Conference on Machine Learning (ICML), pages 20023–20042, 2024. URL https://proceedings.mlr. press/v235/huang24q.html

work page 2024

[17] [17]

Khromov and S

G. Khromov and S. P. Singh. Some fundamental aspects about lipschitz continuity of neural networks. InProceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=5jWsW08zUh

work page 2024

[18] [18]

H. Kim, G. Papamakarios, and A. Mnih. The lipschitz constant of self-attention. InProceedings of International Conference on Machine Learning (ICML), pages 5562–5571, 2021. URL https://proceedings.mlr.press/v139/kim21i.html

work page 2021

[19] [19]

H. Lee, S. Cho, and C. Kim. Indirect gradient matching for adversarial robust distillation. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR),

work page

[20] [20]

URLhttps://doi.org/10.48550/arXiv.2312.03286

work page doi:10.48550/arxiv.2312.03286

[21] [21]

T. Li, P. Zhou, Z. He, X. Cheng, and X. Huang. Friendly sharpness-aware minimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 5631–5640, 2024. URL https://openaccess.thecvf.com/content/CVPR2024/ html/Li_Friendly_Sharpness-Aware_Minimization_CVPR_2024_paper.html

work page 2024

[22] [22]

Z. Li, X. Yan, T. Zhang, H. Qin, D. Xie, J. Tian, L. Kong, Y . Zhang, X. Yang, et al. ARB- LLM: Alternating refined binarizations for large language models. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https: //openreview.net/forum?id=ZU8OdDLTts

work page 2025

[23] [23]

Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y . Mehdad, Y . Shi, R. Krishnamoorthi, and V . Chandra. LLM-QAT: Data-free quantization aware training for large language models. In Findings of the Association for Computational Linguistics (ACL Findings), pages 467–484,

work page

[24] [24]

URLhttps://doi.org/10.18653/v1/2024.findings-acl.26

work page doi:10.18653/v1/2024.findings-acl.26 2024

[25] [25]

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort. SpinQuant: LLM quantization with learned rotations. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://doi.org/10.48550/arXiv.2405.16406. 12

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.16406 2025

[26] [26]

K. Lyu, Z. Li, and S. Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems (NeurIPS), 35: 34689–34708, 2022. URL https://papers.nips.cc/paper_files/paper/2022/hash/ dffd1c523512e557f4e75e8309049213-Abstract-Conference.html

work page 2022

[27] [27]

S. Ma, H. Wang, L. Ma, L. Wang, W. Wang, S. Huang, L. Dong, R. Wang, J. Xue, and F. Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 2024. URLhttps://doi.org/10.48550/arXiv.2402.17764

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.17764 2024

[28] [28]

Merity, C

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. InProceedings of the Fifth International Conference on Learning Representations (ICLR), 2017. URL https: //openreview.net/forum?id=Byj72udxe

work page 2017

[29] [29]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2381–2391, 2018. URL https://doi.org/10.48550/arXiv.1809.02789

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1809.02789 2018

[30] [30]

Pauli, A

P. Pauli, A. Koch, J. Berberich, P. Kohler, and F. Allgöwer. Training robust neural networks using lipschitz bounds.IEEE Control Systems Letters, 6:121–126, 2021. URL https://doi. org/10.1109/LCSYS.2021.3050444

work page doi:10.1109/lcsys.2021.3050444 2021

[31] [31]

X. Qi, J. Wang, Y . Chen, Y . Shi, and L. Zhang. LipsFormer: Introducing lipschitz continuity to vision transformers. InProceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023. URLhttps://openreview.net/forum?id=cHf1DcCwcH3

work page 2023

[32] [32]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21:5485–5551, 2020. URL https://jmlr.org/papers/ volume21/20-074/20-074.pdf

work page 2020

[33] [33]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021. URL https: //doi.org/10.48550/arXiv.1907.10641

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.10641 2021

[34] [34]

W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y . Qiao, and P. Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InProceed- ings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=8Wuvhh0LYW

work page 2024

[35] [35]

Z. Shi, Y . Wang, H. Zhang, J. Z. Kolter, and C.-J. Hsieh. Efficiently com- puting local lipschitz constants of neural networks via bound propagation.Ad- vances in Neural Information Processing Systems (NeurIPS), 35:2350–2364, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 0ff54b4ec4f70b3ae12c8621ca8a49f4-Abstract-Conference.html

work page 2022

[36] [36]

Singla and S

S. Singla and S. Feizi. Skew orthogonal convolutions. InProceedings of International Confer- ence on Machine Learning (ICML), pages 9756–9766, 2021. URL https://proceedings. mlr.press/v139/singla21a.html

work page 2021

[37] [37]

Tseng, J

A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of International Conference on Machine Learning (ICML), pages 48630–48656, 2024. URL https://proceedings.mlr. press/v235/tseng24a.html

work page 2024

[38] [38]

Virmaux and K

A. Virmaux and K. Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in Neural Information Processing Systems (NeurIPS), 31:3835–3844, 2018. URL https://proceedings.neurips.cc/paper_files/paper/ 2018/hash/d54e99a6c03704e95e6965532dec148b-Abstract.html

work page 2018

[39] [39]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018. URLhttps://doi.org/10.18653/v1/W18-5446. 13

work page doi:10.18653/v1/w18-5446 2018

[40] [40]

R. Wang, Y . Gong, X. Liu, G. Zhao, Z. Yang, B. Guo, Z.-J. Zha, and P. CHENG. Op- timizing large language model training using FP4 quantization. InProceedings of In- ternational Conference on Machine Learning (ICML), pages 62937–62957, 2025. URL https://proceedings.mlr.press/v267

work page 2025

[41] [41]

Z. Wang, G. Prakriya, and S. Jha. A quantitative geometric approach to neural-network smoothness.Advances in Neural Information Processing Systems (NeurIPS), 35:34201– 34215, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/ hash/dd1322ce23cbbdd9d7ebb0ad1223c27a-Abstract-Conference.html

work page 2022

[42] [42]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of In- ternational Conference on Machine Learning (ICML), pages 38087–38099, 2023. URL https://proceedings.mlr.press/v202/xiao23c.html

work page 2023

[43] [43]

Y . Xu, X. Han, Z. Yang, S. Wang, Q. Zhu, Z. Liu, W. Liu, and W. Che. OneBit: Towards extremely low-bit large language models.Advances in Neural Information Processing System (NeurIPS), 37:66357–66382, 2024. URLhttps://doi.org/10.52202/079017-2122

work page doi:10.52202/079017-2122 2024

[44] [44]

Y . Xu, S. Ji, Q. Zhu, and W. Che. CRVQ: Channel-relaxed vector quantization for extreme compression of LLMs.Transactions of the Association for Computational Linguistics (TACL), 13:1488–1506, 2025. URLhttps://doi.org/10.1162/TACL.a.45

work page doi:10.1162/tacl.a.45 2025

[45] [45]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4791–4800, 2019. URL https://doi.org/10. 18653/v1/P19-1472

work page 2019

[46] [46]

Gradient Ridge

Z. Zhou, J. Liang, Y . Song, L. Yu, H. Wang, W. Zhang, Y . Yu, and Z. Zhang. Lipschitz generative adversarial nets. InProceedings of International Conference on Machine Learning (ICML), pages 7584–7593, 2019. URLhttps://proceedings.mlr.press/v97/zhou19c.html. A Appendix A.1 Smoothness in Training Process In this section, we discuss several topics related ...

work page 2019