pith. sign in

arxiv: 2606.22249 · v1 · pith:XWMJ4DGYnew · submitted 2026-06-20 · 💻 cs.LG · cs.AI

On the Expressive Power of Weight Quantization in Large Language Models

Pith reviewed 2026-06-26 11:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords weight quantizationexpressive poweruniversal approximationlarge language modelsmodel compressionbit precisionexpressive degradationquantized networks
0
0 comments X

The pith

Weight quantization in large language models loses universal approximation ability below 1.58 bits per weight.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how encoding large language model weights in fewer bits affects their ability to represent complex functions. It shows that quantized models retain universal approximation when using more than 1.58 bits but collapse in expressive power below that threshold. The loss of capacity occurs polynomially as bit count drops. These results frame quantization limits in terms of scaling and compression tradeoffs. Readers interested in model efficiency would see a concrete bound on how far compression can go before core capabilities erode.

Core claim

The paper establishes that 1.58-bit is the limiting precision for weight quantization by proving universal approximation holds for weight-quantized models above this level and expressive collapse occurs below it, while also showing that expressive capacity degrades polynomially with decreasing bit count.

What carries the argument

Restriction of network weights to finite discrete sets whose cardinality is governed by the bit precision, combined with analysis of the resulting function class's approximation properties.

If this is right

  • Models using fewer than 1.58 bits per weight cannot serve as universal approximators regardless of width or depth.
  • Expressive capacity scales polynomially downward with each reduction in bit precision.
  • Quantization-aware scaling laws must incorporate this precision-dependent degradation term.
  • Compression and acceleration techniques gain a theoretical floor below which further bit reduction yields qualitatively different models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The 1.58-bit threshold aligns with ternary weight sets such as {-1, 0, 1}, suggesting those representations sit at the boundary between full and collapsed expressivity.
  • Empirical tests measuring approximation error on simple function classes could directly observe the predicted polynomial rate.
  • Similar discrete-set arguments might apply to activation quantization or mixed-precision schemes.
  • The framework could be extended to quantify how quantization interacts with specific architectural choices like attention heads.

Load-bearing premise

Expressive power is defined such that the number of distinct quantization levels directly determines whether universal approximation is possible, with a sharp change at exactly three levels.

What would settle it

A construction of a 1-bit quantized network family that can still approximate any continuous function on a compact domain to arbitrary accuracy would disprove the claimed collapse threshold.

Figures

Figures reproduced from arXiv: 2606.22249 by Shao-Qun Zhang.

Figure 1
Figure 1. Figure 1: Illustrations of expressive power and the corresponding expressive gap of real-valued, floating, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A schematic diagram of LLM architectures that approximate four hierarchical functions. [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The example of approximating f(x1, x2) = x1x2 + x 2 1 for illustrating the width computing of weight-quantized MLPs based on the Drawer principle. Theorem 4 shows that the weight-quantized ATTs with a linear architecture complexity maintain the universal approximation property. The proof idea of Theorem 4 is similar to that of Theorem 3, necessitating the proof of the following lemma. Lemma 2 If one equips… view at source ↗
Figure 4
Figure 4. Figure 4: The approximation gap bars of weight-quantized MLP within (a) 2-norm and (b) [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The relation curves between the ratio ln(ϵ/δ) and the number of quantization bits n within (a) 2-norm and (b) ∞-norm. addition, it is also challenging to compute the weight gap δ layer by layer; in other words, δ is unknown. Thus, the results of Theorem 7 become the relation among n, ϵ, and model complexity. Here, we replace ϵ by classification accuracy and present an indicator ln(accuracy/model complexity… view at source ↗
Figure 6
Figure 6. Figure 6: (a) The accuracy bars of conducted models with respect to the number of bits [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The relation curves (a-b) between the model complexity and [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

In recent years, weight quantization that encodes the learnable parameters of large language models in an $n$-bit format has garnered significant attention due to its potential for model compression and inference acceleration. Many practical techniques have been developed; however, the theoretical understanding of many aspects, especially the approximation and degradation of expressive power as the number of quantization bits decreases, remains unclear. In this paper, we provide a theoretical investigation into the expressive capability of large language models relative to the number of quantization bits. We argue that 1.58-bit is the limiting precision for weight quantization by establishing the universal approximation and expressive collapse properties of weight-quantized models with respect to the number of quantization bits. Additionally, we confirm that weight quantization leads to expressive degradation, in which the expressive capacity of weight-quantized models degrades polynomially as the number of quantization bits decreases. These theoretical findings provide a solid foundation for advancing weight quantization in the context of scaling laws and shed insights for future research in model compression and inference acceleration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that 1.58 bits is the limiting precision for weight quantization in large language models. It establishes universal approximation and expressive collapse properties with respect to the number of quantization bits, and shows that expressive capacity degrades polynomially as the number of bits decreases. These results are positioned as providing a theoretical foundation for quantization in scaling laws.

Significance. If the central claims are rigorously derived, the identification of a sharp threshold at log2(3) bits and the polynomial degradation rate would supply a concrete theoretical limit that could inform practical quantization choices and scaling analyses for model compression.

major comments (2)
  1. [Abstract] Abstract: the claim of a sharp 1.58-bit threshold for universal approximation and expressive collapse is load-bearing for the entire contribution, yet no definition of expressive power, no statement of the network class (e.g., ReLU networks with finite weights), and no proof sketch or cardinality argument are supplied; without these the threshold cannot be verified and may be an artifact of an implicit modeling choice that equates |Q| = 2^b directly with functional span.
  2. [Abstract] Abstract: the polynomial degradation statement is presented as a derived result, but no functional form, no dependence on network depth or width, and no supporting derivation or theorem statement appear; this prevents assessment of whether the rate is independent of other modeling assumptions.
minor comments (1)
  1. [Abstract] Abstract: the term 'large language models' is used throughout, but the stated properties appear to apply to general feed-forward networks; clarify whether the results rely on transformer-specific structure or hold more broadly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments on the abstract. We address each point below, noting that the abstract summarizes results whose details appear in the full manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of a sharp 1.58-bit threshold for universal approximation and expressive collapse is load-bearing for the entire contribution, yet no definition of expressive power, no statement of the network class (e.g., ReLU networks with finite weights), and no proof sketch or cardinality argument are supplied; without these the threshold cannot be verified and may be an artifact of an implicit modeling choice that equates |Q| = 2^b directly with functional span.

    Authors: The abstract is a concise summary. The manuscript defines expressive power via the universal approximation property for continuous functions on compact sets and specifies the network class as ReLU networks whose weights are drawn from a finite quantization set Q with cardinality 2^b. The threshold at log2(3) follows from a cardinality argument establishing that the representable function class becomes strictly smaller than the target class for b < log2(3), producing collapse; this is shown in the main theorems and is independent of equating |Q| directly with functional span. We will expand the abstract with a one-sentence definition of expressive power and the network class. revision: yes

  2. Referee: [Abstract] Abstract: the polynomial degradation statement is presented as a derived result, but no functional form, no dependence on network depth or width, and no supporting derivation or theorem statement appear; this prevents assessment of whether the rate is independent of other modeling assumptions.

    Authors: The polynomial degradation result is derived in the manuscript, with the functional form depending polynomially on 2^b and with explicit dependence on depth and width appearing in the exponent. The supporting theorem and proof are given in the body. We will add a parenthetical reference to the relevant theorem in the abstract to improve traceability. revision: yes

Circularity Check

0 steps flagged

No circularity; claims presented as derived theoretical properties without reduction to inputs by construction

full rationale

The abstract frames the 1.58-bit limit and polynomial expressive degradation as results obtained by establishing universal approximation and expressive collapse properties of weight-quantized models. No equations, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would make any central claim equivalent to its inputs by construction. The derivation is positioned as independent from external approximation theory benchmarks, consistent with a self-contained analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the central claims rest on unspecified definitions of expressive power, universal approximation for quantized networks, and the precise mapping from quantization cardinality to bit count. No free parameters, invented entities, or explicit axioms are listed in the visible text.

axioms (1)
  • domain assumption Expressive power of a neural network can be rigorously quantified such that quantization to a discrete weight set of cardinality 3 preserves universal approximation while cardinality 2 triggers collapse.
    This modeling choice is required for the 1.58-bit threshold to emerge; it is invoked implicitly when the paper states the limiting precision.

pith-pipeline@v0.9.1-grok · 5697 in / 1391 out tokens · 29444 ms · 2026-06-26T11:51:44.725659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 21 canonical work pages · 10 internal anchors

  1. [1]

    Aftabi, N

    N. Aftabi, N. Moradi, and F. Mahroo. Feed-forward neural networks as a mixed-integer program. arXiv:2402.06697, 2024

  2. [2]

    Llama 3 model card

    AI@Meta. Llama 3 model card. URL https://github.com/meta-llama/llama3/blob/ main/MODEL CARD.md., 2024

  3. [3]

    A. G. Anderson and C. P. Berg. The high-dimensional geometry of binary neural networks. arXiv:1705.07199, 2017

  4. [4]

    Y. Bisk, R. Zellers, J. Gao, and Y. Choi. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pages 7432–7439, 2020

  5. [5]

    R. R. Bunel, I. Turkaslan, P. Torr, P. Kohli, and P. K. Mudigonda. A unified view of piecewise linear neural network verification. In Advances in Neural Information Processing Systems 31, pages 4795–4804, 2018

  6. [6]

    Chatterjee and L

    A. Chatterjee and L. R. Varshney. Towards optimal quantization of neural networks. In Pro- ceedings of the 2017 IEEE International Symposium on Information Theory, pages 1162–1166, 2017

  7. [7]

    J. Chen, C. Wu, S.-Q. Zhang, N. Li, L. Zhang, and Q. Zhang. Efficient ternary weight embedding model: Bridging scalability and performance. arXiv:2411.15438, 2024

  8. [8]

    Cheng, T

    J. Cheng, T. Lin, Z. Shen, and Q. Li. A unified framework for establishing the universal approxi- mation of transformer-type architectures. In Advances in Neural Information Processing Systems 38, 2025

  9. [9]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv:1905.10044, 2019

  10. [10]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457, 2018

  11. [11]

    Courbariaux, Y

    M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, pages 3123–3131, 2015

  12. [12]

    Courbariaux, Y

    M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, pages 3123–3131, 2015. 30

  13. [13]

    J. Deng, W. Dong, S. Richard, L.-J. Li, K. Li, and F.-F. Li. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

  14. [14]

    Y. Ding, J. Liu, J. Xiong, and Y. Shi. On the universal approximability and complexity bounds of quantized ReLU neural networks. In Proceedings of the 7-th International Conference on Learning Representations, 2019

  15. [15]

    Mahoney, and Kurt Keutzer

    A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for efficient neural network inference. arxiv:2103.13630, 2021

  16. [16]

    Gonon, N

    A. Gonon, N. Brisebarre, R. Gribonval, and E. Riccietti. Approximation speed of quantized versus unquantized ReLU neural networks and beyond. IEEE Transactions on Information Theory, 69 (6):3960–3977, 2023

  17. [17]

    D. Gope, G. Dasika, and M. Mattina. Ternary hybrid neural-tree networks for highly constrained IoT applications. Proceedings of Machine Learning and Systems, 1:190–200, 2019

  18. [18]

    Y. Guo. A survey on methods and theories of quantized neural networks. arXiv:1808.04752, 2018

  19. [19]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  20. [20]

    Hornik, M

    K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approx- imators. Neural Networks, 2(5):359–366, 1989

  21. [21]

    E. B. Hunt. Artificial Intelligence. Academic Press, 2014

  22. [22]

    F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenext: Hardware-aware neural network design. arXiv:1803.10615, 2018

  23. [23]

    Kidger and T

    P. Kidger and T. Lyons. Universal approximation with deep narrow networks. In Proceedings of the 33rd Annual Conference on Learning Theory, pages 2306–2327, 2020

  24. [24]

    F. Li, B. Liu, X. Wang, B. Zhang, and J. Yan. Ternary weight networks. arXiv:1605.04711, 2016

  25. [25]

    H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems 30, 2017

  26. [26]

    Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stocke, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra. LLM-QAT: Data-free quantization aware training for large language models. arXiv:2305.17888, 2023. 31

  27. [27]

    Z. Liu, C. Zhao, F. Iandola, C. Lai, Y. Tian, I. Fedorov, Y. Xiong, E. Chang, Y. Shi, R. Kr- ishnamoorthi, L. Lai, and V. Chandra. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases. In Proceedings of the 41st International Conference on Machine Learning, pages 32431–32454, 2024

  28. [28]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017

  29. [29]

    N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the 2018 European Conference on Computer Vision, pages 116–131, 2018

  30. [30]

    S. Ma, H. Wang, S. Huang, X. Zhang, Y. Hu, T. Song, Y. Xia, and F. Wei. Bitnet b1.58 2B4T technical report. arXiv:2504.12285, 2025

  31. [31]

    Pointer Sentinel Mixture Models

    S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv:1609.07843, 2016

  32. [32]

    Mertens and A

    S. Mertens and A. Engel. Vapnik-chervonenkis dimension of neural networks with binary weights. Physical Review E, 55:4478–4488, 1997

  33. [33]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv:1809.02789, 2018

  34. [34]

    Ouyang, T

    X. Ouyang, T. Ge, T. Hartvigsen, Z. Zhang, H. Mi, and D. Yu. Low-bit quantization favors un- dertrained LLMs: Scaling laws for quantized LLMs with 100T training tokens. arXiv:2411.17691, 2024

  35. [35]

    Sakaguchi, R

    K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

  36. [36]

    M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv:1904.09728, 2019

  37. [37]

    X. Shen, P. Dong, L. Lu, Z. Kong, Z. Li, M. Lin, C. Wu, and Y. Wang. Agile-quant: Activation- guided quantization for faster inference of LLMs on the edge. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, pages 18944–18951, 2024

  38. [38]

    Szegedy, V

    C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016

  39. [39]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, 2017. 32

  40. [40]

    Voigtlaender

    F. Voigtlaender. The universal approximation theorem for complex-valued neural networks. Ap- plied and Computational Harmonic Analysis, 64:33–61, 2023

  41. [41]

    X. Wu, S. Huang, W. Wang, T. Song, L. Dong, Y. Xia, and F. Wei. Bitnet distillation. arXiv:2510.13998, 2025

  42. [42]

    Z. Yang, Y. Wang, K. Han, C. Xu, C. Xu, D. Tao, and C. Xu. Searching for low-bit weights in quantized neural networks. In Advances in Neural Information Processing Systems 33, pages 4091–4102, 2020

  43. [43]

    Yayla, M

    M. Yayla, M. Günzel, B. Ramosaj, and J. J. Chen. Universal approximation theorems of fully connected binarized neural networks. arXiv:2102.02631, 2021

  44. [44]

    C. Yun, S. Bhojanapalli, A. S. Rawat, S. Reddi, and S. Kumar. Are transformers universal approx- imators of sequence-to-sequence functions? In Proceedings of the 7th International Conference on Learning Representations, 2019

  45. [45]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? arXiv:1905.07830, 2019

  46. [46]

    Zhang, W.-C

    S.-H. Zhang, W.-C. Tang, C. Wu, P. Hu, N. Li, L.-J. Zhang, Q. Zhang, and S.-Q. Zhang. TernaryCLIP: Efficiently compressing vision-language models with ternary weights and distilled knowledge. arXiv:2510.21879, 2025

  47. [47]

    Zhang and Z.-H

    S.-Q. Zhang and Z.-H. Zhou. Theoretically provable spiking neural networks. In Advances in Neural Information Processing Systems 35, pages 19345–19356, 2022

  48. [48]

    A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv:1702.03044, 2017. 33