On the Expressive Power of Weight Quantization in Large Language Models

Shao-Qun Zhang

arxiv: 2606.22249 · v1 · pith:XWMJ4DGYnew · submitted 2026-06-20 · 💻 cs.LG · cs.AI

On the Expressive Power of Weight Quantization in Large Language Models

Shao-Qun Zhang This is my paper

Pith reviewed 2026-06-26 11:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords weight quantizationexpressive poweruniversal approximationlarge language modelsmodel compressionbit precisionexpressive degradationquantized networks

0 comments

The pith

Weight quantization in large language models loses universal approximation ability below 1.58 bits per weight.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how encoding large language model weights in fewer bits affects their ability to represent complex functions. It shows that quantized models retain universal approximation when using more than 1.58 bits but collapse in expressive power below that threshold. The loss of capacity occurs polynomially as bit count drops. These results frame quantization limits in terms of scaling and compression tradeoffs. Readers interested in model efficiency would see a concrete bound on how far compression can go before core capabilities erode.

Core claim

The paper establishes that 1.58-bit is the limiting precision for weight quantization by proving universal approximation holds for weight-quantized models above this level and expressive collapse occurs below it, while also showing that expressive capacity degrades polynomially with decreasing bit count.

What carries the argument

Restriction of network weights to finite discrete sets whose cardinality is governed by the bit precision, combined with analysis of the resulting function class's approximation properties.

If this is right

Models using fewer than 1.58 bits per weight cannot serve as universal approximators regardless of width or depth.
Expressive capacity scales polynomially downward with each reduction in bit precision.
Quantization-aware scaling laws must incorporate this precision-dependent degradation term.
Compression and acceleration techniques gain a theoretical floor below which further bit reduction yields qualitatively different models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The 1.58-bit threshold aligns with ternary weight sets such as {-1, 0, 1}, suggesting those representations sit at the boundary between full and collapsed expressivity.
Empirical tests measuring approximation error on simple function classes could directly observe the predicted polynomial rate.
Similar discrete-set arguments might apply to activation quantization or mixed-precision schemes.
The framework could be extended to quantify how quantization interacts with specific architectural choices like attention heads.

Load-bearing premise

Expressive power is defined such that the number of distinct quantization levels directly determines whether universal approximation is possible, with a sharp change at exactly three levels.

What would settle it

A construction of a 1-bit quantized network family that can still approximate any continuous function on a compact domain to arbitrary accuracy would disprove the claimed collapse threshold.

Figures

Figures reproduced from arXiv: 2606.22249 by Shao-Qun Zhang.

**Figure 2.** Figure 2: A schematic diagram of LLM architectures that approximate four hierarchical functions. [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: The example of approximating f(x1, x2) = x1x2 + x 2 1 for illustrating the width computing of weight-quantized MLPs based on the Drawer principle. Theorem 4 shows that the weight-quantized ATTs with a linear architecture complexity maintain the universal approximation property. The proof idea of Theorem 4 is similar to that of Theorem 3, necessitating the proof of the following lemma. Lemma 2 If one equips… view at source ↗

**Figure 4.** Figure 4: The approximation gap bars of weight-quantized MLP within (a) 2-norm and (b) [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: The relation curves between the ratio ln(ϵ/δ) and the number of quantization bits n within (a) 2-norm and (b) ∞-norm. addition, it is also challenging to compute the weight gap δ layer by layer; in other words, δ is unknown. Thus, the results of Theorem 7 become the relation among n, ϵ, and model complexity. Here, we replace ϵ by classification accuracy and present an indicator ln(accuracy/model complexity… view at source ↗

**Figure 6.** Figure 6: (a) The accuracy bars of conducted models with respect to the number of bits [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: The relation curves (a-b) between the model complexity and [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

In recent years, weight quantization that encodes the learnable parameters of large language models in an $n$-bit format has garnered significant attention due to its potential for model compression and inference acceleration. Many practical techniques have been developed; however, the theoretical understanding of many aspects, especially the approximation and degradation of expressive power as the number of quantization bits decreases, remains unclear. In this paper, we provide a theoretical investigation into the expressive capability of large language models relative to the number of quantization bits. We argue that 1.58-bit is the limiting precision for weight quantization by establishing the universal approximation and expressive collapse properties of weight-quantized models with respect to the number of quantization bits. Additionally, we confirm that weight quantization leads to expressive degradation, in which the expressive capacity of weight-quantized models degrades polynomially as the number of quantization bits decreases. These theoretical findings provide a solid foundation for advancing weight quantization in the context of scaling laws and shed insights for future research in model compression and inference acceleration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Abstract claims 1.58-bit limit for LLM weight quantization with universal approximation above and polynomial collapse below, but no proofs or definitions are visible so the argument cannot be checked.

read the letter

The one thing to know is that the abstract states 1.58 bits (ternary quantization) as the hard lower limit for weight-quantized LLMs to retain universal approximation, with expressive capacity degrading polynomially as bits drop further. That is the central result being offered.

The paper applies standard approximation-theory ideas to the quantized case and tries to tie the cardinality of the weight set directly to bit precision. It does connect this to practical concerns like scaling laws and inference hardware, which is a reasonable framing even if the execution details are not shown here.

The main soft spot is exactly the one the stress test flags: the threshold depends on defining expressive power so that the functional span is governed by |Q| = 2^b with a sharp cutoff at three levels. If the paper uses a different measure (Rademacher complexity, non-uniform codebooks, or something else), the collapse point moves. Without the lemmas or the precise statement of what counts as approximation for these networks, it is impossible to tell whether the argument holds or just restates a modeling choice. Soundness cannot be assessed from the abstract alone.

The work is aimed at researchers who care about theoretical bounds on model compression rather than immediate engineering tricks. A reader who already follows quantization theory might extract a useful bound if the full derivations are clean and the definitions are stated explicitly.

It deserves a serious referee because the claim is specific enough to be checked and the topic matters for efficient LLMs. I would send it out for review once the full text is in hand, with the expectation that the definitions and proof steps will need close scrutiny.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that 1.58 bits is the limiting precision for weight quantization in large language models. It establishes universal approximation and expressive collapse properties with respect to the number of quantization bits, and shows that expressive capacity degrades polynomially as the number of bits decreases. These results are positioned as providing a theoretical foundation for quantization in scaling laws.

Significance. If the central claims are rigorously derived, the identification of a sharp threshold at log2(3) bits and the polynomial degradation rate would supply a concrete theoretical limit that could inform practical quantization choices and scaling analyses for model compression.

major comments (2)

[Abstract] Abstract: the claim of a sharp 1.58-bit threshold for universal approximation and expressive collapse is load-bearing for the entire contribution, yet no definition of expressive power, no statement of the network class (e.g., ReLU networks with finite weights), and no proof sketch or cardinality argument are supplied; without these the threshold cannot be verified and may be an artifact of an implicit modeling choice that equates |Q| = 2^b directly with functional span.
[Abstract] Abstract: the polynomial degradation statement is presented as a derived result, but no functional form, no dependence on network depth or width, and no supporting derivation or theorem statement appear; this prevents assessment of whether the rate is independent of other modeling assumptions.

minor comments (1)

[Abstract] Abstract: the term 'large language models' is used throughout, but the stated properties appear to apply to general feed-forward networks; clarify whether the results rely on transformer-specific structure or hold more broadly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments on the abstract. We address each point below, noting that the abstract summarizes results whose details appear in the full manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of a sharp 1.58-bit threshold for universal approximation and expressive collapse is load-bearing for the entire contribution, yet no definition of expressive power, no statement of the network class (e.g., ReLU networks with finite weights), and no proof sketch or cardinality argument are supplied; without these the threshold cannot be verified and may be an artifact of an implicit modeling choice that equates |Q| = 2^b directly with functional span.

Authors: The abstract is a concise summary. The manuscript defines expressive power via the universal approximation property for continuous functions on compact sets and specifies the network class as ReLU networks whose weights are drawn from a finite quantization set Q with cardinality 2^b. The threshold at log2(3) follows from a cardinality argument establishing that the representable function class becomes strictly smaller than the target class for b < log2(3), producing collapse; this is shown in the main theorems and is independent of equating |Q| directly with functional span. We will expand the abstract with a one-sentence definition of expressive power and the network class. revision: yes
Referee: [Abstract] Abstract: the polynomial degradation statement is presented as a derived result, but no functional form, no dependence on network depth or width, and no supporting derivation or theorem statement appear; this prevents assessment of whether the rate is independent of other modeling assumptions.

Authors: The polynomial degradation result is derived in the manuscript, with the functional form depending polynomially on 2^b and with explicit dependence on depth and width appearing in the exponent. The supporting theorem and proof are given in the body. We will add a parenthetical reference to the relevant theorem in the abstract to improve traceability. revision: yes

Circularity Check

0 steps flagged

No circularity; claims presented as derived theoretical properties without reduction to inputs by construction

full rationale

The abstract frames the 1.58-bit limit and polynomial expressive degradation as results obtained by establishing universal approximation and expressive collapse properties of weight-quantized models. No equations, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would make any central claim equivalent to its inputs by construction. The derivation is positioned as independent from external approximation theory benchmarks, consistent with a self-contained analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the central claims rest on unspecified definitions of expressive power, universal approximation for quantized networks, and the precise mapping from quantization cardinality to bit count. No free parameters, invented entities, or explicit axioms are listed in the visible text.

axioms (1)

domain assumption Expressive power of a neural network can be rigorously quantified such that quantization to a discrete weight set of cardinality 3 preserves universal approximation while cardinality 2 triggers collapse.
This modeling choice is required for the 1.58-bit threshold to emerge; it is invoked implicitly when the paper states the limiting precision.

pith-pipeline@v0.9.1-grok · 5697 in / 1391 out tokens · 29444 ms · 2026-06-26T11:51:44.725659+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 21 canonical work pages · 10 internal anchors

[1]

Aftabi, N

N. Aftabi, N. Moradi, and F. Mahroo. Feed-forward neural networks as a mixed-integer program. arXiv:2402.06697, 2024

work page arXiv 2024
[2]

Llama 3 model card

AI@Meta. Llama 3 model card. URL https://github.com/meta-llama/llama3/blob/ main/MODEL CARD.md., 2024

2024
[3]

A. G. Anderson and C. P. Berg. The high-dimensional geometry of binary neural networks. arXiv:1705.07199, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Y. Bisk, R. Zellers, J. Gao, and Y. Choi. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pages 7432–7439, 2020

2020
[5]

R. R. Bunel, I. Turkaslan, P. Torr, P. Kohli, and P. K. Mudigonda. A unified view of piecewise linear neural network verification. In Advances in Neural Information Processing Systems 31, pages 4795–4804, 2018

2018
[6]

Chatterjee and L

A. Chatterjee and L. R. Varshney. Towards optimal quantization of neural networks. In Pro- ceedings of the 2017 IEEE International Symposium on Information Theory, pages 1162–1166, 2017

2017
[7]

J. Chen, C. Wu, S.-Q. Zhang, N. Li, L. Zhang, and Q. Zhang. Eﬀicient ternary weight embedding model: Bridging scalability and performance. arXiv:2411.15438, 2024

work page arXiv 2024
[8]

Cheng, T

J. Cheng, T. Lin, Z. Shen, and Q. Li. A unified framework for establishing the universal approxi- mation of transformer-type architectures. In Advances in Neural Information Processing Systems 38, 2025

2025
[9]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. Boolq: Exploring the surprising diﬀiculty of natural yes/no questions. arXiv:1905.10044, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Courbariaux, Y

M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, pages 3123–3131, 2015

2015
[12]

Courbariaux, Y

M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, pages 3123–3131, 2015. 30

2015
[13]

J. Deng, W. Dong, S. Richard, L.-J. Li, K. Li, and F.-F. Li. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

2009
[14]

Y. Ding, J. Liu, J. Xiong, and Y. Shi. On the universal approximability and complexity bounds of quantized ReLU neural networks. In Proceedings of the 7-th International Conference on Learning Representations, 2019

2019
[15]

Mahoney, and Kurt Keutzer

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for eﬀicient neural network inference. arxiv:2103.13630, 2021

work page arXiv 2021
[16]

Gonon, N

A. Gonon, N. Brisebarre, R. Gribonval, and E. Riccietti. Approximation speed of quantized versus unquantized ReLU neural networks and beyond. IEEE Transactions on Information Theory, 69 (6):3960–3977, 2023

2023
[17]

D. Gope, G. Dasika, and M. Mattina. Ternary hybrid neural-tree networks for highly constrained IoT applications. Proceedings of Machine Learning and Systems, 1:190–200, 2019

2019
[18]

Y. Guo. A survey on methods and theories of quantized neural networks. arXiv:1808.04752, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016
[20]

Hornik, M

K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approx- imators. Neural Networks, 2(5):359–366, 1989

1989
[21]

E. B. Hunt. Artificial Intelligence. Academic Press, 2014

2014
[22]

F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenext: Hardware-aware neural network design. arXiv:1803.10615, 2018

work page arXiv 2018
[23]

Kidger and T

P. Kidger and T. Lyons. Universal approximation with deep narrow networks. In Proceedings of the 33rd Annual Conference on Learning Theory, pages 2306–2327, 2020

2020
[24]

F. Li, B. Liu, X. Wang, B. Zhang, and J. Yan. Ternary weight networks. arXiv:1605.04711, 2016

work page arXiv 2016
[25]

H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems 30, 2017

2017
[26]

Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stocke, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra. LLM-QAT: Data-free quantization aware training for large language models. arXiv:2305.17888, 2023. 31

work page arXiv 2023
[27]

Z. Liu, C. Zhao, F. Iandola, C. Lai, Y. Tian, I. Fedorov, Y. Xiong, E. Chang, Y. Shi, R. Kr- ishnamoorthi, L. Lai, and V. Chandra. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases. In Proceedings of the 41st International Conference on Machine Learning, pages 32431–32454, 2024

2024
[28]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet V2: Practical guidelines for eﬀicient CNN architecture design. In Proceedings of the 2018 European Conference on Computer Vision, pages 116–131, 2018

2018
[30]

S. Ma, H. Wang, S. Huang, X. Zhang, Y. Hu, T. Song, Y. Xia, and F. Wei. Bitnet b1.58 2B4T technical report. arXiv:2504.12285, 2025

work page arXiv 2025
[31]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

Mertens and A

S. Mertens and A. Engel. Vapnik-chervonenkis dimension of neural networks with binary weights. Physical Review E, 55:4478–4488, 1997

1997
[33]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv:1809.02789, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Ouyang, T

X. Ouyang, T. Ge, T. Hartvigsen, Z. Zhang, H. Mi, and D. Yu. Low-bit quantization favors un- dertrained LLMs: Scaling laws for quantized LLMs with 100T training tokens. arXiv:2411.17691, 2024

work page arXiv 2024
[35]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

2021
[36]

M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv:1904.09728, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[37]

X. Shen, P. Dong, L. Lu, Z. Kong, Z. Li, M. Lin, C. Wu, and Y. Wang. Agile-quant: Activation- guided quantization for faster inference of LLMs on the edge. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, pages 18944–18951, 2024

2024
[38]

Szegedy, V

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016

2016
[39]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, 2017. 32

2017
[40]

Voigtlaender

F. Voigtlaender. The universal approximation theorem for complex-valued neural networks. Ap- plied and Computational Harmonic Analysis, 64:33–61, 2023

2023
[41]

X. Wu, S. Huang, W. Wang, T. Song, L. Dong, Y. Xia, and F. Wei. Bitnet distillation. arXiv:2510.13998, 2025

work page arXiv 2025
[42]

Z. Yang, Y. Wang, K. Han, C. Xu, C. Xu, D. Tao, and C. Xu. Searching for low-bit weights in quantized neural networks. In Advances in Neural Information Processing Systems 33, pages 4091–4102, 2020

2020
[43]

Yayla, M

M. Yayla, M. Günzel, B. Ramosaj, and J. J. Chen. Universal approximation theorems of fully connected binarized neural networks. arXiv:2102.02631, 2021

work page arXiv 2021
[44]

C. Yun, S. Bhojanapalli, A. S. Rawat, S. Reddi, and S. Kumar. Are transformers universal approx- imators of sequence-to-sequence functions? In Proceedings of the 7th International Conference on Learning Representations, 2019

2019
[45]

HellaSwag: Can a Machine Really Finish Your Sentence?

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[46]

Zhang, W.-C

S.-H. Zhang, W.-C. Tang, C. Wu, P. Hu, N. Li, L.-J. Zhang, Q. Zhang, and S.-Q. Zhang. TernaryCLIP: Eﬀiciently compressing vision-language models with ternary weights and distilled knowledge. arXiv:2510.21879, 2025

work page arXiv 2025
[47]

Zhang and Z.-H

S.-Q. Zhang and Z.-H. Zhou. Theoretically provable spiking neural networks. In Advances in Neural Information Processing Systems 35, pages 19345–19356, 2022

2022
[48]

A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv:1702.03044, 2017. 33

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Aftabi, N

N. Aftabi, N. Moradi, and F. Mahroo. Feed-forward neural networks as a mixed-integer program. arXiv:2402.06697, 2024

work page arXiv 2024

[2] [2]

Llama 3 model card

AI@Meta. Llama 3 model card. URL https://github.com/meta-llama/llama3/blob/ main/MODEL CARD.md., 2024

2024

[3] [3]

A. G. Anderson and C. P. Berg. The high-dimensional geometry of binary neural networks. arXiv:1705.07199, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Y. Bisk, R. Zellers, J. Gao, and Y. Choi. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pages 7432–7439, 2020

2020

[5] [5]

R. R. Bunel, I. Turkaslan, P. Torr, P. Kohli, and P. K. Mudigonda. A unified view of piecewise linear neural network verification. In Advances in Neural Information Processing Systems 31, pages 4795–4804, 2018

2018

[6] [6]

Chatterjee and L

A. Chatterjee and L. R. Varshney. Towards optimal quantization of neural networks. In Pro- ceedings of the 2017 IEEE International Symposium on Information Theory, pages 1162–1166, 2017

2017

[7] [7]

J. Chen, C. Wu, S.-Q. Zhang, N. Li, L. Zhang, and Q. Zhang. Eﬀicient ternary weight embedding model: Bridging scalability and performance. arXiv:2411.15438, 2024

work page arXiv 2024

[8] [8]

Cheng, T

J. Cheng, T. Lin, Z. Shen, and Q. Li. A unified framework for establishing the universal approxi- mation of transformer-type architectures. In Advances in Neural Information Processing Systems 38, 2025

2025

[9] [9]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. Boolq: Exploring the surprising diﬀiculty of natural yes/no questions. arXiv:1905.10044, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[10] [10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Courbariaux, Y

M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, pages 3123–3131, 2015

2015

[12] [12]

Courbariaux, Y

M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, pages 3123–3131, 2015. 30

2015

[13] [13]

J. Deng, W. Dong, S. Richard, L.-J. Li, K. Li, and F.-F. Li. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

2009

[14] [14]

Y. Ding, J. Liu, J. Xiong, and Y. Shi. On the universal approximability and complexity bounds of quantized ReLU neural networks. In Proceedings of the 7-th International Conference on Learning Representations, 2019

2019

[15] [15]

Mahoney, and Kurt Keutzer

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for eﬀicient neural network inference. arxiv:2103.13630, 2021

work page arXiv 2021

[16] [16]

Gonon, N

A. Gonon, N. Brisebarre, R. Gribonval, and E. Riccietti. Approximation speed of quantized versus unquantized ReLU neural networks and beyond. IEEE Transactions on Information Theory, 69 (6):3960–3977, 2023

2023

[17] [17]

D. Gope, G. Dasika, and M. Mattina. Ternary hybrid neural-tree networks for highly constrained IoT applications. Proceedings of Machine Learning and Systems, 1:190–200, 2019

2019

[18] [18]

Y. Guo. A survey on methods and theories of quantized neural networks. arXiv:1808.04752, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016

[20] [20]

Hornik, M

K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approx- imators. Neural Networks, 2(5):359–366, 1989

1989

[21] [21]

E. B. Hunt. Artificial Intelligence. Academic Press, 2014

2014

[22] [22]

F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenext: Hardware-aware neural network design. arXiv:1803.10615, 2018

work page arXiv 2018

[23] [23]

Kidger and T

P. Kidger and T. Lyons. Universal approximation with deep narrow networks. In Proceedings of the 33rd Annual Conference on Learning Theory, pages 2306–2327, 2020

2020

[24] [24]

F. Li, B. Liu, X. Wang, B. Zhang, and J. Yan. Ternary weight networks. arXiv:1605.04711, 2016

work page arXiv 2016

[25] [25]

H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems 30, 2017

2017

[26] [26]

Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stocke, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra. LLM-QAT: Data-free quantization aware training for large language models. arXiv:2305.17888, 2023. 31

work page arXiv 2023

[27] [27]

Z. Liu, C. Zhao, F. Iandola, C. Lai, Y. Tian, I. Fedorov, Y. Xiong, E. Chang, Y. Shi, R. Kr- ishnamoorthi, L. Lai, and V. Chandra. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases. In Proceedings of the 41st International Conference on Machine Learning, pages 32431–32454, 2024

2024

[28] [28]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet V2: Practical guidelines for eﬀicient CNN architecture design. In Proceedings of the 2018 European Conference on Computer Vision, pages 116–131, 2018

2018

[30] [30]

S. Ma, H. Wang, S. Huang, X. Zhang, Y. Hu, T. Song, Y. Xia, and F. Wei. Bitnet b1.58 2B4T technical report. arXiv:2504.12285, 2025

work page arXiv 2025

[31] [31]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[32] [32]

Mertens and A

S. Mertens and A. Engel. Vapnik-chervonenkis dimension of neural networks with binary weights. Physical Review E, 55:4478–4488, 1997

1997

[33] [33]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv:1809.02789, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[34] [34]

Ouyang, T

X. Ouyang, T. Ge, T. Hartvigsen, Z. Zhang, H. Mi, and D. Yu. Low-bit quantization favors un- dertrained LLMs: Scaling laws for quantized LLMs with 100T training tokens. arXiv:2411.17691, 2024

work page arXiv 2024

[35] [35]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

2021

[36] [36]

M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv:1904.09728, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[37] [37]

X. Shen, P. Dong, L. Lu, Z. Kong, Z. Li, M. Lin, C. Wu, and Y. Wang. Agile-quant: Activation- guided quantization for faster inference of LLMs on the edge. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, pages 18944–18951, 2024

2024

[38] [38]

Szegedy, V

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016

2016

[39] [39]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, 2017. 32

2017

[40] [40]

Voigtlaender

F. Voigtlaender. The universal approximation theorem for complex-valued neural networks. Ap- plied and Computational Harmonic Analysis, 64:33–61, 2023

2023

[41] [41]

X. Wu, S. Huang, W. Wang, T. Song, L. Dong, Y. Xia, and F. Wei. Bitnet distillation. arXiv:2510.13998, 2025

work page arXiv 2025

[42] [42]

Z. Yang, Y. Wang, K. Han, C. Xu, C. Xu, D. Tao, and C. Xu. Searching for low-bit weights in quantized neural networks. In Advances in Neural Information Processing Systems 33, pages 4091–4102, 2020

2020

[43] [43]

Yayla, M

M. Yayla, M. Günzel, B. Ramosaj, and J. J. Chen. Universal approximation theorems of fully connected binarized neural networks. arXiv:2102.02631, 2021

work page arXiv 2021

[44] [44]

C. Yun, S. Bhojanapalli, A. S. Rawat, S. Reddi, and S. Kumar. Are transformers universal approx- imators of sequence-to-sequence functions? In Proceedings of the 7th International Conference on Learning Representations, 2019

2019

[45] [45]

HellaSwag: Can a Machine Really Finish Your Sentence?

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[46] [46]

Zhang, W.-C

S.-H. Zhang, W.-C. Tang, C. Wu, P. Hu, N. Li, L.-J. Zhang, Q. Zhang, and S.-Q. Zhang. TernaryCLIP: Eﬀiciently compressing vision-language models with ternary weights and distilled knowledge. arXiv:2510.21879, 2025

work page arXiv 2025

[47] [47]

Zhang and Z.-H

S.-Q. Zhang and Z.-H. Zhou. Theoretically provable spiking neural networks. In Advances in Neural Information Processing Systems 35, pages 19345–19356, 2022

2022

[48] [48]

A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv:1702.03044, 2017. 33

work page internal anchor Pith review Pith/arXiv arXiv 2017