A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization

Lei Dong

arxiv: 2605.18933 · v1 · pith:VJ2QILQHnew · submitted 2026-05-18 · 💻 cs.LG

A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization

Lei Dong This is my paper

Pith reviewed 2026-05-20 12:59 UTC · model grok-4.3

classification 💻 cs.LG

keywords ternary quantizationReLURMSNormsign-magnitude asymmetryweight perturbationspre-norm transformersgeometric analysisBussgang constant

0 comments

The pith

In ReLU + RMSNorm blocks, sign flips produce 2.75 times more transverse output energy than equal-magnitude sign-preserving changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that sign-magnitude decomposition of weight perturbations reveals a directional asymmetry created by ReLU in a two-layer model with Gaussian weights. RMSNorm then selectively projects this asymmetry into transverse output energy, making sign flips far costlier than magnitude adjustments as the flip rate approaches zero. Sign-quantization error aligns with the less damaging sign-preserving type and passes through ReLU with almost no change in its radial fraction. This geometry accounts for the observed tolerance of pre-norm Transformers to ternary quantization, while real-model deviations trace to outlier features that concentrate energy and violate the delocalized-entry assumption.

Core claim

Sign-flips produce π/(π-2) ≈ 2.75 times more transverse output energy than sign-preserving magnitude perturbations of equal Frobenius norm as the flip rate p → 0. Sign-quantization error is a sign-preserving perturbation whose angular alignment satisfies cos² → 2/π; after ReLU its radial fraction remains 0.365, matching the pre-ReLU value 1-2/π to within 0.4 percent, rendering ReLU approximately transparent to the error.

What carries the argument

ReLU-induced hidden-space directional asymmetry between sign-flip and magnitude perturbations, selectively exposed by the transverse-projection Fréchet derivative of RMSNorm.

If this is right

Ternary quantization error stays small because it behaves like a sign-preserving perturbation that ReLU largely passes through unchanged.
A single sign-flip on an input feature of amplitude α amplifies post-ReLU energy by a factor R ≈ n α² relative to a delocalized entry.
Count-matched negative-log-likelihood leverage on TinyLlama stabilizes near 10× at low flip rates, matching the per-entry prediction.
The all-column NLL ratio reaches 5.0× while the perplexity gap is larger, consistent with metric nonlinearity.
Heavy-tailed outlier amplitudes (median 0.024, max 0.26 at layer 12) explain why real models deviate from the delocalized two-layer prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ReLU-RMSNorm geometry could be tested on other activation-normalization combinations to predict their quantization sensitivity.
Tracking how the 2.75 factor compounds or saturates across many layers would clarify scaling behavior beyond the two-layer case.
The Bussgang constant and half-space structure suggest analogous sign-magnitude analyses for binary or low-bit quantization in attention blocks.

Load-bearing premise

Weights are i.i.d. Gaussian and entries remain delocalized without dominant outliers.

What would settle it

Measure the ratio of transverse output energies produced by sign-flip versus magnitude perturbations of equal Frobenius norm in a two-layer ReLU-RMSNorm network initialized with i.i.d. Gaussian weights, at small flip rates.

Figures

Figures reproduced from arXiv: 2605.18933 by Lei Dong.

**Figure 2.** Figure 2: Multi-layer compounding refutation (p=0.01). Dashed line: the refuted hypothesis CL = (π/(π−2))L , which predicts exponential growth to ∼437 at L=6. All three measured architectures—V2 (i.i.d., no residual), Exp B (trained TinyLlama, ReLU), and V3 (i.i.d., with residual)—show flat or declining CL, with observed C6 ∈ [1.3, 3.1]. Shaded regions: 95% CI. Residual connection dilution. Comparing V2 and V3 at L … view at source ↗

read the original abstract

Pre-norm Transformers with RMSNorm tolerate ternary {-1,0,+1} weight quantization with surprisingly small loss (Ma et al., 2024). We give a geometric explanation via sign-magnitude decomposition of weight perturbations. In a two-layer ReLU + RMSNorm model with i.i.d. Gaussian weights, sign-flips produce $\pi/(\pi-2) \approx 2.75$ times more transverse output energy than sign-preserving magnitude perturbations of equal Frobenius norm, as the flip rate $p \to 0$ (Theorem 3). The mechanism: ReLU creates a hidden-space directional asymmetry between the two perturbation types, which RMSNorm's transverse-projection Fr\'echet derivative selectively exposes. Sign-quantization error is itself a sign-preserving perturbation with angular alignment $\cos^2 \to 2/\pi$ (Theorem 4); its post-ReLU radial fraction ($0.365$) matches the pre-ReLU value $1-2/\pi$ within $0.4\%$, so ReLU is approximately transparent to ternary error. Multi-layer compounding of the $2.75\times$ factor is not experimentally supported; the gap to real-model sign sensitivity arises from outlier features violating delocalization. For an input dimension with amplitude $\alpha$, a single sign-flip produces post-ReLU energy amplified by $R \approx n\alpha^2$ relative to a delocalized entry. On TinyLlama-1.1B, at linear response ($p \leq 0.5\%$), count-matched NLL leverage stabilizes at $\sim 10\times \approx n\mathbb{E}[\alpha^2]$, matching the per-entry theory; the all-column NLL ratio of $5.0\times$ falls within $R_{\mathrm{col}} \leq 19$ ($67\times$ PPL gap reflects metric nonlinearity). Measured outlier $\alpha$ at layer 12 (median $0.024$, max $0.26$) confirms heavy-tailed concentration. The Bussgang constant $2/\pi$, RMSNorm geometry, and ReLU half-space structure together explain sign-magnitude asymmetry in pre-norm models, with $R \propto n\alpha^2$ accounting for real-model deviations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies explicit geometric theorems explaining why sign-preserving ternary errors cause less damage than sign flips in a ReLU+RMSNorm block, with a partial match to TinyLlama data, but the multi-layer extension via outliers lacks propagation analysis.

read the letter

The main point is that this work gives a geometric account of sign-magnitude asymmetry under ternary quantization in pre-norm blocks. In the two-layer i.i.d. Gaussian case, sign flips produce roughly 2.75 times more transverse output energy than equal-norm magnitude changes, and ReLU passes the sign-preserving quantization error through with a radial fraction that matches the pre-ReLU value within 0.4 percent. That is the core new material: Theorems 3 and 4 formalize the transverse energy ratio from ReLU directional asymmetry and the RMSNorm projection, plus the angular alignment of the error term tied to the Bussgang constant. The TinyLlama section adds a concrete check by linking measured outlier amplitudes to the observed NLL leverage at low flip rates, which lines up with the per-entry amplification factor they derive. Those pieces are useful and go beyond the earlier empirical tolerance reports. The soft spots sit in the bridge to real models. The derivations stay inside the simplified two-layer setting, and the paper itself flags that multi-layer compounding of the 2.75 factor is not experimentally supported. Real deviations are attributed to outlier features that produce an R approximately n alpha squared boost, with median and max alpha values reported from one layer. Yet there is no derivation showing how these local effects accumulate across stacked layers or why the column NLL ratio sits at 5 times while the PPL gap reaches 67 times. The outlier story explains the gap after the fact but does not come with a propagation argument, so it remains the least secure link when moving from the theorems to practical quantization tolerance. This paper is for readers working on theoretical quantization or geometric analyses of transformer blocks. Someone focused on why certain weight perturbations are tolerated in efficient inference will find the theorems and the radial-fraction match worth their time. It has enough formal content and partial empirical grounding to merit a serious referee, though the multi-layer and outlier parts would need more work in revision. I would send it out for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that pre-norm Transformers tolerate ternary {-1,0,+1} weight quantization with small loss due to sign-magnitude asymmetry arising from ReLU + RMSNorm geometry. In a two-layer ReLU + RMSNorm model with i.i.d. Gaussian weights, sign-flips produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-Frobenius-norm sign-preserving magnitude perturbations as p → 0 (Theorem 3). Sign-quantization error is a sign-preserving perturbation with cos² → 2/π alignment (Theorem 4), and ReLU is approximately transparent to it (post-ReLU radial fraction 0.365 matches pre-ReLU 1-2/π within 0.4%). Real-model deviations (e.g., TinyLlama-1.1B) are attributed to outlier features with amplitude α producing amplification R ≈ nα², which matches observed ~10× NLL leverage at low p and explains the 5.0× column NLL ratio.

Significance. If the two-layer geometric results hold and the outlier mechanism is placed on firmer footing, the work supplies a concrete geometric account of why RMSNorm + ReLU blocks are unusually tolerant to sign errors under ternary quantization. The explicit derivation of the π/(π-2) energy ratio, the Bussgang alignment, and the measured α statistics on TinyLlama constitute useful, falsifiable contributions that could inform quantization design for pre-norm models.

major comments (2)

[Theorem 3] Theorem 3 and the subsequent multi-layer discussion: the π/(π-2) transverse-energy ratio is derived under i.i.d. Gaussian weights and delocalized entries in a two-layer model, yet the manuscript invokes this ratio to explain tolerance in deep Transformers without a propagation analysis showing how the local 2.75× asymmetry compounds across layers or interacts with residual connections.
[TinyLlama analysis] TinyLlama analysis (NLL leverage and R ≈ nα²): the claim that outlier amplitude α (median 0.024, max 0.26 at layer 12) produces R ≈ nE[α²] that accounts for the observed ~10× per-entry NLL leverage and 5.0× column NLL ratio rests on post-hoc attribution; no derivation or simulation demonstrates that the measured α distribution suffices to close the gap from the two-layer theory to the 67× PPL discrepancy.

minor comments (2)

[Abstract] Abstract: the definition of R_col ≤ 19 and its relation to the per-column NLL ratio should be stated explicitly rather than introduced only in the experimental paragraph.
Notation: the symbol R is used both for the per-entry amplification and for the column-level bound; a brief clarifying sentence would prevent confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for recognizing the geometric contributions of the local analysis. We respond to each major comment below.

read point-by-point responses

Referee: [Theorem 3] Theorem 3 and the subsequent multi-layer discussion: the π/(π-2) transverse-energy ratio is derived under i.i.d. Gaussian weights and delocalized entries in a two-layer model, yet the manuscript invokes this ratio to explain tolerance in deep Transformers without a propagation analysis showing how the local 2.75× asymmetry compounds across layers or interacts with residual connections.

Authors: Theorem 3 derives the local 2.75× transverse-energy ratio under the two-layer i.i.d. Gaussian setting with delocalized entries. The manuscript invokes this ratio to identify the geometric origin of sign-magnitude asymmetry within the ReLU + RMSNorm block itself, which forms the core computational unit of pre-norm Transformers. The text already states that multi-layer compounding is not experimentally supported and that real-model gaps arise from outlier features that violate delocalization. A full propagation analysis through residual streams would be a natural extension but is outside the scope of the present local geometric study. We will add a clarifying sentence in the discussion to emphasize that the ratio characterizes the local tolerance mechanism rather than supplying a global multi-layer prediction. revision: partial
Referee: [TinyLlama analysis] TinyLlama analysis (NLL leverage and R ≈ nα²): the claim that outlier amplitude α (median 0.024, max 0.26 at layer 12) produces R ≈ nE[α²] that accounts for the observed ~10× per-entry NLL leverage and 5.0× column NLL ratio rests on post-hoc attribution; no derivation or simulation demonstrates that the measured α distribution suffices to close the gap from the two-layer theory to the 67× PPL discrepancy.

Authors: The scaling R ≈ nα² follows directly from the effect of a localized sign-flip on a single input coordinate of amplitude α: the post-ReLU energy contribution is amplified by that factor relative to a delocalized entry. The measured α distribution at layer 12 yields nE[α²] ≈ 10, which matches the observed per-entry NLL leverage at linear response. The column-wise NLL ratio of 5.0× lies inside the bound R_col ≤ 19 obtained from the maximum observed α. The paper attributes the remaining 67× PPL discrepancy to the nonlinear mapping from small NLL changes to perplexity at low p. While an end-to-end simulation of the full PPL gap is not included, the derived scaling together with the empirical α statistics provides a quantitative bridge from the delocalized two-layer theory to the measured sensitivities. We will include the explicit derivation of R in the appendix. revision: partial

Circularity Check

0 steps flagged

No significant circularity; geometric claims derive independently

full rationale

The paper establishes Theorems 3 and 4 through direct geometric analysis of a two-layer ReLU + RMSNorm block with i.i.d. Gaussian weights, yielding the π/(π-2) transverse energy ratio for sign-flips and the 2/π angular alignment for quantization error from first-principles expectations over ReLU half-spaces and RMSNorm projections. These constants are standard in Gaussian signal processing (e.g., Bussgang theorem) and do not rely on fitting to the paper's target results or self-referential definitions. The outlier amplification factor R ≈ nα² is introduced as an explanatory mechanism to bridge to real models like TinyLlama, where it is confirmed by measured α distributions rather than used to derive or fit the core asymmetry. No self-citations are load-bearing for the central claims, no ansatzes are smuggled, and the derivation chain remains self-contained under the stated model assumptions without reducing predictions to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Central claims rest on the i.i.d. Gaussian weight assumption and two-layer simplification for the energy ratios; real-model validation incorporates measured outlier amplitudes rather than additional fitted parameters for the main geometric result.

free parameters (1)

outlier amplitude α
Measured median 0.024 and max 0.26 at layer 12 on TinyLlama to compute amplification R ≈ nα² for explaining deviations from basic theory.

axioms (2)

domain assumption Weights are i.i.d. Gaussian random variables
Invoked to derive expected transverse output energy ratios for sign-flip versus magnitude perturbations in the two-layer model (Theorem 3).
domain assumption Analysis restricted to linear response regime with small flip rate p ≤ 0.5%
Used to observe stabilization of count-matched NLL leverage at ~10× matching per-entry theory.

pith-pipeline@v0.9.0 · 5952 in / 1808 out tokens · 83119 ms · 2026-05-20T12:59:51.557040+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 9 internal anchors

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, et al. Qwen technical report. arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

arXiv:1511.00363. 31 Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InNeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

arXiv:1803.03635. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv
[5]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

arXiv:2210.17323. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations.Journal of Machine Learning Research, 18(187):1–30,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

arXiv:1609.07061. Michel Ledoux.The Concentration of Measure Phenomenon, volume 89 ofMathematical Surveys and Monographs. American Mathematical Society,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

arXiv:2306.00978. Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits. arXiv:2402.17764,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Robert Price

arXiv:2504.05357. Robert Price. A useful theorem for nonlinear devices having Gaussian inputs.IRE Trans- actions on Information Theory, 4(2):69–72,

work page arXiv
[9]

Disentangling direction and magnitude in transformer representations: A double dissociation through L2-matched perturbation analysis

Mangadoddi Srikar Vardhan and Lekkala Sai Teja. Disentangling direction and magnitude in transformer representations: A double dissociation through L2-matched perturbation analysis. arXiv:2602.11169,

work page arXiv
[10]

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. BitNet: Scaling 1-bit Transformers for large language models. arXiv:2310.11453,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Smoothquant: Accurate and efficient post-training quantization for large language models

arXiv:2211.10438. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the trans- former architecture. InInternational Conference on Machine Learning (ICML), pages 10524–10533,

work page arXiv
[12]

Zhang, B

arXiv:2002.04745. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2 technical report. arXiv:2407.10671,

work page arXiv 2002
[13]

Root Mean Square Layer Normalization

arXiv:1910.07467. Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. InNeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[14]

=ρ √ 1−ρ2 = 2√p−5p3/2 +O(p 5/2), arctan(a) =π/2−arctan(1/a) =π/2−2√p−p3/2/3 +O(p 5/2). Substituting: the constant terms cancel (1/4−1 2π·π 2 = 1/4−1/4 = 0), the√pterms cancel, and collectingp 3/2 coefficients gives5/(2π) + 1/(6π) = 16/(6π) = 8/(3π): βflip = 8p3/2 3π +O(p 5/2).(23) 3Strictly, the uniformity of CLT convergence inp(i.e.,supp∈(ϵ,1/2)|βflip(p,...

work page 1958
[15]

Lemma A.1(Bussgang-ReLU correlation).Under assumptions(A1)–(A2), letz i =W 1,iˆx, ∆z mag i =δ·sign(W1,i)·ˆxwithδ2 = 4p/n,a i = ReLU(zi), and∆a i = ReLU(zi + ∆z i)− ReLU(zi)

Here E[ai] =E[ReLU(z i)]>0, so the uncentered and Pearson correlations differ; the relevant quantity for the vector-levelcos2(∆a,a)is the uncentered version. Lemma A.1(Bussgang-ReLU correlation).Under assumptions(A1)–(A2), letz i =W 1,iˆx, ∆z mag i =δ·sign(W1,i)·ˆxwithδ2 = 4p/n,a i = ReLU(zi), and∆a i = ReLU(zi + ∆z i)− ReLU(zi). Then asn→∞: Corr(∆a i,ai)...

work page 2018
[16]

weights, no residual con- nection)

D Additional Experimental Data D.1 Multi-Layer Detailed Results (V2,p=0.01, 20 seeds) Table 15 shows per-layerCL values for the V2 architecture (i.i.d. weights, no residual con- nection). The monotone decline contradicts exponential compounding. 41 Table 15: V2 architecture per-layer results.CL shows monotone decline, contradicting the exponential compoun...

work page 2024
[17]

Theory predictsR sign ≈p; convergence improves with dimension

Table 18: Measured sign-perturbation radial fractionsR sign. Theory predictsR sign ≈p; convergence improves with dimension. Dimp=0.01p=0.02p=0.05p=0.10 256 0.0217 0.0308 0.0565 0.0969 512 0.0156 0.0248 0.0509 0.0917 1024 0.0125 0.0218 0.0481 0.0891 2048 0.0110 0.0202 0.0465 0.0876 43 Table 19: Measured magnitude-perturbation radial fractionsRmag. Theory p...

work page 2048
[18]

Settingp=α2:E outlier = 2α2(1−4α/(3π)), matchingf(1−2α2)

Consistency with Theorem 3.Main text Step A6 gives total post-ReLU energyE(p) = 2p(1−4√p/(3π) +O(p))on unit-variance scale. Settingp=α2:E outlier = 2α2(1−4α/(3π)), matchingf(1−2α2). E.6 Numerical Predictions Table 20 gives predictions forn= 2048(TinyLlama hidden dimension). 46 Table 20: Predicted outlier leverage ratioR(α,2048)from Theorem

work page 2048
[19]

αEnergyα 2 Pflip R(exact)Rvia Eq. (28) 0.05 0.25% 3.2% 5.0 5.0 0.10 1.0% 6.4% 19.5 19.6 0.20 4.0% 12.8% 73 75 0.30 9.0% 19.5% 162 161 0.50 25% 33.3% 404 404 E.7 Phase Crossover: Perturbative to Non-Perturbative The crossover occurs at the crossover scale αc = 1√n, where a single outlier flip has the same impact as a single non-outlier flip (R(αc,n)≈1). Fo...

work page 2048
[20]

Gate-flip probability∼(2/π) arcsinα=O(1)

•Outlier regime(α≫αc): Single-entry flip perturbszi byO(α/√n)∼σz. Gate-flip probability∼(2/π) arcsinα=O(1). Gate-flip energy dominates smooth energy. The transition is a smooth crossover (no critical exponent or discontinuity):R=nα2 grows continuously fromR= 1atα=αc toR∼natα∼1. E.8 Connection to Experiments Theory vs. observation.Theorem 6 predicts per-en...

work page 2048

[1] [1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, et al. Qwen technical report. arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

arXiv:1511.00363. 31 Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InNeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

arXiv:1803.03635. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

arXiv:2210.17323. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations.Journal of Machine Learning Research, 18(187):1–30,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

arXiv:1609.07061. Michel Ledoux.The Concentration of Measure Phenomenon, volume 89 ofMathematical Surveys and Monographs. American Mathematical Society,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

arXiv:2306.00978. Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits. arXiv:2402.17764,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Robert Price

arXiv:2504.05357. Robert Price. A useful theorem for nonlinear devices having Gaussian inputs.IRE Trans- actions on Information Theory, 4(2):69–72,

work page arXiv

[9] [9]

Disentangling direction and magnitude in transformer representations: A double dissociation through L2-matched perturbation analysis

Mangadoddi Srikar Vardhan and Lekkala Sai Teja. Disentangling direction and magnitude in transformer representations: A double dissociation through L2-matched perturbation analysis. arXiv:2602.11169,

work page arXiv

[10] [10]

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. BitNet: Scaling 1-bit Transformers for large language models. arXiv:2310.11453,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Smoothquant: Accurate and efficient post-training quantization for large language models

arXiv:2211.10438. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the trans- former architecture. InInternational Conference on Machine Learning (ICML), pages 10524–10533,

work page arXiv

[12] [12]

Zhang, B

arXiv:2002.04745. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2 technical report. arXiv:2407.10671,

work page arXiv 2002

[13] [13]

Root Mean Square Layer Normalization

arXiv:1910.07467. Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. InNeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[14] [14]

=ρ √ 1−ρ2 = 2√p−5p3/2 +O(p 5/2), arctan(a) =π/2−arctan(1/a) =π/2−2√p−p3/2/3 +O(p 5/2). Substituting: the constant terms cancel (1/4−1 2π·π 2 = 1/4−1/4 = 0), the√pterms cancel, and collectingp 3/2 coefficients gives5/(2π) + 1/(6π) = 16/(6π) = 8/(3π): βflip = 8p3/2 3π +O(p 5/2).(23) 3Strictly, the uniformity of CLT convergence inp(i.e.,supp∈(ϵ,1/2)|βflip(p,...

work page 1958

[15] [15]

Lemma A.1(Bussgang-ReLU correlation).Under assumptions(A1)–(A2), letz i =W 1,iˆx, ∆z mag i =δ·sign(W1,i)·ˆxwithδ2 = 4p/n,a i = ReLU(zi), and∆a i = ReLU(zi + ∆z i)− ReLU(zi)

Here E[ai] =E[ReLU(z i)]>0, so the uncentered and Pearson correlations differ; the relevant quantity for the vector-levelcos2(∆a,a)is the uncentered version. Lemma A.1(Bussgang-ReLU correlation).Under assumptions(A1)–(A2), letz i =W 1,iˆx, ∆z mag i =δ·sign(W1,i)·ˆxwithδ2 = 4p/n,a i = ReLU(zi), and∆a i = ReLU(zi + ∆z i)− ReLU(zi). Then asn→∞: Corr(∆a i,ai)...

work page 2018

[16] [16]

weights, no residual con- nection)

D Additional Experimental Data D.1 Multi-Layer Detailed Results (V2,p=0.01, 20 seeds) Table 15 shows per-layerCL values for the V2 architecture (i.i.d. weights, no residual con- nection). The monotone decline contradicts exponential compounding. 41 Table 15: V2 architecture per-layer results.CL shows monotone decline, contradicting the exponential compoun...

work page 2024

[17] [17]

Theory predictsR sign ≈p; convergence improves with dimension

Table 18: Measured sign-perturbation radial fractionsR sign. Theory predictsR sign ≈p; convergence improves with dimension. Dimp=0.01p=0.02p=0.05p=0.10 256 0.0217 0.0308 0.0565 0.0969 512 0.0156 0.0248 0.0509 0.0917 1024 0.0125 0.0218 0.0481 0.0891 2048 0.0110 0.0202 0.0465 0.0876 43 Table 19: Measured magnitude-perturbation radial fractionsRmag. Theory p...

work page 2048

[18] [18]

Settingp=α2:E outlier = 2α2(1−4α/(3π)), matchingf(1−2α2)

Consistency with Theorem 3.Main text Step A6 gives total post-ReLU energyE(p) = 2p(1−4√p/(3π) +O(p))on unit-variance scale. Settingp=α2:E outlier = 2α2(1−4α/(3π)), matchingf(1−2α2). E.6 Numerical Predictions Table 20 gives predictions forn= 2048(TinyLlama hidden dimension). 46 Table 20: Predicted outlier leverage ratioR(α,2048)from Theorem

work page 2048

[19] [19]

αEnergyα 2 Pflip R(exact)Rvia Eq. (28) 0.05 0.25% 3.2% 5.0 5.0 0.10 1.0% 6.4% 19.5 19.6 0.20 4.0% 12.8% 73 75 0.30 9.0% 19.5% 162 161 0.50 25% 33.3% 404 404 E.7 Phase Crossover: Perturbative to Non-Perturbative The crossover occurs at the crossover scale αc = 1√n, where a single outlier flip has the same impact as a single non-outlier flip (R(αc,n)≈1). Fo...

work page 2048

[20] [20]

Gate-flip probability∼(2/π) arcsinα=O(1)

•Outlier regime(α≫αc): Single-entry flip perturbszi byO(α/√n)∼σz. Gate-flip probability∼(2/π) arcsinα=O(1). Gate-flip energy dominates smooth energy. The transition is a smooth crossover (no critical exponent or discontinuity):R=nα2 grows continuously fromR= 1atα=αc toR∼natα∼1. E.8 Connection to Experiments Theory vs. observation.Theorem 6 predicts per-en...

work page 2048