A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization
Pith reviewed 2026-05-20 12:59 UTC · model grok-4.3
The pith
In ReLU + RMSNorm blocks, sign flips produce 2.75 times more transverse output energy than equal-magnitude sign-preserving changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sign-flips produce π/(π-2) ≈ 2.75 times more transverse output energy than sign-preserving magnitude perturbations of equal Frobenius norm as the flip rate p → 0. Sign-quantization error is a sign-preserving perturbation whose angular alignment satisfies cos² → 2/π; after ReLU its radial fraction remains 0.365, matching the pre-ReLU value 1-2/π to within 0.4 percent, rendering ReLU approximately transparent to the error.
What carries the argument
ReLU-induced hidden-space directional asymmetry between sign-flip and magnitude perturbations, selectively exposed by the transverse-projection Fréchet derivative of RMSNorm.
If this is right
- Ternary quantization error stays small because it behaves like a sign-preserving perturbation that ReLU largely passes through unchanged.
- A single sign-flip on an input feature of amplitude α amplifies post-ReLU energy by a factor R ≈ n α² relative to a delocalized entry.
- Count-matched negative-log-likelihood leverage on TinyLlama stabilizes near 10× at low flip rates, matching the per-entry prediction.
- The all-column NLL ratio reaches 5.0× while the perplexity gap is larger, consistent with metric nonlinearity.
- Heavy-tailed outlier amplitudes (median 0.024, max 0.26 at layer 12) explain why real models deviate from the delocalized two-layer prediction.
Where Pith is reading between the lines
- The same ReLU-RMSNorm geometry could be tested on other activation-normalization combinations to predict their quantization sensitivity.
- Tracking how the 2.75 factor compounds or saturates across many layers would clarify scaling behavior beyond the two-layer case.
- The Bussgang constant and half-space structure suggest analogous sign-magnitude analyses for binary or low-bit quantization in attention blocks.
Load-bearing premise
Weights are i.i.d. Gaussian and entries remain delocalized without dominant outliers.
What would settle it
Measure the ratio of transverse output energies produced by sign-flip versus magnitude perturbations of equal Frobenius norm in a two-layer ReLU-RMSNorm network initialized with i.i.d. Gaussian weights, at small flip rates.
Figures
read the original abstract
Pre-norm Transformers with RMSNorm tolerate ternary {-1,0,+1} weight quantization with surprisingly small loss (Ma et al., 2024). We give a geometric explanation via sign-magnitude decomposition of weight perturbations. In a two-layer ReLU + RMSNorm model with i.i.d. Gaussian weights, sign-flips produce $\pi/(\pi-2) \approx 2.75$ times more transverse output energy than sign-preserving magnitude perturbations of equal Frobenius norm, as the flip rate $p \to 0$ (Theorem 3). The mechanism: ReLU creates a hidden-space directional asymmetry between the two perturbation types, which RMSNorm's transverse-projection Fr\'echet derivative selectively exposes. Sign-quantization error is itself a sign-preserving perturbation with angular alignment $\cos^2 \to 2/\pi$ (Theorem 4); its post-ReLU radial fraction ($0.365$) matches the pre-ReLU value $1-2/\pi$ within $0.4\%$, so ReLU is approximately transparent to ternary error. Multi-layer compounding of the $2.75\times$ factor is not experimentally supported; the gap to real-model sign sensitivity arises from outlier features violating delocalization. For an input dimension with amplitude $\alpha$, a single sign-flip produces post-ReLU energy amplified by $R \approx n\alpha^2$ relative to a delocalized entry. On TinyLlama-1.1B, at linear response ($p \leq 0.5\%$), count-matched NLL leverage stabilizes at $\sim 10\times \approx n\mathbb{E}[\alpha^2]$, matching the per-entry theory; the all-column NLL ratio of $5.0\times$ falls within $R_{\mathrm{col}} \leq 19$ ($67\times$ PPL gap reflects metric nonlinearity). Measured outlier $\alpha$ at layer 12 (median $0.024$, max $0.26$) confirms heavy-tailed concentration. The Bussgang constant $2/\pi$, RMSNorm geometry, and ReLU half-space structure together explain sign-magnitude asymmetry in pre-norm models, with $R \propto n\alpha^2$ accounting for real-model deviations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that pre-norm Transformers tolerate ternary {-1,0,+1} weight quantization with small loss due to sign-magnitude asymmetry arising from ReLU + RMSNorm geometry. In a two-layer ReLU + RMSNorm model with i.i.d. Gaussian weights, sign-flips produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-Frobenius-norm sign-preserving magnitude perturbations as p → 0 (Theorem 3). Sign-quantization error is a sign-preserving perturbation with cos² → 2/π alignment (Theorem 4), and ReLU is approximately transparent to it (post-ReLU radial fraction 0.365 matches pre-ReLU 1-2/π within 0.4%). Real-model deviations (e.g., TinyLlama-1.1B) are attributed to outlier features with amplitude α producing amplification R ≈ nα², which matches observed ~10× NLL leverage at low p and explains the 5.0× column NLL ratio.
Significance. If the two-layer geometric results hold and the outlier mechanism is placed on firmer footing, the work supplies a concrete geometric account of why RMSNorm + ReLU blocks are unusually tolerant to sign errors under ternary quantization. The explicit derivation of the π/(π-2) energy ratio, the Bussgang alignment, and the measured α statistics on TinyLlama constitute useful, falsifiable contributions that could inform quantization design for pre-norm models.
major comments (2)
- [Theorem 3] Theorem 3 and the subsequent multi-layer discussion: the π/(π-2) transverse-energy ratio is derived under i.i.d. Gaussian weights and delocalized entries in a two-layer model, yet the manuscript invokes this ratio to explain tolerance in deep Transformers without a propagation analysis showing how the local 2.75× asymmetry compounds across layers or interacts with residual connections.
- [TinyLlama analysis] TinyLlama analysis (NLL leverage and R ≈ nα²): the claim that outlier amplitude α (median 0.024, max 0.26 at layer 12) produces R ≈ nE[α²] that accounts for the observed ~10× per-entry NLL leverage and 5.0× column NLL ratio rests on post-hoc attribution; no derivation or simulation demonstrates that the measured α distribution suffices to close the gap from the two-layer theory to the 67× PPL discrepancy.
minor comments (2)
- [Abstract] Abstract: the definition of R_col ≤ 19 and its relation to the per-column NLL ratio should be stated explicitly rather than introduced only in the experimental paragraph.
- Notation: the symbol R is used both for the per-entry amplification and for the column-level bound; a brief clarifying sentence would prevent confusion.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for recognizing the geometric contributions of the local analysis. We respond to each major comment below.
read point-by-point responses
-
Referee: [Theorem 3] Theorem 3 and the subsequent multi-layer discussion: the π/(π-2) transverse-energy ratio is derived under i.i.d. Gaussian weights and delocalized entries in a two-layer model, yet the manuscript invokes this ratio to explain tolerance in deep Transformers without a propagation analysis showing how the local 2.75× asymmetry compounds across layers or interacts with residual connections.
Authors: Theorem 3 derives the local 2.75× transverse-energy ratio under the two-layer i.i.d. Gaussian setting with delocalized entries. The manuscript invokes this ratio to identify the geometric origin of sign-magnitude asymmetry within the ReLU + RMSNorm block itself, which forms the core computational unit of pre-norm Transformers. The text already states that multi-layer compounding is not experimentally supported and that real-model gaps arise from outlier features that violate delocalization. A full propagation analysis through residual streams would be a natural extension but is outside the scope of the present local geometric study. We will add a clarifying sentence in the discussion to emphasize that the ratio characterizes the local tolerance mechanism rather than supplying a global multi-layer prediction. revision: partial
-
Referee: [TinyLlama analysis] TinyLlama analysis (NLL leverage and R ≈ nα²): the claim that outlier amplitude α (median 0.024, max 0.26 at layer 12) produces R ≈ nE[α²] that accounts for the observed ~10× per-entry NLL leverage and 5.0× column NLL ratio rests on post-hoc attribution; no derivation or simulation demonstrates that the measured α distribution suffices to close the gap from the two-layer theory to the 67× PPL discrepancy.
Authors: The scaling R ≈ nα² follows directly from the effect of a localized sign-flip on a single input coordinate of amplitude α: the post-ReLU energy contribution is amplified by that factor relative to a delocalized entry. The measured α distribution at layer 12 yields nE[α²] ≈ 10, which matches the observed per-entry NLL leverage at linear response. The column-wise NLL ratio of 5.0× lies inside the bound R_col ≤ 19 obtained from the maximum observed α. The paper attributes the remaining 67× PPL discrepancy to the nonlinear mapping from small NLL changes to perplexity at low p. While an end-to-end simulation of the full PPL gap is not included, the derived scaling together with the empirical α statistics provides a quantitative bridge from the delocalized two-layer theory to the measured sensitivities. We will include the explicit derivation of R in the appendix. revision: partial
Circularity Check
No significant circularity; geometric claims derive independently
full rationale
The paper establishes Theorems 3 and 4 through direct geometric analysis of a two-layer ReLU + RMSNorm block with i.i.d. Gaussian weights, yielding the π/(π-2) transverse energy ratio for sign-flips and the 2/π angular alignment for quantization error from first-principles expectations over ReLU half-spaces and RMSNorm projections. These constants are standard in Gaussian signal processing (e.g., Bussgang theorem) and do not rely on fitting to the paper's target results or self-referential definitions. The outlier amplification factor R ≈ nα² is introduced as an explanatory mechanism to bridge to real models like TinyLlama, where it is confirmed by measured α distributions rather than used to derive or fit the core asymmetry. No self-citations are load-bearing for the central claims, no ansatzes are smuggled, and the derivation chain remains self-contained under the stated model assumptions without reducing predictions to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- outlier amplitude α
axioms (2)
- domain assumption Weights are i.i.d. Gaussian random variables
- domain assumption Analysis restricted to linear response regime with small flip rate p ≤ 0.5%
Reference graph
Works this paper leans on
-
[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, et al. Qwen technical report. arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
BinaryConnect: Training Deep Neural Networks with binary weights during propagations
arXiv:1511.00363. 31 Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InNeurIPS,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
arXiv:1803.03635. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations (ICLR),
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
arXiv:2210.17323. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations.Journal of Machine Learning Research, 18(187):1–30,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
arXiv:1609.07061. Michel Ledoux.The Concentration of Measure Phenomenon, volume 89 ofMathematical Surveys and Monographs. American Mathematical Society,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv:2306.00978. Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits. arXiv:2402.17764,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
arXiv:2504.05357. Robert Price. A useful theorem for nonlinear devices having Gaussian inputs.IRE Trans- actions on Information Theory, 4(2):69–72,
-
[9]
Mangadoddi Srikar Vardhan and Lekkala Sai Teja. Disentangling direction and magnitude in transformer representations: A double dissociation through L2-matched perturbation analysis. arXiv:2602.11169,
-
[10]
BitNet: Scaling 1-bit Transformers for Large Language Models
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. BitNet: Scaling 1-bit Transformers for large language models. arXiv:2310.11453,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Smoothquant: Accurate and efficient post-training quantization for large language models
arXiv:2211.10438. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the trans- former architecture. InInternational Conference on Machine Learning (ICML), pages 10524–10533,
- [12]
-
[13]
Root Mean Square Layer Normalization
arXiv:1910.07467. Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. InNeurIPS,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[14]
=ρ √ 1−ρ2 = 2√p−5p3/2 +O(p 5/2), arctan(a) =π/2−arctan(1/a) =π/2−2√p−p3/2/3 +O(p 5/2). Substituting: the constant terms cancel (1/4−1 2π·π 2 = 1/4−1/4 = 0), the√pterms cancel, and collectingp 3/2 coefficients gives5/(2π) + 1/(6π) = 16/(6π) = 8/(3π): βflip = 8p3/2 3π +O(p 5/2).(23) 3Strictly, the uniformity of CLT convergence inp(i.e.,supp∈(ϵ,1/2)|βflip(p,...
work page 1958
-
[15]
Here E[ai] =E[ReLU(z i)]>0, so the uncentered and Pearson correlations differ; the relevant quantity for the vector-levelcos2(∆a,a)is the uncentered version. Lemma A.1(Bussgang-ReLU correlation).Under assumptions(A1)–(A2), letz i =W 1,iˆx, ∆z mag i =δ·sign(W1,i)·ˆxwithδ2 = 4p/n,a i = ReLU(zi), and∆a i = ReLU(zi + ∆z i)− ReLU(zi). Then asn→∞: Corr(∆a i,ai)...
work page 2018
-
[16]
weights, no residual con- nection)
D Additional Experimental Data D.1 Multi-Layer Detailed Results (V2,p=0.01, 20 seeds) Table 15 shows per-layerCL values for the V2 architecture (i.i.d. weights, no residual con- nection). The monotone decline contradicts exponential compounding. 41 Table 15: V2 architecture per-layer results.CL shows monotone decline, contradicting the exponential compoun...
work page 2024
-
[17]
Theory predictsR sign ≈p; convergence improves with dimension
Table 18: Measured sign-perturbation radial fractionsR sign. Theory predictsR sign ≈p; convergence improves with dimension. Dimp=0.01p=0.02p=0.05p=0.10 256 0.0217 0.0308 0.0565 0.0969 512 0.0156 0.0248 0.0509 0.0917 1024 0.0125 0.0218 0.0481 0.0891 2048 0.0110 0.0202 0.0465 0.0876 43 Table 19: Measured magnitude-perturbation radial fractionsRmag. Theory p...
work page 2048
-
[18]
Settingp=α2:E outlier = 2α2(1−4α/(3π)), matchingf(1−2α2)
Consistency with Theorem 3.Main text Step A6 gives total post-ReLU energyE(p) = 2p(1−4√p/(3π) +O(p))on unit-variance scale. Settingp=α2:E outlier = 2α2(1−4α/(3π)), matchingf(1−2α2). E.6 Numerical Predictions Table 20 gives predictions forn= 2048(TinyLlama hidden dimension). 46 Table 20: Predicted outlier leverage ratioR(α,2048)from Theorem
work page 2048
-
[19]
αEnergyα 2 Pflip R(exact)Rvia Eq. (28) 0.05 0.25% 3.2% 5.0 5.0 0.10 1.0% 6.4% 19.5 19.6 0.20 4.0% 12.8% 73 75 0.30 9.0% 19.5% 162 161 0.50 25% 33.3% 404 404 E.7 Phase Crossover: Perturbative to Non-Perturbative The crossover occurs at the crossover scale αc = 1√n, where a single outlier flip has the same impact as a single non-outlier flip (R(αc,n)≈1). Fo...
work page 2048
-
[20]
Gate-flip probability∼(2/π) arcsinα=O(1)
•Outlier regime(α≫αc): Single-entry flip perturbszi byO(α/√n)∼σz. Gate-flip probability∼(2/π) arcsinα=O(1). Gate-flip energy dominates smooth energy. The transition is a smooth crossover (no critical exponent or discontinuity):R=nα2 grows continuously fromR= 1atα=αc toR∼natα∼1. E.8 Connection to Experiments Theory vs. observation.Theorem 6 predicts per-en...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.