pith. sign in

arxiv: 2605.18933 · v1 · pith:VJ2QILQHnew · submitted 2026-05-18 · 💻 cs.LG

A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization

Pith reviewed 2026-05-20 12:59 UTC · model grok-4.3

classification 💻 cs.LG
keywords ternary quantizationReLURMSNormsign-magnitude asymmetryweight perturbationspre-norm transformersgeometric analysisBussgang constant
0
0 comments X

The pith

In ReLU + RMSNorm blocks, sign flips produce 2.75 times more transverse output energy than equal-magnitude sign-preserving changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that sign-magnitude decomposition of weight perturbations reveals a directional asymmetry created by ReLU in a two-layer model with Gaussian weights. RMSNorm then selectively projects this asymmetry into transverse output energy, making sign flips far costlier than magnitude adjustments as the flip rate approaches zero. Sign-quantization error aligns with the less damaging sign-preserving type and passes through ReLU with almost no change in its radial fraction. This geometry accounts for the observed tolerance of pre-norm Transformers to ternary quantization, while real-model deviations trace to outlier features that concentrate energy and violate the delocalized-entry assumption.

Core claim

Sign-flips produce π/(π-2) ≈ 2.75 times more transverse output energy than sign-preserving magnitude perturbations of equal Frobenius norm as the flip rate p → 0. Sign-quantization error is a sign-preserving perturbation whose angular alignment satisfies cos² → 2/π; after ReLU its radial fraction remains 0.365, matching the pre-ReLU value 1-2/π to within 0.4 percent, rendering ReLU approximately transparent to the error.

What carries the argument

ReLU-induced hidden-space directional asymmetry between sign-flip and magnitude perturbations, selectively exposed by the transverse-projection Fréchet derivative of RMSNorm.

If this is right

  • Ternary quantization error stays small because it behaves like a sign-preserving perturbation that ReLU largely passes through unchanged.
  • A single sign-flip on an input feature of amplitude α amplifies post-ReLU energy by a factor R ≈ n α² relative to a delocalized entry.
  • Count-matched negative-log-likelihood leverage on TinyLlama stabilizes near 10× at low flip rates, matching the per-entry prediction.
  • The all-column NLL ratio reaches 5.0× while the perplexity gap is larger, consistent with metric nonlinearity.
  • Heavy-tailed outlier amplitudes (median 0.024, max 0.26 at layer 12) explain why real models deviate from the delocalized two-layer prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ReLU-RMSNorm geometry could be tested on other activation-normalization combinations to predict their quantization sensitivity.
  • Tracking how the 2.75 factor compounds or saturates across many layers would clarify scaling behavior beyond the two-layer case.
  • The Bussgang constant and half-space structure suggest analogous sign-magnitude analyses for binary or low-bit quantization in attention blocks.

Load-bearing premise

Weights are i.i.d. Gaussian and entries remain delocalized without dominant outliers.

What would settle it

Measure the ratio of transverse output energies produced by sign-flip versus magnitude perturbations of equal Frobenius norm in a two-layer ReLU-RMSNorm network initialized with i.i.d. Gaussian weights, at small flip rates.

Figures

Figures reproduced from arXiv: 2605.18933 by Lei Dong.

Figure 1
Figure 1. Figure 1: Theorem 3 verification. Blue curve: leading-order theoretical prediction [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multi-layer compounding refutation (p=0.01). Dashed line: the refuted hypothesis CL = (π/(π−2))L , which predicts exponential growth to ∼437 at L=6. All three measured architectures—V2 (i.i.d., no residual), Exp B (trained TinyLlama, ReLU), and V3 (i.i.d., with residual)—show flat or declining CL, with observed C6 ∈ [1.3, 3.1]. Shaded regions: 95% CI. Residual connection dilution. Comparing V2 and V3 at L … view at source ↗
read the original abstract

Pre-norm Transformers with RMSNorm tolerate ternary {-1,0,+1} weight quantization with surprisingly small loss (Ma et al., 2024). We give a geometric explanation via sign-magnitude decomposition of weight perturbations. In a two-layer ReLU + RMSNorm model with i.i.d. Gaussian weights, sign-flips produce $\pi/(\pi-2) \approx 2.75$ times more transverse output energy than sign-preserving magnitude perturbations of equal Frobenius norm, as the flip rate $p \to 0$ (Theorem 3). The mechanism: ReLU creates a hidden-space directional asymmetry between the two perturbation types, which RMSNorm's transverse-projection Fr\'echet derivative selectively exposes. Sign-quantization error is itself a sign-preserving perturbation with angular alignment $\cos^2 \to 2/\pi$ (Theorem 4); its post-ReLU radial fraction ($0.365$) matches the pre-ReLU value $1-2/\pi$ within $0.4\%$, so ReLU is approximately transparent to ternary error. Multi-layer compounding of the $2.75\times$ factor is not experimentally supported; the gap to real-model sign sensitivity arises from outlier features violating delocalization. For an input dimension with amplitude $\alpha$, a single sign-flip produces post-ReLU energy amplified by $R \approx n\alpha^2$ relative to a delocalized entry. On TinyLlama-1.1B, at linear response ($p \leq 0.5\%$), count-matched NLL leverage stabilizes at $\sim 10\times \approx n\mathbb{E}[\alpha^2]$, matching the per-entry theory; the all-column NLL ratio of $5.0\times$ falls within $R_{\mathrm{col}} \leq 19$ ($67\times$ PPL gap reflects metric nonlinearity). Measured outlier $\alpha$ at layer 12 (median $0.024$, max $0.26$) confirms heavy-tailed concentration. The Bussgang constant $2/\pi$, RMSNorm geometry, and ReLU half-space structure together explain sign-magnitude asymmetry in pre-norm models, with $R \propto n\alpha^2$ accounting for real-model deviations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that pre-norm Transformers tolerate ternary {-1,0,+1} weight quantization with small loss due to sign-magnitude asymmetry arising from ReLU + RMSNorm geometry. In a two-layer ReLU + RMSNorm model with i.i.d. Gaussian weights, sign-flips produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-Frobenius-norm sign-preserving magnitude perturbations as p → 0 (Theorem 3). Sign-quantization error is a sign-preserving perturbation with cos² → 2/π alignment (Theorem 4), and ReLU is approximately transparent to it (post-ReLU radial fraction 0.365 matches pre-ReLU 1-2/π within 0.4%). Real-model deviations (e.g., TinyLlama-1.1B) are attributed to outlier features with amplitude α producing amplification R ≈ nα², which matches observed ~10× NLL leverage at low p and explains the 5.0× column NLL ratio.

Significance. If the two-layer geometric results hold and the outlier mechanism is placed on firmer footing, the work supplies a concrete geometric account of why RMSNorm + ReLU blocks are unusually tolerant to sign errors under ternary quantization. The explicit derivation of the π/(π-2) energy ratio, the Bussgang alignment, and the measured α statistics on TinyLlama constitute useful, falsifiable contributions that could inform quantization design for pre-norm models.

major comments (2)
  1. [Theorem 3] Theorem 3 and the subsequent multi-layer discussion: the π/(π-2) transverse-energy ratio is derived under i.i.d. Gaussian weights and delocalized entries in a two-layer model, yet the manuscript invokes this ratio to explain tolerance in deep Transformers without a propagation analysis showing how the local 2.75× asymmetry compounds across layers or interacts with residual connections.
  2. [TinyLlama analysis] TinyLlama analysis (NLL leverage and R ≈ nα²): the claim that outlier amplitude α (median 0.024, max 0.26 at layer 12) produces R ≈ nE[α²] that accounts for the observed ~10× per-entry NLL leverage and 5.0× column NLL ratio rests on post-hoc attribution; no derivation or simulation demonstrates that the measured α distribution suffices to close the gap from the two-layer theory to the 67× PPL discrepancy.
minor comments (2)
  1. [Abstract] Abstract: the definition of R_col ≤ 19 and its relation to the per-column NLL ratio should be stated explicitly rather than introduced only in the experimental paragraph.
  2. Notation: the symbol R is used both for the per-entry amplification and for the column-level bound; a brief clarifying sentence would prevent confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for recognizing the geometric contributions of the local analysis. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Theorem 3] Theorem 3 and the subsequent multi-layer discussion: the π/(π-2) transverse-energy ratio is derived under i.i.d. Gaussian weights and delocalized entries in a two-layer model, yet the manuscript invokes this ratio to explain tolerance in deep Transformers without a propagation analysis showing how the local 2.75× asymmetry compounds across layers or interacts with residual connections.

    Authors: Theorem 3 derives the local 2.75× transverse-energy ratio under the two-layer i.i.d. Gaussian setting with delocalized entries. The manuscript invokes this ratio to identify the geometric origin of sign-magnitude asymmetry within the ReLU + RMSNorm block itself, which forms the core computational unit of pre-norm Transformers. The text already states that multi-layer compounding is not experimentally supported and that real-model gaps arise from outlier features that violate delocalization. A full propagation analysis through residual streams would be a natural extension but is outside the scope of the present local geometric study. We will add a clarifying sentence in the discussion to emphasize that the ratio characterizes the local tolerance mechanism rather than supplying a global multi-layer prediction. revision: partial

  2. Referee: [TinyLlama analysis] TinyLlama analysis (NLL leverage and R ≈ nα²): the claim that outlier amplitude α (median 0.024, max 0.26 at layer 12) produces R ≈ nE[α²] that accounts for the observed ~10× per-entry NLL leverage and 5.0× column NLL ratio rests on post-hoc attribution; no derivation or simulation demonstrates that the measured α distribution suffices to close the gap from the two-layer theory to the 67× PPL discrepancy.

    Authors: The scaling R ≈ nα² follows directly from the effect of a localized sign-flip on a single input coordinate of amplitude α: the post-ReLU energy contribution is amplified by that factor relative to a delocalized entry. The measured α distribution at layer 12 yields nE[α²] ≈ 10, which matches the observed per-entry NLL leverage at linear response. The column-wise NLL ratio of 5.0× lies inside the bound R_col ≤ 19 obtained from the maximum observed α. The paper attributes the remaining 67× PPL discrepancy to the nonlinear mapping from small NLL changes to perplexity at low p. While an end-to-end simulation of the full PPL gap is not included, the derived scaling together with the empirical α statistics provides a quantitative bridge from the delocalized two-layer theory to the measured sensitivities. We will include the explicit derivation of R in the appendix. revision: partial

Circularity Check

0 steps flagged

No significant circularity; geometric claims derive independently

full rationale

The paper establishes Theorems 3 and 4 through direct geometric analysis of a two-layer ReLU + RMSNorm block with i.i.d. Gaussian weights, yielding the π/(π-2) transverse energy ratio for sign-flips and the 2/π angular alignment for quantization error from first-principles expectations over ReLU half-spaces and RMSNorm projections. These constants are standard in Gaussian signal processing (e.g., Bussgang theorem) and do not rely on fitting to the paper's target results or self-referential definitions. The outlier amplification factor R ≈ nα² is introduced as an explanatory mechanism to bridge to real models like TinyLlama, where it is confirmed by measured α distributions rather than used to derive or fit the core asymmetry. No self-citations are load-bearing for the central claims, no ansatzes are smuggled, and the derivation chain remains self-contained under the stated model assumptions without reducing predictions to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Central claims rest on the i.i.d. Gaussian weight assumption and two-layer simplification for the energy ratios; real-model validation incorporates measured outlier amplitudes rather than additional fitted parameters for the main geometric result.

free parameters (1)
  • outlier amplitude α
    Measured median 0.024 and max 0.26 at layer 12 on TinyLlama to compute amplification R ≈ nα² for explaining deviations from basic theory.
axioms (2)
  • domain assumption Weights are i.i.d. Gaussian random variables
    Invoked to derive expected transverse output energy ratios for sign-flip versus magnitude perturbations in the two-layer model (Theorem 3).
  • domain assumption Analysis restricted to linear response regime with small flip rate p ≤ 0.5%
    Used to observe stabilization of count-matched NLL leverage at ~10× matching per-entry theory.

pith-pipeline@v0.9.0 · 5952 in / 1808 out tokens · 83119 ms · 2026-05-20T12:59:51.557040+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 9 internal anchors

  1. [1]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv:1607.06450,

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, et al. Qwen technical report. arXiv:2309.16609,

  3. [3]

    BinaryConnect: Training Deep Neural Networks with binary weights during propagations

    arXiv:1511.00363. 31 Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InNeurIPS,

  4. [4]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    arXiv:1803.03635. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations (ICLR),

  5. [5]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    arXiv:2210.17323. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations.Journal of Machine Learning Research, 18(187):1–30,

  6. [6]

    Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

    arXiv:1609.07061. Michel Ledoux.The Concentration of Measure Phenomenon, volume 89 ofMathematical Surveys and Monographs. American Mathematical Society,

  7. [7]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    arXiv:2306.00978. Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits. arXiv:2402.17764,

  8. [8]

    Robert Price

    arXiv:2504.05357. Robert Price. A useful theorem for nonlinear devices having Gaussian inputs.IRE Trans- actions on Information Theory, 4(2):69–72,

  9. [9]

    Disentangling direction and magnitude in transformer representations: A double dissociation through L2-matched perturbation analysis

    Mangadoddi Srikar Vardhan and Lekkala Sai Teja. Disentangling direction and magnitude in transformer representations: A double dissociation through L2-matched perturbation analysis. arXiv:2602.11169,

  10. [10]

    BitNet: Scaling 1-bit Transformers for Large Language Models

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. BitNet: Scaling 1-bit Transformers for large language models. arXiv:2310.11453,

  11. [11]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    arXiv:2211.10438. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the trans- former architecture. InInternational Conference on Machine Learning (ICML), pages 10524–10533,

  12. [12]

    Zhang, B

    arXiv:2002.04745. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2 technical report. arXiv:2407.10671,

  13. [13]

    Root Mean Square Layer Normalization

    arXiv:1910.07467. Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. InNeurIPS,

  14. [14]

    =ρ √ 1−ρ2 = 2√p−5p3/2 +O(p 5/2), arctan(a) =π/2−arctan(1/a) =π/2−2√p−p3/2/3 +O(p 5/2). Substituting: the constant terms cancel (1/4−1 2π·π 2 = 1/4−1/4 = 0), the√pterms cancel, and collectingp 3/2 coefficients gives5/(2π) + 1/(6π) = 16/(6π) = 8/(3π): βflip = 8p3/2 3π +O(p 5/2).(23) 3Strictly, the uniformity of CLT convergence inp(i.e.,supp∈(ϵ,1/2)|βflip(p,...

  15. [15]

    Lemma A.1(Bussgang-ReLU correlation).Under assumptions(A1)–(A2), letz i =W 1,iˆx, ∆z mag i =δ·sign(W1,i)·ˆxwithδ2 = 4p/n,a i = ReLU(zi), and∆a i = ReLU(zi + ∆z i)− ReLU(zi)

    Here E[ai] =E[ReLU(z i)]>0, so the uncentered and Pearson correlations differ; the relevant quantity for the vector-levelcos2(∆a,a)is the uncentered version. Lemma A.1(Bussgang-ReLU correlation).Under assumptions(A1)–(A2), letz i =W 1,iˆx, ∆z mag i =δ·sign(W1,i)·ˆxwithδ2 = 4p/n,a i = ReLU(zi), and∆a i = ReLU(zi + ∆z i)− ReLU(zi). Then asn→∞: Corr(∆a i,ai)...

  16. [16]

    weights, no residual con- nection)

    D Additional Experimental Data D.1 Multi-Layer Detailed Results (V2,p=0.01, 20 seeds) Table 15 shows per-layerCL values for the V2 architecture (i.i.d. weights, no residual con- nection). The monotone decline contradicts exponential compounding. 41 Table 15: V2 architecture per-layer results.CL shows monotone decline, contradicting the exponential compoun...

  17. [17]

    Theory predictsR sign ≈p; convergence improves with dimension

    Table 18: Measured sign-perturbation radial fractionsR sign. Theory predictsR sign ≈p; convergence improves with dimension. Dimp=0.01p=0.02p=0.05p=0.10 256 0.0217 0.0308 0.0565 0.0969 512 0.0156 0.0248 0.0509 0.0917 1024 0.0125 0.0218 0.0481 0.0891 2048 0.0110 0.0202 0.0465 0.0876 43 Table 19: Measured magnitude-perturbation radial fractionsRmag. Theory p...

  18. [18]

    Settingp=α2:E outlier = 2α2(1−4α/(3π)), matchingf(1−2α2)

    Consistency with Theorem 3.Main text Step A6 gives total post-ReLU energyE(p) = 2p(1−4√p/(3π) +O(p))on unit-variance scale. Settingp=α2:E outlier = 2α2(1−4α/(3π)), matchingf(1−2α2). E.6 Numerical Predictions Table 20 gives predictions forn= 2048(TinyLlama hidden dimension). 46 Table 20: Predicted outlier leverage ratioR(α,2048)from Theorem

  19. [19]

    αEnergyα 2 Pflip R(exact)Rvia Eq. (28) 0.05 0.25% 3.2% 5.0 5.0 0.10 1.0% 6.4% 19.5 19.6 0.20 4.0% 12.8% 73 75 0.30 9.0% 19.5% 162 161 0.50 25% 33.3% 404 404 E.7 Phase Crossover: Perturbative to Non-Perturbative The crossover occurs at the crossover scale αc = 1√n, where a single outlier flip has the same impact as a single non-outlier flip (R(αc,n)≈1). Fo...

  20. [20]

    Gate-flip probability∼(2/π) arcsinα=O(1)

    •Outlier regime(α≫αc): Single-entry flip perturbszi byO(α/√n)∼σz. Gate-flip probability∼(2/π) arcsinα=O(1). Gate-flip energy dominates smooth energy. The transition is a smooth crossover (no critical exponent or discontinuity):R=nα2 grows continuously fromR= 1atα=αc toR∼natα∼1. E.8 Connection to Experiments Theory vs. observation.Theorem 6 predicts per-en...