pith. sign in

arxiv: 2605.19561 · v1 · pith:H7MFXKAPnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization

Pith reviewed 2026-05-20 06:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords MXFP4 quantizationactivation quantizationpost-training quantizationorthogonal rotationLLM inferenceSchur-Horn theoremcodebook utilizationquantization error analysis
0
0 comments X

The pith

Two-level orthogonal rotations fix MXFP4 activation imbalances to close the accuracy gap in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that MXFP4 quantization of activations fails due to extreme variance differences between blocks and poor use of the available codebook values inside each block. TORQ counters this with a training-free method that first applies orthogonal rotations across blocks to spread activation energy more evenly using the Schur-Horn theorem, then applies further rotations inside blocks to maximize entropy and fill the codebook more completely. A sympathetic reader would care because the resulting 4-bit models recover most of the accuracy lost to direct quantization, as measured by lower perplexity and higher downstream task scores on models such as Qwen3-32B and LLaMA3.

Core claim

Direct MXFP4 activation quantization suffers from two structural imbalances between activation distributions and the block floating-point format: extreme inter-block variance imbalance that forces shared scales to favor high-magnitude blocks, and intra-block codebook utilization imbalance that leaves many representable values unused. TORQ corrects both by macroscopic inter-block orthogonal rotation that redistributes activation energy according to the Schur-Horn theorem and microscopic intra-block rotation guided by a maximum-entropy objective that alleviates codebook collapse.

What carries the argument

Two-level orthogonal rotation (TORQ), which uses inter-block rotations derived from the Schur-Horn theorem to equalize variance across blocks and intra-block rotations chosen to maximize entropy, thereby reshaping activation geometry for compatibility with MXFP4 scaling and codebooks.

If this is right

  • On Qwen3-32B the WikiText perplexity reaches 8.43, close to the 7.61 achieved by BF16.
  • Average task accuracy on Qwen3-32B rises from 38.40 percent with direct RTN to 73.63 percent, versus 74.82 percent for BF16.
  • The method works without retraining and therefore applies directly to already-trained models such as LLaMA3 and Qwen3.
  • High-variance blocks no longer dominate shared scaling factors, preserving precision for small-magnitude activation elements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-level rotation idea could be tested on other block-floating-point or low-precision formats to see whether the imbalance diagnosis generalizes.
  • Combining TORQ with weight-only quantization would produce a fully 4-bit model whose total compression and accuracy trade-off could be measured directly.
  • Because the rotations are linear and fixed, they might be fused into preceding linear layers at deployment time to keep inference cost unchanged.

Load-bearing premise

Precomputed orthogonal rotations can be applied to activations at inference time without changing the model's output behavior or adding unacceptable overhead.

What would settle it

Apply the TORQ rotations to an untested model such as a new 70B-class LLM and measure WikiText perplexity; if the result stays near the direct-RTN baseline of 38 percent accuracy instead of rising toward 73 percent, the claim that the rotations preserve end-to-end accuracy would not hold.

Figures

Figures reproduced from arXiv: 2605.19561 by Dawei Yang, Xing Hu, Zukang Xu.

Figure 1
Figure 1. Figure 1: Performance of Different Quantization Methods on LLaMA3-8B and LLaMA3-70B. Despite the theoretical advantages of MXFP4, our em￾pirical analysis reveals that directly applying it to LLM activation quantization encounters severe accuracy bottle￾necks. State-of-the-art quantization methods, such as Single￾Quant (Xiao et al., 2025), DARTQuant (Shao et al., 2025a), FlatQuant (Sun et al., 2024), and OSTQuant (Hu… view at source ↗
Figure 2
Figure 2. Figure 2: MXFP4 Quantization of each micro block’s variance distribution: (a) Variance distribution of original data without TORQ, (b) Variance distribution after TORQ processing [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Histogram statistics of each codeword quantified by FP4(e2m1):(a) Codeword occupancy of the original data without TORQ, (b) Codeword occupancy after TORQ processing micro-codebook alignment, it achieves a systematic minimization of quantization error. • We establish a solid theoretical foundation for two￾level rotation. We prove that under a convex error model, the variance equilibrium configuration is ap￾… view at source ↗
Figure 4
Figure 4. Figure 4: TORQ algorithm’s complete process framework. Algorithm 1 Givens Rotation for Variance Equalization 1: Input: Covariance matrix Σ ∈ R B×B, convergence threshold ϵ 2: Output: Orthogonal matrix R ∈ O(B) such that diag(RΣR⊤) ≈ c1 3: Initialize R = IB, calculate target c = tr(Σ)/B 4: while maxi |σii − c| > ϵ do 5: Select block pair (i, j) with opposite variance devia￾tion signs, i.e., (σii − c)(σjj − c) < 0 6: … view at source ↗
read the original abstract

As Large Language Models (LLMs) advance toward practical deployment, the Microscaling FP4 (MXFP4) format has emerged as a cornerstone for next-generation low-bit inference, owing to its ability to balance high dynamic range with hardware efficiency. However, directly applying MXFP4 to LLM activation quantization inevitably leads to significant accuracy degradation. In this paper, we theoretically analyze the error structure of MXFP4 activation quantization, revealing that the root cause of this performance drop lies in two structural imbalances between activation distributions and the MXFP4 block floating-point format: (1) extreme inter-block variance imbalance and (2) intra-block codebook utilization imbalance. To address these challenges, we propose TORQ (Two-level Orthogonal Rotation for MXFP4 Quantization), a training-free Post-Training Quantization (PTQ) framework designed to reshape the geometric properties of the activation space through optimal coordinate transformations. At the macroscopic level, TORQ leverages the Schur-Horn theorem to redistribute activation energy via inter-block orthogonal rotation, preventing high-variance blocks from driving up shared scaling factors and thereby preserving the precision of small-magnitude elements. At the microscopic level, TORQ employs maximum-entropy-guided intra-block rotation to alleviate codebook collapse and maximize the MXFP4 codebook's information capacity. Experiments on mainstream LLMs such as LLaMA3 and Qwen3 show that TORQ significantly improves the accuracy of MXFP4 activation quantization compared to existing methods: on Qwen3-32B, the perplexity on WikiText is reduced to 8.43 (vs. 7.61 for BF16), and the average accuracy increases from 38.40% with direct RTN to 73.63% (vs. 74.82% for BF16), substantially narrowing the gap between 4-bit floating-point quantization and full-precision inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes TORQ, a training-free post-training quantization framework for MXFP4 activation quantization in large language models. It theoretically identifies two structural imbalances—extreme inter-block variance imbalance and intra-block codebook utilization imbalance—as the root causes of accuracy degradation. TORQ addresses these through inter-block orthogonal rotations based on the Schur-Horn theorem to redistribute activation energy and intra-block rotations guided by a maximum-entropy objective to maximize codebook utilization. Experimental results on models including LLaMA3 and Qwen3 demonstrate substantial improvements, such as reducing WikiText perplexity to 8.43 on Qwen3-32B (compared to 7.61 for BF16) and increasing average accuracy to 73.63% (from 38.40% with direct RTN, versus 74.82% for BF16).

Significance. Should the method's correctness and the reported gains be confirmed, this work is significant for enabling more accurate low-bit inference with MXFP4, a format valued for its hardware efficiency. The training-free aspect and use of established mathematical tools like the Schur-Horn theorem represent strengths, as does the explicit targeting of activation quantization challenges. The results substantially narrow the performance gap to full-precision models, which could have practical implications for deploying large models on resource-constrained hardware.

major comments (2)
  1. Theoretical analysis paragraph: The link between the identified imbalances and MXFP4 error structure is presented, but the manuscript does not include a derivation or explicit verification that the Schur-Horn inter-block rotation (and subsequent intra-block rotation) remains exactly invertible with weight compensation while preserving alignment to MXFP4 per-block scaling boundaries after the global transform; this is load-bearing for attributing the reported perplexity reduction (8.43 vs. 7.61) and accuracy lift (73.63% vs. 38.40%) solely to reduced quantization error rather than implementation artifacts.
  2. Experiments section (Qwen3-32B results): The central accuracy claims rest on the assumption that the two-level rotations can be applied at inference without numerical drift or block misalignment, yet no ablation isolates the contribution of each rotation level, and no check is shown for whether post-rotation activations maintain the original MXFP4 block definitions used for scaling.
minor comments (3)
  1. Abstract: The statement that TORQ 'substantially narrowing the gap' would benefit from reporting the precise remaining gap to BF16 (e.g., 8.43 vs. 7.61 perplexity) alongside the RTN baseline for direct comparison.
  2. Method description: The distinction between inter-block and intra-block rotation matrices would be clearer with explicit subscript notation or separate equations to avoid potential reader confusion when describing the two-level process.
  3. Related work: Direct RTN and other baselines should include precise citations and implementation details to ensure the comparisons are fully reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of major revision. The comments correctly identify areas where additional theoretical detail and experimental controls would strengthen the manuscript. We address each point below and will revise accordingly.

read point-by-point responses
  1. Referee: Theoretical analysis paragraph: The link between the identified imbalances and MXFP4 error structure is presented, but the manuscript does not include a derivation or explicit verification that the Schur-Horn inter-block rotation (and subsequent intra-block rotation) remains exactly invertible with weight compensation while preserving alignment to MXFP4 per-block scaling boundaries after the global transform; this is load-bearing for attributing the reported perplexity reduction (8.43 vs. 7.61) and accuracy lift (73.63% vs. 38.40%) solely to reduced quantization error rather than implementation artifacts.

    Authors: We agree that an explicit derivation is needed to rigorously support the claims. The inter-block rotation is an orthogonal matrix obtained via the Schur-Horn theorem, which is invertible by definition (R^{-1} = R^T). Weight compensation is performed by applying the transposed rotation to the weight matrix, yielding exact equivalence Wx = (WR^T)(Rx) with no approximation. For MXFP4 block alignment, the rotation is constructed block-wise on the variance vector so that the per-block scaling groups remain invariant under the transform; intra-block rotations are applied independently within each original block. In the revised manuscript we will add a new subsection containing (i) the full proof of invertibility and equivalence, (ii) a lemma establishing preservation of MXFP4 block boundaries, and (iii) numerical verification that post-rotation activations produce identical block indices and scaling factors to the unrotated case (within floating-point precision). revision: yes

  2. Referee: Experiments section (Qwen3-32B results): The central accuracy claims rest on the assumption that the two-level rotations can be applied at inference without numerical drift or block misalignment, yet no ablation isolates the contribution of each rotation level, and no check is shown for whether post-rotation activations maintain the original MXFP4 block definitions used for scaling.

    Authors: We accept that isolating the two rotation stages and verifying block integrity would improve experimental transparency. In the revised Experiments section we will add (1) an ablation table on Qwen3-32B that reports WikiText perplexity and average accuracy when using only inter-block rotation, only intra-block rotation, and both combined, and (2) a verification plot and table measuring the maximum absolute deviation in block indices and scaling factors before versus after the full two-level transform across all layers and tokens. These additions will directly confirm absence of drift or misalignment while quantifying the incremental benefit of each level. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation relies on external Schur-Horn theorem and max-entropy objective

full rationale

The paper's core chain begins with an error-structure analysis of MXFP4 that invokes the standard Schur-Horn theorem (external linear-algebra result) to justify inter-block orthogonal rotations and a maximum-entropy objective (standard information-theoretic criterion) to guide intra-block rotations. Neither step reduces to a fitted parameter taken from the target accuracy metric, nor does any load-bearing premise rest on a self-citation whose content is itself unverified. The reported perplexity and accuracy numbers are obtained from direct PTQ experiments on LLaMA3/Qwen3 models and are therefore falsifiable against external baselines rather than being tautological with the rotation construction itself. The derivation therefore remains self-contained against external mathematical facts and empirical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the applicability of the Schur-Horn theorem to activation covariance matrices and on the assumption that maximum-entropy rotation inside blocks improves codebook utilization without side effects on model outputs.

axioms (2)
  • standard math Schur-Horn theorem can be used to redistribute activation energy across blocks via orthogonal transformation
    Invoked for the macroscopic inter-block rotation step
  • domain assumption Maximum-entropy rotation inside each block maximizes MXFP4 codebook information capacity
    Used to derive the microscopic rotation objective

pith-pipeline@v0.9.0 · 5878 in / 1442 out tokens · 35988 ms · 2026-05-20T06:46:38.544829+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Orthogonality:R inter ∈ O(B)andR intra ∈ O(K)

  2. [2]

    Training-free: No model fine-tuning is required

  3. [3]

    Appendix A.3: Inference Process We present the online inference process of TORQ in Algorithm 3

    Data-efficiency: Low calibration data requirement. Appendix A.3: Inference Process We present the online inference process of TORQ in Algorithm 3. The core advantage of TORQ is that the computational overhead of the inverse transformation can be absorbed into the linear layers, resulting in zero overhead during online inference. Appendix A.4: Exact Solver...

  4. [4]

    Let the vectors for the selected columns be u=X norm[:,p] and v=X norm[:,q]

    Critical Angle Set Construction Consider the normalized activation matrix Xnorm =S −1X. Let the vectors for the selected columns be u=X norm[:,p] and v=X norm[:,q]. For each sampleb∈ {1, . . . , B}, we define its 2D polar coordinates: rb = q u2 b +v 2 b , φ b = atan2(vb, ub)∈(−π, π](13) Under a Givens rotationG (p,q)(θ), the magnitude of the projected com...

  5. [5]

    Therefore, we only need to evaluate the loss at the midpoint of each interval: ˜θm = τm +τ m+1 2 , m= 1,

    Optimal Search via Finite Enumeration Sort the unique angles in T as 0≤τ 1 < τ 2 <· · ·< τ M <2π .Key Property:Within any open interval (τm, τm+1), the quantization bin assignment for all samples remains constant, implying the loss functionL code is constant. Therefore, we only need to evaluate the loss at the midpoint of each interval: ˜θm = τm +τ m+1 2 ...

  6. [6]

    Column Imbalance Score Given the normalized activation matrixZ=S −1XR, let z(k) denote the k-th column vector. We first construct the codebook histogram for each column: N (k) j = BX b=1 ⊮(|z b,k| ∈ I j),ˆp (k) j = N (k) j PJ l=1 N (k) l (20) We define theImbalance Score hk as the mean squared distance between the empirical distribution and the uniform di...

  7. [7]

    We introduceComplementarityto measure the statistical difference between two columns

    Pair Complementarity Simply selecting the two columns with the highest hk is suboptimal if they share similar skewness (e.g., both biased towards large values). We introduceComplementarityto measure the statistical difference between two columns. We define the ”anti-correlation”c k,l between columnkand columnlas: ck,l =− JX j=1 ˆp(k) j − 1 J ˆp(l) j − 1 J...

  8. [8]

    Selection Workflow In each R-step iteration, the selection proceeds as follows:

  9. [9]

    2.Pairing:ComputeH k,l for all pairs within the candidate pool

    Filtering:Calculate hk for all columns and select the top Ktop (e.g., K/2) most imbalanced columns as the candidate pool. 2.Pairing:ComputeH k,l for all pairs within the candidate pool. 3.Execution:Select the topPnon-overlapping pairs with the highestH k,l scores. 4.Optimization:Apply the exact angle solver (Appendix A.4) to thesePpairs to updateR intra. ...

  10. [10]

    Inter-block Rotation Construction: • Covariance Estimation: Computing covariance matrices for allKpositions requiresO(T B 2K)

    Offline Calibration Phase The calibration process involves constructing rotation matrices using a small set of calibration data (typically T= 128 samples). Inter-block Rotation Construction: • Covariance Estimation: Computing covariance matrices for allKpositions requiresO(T B 2K). • Givens Iterations (Algorithm 1): Each position requiresO(B 3)for converg...

  11. [11]

    • Intra-block:O(BK 2)

    Online Inference Phase Forward Rotation:Applying the rotations requires standard matrix multiplications: • Inter-block:O(B 2K)(Block-diagonal multiplication). • Intra-block:O(BK 2). Inverse Transformation (Zero Overhead):Crucially, the inverse transformations ( R⊤ intra and R⊤ inter) are fused into the weights of the subsequent linear layer offline (as de...

  12. [12]

    For a configuration of B= 128, K= 32 , this amounts to approximately 17k parameters, which is negligible compared to LLM weights

    Practical Deployment Suggestions 15 TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization • Memory Overhead:Storing the rotation matrices requires B2 parameters for inter-block (per position, actually decomposed into block-diagonal) and K 2 parameters for intra-block. For a configuration of B= 128, K= 32 , this amounts to approximately 17k parameters...