arxiv: 2605.00140 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.CL· cs.CV

Recognition: unknown

Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

Yifeng Wang , Zhun Sun , Keisuke Sakaguchi

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:47 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords LLM quantizationpost-training quantizationactivation residualsHessianweight splittinglow-bit inferenceerror mitigationreasoning performance

0 comments

The pith

ARHQ splits LLM weights using activation residual Hessians to reduce error propagation in low-bit quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Activation Residual Hessian Quantization (ARHQ) as a post-training method to limit error buildup when both weights and activations in large language models are reduced to low bit widths. It forms a residual Hessian from the errors introduced by activation quantization, then applies a closed-form truncated singular value decomposition to the weight matrix scaled by the square root of that Hessian. This isolates the weight directions most prone to amplifying errors and moves them into a separate high-precision low-rank branch. A sympathetic reader would care because efficient low-precision inference is required to run capable models on limited hardware, yet standard quantization frequently degrades performance on reasoning tasks.

Core claim

ARHQ is a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian from activation quantization residuals (G_x), ARHQ analytically identifies and isolates error-sensitive weight directions into a high-precision low-rank branch via a closed-form truncated SVD on the scaled weight matrix W G_x^{1/2}. Experimental results on Qwen3-4B-Thinking-2507 demonstrate that ARHQ significantly improves layer-wise SNR and preserves downstream reasoning performance on ZebraLogic even under aggressive quantization.

What carries the argument

Input-side residual Hessian G_x from activation quantization residuals, used with closed-form truncated SVD on W G_x^{1/2} to isolate and split error-sensitive weight directions into a high-precision low-rank branch.

If this is right

Layer-wise signal-to-noise ratio rises under aggressive low-bit settings.
Downstream reasoning performance on ZebraLogic stays intact for the tested Qwen3-4B model.
Error propagation between activation and weight quantization is reduced.
Sensitive weight directions are identified analytically without retraining or iterative search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The splitting approach may generalize to other large language models and bit widths beyond the single model tested.
Layer-specific choice of SVD truncation rank could improve the accuracy-efficiency trade-off further.
Similar residual-Hessian analysis might extend to related compression methods such as pruning or knowledge distillation.
The technique could be paired with existing quantization libraries to lower memory needs for on-device inference.

Load-bearing premise

Constructing the input-side residual Hessian from activation quantization residuals and running closed-form truncated SVD on the scaled weight matrix reliably isolates the error-sensitive directions without introducing new inaccuracies or requiring model-specific tuning.

What would settle it

Applying ARHQ to Qwen3-4B-Thinking-2507 at aggressive quantization levels and measuring no gain in layer-wise SNR or a drop in ZebraLogic accuracy compared with standard quantization would show the method does not deliver its claimed benefit.

read the original abstract

We present Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian from activation quantization residuals (G_x), ARHQ analytically identifies and isolates error-sensitive weight directions into a high-precision low-rank branch. This is achieved via a closed-form truncated SVD on the scaled weight matrix W G^{1/2}_x . Experimental results on Qwen3-4B-Thinking-2507 demonstrate that ARHQ significantly improves layer-wise SNR and preserves downstream reasoning performance on ZebraLogic even under aggressive quantization. The code is available at https://github.com/BeautMoonQ/ARHQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARHQ offers a clean closed-form SVD split using activation residuals but the supporting evidence stays narrow and unablated.

read the letter

ARHQ is a post-training weight-splitting method that builds an input-side residual Hessian G_x from activation quantization errors and then applies truncated SVD to W G_x^{1/2} so that the most sensitive directions move into a high-precision low-rank branch. The construction is presented as analytical and closed-form, which removes the need for iterative search or retraining. They release code and show results on Qwen3-4B-Thinking-2507 where layer-wise SNR improves and ZebraLogic reasoning holds under aggressive low-bit settings. That combination of a simple matrix construction plus public code is the part worth noting for people who actually ship quantized models. The approach sits in the practical post-training quantization line rather than in theoretical approximation bounds or end-to-end training methods. The soft spot is the lack of controls. The abstract gives no numerical baselines against other splitting heuristics, no ablation on the SVD rank choice, and no test of whether a random low-rank split of the same size would produce similar SNR numbers. The stress-test worry about G_x being measured after an initial quantization pass is also live: the residuals could already carry correlated error, so the extracted directions might not be as cleanly error-sensitive as claimed. Without an explicit bound on the discarded singular components or results on a second model family, it is difficult to separate a general property from a model-specific or rank-heuristic effect. This paper is for engineers who need a lightweight activation-aware split they can drop into existing pipelines. A reader who wants to reproduce the Qwen numbers or adapt the SVD step to their own layers will get immediate value from the repo. It deserves peer review because the core procedure is well-specified and the target problem is real, even though the current experiments are still preliminary and would need expansion on controls and additional models before publication.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method for low-bit LLM quantization. It constructs an input-side residual Hessian G_x from activation quantization residuals and performs a closed-form truncated SVD on the scaled weight matrix W G_x^{1/2} to analytically isolate error-sensitive weight directions into a high-precision low-rank branch, with the remainder quantized at low precision. Experiments on Qwen3-4B-Thinking-2507 report improved layer-wise SNR and preserved reasoning performance on ZebraLogic under aggressive quantization, with code released at the provided GitHub link.

Significance. If the residual-Hessian construction and truncated SVD reliably extract the dominant error-propagation directions without hidden tuning or new artifacts, ARHQ could provide a practical analytical alternative to iterative or learned quantization splits, reducing reliance on extensive hyperparameter search for LLM deployment. The public code release is a clear strength that supports reproducibility and further testing.

major comments (2)

[Abstract] Abstract: The claim that ARHQ 'significantly improves layer-wise SNR' and 'preserves downstream reasoning performance' is presented without any numerical SNR deltas, baseline comparisons (e.g., to standard low-bit methods), error bars, or specification of the exact bit-widths and layers evaluated on Qwen3-4B-Thinking-2507. This absence prevents assessment of effect size or statistical reliability.
[Method] Method description (central construction): The isolation of error-sensitive directions via G_x (built from activation residuals) and truncated SVD on W G_x^{1/2} is asserted to be analytical and closed-form, yet no error bound on the discarded singular components, no ablation on SVD rank selection, and no control experiment (e.g., random low-rank splits of matching dimension) are reported. Without these, it remains unclear whether the SNR gains arise from the specific residual-Hessian mechanism or from model-specific statistics or implicit rank heuristics.

minor comments (2)

[Abstract] The abstract refers to 'aggressive quantization' without defining the target bit-width or activation/weight precision pair; adding this detail would improve clarity for readers.
[Abstract] The model identifier 'Qwen3-4B-Thinking-2507' is non-standard; confirming the exact checkpoint or providing a reference would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the quantitative presentation and methodological justification in our technical report. We address each major comment below and will incorporate revisions to improve clarity and rigor without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that ARHQ 'significantly improves layer-wise SNR' and 'preserves downstream reasoning performance' is presented without any numerical SNR deltas, baseline comparisons (e.g., to standard low-bit methods), error bars, or specification of the exact bit-widths and layers evaluated on Qwen3-4B-Thinking-2507. This absence prevents assessment of effect size or statistical reliability.

Authors: We agree that the abstract would be more informative with explicit quantitative details. The full experimental section reports layer-wise SNR values and ZebraLogic accuracies, but these were not summarized numerically in the abstract. In the revised version, we will update the abstract to include specific SNR deltas (e.g., average improvement in dB relative to uniform low-bit baselines), direct comparisons to standard methods such as GPTQ and AWQ, mention of variability across layers, and precise specifications of bit-widths (e.g., 2-bit weights with 4-bit activations) along with the evaluated layers on Qwen3-4B-Thinking-2507. This will allow readers to assess effect sizes directly. revision: yes
Referee: [Method] Method description (central construction): The isolation of error-sensitive directions via G_x (built from activation residuals) and truncated SVD on W G_x^{1/2} is asserted to be analytical and closed-form, yet no error bound on the discarded singular components, no ablation on SVD rank selection, and no control experiment (e.g., random low-rank splits of matching dimension) are reported. Without these, it remains unclear whether the SNR gains arise from the specific residual-Hessian mechanism or from model-specific statistics or implicit rank heuristics.

Authors: We appreciate this point on strengthening the analytical claims. The truncated SVD follows directly from the Eckart-Young-Mirsky theorem, which guarantees that the low-rank approximation error is bounded by the sum of the discarded singular values of W G_x^{1/2}; we will explicitly include this bound and its derivation in the revised method section. To address rank selection, we will add an ablation varying the retained rank and reporting corresponding SNR and downstream task metrics. We will also include a control experiment with random low-rank splits of identical dimensions to isolate the contribution of the residual-Hessian scaling. These additions will be presented in the next manuscript version to clarify that gains stem from the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

ARHQ is a constructive post-training procedure with no derivation reducing to its inputs

full rationale

The paper defines ARHQ explicitly as the construction of an input-side residual Hessian G_x from activation quantization residuals followed by closed-form truncated SVD on W G_x^{1/2} to isolate a high-precision low-rank branch. This is a method specification, not a claim that some quantity is predicted or derived from first principles that turns out to be identical to the construction itself. Experimental results on layer-wise SNR and ZebraLogic performance for Qwen3-4B are reported as validation of the procedure rather than tautological outputs. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems imported from prior author work appear in the provided text that would make the central claim circular. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The method relies on standard linear algebra (SVD) and the assumption that activation quantization residuals can be summarized by a Hessian-like matrix G_x. No explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Truncated SVD on the scaled weight matrix isolates error-sensitive directions
Invoked in the description of how ARHQ analytically identifies sensitive weights; this is a modeling assumption rather than a proven property for quantization error propagation.

pith-pipeline@v0.9.0 · 5424 in / 1404 out tokens · 33752 ms · 2026-05-09T19:47:17.632524+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Asvd: Activation-aware singular value decomposition for compressing large language models,

Asvd: Activation-aware singular value decomposition for compressing large language models , author=. arXiv preprint arXiv:2312.05821 , year=

work page arXiv
[2]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models , author=. arXiv preprint arXiv:2411.05007 , year=

work page arXiv
[3]

Svd-llm: Truncation- aware singular value decomposition for large language model compression,

Svd-llm: Truncation-aware singular value decomposition for large language model compression , author=. arXiv preprint arXiv:2403.07378 , year=

work page arXiv
[4]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Awq: Activation-aware weight quantization for llm compression and acceleration , author=. arXiv preprint arXiv:2306.00978 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review arXiv
[6]

Smoothquant: Accurate and efficient post-training quantization for large language models,

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. arXiv preprint arXiv:2211.10438 , year=

work page arXiv
[7]

arXiv preprint arXiv:2603.08185 , year=

Serq: Saliency-aware low-rank error reconstruction for llm quantization , author=. arXiv preprint arXiv:2603.08185 , year=

work page arXiv
[8]

Quarot: Outlier-free 4-bit inference in rotated llms,

Quarot: Outlier-free 4-bit inference in rotated llms , author=. arXiv preprint arXiv:2404.00456 , year=

work page arXiv
[9]

Spinquant–llm quantization with learned rotations,

Spinquant: Llm quantization with learned rotations , author=. arXiv preprint arXiv:2405.16406 , year=

work page arXiv