Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs

Bei Liu; Binxing Xu; Chao Li; Hao Gu; Hao Wang; Jiacheng Liu; Lujun Li; Qiyuan Zhu; Sirui Han; Xintong Yang

arxiv: 2604.07888 · v1 · submitted 2026-04-09 · 💻 cs.LG

Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs

Binxing Xu , Hao Gu , Lujun Li , Hao Wang , Bei Liu , Jiacheng Liu , Qiyuan Zhu , Xintong Yang

show 3 more authors

Chao Li Sirui Han Yike Guo

This is my paper

Pith reviewed 2026-05-10 16:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM quantizationlow-bit trainingQAToutlier channel splittingprogressive precision reductionW2A2model compressioninference speedup

0 comments

The pith

Progressive training with outlier splitting enables stable 2-bit LLM quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that direct 2-bit quantization-aware training of LLMs leads to instability from error buildup, but reducing precision in stages across blocks while splitting outlier channels succeeds in keeping training stable. A sympathetic reader would care because this would let large models run on far less memory and power while retaining most accuracy. The nested grid design also means one training run can support several bit widths at deployment time. Results on Llama2 and Llama3 show the method beats earlier low-bit techniques under W2A2 settings.

Core claim

The authors claim that block-wise progressive precision reduction from higher bits to 2 bits, using nested integer quantization grids, combined with rounding-aware outlier channel splitting, stabilizes quantization-aware training. This produces W2A2 models whose WikiText2 perplexity is only 2.25 higher than full-precision versions on Llama2 and Llama3, while also supplying custom kernels for large speed gains.

What carries the argument

Rounding-aware outlier channel splitting, which identifies heavy-tailed channels, divides them into multiple lower-range channels, and applies a rounding rule that makes the split act as an identity transform on the quantized outputs.

If this is right

A single training produces a model that can be deployed at any supported bit width via the nested grids.
Custom W2A2 and W2A16 kernels deliver up to 11 times speedup over BF16.
The method follows microscaling groups with E4M3 scales to match current hardware standards.
It outperforms BitDistiller and EfficientQAT on Llama2 and Llama3 under W2A2 conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The progressive schedule could be tuned to quantize models larger than those tested here with similar stability.
Native 2-bit hardware would amplify the reported speedups beyond the custom-kernel gains.
The outlier-splitting idea might extend to handling activation outliers during standard training.
Similar staged reduction could help other compression methods such as pruning or distillation.

Load-bearing premise

That gradually stepping down precision block by block will stop quantization errors from accumulating enough to cause training divergence at 2 bits.

What would settle it

Training the same Llama models directly at 2 bits without the progressive stages and checking whether perplexity stays near 2.25 or the run fails to converge.

Figures

Figures reproduced from arXiv: 2604.07888 by Bei Liu, Binxing Xu, Chao Li, Hao Gu, Hao Wang, Jiacheng Liu, Lujun Li, Qiyuan Zhu, Sirui Han, Xintong Yang, Yike Guo.

**Figure 1.** Figure 1: Loss landscapes under different precisions. The vertical axis denotes the loss, the horizontal axes (α, β) represent random directions in parameter space. formats to optimize both storage and computational efficiency. Existing approaches fall into two families: post-training quantization (PTQ) and quantization-aware training (QAT). PTQ quantizes a pretrained model with little or no retraining and thus dom… view at source ↗

**Figure 2.** Figure 2: Analysis of QAT challenges. (a) Training loss curve of direct QAT, exhibiting a prominent loss spike. (b) Layer-wise reconstruction loss and relative error across Transformer blocks, illustrating significant error accumulation in deeper layers. (c) Comparison of training budgets; our method (Bit-by-Bit) achieves a 3600× reduction in token requirements compared to ParetoQ. rors; (iii) engineering robust qua… view at source ↗

**Figure 3.** Figure 3: Value distributions of various group granularities showing (a) Low-bit values are nested in the high-bit grid, (b) lower bits collapse representations; larger groups improve dynamic-range. tiveness, these designs often introduce complex implementations and kernel inefficiency. Quantization-Aware Training (QAT) aims to address these issues by jointly optimizing the weights along with the quantizer to mitiga… view at source ↗

**Figure 4.** Figure 4: (a) Progressive Bit-by-Bit QAT: Direct 2-bit QAT drives weights into coarse clusters under a non-smooth loss landscape, progressive schedule that lowers precision stage-by-stage, using the higher-precision phase to stabilize and initialize the next stage. (b) Rounding-aware outlier channel splitting: detect outlier channels via metric ||x||2 · max|w|, then apply identical, rounding-aware halving that keeps… view at source ↗

**Figure 5.** Figure 5: Bit-shifting from higher to lower precision [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Kernel latency of BIT-BY-BIT relative to native PyTorch and Marlin of Wup, Wdown. introduced by the outlier splitting process. Notably, in (4096, 14336) setting, our W2A2 implementation achieves a speedup of over 10× compared to the native PyTorch FP16 baseline, the performance overhead remains negligible with the inclusion of OCS. Furthermore, for end-to-end inference on Llama 3-8B, we reaches a decodin… view at source ↗

read the original abstract

Training LLMs at ultra-low precision remains a formidable challenge. Direct low-bit QAT often suffers from convergence instability and substantial training costs, exacerbated by quantization noise from heavy-tailed outlier channels and error accumulation across layers. To address these issues, we present Bit-by-Bit, a progressive QAT framework with outlier channel splitting. Our approach integrates three key components: (1) block-wise progressive training that reduces precision stage by stage, ensuring stable initialization for low-bit optimization; (2) nested structure of integer quantization grids to enable a "train once, deploy any precision" paradigm, allowing a single model to support multiple bit-widths without retraining; (3) rounding-aware outlier channel splitting, which mitigates quantization error while acting as an identity transform that preserves the quantized outputs. Furthermore, we follow microscaling groups with E4M3 scales, capturing dynamic activation ranges in alignment with OCP/NVIDIA standards. To address the lack of efficient 2-bit kernels, we developed custom operators for both W2A2 and W2A16 configurations, achieving up to 11$\times$ speedup over BF16. Under W2A2 settings, Bit-by-Bit significantly outperforms baselines like BitDistiller and EfficientQAT on both Llama2/3, achieving a loss of only 2.25 WikiText2 PPL compared to full-precision models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bit-by-Bit combines progressive block-wise QAT, nested grids, and outlier splitting to reach stable W2A2 LLMs with a multi-precision deployment option and 11x custom-kernel speedups, but end-to-end results without component ablations leave the progressive schedule's specific contribution unclear.

read the letter

The paper's main contribution is a training recipe that moves LLMs from higher to lower precision in blocks while using nested integer grids and rounding-aware outlier splitting. This setup aims to avoid the usual convergence crashes at 2 bits and produces one set of weights that can be deployed at several bit widths without retraining. They also add microscaling with E4M3 and ship custom W2A2 and W2A16 kernels that deliver up to 11x speedup over BF16. On Llama-2 and Llama-3 the W2A2 version shows only 2.25 higher WikiText-2 perplexity than full precision and beats the cited baselines. That combination of ideas and the reported numbers are the concrete advance. The engineering focus on standards-compliant scaling and fast kernels is useful for anyone who actually ships quantized models. The soft spot is the missing isolation of the progressive schedule. The abstract and stress-test note give only the full pipeline results, so it is not possible to tell whether the block-wise reduction, the outlier split, or simply the initialization and kernels are doing the heavy lifting on stability. No error bars, training curves, or ablation tables are referenced, which keeps the causal story thin. The work is aimed at practitioners who need low-bit inference with acceptable accuracy loss and who value the multi-precision flexibility. A reader already working on QAT will pick up the custom kernels and the nested-grid trick, but will still want to see the full experimental section and code before treating the 2.25 PPL number as settled. It is worth sending to peer review because the problem is real, the speedups are concrete, and the framework is described clearly enough that referees can ask for the missing ablations and reproducibility details.

Referee Report

2 major / 1 minor

Summary. The paper presents Bit-by-Bit, a progressive quantization-aware training (QAT) framework for low-bit large language models (LLMs). It combines block-wise progressive precision reduction, nested integer quantization grids for multi-precision support, and rounding-aware outlier channel splitting to achieve stable training at ultra-low bits like W2A2. The approach includes microscaling with E4M3 scales and custom kernels for up to 11x speedup, claiming superior performance over baselines such as BitDistiller and EfficientQAT on Llama2 and Llama3 models with only 2.25 perplexity degradation on WikiText2 compared to full-precision models.

Significance. Should the empirical results prove robust, this contribution would be significant for enabling efficient, low-memory deployment of LLMs by making 2-bit quantization practical with minimal accuracy loss and hardware support, addressing key challenges in quantization noise and convergence instability.

major comments (2)

[Experimental Evaluation] The central claim of significant outperformance under W2A2 settings with 2.25 WikiText2 PPL relies on end-to-end results without ablations that remove the block-wise progressive training while keeping outlier splitting and nested grids fixed; this leaves the causal role of the progressive schedule unisolated, as noted in the lack of training curves or component-wise comparisons.
[Methodology] Insufficient details are provided on baseline reproduction, hyperparameter selection, and error bars across multiple runs, which is critical given the sensitivity of low-bit QAT to initialization and optimization choices.

minor comments (1)

[Abstract] The abstract mentions 'up to 11× speedup' but does not specify the hardware or exact configurations for the W2A2 and W2A16 cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and will revise the manuscript to strengthen the experimental evidence and reproducibility details.

read point-by-point responses

Referee: [Experimental Evaluation] The central claim of significant outperformance under W2A2 settings with 2.25 WikiText2 PPL relies on end-to-end results without ablations that remove the block-wise progressive training while keeping outlier splitting and nested grids fixed; this leaves the causal role of the progressive schedule unisolated, as noted in the lack of training curves or component-wise comparisons.

Authors: We agree that the current end-to-end comparisons, while demonstrating overall superiority of Bit-by-Bit over baselines such as BitDistiller and EfficientQAT, do not fully isolate the contribution of the block-wise progressive training schedule. To address this, the revised manuscript will include new ablation experiments that disable the progressive precision reduction (while retaining outlier channel splitting and nested integer grids) and report the resulting performance degradation. We will also add training curves comparing convergence behavior with and without the progressive schedule under W2A2 settings to better illustrate its role in stability. revision: yes
Referee: [Methodology] Insufficient details are provided on baseline reproduction, hyperparameter selection, and error bars across multiple runs, which is critical given the sensitivity of low-bit QAT to initialization and optimization choices.

Authors: We appreciate this point on reproducibility, which is especially relevant for low-bit QAT. In the revised version, we will expand the experimental setup section with full details on baseline reproduction (including exact hyperparameter values, learning rate schedules, and optimization choices for BitDistiller and EfficientQAT), as well as our own hyperparameter selection process. Additionally, we will report error bars (standard deviation across three independent runs with different random seeds) for the primary W2A2 results on Llama2/3 models to quantify sensitivity to initialization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training recipe with independent benchmark support

full rationale

The paper describes an empirical QAT framework (block-wise progressive precision reduction, nested integer grids, rounding-aware outlier splitting, and microscaling) whose claims are validated through end-to-end benchmark results on Llama2/3 models rather than any mathematical derivation. No equations, uniqueness theorems, or self-citations are invoked to force the method's components; each element is introduced as a practical design choice and evaluated via reported PPL and speedup numbers. The derivation chain is therefore self-contained against external benchmarks and does not reduce any prediction or result to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical training heuristics rather than formal axioms or new theoretical entities; the only potential free parameters are the choice of precision-reduction schedule and the micro-scaling group size, both selected by the authors to match observed activation ranges.

free parameters (2)

precision reduction schedule
The sequence of bit-width stages (e.g., 16->8->4->2) is chosen by the authors to ensure stable initialization; no automatic derivation is given.
micro-scaling group size
Group size for E4M3 scales is set to capture dynamic activation ranges per the OCP/NVIDIA standard but remains a design choice.

pith-pipeline@v0.9.0 · 5586 in / 1373 out tokens · 36654 ms · 2026-05-10T16:55:58.559540+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[3]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[4]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[5]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[6]

(QGT+  o/߸ ;fQ Zt鐒gvZxG*J Y ȮY! dZs (HE E 2 n=#R

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[3] [3]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[4] [4]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[5] [5]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[6] [6]

(QGT+  o/߸ ;fQ Zt鐒gvZxG*J Y ȮY! dZs (HE E 2 n=#R

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv