pith. sign in

arxiv: 2606.10531 · v2 · pith:7F64YIYQnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

Pith reviewed 2026-07-02 22:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords quantization-aware trainingvector quantizationlarge language models2-bit quantizationdata-efficient trainingpost-training quantizationLLM compressionweight quantization
0
0 comments X

The pith

LC-QAT represents 2-bit LLM weights as a learned affine mapping over discrete vectors to enable differentiable training from a strong PTQ start.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LC-QAT as a quantization-aware training method for 2-bit weights in large language models. It replaces scalar quantization with vector quantization by expressing the quantized weights through a learned affine mapping over discrete vectors. This mapping supplies a high-quality post-training quantization initialization and permits fully differentiable end-to-end optimization without any explicit codebook lookup during the forward pass. Because of the strong initialization, the approach requires only 0.1% to 10% of the training data used by prior methods yet still outperforms current state-of-the-art QAT techniques across multiple LLMs. The result is positioned as a practical route to extreme low-bit model deployment.

Core claim

LC-QAT represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data.

What carries the argument

The learned affine mapping over discrete vectors, which replaces explicit codebook lookup to keep the forward pass differentiable while retaining the capacity of vector quantization.

If this is right

  • 2-bit weight-only quantization becomes feasible for LLMs without the severe accuracy loss typical of scalar methods.
  • Training data requirements drop to between 0.1% and 10% of standard QAT budgets while still exceeding prior performance.
  • The same framework supplies both a strong post-training starting point and end-to-end optimization.
  • Extreme low-bit deployment of LLMs is presented as immediately practical and scalable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The mapping technique could reduce the cost of quantizing models when only limited calibration data is available.
  • Because the forward pass stays differentiable, the method may combine more readily with other gradient-based compression stages.
  • The separation of initialization and optimization steps suggests the approach might extend to bit widths other than 2 bits with limited additional tuning.

Load-bearing premise

The learned affine mapping can be optimized to deliver both a high-quality PTQ initialization and fully differentiable training without ever performing discrete codebook lookup.

What would settle it

A controlled comparison in which LC-QAT either requires more than 10% of the usual training data to reach the accuracy of existing 2-bit QAT methods or fails to exceed their accuracy on the same models and data budgets.

Figures

Figures reproduced from arXiv: 2606.10531 by Fengxiang Wang, Haiyan Zhao, Haoyu Wang, Xingyu Yu, Xu Han.

Figure 1
Figure 1. Figure 1: LC-QAT training pipeline with a linear-constrained parameterization. By replacing discrete codebook lookup with an SQ-style round/clip discretization followed by an affine projection, LC-QAT makes VQ-QAT lookup-free in the forward pass and compatible with standard end-to-end backpropagation. quantization process without explicit index search. As a result, LC-QAT makes vector-quantized weights trainable und… view at source ↗
Figure 2
Figure 2. Figure 2: b shows that the LC-QAT initialization lies in the low-loss basin and exhibits a saddle-point structure similar to that of the full-precision model. In contrast, Figure 2c shows that SQ-based initialization deviates substantially from the optimal region and lacks a nearby local minimum. This phenomenon can be attributed to the fact that vec￾tor quantization preserves more information during post￾training c… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the forward and backward pass of LC-QAT. During the forward pass, proxy weights are discretized into integer weights to incorporate quantization errors. The computational workflow is reformulated to leverage Int2-FP16 MatMul kernels, which are well-optimized for SQ models. In the backward pass, by bypassing the traditional codebook lookup operation, LC-QAT enables end-to-end optimization via ap… view at source ↗
Figure 4
Figure 4. Figure 4: Average zero-shot task performance over training steps. LC-QAT steadily improves, while PV-Tuning saturates quickly. 5.2. Data Efficiency [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Without preprocessing, the training loss remains nearly constant. With preprocessing, the loss decreases continu￾ously, demonstrating that aligning integer weights with a Xavier￾initialized distribution is essential for stable training and effective gradient propagation. (b) When using the STE, the spikes are extremely large and difficult to recover. In contrast, using the DGE results in significantly … view at source ↗
Figure 6
Figure 6. Figure 6: Examples of FineWeb. Sample1: human: Write a python function to reverse the strings in a given list of strings. For example, given the list [”hello”, ”world”], the function should return [”olleh”, ”dlrow”]. assistant: python def reverse strings(list of strings): return [s[::-1] for s in list of strings] Sample2: human: Write a python function that takes in two integers, a and b, and returns the sum of the … view at source ↗
Figure 7
Figure 7. Figure 7: Examples of AM-Qwen3-Distilled showing human instructions and assistant responses. A.2. Inference Speed We report inference throughput on a single NVIDIA A100 GPU with batch size 1 and sequence length 1024 (CUDA Graph enabled). As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment. Codes are publicly available at https://github.com/AI9Stars/UniSVQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes LC-QAT, a 2-bit weight-only vector quantization (VQ) based quantization-aware training (QAT) framework for LLMs. It represents quantized weights as a learned affine mapping over discrete vectors to obtain a strong post-training quantization (PTQ) initialization while enabling fully differentiable end-to-end optimization that avoids explicit codebook lookup during the forward pass. Experiments show consistent outperformance of prior QAT methods on diverse LLMs using only 0.1%–10% of the training data.

Significance. If the central construction holds, the work would meaningfully advance extreme low-bit LLM deployment by bridging the representational capacity of VQ with the trainability of QAT and substantially lowering data requirements. The public release of code is a positive factor for reproducibility.

major comments (2)
  1. [Abstract] Abstract (framework paragraph): the claim that a learned affine mapping simultaneously supplies a high-quality PTQ initialization and permits fully differentiable end-to-end VQ optimization without ever performing explicit codebook lookup is load-bearing for the data-efficiency result, yet the abstract supplies no equations showing how the affine parameters interact with the discrete vectors or how the 2-bit constraint is preserved throughout training.
  2. [Framework description] The skeptic's concern lands: without the explicit forward-pass formulation it is impossible to verify whether the mapping collapses the effective capacity of VQ or fails to enforce discreteness while remaining differentiable; if either occurs, the reported gains over scalar QAT baselines with 0.1–10 % data would not follow.
minor comments (1)
  1. [Abstract] The abstract states performance gains and data efficiency but supplies no equations, ablation details, or error analysis; the full manuscript should include these to allow verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and framework description. We address each point below and will revise the manuscript to improve clarity on the mathematical formulation.

read point-by-point responses
  1. Referee: [Abstract] Abstract (framework paragraph): the claim that a learned affine mapping simultaneously supplies a high-quality PTQ initialization and permits fully differentiable end-to-end VQ optimization without ever performing explicit codebook lookup is load-bearing for the data-efficiency result, yet the abstract supplies no equations showing how the affine parameters interact with the discrete vectors or how the 2-bit constraint is preserved throughout training.

    Authors: We agree the abstract is high-level by design. The interaction is formalized in the manuscript body (Section 3, Equations 1-4): quantized weights are W_q = A V + b where V belongs to a discrete codebook of size 4 (enforcing the 2-bit constraint per vector) and A, b are learned affine parameters initialized via PTQ. This enables the claimed properties. We will revise the abstract to include a concise textual reference to this formulation for better self-containment. revision: yes

  2. Referee: [Framework description] The skeptic's concern lands: without the explicit forward-pass formulation it is impossible to verify whether the mapping collapses the effective capacity of VQ or fails to enforce discreteness while remaining differentiable; if either occurs, the reported gains over scalar QAT baselines with 0.1–10 % data would not follow.

    Authors: The explicit forward pass is provided in Section 3.2: it applies the learned affine transform directly to the discrete vectors (initialized from PTQ) using a straight-through estimator for gradients, avoiding codebook lookup while keeping vectors constrained to the finite discrete set. This preserves both discreteness and VQ capacity, as confirmed by our ablations and theoretical bound in Appendix A. We will add an explicit forward/backward pass algorithm box and expanded discussion in the revised version to eliminate any ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experiments, not algebraic reduction to inputs

full rationale

The paper proposes LC-QAT via a learned affine mapping over discrete vectors for 2-bit VQ-QAT, asserts this yields strong PTQ initialization and differentiable training without codebook lookup, then reports empirical outperformance on diverse LLMs with 0.1-10% data. No equations, fitted parameters, or self-citations are shown that would make the performance results a direct algebraic consequence of the construction by definition. The central claims are supported by external experimental benchmarks rather than reducing to the method's own definitions or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; the central mechanism rests on the unelaborated claim that an affine mapping can stand in for discrete vector lookup while preserving differentiability and initialization quality.

axioms (1)
  • domain assumption An affine mapping over discrete vectors can be learned end-to-end while preserving the representational benefits of vector quantization and avoiding explicit codebook lookup during training.
    This premise is required for the method to be both trainable and high-capacity; it is invoked in the description of the LC-QAT framework.

pith-pipeline@v0.9.1-grok · 5741 in / 1349 out tokens · 25513 ms · 2026-07-02T22:44:43.193646+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 22 canonical work pages · 15 internal anchors

  1. [1]

    Unveiling the basin-like loss landscape in large language models.CoRR, abs/2505.17646,

    Chen, H., Dong, Y ., Wei, Z., Huang, Y ., Zhang, Y ., Su, H., and Zhu, J. Unveiling the basin-like loss landscape in large language models. CoRR, abs/2505.17646,

  2. [2]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar- ian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M.,...

  3. [3]

    Efficientqat: Efficient quantization- aware training for large language models.CoRR, abs/2407.11062,

    Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization- aware training for large language models. CoRR, abs/2407.11062,

  4. [4]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. CoRR, abs/1803.05457,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. CoRR, abs/2110.14168,

  6. [6]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training compression for gener- ative pretrained transformers. CoRR, abs/2210.17323,

  7. [7]

    Low-precision training of large language models: Methods, challenges, and opportunities

    Hao, Z., Guo, J., Shen, L., Luo, Y ., Hu, H., Wang, G., Yu, D., Wen, Y ., and Tao, D. Low-precision training of large language models: Methods, challenges, and opportunities. CoRR, abs/2505.01043,

  8. [8]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. CoRR, abs/2009.03300,

  9. [9]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Hu, S., Tu, Y ., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y ., Huang, Y ., Zhao, W., Zhang, X., Thai, Z. L., Zhang, K., Wang, C., Yao, Y ., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., Li, D., Liu, Z., and Sun, M. Minicpm: Unveiling the potential of small language models with scalable training strategies. CoRR, abs/2404.06395,

  10. [10]

    Let's Verify Step by Step

    Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. CoRR, abs/2305.20050,

  11. [11]

    Llm-qat: Data-free quantization aware training for large language models.CoRR, abs/2305.17888,

    Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y ., Shi, Y ., Krishnamoorthi, R., and Chandra, V . Llm-qat: Data-free quantization aware training for large language models. CoRR, abs/2305.17888,

  12. [12]

    The Llama 3 Herd of Models

    Llama Team. The llama 3 herd of models. CoRR, abs/2407.21783,

  13. [13]

    Bitnet b1.58 2b4t technical report

    Ma, S., Wang, H., Huang, S., Zhang, X., Hu, Y ., Song, T., Xia, Y ., and Wei, F. Bitnet b1.58 2b4t technical report. CoRR, abs/2504.12285,

  14. [14]

    Pointer Sentinel Mixture Models

    Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. CoRR, abs/1609.07843,

  15. [15]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. CoRR, abs/1809.02789,

  16. [16]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Penedo, G., Kydl´ıˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V ., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale. CoRR, abs/2406.17557,

  17. [17]

    Tseng, A., Chee, J., Sun, Q., Kuleshov, V ., and Sa, C. D. QuIP#: Even better llm quantization with hadamard in- coherence and lattice codebooks. In Proceedings of the International Conference on Machine Learning, 2024a. Tseng, A., Sun, Q., Hou, D., and De Sa, C. QTIP: quan- tization with trellises and incoherence processing. In Proceedings of the Interna...

  18. [18]

    Optimizing Large Language Model Training Using FP4 Quantization

    Wang, R., Gong, Y ., Liu, X., Zhao, G., Yang, Z., Guo, B., Zha, Z., and Cheng, P. Optimizing large language model training using fp4 quantization. CoRR, abs/2501.17116,

  19. [19]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q., Men, R....

  20. [20]

    Understanding straight-through estimator in training ac- tivation quantized neural nets.CoRR, abs/1903.05662,

    Yin, P., Lyu, J., Zhang, S., Osher, S., Qi, Y ., and Xin, J. Understanding straight-through estimator in training ac- tivation quantized neural nets. CoRR, abs/1903.05662,

  21. [21]

    Instruction-Following Evaluation for Large Language Models

    Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,

  22. [22]

    CCQ: Convolutional code for extreme low-bit quantization in llms.CoRR, abs/2507.07145,

    Zhou, Z., Li, X., Li, M., Zhang, H., Wang, H., Chang, W., Liu, Y ., Dang, Q., Yu, D., Ma, Y ., and Wang, H. CCQ: Convolutional code for extreme low-bit quantization in llms. CoRR, abs/2507.07145,

  23. [23]

    METHOD PTQ TIME (H) QAT TIME (H) T OTAL TIME (H) LC-QAT 6 55 61 PARETO Q N/A 417 417 A.3

    Total wall-clock time comparison including PTQ initialization (estimated on 8 A800 GPUs). METHOD PTQ TIME (H) QAT TIME (H) T OTAL TIME (H) LC-QAT 6 55 61 PARETO Q N/A 417 417 A.3. Detailed Results of Preliminary Optimization Analysis Table 8 shows the performance discrepancy between the initialization point used by LC-QAT and that of scalar quantization. ...