pith. sign in

arxiv: 2506.12040 · v2 · submitted 2025-05-24 · 💻 cs.LG · cs.AI· cs.CV

BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook

Pith reviewed 2026-05-19 13:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords LLM quantizationsub-1-bit compressionbinary codebooklearnable transformationmodel compressionefficient inferenceweight clusteringextreme quantization
0
0 comments X

The pith

Learnable transformation plus binary codebook lets LLMs run at 0.8 bits with 3.1 percent zero-shot accuracy loss and 1.6x speedup over FP16.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BTC-LLM as a sub-1-bit quantization method for large language models that avoids the drawbacks of earlier sparsity-based binarization approaches. It introduces a learnable transformation to shrink outliers and align sign patterns across weights, paired with a binary codebook that groups similar weight vectors into short indices. These steps remove the need for sparse masks, so the quantized model runs on ordinary hardware. Tests on LLaMA, Qwen and other families show compression down to 0.7-1.11 bits while keeping most task performance. At 0.8 bits on LLaMA-2-13B the accuracy drop stays near 3 percent and inference speeds up by 1.6 times compared with full-precision weights.

Core claim

The authors show that a learnable linear transformation followed by binary pattern clustering can compress LLM weights below one bit per parameter by replacing repeated vectors with compact codebook indices and by removing the requirement for explicit sparsity masks, thereby delivering both memory reduction and standard-hardware compatibility while limiting accuracy loss on zero-shot benchmarks to a few percent.

What carries the argument

Binary Codebook that clusters recurring weight vectors into compact indices using custom distance metrics and sign-based updates; paired with a Learnable Transformation that reduces outliers and promotes shared sign patterns.

If this is right

  • LLMs become deployable with roughly one-eighth the memory footprint of FP16 while retaining near-original zero-shot accuracy.
  • Inference no longer requires custom sparse-matrix kernels or mask storage, allowing use on standard GPUs and CPUs.
  • Models from multiple families (LLaMA, Qwen, FBI-LLM) reach compression ratios between 0.7 and 1.11 bits with consistent speed gains.
  • The 1.6x wall-clock improvement over FP16 at 0.8 bits scales directly with reduced data movement.
  • Elimination of per-weight masks removes a source of runtime overhead that previously limited extreme binarization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transformation-plus-codebook pattern may extend to other neural-network families beyond transformers if the sign-pattern clustering generalizes.
  • Further bit reduction below 0.7 bits could be tested by increasing codebook size or adding a second transformation stage.
  • Energy cost per token should drop proportionally with memory bandwidth, which matters for battery-powered or edge devices.
  • The approach might combine with post-training methods such as knowledge distillation to recover any remaining accuracy gap.

Load-bearing premise

The learnable transformation reliably reduces outliers and creates shared sign patterns across weights without introducing hidden failure modes that standard zero-shot tests would miss.

What would settle it

An experiment that measures accuracy collapse or loss of speedup on a held-out task suite or on a different hardware platform after the same transformation and codebook training.

Figures

Figures reproduced from arXiv: 2506.12040 by Bei Liu, Hao Gu, Hao Wang, Jiacheng Liu, Lei Wang, Lujun Li, Qiyuan Zhu, Sirui Han, Yike Guo, Zheyu Wang.

Figure 1
Figure 1. Figure 1: Binary vector distribution (length 10) from [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Activation distributions for the self_attn.k_proj layer in the LLaMA-2-7B model: (a) Original FP16 (max abs: 8), (b) BiLLM (max abs: 15), (c) ARB-LLM (max abs: 10), and (d) our proposed BTC-LLM (max abs: 0.4). Binary quantization [30] represents the most aggressive quantization approach, converting floating￾point weights to binary values (±1) to reduce memory requirements by over 32× [19]. For instance, Bi… view at source ↗
Figure 3
Figure 3. Figure 3: Perplexity of LLaMA-2-7B on WikiText2. Our BTC-LLM outperforms 2-bit methods at 0.9-bit. Our comprehensive evaluations of the LLaMA family of models (7B to 65B pa￾rameters) demonstrate the superior perfor￾mance of BTC-LLM in multiple bit width settings, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall architecture of BTC-LLM. (a) Sub-bit pipeline: the ARB quantizer transforms full-precision weights into binary form with associated scale and bias, followed by binary codebook representation and index assignment. (b) Structure of transformed attention (b1) and FFN (b2) blocks. The diagonal scaling and orthogonal transformation are merged into the weights to ensure computational equivalence and effi… view at source ↗
Figure 5
Figure 5. Figure 5: Trade-offs of runtime, throughput, memory usage, and accuracy under sub-1-bit quantization [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Extension of BTC-LLM for activation and kv cache quantization [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualizations comparing of the weight relative quantize error of LLaMA-2-7B with [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualizations comparing of the weight relative quantize error of LLaMA-2-7B with [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualizations of the activation distribution of different layers in LLaMA-2-7B before and [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualizations of the activation distribution of different layers in LLaMA-2-7B before and [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
read the original abstract

Binary quantization represents the most extreme form of compression, reducing weights to +/-1 for maximal memory and computational efficiency. While recent sparsity-aware binarization achieves sub-1-bit compression via weight pruning, it faces critical challenges: performance degradation, mask-management overhead, and limited hardware compatibility. In this paper, we present BTC-LLM, a novel sub-1-bit LLM quantization framework that leverages binary pattern clustering and weight transformation to overcome these limitations. Our approach incorporates two key innovations: (1) a Binary Codebook that clusters recurring vectors into compact indices using custom distance metrics and sign-based updates; (2) a Learnable Transformation that reduces outliers and promotes shared sign patterns among binary weights. This eliminates sparse masks, enabling efficient inference on standard hardware. Extensive evaluations across LLaMA, Qwen, and FBI-LLM families demonstrate that BTC-LLM achieves state-of-the-art results in extreme compression (1.11-0.7 bits). Notably, BTC-LLM compressed to 0.8 bits on LLaMA-2-13B maintains high performance, with only a 3.1 percent accuracy drop in zero-shot benchmarks, while delivering a 1.6x speedup over FP16.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents BTC-LLM, a sub-1-bit LLM quantization framework that uses a Binary Codebook to cluster recurring weight vectors via custom distance metrics and sign-based updates, combined with a Learnable Transformation to reduce outliers and promote shared sign patterns. This design is claimed to eliminate sparse masks and enable efficient inference on standard hardware. Evaluations across LLaMA, Qwen, and FBI-LLM families report state-of-the-art results in the 0.7–1.11 bit range, with the specific result that LLaMA-2-13B at 0.8 bits incurs only a 3.1% drop in zero-shot accuracy while achieving a 1.6x speedup over FP16.

Significance. If the central claims are substantiated, the work would advance extreme quantization by addressing mask overhead and hardware compatibility, offering practical value for resource-constrained LLM deployment. The multi-family evaluation and concrete speedup number are positive features. However, the reliance on learned transformation parameters and codebook choices fitted to observed weight distributions creates moderate risk that the reported accuracy and speedup are not fully independent of those modeling decisions.

major comments (2)
  1. [Abstract] Abstract: The claim that the Learnable Transformation 'eliminates sparse masks' for standard-hardware inference is load-bearing for both the 0.8-bit accuracy result and the 1.6x speedup, yet the manuscript provides no direct quantitative validation such as pre/post-transformation outlier norms, sign-pattern entropy, or measured kernel latency without custom masks; zero-shot accuracy alone does not rule out new failure modes introduced by the binary codebook clustering step.
  2. [Abstract] Abstract: The 3.1% accuracy drop on LLaMA-2-13B at 0.8 bits and the reported speedup rest on the assumption that codebook size, distance metric, and learnable transformation parameters are chosen independently of the final benchmark numbers; because these are free parameters explicitly fitted to weight distributions, the evaluation risks circularity that must be addressed with explicit ablation or held-out validation.
minor comments (1)
  1. Consider adding a consolidated table of bit-width, accuracy drop, and speedup across all model families to improve readability of the experimental claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and commitments to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the Learnable Transformation 'eliminates sparse masks' for standard-hardware inference is load-bearing for both the 0.8-bit accuracy result and the 1.6x speedup, yet the manuscript provides no direct quantitative validation such as pre/post-transformation outlier norms, sign-pattern entropy, or measured kernel latency without custom masks; zero-shot accuracy alone does not rule out new failure modes introduced by the binary codebook clustering step.

    Authors: We agree that direct supporting metrics would make the claim more robust. The reported 1.6x speedup was measured using standard GPU kernels that perform index-based lookups from the binary codebook, which by design requires no sparse masks. In the revised manuscript we will add: (i) pre- and post-transformation outlier-norm statistics, (ii) sign-pattern entropy before and after the transformation, and (iii) a latency breakdown isolating the contribution of the codebook lookup. These additions will also help rule out new failure modes introduced by clustering. revision: yes

  2. Referee: [Abstract] Abstract: The 3.1% accuracy drop on LLaMA-2-13B at 0.8 bits and the reported speedup rest on the assumption that codebook size, distance metric, and learnable transformation parameters are chosen independently of the final benchmark numbers; because these are free parameters explicitly fitted to weight distributions, the evaluation risks circularity that must be addressed with explicit ablation or held-out validation.

    Authors: Codebook size and transformation parameters are determined from the observed weight statistics of each model during quantization, which is the standard practice for learned quantization methods. We already report ablations over codebook sizes and transformation strengths in the experimental section. To further address the circularity concern, we will include additional results on held-out model families and benchmark suites in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in method design or claims

full rationale

The paper introduces BTC-LLM as an empirical quantization framework relying on a learnable transformation and binary codebook, with performance validated through zero-shot benchmarks and hardware speedup measurements on LLaMA and other models. No derivation chain is presented that reduces claimed outcomes (e.g., outlier reduction or mask elimination) to tautological redefinitions or fitted parameters renamed as independent predictions. Design choices are motivated by observed weight statistics but are not asserted as first-principles results forced by prior self-citations or internal equations; external benchmark results provide independent falsifiability. This is the expected outcome for an applied ML compression paper without mathematical derivation claims.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the empirical effectiveness of the binary codebook clustering and the learnable transformation; both are introduced as new components whose parameters are optimized on the target model weights.

free parameters (2)
  • codebook size and distance metric
    Chosen to cluster recurring binary vectors; the specific metric and number of entries are design choices fitted to observed weight statistics.
  • learnable transformation parameters
    Learned jointly to reduce outliers and align sign patterns; these are optimized during the quantization process rather than derived from first principles.
axioms (2)
  • domain assumption Binary weights can be clustered into a compact codebook using custom distance metrics without destroying downstream task performance.
    Invoked when the authors state that the Binary Codebook clusters recurring vectors into compact indices.
  • domain assumption A learnable transformation exists that simultaneously reduces outliers and promotes shared sign patterns among binary weights.
    Central premise stated in the description of the second innovation.
invented entities (2)
  • Binary Codebook no independent evidence
    purpose: Compact index-based representation of recurring binary weight vectors
    New data structure introduced to replace sparse masks.
  • Learnable Transformation no independent evidence
    purpose: Weight adjustment that reduces outliers and aligns sign patterns
    New module whose parameters are optimized to enable the codebook approach.

pith-pipeline@v0.9.0 · 5783 in / 1641 out tokens · 48672 ms · 2026-05-19T13:57:07.085875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

  1. [1]

    Quarot: Outlier-free 4-bit inference in rotated llms

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems , 37:100213– 100240, 2024. 3, 4

  2. [2]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. 1

  3. [3]

    Figurative language in recognizing textual entailment

    Tuhin Chakrabarty, Debanjan Ghosh, Adam Poliak, and Smaranda Muresan. Figurative language in recognizing textual entailment. arXiv preprint arXiv:2106.01195, 2021. 7

  4. [4]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, ACL, pages 2924–2936, 2019. 7

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. 7

  6. [6]

    Stbllm: Breaking the 1-bit barrier with structured binary llms

    Peijie Dong, Lujun Li, Yuedong Zhong, Dayou Du, Ruibo Fan, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Yike Guo, et al. Stbllm: Breaking the 1-bit barrier with structured binary llms. In ICLR, 2025. 2, 3, 5, 7

  7. [7]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 6

  8. [8]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. 3

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1

  10. [10]

    arXiv preprint arXiv:2501.13987

    Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. arXiv preprint arXiv:2501.13987 ,

  11. [11]

    Billm: Pushing the limit of post-training quantization for llms

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms. ICML,

  12. [12]

    Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms. arXiv preprint arXiv:2402.04291, 2024. 3, 4, 7

  13. [13]

    Xnor-pop: A processing-in-memory architecture for binary convolutional neural networks in wide-io2 drams

    Lei Jiang, Minje Kim, Wujie Wen, and Danghui Wang. Xnor-pop: A processing-in-memory architecture for binary convolutional neural networks in wide-io2 drams. In 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) , pages 1–6. IEEE,

  14. [14]

    Arb-llm: Alternating refined binarizations for large language models

    Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Linghe Kong, Yulun Zhang, Xiaokang Yang, et al. Arb-llm: Alternating refined binarizations for large language models. In ICLR, 2025. 2, 3, 4, 7

  15. [15]

    Duquant: Distributing outliers via dual transformation makes stronger quantized llms

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. In NeurIPS, 2024. 1 10

  16. [16]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024. 3

  17. [17]

    2409.17066 , archivePrefix=

    Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. Vptq: Extreme low-bit vector post-training quantization for large language models. arXiv preprint arXiv:2409.17066, 2024. 3, 5, 7

  18. [18]

    Llm-qat: Data-free quantiza- tion aware training for large language models

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantiza- tion aware training for large language models. In ACL, 2024. 3

  19. [19]

    Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm

    Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In ECCV, 2018. 2, 3

  20. [20]

    SpinQuant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations. arXiv preprint arXiv:2405.16406, 2024. 4

  21. [21]

    Fbi-llm: Scaling up fully binarized llms from scratch via autoregressive distillation

    Liqun Ma, Mingjie Sun, and Zhiqiang Shen. Fbi-llm: Scaling up fully binarized llms from scratch via autoregressive distillation. arXiv preprint arXiv:2407.07093, 2024. 8

  22. [22]

    Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024

    Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models. arXiv preprint arXiv:2403.12544, 2024. 1

  23. [23]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In ICLR, 2017. 7

  24. [24]

    Bitblas: A high-performance BLAS library for quantized matrix multiplication

    Microsoft. Bitblas: A high-performance BLAS library for quantized matrix multiplication. https://github.com/microsoft/BitBLAS, 2023. Accessed: 2024-03-01. 9

  25. [25]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018. 7

  26. [26]

    Hello GPT-4o, 2024

    OpenAI. Hello GPT-4o, 2024. 1

  27. [27]

    Xnor-popcount, an alternative solution to the accumulation multiplication method for approximate computations, to improve latency and power efficiency

    Van-Khoa Pham, Lai Le, Thanh-Kieu Tran Thi, et al. Xnor-popcount, an alternative solution to the accumulation multiplication method for approximate computations, to improve latency and power efficiency. Journal of Technical Education Science, 20(01):12–20, 2025. 6

  28. [28]

    XNOR-popcount-GEMM-PyTorch-CPU-CUDA: A PyTorch implementation of real XNOR-popcount (1-bit op) GEMM Linear PyTorch extension

    Tairen Piao. XNOR-popcount-GEMM-PyTorch-CPU-CUDA: A PyTorch implementation of real XNOR-popcount (1-bit op) GEMM Linear PyTorch extension. https://github.com/ tairenpiao/XNOR-popcount-GEMM-PyTorch-CPU-CUDA , 2022. Accessed: 2025-05-15. 6

  29. [29]

    Xnor-net: Imagenet classification using binary convolutional neural networks

    Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016. 3

  30. [30]

    Xnor-net: Imagenet classification using binary convolutional neural networks

    Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Joseph. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016. 2

  31. [31]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In AAAI, 2020. 7

  32. [32]

    Omniquant: Omnidirectionally calibrated quantiza- tion for large language models

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models. In ICLR2024 Spotlight, 2023. 1, 4

  33. [33]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization. arXiv preprint arXiv:2410.09426, 2024. 4 11

  34. [34]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 6

  35. [35]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 6

  36. [36]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

    Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks. arXiv preprint arXiv:2402.04396, 2024. 3, 7

  37. [37]

    Gptvq: The blessing of dimensionality for llm quantization

    Mart Van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The blessing of dimensionality for llm quantization. arXiv preprint arXiv:2402.15319, 2024. 3, 5, 7

  38. [38]

    Claq: Pushing the limits of low-bit post-training quantization for llms

    Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, and Yanmin Qian. Claq: Pushing the limits of low-bit post-training quantization for llms. arXiv preprint arXiv:2405.17233, 2024. 1

  39. [39]

    BitNet: Scaling 1-bit Transformers for Large Language Models

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453, 2023. 2, 3, 8

  40. [40]

    Emergent abilities of large language models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. 1

  41. [41]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 1

  42. [42]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning , pages 38087–38099. PMLR, 2023. 3, 4

  43. [43]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. 9

  44. [44]

    Hellaswag: Can a machine really finish your sentence? In ACL, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In ACL, 2019. 7 12 Appendix In the appendix, we include further discussions on the broader implications of our work, additional experimental results, implementation details, and pseudocode to facilitate reproducibility. A Extended ...

  45. [45]

    Early termination: For cases where the number of unique vectors is less than or equal to the codebook size, we achieve perfect reconstruction with exact vector matching in a single iteration

  46. [46]

    Efficient centroid updates: Unlike traditional k-means requiring reconstruction for each update, our method directly computes means and applies the sign function to maintain binary constraints

  47. [47]

    Vectorized operations : We leverage PyTorch’s efficient tensor operations like scatter_add_ and bincount to accelerate cluster assignment and centroid updates

  48. [48]

    Binary-specific distance metric : Distance calculations between binary vectors utilize squared Euclidean distance, which is more efficient than computing full reconstruction error. C.4 Complete Binary Transformation and Compression Our complete binary transformation and compression (BTC) approach combines learned transforma- tions with binary codebook com...