BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook

Bei Liu; Hao Gu; Hao Wang; Jiacheng Liu; Lei Wang; Lujun Li; Qiyuan Zhu; Sirui Han; Yike Guo; Zheyu Wang

arxiv: 2506.12040 · v2 · submitted 2025-05-24 · 💻 cs.LG · cs.AI· cs.CV

BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook

Hao Gu , Lujun Li , Hao Wang , Lei Wang , Zheyu Wang , Bei Liu , Jiacheng Liu , Qiyuan Zhu

show 2 more authors

Sirui Han Yike Guo

This is my paper

Pith reviewed 2026-05-19 13:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords LLM quantizationsub-1-bit compressionbinary codebooklearnable transformationmodel compressionefficient inferenceweight clusteringextreme quantization

0 comments

The pith

Learnable transformation plus binary codebook lets LLMs run at 0.8 bits with 3.1 percent zero-shot accuracy loss and 1.6x speedup over FP16.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BTC-LLM as a sub-1-bit quantization method for large language models that avoids the drawbacks of earlier sparsity-based binarization approaches. It introduces a learnable transformation to shrink outliers and align sign patterns across weights, paired with a binary codebook that groups similar weight vectors into short indices. These steps remove the need for sparse masks, so the quantized model runs on ordinary hardware. Tests on LLaMA, Qwen and other families show compression down to 0.7-1.11 bits while keeping most task performance. At 0.8 bits on LLaMA-2-13B the accuracy drop stays near 3 percent and inference speeds up by 1.6 times compared with full-precision weights.

Core claim

The authors show that a learnable linear transformation followed by binary pattern clustering can compress LLM weights below one bit per parameter by replacing repeated vectors with compact codebook indices and by removing the requirement for explicit sparsity masks, thereby delivering both memory reduction and standard-hardware compatibility while limiting accuracy loss on zero-shot benchmarks to a few percent.

What carries the argument

Binary Codebook that clusters recurring weight vectors into compact indices using custom distance metrics and sign-based updates; paired with a Learnable Transformation that reduces outliers and promotes shared sign patterns.

If this is right

LLMs become deployable with roughly one-eighth the memory footprint of FP16 while retaining near-original zero-shot accuracy.
Inference no longer requires custom sparse-matrix kernels or mask storage, allowing use on standard GPUs and CPUs.
Models from multiple families (LLaMA, Qwen, FBI-LLM) reach compression ratios between 0.7 and 1.11 bits with consistent speed gains.
The 1.6x wall-clock improvement over FP16 at 0.8 bits scales directly with reduced data movement.
Elimination of per-weight masks removes a source of runtime overhead that previously limited extreme binarization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transformation-plus-codebook pattern may extend to other neural-network families beyond transformers if the sign-pattern clustering generalizes.
Further bit reduction below 0.7 bits could be tested by increasing codebook size or adding a second transformation stage.
Energy cost per token should drop proportionally with memory bandwidth, which matters for battery-powered or edge devices.
The approach might combine with post-training methods such as knowledge distillation to recover any remaining accuracy gap.

Load-bearing premise

The learnable transformation reliably reduces outliers and creates shared sign patterns across weights without introducing hidden failure modes that standard zero-shot tests would miss.

What would settle it

An experiment that measures accuracy collapse or loss of speedup on a held-out task suite or on a different hardware platform after the same transformation and codebook training.

Figures

Figures reproduced from arXiv: 2506.12040 by Bei Liu, Hao Gu, Hao Wang, Jiacheng Liu, Lei Wang, Lujun Li, Qiyuan Zhu, Sirui Han, Yike Guo, Zheyu Wang.

**Figure 2.** Figure 2: Activation distributions for the self_attn.k_proj layer in the LLaMA-2-7B model: (a) Original FP16 (max abs: 8), (b) BiLLM (max abs: 15), (c) ARB-LLM (max abs: 10), and (d) our proposed BTC-LLM (max abs: 0.4). Binary quantization [30] represents the most aggressive quantization approach, converting floatingpoint weights to binary values (±1) to reduce memory requirements by over 32× [19]. For instance, Bi… view at source ↗

**Figure 3.** Figure 3: Perplexity of LLaMA-2-7B on WikiText2. Our BTC-LLM outperforms 2-bit methods at 0.9-bit. Our comprehensive evaluations of the LLaMA family of models (7B to 65B parameters) demonstrate the superior performance of BTC-LLM in multiple bit width settings, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overall architecture of BTC-LLM. (a) Sub-bit pipeline: the ARB quantizer transforms full-precision weights into binary form with associated scale and bias, followed by binary codebook representation and index assignment. (b) Structure of transformed attention (b1) and FFN (b2) blocks. The diagonal scaling and orthogonal transformation are merged into the weights to ensure computational equivalence and effi… view at source ↗

**Figure 5.** Figure 5: Trade-offs of runtime, throughput, memory usage, and accuracy under sub-1-bit quantization [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Extension of BTC-LLM for activation and kv cache quantization [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Visualizations comparing of the weight relative quantize error of LLaMA-2-7B with [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Visualizations comparing of the weight relative quantize error of LLaMA-2-7B with [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Visualizations of the activation distribution of different layers in LLaMA-2-7B before and [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Visualizations of the activation distribution of different layers in LLaMA-2-7B before and [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

read the original abstract

Binary quantization represents the most extreme form of compression, reducing weights to +/-1 for maximal memory and computational efficiency. While recent sparsity-aware binarization achieves sub-1-bit compression via weight pruning, it faces critical challenges: performance degradation, mask-management overhead, and limited hardware compatibility. In this paper, we present BTC-LLM, a novel sub-1-bit LLM quantization framework that leverages binary pattern clustering and weight transformation to overcome these limitations. Our approach incorporates two key innovations: (1) a Binary Codebook that clusters recurring vectors into compact indices using custom distance metrics and sign-based updates; (2) a Learnable Transformation that reduces outliers and promotes shared sign patterns among binary weights. This eliminates sparse masks, enabling efficient inference on standard hardware. Extensive evaluations across LLaMA, Qwen, and FBI-LLM families demonstrate that BTC-LLM achieves state-of-the-art results in extreme compression (1.11-0.7 bits). Notably, BTC-LLM compressed to 0.8 bits on LLaMA-2-13B maintains high performance, with only a 3.1 percent accuracy drop in zero-shot benchmarks, while delivering a 1.6x speedup over FP16.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BTC-LLM pairs a custom binary codebook with a learnable transformation to reach dense 0.8-bit quantization on 13B models with a 3% accuracy drop, but the transformation's claimed benefits rest on indirect accuracy numbers rather than direct measurements.

read the letter

The main thing here is that BTC-LLM reaches 0.8 bits on LLaMA-2-13B with a 3.1% zero-shot drop and 1.6x speedup over FP16, while avoiding sparse masks so it can run on standard hardware. The method clusters weight vectors into a binary codebook using custom distances and sign-based updates, then applies a learnable transformation to cut outliers and encourage uniform sign patterns across weights. This is the concrete step beyond prior sparsity-aware binarization that still needed masks and extra overhead. The paper reports results on LLaMA, Qwen, and FBI-LLM families down to 0.7 bits, which gives a reasonable sense of how the approach scales across model sizes. Those numbers are the strongest part of the work so far. The experiments appear to use standard zero-shot benchmarks, and the claim of state-of-the-art performance in the sub-1-bit range is at least plausible given the setup. The soft spots sit in the validation of the learnable transformation. The abstract states that it eliminates sparse masks by promoting shared sign patterns, yet the reported results give only final accuracy and speedup. There are no pre/post numbers on outlier norms, sign-pattern entropy, or actual kernel latency measurements without custom code. That leaves open whether the transformation is doing the work or whether the codebook is simply fitting the observed distributions well. The codebook size and distance metric are also free choices that could be tuned after seeing the data, which adds some circularity risk even if the overall framework is not purely post-hoc. This paper is for researchers focused on extreme quantization and edge deployment of LLMs. A reader who needs practical sub-1-bit options without mask management would get usable ideas from the codebook-plus-transformation pairing. It deserves a serious referee to examine the full ablations, implementation details, and any hardware timing data that may be in the supplement.

Referee Report

2 major / 1 minor

Summary. The manuscript presents BTC-LLM, a sub-1-bit LLM quantization framework that uses a Binary Codebook to cluster recurring weight vectors via custom distance metrics and sign-based updates, combined with a Learnable Transformation to reduce outliers and promote shared sign patterns. This design is claimed to eliminate sparse masks and enable efficient inference on standard hardware. Evaluations across LLaMA, Qwen, and FBI-LLM families report state-of-the-art results in the 0.7–1.11 bit range, with the specific result that LLaMA-2-13B at 0.8 bits incurs only a 3.1% drop in zero-shot accuracy while achieving a 1.6x speedup over FP16.

Significance. If the central claims are substantiated, the work would advance extreme quantization by addressing mask overhead and hardware compatibility, offering practical value for resource-constrained LLM deployment. The multi-family evaluation and concrete speedup number are positive features. However, the reliance on learned transformation parameters and codebook choices fitted to observed weight distributions creates moderate risk that the reported accuracy and speedup are not fully independent of those modeling decisions.

major comments (2)

[Abstract] Abstract: The claim that the Learnable Transformation 'eliminates sparse masks' for standard-hardware inference is load-bearing for both the 0.8-bit accuracy result and the 1.6x speedup, yet the manuscript provides no direct quantitative validation such as pre/post-transformation outlier norms, sign-pattern entropy, or measured kernel latency without custom masks; zero-shot accuracy alone does not rule out new failure modes introduced by the binary codebook clustering step.
[Abstract] Abstract: The 3.1% accuracy drop on LLaMA-2-13B at 0.8 bits and the reported speedup rest on the assumption that codebook size, distance metric, and learnable transformation parameters are chosen independently of the final benchmark numbers; because these are free parameters explicitly fitted to weight distributions, the evaluation risks circularity that must be addressed with explicit ablation or held-out validation.

minor comments (1)

Consider adding a consolidated table of bit-width, accuracy drop, and speedup across all model families to improve readability of the experimental claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and commitments to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the Learnable Transformation 'eliminates sparse masks' for standard-hardware inference is load-bearing for both the 0.8-bit accuracy result and the 1.6x speedup, yet the manuscript provides no direct quantitative validation such as pre/post-transformation outlier norms, sign-pattern entropy, or measured kernel latency without custom masks; zero-shot accuracy alone does not rule out new failure modes introduced by the binary codebook clustering step.

Authors: We agree that direct supporting metrics would make the claim more robust. The reported 1.6x speedup was measured using standard GPU kernels that perform index-based lookups from the binary codebook, which by design requires no sparse masks. In the revised manuscript we will add: (i) pre- and post-transformation outlier-norm statistics, (ii) sign-pattern entropy before and after the transformation, and (iii) a latency breakdown isolating the contribution of the codebook lookup. These additions will also help rule out new failure modes introduced by clustering. revision: yes
Referee: [Abstract] Abstract: The 3.1% accuracy drop on LLaMA-2-13B at 0.8 bits and the reported speedup rest on the assumption that codebook size, distance metric, and learnable transformation parameters are chosen independently of the final benchmark numbers; because these are free parameters explicitly fitted to weight distributions, the evaluation risks circularity that must be addressed with explicit ablation or held-out validation.

Authors: Codebook size and transformation parameters are determined from the observed weight statistics of each model during quantization, which is the standard practice for learned quantization methods. We already report ablations over codebook sizes and transformation strengths in the experimental section. To further address the circularity concern, we will include additional results on held-out model families and benchmark suites in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in method design or claims

full rationale

The paper introduces BTC-LLM as an empirical quantization framework relying on a learnable transformation and binary codebook, with performance validated through zero-shot benchmarks and hardware speedup measurements on LLaMA and other models. No derivation chain is presented that reduces claimed outcomes (e.g., outlier reduction or mask elimination) to tautological redefinitions or fitted parameters renamed as independent predictions. Design choices are motivated by observed weight statistics but are not asserted as first-principles results forced by prior self-citations or internal equations; external benchmark results provide independent falsifiability. This is the expected outcome for an applied ML compression paper without mathematical derivation claims.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the empirical effectiveness of the binary codebook clustering and the learnable transformation; both are introduced as new components whose parameters are optimized on the target model weights.

free parameters (2)

codebook size and distance metric
Chosen to cluster recurring binary vectors; the specific metric and number of entries are design choices fitted to observed weight statistics.
learnable transformation parameters
Learned jointly to reduce outliers and align sign patterns; these are optimized during the quantization process rather than derived from first principles.

axioms (2)

domain assumption Binary weights can be clustered into a compact codebook using custom distance metrics without destroying downstream task performance.
Invoked when the authors state that the Binary Codebook clusters recurring vectors into compact indices.
domain assumption A learnable transformation exists that simultaneously reduces outliers and promotes shared sign patterns among binary weights.
Central premise stated in the description of the second innovation.

invented entities (2)

Binary Codebook no independent evidence
purpose: Compact index-based representation of recurring binary weight vectors
New data structure introduced to replace sparse masks.
Learnable Transformation no independent evidence
purpose: Weight adjustment that reduces outliers and aligns sign patterns
New module whose parameters are optimized to enable the codebook approach.

pith-pipeline@v0.9.0 · 5783 in / 1641 out tokens · 48672 ms · 2026-05-19T13:57:07.085875+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

[1]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems , 37:100213– 100240, 2024. 3, 4

work page 2024
[2]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Figurative language in recognizing textual entailment

Tuhin Chakrabarty, Debanjan Ghosh, Adam Poliak, and Smaranda Muresan. Figurative language in recognizing textual entailment. arXiv preprint arXiv:2106.01195, 2021. 7

work page arXiv 2021
[4]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, ACL, pages 2924–2936, 2019. 7

work page 2019
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Stbllm: Breaking the 1-bit barrier with structured binary llms

Peijie Dong, Lujun Li, Yuedong Zhong, Dayou Du, Ruibo Fan, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Yike Guo, et al. Stbllm: Breaking the 1-bit barrier with structured binary llms. In ICLR, 2025. 2, 3, 5, 7

work page 2025
[7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

arXiv preprint arXiv:2501.13987

Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. arXiv preprint arXiv:2501.13987 ,

work page arXiv
[11]

Billm: Pushing the limit of post-training quantization for llms

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms. ICML,

work page
[12]

Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms. arXiv preprint arXiv:2402.04291, 2024. 3, 4, 7

work page arXiv 2024
[13]

Xnor-pop: A processing-in-memory architecture for binary convolutional neural networks in wide-io2 drams

Lei Jiang, Minje Kim, Wujie Wen, and Danghui Wang. Xnor-pop: A processing-in-memory architecture for binary convolutional neural networks in wide-io2 drams. In 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) , pages 1–6. IEEE,

work page 2017
[14]

Arb-llm: Alternating refined binarizations for large language models

Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Linghe Kong, Yulun Zhang, Xiaokang Yang, et al. Arb-llm: Alternating refined binarizations for large language models. In ICLR, 2025. 2, 3, 4, 7

work page 2025
[15]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. In NeurIPS, 2024. 1 10

work page 2024
[16]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024. 3

work page 2024
[17]

2409.17066 , archivePrefix=

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. Vptq: Extreme low-bit vector post-training quantization for large language models. arXiv preprint arXiv:2409.17066, 2024. 3, 5, 7

work page arXiv 2024
[18]

Llm-qat: Data-free quantiza- tion aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantiza- tion aware training for large language models. In ACL, 2024. 3

work page 2024
[19]

Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm

Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In ECCV, 2018. 2, 3

work page 2018
[20]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations. arXiv preprint arXiv:2405.16406, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Fbi-llm: Scaling up fully binarized llms from scratch via autoregressive distillation

Liqun Ma, Mingjie Sun, and Zhiqiang Shen. Fbi-llm: Scaling up fully binarized llms from scratch via autoregressive distillation. arXiv preprint arXiv:2407.07093, 2024. 8

work page arXiv 2024
[22]

Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024

Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models. arXiv preprint arXiv:2403.12544, 2024. 1

work page arXiv 2024
[23]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In ICLR, 2017. 7

work page 2017
[24]

Bitblas: A high-performance BLAS library for quantized matrix multiplication

Microsoft. Bitblas: A high-performance BLAS library for quantized matrix multiplication. https://github.com/microsoft/BitBLAS, 2023. Accessed: 2024-03-01. 9

work page 2023
[25]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018. 7

work page 2018
[26]

Hello GPT-4o, 2024

OpenAI. Hello GPT-4o, 2024. 1

work page 2024
[27]

Xnor-popcount, an alternative solution to the accumulation multiplication method for approximate computations, to improve latency and power efficiency

Van-Khoa Pham, Lai Le, Thanh-Kieu Tran Thi, et al. Xnor-popcount, an alternative solution to the accumulation multiplication method for approximate computations, to improve latency and power efficiency. Journal of Technical Education Science, 20(01):12–20, 2025. 6

work page 2025
[28]

XNOR-popcount-GEMM-PyTorch-CPU-CUDA: A PyTorch implementation of real XNOR-popcount (1-bit op) GEMM Linear PyTorch extension

Tairen Piao. XNOR-popcount-GEMM-PyTorch-CPU-CUDA: A PyTorch implementation of real XNOR-popcount (1-bit op) GEMM Linear PyTorch extension. https://github.com/ tairenpiao/XNOR-popcount-GEMM-PyTorch-CPU-CUDA , 2022. Accessed: 2025-05-15. 6

work page 2022
[29]

Xnor-net: Imagenet classification using binary convolutional neural networks

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016. 3

work page 2016
[30]

Xnor-net: Imagenet classification using binary convolutional neural networks

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Joseph. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016. 2

work page 2016
[31]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In AAAI, 2020. 7

work page 2020
[32]

Omniquant: Omnidirectionally calibrated quantiza- tion for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models. In ICLR2024 Spotlight, 2023. 1, 4

work page 2023
[33]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization. arXiv preprint arXiv:2410.09426, 2024. 4 11

work page arXiv 2024
[34]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks. arXiv preprint arXiv:2402.04396, 2024. 3, 7

work page arXiv 2024
[37]

Gptvq: The blessing of dimensionality for llm quantization

Mart Van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The blessing of dimensionality for llm quantization. arXiv preprint arXiv:2402.15319, 2024. 3, 5, 7

work page arXiv 2024
[38]

Claq: Pushing the limits of low-bit post-training quantization for llms

Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, and Yanmin Qian. Claq: Pushing the limits of low-bit post-training quantization for llms. arXiv preprint arXiv:2405.17233, 2024. 1

work page arXiv 2024
[39]

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453, 2023. 2, 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Emergent abilities of large language models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. 1

work page 2022
[41]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 1

work page 2022
[42]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning , pages 38087–38099. PMLR, 2023. 3, 4

work page 2023
[43]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Hellaswag: Can a machine really finish your sentence? In ACL, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In ACL, 2019. 7 12 Appendix In the appendix, we include further discussions on the broader implications of our work, additional experimental results, implementation details, and pseudocode to facilitate reproducibility. A Extended ...

work page 2019
[45]

Early termination: For cases where the number of unique vectors is less than or equal to the codebook size, we achieve perfect reconstruction with exact vector matching in a single iteration

work page
[46]

Efficient centroid updates: Unlike traditional k-means requiring reconstruction for each update, our method directly computes means and applies the sign function to maintain binary constraints

work page
[47]

Vectorized operations : We leverage PyTorch’s efficient tensor operations like scatter_add_ and bincount to accelerate cluster assignment and centroid updates

work page
[48]

Binary-specific distance metric : Distance calculations between binary vectors utilize squared Euclidean distance, which is more efficient than computing full reconstruction error. C.4 Complete Binary Transformation and Compression Our complete binary transformation and compression (BTC) approach combines learned transforma- tions with binary codebook com...

work page

[1] [1]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems , 37:100213– 100240, 2024. 3, 4

work page 2024

[2] [2]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Figurative language in recognizing textual entailment

Tuhin Chakrabarty, Debanjan Ghosh, Adam Poliak, and Smaranda Muresan. Figurative language in recognizing textual entailment. arXiv preprint arXiv:2106.01195, 2021. 7

work page arXiv 2021

[4] [4]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, ACL, pages 2924–2936, 2019. 7

work page 2019

[5] [5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Stbllm: Breaking the 1-bit barrier with structured binary llms

Peijie Dong, Lujun Li, Yuedong Zhong, Dayou Du, Ruibo Fan, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Yike Guo, et al. Stbllm: Breaking the 1-bit barrier with structured binary llms. In ICLR, 2025. 2, 3, 5, 7

work page 2025

[7] [7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

arXiv preprint arXiv:2501.13987

Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. arXiv preprint arXiv:2501.13987 ,

work page arXiv

[11] [11]

Billm: Pushing the limit of post-training quantization for llms

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms. ICML,

work page

[12] [12]

Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms. arXiv preprint arXiv:2402.04291, 2024. 3, 4, 7

work page arXiv 2024

[13] [13]

Xnor-pop: A processing-in-memory architecture for binary convolutional neural networks in wide-io2 drams

Lei Jiang, Minje Kim, Wujie Wen, and Danghui Wang. Xnor-pop: A processing-in-memory architecture for binary convolutional neural networks in wide-io2 drams. In 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) , pages 1–6. IEEE,

work page 2017

[14] [14]

Arb-llm: Alternating refined binarizations for large language models

Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Linghe Kong, Yulun Zhang, Xiaokang Yang, et al. Arb-llm: Alternating refined binarizations for large language models. In ICLR, 2025. 2, 3, 4, 7

work page 2025

[15] [15]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. In NeurIPS, 2024. 1 10

work page 2024

[16] [16]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024. 3

work page 2024

[17] [17]

2409.17066 , archivePrefix=

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. Vptq: Extreme low-bit vector post-training quantization for large language models. arXiv preprint arXiv:2409.17066, 2024. 3, 5, 7

work page arXiv 2024

[18] [18]

Llm-qat: Data-free quantiza- tion aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantiza- tion aware training for large language models. In ACL, 2024. 3

work page 2024

[19] [19]

Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm

Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In ECCV, 2018. 2, 3

work page 2018

[20] [20]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations. arXiv preprint arXiv:2405.16406, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Fbi-llm: Scaling up fully binarized llms from scratch via autoregressive distillation

Liqun Ma, Mingjie Sun, and Zhiqiang Shen. Fbi-llm: Scaling up fully binarized llms from scratch via autoregressive distillation. arXiv preprint arXiv:2407.07093, 2024. 8

work page arXiv 2024

[22] [22]

Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544, 2024

Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models. arXiv preprint arXiv:2403.12544, 2024. 1

work page arXiv 2024

[23] [23]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In ICLR, 2017. 7

work page 2017

[24] [24]

Bitblas: A high-performance BLAS library for quantized matrix multiplication

Microsoft. Bitblas: A high-performance BLAS library for quantized matrix multiplication. https://github.com/microsoft/BitBLAS, 2023. Accessed: 2024-03-01. 9

work page 2023

[25] [25]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018. 7

work page 2018

[26] [26]

Hello GPT-4o, 2024

OpenAI. Hello GPT-4o, 2024. 1

work page 2024

[27] [27]

Xnor-popcount, an alternative solution to the accumulation multiplication method for approximate computations, to improve latency and power efficiency

Van-Khoa Pham, Lai Le, Thanh-Kieu Tran Thi, et al. Xnor-popcount, an alternative solution to the accumulation multiplication method for approximate computations, to improve latency and power efficiency. Journal of Technical Education Science, 20(01):12–20, 2025. 6

work page 2025

[28] [28]

XNOR-popcount-GEMM-PyTorch-CPU-CUDA: A PyTorch implementation of real XNOR-popcount (1-bit op) GEMM Linear PyTorch extension

Tairen Piao. XNOR-popcount-GEMM-PyTorch-CPU-CUDA: A PyTorch implementation of real XNOR-popcount (1-bit op) GEMM Linear PyTorch extension. https://github.com/ tairenpiao/XNOR-popcount-GEMM-PyTorch-CPU-CUDA , 2022. Accessed: 2025-05-15. 6

work page 2022

[29] [29]

Xnor-net: Imagenet classification using binary convolutional neural networks

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016. 3

work page 2016

[30] [30]

Xnor-net: Imagenet classification using binary convolutional neural networks

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Joseph. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016. 2

work page 2016

[31] [31]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In AAAI, 2020. 7

work page 2020

[32] [32]

Omniquant: Omnidirectionally calibrated quantiza- tion for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models. In ICLR2024 Spotlight, 2023. 1, 4

work page 2023

[33] [33]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization. arXiv preprint arXiv:2410.09426, 2024. 4 11

work page arXiv 2024

[34] [34]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks. arXiv preprint arXiv:2402.04396, 2024. 3, 7

work page arXiv 2024

[37] [37]

Gptvq: The blessing of dimensionality for llm quantization

Mart Van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The blessing of dimensionality for llm quantization. arXiv preprint arXiv:2402.15319, 2024. 3, 5, 7

work page arXiv 2024

[38] [38]

Claq: Pushing the limits of low-bit post-training quantization for llms

Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, and Yanmin Qian. Claq: Pushing the limits of low-bit post-training quantization for llms. arXiv preprint arXiv:2405.17233, 2024. 1

work page arXiv 2024

[39] [39]

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453, 2023. 2, 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Emergent abilities of large language models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. 1

work page 2022

[41] [41]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 1

work page 2022

[42] [42]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning , pages 38087–38099. PMLR, 2023. 3, 4

work page 2023

[43] [43]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Hellaswag: Can a machine really finish your sentence? In ACL, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In ACL, 2019. 7 12 Appendix In the appendix, we include further discussions on the broader implications of our work, additional experimental results, implementation details, and pseudocode to facilitate reproducibility. A Extended ...

work page 2019

[45] [45]

Early termination: For cases where the number of unique vectors is less than or equal to the codebook size, we achieve perfect reconstruction with exact vector matching in a single iteration

work page

[46] [46]

Efficient centroid updates: Unlike traditional k-means requiring reconstruction for each update, our method directly computes means and applies the sign function to maintain binary constraints

work page

[47] [47]

Vectorized operations : We leverage PyTorch’s efficient tensor operations like scatter_add_ and bincount to accelerate cluster assignment and centroid updates

work page

[48] [48]

Binary-specific distance metric : Distance calculations between binary vectors utilize squared Euclidean distance, which is more efficient than computing full reconstruction error. C.4 Complete Binary Transformation and Compression Our complete binary transformation and compression (BTC) approach combines learned transforma- tions with binary codebook com...

work page