Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Cuong Nguyen; Cuong Pham; Dung Anh Hoang; Jianfei Cai; Thanh-Toan Do; Trung Le

arxiv: 2512.21651 · v3 · pith:F4MGJBXUnew · submitted 2025-12-25 · 💻 cs.LG

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Dung Anh Hoang , Cuong Pham , Cuong Nguyen , Trung le , Jianfei Cai , Thanh-Toan Do This is my paper

Pith reviewed 2026-05-21 16:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-training quantization1-bit quantizationlarge language modelsoutput alignmenterror accumulationanisotropic distortionmodel compressionLLM deployment

0 comments

The pith

Output alignment for 1-bit LLM quantization succeeds only after correcting layer error accumulation and anisotropic representation distortion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

For 1-bit post-training quantization of large language models, simply minimizing output discrepancies on calibration data often underperforms even weight-matching approaches. The paper identifies error accumulation through successive layers and anisotropic distortion of the activation space as the root causes. It introduces a method that explicitly compensates for these effects during alignment. A reader would care because successful 1-bit models could slash memory and compute needs for deployment on edge devices while keeping most of the original capability.

Core claim

The discovery is that the failure of naive output-driven 1-bit PTQ arises from two fundamental issues—error accumulation across layers and anisotropic distortion of the representation space—and that a novel method addressing both while staying computationally efficient consistently outperforms prior 1-bit PTQ techniques.

What carries the argument

An output-alignment procedure augmented with explicit fixes for inter-layer error propagation and restoration of isotropic properties in the distorted feature space.

If this is right

Quantized models preserve output behavior more faithfully on tasks beyond the calibration data.
Computational efficiency is maintained since no retraining is required.
The approach highlights the need to consider representation geometry in quantization design.
Experiments show consistent gains over existing methods in the 1-bit regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar distortion issues may appear in other extreme compression techniques like pruning or low-rank adaptation.
Practitioners could adapt the corrections to multi-bit or mixed-precision settings for further gains.
The method suggests exploring calibration-free alternatives if the distortions can be modeled analytically.

Load-bearing premise

The calibration dataset sufficiently represents the data distributions encountered during actual use so that the derived corrections generalize without creating new performance issues.

What would settle it

Run the proposed quantized model and a naive output-aligned version on a new task whose inputs differ substantially in distribution from the calibration set; a reversal or disappearance of the reported accuracy advantage would falsify the generalization of the fixes.

Figures

Figures reproduced from arXiv: 2512.21651 by Cuong Nguyen, Cuong Pham, Dung Anh Hoang, Jianfei Cai, Thanh-Toan Do, Trung Le.

**Figure 2.** Figure 2: Accumulated quantization error in LLaMA-2-7B under ARB-X. The top plot reports [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Block-wise MSE reconstruction error between quantized and full-precision attention score [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression techniques have been proposed, including quantization, pruning, and knowledge distillation. Among these, post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration, enabling low-cost deployment. Recent advances for post-training quantization have demonstrated that even near 4-bit methods can maintain most of the original model performance. However, 1-bit quantization remains particularly challenging. A common strategy in 1-bit quantization is to determine binary weights by matching full-precision parameters, following a weight-driven criterion. However, this objective is not directly aligned with the quantized model's objective, which is to preserve the model's output behavior under the impact of quantization. A natural alternative is to adopt output-driven criteria that minimize discrepancies in model outputs using calibration data. Surprisingly, naive output-driven approaches often perform even worse in the 1-bit regime. In this paper, we show that this failure arises from two fundamental issues: error accumulation across layers and, more critically, \emph{anisotropic distortion} of the representation space. Based on these insights, we propose a novel PTQ method for 1-bit LLMs that explicitly addresses these issues while maintaining computational efficiency. Extensive experiments demonstrate that our approach consistently outperforms existing 1-bit PTQ methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags error accumulation and anisotropic distortion as reasons naive output alignment fails in 1-bit PTQ, but the abstract gives almost no mechanics or numbers to judge the fix.

read the letter

The main point is that the authors trace the poor results of output-driven 1-bit quantization to two problems: errors piling up layer by layer and an uneven stretching of the representation space they call anisotropic distortion. They claim a new method corrects both without losing efficiency and beats prior 1-bit approaches on experiments. That diagnosis lines up with the observation that weight-matching objectives miss the real goal of keeping model outputs stable. Spotting the anisotropic issue as a distinct failure mode looks like a step beyond the usual 1-bit PTQ papers that focus only on weight or simple output matching. The shift to output alignment itself is a reasonable reframing even if it is not entirely new. The paper earns credit for laying out why naive output calibration can actually hurt performance in the 1-bit regime. The soft spots sit in the missing details. The abstract states the fixes but shows no equations, no description of the correction terms, and no ablation tables that would let a reader see how much each issue contributes or how the method removes it. Without those, it is hard to tell whether the reported gains come from the proposed changes or from careful tuning on the calibration set. The stress-test concern about calibration data missing the distortions that matter for downstream tasks is worth checking; short calibration sequences often sample only a narrow slice of the space, and any method that learns corrections from them risks overfitting in ways that do not show up until held-out evaluation. This work is for people already following extreme quantization for LLMs and edge deployment. A reader who tracks PTQ papers could pick up the problem framing and the experimental headline claims, but would need the full methods section to decide whether the approach is ready to build on. I would send it to peer review so the authors can supply the derivations, controls, and numbers that are absent from the abstract.

Referee Report

3 major / 2 minor

Summary. The paper argues that naive output-driven criteria for 1-bit post-training quantization of LLMs fail due to error accumulation across layers and, more critically, anisotropic distortion of the representation space. It proposes a new PTQ method that explicitly corrects for these two issues while preserving computational efficiency, and reports consistent outperformance over prior 1-bit PTQ baselines in extensive experiments.

Significance. If the corrections are shown to generalize beyond the calibration set and the reported gains are not artifacts of calibration-data overfitting, the work could meaningfully improve practical 1-bit quantization for resource-constrained LLM deployment. The explicit diagnosis of anisotropic distortion supplies a concrete direction for future output-alignment research.

major comments (3)

[§3.2] §3.2 (Anisotropic distortion correction): the manuscript must supply the precise mathematical definition and update rule for the correction term. Without an explicit equation showing how the distortion metric is computed from calibration activations and how it is subtracted or projected, it is impossible to verify that the fix targets the directions that affect downstream task performance rather than merely reducing calibration-set output discrepancy.
[§4.1–4.3] §4.1–4.3 (Experimental protocol): all reported accuracy gains are measured on the same calibration sequences used to derive the output-alignment corrections. The paper should add a held-out calibration-set ablation or cross-domain calibration experiment to demonstrate that the anisotropic correction does not overfit to the narrow subspace sampled by the calibration data, directly addressing the central generalization concern.
[§4.4] §4.4 (Error accumulation analysis): the claim that the new method mitigates layer-wise error accumulation is load-bearing for the overall narrative, yet no layer-wise activation or output-error curves are shown. Adding such diagnostics would allow readers to confirm that the observed gains arise from the proposed mechanism rather than from incidental hyper-parameter tuning.

minor comments (2)

[Abstract] Abstract: the phrase 'extensive experiments' should be accompanied by at least the model sizes and task categories evaluated so that readers can immediately gauge the scope of the claimed improvements.
[§2] Notation in §2: the symbol used for the anisotropic distortion metric should be introduced with a short parenthetical definition on first use to avoid ambiguity with standard layer-wise quantization error terms.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These points help improve the clarity of our technical presentation and the robustness of our experimental claims. We respond to each major comment below and indicate the specific revisions planned for the next manuscript version.

read point-by-point responses

Referee: [§3.2] §3.2 (Anisotropic distortion correction): the manuscript must supply the precise mathematical definition and update rule for the correction term. Without an explicit equation showing how the distortion metric is computed from calibration activations and how it is subtracted or projected, it is impossible to verify that the fix targets the directions that affect downstream task performance rather than merely reducing calibration-set output discrepancy.

Authors: We agree that an explicit equation is required for full reproducibility and to allow readers to verify the targeted effect on downstream performance. The current manuscript describes the anisotropic correction in prose within §3.2 but does not present the closed-form definition or update rule. In the revised manuscript we will insert a new Equation (3) that defines the distortion metric as the sum of squared deviations along the top principal components of the calibration activation covariance and states the correction as a projection subtraction applied to the quantized activations. This formulation directly addresses the referee’s concern by making the mechanism verifiable. revision: yes
Referee: [§4.1–4.3] §4.1–4.3 (Experimental protocol): all reported accuracy gains are measured on the same calibration sequences used to derive the output-alignment corrections. The paper should add a held-out calibration-set ablation or cross-domain calibration experiment to demonstrate that the anisotropic correction does not overfit to the narrow subspace sampled by the calibration data, directly addressing the central generalization concern.

Authors: The referee correctly identifies a potential generalization gap. Our current protocol follows the standard PTQ practice of using the same small calibration set for both correction derivation and reporting, which is consistent with prior 1-bit PTQ literature. To directly address the overfitting concern we will add two new experiments in the revised §4: (1) a held-out calibration ablation that splits the original calibration sequences into disjoint derivation and evaluation subsets, and (2) a cross-domain calibration study that derives corrections on general-text data and evaluates on code and mathematical reasoning benchmarks. These additions will demonstrate that the reported gains are not artifacts of calibration-set overfitting. revision: yes
Referee: [§4.4] §4.4 (Error accumulation analysis): the claim that the new method mitigates layer-wise error accumulation is load-bearing for the overall narrative, yet no layer-wise activation or output-error curves are shown. Adding such diagnostics would allow readers to confirm that the observed gains arise from the proposed mechanism rather than from incidental hyper-parameter tuning.

Authors: We accept that layer-wise diagnostics are necessary to substantiate the error-accumulation narrative. Although the manuscript discusses the phenomenon conceptually, it does not include the corresponding plots. In the revised manuscript we will add a new figure in §4.4 that plots per-layer output error (and optionally activation error) for our method against the strongest baselines. The figure will show that error growth is visibly attenuated across depth, supporting that the gains arise from the proposed corrections rather than hyper-parameter effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent corrections validated on benchmarks

full rationale

The paper identifies two issues (error accumulation and anisotropic distortion) as root causes for naive output-driven PTQ failure, then proposes explicit fixes while using standard calibration data for the quantization process. No equations or steps reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The central claims rest on experimental outperformance against prior 1-bit methods on held-out tasks, with the calibration set serving its conventional role rather than creating a tautological fit. This keeps the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that a small calibration set can diagnose and correct representation-space distortions that affect downstream tasks. No free parameters, axioms, or invented entities are explicitly listed in the abstract.

axioms (1)

domain assumption A small calibration dataset is sufficient to measure and correct anisotropic distortion and error accumulation that generalize to the full evaluation distribution.
Invoked when the method uses calibration data to align outputs; if false, the corrections would not transfer.

pith-pipeline@v0.9.0 · 5814 in / 1401 out tokens · 32204 ms · 2026-05-21T16:38:02.248999+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we modify the optimization objective... L(X,l)=||XW−bXcW||_F = Tr[(XW−bXcW)(XW−bXcW)^T]
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Attention Matrix Preservation (AMP) ... max L_AMP = ||(bXcW cW^T bX^T) ⊙ (XW W^T X^T)||

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 11 internal anchors

[1]

Language Models are Few-Shot Learners

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical com- monsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34(05), pp. 7432–7439, 2020a. Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v34i05.6239 2005
[2]

Peijie Dong, Lujun Li, Dayou Du, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Wenhan Luo, Qi fei Liu, Yi-Ting Guo, and Xiaowen Chu

URLhttps://api.semanticscholar.org/ CorpusID:52967399. Peijie Dong, Lujun Li, Dayou Du, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Wenhan Luo, Qi fei Liu, Yi-Ting Guo, and Xiaowen Chu. Stbllm: Breaking the 1-bit barrier with struc- tured binary llms.ArXiv, abs/2408.01803,

work page arXiv
[3]

Network sketching: Exploiting bi- nary structure in deep cnns.2017 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pp

Yiwen Guo, Anbang Yao, Hao Zhao, and Yurong Chen. Network sketching: Exploiting bi- nary structure in deep cnns.2017 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pp. 4040–4048,

work page 2017
[4]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

URLhttps://api.semanticscholar.org/ CorpusID:11244259. Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Billm: Pushing the limit of post-training quantization for llms

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

work page arXiv
[7]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam M. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.ArXiv, abs/2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[8]

Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

URLhttps://api. semanticscholar.org/CorpusID:204960716. Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

work page arXiv
[9]

Arb-llm: Alternating refined binarizations for large language models

Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Zhongchao Shi, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Arb-llm: Alternating refined bina- rizations for large language models.ArXiv, abs/2410.03129,

work page arXiv
[10]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

URLhttps://api. semanticscholar.org/CorpusID:273163233. 11 Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration.arXiv preprint arXiv:2306.00978,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm

Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, W. Liu, and K. Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm.ArXiv, abs/1808.00278,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Pointer Sentinel Mixture Models

URL https://aclanthology.org/J93-2004/. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[13]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),Proceedings of the 2018 Conference on Empir- ical Methods in Natural Language Processing, pp. 2381–2391, Brussels, Belgium, Oct...

work page 2018
[14]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. URL https://aclanthology.org/D18-1260/. Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The lambada dataset: Word prediction requiring a broad discourse context.arXiv preprin...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d18-1260
[15]

Liu, and Heng Tao Shen

Fumin Shen, Chunhua Shen, W. Liu, and Heng Tao Shen. Supervised discrete hashing.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 37–45,

work page 2015
[16]

BitNet: Scaling 1-bit Transformers for Large Language Models

12 Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models.ArXiv, abs/2310.11453,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Alternating Multi-bit Quantization for Recurrent Neural Networks

Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha. Alternating multi-bit quantization for recurrent neural networks. InInternational Con- ference on Learning Representations, volume abs/1802.00150,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization.ArXiv, abs/1912.08777,

work page arXiv 1912
[19]

URLhttps:// api.semanticscholar.org/CorpusID:209405420. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transforme...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

The quantization block size is fixed at 128 following ARB (Li et al.,

and BiLLM (Huang et al., 2024), we use the C4 dataset with a sequence length of 2048 as calibration data to enable fair comparison. The quantization block size is fixed at 128 following ARB (Li et al.,

work page 2024
[21]

As observed, output alignment is most effective and consistent when applied to the final layer of each block, for both Llama and OPT models. Inference and Storage Overhead Analysis.Our method introducesno additional inference or storage overhead, as it does not add any new quantization parameters and leaves both the model architecture and forward-pass com...

work page 2024
[22]

Quantization Overhead.We provide in detail the quantization time of our method, compared to ARB-X and ARB-RCLi et al

and4.4–5.1×faster than the full-precision model, hence these perfor- mance gains also apply to our method. Quantization Overhead.We provide in detail the quantization time of our method, compared to ARB-X and ARB-RCLi et al. (2024), across architecture. While our method incurs slightly higher overhead than ARB-RC due to the additional closed-form computat...

work page 2024

[1] [1]

Language Models are Few-Shot Learners

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical com- monsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34(05), pp. 7432–7439, 2020a. Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v34i05.6239 2005

[2] [2]

Peijie Dong, Lujun Li, Dayou Du, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Wenhan Luo, Qi fei Liu, Yi-Ting Guo, and Xiaowen Chu

URLhttps://api.semanticscholar.org/ CorpusID:52967399. Peijie Dong, Lujun Li, Dayou Du, Yuhan Chen, Zhenheng Tang, Qiang Wang, Wei Xue, Wenhan Luo, Qi fei Liu, Yi-Ting Guo, and Xiaowen Chu. Stbllm: Breaking the 1-bit barrier with struc- tured binary llms.ArXiv, abs/2408.01803,

work page arXiv

[3] [3]

Network sketching: Exploiting bi- nary structure in deep cnns.2017 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pp

Yiwen Guo, Anbang Yao, Hao Zhao, and Yurong Chen. Network sketching: Exploiting bi- nary structure in deep cnns.2017 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pp. 4040–4048,

work page 2017

[4] [4]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

URLhttps://api.semanticscholar.org/ CorpusID:11244259. Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Billm: Pushing the limit of post-training quantization for llms

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

work page arXiv

[7] [7]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam M. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.ArXiv, abs/2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[8] [8]

Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

URLhttps://api. semanticscholar.org/CorpusID:204960716. Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426,

work page arXiv

[9] [9]

Arb-llm: Alternating refined binarizations for large language models

Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Zhongchao Shi, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Arb-llm: Alternating refined bina- rizations for large language models.ArXiv, abs/2410.03129,

work page arXiv

[10] [10]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

URLhttps://api. semanticscholar.org/CorpusID:273163233. 11 Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration.arXiv preprint arXiv:2306.00978,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm

Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, W. Liu, and K. Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm.ArXiv, abs/1808.00278,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Pointer Sentinel Mixture Models

URL https://aclanthology.org/J93-2004/. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[13] [13]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),Proceedings of the 2018 Conference on Empir- ical Methods in Natural Language Processing, pp. 2381–2391, Brussels, Belgium, Oct...

work page 2018

[14] [14]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. URL https://aclanthology.org/D18-1260/. Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The lambada dataset: Word prediction requiring a broad discourse context.arXiv preprin...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d18-1260

[15] [15]

Liu, and Heng Tao Shen

Fumin Shen, Chunhua Shen, W. Liu, and Heng Tao Shen. Supervised discrete hashing.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 37–45,

work page 2015

[16] [16]

BitNet: Scaling 1-bit Transformers for Large Language Models

12 Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models.ArXiv, abs/2310.11453,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Alternating Multi-bit Quantization for Recurrent Neural Networks

Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha. Alternating multi-bit quantization for recurrent neural networks. InInternational Con- ference on Learning Representations, volume abs/1802.00150,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization.ArXiv, abs/1912.08777,

work page arXiv 1912

[19] [19]

URLhttps:// api.semanticscholar.org/CorpusID:209405420. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transforme...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

The quantization block size is fixed at 128 following ARB (Li et al.,

and BiLLM (Huang et al., 2024), we use the C4 dataset with a sequence length of 2048 as calibration data to enable fair comparison. The quantization block size is fixed at 128 following ARB (Li et al.,

work page 2024

[21] [21]

As observed, output alignment is most effective and consistent when applied to the final layer of each block, for both Llama and OPT models. Inference and Storage Overhead Analysis.Our method introducesno additional inference or storage overhead, as it does not add any new quantization parameters and leaves both the model architecture and forward-pass com...

work page 2024

[22] [22]

Quantization Overhead.We provide in detail the quantization time of our method, compared to ARB-X and ARB-RCLi et al

and4.4–5.1×faster than the full-precision model, hence these perfor- mance gains also apply to our method. Quantization Overhead.We provide in detail the quantization time of our method, compared to ARB-X and ARB-RCLi et al. (2024), across architecture. While our method incurs slightly higher overhead than ARB-RC due to the additional closed-form computat...

work page 2024