Recognition: 2 theorem links
· Lean TheoremBWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
Pith reviewed 2026-05-13 17:22 UTC · model grok-4.3
The pith
Binary weights and ternary activations let Transformers reach within 3.5 percent of full-precision accuracy while running 16 to 24 times faster on GPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BWTA projects tiny values to zero during binarization of weights and ternarization of activations, trains the model with Smooth Multi-Stage Quantization that combines Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor, and supplies a BWTA MatMul CUDA kernel using bit-packing; this combination keeps average GLUE drop at 3.5 percent for BERT and maintains comparable perplexity for LLMs while producing 16-24 times kernel speedup over FP16.
What carries the argument
The BWTA scheme that binarizes weights and ternarizes activations by projecting tiny values to zero, supported by Smooth Multi-Stage Quantization training and a custom instruction-level-parallel CUDA MatMul kernel.
If this is right
- BERT models under BWTA show an average 3.5 percent GLUE drop and less than 2 percent drop on five additional tasks.
- Large language models quantized with BWTA retain comparable perplexity and task accuracy to their full-precision versions.
- The custom CUDA kernel delivers 16 to 24 times speedup over FP16 at the matrix-multiplication level and 216 to 330 tokens per second end-to-end prefill on LLMs.
- Memory footprint is reduced because weights and activations use only one or two bits per value.
Where Pith is reading between the lines
- The same projection-and-kernel pattern could be tested on other attention-based architectures such as vision transformers to check whether the accuracy preservation holds beyond language.
- If the zero-projection rule generalizes, similar custom kernels might be written for additional low-bit formats on the same GPU hardware without waiting for new instruction sets.
- The reported token-per-second numbers suggest BWTA could be used to serve larger models on existing server GPUs before new accelerator hardware arrives.
Load-bearing premise
Projecting tiny values to zero together with the Smooth Multi-Stage Quantization procedure will preserve accuracy across Transformer models and tasks without needing architecture-specific retuning.
What would settle it
Running the published BWTA procedure on a standard BERT-base model and observing an average GLUE score drop larger than 5 percent compared with the full-precision baseline would falsify the near-full-precision claim.
Figures
read the original abstract
Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Binary Weights & Ternary Activations (BWTA) quantization for Transformer models, which projects tiny values to zero to mitigate zero-point distortion in binarization. Training uses Smooth Multi-Stage Quantization combining Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor for stable convergence. Inference relies on a custom BWTA MatMul CUDA kernel with bit-packing for linear and attention operators. Experiments claim near full-precision results: 3.5% average GLUE drop for BERT-base, <2% on five tasks, comparable LLM perplexity/accuracy, plus 16-24x kernel speedup over FP16 and 216-330 tokens/s end-to-end prefill.
Significance. If the empirical claims hold with proper validation, the work would demonstrate a practical algorithm-hardware co-design for ultra-low-bit Transformer inference that achieves substantial efficiency gains while preserving model quality, addressing key barriers to deploying binarized/ternary models on GPUs.
major comments (3)
- [Experiments] Experimental section: the headline accuracy claims (3.5% GLUE drop, <2% on five tasks, comparable LLM metrics) are presented without error bars, standard deviations across runs, or ablation studies isolating the zero-projection step versus the Smooth Multi-Stage Quantization components; this makes it impossible to assess whether the reported drops are statistically stable or architecture-specific.
- [§3.2] §3.2 (Smooth Multi-Stage Quantization): the Magnitude-Alignment Projection Factor is introduced as a free hyper-parameter without a scaling analysis or bound showing when the induced distortion remains negligible as model depth or width increases; the assumption that it generalizes across BERT and LLMs without per-architecture retuning is load-bearing for the central accuracy claim but unsupported by the provided evidence.
- [Inference kernel] Kernel implementation (BWTA MatMul CUDA kernel): the 16-24x speedup and seamless integration claims rest on an unverified custom kernel; no micro-benchmark tables compare against cuBLAS/FP16 baselines under identical batch/size conditions, nor is there confirmation that the bit-packing preserves numerical equivalence to the quantized forward pass.
minor comments (2)
- [Abstract] Abstract and §4: the phrase 'approaches full-precision performance' is used without a precise definition (e.g., within X% of FP32 on all tasks); a table summarizing per-task deltas would improve clarity.
- [§3] Notation: the Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor are described at a high level; explicit pseudocode or equations for the projection threshold and degradation schedule would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on experimental rigor, the analysis of the projection factor, and kernel verification. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experimental section: the headline accuracy claims (3.5% GLUE drop, <2% on five tasks, comparable LLM metrics) are presented without error bars, standard deviations across runs, or ablation studies isolating the zero-projection step versus the Smooth Multi-Stage Quantization components; this makes it impossible to assess whether the reported drops are statistically stable or architecture-specific.
Authors: We agree that error bars and component ablations would improve statistical assessment. In the revised manuscript we will report all GLUE and LLM metrics as means over at least three random seeds with standard deviations. We will also add a dedicated ablation subsection that isolates the zero-projection step from the Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor, quantifying their individual and joint contributions to final accuracy. revision: yes
-
Referee: [§3.2] §3.2 (Smooth Multi-Stage Quantization): the Magnitude-Alignment Projection Factor is introduced as a free hyper-parameter without a scaling analysis or bound showing when the induced distortion remains negligible as model depth or width increases; the assumption that it generalizes across BERT and LLMs without per-architecture retuning is load-bearing for the central accuracy claim but unsupported by the provided evidence.
Authors: The factor is fixed at 0.1 for all models without retuning. While we do not derive a closed-form bound on distortion scaling, we empirically validate the choice across BERT-base/large and LLMs up to 7B parameters. In revision we will add a sensitivity study varying the factor on models of different widths and depths, together with an empirical analysis of activation magnitude distributions showing why the induced distortion stays small. revision: partial
-
Referee: [Inference kernel] Kernel implementation (BWTA MatMul CUDA kernel): the 16-24x speedup and seamless integration claims rest on an unverified custom kernel; no micro-benchmark tables compare against cuBLAS/FP16 baselines under identical batch/size conditions, nor is there confirmation that the bit-packing preserves numerical equivalence to the quantized forward pass.
Authors: We will add a micro-benchmark table comparing the BWTA MatMul kernel against cuBLAS FP16 for the exact matrix shapes and batch/sequence configurations used in the BERT and LLM experiments. We will also include a numerical equivalence verification section showing that the bit-packed kernel matches a reference quantized implementation to within 1e-6 maximum absolute error. revision: yes
Circularity Check
No circularity; empirical co-design validated by experiments
full rationale
The paper presents BWTA as an algorithm-hardware co-design consisting of a zero-projection binarization scheme, Smooth Multi-Stage Quantization training (Levelwise Degradation + Magnitude-Alignment Projection Factor), and a custom CUDA MatMul kernel. No equations, derivations, or predictions are shown that reduce by construction to fitted parameters, self-defined quantities, or self-citation chains. Performance numbers (3.5% GLUE drop, 16-24x speedup) are reported as direct experimental outcomes on BERT and LLMs rather than outputs forced from the method's own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- Magnitude-Alignment Projection Factor
axioms (1)
- domain assumption Gradual introduction of quantization constraints via levelwise degradation enables stable training of low-bit models.
invented entities (1)
-
BWTA MatMul CUDA kernel
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
propose Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero... Levelwise Degradation Strategy... Magnitude Alignment Projection Factor
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Smooth Multi-Stage Quantization... Levelwise Degradation Strategy... {L0, L1, ..., Ll} with Ll=1
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Analyzing the Structure of Attention in a Transformer Language Model
J. Vig and Y. Belinkov, “Analyzing the structure of attention in a transformer language model,”arXiv preprint arXiv:1906.04284, 2019
work page Pith review arXiv 1906
-
[2]
A survey on semi-supervised learning,
J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,”Machine learning, vol. 109, no. 2, pp. 373–440, 2020
work page 2020
-
[3]
Multimodal learning with transformers: A survey,
P . Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023
work page 2023
-
[4]
Crossformer++: A versatile vision transformer hinging on cross-scale attention,
W. Wang, W. Chen, Q. Qiu, L. Chen, B. Wu, B. Lin, X. He, and W. Liu, “Crossformer++: A versatile vision transformer hinging on cross-scale attention,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023
work page 2023
-
[5]
A practical survey on faster and lighter transformers,
Q. Fournier, G. M. Caron, and D. Aloise, “A practical survey on faster and lighter transformers,”ACM Computing Surveys, vol. 55, no. 14s, pp. 1–40, 2023
work page 2023
-
[6]
Towards accurate and compact architectures via neural architecture transformer,
Y. Guo, Y. Zheng, M. Tan, Q. Chen, Z. Li, J. Chen, P . Zhao, and J. Huang, “Towards accurate and compact architectures via neural architecture transformer,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6501–6516, 2021
work page 2021
-
[7]
Q-vit: Accurate and fully quantized low-bit vision transformer,
Y. Li, S. Xu, B. Zhang, X. Cao, P . Gao, and G. Guo, “Q-vit: Accurate and fully quantized low-bit vision transformer,”Advances in neural information processing systems, vol. 35, pp. 34 451–34 463, 2022
work page 2022
-
[8]
SQuant: On-the-fly data-free quantization via diagonal hessian approximation,
C. Guo, Y. Qiu, J. Leng, X. Gao, C. Zhang, Y. Liu, F. Yang, Y. Zhu, and M. Guo, “SQuant: On-the-fly data-free quantization via diagonal hessian approximation,” inInternational Conference on Learning Representations, 2022
work page 2022
-
[9]
Quantization and training of neural networks for efficient integer-arithmetic-only inference,
B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713
work page 2018
-
[11]
Knowledge distillation via the target-aware transformer,
S. Lin, H. Xie, B. Wang, K. Yu, X. Chang, X. Liang, and G. Wang, “Knowledge distillation via the target-aware transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 915–10 924
work page 2022
-
[12]
Dearkd: data-efficient early knowledge distillation for vision transformers,
X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, and D. Tao, “Dearkd: data-efficient early knowledge distillation for vision transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 052–12 062
work page 2022
-
[13]
L. Wang and K.-J. Yoon, “Knowledge distillation and student- teacher learning for visual intelligence: A review and new outlooks,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3048–3068, 2021
work page 2021
-
[14]
Tprune: Efficient transformer pruning for mobile devices,
J. Mao, H. Yang, A. Li, H. Li, and Y. Chen, “Tprune: Efficient transformer pruning for mobile devices,”ACM Transactions on Cyber-Physical Systems, vol. 5, no. 3, pp. 1–22, 2021
work page 2021
-
[15]
Width & depth pruning for vision transformers,
F. Yu, K. Huang, M. Wang, Y. Cheng, W. Chu, and L. Cui, “Width & depth pruning for vision transformers,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3143–3151
work page 2022
-
[16]
A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recom- mendations,
H. Cheng, M. Zhang, and J. Q. Shi, “A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recom- mendations,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[17]
Shapeshifter: a parameter- efficient transformer using factorized reshaped matrices,
A. Panahi, S. Saeedi, and T. Arodz, “Shapeshifter: a parameter- efficient transformer using factorized reshaped matrices,”Advances in Neural Information Processing Systems, vol. 34, pp. 1337–1350, 2021
work page 2021
-
[18]
Subformer: Exploring weight sharing for parameter efficiency in generative transformers,
M. Reid, E. Marrese-Taylor, and Y. Matsuo, “Subformer: Exploring weight sharing for parameter efficiency in generative transformers,” arXiv preprint arXiv:2101.00234, 2021
-
[19]
Qlora: Efficient finetuning of quantized llms,
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in Neural Informa- tion Processing Systems, vol. 36, 2024
work page 2024
-
[20]
N. Zhang, F. Nex, G. Vosselman, and N. Kerle, “Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 537–18 546
work page 2023
-
[21]
Towards lightweight transformer via group-wise transfor- mation for vision-and-language tasks,
G. Luo, Y. Zhou, X. Sun, Y. Wang, L. Cao, Y. Wu, F. Huang, and R. Ji, “Towards lightweight transformer via group-wise transfor- mation for vision-and-language tasks,”IEEE Transactions on Image Processing, vol. 31, pp. 3386–3398, 2022
work page 2022
-
[22]
Bibert: Accurate fully binarized bert,
H. Qin, Y. Ding, M. Zhang, Q. Yan, A. Liu, Q. Dang, Z. Liu, and X. Liu, “Bibert: Accurate fully binarized bert,”arXiv preprint arXiv:2203.06390, 2022
-
[23]
Learning efficient binarized object detectors with information compression,
Z. Wang, J. Lu, Z. Wu, and J. Zhou, “Learning efficient binarized object detectors with information compression,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021
work page 2021
-
[24]
Learning channel-wise interactions for binary convolutional neural networks,
Z. Wang, J. Lu, and J. Zhou, “Learning channel-wise interactions for binary convolutional neural networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020
work page 2020
-
[25]
Hierarchical binary cnns for landmark localization with limited resources,
A. Bulat and G. Tzimiropoulos, “Hierarchical binary cnns for landmark localization with limited resources,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 343–356, 2020
work page 2020
-
[26]
Binaryformer: A hierarchical-adaptive binary vision transformer (vit) for efficient computing,
M. Wang, Z. Xu, B. Zheng, and W. Xie, “Binaryformer: A hierarchical-adaptive binary vision transformer (vit) for efficient computing,”IEEE Transactions on Industrial Informatics, 2024
work page 2024
-
[27]
Bitnet: Scaling 1- bit transformers for large language models.arXiv preprint arXiv:2310.11453,
H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y. Wu, and F. Wei, “Bitnet: Scaling 1-bit transformers for large language models,”arXiv preprint arXiv:2310.11453, 2023
-
[28]
Scalable matmul-free language modeling,
R.-J. Zhu, Y. Zhang, S. Abreu, E. Sifferman, T. Sheaves, Y. Wang, D. Richmond, S. B. Shrestha, P . Zhou, and J. K. Eshraghian, “Scalable matmul-free language modeling,”arXiv preprint arXiv:2406.02528, 2024
-
[29]
BinaryBERT: Pushing the limit of BERT quantization,
H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, and I. King, “BinaryBERT: Pushing the limit of BERT quantization,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R....
work page 2021
-
[30]
Bit: Robustly binarized multi-distilled trans- former,
Z. Liu, B. Oguz, A. Pappu, L. Xiao, S. Yih, M. Li, R. Krishnamoorthi, and Y. Mehdad, “Bit: Robustly binarized multi-distilled trans- former,”Advances in neural information processing systems, vol. 35, pp. 14 303–14 316, 2022
work page 2022
-
[31]
Mlbert: Multi-level fully binarized bert,
M. M. Nasab, M. Fakhire, M. E. Salehi, and M. Modarresi, “Mlbert: Multi-level fully binarized bert,” in2024 1st International Confer- ence on Innovative Engineering Sciences and Technological Research (ICIESTR), 2024, pp. 1–6
work page 2024
-
[32]
X. Xing, L. Du, X. Wang, X. Zeng, Y. Wang, Z. Zhang, and J. Zhang, “Bipft: Binary pre-trained foundation transformer with low-rank estimation of binarization residual polynomials,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 14, 2024, pp. 16 094–16 102
work page 2024
-
[33]
Bebert: Efficient and robust binary ensemble bert,
J. Tian, C. Fang, H. Wang, and Z. Wang, “Bebert: Efficient and robust binary ensemble bert,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5
work page 2023
-
[34]
Binary and ternary natural language generation,
Z. Liu, B. Oguz, A. Pappu, Y. Shi, and R. Krishnamoorthi, “Binary and ternary natural language generation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 65–77. [Onl...
work page 2023
-
[35]
Binaryvit: Pushing binary vision trans- formers towards convolutional models,
P .-H. C. Le and X. Li, “Binaryvit: Pushing binary vision trans- formers towards convolutional models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2023, pp. 4664–4673
work page 2023
-
[36]
Bivit: Extremely compressed binary vision transformers,
Y. He, Z. Lou, L. Zhang, J. Liu, W. Wu, H. Zhou, and B. Zhuang, “Bivit: Extremely compressed binary vision transformers,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5628–5640
work page 2023
-
[37]
Db-llm: Accurate dual-binarization for efficient llms,
H. Chen, C. Lv, L. Ding, H. Qin, X. Zhou, Y. Ding, X. Liu, M. Zhang, J. Guo, X. Liu, and D. Tao, “Db-llm: Accurate dual-binarization for efficient llms,” inAnnual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[38]
Pb-llm: Partially binarized large language models,
Y. Shang, Z. Yuan, Q. Wu, and Z. Dong, “Pb-llm: Partially binarized large language models,”arXiv preprint arXiv:2310.00034, 2023
-
[39]
Billm: Pushing the limit of post-training quantization for llms,
W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and X. Qi, “Billm: Pushing the limit of post-training quantization for llms,” inInternational Conference on Machine Learning, 2024
work page 2024
-
[40]
Overcoming oscillations in quantization-aware training,
M. Nagel, M. Fournarakis, Y. Bondarenko, and T. Blankevoort, “Overcoming oscillations in quantization-aware training,” inPro- ceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 162, 17–23 Jul 2022, pp. 16 318–16 330
work page 2022
-
[41]
R. Gong, X. Liu, S. Jiang, T. Li, P . Hu, J. Lin, F. Yu, and J. Yan, “Differentiable soft quantization: Bridging full-precision and low- PREPRINT SUBMITTED TO ARXIV 12 bit neural networks,” inInternational Conference on Computer Vision (ICCV), 2019
work page 2019
-
[42]
Atom: Low-bit quantization for efficient and accurate llm serving,
Y. Zhao, C.-Y. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit quantization for efficient and accurate llm serving,” inProceedings of Machine Learning and Systems, P . Gibbons, G. Pekhimenko, and C. D. Sa, Eds., vol. 6, 2024, pp. 196–209. [Online]. Avail- able: https://proceedings.mlsys.org/paper f...
work page 2024
-
[43]
Biqgemm: Matrix multiplication with lookup table for binary-coding-based quantized dnns,
Y. Jeon, B. Park, S. J. Kwon, B. Kim, J. Yun, and D. Lee, “Biqgemm: Matrix multiplication with lookup table for binary-coding-based quantized dnns,” inSC20: International Conference for High Perfor- mance Computing, Networking, Storage and Analysis, 2020, pp. 1–14
work page 2020
-
[44]
Bit-slicing fpga accelerator for quantized neural networks,
O. Bilaniuk, S. Wagner, Y. Savaria, and J.-P . David, “Bit-slicing fpga accelerator for quantized neural networks,” in2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5
work page 2019
-
[45]
Efficient approaches for gemm acceleration on leading ai-optimized fpgas,
E. Taka, D. Gourounas, A. Gerstlauer, D. Marculescu, and A. Arora, “Efficient approaches for gemm acceleration on leading ai-optimized fpgas,”arXiv preprint arXiv:2404.11066, 2024
-
[46]
FINN: A framework for fast, scalable binarized neural network inference,
Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P . H. W. Leong, M. Jahre, and K. A. Vissers, “FINN: A framework for fast, scalable binarized neural network inference,”CoRR, vol. abs/1612.07119, 2016
-
[47]
Softmap: Software-hardware co-design for integer-only softmax on associative processors,
M. Rakka, J. Li, G. Dai, A. Eltawil, M. Fouda, and F. J. Kurdahi, “Softmap: Software-hardware co-design for integer-only softmax on associative processors,” inarXiv.org, 2024
work page 2024
-
[48]
Sole: Hardware- software co-design of softmax and layernorm for efficient trans- former inference,
W. Wang, S. Zhou, W. Sun, P . Sun, and Y. Liu, “Sole: Hardware- software co-design of softmax and layernorm for efficient trans- former inference,” in2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2023, pp. 1–9
work page 2023
-
[49]
H. Xia, Z. Zheng, X. Wu, S. Chen, Z. Yao, S. Youn, A. Bakhtiari, M. Wyatt, D. Zhuang, Z. Zhou, O. Ruwase, Y. He, and S. L. Song, “Quant-LLM: Accelerating the serving of large language models via FP6-Centric Algorithm-System Co-Design on modern GPUs,” in2024 USENIX Annual Technical Conference (USENIX ATC 24), Jul. 2024, pp. 699–713
work page 2024
-
[50]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,”arXiv preprint arXiv:1804.07461, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[51]
Dynabert: Dynamic bert with adaptive width and depth,
L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, and Q. Liu, “Dynabert: Dynamic bert with adaptive width and depth,”Advances in Neural Information Processing Systems, vol. 33, pp. 9782–9793, 2020
work page 2020
-
[52]
O. Zafrir, G. Boudoukh, P . Izsak, and M. Wasserblat, “Q8bert: Quantized 8bit bert,” in2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2- NIPS). IEEE, 2019, pp. 36–39
work page 2019
-
[53]
Ternarybert: Distillation-aware ultra-low bit bert,
W. Zhang, L. Hou, Y. Yin, L. Shang, X. Chen, X. Jiang, and Q. Liu, “Ternarybert: Distillation-aware ultra-low bit bert,”arXiv preprint arXiv:2009.12812, 2020
-
[54]
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, “Ptq4vit: Post-training quantization framework for vision transformers with twin uniform quantization,”arXiv preprint arXiv:2111.12293, 2021
-
[55]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Y. Bengio, N. Leonard, and A. Courville, “Estimating or prop- agating gradients through stochastic neurons for conditional computation,”arXiv preprint arXiv:1308.3432, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[56]
Learned step size quantization,
S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned step size quantization,”arXiv preprint arXiv:1902.08153, 2019
-
[57]
Training binary neural networks with real-to-binary convolutions,
B. Martinez, J. Yang, A. Bulat, and G. Tzimiropoulos, “Training binary neural networks with real-to-binary convolutions,”arXiv preprint arXiv:2003.11535, 2020
-
[58]
Awq: Activation-aware weight quantization for on-device llm compression and acceleration,
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,” Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024
work page 2024
-
[59]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[60]
Omniquant: Omnidirectionally calibrated quantization for large language models
W. Shao, M. Chen, Z. Zhang, P . Xu, L. Zhao, Z. Li, K. Zhang, P . Gao, Y. Qiao, and P . Luo, “Omniquant: Omnidirectionally calibrated quantization for large language models,”arXiv preprint arXiv:2308.13137, 2023
-
[61]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. Yifu Dingis a Ph.D. Candidate under the su- pervision of Prof. Xianglong Liu in the School of Computer Science and Engineering & Shenyuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Relevance & comparability. GPUs are the de-facto platform for LLMs; evaluating on a single commodity platform enables fair, reproducible comparisons with existing low-bit baselines and real workloads (pre- fill/decode), without confounds from heterogeneous boards or toolchains
-
[63]
Co-design within real constraints. Our work co- optimizes BWTA with GPU realities (MMA tile geome- try, register/shared-memory limits, low-bit instruction throughput, packing/layout). This is not software-only; the algorithmic choices were made because they map efficiently to GPU bitwise execution paths
-
[64]
Community choice. GPU kernels can be immediately adopted by the most frameworks and toolkits, benefiting a wide range of models and inference pipelines. We agree that FPGA/ASIC can further improve energy effi- ciency. However, our GPU-oriented kernels and layouts are not a 1:1 drop-in for FPGA/ASIC due to different instruction sets, memory hierarchies, an...
-
[65]
https://docs.nvidia.com/cuda/cublas PREPRINT SUBMITTED TO ARXIV 17 Quant #Bits Size (MB) MNLI -m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg. Full Precision 32/32 418 84.9/85.5 91.4 88.6 93.2 59.7 89.8 86.2 72.2 83.9 Q2BERT 2/8 43.0 47.2/47.3 67.0 61.3 80.6 0 4.4 68.4 52.7 47.7 Q-BERT 2/8 43.0 76.6/77.0 – – 84.6 – – 68.3 52.7 – TernaryBERT 2/8 28.0 83.3/83....
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.