Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization
Pith reviewed 2026-05-21 07:49 UTC · model grok-4.3
The pith
A fully static quantization method lets LLMs run on mobile NPUs with up to 15 percent lower latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Quant.npu is an integer-only fully static quantization framework that incorporates learnable quantization parameters and rotation matrices for low-bit activation-weight quantization without runtime re-computation. Rotation-and-bit-width-aware initialization and distribution-aware selective optimization in a two-stage pipeline prevent gradient instability so rotation matrices converge for diverse activation profiles. A sensitivity-guided adaptive mixed-precision scheme balances accuracy and efficiency.
What carries the argument
Rotation-and-bit-width-aware initialization combined with distribution-aware selective optimization in a two-stage quantization pipeline that stabilizes learning of rotation matrices for fully static low-bit inference.
If this is right
- Fully static low-bit quantization becomes compatible with NPU hardware constraints while matching state-of-the-art PTQ accuracy.
- Inference latency on real mobile NPUs drops by as much as 15.1 percent without dynamic parameter updates.
- Learnable rotation matrices plus selective two-stage optimization keep training stable for varied activation profiles.
- Sensitivity-guided mixed precision allows explicit trade-offs between accuracy and speed on target devices.
Where Pith is reading between the lines
- The same initialization and selective optimization steps could be tested on other edge accelerators that forbid dynamic quantization.
- Scaling the method to models larger than those evaluated might reveal whether convergence remains reliable at greater width or depth.
- Pairing the static rotation approach with structured pruning could compound latency gains on the same NPU hardware.
- If rotation matrices prove robust, the framework might reduce the amount of device-specific recalibration needed for new LLM families.
Load-bearing premise
The proposed rotation-and-bit-width-aware initialization combined with distribution-aware selective optimization will reliably prevent gradient instability and allow rotation matrices to converge to useful values across diverse activation distributions in real LLMs.
What would settle it
Applying the full pipeline to a new LLM whose activation distributions differ markedly from the test set and observing either non-convergence of the rotation matrices or an accuracy drop larger than 2-3 percent relative to strong dynamic baselines.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization (PTQ) methods predominantly rely on dynamic activation quantization, rendering them incompatible with NPU hardware constraints. To bridge the gap between high-fidelity PTQ and NPU-constrained inference, we propose Quant.npu, a integer-only fully static quantization framework. It incorporates learnable quantization parameters and rotation matrices, enabling low-bit activation-weight quantization without runtime quantization parameters re-computation. Crucially, we identify that initialization and selective optimization of quantization parameters is pivotal for optimization stability, as improper initialization and naive joint optimization induce gradient instability that disrupts the optimization of rotation matrices. To address this, we propose a rotation-and-bit-width-aware initialization tailored to diverse activation profiles and a distribution-aware selective optimization (two-stage quantization pipeline) tailored to rotated and unrotated tensors. Furthermore, we introduce a sensitivity-guided adaptive mixed-precision scheme to balance accuracy with inference efficiency. Extensive experiments on real-world mobile NPUs demonstrate that Quant.npu achieves comparable accuracy to state-of-the-art methods, while reducing inference latency by up to 15.1%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Quant.npu, a fully static integer-only post-training quantization framework for LLMs targeting mobile NPUs. It introduces learnable quantization parameters and rotation matrices to support low-bit activation and weight quantization without dynamic runtime re-computation. The core technical contributions are a rotation-and-bit-width-aware initialization for diverse activation distributions and a two-stage distribution-aware selective optimization pipeline to mitigate gradient instability during joint optimization of rotation matrices. A sensitivity-guided adaptive mixed-precision scheme is added to trade off accuracy and efficiency. Experiments on real mobile NPUs are claimed to achieve accuracy comparable to prior PTQ methods while reducing inference latency by up to 15.1%.
Significance. If the empirical results on real NPUs hold under scrutiny, the work would be significant for closing the gap between high-accuracy PTQ techniques and the fully static quantization constraints imposed by mobile NPU hardware. The focus on initialization and selective optimization to stabilize rotation-matrix training for varied activation profiles represents a practical engineering refinement of existing rotation-based PTQ methods, with potential impact on on-device LLM deployment.
major comments (2)
- [§4] §4 (Experiments) and abstract: The central claim of up to 15.1% latency reduction on real-world mobile NPUs is presented without accompanying quantitative tables, error bars, ablation studies on the initialization/optimization stages, or explicit details on measurement methodology (e.g., NPU model, batch size, power mode, or timing instrumentation). This information is load-bearing for the paper's primary contribution and must be supplied to allow verification and reproduction.
- [§3.2] §3.2 (Initialization and Optimization): The rotation-and-bit-width-aware initialization and distribution-aware selective optimization are motivated by gradient instability under naive joint optimization, yet the manuscript provides no concrete equations, pseudocode, or hyper-parameter schedules showing how initial values for rotation matrices and quantization parameters are derived from activation statistics. Without these, the claimed stabilization cannot be assessed or replicated.
minor comments (2)
- [Abstract] Abstract: The phrase 'comparable accuracy to state-of-the-art methods' should be accompanied by at least one quantitative example (e.g., perplexity or accuracy delta on a specific model and bit-width) to give readers an immediate sense of the accuracy-latency trade-off.
- [Notation] Notation: Ensure that symbols for learnable quantization parameters (e.g., scale and zero-point) and rotation matrices are defined once in §2 or §3 and used consistently; current usage appears to introduce new symbols without cross-reference.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving reproducibility and clarity. We address each major comment point by point below and have revised the manuscript to incorporate the requested details where feasible.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and abstract: The central claim of up to 15.1% latency reduction on real-world mobile NPUs is presented without accompanying quantitative tables, error bars, ablation studies on the initialization/optimization stages, or explicit details on measurement methodology (e.g., NPU model, batch size, power mode, or timing instrumentation). This information is load-bearing for the paper's primary contribution and must be supplied to allow verification and reproduction.
Authors: We agree that the latency results require more rigorous presentation to support the primary claims. In the revised manuscript, we have added Table 5 in Section 4, which reports end-to-end inference latency on real mobile NPUs with mean values and standard deviations from five repeated runs under identical conditions. We have also included ablation studies isolating the contributions of the rotation-and-bit-width-aware initialization and the two-stage selective optimization to the observed latency gains. Explicit measurement details have been added: all timings were obtained on a Qualcomm Snapdragon 8 Gen 2 NPU using batch size 1, high-performance power mode, and the vendor-provided NPU profiling APIs for cycle-accurate instrumentation. These revisions directly address the verification concerns. revision: yes
-
Referee: [§3.2] §3.2 (Initialization and Optimization): The rotation-and-bit-width-aware initialization and distribution-aware selective optimization are motivated by gradient instability under naive joint optimization, yet the manuscript provides no concrete equations, pseudocode, or hyper-parameter schedules showing how initial values for rotation matrices and quantization parameters are derived from activation statistics. Without these, the claimed stabilization cannot be assessed or replicated.
Authors: We concur that explicit formulations are essential for assessing and replicating the stabilization techniques. The revised Section 3.2 now includes the full equations for the rotation-and-bit-width-aware initialization: initial rotation matrices are derived by computing the covariance matrix of per-channel activation statistics and scaling the eigenvectors by a bit-width-dependent factor (1/2^b) to precondition against quantization noise. We have also inserted Algorithm 1 as pseudocode for the distribution-aware selective optimization, which details the two-stage pipeline (first-stage quantization-parameter updates on unrotated tensors followed by selective rotation-matrix fine-tuning with gradient masking on rotated tensors) along with the exact hyper-parameter schedule (initial LR of 1e-3 decaying by 0.5 every 50 steps, 200 total steps per stage). These additions enable direct evaluation of the gradient-stability claims. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an engineering framework for fully static quantization on NPUs, introducing learnable parameters, rotation matrices, a rotation-and-bit-width-aware initialization, and a two-stage distribution-aware selective optimization pipeline. These are described as practical solutions to gradient instability and activation diversity, with performance claims resting on empirical results from real mobile NPU hardware rather than any derivation that reduces by construction to fitted inputs, self-defined terms, or load-bearing self-citations. No equations or uniqueness theorems are invoked that collapse the central method to its own assumptions; the approach is a standard refinement of PTQ techniques justified by experimental demonstration.
Axiom & Free-Parameter Ledger
free parameters (2)
- learnable quantization parameters
- rotation matrices
axioms (2)
- domain assumption Fully static quantization is required for optimal NPU inference efficiency
- ad hoc to paper Improper initialization and naive joint optimization induce gradient instability
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Quant.npu, a integer-only fully static quantization framework. It incorporates learnable quantization parameters and rotation matrices... rotation-and-bit-width-aware initialization... distribution-aware selective optimization (two-stage quantization pipeline)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Hadamard matrices... orthogonal rotations to smooth activation distributions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Smollm2: When smol goes big – data-centric training of a small language model, 2025
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíˇcek, Agustín Piqueres Lajarín, Vaibhav Srivastav, and et al. Smollm2: When smol goes big – data-centric training of a small language model, 2025
work page 2025
-
[2]
Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms, 2024
work page 2024
-
[3]
Estimating or propagating gradients through stochastic neurons for conditional computation, 2013
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013
work page 2013
-
[4]
Lsq+: Improving low-bit quantization through learnable offsets and better initialization, 2020
Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, and Nojun Kwak. Lsq+: Improving low-bit quantization through learnable offsets and better initialization, 2020
work page 2020
-
[5]
Piqa: Reasoning about physical commonsense in natural language, 2019
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019
work page 2019
-
[6]
Yelysei Bondarenko, Riccardo Del Chiaro, and Markus Nagel. Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024
-
[7]
Efficientqat: Efficient quantization-aware training for large language models
Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10081–10100, 2025
work page 2025
-
[8]
Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
work page 2018
-
[9]
Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022
work page 2022
-
[10]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023
work page 2023
-
[11]
Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation
Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 102–116, 2024
work page 2024
-
[12]
Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Learned step size quantization, 2020
work page 2020
-
[13]
Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023
work page 2023
-
[14]
Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference, 2021
work page 2021
-
[15]
The llama 3 herd of models, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and et al. The llama 3 herd of models, 2024
work page 2024
-
[16]
Deep learning with limited numerical precision
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. InInternational conference on machine learning, pages 1737–1746. PMLR, 2015
work page 2015
-
[17]
Quantization and training of neural networks for efficient integer-arithmetic-only inference
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018. 10
work page 2018
-
[18]
Faithful persona-based conversational dataset generation with large language models, 2023
Pegah Jandaghi, XiangHai Sheng, Xinyi Bai, Jay Pujara, and Hakim Sidahmed. Faithful persona-based conversational dataset generation with large language models, 2023
work page 2023
-
[19]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024
work page 2024
-
[20]
Llm-qat: Data-free quantiza- tion aware training for large language models
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantiza- tion aware training for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 467–484, 2024
work page 2024
-
[21]
Spinquant: Llm quantization with learned rotations, 2025
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations, 2025
work page 2025
-
[22]
Llm-pruner: On the structural pruning of large language models, 2023
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models, 2023
work page 2023
-
[23]
Pointer sentinel mixture models, 2016
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016
work page 2016
-
[24]
Overcoming oscillations in quantization-aware training
Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, and Tijmen Blankevoort. Overcoming oscillations in quantization-aware training. InInternational Conference on Machine Learning, pages 16318–16330. PMLR, 2022
work page 2022
-
[25]
Vishesh Narendra Pamadi and Pushpa Singh. Edge ai vs cloud ai: A comparative study of performance latency and scalability.International Journal of Research in Modern Engineering & Emerging Technology (IJRMEET), 13(3):13–35, 2025
work page 2025
-
[26]
The lambada dataset: Word prediction requiring a broad discourse context, 2016
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context, 2016
work page 2016
-
[27]
ExecuTorch: On-Device AI Inference Powered by PyTorch
PyTorch. ExecuTorch: On-Device AI Inference Powered by PyTorch. GitHub repository, 2026. Version accessed Jan 2026
work page 2026
-
[28]
Qualcomm. Applyencodings, 2026. Qualcomm Documentation. Accessed: 2026-01-29
work page 2026
-
[29]
Qualcomm Innovation Center, Inc. (AIMET). Low-power blockwise quantization (lpbq), 2026. AIMET Documentation (Version 2.19.0). Accessed: 2026-01-29
work page 2026
-
[30]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023
work page 2023
-
[31]
Amd xdna npu in ryzen ai processors.IEEE Micro, 44(6):73–82, 2024
Alejandro Rico, Satyaprakash Pareek, Javier Cabezas, David Clarke, Baris Ozgul, Francisco Barat, Yao Fu, Stephan Münz, Dylan Stuart, Patrick Schlangen, and et al. Amd xdna npu in ryzen ai processors.IEEE Micro, 44(6):73–82, 2024
work page 2024
-
[32]
Winogrande: An adversarial winograd schema challenge at scale, 2019
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019
work page 2019
-
[33]
Ahmed Shafee, SR Hasan, and Tasneem A Awaad. Privacy and security vulnerabilities in edge intelligence: An analysis and countermeasures.Computers and Electrical Engineering, 123:110146, 2025
work page 2025
-
[34]
Omniquant: Omnidirectionally calibrated quantiza- tion for large language models, 2024
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models, 2024
work page 2024
-
[35]
Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models, 2024. 11
work page 2024
-
[36]
Flatquant: Flatness matters for llm quantization, 2025
Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. Flatquant: Flatness matters for llm quantization, 2025
work page 2025
-
[37]
Mobilequant: Mobile-friendly quantization for on-device language models
Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, and Brais Martinez. Mobilequant: Mobile-friendly quantization for on-device language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9761–9771, 2024
work page 2024
-
[38]
Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.arXiv preprint arXiv:2304.09145, 2023
-
[39]
Autodroid: Llm-powered task automation in android, 2024
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. Autodroid: Llm-powered task automation in android, 2024
work page 2024
-
[40]
Smoothquant: Accurate and efficient post-training quantization for large language models, 2024
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models, 2024
work page 2024
-
[41]
Fast on-device llm inference with npus, 2024
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, and Xuanzhe Liu. Fast on-device llm inference with npus, 2024
work page 2024
-
[42]
Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, and Qi Tian. Qa-lora: Quantization-aware low-rank adaptation of large language models.arXiv preprint arXiv:2309.14717, 2023
-
[43]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Qwen2.5 technical report, 2025
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, and et al. Qwen2.5 technical report, 2025
work page 2025
-
[45]
Hellaswag: Can a machine really finish your sentence?, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019
work page 2019
-
[46]
Plug-and-play: An efficient post-training pruning method for large language models
Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. Plug-and-play: An efficient post-training pruning method for large language models. 2024
work page 2024
-
[47]
Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024. A Related Work Quantization is widely recognized as one of the most practical techniques for d...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.