pith. sign in

arxiv: 2506.01352 · v2 · submitted 2025-06-02 · 💻 cs.LG

TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

Pith reviewed 2026-05-19 11:20 UTC · model grok-4.3

classification 💻 cs.LG
keywords activation quantizationpipeline parallelismdistributed traininglarge language modelsconvergence ratequantization errorHadamard transformadaptive bit allocation
0
0 comments X

The pith

TAH-Quant enables 3-4 bit activation quantization in pipeline parallelism while preserving the convergence rate of standard SGD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models trained across distributed devices face communication bottlenecks when sending activations between pipeline stages over slow networks. TAH-Quant addresses this by quantizing those activations to a few bits using tile-wise adaptive bit allocation guided by entropy, combined with a Hadamard transformation and pivot swapping to handle outliers. The framework proves that this quantization does not degrade the overall training convergence beyond the standard rate of vanilla stochastic gradient descent. If effective, it allows much higher throughput in resource-pooled decentralized training without extra caching costs or loss of model quality.

Core claim

Pipeline parallel training equipped with TAH-Quant maintains a convergence rate of O(1/sqrt(T)), matching that of vanilla stochastic gradient descent, while achieving an aggressive activation quantization ratio of 3-4 bits and up to 4.3x throughput speedup.

What carries the argument

Tile-wise Adaptive Hadamard Quantization, which applies fine-grained quantization to small channel windows per token with entropy-guided bit allocation and outlier suppression through Hadamard-based transformation and pivot swapping.

If this is right

  • Pipeline parallel training can achieve up to 4.3 times higher throughput compared to full-precision FP32.
  • Convergence rate remains equivalent to vanilla stochastic gradient descent.
  • The method avoids activation-cache overhead present in prior quantization approaches.
  • Communication volume for intermediate activations drops significantly under limited network bandwidth.
  • Performance holds across various training scenarios and model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar tile-wise adaptive quantization could reduce communication in other distributed setups such as data or tensor parallelism.
  • Combining the approach with gradient compression techniques might yield additional speedups in fully decentralized systems.
  • The outlier suppression mechanism may prove useful for quantizing other tensors like weights in bandwidth-constrained environments.

Load-bearing premise

The combination of entropy-guided tile-wise bit allocation, Hadamard transformation, and pivot swapping keeps quantization errors small enough that they do not introduce bias or increased variance capable of breaking the standard SGD convergence bound.

What would settle it

A training run on a standard language model benchmark where loss fails to decrease at the expected rate or diverges under TAH-Quant would show the convergence guarantee does not hold.

Figures

Figures reproduced from arXiv: 2506.01352 by Binhang Yuan, Guangxin He, Kai Chen, Kun Yuan, Tianyi Bai, Yuan Cao, Yutong He.

Figure 1
Figure 1. Figure 1: Empirical justification of Assumption 4 6 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The training convergence for each task (loss vs. steps). Task ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end training performance over different networks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study for Hadamard transform. Secondly, to examine the effectiveness of the entropy-guided adaptive bit allocation, we compare TAH-QUANT with adaptive bit allocation en￾abled against a variant without adaptive allocation. Results in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants, but is often bottlenecked by network communication, particularly under pipeline parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited. To address these issues, we propose TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework for pipeline parallelism. TAH-Quant integrates fine-grained tile-wise quantization, entropy-guided tile-wise adaptive bit allocation for optimal bit usage, and a Hadamard-based transformation with pivot swapping to effectively suppress outliers. Compared with token-level allocation, the tile-wise allocator assigns precision at the granularity of small channel windows within each token, reducing quantization error under the same bit budget. We prove that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of O(1/sqrt(T)), matching that of vanilla stochastic gradient descent. Extensive experiments demonstrate that TAH-Quant achieves an aggressive activation quantization ratio of 3-4 bits, providing up to 4.3x throughput speedup over uncompressed FP32 and up to 1.33x wall-clock speedup over AQ-SGD, while preserving training convergence, avoiding AQ-SGD's activation-cache overhead, and generalizing well across various training scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes TAH-Quant, a tile-wise adaptive Hadamard quantization method for activations in pipeline-parallel LLM training over slow networks. It integrates fine-grained tile-wise quantization, entropy-guided adaptive bit allocation, and Hadamard transformation with pivot swapping to suppress outliers. The central claims are a convergence rate of O(1/sqrt(T)) matching vanilla SGD, 3-4 bit quantization, up to 4.3x throughput speedup over FP32, and 1.33x wall-clock speedup over AQ-SGD while avoiding activation-cache overhead.

Significance. If the convergence analysis holds for the adaptive components, the work addresses a key bottleneck in decentralized pipeline-parallel training by reducing activation communication volume without degrading the SGD rate. The explicit convergence guarantee and comparison to AQ-SGD are positive features; empirical generalization across scenarios strengthens the practical contribution.

major comments (2)
  1. [Convergence analysis / Theorem 1] Convergence analysis (Theorem on O(1/sqrt(T)) rate): The proof must explicitly bound the second moment of quantization noise under entropy-guided tile-wise adaptive allocation. If the analysis invokes a generic fixed-bit or non-adaptive quantization lemma, the data-dependent bit decisions could correlate noise with activation statistics across pipeline stages, potentially inflating variance or introducing bias that violates the stated rate. The manuscript should derive or cite the precise assumption (e.g., unbiasedness or Lipschitz bound) that remains valid after pivot swapping and Hadamard transform.
  2. [Experiments / Figure 3] Experimental validation of convergence (Section 5 / Figure on loss curves): The reported preservation of convergence is shown only for selected models and bit budgets; it is unclear whether the adaptive allocator was ablated against fixed-bit baselines to isolate its effect on the observed O(1/sqrt(T)) behavior. Without variance estimates or multiple random seeds for the adaptive entropy estimator, it is difficult to confirm that the speedup does not trade off against hidden convergence degradation.
minor comments (2)
  1. [Method / §3.2] Notation for tile size and entropy estimator: Define the tile dimensions (channel window size) and the exact entropy computation formula early in Section 3; current description leaves ambiguity whether entropy is computed per-token or per-layer.
  2. [Experiments / §5.1] Baseline implementation details: Clarify whether AQ-SGD's activation-cache overhead is measured under identical pipeline stage counts and network bandwidth; the 1.33x wall-clock claim would be stronger with explicit bandwidth numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating where revisions will be made to clarify the analysis and strengthen the experiments.

read point-by-point responses
  1. Referee: [Convergence analysis / Theorem 1] Convergence analysis (Theorem on O(1/sqrt(T)) rate): The proof must explicitly bound the second moment of quantization noise under entropy-guided tile-wise adaptive allocation. If the analysis invokes a generic fixed-bit or non-adaptive quantization lemma, the data-dependent bit decisions could correlate noise with activation statistics across pipeline stages, potentially inflating variance or introducing bias that violates the stated rate. The manuscript should derive or cite the precise assumption (e.g., unbiasedness or Lipschitz bound) that remains valid after pivot swapping and Hadamard transform.

    Authors: We thank the referee for this precise observation on the proof structure. Theorem 1 establishes the O(1/sqrt(T)) rate by first applying the Hadamard transform with pivot swapping, which renders the per-tile quantization noise unbiased with second-moment bounded proportionally to the local bit width; the entropy-guided allocator then distributes the global bit budget to minimize aggregate variance while preserving the unbiasedness property. The Lipschitz continuity of the loss (assumed in the standard SGD analysis) continues to hold after the orthogonal transform. To make the adaptive case fully explicit and rule out correlation-induced bias across pipeline stages, we will insert a dedicated supporting lemma in the revised Section 4 that directly bounds the second moment under entropy-guided allocation. revision: yes

  2. Referee: [Experiments / Figure 3] Experimental validation of convergence (Section 5 / Figure on loss curves): The reported preservation of convergence is shown only for selected models and bit budgets; it is unclear whether the adaptive allocator was ablated against fixed-bit baselines to isolate its effect on the observed O(1/sqrt(T)) behavior. Without variance estimates or multiple random seeds for the adaptive entropy estimator, it is difficult to confirm that the speedup does not trade off against hidden convergence degradation.

    Authors: We agree that isolating the contribution of the adaptive allocator and providing statistical robustness would improve clarity. The main experiments already compare TAH-Quant against fixed-bit quantization baselines (Section 5.2), but we will add an explicit ablation subsection that varies only the allocator while holding total bits constant. In addition, we will rerun the entropy-estimation experiments with three independent random seeds, report mean and standard deviation on the loss curves, and extend Figure 3 plus the appendix with results for two further model scales and bit budgets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; convergence claim references external SGD benchmark independently

full rationale

The paper introduces TAH-Quant with novel components including tile-wise quantization, entropy-guided adaptive bit allocation, Hadamard transformation, and pivot swapping. It explicitly states a proof that equipped pipeline parallel training maintains the O(1/sqrt(T)) convergence rate of vanilla stochastic gradient descent. This is framed as an analysis of quantization error impact rather than a self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The derivation chain remains self-contained against the external vanilla SGD benchmark, with no quoted reductions showing the claimed rate or error bounds collapsing to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described; the convergence claim likely rests on standard stochastic optimization assumptions not detailed here.

pith-pipeline@v0.9.0 · 5794 in / 1119 out tokens · 68551 ms · 2026-05-19T11:20:39.875572+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

    cs.DC 2026-05 unverdicted novelty 7.0

    NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

  2. TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

    cs.DC 2026-04 unverdicted novelty 5.0

    TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.

  3. Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations

    cs.LG 2026-04 unverdicted novelty 3.0

    A retrospective survey and empirical evaluation of deep learning optimization algorithms that identifies trends, design trade-offs, and future directions.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 3 Pith papers · 3 internal anchors

  1. [1]

    Towards crowdsourced training of large neural networks using decentral- ized mixture-of-experts

    Max Ryabinin and Anton Gusev. Towards crowdsourced training of large neural networks using decentral- ized mixture-of-experts. Advances in Neural Information Processing Systems, 33:3659–3672, 2020

  2. [2]

    Decentralized training of foundation models in heterogeneous environments.Advances in Neural Information Processing Systems, 35:25464–25477, 2022

    Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S Liang, Christopher Re, and Ce Zhang. Decentralized training of foundation models in heterogeneous environments.Advances in Neural Information Processing Systems, 35:25464–25477, 2022

  3. [3]

    Improving training time and gpu utilization in geo-distributed language model training.arXiv preprint arXiv:2411.14458, 2024

    Rohan Gandhi, Karan Tandon, Debopam Bhattacherjee, Venkata N Padmanabhan, et al. Improving training time and gpu utilization in geo-distributed language model training.arXiv preprint arXiv:2411.14458, 2024

  4. [4]

    Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems, 35:19215–19230, 2022

    Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher R´e, and Ce Zhang. Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems, 35:19215–19230, 2022

  5. [5]

    Cocktailsgd: Fine-tuning foundation models over 500mbps networks

    Jue Wang, Yucheng Lu, Binhang Yuan, Beidi Chen, Percy Liang, Christopher De Sa, Christopher Re, and Ce Zhang. Cocktailsgd: Fine-tuning foundation models over 500mbps networks. InInternational Conference on Machine Learning, pages 36058–36076. PMLR, 2023

  6. [6]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

  7. [7]

    Pipedream: generalized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

  8. [8]

    Memory-efficient pipeline-parallel dnn training

    Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021. 12

  9. [9]

    Streaming diloco with overlapping communication: Towards a distributed free lunch.arXiv preprint arXiv:2501.18512, 2025

    Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey , Ross McIlroy , Jiajun Shen, et al. Streaming diloco with overlapping communication: Towards a distributed free lunch.arXiv preprint arXiv:2501.18512, 2025

  10. [10]

    Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

    Song Han, Huizi Mao, and William J Dally . Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. 2016

  11. [11]

    Quantized neural networks: Training neural networks with low precision weights and activations.The Journal of Machine Learning Research, 18(1):6869–6898, 2017

    Itay Hubara, Matthieu Courbariaux, Daniel Soudry , Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations.The Journal of Machine Learning Research, 18(1):6869–6898, 2017

  12. [12]

    Ac-gc: Lossy activation compression with guaranteed convergence

    R David Evans and Tor Aamodt. Ac-gc: Lossy activation compression with guaranteed convergence. Advances in Neural Information Processing Systems, 34, 2021

  13. [13]

    Backprop with approximate activations for memory-efficient network training

    Ayan Chakrabarti and Benjamin Moseley. Backprop with approximate activations for memory-efficient network training. Advances in Neural Information Processing Systems, 32, 2019

  14. [14]

    Training transformers together

    Alexander Borzunov, Max Ryabinin, Tim Dettmers, Quentin Lhoest, Lucile Saulnier, Michael Diskin, and Yacine Jernite. Training transformers together. InNeurIPS 2021 Competitions and Demonstrations Track, pages 335–342. PMLR, 2022

  15. [15]

    Distributed inference and fine-tuning of large language models over the internet

    Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, and Colin A Raffel. Distributed inference and fine-tuning of large language models over the internet. Advances in neural information processing systems, 36:12312–12331, 2023

  16. [16]

    Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.arXiv preprint arXiv:2502.19913, 2025

    Nikolay Blagoev, Lydia Yiyu Chen, and O ˘guzhan Ersoy. Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.arXiv preprint arXiv:2502.19913, 2025

  17. [17]

    Distributed deep learning in open collaborations

    Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Anton Sinitsin, Dmitry Popov, Dmitry V Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, et al. Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897, 2021

  18. [18]

    Swarm parallelism: Training large models can be surprisingly communication-efficient

    Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. Swarm parallelism: Training large models can be surprisingly communication-efficient. InInternational Conference on Machine Learning, pages 29416–29440. PMLR, 2023

  19. [19]

    Position: exploring the robustness of pipeline-parallelism-based decentralized training

    Lin Lu, Chenxi Dai, Wangcheng Tao, Binhang Yuan, Yanan Sun, and Pan Zhou. Position: exploring the robustness of pipeline-parallelism-based decentralized training. InForty-first International Conference on Machine Learning, 2024

  20. [20]

    Ml training with cloud gpu shortages: Is cross-region the answer? In Proceedings of the 4th Workshop on Machine Learning and Systems, pages 107–116, 2024

    Foteini Strati, Paul Elvinger, Tolga Kerimoglu, and Ana Klimovic. Ml training with cloud gpu shortages: Is cross-region the answer? In Proceedings of the 4th Workshop on Machine Learning and Systems, pages 107–116, 2024

  21. [21]

    Exact: Scalable graph neural networks training via extreme activation compression

    Zirui Liu, Kaixiong Zhou, Fan Yang, Li Li, Rui Chen, and Xia Hu. Exact: Scalable graph neural networks training via extreme activation compression. In International Conference on Learning Representations, 2021

  22. [22]

    Neural network weight compression with nnw-bdi

    Andrei Bersatti, Nima Shoghi Ghalehshahi, and Hyesoon Kim. Neural network weight compression with nnw-bdi. In The International Symposium on Memory Systems, pages 335–340, 2020

  23. [23]

    Accelerating convolutional neural networks via activation map compression

    Georgios Georgiadis. Accelerating convolutional neural networks via activation map compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7085–7095, 2019

  24. [24]

    Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript

    Fangcheng Fu, Yuzheng Hu, Yihan He, Jiawei Jiang, Yingxia Shao, Ce Zhang, and Bin Cui. Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript. In International Conference on Machine Learning, pages 3304–3314. PMLR, 2020. 13

  25. [25]

    Gact: Activation compressed training for generic network architectures

    Xiaoxuan Liu, Lianmin Zheng, Dequan Wang, Yukuo Cen, Weize Chen, Xu Han, Jianfei Chen, Zhiyuan Liu, Jie Tang, Joey Gonzalez, et al. Gact: Activation compressed training for generic network architectures. In International Conference on Machine Learning, pages 14139–14152. PMLR, 2022

  26. [26]

    Dropit: Dropping intermediate tensors for memory-efficient dnn training

    Joya Chen, Kai Xu, Yuhui Wang, Yifei Cheng, and Angela Yao. Dropit: Dropping intermediate tensors for memory-efficient dnn training. InThe Eleventh International Conference on Learning Representations

  27. [27]

    Does compressing activations help model parallel training? Proceedings of Machine Learning and Systems, 6:239–252, 2024

    Song Bian, Dacheng Li, Hongyi Wang, Eric Xing, and Shivaram Venkataraman. Does compressing activations help model parallel training? Proceedings of Machine Learning and Systems, 6:239–252, 2024

  28. [28]

    Exploring the benefit of activation sparsity in pre-training

    Zhengyan Zhang, Chaojun Xiao, Qiujieli Qin, Yankai Lin, Zhiyuan Zeng, Xu Han, Zhiyuan Liu, Ruobing Xie, Maosong Sun, and Jie Zhou. Exploring the benefit of activation sparsity in pre-training. InInternational Conference on Machine Learning, pages 60040–60056. PMLR, 2024

  29. [29]

    Compressing dma engine: Leveraging activation sparsity for training deep neural networks

    Minsoo Rhu, Mike O’Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. Compressing dma engine: Leveraging activation sparsity for training deep neural networks. In2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 78–91. IEEE, 2018

  30. [30]

    Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

    Ziyu Jiang, Xuxi Chen, Xueqin Huang, Xianzhi Du, Denny Zhou, and Zhangyang Wang. Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

  31. [31]

    The lazy neuron phenomenon: On emergence of activation sparsity in transformers

    Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, et al. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations

  32. [32]

    Jpeg-act: accelerating deep learning via transform-based lossy compression

    R David Evans, Lufei Liu, and Tor M Aamodt. Jpeg-act: accelerating deep learning via transform-based lossy compression. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 860–873. IEEE, 2020

  33. [33]

    Division: memory efficient training via dual activation precision

    Guanchu Wang, Zirui Liu, Zhimeng Jiang, Ninghao Liu, Na Zou, and Xia Hu. Division: memory efficient training via dual activation precision. In International Conference on Machine Learning, pages 36036– 36057. PMLR, 2023

  34. [34]

    Actnn: Reducing training memory footprint via 2-bit activation compressed training

    Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael Mahoney, and Joseph Gonzalez. Actnn: Reducing training memory footprint via 2-bit activation compressed training. In International Conference on Machine Learning, pages 1803–1813. PMLR, 2021

  35. [35]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024

  36. [36]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022

  37. [37]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, pages 38087–38099. PMLR, 2023

  38. [38]

    Kivi: A tuning-free asymmetric 2bit quantization for kv cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. In International Conference on Machine Learning, pages 32332–32344. PMLR, 2024

  39. [39]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023. 14

  40. [40]

    How to param- eterize asymmetric quantization ranges for quantization-aware training.arXiv preprint arXiv:2404.16898, 2024

    Jaeseong You, Minseop Park, Kyunggeun Lee, Seokjun An, Chirag Patel, and Markus Nage. How to param- eterize asymmetric quantization ranges for quantization-aware training.arXiv preprint arXiv:2404.16898, 2024

  41. [41]

    Llm-qat: Data-free quantization aware training for large language models

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. In Findings of the Association for Computational Linguistics ACL 2024 , pages 467–484, 2024

  42. [42]

    Duquant: Distributing outliers via dual transformation makes stronger quantized llms

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. Advances in Neural Information Processing Systems, 37:87766–87800, 2024

  43. [43]

    arXiv preprint arXiv:2501.13987

    Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting.arXiv preprint arXiv:2501.13987, 2025

  44. [44]

    Feature generation i: data transformation and dimensionality reduction

    Sergios Theodoridis and Konstantinos Koutroumbas. Feature generation i: data transformation and dimensionality reduction. Pattern recognition, pages 323–409, 2009

  45. [45]

    https://www.ucloud.cn/en/

    Ucloud. https://www.ucloud.cn/en/

  46. [46]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

  47. [47]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey , Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018

  48. [48]

    Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

  49. [49]

    TruthfulQA: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computationa...

  50. [50]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  51. [51]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  52. [52]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry , Quoc V. Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. 15

  53. [53]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...