Recognition: unknown
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
Pith reviewed 2026-05-08 01:31 UTC · model grok-4.3
The pith
TACO compresses intermediate tensors in tensor-parallel LLM training to FP8, delivering up to 1.87X end-to-end throughput gains with near-lossless accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TACO is an FP8-based framework for compressing tensor-parallel intermediate tensors that first applies data-driven reshaping together with an Adaptive Scale-Hadamard Transform for high-fidelity quantization, then uses Dual-Scale Quantization to preserve numerical stability across training steps, and finally employs a highly fused compression operator to cut memory traffic and kernel overhead while allowing overlap with communication. Integrated with existing data and pipeline parallelism methods, this produces a compression-enabled 3D-parallel training system whose experiments on GPT and Qwen models report up to 1.87X end-to-end throughput improvement at near-lossless accuracy.
What carries the argument
Data-driven reshaping combined with Adaptive Scale-Hadamard Transform and Dual-Scale Quantization inside a fused FP8 compression operator applied to tensor-parallel intermediate tensors.
Load-bearing premise
The data-driven reshaping and Dual-Scale Quantization will remain stable and will not introduce convergence issues or accuracy degradation when applied across the full range of tensor distributions encountered in long training runs on diverse model architectures.
What would settle it
Run a complete training epoch of a GPT-scale or Qwen model with TACO enabled and measure final accuracy or perplexity against an otherwise identical uncompressed baseline; significant degradation or throughput gains below 1.5X would falsify the central effectiveness claim.
Figures
read the original abstract
Handling communication overhead in large-scale tensor-parallel training remains a critical challenge due to the dense, near-zero distributions of intermediate tensors, which exacerbate errors under frequent communication and introduce significant computational overhead during compression. To this end, we propose TACO (Tensor-parallel Adaptive COmmunication compression), a robust FP8-based framework for compressing TP intermediate tensors. First, we employ a data-driven reshaping strategy combined with an Adaptive Scale-Hadamard Transform to enable high-fidelity FP8 quantization, while its Dual-Scale Quantization mechanism ensures numerical stability throughout training. Second, we design a highly fused compression operator to reduce memory traffic and kernel launch overhead, allowing efficient overlap with communication. Finally, we integrate TACO with existing state-of-the-art methods for Data and Pipeline Parallelism to develop a compression-enabled 3D-parallel training framework. Detailed experiments on GPT models and Qwen model demonstrate up to 1.87X end-to-end throughput improvement while maintaining near-lossless accuracy, validating the effectiveness and efficiency of TACO in large-scale training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TACO, a FP8-based framework for compressing intermediate tensors during tensor-parallel LLM training. It introduces a data-driven reshaping strategy with Adaptive Scale-Hadamard Transform and Dual-Scale Quantization for high-fidelity compression, fused operators to minimize overhead and enable overlap with communication, and integration into a 3D-parallel (tensor, data, pipeline) training system. Experiments on GPT and Qwen models report up to 1.87X end-to-end throughput gains while claiming near-lossless accuracy preservation.
Significance. If the empirical results are robust, TACO addresses a key scalability bottleneck in large-scale distributed training by reducing communication volume for dense intermediate tensors without degrading convergence. The combination of quantization techniques tailored to tensor distributions and kernel-level optimizations could enable faster training or larger models on existing hardware clusters, with direct relevance to production LLM systems.
major comments (3)
- [Abstract] Abstract and experimental claims: the reported 'near-lossless accuracy' and 1.87X throughput are presented without quantitative details on the exact accuracy metric (e.g., validation perplexity delta, downstream task scores), baseline configurations, or statistical measures such as standard deviation across runs, which are load-bearing for validating the stability assertion.
- [Method] Description of Dual-Scale Quantization and data-driven reshaping: the manuscript states these ensure numerical stability and high-fidelity FP8 quantization but provides no ablation isolating their individual contributions or analysis of error accumulation over full training trajectories on diverse tensor distributions, undermining the robustness claim for long runs.
- [Experiments] Integration and end-to-end results: while throughput improvements are claimed when combined with data and pipeline parallelism, there is no explicit verification (e.g., loss curves or gradient statistics) that the compression does not introduce convergence issues across the full range of tensor shapes and training steps encountered in the GPT/Qwen experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights opportunities to strengthen the clarity of our claims and the robustness of our evaluations. We address each major comment below and will revise the manuscript to incorporate the requested details and analyses.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental claims: the reported 'near-lossless accuracy' and 1.87X throughput are presented without quantitative details on the exact accuracy metric (e.g., validation perplexity delta, downstream task scores), baseline configurations, or statistical measures such as standard deviation across runs, which are load-bearing for validating the stability assertion.
Authors: We agree that the abstract would benefit from more precise quantitative details. In the revised manuscript, we will update the abstract to explicitly state the accuracy metrics employed (validation perplexity and downstream task scores), the baseline configurations (standard 3D-parallel training without compression), the observed perplexity deltas, and standard deviations across repeated runs. These details are already reported in the experiments section and will be summarized concisely in the abstract. revision: yes
-
Referee: [Method] Description of Dual-Scale Quantization and data-driven reshaping: the manuscript states these ensure numerical stability and high-fidelity FP8 quantization but provides no ablation isolating their individual contributions or analysis of error accumulation over full training trajectories on diverse tensor distributions, undermining the robustness claim for long runs.
Authors: We will add a new ablation subsection to the method and experiments sections that isolates the contributions of the data-driven reshaping strategy, Adaptive Scale-Hadamard Transform, and Dual-Scale Quantization. This will include quantitative comparisons of quantization error and end-to-end accuracy with and without each component. We will also include analysis of error accumulation by tracking per-step quantization error and its effect on gradient norms over full training trajectories for the diverse tensor distributions encountered in the GPT and Qwen models. revision: yes
-
Referee: [Experiments] Integration and end-to-end results: while throughput improvements are claimed when combined with data and pipeline parallelism, there is no explicit verification (e.g., loss curves or gradient statistics) that the compression does not introduce convergence issues across the full range of tensor shapes and training steps encountered in the GPT/Qwen experiments.
Authors: We will expand the experiments section to include explicit loss curves and gradient norm statistics for the integrated 3D-parallel (tensor + data + pipeline) training runs with TACO. These plots will cover the full range of tensor shapes and training steps from the GPT and Qwen experiments, directly comparing convergence behavior against the uncompressed baseline to confirm the absence of degradation. revision: yes
Circularity Check
No significant circularity in empirical engineering contribution
full rationale
The manuscript presents TACO as a practical FP8 compression framework for tensor-parallel training, relying on data-driven reshaping, Adaptive Scale-Hadamard Transform, Dual-Scale Quantization, and fused kernels. All load-bearing claims are supported by concrete throughput and accuracy measurements on GPT and Qwen models rather than any closed-form derivation or prediction. No equations, fitted parameters, or self-citation chains reduce reported results to quantities defined within the paper itself; the work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Li, Ryota Tomioka, and Milan Vojnovic
Dan Alistarh, Demjan Grubic, Jerry Z. Li, Ryota Tomioka, and Milan Vojnovic
-
[2]
InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17)
QSGD: communication-efficient SGD via gradient quantization and en- coding. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 1707–1718
-
[3]
Quentin Anthony, Benjamin Michalowicz, Jacob Hatef, Lang Xu, Mustafa Abdul- jabbar, Aamir Shafi, Hari Subramoni, and Dhabaleswar Panda. 2024. Demysti- fying the Communication Characteristics for Distributed Transformer Models. arXiv:2408.10197 [cs.DC] https://arxiv.org/abs/2408.10197
-
[4]
Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, and Dan Alistarh. 2024. QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models. InProceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Asso...
-
[5]
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman
-
[6]
Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems37 (2024), 100213–100240
2024
-
[7]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901
2020
-
[8]
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. 2023. Quip: 2-bit quantization of large language models with guarantees.Advances in Neural Information Processing Systems36 (2023), 4396–4429
2023
-
[9]
Chia-Yu Chen, Jiamin Ni, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Xiao Sun, Naigang Wang, Swagath Venkataramani, Vijayalakshmi Viji Srinivasan, Wei Zhang, et al . 2020. Scalecom: Scalable sparsified gradient compression for communication-efficient distributed training.Advances in Neural Information Processing Systems33 (2020), 13551–13563
2020
-
[10]
Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael W. Mahoney, and Joseph E. Gonzalez. 2021. ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training. arXiv:2104.14129 [cs.LG] https://arxiv.org/abs/2104.14129
-
[11]
Shiyang Chen, Da Zheng, Caiwen Ding, Chengying Huan, Yuede Ji, and Hang Liu. 2023. TANGO: re-thinking quantization for graph neural network training on GPUs. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(Denver, CO, USA)(SC ’23). As- sociation for Computing Machinery, New York, NY, USA, Arti...
-
[12]
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, and Jingren Zhou. 2024. EE- LLM: large-scale training and inference of early-exit large language models with 3D parallelism. InProceedings of the 41st International Conference on Machine Learning (ICML’24). JMLR.org, Vienna, Austria, Article 277, 27 pages
2024
-
[13]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24, 240 (2023), 1–113
2023
-
[14]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems36 (2023), 10088–10115
2023
- [15]
-
[16]
arXiv preprint arXiv:2509.23202 , year=
Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, El- dar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2026. Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization. arXiv:2509.23202 [cs.LG] https://arxiv.org/abs/2509.23202
-
[17]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG] https://arxiv.org/abs/2210.17323
work page internal anchor Pith review arXiv 2023
-
[18]
Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia
-
[19]
arXiv:2211.15841 [cs.LG]https://arxiv.org/abs/2211.15841
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. arXiv:2211.15841 [cs.LG] https://arxiv.org/abs/2211.15841
-
[20]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 [cs.CL] https://arxiv.org/abs/2101.00027
work page internal anchor Pith review arXiv 2020
-
[21]
Aaron Grattafiori et al . 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783
work page internal anchor Pith review arXiv 2024
-
[22]
Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kun Yuan, and Binhang Yuan
-
[23]
TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network
TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network. arXiv:2506.01352 [cs.LG] https://arxiv.org/abs/2506.01352
work page internal anchor Pith review arXiv
-
[24]
Le, Yonghui Wu, and Zhifeng Chen
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. InProceedings of the 33rd International Conference on Neural Informa- tion Processing Systems. Curran Associates Inc., ...
2019
-
[25]
Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, et al. 2024. Sdp4bit: Toward 4-bit communi- cation quantization in sharded data parallelism for LLM training.Advances in Neural Information Processing Systems37 (2024), 8734–8759
2024
-
[26]
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...
-
[27]
Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. 2022. Fp8 quantization: The power of the exponent.Advances in Neural Information Processing Systems35 (2022), 14651–14662
2022
-
[28]
Itay Lamprecht, Asaf Karnieli, Yair Hanani, Niv Giladi, and Daniel Soudry. 2025. Tensor-Parallelism with Partially Synchronized Activations. NeurIPS 2025 Poster. https://openreview.net/forum?id=fyeSq3m8CY Accepted as NeurIPS 2025 Poster
2025
- [29]
-
[30]
Shigang Li and Torsten Hoefler. 2022. Near-optimal sparse allreduce for dis- tributed deep learning. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Seoul, Republic of Korea) (PPoPP ’22). Association for Computing Machinery, New York, NY, USA, 135–149. doi:10.1145/3503221.3508399
-
[31]
Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023. Sequence Parallelism: Long Sequence Training from System Perspective. InPro- ceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Toronto, Canada, 2391–2404
2023
- [32]
-
[33]
Xingchen Liu, Haoran Kong, Hairui Zhao, Shengkai Lyu, Zheng Wei, Man Liu, Xingjian Tian, Liyang Zhao, Zhuohan Chen, Fakang Wang, Zizhong Chen, Zhan Wang, Guangming Tan, and Dingwen Tao. 2026. COCCL: A Collective Commu- nication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training. InProceedings of the 3...
-
[34]
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. 2023. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv:2305.17888 [cs.CL] https://arxiv.org/abs/2305.17888
-
[35]
Ilia Markov, Adrian Vladu, Qi Guo, and Dan Alistarh. 2023. Quantized distributed training of large models with convergence guarantees. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Honolulu, Hawaii, USA, Article 1001, 25 pages
2023
-
[36]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. arXiv:1710.03740 [cs.AI] https://arxiv.org/abs/1710.03740
work page internal anchor Pith review arXiv 2018
-
[37]
Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. 2022. FP8 Formats for Deep Learning. arXiv:2209.05433 [cs.LG] https://arxiv.org/abs/2209.05433
work page internal anchor Pith review arXiv 2022
-
[38]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM Symposium on Operating Systems Principles(Huntsville, Ontario, Canada)(SOSP ’19). Association for Computing Machiner...
-
[39]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron- LM. InProceedings of the International Conference for High ...
- [40]
-
[41]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-pe...
2019
-
[42]
Igor Polyakov, Alexey Dukhanov, and Egor Spirin. 2025. TAGC: Optimizing Gradient Communication in Distributed Transformer Training. InProceedings of the 5th Workshop on Machine Learning and Systems(World Trade Center, Rotterdam, Netherlands)(EuroMLSys ’25). Association for Computing Machinery, New York, NY, USA, 254–260. doi:10.1145/3721146.3721946
- [43]
-
[44]
M. I. Rudakov, A. N. Beznosikov, Ya. A. Kholodov, and A. V. Gasnikov. 2023. Activations and Gradients Compression for Model-Parallel Training.Doklady Mathematics108, S2 (Dec. 2023), S272–S281. doi:10.1134/s1064562423701314
-
[45]
2025.Quantization Methods for Matrix Multiplication and Ef- ficient Transformers
Semyon Savkin. 2025.Quantization Methods for Matrix Multiplication and Ef- ficient Transformers. Ph. D. Dissertation. MASSACHUSETTS INSTITUTE OF TECHNOLOGY
2025
-
[46]
Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, and Mengni Wang. 2024. Efficient post-training quantization with fp8 formats.Proceedings of Machine Learning and Systems6 (2024), 483–498
2024
-
[47]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL] https: //arxiv.org/abs/1909.08053
work page internal anchor Pith review arXiv 2020
-
[48]
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. 2022. Using DeepSpeed and Megatr...
work page Pith review arXiv 2022
- [49]
-
[50]
Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm. github.io/blog/qwen2.5/
2024
-
[51]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL] https://arxiv.org/abs/2302.13971
work page internal anchor Pith review arXiv 2023
-
[52]
Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. 2024. QuIP#: even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Vienna, Austria, Article 1987, 27 pages
2024
-
[53]
Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. 2019. PowerSGD: prac- tical low-rank gradient compression for distributed optimization. InProceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 1278, 10 pages
2019
- [54]
-
[55]
Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christo- pher Ré, and Ce Zhang. 2022. Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems35 (2022), 19215–19230
2022
-
[56]
BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, et al
-
[57]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv:2211.05100 [cs.CL] https://arxiv.org/abs/2211.05100
work page internal anchor Pith review arXiv
-
[58]
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han
-
[59]
Smoothquant: Accurate and efficient post-training quantization for large language models,
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438 [cs.CL] https://arxiv.org/abs/2211.10438
-
[60]
Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha Gulhane, Aamir Shafi, Hari Subramoni, and Dhabaleswar K DK Panda. 2024. Accelerating large language model training with hybrid gpu-based compression. In2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, Philadelphia, PA, USA, 196–205
2024
-
[61]
Qingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao, Shiyou Qian, Dingyu Yang, Jian Cao, Jinghua Tang, Yinghao Yu, Chenzhi Liao, Kangjin Wang, and Liping Zhang. 2025. EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training. arXiv:2511.10333 [cs.LG] https://arxiv.org/abs/2511. 10333
-
[62]
Lin Zhang, Longteng Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. 2023. Eval- uation and optimization of gradient compression for distributed deep learning. In2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS). IEEE, Hong Kong, China, 361–371
2023
-
[63]
Hao Zheng, Peng Liang, Yu Tang, Yanqi Shi, Linbo Qiao, and Dongsheng Li. 2024. 3D Parallelism for transformers via Integer programming. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Seoul, Korea, 6440–6444
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.