pith. machine review for the scientific record. sign in

arxiv: 2604.24088 · v1 · submitted 2026-04-27 · 💻 cs.DC · cs.AI

Recognition: unknown

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:31 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords tensor parallelismcommunication compressionFP8 quantizationLLM trainingadaptive scaling3D parallelismfused operator
0
0 comments X

The pith

TACO compresses intermediate tensors in tensor-parallel LLM training to FP8, delivering up to 1.87X end-to-end throughput gains with near-lossless accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the communication bottleneck in tensor-parallel training of large language models, where dense near-zero intermediate tensors create both high data volume and quantization errors. It introduces TACO, a framework that reshapes tensors adaptively, applies a scale-Hadamard transform, and uses dual-scale quantization to enable stable and accurate FP8 representation. A fused operator then reduces the cost of this process and permits overlap with communication. When combined with data and pipeline parallelism into a 3D setup, the method is tested on GPT and Qwen models and shows substantial speedups while accuracy stays close to the uncompressed baseline.

Core claim

TACO is an FP8-based framework for compressing tensor-parallel intermediate tensors that first applies data-driven reshaping together with an Adaptive Scale-Hadamard Transform for high-fidelity quantization, then uses Dual-Scale Quantization to preserve numerical stability across training steps, and finally employs a highly fused compression operator to cut memory traffic and kernel overhead while allowing overlap with communication. Integrated with existing data and pipeline parallelism methods, this produces a compression-enabled 3D-parallel training system whose experiments on GPT and Qwen models report up to 1.87X end-to-end throughput improvement at near-lossless accuracy.

What carries the argument

Data-driven reshaping combined with Adaptive Scale-Hadamard Transform and Dual-Scale Quantization inside a fused FP8 compression operator applied to tensor-parallel intermediate tensors.

Load-bearing premise

The data-driven reshaping and Dual-Scale Quantization will remain stable and will not introduce convergence issues or accuracy degradation when applied across the full range of tensor distributions encountered in long training runs on diverse model architectures.

What would settle it

Run a complete training epoch of a GPT-scale or Qwen model with TACO enabled and measure final accuracy or perplexity against an otherwise identical uncompressed baseline; significant degradation or throughput gains below 1.5X would falsify the central effectiveness claim.

Figures

Figures reproduced from arXiv: 2604.24088 by Bing Lu, Dingwen Tao, Guangming Tan, Hairui Zhao, Man Liu, Shengkay Lyu, Shengquan Yin, Wenjing Huang, Xingchen Liu, Xingjian Tian, Zheng Wei.

Figure 1
Figure 1. Figure 1: Communication overhead and impact of quantization on view at source ↗
Figure 3
Figure 3. Figure 3: FP8 Format Specifications: E4M3 vs. E5M2. E4M3 provides view at source ↗
Figure 5
Figure 5. Figure 5: Data distribution characteristics of INT8 and FP8. view at source ↗
Figure 6
Figure 6. Figure 6: Quantization errors of INT8 and FP8. error is bounded by the unit in the last place (ULP): |𝑒 FP8 𝑖 | ≤ ULP(𝑥𝑖) = 2 ⌊log2 |𝑥𝑖 | ⌋−𝑚, 𝑥𝑖 ∈ 𝑋0, (3) where 𝑚 is the mantissa width. This implies that smaller values (lower exponent) are represented with finer precision and denser quantization points, closely matching the dense, small-magnitude peak observed in TP intermediate tensors. For large-magnitude values … view at source ↗
Figure 7
Figure 7. Figure 7: Overview of TACO. TP intermediate tensors are compressed via adaptive rescaling, the ASH transform, and dual-scale quantization, with view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of tensors before and after Hadamard-based view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of Naïve Multi-Kernel Execution vs. TACO view at source ↗
Figure 10
Figure 10. Figure 10: Throughput comparison. Left: Performance of different view at source ↗
Figure 11
Figure 11. Figure 11: Ablation of TACO components on convergence stability. view at source ↗
Figure 13
Figure 13. Figure 13: Validation and test loss for baseline, standard Hadamard, view at source ↗
Figure 15
Figure 15. Figure 15: End-to-end training throughput (TFLOPS) under different view at source ↗
Figure 17
Figure 17. Figure 17: Validation loss of GPT-6.7B under full 3D parallelism. view at source ↗
read the original abstract

Handling communication overhead in large-scale tensor-parallel training remains a critical challenge due to the dense, near-zero distributions of intermediate tensors, which exacerbate errors under frequent communication and introduce significant computational overhead during compression. To this end, we propose TACO (Tensor-parallel Adaptive COmmunication compression), a robust FP8-based framework for compressing TP intermediate tensors. First, we employ a data-driven reshaping strategy combined with an Adaptive Scale-Hadamard Transform to enable high-fidelity FP8 quantization, while its Dual-Scale Quantization mechanism ensures numerical stability throughout training. Second, we design a highly fused compression operator to reduce memory traffic and kernel launch overhead, allowing efficient overlap with communication. Finally, we integrate TACO with existing state-of-the-art methods for Data and Pipeline Parallelism to develop a compression-enabled 3D-parallel training framework. Detailed experiments on GPT models and Qwen model demonstrate up to 1.87X end-to-end throughput improvement while maintaining near-lossless accuracy, validating the effectiveness and efficiency of TACO in large-scale training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper proposes TACO, a FP8-based framework for compressing intermediate tensors during tensor-parallel LLM training. It introduces a data-driven reshaping strategy with Adaptive Scale-Hadamard Transform and Dual-Scale Quantization for high-fidelity compression, fused operators to minimize overhead and enable overlap with communication, and integration into a 3D-parallel (tensor, data, pipeline) training system. Experiments on GPT and Qwen models report up to 1.87X end-to-end throughput gains while claiming near-lossless accuracy preservation.

Significance. If the empirical results are robust, TACO addresses a key scalability bottleneck in large-scale distributed training by reducing communication volume for dense intermediate tensors without degrading convergence. The combination of quantization techniques tailored to tensor distributions and kernel-level optimizations could enable faster training or larger models on existing hardware clusters, with direct relevance to production LLM systems.

major comments (3)
  1. [Abstract] Abstract and experimental claims: the reported 'near-lossless accuracy' and 1.87X throughput are presented without quantitative details on the exact accuracy metric (e.g., validation perplexity delta, downstream task scores), baseline configurations, or statistical measures such as standard deviation across runs, which are load-bearing for validating the stability assertion.
  2. [Method] Description of Dual-Scale Quantization and data-driven reshaping: the manuscript states these ensure numerical stability and high-fidelity FP8 quantization but provides no ablation isolating their individual contributions or analysis of error accumulation over full training trajectories on diverse tensor distributions, undermining the robustness claim for long runs.
  3. [Experiments] Integration and end-to-end results: while throughput improvements are claimed when combined with data and pipeline parallelism, there is no explicit verification (e.g., loss curves or gradient statistics) that the compression does not introduce convergence issues across the full range of tensor shapes and training steps encountered in the GPT/Qwen experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the clarity of our claims and the robustness of our evaluations. We address each major comment below and will revise the manuscript to incorporate the requested details and analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental claims: the reported 'near-lossless accuracy' and 1.87X throughput are presented without quantitative details on the exact accuracy metric (e.g., validation perplexity delta, downstream task scores), baseline configurations, or statistical measures such as standard deviation across runs, which are load-bearing for validating the stability assertion.

    Authors: We agree that the abstract would benefit from more precise quantitative details. In the revised manuscript, we will update the abstract to explicitly state the accuracy metrics employed (validation perplexity and downstream task scores), the baseline configurations (standard 3D-parallel training without compression), the observed perplexity deltas, and standard deviations across repeated runs. These details are already reported in the experiments section and will be summarized concisely in the abstract. revision: yes

  2. Referee: [Method] Description of Dual-Scale Quantization and data-driven reshaping: the manuscript states these ensure numerical stability and high-fidelity FP8 quantization but provides no ablation isolating their individual contributions or analysis of error accumulation over full training trajectories on diverse tensor distributions, undermining the robustness claim for long runs.

    Authors: We will add a new ablation subsection to the method and experiments sections that isolates the contributions of the data-driven reshaping strategy, Adaptive Scale-Hadamard Transform, and Dual-Scale Quantization. This will include quantitative comparisons of quantization error and end-to-end accuracy with and without each component. We will also include analysis of error accumulation by tracking per-step quantization error and its effect on gradient norms over full training trajectories for the diverse tensor distributions encountered in the GPT and Qwen models. revision: yes

  3. Referee: [Experiments] Integration and end-to-end results: while throughput improvements are claimed when combined with data and pipeline parallelism, there is no explicit verification (e.g., loss curves or gradient statistics) that the compression does not introduce convergence issues across the full range of tensor shapes and training steps encountered in the GPT/Qwen experiments.

    Authors: We will expand the experiments section to include explicit loss curves and gradient norm statistics for the integrated 3D-parallel (tensor + data + pipeline) training runs with TACO. These plots will cover the full range of tensor shapes and training steps from the GPT and Qwen experiments, directly comparing convergence behavior against the uncompressed baseline to confirm the absence of degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical engineering contribution

full rationale

The manuscript presents TACO as a practical FP8 compression framework for tensor-parallel training, relying on data-driven reshaping, Adaptive Scale-Hadamard Transform, Dual-Scale Quantization, and fused kernels. All load-bearing claims are supported by concrete throughput and accuracy measurements on GPT and Qwen models rather than any closed-form derivation or prediction. No equations, fitted parameters, or self-citation chains reduce reported results to quantities defined within the paper itself; the work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit mathematical axioms, free parameters, or invented entities are stated. The data-driven reshaping and adaptive scales are mentioned but not quantified.

pith-pipeline@v0.9.0 · 5513 in / 1114 out tokens · 37154 ms · 2026-05-08T01:31:52.752291+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 33 canonical work pages · 9 internal anchors

  1. [1]

    Li, Ryota Tomioka, and Milan Vojnovic

    Dan Alistarh, Demjan Grubic, Jerry Z. Li, Ryota Tomioka, and Milan Vojnovic

  2. [2]

    InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17)

    QSGD: communication-efficient SGD via gradient quantization and en- coding. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 1707–1718

  3. [3]

    Quentin Anthony, Benjamin Michalowicz, Jacob Hatef, Lang Xu, Mustafa Abdul- jabbar, Aamir Shafi, Hari Subramoni, and Dhabaleswar Panda. 2024. Demysti- fying the Communication Characteristics for Distributed Transformer Models. arXiv:2408.10197 [cs.DC] https://arxiv.org/abs/2408.10197

  4. [4]

    Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, and Dan Alistarh. 2024. QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models. InProceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Asso...

  5. [5]

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

  6. [6]

    Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems37 (2024), 100213–100240

  7. [7]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  8. [8]

    Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. 2023. Quip: 2-bit quantization of large language models with guarantees.Advances in Neural Information Processing Systems36 (2023), 4396–4429

  9. [9]

    Chia-Yu Chen, Jiamin Ni, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Xiao Sun, Naigang Wang, Swagath Venkataramani, Vijayalakshmi Viji Srinivasan, Wei Zhang, et al . 2020. Scalecom: Scalable sparsified gradient compression for communication-efficient distributed training.Advances in Neural Information Processing Systems33 (2020), 13551–13563

  10. [10]

    Mahoney, and Joseph E

    Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael W. Mahoney, and Joseph E. Gonzalez. 2021. ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training. arXiv:2104.14129 [cs.LG] https://arxiv.org/abs/2104.14129

  11. [11]

    Shiyang Chen, Da Zheng, Caiwen Ding, Chengying Huan, Yuede Ji, and Hang Liu. 2023. TANGO: re-thinking quantization for graph neural network training on GPUs. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(Denver, CO, USA)(SC ’23). As- sociation for Computing Machinery, New York, NY, USA, Arti...

  12. [12]

    Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, and Jingren Zhou. 2024. EE- LLM: large-scale training and inference of early-exit large language models with 3D parallelism. InProceedings of the 41st International Conference on Machine Learning (ICML’24). JMLR.org, Vienna, Austria, Article 277, 27 pages

  13. [13]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24, 240 (2023), 1–113

  14. [14]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems36 (2023), 10088–10115

  15. [15]

    Harry Dong, Tyler Johnson, Minsik Cho, and Emad Soroush. 2024. Towards Low- bit Communication for Tensor Parallel LLM Inference. arXiv:2411.07942 [cs.AI] https://arxiv.org/abs/2411.07942

  16. [16]

    arXiv preprint arXiv:2509.23202 , year=

    Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, El- dar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2026. Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization. arXiv:2509.23202 [cs.LG] https://arxiv.org/abs/2509.23202

  17. [17]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG] https://arxiv.org/abs/2210.17323

  18. [18]

    Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia

  19. [19]

    arXiv:2211.15841 [cs.LG]https://arxiv.org/abs/2211.15841

    MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. arXiv:2211.15841 [cs.LG] https://arxiv.org/abs/2211.15841

  20. [20]

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 [cs.CL] https://arxiv.org/abs/2101.00027

  21. [21]

    Aaron Grattafiori et al . 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

  22. [22]

    Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kun Yuan, and Binhang Yuan

  23. [23]

    TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

    TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network. arXiv:2506.01352 [cs.LG] https://arxiv.org/abs/2506.01352

  24. [24]

    Le, Yonghui Wu, and Zhifeng Chen

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. InProceedings of the 33rd International Conference on Neural Informa- tion Processing Systems. Curran Associates Inc., ...

  25. [25]

    Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, et al. 2024. Sdp4bit: Toward 4-bit communi- cation quantization in sharded data parallelism for LLM training.Advances in Neural Information Processing Systems37 (2024), 8734–8759

  26. [26]

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

  27. [27]

    Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. 2022. Fp8 quantization: The power of the exponent.Advances in Neural Information Processing Systems35 (2022), 14651–14662

  28. [28]

    Itay Lamprecht, Asaf Karnieli, Yair Hanani, Niv Giladi, and Daniel Soudry. 2025. Tensor-Parallelism with Partially Synchronized Activations. NeurIPS 2025 Poster. https://openreview.net/forum?id=fyeSq3m8CY Accepted as NeurIPS 2025 Poster

  29. [29]

    Qingyuan Li, Bo Zhang, Liang Ye, Yifan Zhang, Wei Wu, Yerui Sun, Lin Ma, and Yuchen Xie. 2024. Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference. arXiv:2412.04964 [cs.AI] https://arxiv.org/abs/2412.04964

  30. [30]

    Shigang Li and Torsten Hoefler. 2022. Near-optimal sparse allreduce for dis- tributed deep learning. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Seoul, Republic of Korea) (PPoPP ’22). Association for Computing Machinery, New York, NY, USA, 135–149. doi:10.1145/3503221.3508399

  31. [31]

    Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023. Sequence Parallelism: Long Sequence Training from System Perspective. InPro- ceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Toronto, Canada, 2391–2404

  32. [32]

    Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2020. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. arXiv:1712.01887 [cs.CV] https://arxiv.org/abs/1712.01887

  33. [33]

    Xingchen Liu, Haoran Kong, Hairui Zhao, Shengkai Lyu, Zheng Wei, Man Liu, Xingjian Tian, Liyang Zhao, Zhuohan Chen, Fakang Wang, Zizhong Chen, Zhan Wang, Guangming Tan, and Dingwen Tao. 2026. COCCL: A Collective Commu- nication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training. InProceedings of the 3...

  34. [34]

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. 2023. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv:2305.17888 [cs.CL] https://arxiv.org/abs/2305.17888

  35. [35]

    Ilia Markov, Adrian Vladu, Qi Guo, and Dan Alistarh. 2023. Quantized distributed training of large models with convergence guarantees. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Honolulu, Hawaii, USA, Article 1001, 25 pages

  36. [36]

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. arXiv:1710.03740 [cs.AI] https://arxiv.org/abs/1710.03740

  37. [37]

    Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. 2022. FP8 Formats for Deep Learning. arXiv:2209.05433 [cs.LG] https://arxiv.org/abs/2209.05433

  38. [38]

    Devanur, Gregory R

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM Symposium on Operating Systems Principles(Huntsville, Ontario, Canada)(SOSP ’19). Association for Computing Machiner...

  39. [39]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron- LM. InProceedings of the International Conference for High ...

  40. [40]

    Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. 2023. OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text. HPDC ’26, July 13–16, 2026, Cleveland, OH, USA Liu et al. arXiv:2310.06786 [cs.AI] https://arxiv.org/abs/2310.06786

  41. [41]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-pe...

  42. [42]

    Igor Polyakov, Alexey Dukhanov, and Egor Spirin. 2025. TAGC: Optimizing Gradient Communication in Distributed Transformer Training. InProceedings of the 5th Workshop on Machine Learning and Systems(World Trade Center, Rotterdam, Netherlands)(EuroMLSys ’25). Association for Computing Machinery, New York, NY, USA, 254–260. doi:10.1145/3721146.3721946

  43. [43]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv:1910.02054 [cs.LG] https://arxiv.org/abs/1910.02054

  44. [44]

    M. I. Rudakov, A. N. Beznosikov, Ya. A. Kholodov, and A. V. Gasnikov. 2023. Activations and Gradients Compression for Model-Parallel Training.Doklady Mathematics108, S2 (Dec. 2023), S272–S281. doi:10.1134/s1064562423701314

  45. [45]

    2025.Quantization Methods for Matrix Multiplication and Ef- ficient Transformers

    Semyon Savkin. 2025.Quantization Methods for Matrix Multiplication and Ef- ficient Transformers. Ph. D. Dissertation. MASSACHUSETTS INSTITUTE OF TECHNOLOGY

  46. [46]

    Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, and Mengni Wang. 2024. Efficient post-training quantization with fp8 formats.Proceedings of Machine Learning and Systems6 (2024), 483–498

  47. [47]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL] https: //arxiv.org/abs/1909.08053

  48. [48]

    Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. 2022. Using DeepSpeed and Megatr...

  49. [49]

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. 2025. FlatQuant: Flatness Matters for LLM Quantization. arXiv:2410.09426 [cs.CL] https://arxiv.org/abs/2410.09426

  50. [50]

    Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm. github.io/blog/qwen2.5/

  51. [51]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL] https://arxiv.org/abs/2302.13971

  52. [52]

    Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. 2024. QuIP#: even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Vienna, Austria, Article 1987, 27 pages

  53. [53]

    Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. 2019. PowerSGD: prac- tical low-rank gradient compression for distributed optimization. InProceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 1278, 10 pages

  54. [54]

    Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong Ma, and Cheng Li. 2024. Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution. arXiv:2411.15871 [cs.DC] https://arxiv.org/abs/ 2411.15871

  55. [55]

    Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christo- pher Ré, and Ce Zhang. 2022. Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems35 (2022), 19215–19230

  56. [56]

    BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, et al

  57. [57]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv:2211.05100 [cs.CL] https://arxiv.org/abs/2211.05100

  58. [58]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

  59. [59]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438 [cs.CL] https://arxiv.org/abs/2211.10438

  60. [60]

    Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha Gulhane, Aamir Shafi, Hari Subramoni, and Dhabaleswar K DK Panda. 2024. Accelerating large language model training with hybrid gpu-based compression. In2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, Philadelphia, PA, USA, 196–205

  61. [61]

    Qingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao, Shiyou Qian, Dingyu Yang, Jian Cao, Jinghua Tang, Yinghao Yu, Chenzhi Liao, Kangjin Wang, and Liping Zhang. 2025. EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training. arXiv:2511.10333 [cs.LG] https://arxiv.org/abs/2511. 10333

  62. [62]

    Lin Zhang, Longteng Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. 2023. Eval- uation and optimization of gradient compression for distributed deep learning. In2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS). IEEE, Hong Kong, China, 361–371

  63. [63]

    Hao Zheng, Peng Liang, Yu Tang, Yanqi Shi, Linbo Qiao, and Dongsheng Li. 2024. 3D Parallelism for transformers via Integer programming. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Seoul, Korea, 6440–6444