Recognition: no theorem link
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
Pith reviewed 2026-05-14 01:50 UTC · model grok-4.3
The pith
Switch-centric architecture reduces All-Reduce latency in LLM inference by letting the switch directly access attached accelerator memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SCIN is the first switch-centric in-network architecture for multi-accelerator shared-memory networks. It introduces an in-switch accelerator capable of directly accessing memory regions in attached accelerators for in-network processing, together with a co-designed communication fabric that enables such access with negligible protocol overhead. SCIN delivers lower All-Reduce latency than NVLS by eliminating redundant data movement. Moreover, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss.
What carries the argument
The in-switch accelerator (ISA) that directly accesses memory regions in attached accelerators, supported by a co-designed communication fabric with negligible protocol overhead.
If this is right
- All-Reduce completes without the extra transfer of reduced data back to the initiating GPU.
- In-network quantization becomes feasible for All-Reduce, allowing 8-bit precision and nearly doubled effective bandwidth.
- All-Reduce accelerates up to 8.7x for small messages and 3.8x for large messages in an 8-GPU system.
- End-to-end LLM inference gains up to 1.74x TTFT speedup and 1.34x TPOT speedup on LLaMA-2 models.
Where Pith is reading between the lines
- The same direct-access mechanism could support other collectives such as All-Gather if the ISA is extended.
- Power and energy savings in data-center racks would follow from the reduced total bytes moved across the network.
- Commercial switch ASICs could incorporate similar ISAs if the protocol overhead stays low at larger cluster sizes.
- Accuracy impact of 8-bit INQ may differ across model families, requiring per-model validation beyond LLaMA-2.
Load-bearing premise
The in-switch accelerator can directly access memory regions in attached accelerators with negligible protocol overhead and the design scales to 8-GPU systems without hidden hardware or synchronization costs.
What would settle it
Measure All-Reduce latency on an 8-GPU hardware prototype running SCIN versus NVLS, and check end-to-end inference accuracy on LLaMA-2 after forcing All-Reduce to 8-bit INQ.
Figures
read the original abstract
In-network computing techniques, exemplified by NVLink SHARP (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations such as All-Reduce to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger in-switch reduction, which means that the data reduced in the switch must be transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS during inference still operates at 16-bit precision, leading to substantial bandwidth waste. To address these limitations, we propose SCIN, the first switch-centric in-network architecture for multi-accelerator shared-memory networks, enabling both low-latency and high-bandwidth All-Reduce. Specifically, we introduce an in-switch accelerator (ISA) capable of directly accessing the memory regions in attached accelerators for in-network processing, together with a co-designed communication fabric that enables such access with negligible protocol overhead. SCIN delivers lower All-Reduce latency than NVLS by eliminating redundant data movement. Moreover, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss. We also present a multi-FPGA prototype of SCIN to validate its feasibility and effectiveness. Simulation results for an 8-GPU system show that our design accelerates All-Reduce by up to 8.7x for small messages and 3.8x for large messages, yielding up to 1.74x TTFT speedup and 1.34x TPOT speedup on LLaMA-2 models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SCIN, a switch-centric in-network architecture for accelerating All-Reduce operations in LLM inference on shared-memory multi-accelerator networks. It introduces an In-Switch Accelerator (ISA) that directly accesses memory regions of attached accelerators via a co-designed fabric, addressing two limitations of NVLink SHARP (NVLS): redundant data movement from GPU-triggered reductions and inability to offload non-memory-semantic operators such as the proposed In-Network Quantization (INQ). SCIN enables 8-bit INQ for All-Reduce with claimed negligible accuracy loss, and evaluations include a multi-FPGA prototype plus simulations for an 8-GPU system reporting up to 8.7x All-Reduce latency reduction for small messages, 3.8x for large messages, 1.74x TTFT speedup, and 1.34x TPOT speedup on LLaMA-2 models.
Significance. If the assumptions on negligible protocol overhead for direct memory access hold, SCIN could meaningfully advance in-network computing for distributed LLM inference by cutting communication latency and nearly doubling effective bandwidth via quantization. The multi-FPGA prototype and 8-GPU simulations provide concrete evidence of feasibility beyond pure simulation, strengthening the architectural contribution relative to prior switch-offload work.
major comments (3)
- [§4] §4 (prototype description): the claim that the co-designed communication fabric enables direct ISA access to accelerator memory regions 'with negligible protocol overhead' is load-bearing for both the latency reduction and INQ offload arguments, yet no cycle-accurate accounting of coherence traffic, address translation, or barrier synchronization costs is supplied; without these measurements the elimination of redundant NVLS round-trips cannot be verified.
- [§5] §5 (simulation results): the reported 8.7x and 3.8x All-Reduce speedups for an 8-GPU system lack error bars, detailed methodology, and full validation data against NVLS baselines, making it impossible to assess whether the gains are robust or sensitive to the unverified direct-access assumption.
- [§3.3] §3.3 (INQ operator): the assertion that 8-bit INQ yields 'negligible accuracy loss' while nearly doubling bandwidth is central to the high-bandwidth claim, but the manuscript provides no quantitative accuracy metrics, quantization scheme details, or per-layer error analysis on the LLaMA-2 models used in the TTFT/TPOT experiments.
minor comments (2)
- [Abstract] The abstract states speedups 'up to 1.74x TTFT' and '1.34x TPOT' but does not specify the exact baseline (e.g., NVLS configuration, number of GPUs, or message sizes) against which these end-to-end gains are measured.
- [§3] Notation for the ISA memory-access protocol and INQ bit-width reduction should be introduced with a small diagram or pseudocode in §3 to improve readability for readers unfamiliar with NVLS internals.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below and will revise the manuscript to incorporate the requested details and clarifications.
read point-by-point responses
-
Referee: [§4] §4 (prototype description): the claim that the co-designed communication fabric enables direct ISA access to accelerator memory regions 'with negligible protocol overhead' is load-bearing for both the latency reduction and INQ offload arguments, yet no cycle-accurate accounting of coherence traffic, address translation, or barrier synchronization costs is supplied; without these measurements the elimination of redundant NVLS round-trips cannot be verified.
Authors: We agree that a cycle-accurate accounting of these overhead components would strengthen the presentation. The co-designed fabric uses a lightweight custom protocol with pre-registered memory regions and offset-based addressing to avoid full coherence traffic and complex translation, while synchronization relies on a dedicated in-switch barrier. We will expand §4 with detailed measurements from the multi-FPGA prototype, including breakdowns of coherence, translation, and barrier costs, to verify the overhead is negligible and that redundant NVLS round-trips are eliminated. revision: yes
-
Referee: [§5] §5 (simulation results): the reported 8.7x and 3.8x All-Reduce speedups for an 8-GPU system lack error bars, detailed methodology, and full validation data against NVLS baselines, making it impossible to assess whether the gains are robust or sensitive to the unverified direct-access assumption.
Authors: We acknowledge that additional statistical and methodological details are needed for full assessment. We will revise §5 to include error bars on the speedup results, a complete description of the simulation methodology and parameters for the 8-GPU system, and expanded validation data with direct head-to-head comparisons against NVLS to demonstrate robustness. revision: yes
-
Referee: [§3.3] §3.3 (INQ operator): the assertion that 8-bit INQ yields 'negligible accuracy loss' while nearly doubling bandwidth is central to the high-bandwidth claim, but the manuscript provides no quantitative accuracy metrics, quantization scheme details, or per-layer error analysis on the LLaMA-2 models used in the TTFT/TPOT experiments.
Authors: We agree that quantitative support for the accuracy claim is required. We will update §3.3 to include the specific quantization scheme details, quantitative accuracy metrics on the LLaMA-2 models from the TTFT/TPOT experiments, and per-layer error analysis to substantiate the negligible loss. revision: yes
Circularity Check
No circularity: claims rest on architectural description and empirical simulation results
full rationale
The paper proposes a new switch-centric architecture (SCIN) with an in-switch accelerator (ISA) and co-designed fabric, validated via multi-FPGA prototype and 8-GPU simulations. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. Performance numbers (e.g., 8.7x All-Reduce speedup, 1.74x TTFT) are presented as direct simulation outputs rather than results derived by construction from inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The shared-memory network topology permits direct memory access from the switch to attached accelerators with negligible protocol overhead.
invented entities (2)
-
In-Switch Accelerator (ISA)
no independent evidence
-
In-Network Quantization (INQ)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Dennis Abts, Garrin Kimmell, Andrew Ling, John Kim, Matt Boyd, Andrew Bitar, Sahil Parmar, Ibrahim Ahmed, Roberto DiCecco, David Han, John Thompson, Michael Bye, Jennifer Hwang, Jeremy Fowers, Peter Lillian, Ashwin Murthy, Elyas Mehtabuddin, Chetan Tekur, Thomas Sohmers, Kris Kang, Stephen Maresh, and Jonathan Ross. 2022. A software-defined tensor streami...
work page 2022
-
[2]
Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, Max Baker, Tom Hawkins, Andrew Bell, John Thompson, Temesghen Kahsai, Garrin Kimmell, Jennifer Hwang, Rebekah Leslie-Hurd, Michael Bye, E.R. Creswick, Matthew Boyd, Mahitha Venigalla, Evan Laforge, Jon Purdy, Purushotham Ka- math, Dinesh Maheshwari, Michael Beidler, Geert Rosseel, Omar Ahm...
-
[3]
Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. 2024. Vidur: A large-scale simulation framework for llm inference.Proceedings of Machine Learning and Systems6 (2024), 351–366
work page 2024
-
[4]
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.Proceedings of 18th USENIX Symposium on Operating Systems Design and Implementation, 2024, Santa Clara(2024)
work page 2024
- [5]
-
[6]
George Almási, Philip Heidelberger, Charles J Archer, Xavier Martorell, C Chris Erway, José E Moreira, Burkhard Steinmacher-Burow, and Yili Zheng. 2005. Opti- mization of MPI collective communication on BlueGene/L systems. InProceedings of the 19th annual international conference on Supercomputing. 253–262
work page 2005
-
[7]
AMD. 2026. Aurora 64B/66B. https://www.amd.com/en/products/adaptive-socs- and-fpgas/intellectual-property/aurora64b66b.html
work page 2026
-
[8]
AMD. 2026. JTAG to AXI Master. https://www.amd.com/en/products/adaptive- socs-and-fpgas/intellectual-property/jtag_to_axi_master.html
work page 2026
-
[9]
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. InSC22: International Conference for High Performance Computing, Networking, Storage and...
-
[10]
Baba Arimilli, Ravi Arimilli, Vicente Chung, Scott Clark, Wolfgang Denzel, Ben Drerup, Torsten Hoefler, Jody Joyner, Jerry Lewis, Jian Li, Nan Ni, and Ram Rajamony. 2010. The PERCS High-Performance Interconnect. In2010 18th IEEE Symposium on High Performance Interconnects. 75–82. https://doi.org/10.1109/ HOTI.2010.16
work page 2010
-
[11]
ARM. 2026. AMBA AXI Protocol Specification. https://developer.arm.com/ documentation/ihi0022/latest/
work page 2026
-
[12]
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi
-
[13]
Piqa: Reasoning about physical commonsense in natural language
PIQA: Reasoning about Physical Commonsense in Natural Language. arXiv:1911.11641 [cs.CL] https://arxiv.org/abs/1911.11641
-
[14]
Broadcom. 2025. Scale-Up Ethernet Framework Specification. https://docs. broadcom.com/doc/scale-up-ethernet-framework
work page 2025
-
[15]
Chuyan Chen, Yutong He, Pengrui Li, Weichen Jia, and Kun Yuan. 2026. Greedy low-rank gradient compression for distributed learning with convergence guar- antees.IEEE Transactions on Signal Processing(2026)
work page 2026
-
[16]
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. arXiv:1905.10044 [cs.CL] https://arxiv.org/abs/ 1905.10044
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[17]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answer- ing? Try ARC, the AI2 Reasoning Challenge.arXiv:1803.05457v1(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
Ultra Accelerator Link Consortium. 2025. UALink 1.0 Specification. https: //ualinkconsortium.org/specifications/ualink-1-0-specification/
work page 2025
-
[19]
John Danskin and Denis Foley. 2016. Pascal GPU with NVLink. In2016 IEEE Hot Chips 28 Symposium (HCS). IEEE, 1–24
work page 2016
- [20]
-
[21]
Jiawei Fei, Chen-Yu Ho, Atal N Sahu, Marco Canini, and Amedeo Sapio. 2021. Efficient sparse collective communication and its application to accelerate dis- tributed deep learning. InProceedings of the 2021 ACM SIGCOMM 2021 Conference. 676–691
work page 2021
-
[22]
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The L...
-
[23]
Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer. 2024. AI and Memory Wall.IEEE Micro44, 3 (2024), 33–39. https://doi.org/10.1109/MM.2024.3373763
-
[24]
Richard L. Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koush- nir, Lion Levi, Alex Margolin, Tamir Ronen, Alexander Shpiner, Oded Wertheim, and Eitan Zahavi. 2016. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduct...
-
[25]
Jan Hansen-Palmus, Michael Truong Le, Oliver Hausdörfer, and Alok Verma
- [26]
-
[27]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. arXiv:2009.03300 [cs.CY] https://arxiv.org/abs/2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
Alex Ishii, Denis Foley, Eric Anderson, Bill Dally, Glenn Dearth, Larry Dennison, Mark Hummel, and John Schafer. 2018. Nvswitch and dgx-2 nvlink-switching chip and scale-up compute server. InHot Chips
work page 2018
-
[29]
Alexander Ishii and Ryan Wells. 2022. The nvlink-network switch: Nvidia’s switch chip for high communication-bandwidth superpods. In2022 IEEE Hot Chips 34 Symposium (HCS). IEEE, 1–23
work page 2022
-
[30]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Nan Jiang, Daniel U Becker, George Michelogiannakis, James Balfour, Brian Towles, David E Shaw, John Kim, and William J Dally. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In2013 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 86–96
work page 2013
-
[33]
Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. arXiv:2304.01433 [cs.AR] https://arxiv...
-
[34]
Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in- network architecture for accelerating shared-memory multiprocessor collectives. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 996–1009
work page 2020
-
[35]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles
work page 2023
-
[36]
Jiedong Lang, Zhehao Guo, and Shuyu Huang. 2024. A comprehensive study on quantization techniques for large language models. In2024 4th International conference on artificial intelligence, robotics, and communication (ICAIRC). IEEE, 224–231
work page 2024
-
[37]
ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. {ATP}: In-network aggregation for multi- tenant learning. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 741–761
work page 2021
- [38]
- [39]
-
[40]
Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. InProceedings of the 46th International Symposium on Computer Architecture. 279–291
work page 2019
-
[41]
Heng Liao, Bingyang Liu, Xianping Chen, Zhigang Guo, Chuanning Cheng, Jianbing Wang, Xiangyu Chen, Peng Dong, Rui Meng, Wenjie Liu, Zhe Zhou, Ziyang Zhang, Yuhang Gai, Cunle Qian, Yi Xiong, Zhongwu Cheng, Jing Xia, Yuli Ma, Xi Chen, Wenhua Du, Shizhong Xiao, Chungang Li, Yong Qin, Liudong Xiong, Zhou Yu, Lv Chen, Lei Chen, Buyun Wang, Pei Wu, Junen Gao, X...
-
[42]
Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. 2021. Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing : Industry Track Paper. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 789–801. https: //doi.org/10.1109/HPCA51647.2021.00071
-
[43]
John Little and Stephen Graves. 2008. Little’s Law. InBuilding Intuition: Insights from Basic Operations Management Models and Principles. 81–100
work page 2008
-
[44]
Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, and Kwang-Ting Cheng. 2023. Llm-fp4: 4-bit floating-point quantized transformers. InProceedings of the 2023 conference on empirical methods in natural language processing. 592– 605
work page 2023
- [45]
-
[46]
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[47]
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. arXiv:1809.02789 [cs.CL] https://arxiv.org/abs/1809.02789
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[48]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM symposium on operating systems principles. 1–15
work page 2019
-
[49]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InSC21: International Conference for High Performance Co...
work page 2021
-
[50]
2022.Upgrading Multi-GPU Interconnectivity with the Third-Generation NVIDIA NVSwitch
NVIDIA. 2022.Upgrading Multi-GPU Interconnectivity with the Third-Generation NVIDIA NVSwitch. https://developer.nvidia.com/blog/?p=53977
work page 2022
-
[51]
NVIDIA. 2023. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM
work page 2023
-
[52]
NVIDIA. 2024. nccl-tests. https://github.com/NVIDIA/nccl-tests/blob/master/ doc/PERFORMANCE.md
work page 2024
-
[53]
NVIDIA. 2024. NVIDIA DGX H200. https://resources.nvidia.com/en-us-dgx- systems/dgx-h200-datasheet?ncid=no-ncid
work page 2024
-
[54]
NVIDIA. 2024. NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference. https://developer.nvidia.com/blog/nvidia-gb200-nvl72- delivers-trillion-parameter-llm-training-and-real-time-inference/?ncid=no- ncid
work page 2024
-
[55]
NVIDIA. 2024. NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records. https://developer.nvidia.com/blog/nvidia-h200- tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/
work page 2024
-
[56]
NVIDIA. 2026. Device-Initiated Communication — NCCL 2.29.7 Documen- tation. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/ deviceapi.html
work page 2026
-
[57]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Brad- bury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference.Proceedings of machine learning and systems5 (2023), 606–624
work page 2023
-
[58]
Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna
-
[59]
In2020 IEEE International Symposium on Performance Analy- sis of Systems and Software (ISPASS)
ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. In2020 IEEE International Symposium on Performance Analy- sis of Systems and Software (ISPASS). 81–92. https://doi.org/10.1109/ISPASS48437. 2020.00018
-
[60]
Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kan...
-
[61]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi
-
[62]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arXiv:1907.10641 [cs.CL] https://arxiv.org/abs/1907.10641
work page internal anchor Pith review arXiv 1907
-
[63]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Pe- ter Richtárik. 2021. Scaling distributed machine learning with {In-Network} aggregation. In18th USENIX Symposium on Networked Systems Design and Im- plementation (NSDI 21). 785–808
work page 2021
-
[64]
Prajwal Singhania, Siddharth Singh, Lannie Dalton Hough, Ishaan Revankar, Harshitha Menon, Charles Jekel, and Abhinav Bhatele. 2025. Understanding Communication Bottlenecks in Multi-Node LLM Inference. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Research Poster
work page 2025
-
[65]
Synopsys. 2026. Pre-Silicon Prototyping. https://www.synopsys.com/ verification/emulation-prototyping/prototyping.html
work page 2026
-
[66]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autore- gressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax
work page 2021
-
[68]
Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajb- handari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He. 2023. Zero++: Extremely efficient collective communication for giant model training.arXiv preprint arXiv:2306.10209(2023). 13 Aojie Jiang, Kang Zhu, Zhiheng Zhang, Zhengxu Su, Juntao Liu, Yuan Du, and Li Du
-
[69]
William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Swati Gupta, and Tushar Krishna. 2024. TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning. InProceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture. 856–870. https://doi.org/10.1109/ MICRO61859.2024.00068
-
[70]
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han
-
[71]
InInternational conference on machine learning
Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning. 38087–38099
-
[72]
Lang Xu, Kaushik Kandadi Suresh, Quentin Anthony, Nawras Alnaasan, and Dha- baleswar K Panda. 2025. Characterizing communication patterns in distributed large language model inference. In2025 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, 1–11
work page 2025
-
[73]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence? arXiv:1905.07830 [cs.CL] https://arxiv.org/abs/1905.07830
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[74]
Chen Zhang, Qijun Zhang, Zhuoshan Zhou, Yijia Diao, Haibo Wang, Zhe Zhou, Zhipeng Tu, Zhiyao Li, Guangyu Sun, Zhuoran Song, Zhigang Ji, Jingwen Leng, and Minyi Guo. 2026. Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). 1–15. ht...
-
[75]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: efficient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Information Processing Systems. 14
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.