arxiv: 2603.28239 · v3 · submitted 2026-03-30 · 💻 cs.AR

Recognition: no theorem link

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Aojie Jiang , Kang Zhu , Zhiheng Zhang , Zhengxu Su , Juntao Liu , Yuan Du , Li Du

Authors on Pith no claims yet

Pith reviewed 2026-05-14 01:50 UTC · model grok-4.3

classification 💻 cs.AR

keywords in-network computingAll-ReduceLLM inferenceswitch-centric architecturein-network quantizationshared-memory networkmulti-GPU communicationcollective operations

0 comments

The pith

Switch-centric architecture reduces All-Reduce latency in LLM inference by letting the switch directly access attached accelerator memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing in-network approaches like NVLS still force results from switch reductions to travel back to the initiating GPU, creating redundant transfers that slow collective operations. SCIN instead places an in-switch accelerator that reads and writes directly into the memory of connected accelerators through a co-designed fabric. This change removes the return trip for reduced data and simultaneously supports in-network quantization that drops All-Reduce precision to 8 bits. Simulations on 8-GPU systems report up to 8.7x faster All-Reduce for small messages and measurable end-to-end gains in time-to-first-token and tokens-per-output-token for LLaMA-2 models.

Core claim

SCIN is the first switch-centric in-network architecture for multi-accelerator shared-memory networks. It introduces an in-switch accelerator capable of directly accessing memory regions in attached accelerators for in-network processing, together with a co-designed communication fabric that enables such access with negligible protocol overhead. SCIN delivers lower All-Reduce latency than NVLS by eliminating redundant data movement. Moreover, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss.

What carries the argument

The in-switch accelerator (ISA) that directly accesses memory regions in attached accelerators, supported by a co-designed communication fabric with negligible protocol overhead.

If this is right

All-Reduce completes without the extra transfer of reduced data back to the initiating GPU.
In-network quantization becomes feasible for All-Reduce, allowing 8-bit precision and nearly doubled effective bandwidth.
All-Reduce accelerates up to 8.7x for small messages and 3.8x for large messages in an 8-GPU system.
End-to-end LLM inference gains up to 1.74x TTFT speedup and 1.34x TPOT speedup on LLaMA-2 models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same direct-access mechanism could support other collectives such as All-Gather if the ISA is extended.
Power and energy savings in data-center racks would follow from the reduced total bytes moved across the network.
Commercial switch ASICs could incorporate similar ISAs if the protocol overhead stays low at larger cluster sizes.
Accuracy impact of 8-bit INQ may differ across model families, requiring per-model validation beyond LLaMA-2.

Load-bearing premise

The in-switch accelerator can directly access memory regions in attached accelerators with negligible protocol overhead and the design scales to 8-GPU systems without hidden hardware or synchronization costs.

What would settle it

Measure All-Reduce latency on an 8-GPU hardware prototype running SCIN versus NVLS, and check end-to-end inference accuracy on LLaMA-2 after forcing All-Reduce to 8-bit INQ.

Figures

Figures reproduced from arXiv: 2603.28239 by Aojie Jiang, Juntao Liu, Kang Zhu, Li Du, Yuan Du, Zhengxu Su, Zhiheng Zhang.

**Figure 1.** Figure 1: Two architectures for in-network computing: accelerator-centric architecture in NVLS (left) and the proposed switch [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: Communication and computation time breakdown [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Switch Microarchitecture for SCIN Accelerator_TP_Inference () { // Perform Attention Block Attention(); //Atomically increment switch’s synchronization counter Arrive(); // Poll the local flag until it becomes 1 Wait(); // Perform MLP Block MLP(); } one network hop Switch_All_Reduce () { // Poll the local counter until it equals the number of participants Wait(); // Execute All-Reduce operation Execute(); … view at source ↗

**Figure 5.** Figure 5: Synchronization mechanism in SCIN executing the All-Reduce operation, the ISA first polls the local synchronization counter. Once the counter reaches the number of participants for this operation, the ISA knows that all accelerators have arrived at the synchronization point and it can safely initiate the All-Reduce operation. After the All-Reduce completes and all write responses are received, the reduced… view at source ↗

**Figure 7.** Figure 7: Block-wise quantization with one scale factor for [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Photograph of the SCIN prototype compression with only one additional quantization step, in contrast to the 𝑁 − 1 quantization steps required by ring-based schemes, while maintaining model accuracy (evaluated in Section 4.2). For hardware simplicity, we use the maximum absolute value within each block as the clipping range to compute its scale factor. We assume that the activation data and scale factors ar… view at source ↗

**Figure 9.** Figure 9: All-Reduce Latency: FPGA-based prototype vs. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Simulated All-Reduce performance with 16 waves [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Simulation results with and without wave regula [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: TTFT and TPOT speedup of SCIN over the software ring All-Reduce algorithm for LLaMA-2 models with TP = 8. The [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

read the original abstract

In-network computing techniques, exemplified by NVLink SHARP (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations such as All-Reduce to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger in-switch reduction, which means that the data reduced in the switch must be transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS during inference still operates at 16-bit precision, leading to substantial bandwidth waste. To address these limitations, we propose SCIN, the first switch-centric in-network architecture for multi-accelerator shared-memory networks, enabling both low-latency and high-bandwidth All-Reduce. Specifically, we introduce an in-switch accelerator (ISA) capable of directly accessing the memory regions in attached accelerators for in-network processing, together with a co-designed communication fabric that enables such access with negligible protocol overhead. SCIN delivers lower All-Reduce latency than NVLS by eliminating redundant data movement. Moreover, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss. We also present a multi-FPGA prototype of SCIN to validate its feasibility and effectiveness. Simulation results for an 8-GPU system show that our design accelerates All-Reduce by up to 8.7x for small messages and 3.8x for large messages, yielding up to 1.74x TTFT speedup and 1.34x TPOT speedup on LLaMA-2 models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCIN proposes a switch-centric design with direct in-switch memory access and INQ that claims up to 8.7x All-Reduce speedup over NVLS, backed by 8-GPU simulations and a multi-FPGA prototype.

read the letter

SCIN introduces a switch-centric in-network architecture that lets an in-switch accelerator directly access memory in attached accelerators, unlike the GPU-triggered approach in NVLS. It also adds in-network quantization to run All-Reduce at 8 bits. The main advance is the ISA for direct loads and stores plus the co-designed fabric that keeps protocol overhead low. This removes the need to send reduced data back to the initiating GPU and allows non-memory-semantic ops like INQ. The paper reports simulation speedups of up to 8.7x for small All-Reduce messages and 3.8x for large ones in an 8-GPU setup, plus TTFT and TPOT gains on LLaMA-2. They also built a multi-FPGA prototype to check that the hardware idea works. The prototype is useful evidence that the concept is realizable. The simulations give a sense of the potential at scale. The central worry is whether the direct memory access really has negligible overhead at 8-GPU scale. The design assumes the co-designed fabric avoids significant coherence traffic, address translation costs, and synchronization delays. If those turn out higher, the latency advantage and the INQ bandwidth doubling both weaken. The abstract does not include detailed methodology or error bars on the results, which makes it harder to gauge how robust the numbers are. The prototype validates basic function but may not fully test the scaling assumptions under realistic loads. This work is aimed at systems researchers and hardware designers working on communication optimizations for large model inference. Anyone comparing in-network options to NVLS would find the concrete alternative and the reported numbers worth looking at. I would send it to peer review. The prototype and simulation results give referees something solid to evaluate, even if the overhead assumptions need more scrutiny in revision.

Referee Report

3 major / 2 minor

Summary. The paper proposes SCIN, a switch-centric in-network architecture for accelerating All-Reduce operations in LLM inference on shared-memory multi-accelerator networks. It introduces an In-Switch Accelerator (ISA) that directly accesses memory regions of attached accelerators via a co-designed fabric, addressing two limitations of NVLink SHARP (NVLS): redundant data movement from GPU-triggered reductions and inability to offload non-memory-semantic operators such as the proposed In-Network Quantization (INQ). SCIN enables 8-bit INQ for All-Reduce with claimed negligible accuracy loss, and evaluations include a multi-FPGA prototype plus simulations for an 8-GPU system reporting up to 8.7x All-Reduce latency reduction for small messages, 3.8x for large messages, 1.74x TTFT speedup, and 1.34x TPOT speedup on LLaMA-2 models.

Significance. If the assumptions on negligible protocol overhead for direct memory access hold, SCIN could meaningfully advance in-network computing for distributed LLM inference by cutting communication latency and nearly doubling effective bandwidth via quantization. The multi-FPGA prototype and 8-GPU simulations provide concrete evidence of feasibility beyond pure simulation, strengthening the architectural contribution relative to prior switch-offload work.

major comments (3)

[§4] §4 (prototype description): the claim that the co-designed communication fabric enables direct ISA access to accelerator memory regions 'with negligible protocol overhead' is load-bearing for both the latency reduction and INQ offload arguments, yet no cycle-accurate accounting of coherence traffic, address translation, or barrier synchronization costs is supplied; without these measurements the elimination of redundant NVLS round-trips cannot be verified.
[§5] §5 (simulation results): the reported 8.7x and 3.8x All-Reduce speedups for an 8-GPU system lack error bars, detailed methodology, and full validation data against NVLS baselines, making it impossible to assess whether the gains are robust or sensitive to the unverified direct-access assumption.
[§3.3] §3.3 (INQ operator): the assertion that 8-bit INQ yields 'negligible accuracy loss' while nearly doubling bandwidth is central to the high-bandwidth claim, but the manuscript provides no quantitative accuracy metrics, quantization scheme details, or per-layer error analysis on the LLaMA-2 models used in the TTFT/TPOT experiments.

minor comments (2)

[Abstract] The abstract states speedups 'up to 1.74x TTFT' and '1.34x TPOT' but does not specify the exact baseline (e.g., NVLS configuration, number of GPUs, or message sizes) against which these end-to-end gains are measured.
[§3] Notation for the ISA memory-access protocol and INQ bit-width reduction should be introduced with a small diagram or pseudocode in §3 to improve readability for readers unfamiliar with NVLS internals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below and will revise the manuscript to incorporate the requested details and clarifications.

read point-by-point responses

Referee: [§4] §4 (prototype description): the claim that the co-designed communication fabric enables direct ISA access to accelerator memory regions 'with negligible protocol overhead' is load-bearing for both the latency reduction and INQ offload arguments, yet no cycle-accurate accounting of coherence traffic, address translation, or barrier synchronization costs is supplied; without these measurements the elimination of redundant NVLS round-trips cannot be verified.

Authors: We agree that a cycle-accurate accounting of these overhead components would strengthen the presentation. The co-designed fabric uses a lightweight custom protocol with pre-registered memory regions and offset-based addressing to avoid full coherence traffic and complex translation, while synchronization relies on a dedicated in-switch barrier. We will expand §4 with detailed measurements from the multi-FPGA prototype, including breakdowns of coherence, translation, and barrier costs, to verify the overhead is negligible and that redundant NVLS round-trips are eliminated. revision: yes
Referee: [§5] §5 (simulation results): the reported 8.7x and 3.8x All-Reduce speedups for an 8-GPU system lack error bars, detailed methodology, and full validation data against NVLS baselines, making it impossible to assess whether the gains are robust or sensitive to the unverified direct-access assumption.

Authors: We acknowledge that additional statistical and methodological details are needed for full assessment. We will revise §5 to include error bars on the speedup results, a complete description of the simulation methodology and parameters for the 8-GPU system, and expanded validation data with direct head-to-head comparisons against NVLS to demonstrate robustness. revision: yes
Referee: [§3.3] §3.3 (INQ operator): the assertion that 8-bit INQ yields 'negligible accuracy loss' while nearly doubling bandwidth is central to the high-bandwidth claim, but the manuscript provides no quantitative accuracy metrics, quantization scheme details, or per-layer error analysis on the LLaMA-2 models used in the TTFT/TPOT experiments.

Authors: We agree that quantitative support for the accuracy claim is required. We will update §3.3 to include the specific quantization scheme details, quantitative accuracy metrics on the LLaMA-2 models from the TTFT/TPOT experiments, and per-layer error analysis to substantiate the negligible loss. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on architectural description and empirical simulation results

full rationale

The paper proposes a new switch-centric architecture (SCIN) with an in-switch accelerator (ISA) and co-designed fabric, validated via multi-FPGA prototype and 8-GPU simulations. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. Performance numbers (e.g., 8.7x All-Reduce speedup, 1.74x TTFT) are presented as direct simulation outputs rather than results derived by construction from inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims depend on the feasibility of a new in-switch accelerator and co-designed fabric whose overhead and scaling behavior are postulated rather than derived from prior independent evidence.

axioms (1)

domain assumption The shared-memory network topology permits direct memory access from the switch to attached accelerators with negligible protocol overhead.
Invoked to justify the low-latency access and INQ capabilities of the proposed ISA.

invented entities (2)

In-Switch Accelerator (ISA) no independent evidence
purpose: Perform in-network processing by directly accessing accelerator memory regions.
New hardware component introduced to overcome NVLS limitations.
In-Network Quantization (INQ) no independent evidence
purpose: Reduce All-Reduce precision to 8 bits inside the switch.
New operator enabled by the switch-centric design.

pith-pipeline@v0.9.0 · 5669 in / 1276 out tokens · 58961 ms · 2026-05-14T01:50:07.655516+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 10 internal anchors

[1]

Dennis Abts, Garrin Kimmell, Andrew Ling, John Kim, Matt Boyd, Andrew Bitar, Sahil Parmar, Ibrahim Ahmed, Roberto DiCecco, David Han, John Thompson, Michael Bye, Jennifer Hwang, Jeremy Fowers, Peter Lillian, Ashwin Murthy, Elyas Mehtabuddin, Chetan Tekur, Thomas Sohmers, Kris Kang, Stephen Maresh, and Jonathan Ross. 2022. A software-defined tensor streami...

work page 2022
[2]

Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, Max Baker, Tom Hawkins, Andrew Bell, John Thompson, Temesghen Kahsai, Garrin Kimmell, Jennifer Hwang, Rebekah Leslie-Hurd, Michael Bye, E.R. Creswick, Matthew Boyd, Mahitha Venigalla, Evan Laforge, Jon Purdy, Purushotham Ka- math, Dinesh Maheshwari, Michael Beidler, Geert Rosseel, Omar Ahm...

work page doi:10.1109/isca45697.2020.00023 2020
[3]

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. 2024. Vidur: A large-scale simulation framework for llm inference.Proceedings of Machine Learning and Systems6 (2024), 351–366

work page 2024
[4]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.Proceedings of 18th USENIX Symposium on Operating Systems Design and Implementation, 2024, Santa Clara(2024)

work page 2024
[5]

Ibrahim Ahmed, Clemens Schaefer, Gil Tabak, Denis Vnukov, Zenong Zhang, Anatoliy Yevtushenko, and Andy Davis. 2025. EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration.arXiv preprint arXiv:2506.17615(2025)

work page arXiv 2025
[6]

George Almási, Philip Heidelberger, Charles J Archer, Xavier Martorell, C Chris Erway, José E Moreira, Burkhard Steinmacher-Burow, and Yili Zheng. 2005. Opti- mization of MPI collective communication on BlueGene/L systems. InProceedings of the 19th annual international conference on Supercomputing. 253–262

work page 2005
[7]

AMD. 2026. Aurora 64B/66B. https://www.amd.com/en/products/adaptive-socs- and-fpgas/intellectual-property/aurora64b66b.html

work page 2026
[8]

AMD. 2026. JTAG to AXI Master. https://www.amd.com/en/products/adaptive- socs-and-fpgas/intellectual-property/jtag_to_axi_master.html

work page 2026
[9]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. InSC22: International Conference for High Performance Computing, Networking, Storage and...

work page Pith review doi:10.1109/sc41404.2022.00051 2022
[10]

Baba Arimilli, Ravi Arimilli, Vicente Chung, Scott Clark, Wolfgang Denzel, Ben Drerup, Torsten Hoefler, Jody Joyner, Jerry Lewis, Jian Li, Nan Ni, and Ram Rajamony. 2010. The PERCS High-Performance Interconnect. In2010 18th IEEE Symposium on High Performance Interconnects. 75–82. https://doi.org/10.1109/ HOTI.2010.16

work page 2010
[11]

ARM. 2026. AMBA AXI Protocol Specification. https://developer.arm.com/ documentation/ihi0022/latest/

work page 2026
[12]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi

work page
[13]

Piqa: Reasoning about physical commonsense in natural language

PIQA: Reasoning about Physical Commonsense in Natural Language. arXiv:1911.11641 [cs.CL] https://arxiv.org/abs/1911.11641

work page arXiv 1911
[14]

Broadcom. 2025. Scale-Up Ethernet Framework Specification. https://docs. broadcom.com/doc/scale-up-ethernet-framework

work page 2025
[15]

Chuyan Chen, Yutong He, Pengrui Li, Weichen Jia, and Kun Yuan. 2026. Greedy low-rank gradient compression for distributed learning with convergence guar- antees.IEEE Transactions on Signal Processing(2026)

work page 2026
[16]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. arXiv:1905.10044 [cs.CL] https://arxiv.org/abs/ 1905.10044

work page internal anchor Pith review Pith/arXiv arXiv 2019
[17]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answer- ing? Try ARC, the AI2 Reasoning Challenge.arXiv:1803.05457v1(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Ultra Accelerator Link Consortium. 2025. UALink 1.0 Specification. https: //ualinkconsortium.org/specifications/ualink-1-0-specification/

work page 2025
[19]

John Danskin and Denis Foley. 2016. Pascal GPU with NVLink. In2016 IEEE Hot Chips 28 Symposium (HCS). IEEE, 1–24

work page 2016
[20]

Harry Dong, Tyler Johnson, Minsik Cho, and Emad Soroush. 2024. Towards Low-Bit Communication for Tensor Parallel LLM Inference. InNeurIPS Workshop. https://arxiv.org/abs/2411.07942

work page arXiv 2024
[21]

Jiawei Fei, Chen-Yu Ho, Atal N Sahu, Marco Canini, and Amedeo Sapio. 2021. Efficient sparse collective communication and its application to accelerate dis- tributed deep learning. InProceedings of the 2021 ACM SIGCOMM 2021 Conference. 676–691

work page 2021
[22]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The L...

work page doi:10.5281/zenodo.12608602 2024
[23]

Mahoney, and Kurt Keutzer

Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer. 2024. AI and Memory Wall.IEEE Micro44, 3 (2024), 33–39. https://doi.org/10.1109/MM.2024.3373763

work page doi:10.1109/mm.2024.3373763 2024
[24]

Richard L. Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koush- nir, Lion Levi, Alex Margolin, Tamir Ronen, Alexander Shpiner, Oded Wertheim, and Eitan Zahavi. 2016. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduct...

work page doi:10.1109/comhpc.2016.006 2016
[25]

Jan Hansen-Palmus, Michael Truong Le, Oliver Hausdörfer, and Alok Verma

work page
[26]

Communication compression for tensor parallel LLM inference.arXiv preprint arXiv:2411.09510(2024)

work page arXiv 2024
[27]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. arXiv:2009.03300 [cs.CY] https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Alex Ishii, Denis Foley, Eric Anderson, Bill Dally, Glenn Dearth, Larry Dennison, Mark Hummel, and John Schafer. 2018. Nvswitch and dgx-2 nvlink-switching chip and scale-up compute server. InHot Chips

work page 2018
[29]

Alexander Ishii and Ryan Wells. 2022. The nvlink-network switch: Nvidia’s switch chip for high communication-bandwidth superpods. In2022 IEEE Hot Chips 34 Symposium (HCS). IEEE, 1–23

work page 2022
[30]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Nan Jiang, Daniel U Becker, George Michelogiannakis, James Balfour, Brian Towles, David E Shaw, John Kim, and William J Dally. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In2013 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 86–96

work page 2013
[33]

Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. arXiv:2304.01433 [cs.AR] https://arxiv...

work page arXiv 2023
[34]

Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in- network architecture for accelerating shared-memory multiprocessor collectives. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 996–1009

work page 2020
[35]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023
[36]

Jiedong Lang, Zhehao Guo, and Shuyu Huang. 2024. A comprehensive study on quantization techniques for large language models. In2024 4th International conference on artificial intelligence, robotics, and communication (ICAIRC). IEEE, 224–231

work page 2024
[37]

ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. {ATP}: In-network aggregation for multi- tenant learning. In18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 741–761

work page 2021
[38]

Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, and Yongin Kwon. 2024. Explor- ing the trade-offs: Quantization methods, task difficulty, and model size in large language models from edge to giant.arXiv preprint arXiv:2409.11055(2024)

work page arXiv 2024
[39]

Qingyuan Li, Bo Zhang, Liang Ye, Yifan Zhang, Wei Wu, Yerui Sun, Lin Ma, and Yuchen Xie. 2024. Flash communication: Reducing tensor parallelization bottle- neck for fast large language model inference.arXiv preprint arXiv:2412.04964 (2024)

work page arXiv 2024
[40]

Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. InProceedings of the 46th International Symposium on Computer Architecture. 279–291

work page 2019
[41]

Heng Liao, Bingyang Liu, Xianping Chen, Zhigang Guo, Chuanning Cheng, Jianbing Wang, Xiangyu Chen, Peng Dong, Rui Meng, Wenjie Liu, Zhe Zhou, Ziyang Zhang, Yuhang Gai, Cunle Qian, Yi Xiong, Zhongwu Cheng, Jing Xia, Yuli Ma, Xi Chen, Wenhua Du, Shizhong Xiao, Chungang Li, Yong Qin, Liudong Xiong, Zhou Yu, Lv Chen, Lei Chen, Buyun Wang, Pei Wu, Junen Gao, X...

work page arXiv 2025
[42]

Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. 2021. Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing : Industry Track Paper. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 789–801. https: //doi.org/10.1109/HPCA51647.2021.00071

work page doi:10.1109/hpca51647.2021.00071 2021
[43]

John Little and Stephen Graves. 2008. Little’s Law. InBuilding Intuition: Insights from Basic Operations Management Models and Principles. 81–100

work page 2008
[44]

Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, and Kwang-Ting Cheng. 2023. Llm-fp4: 4-bit floating-point quantized transformers. InProceedings of the 2023 conference on empirical methods in natural language processing. 592– 605

work page 2023
[45]

Xiaoyu Ma and David Patterson. 2026. Challenges and Research Directions for Large Language Model Inference Hardware.arXiv preprint arXiv:2601.05047 (2026)

work page arXiv 2026
[46]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[47]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. arXiv:1809.02789 [cs.CL] https://arxiv.org/abs/1809.02789

work page internal anchor Pith review Pith/arXiv arXiv 2018
[48]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM symposium on operating systems principles. 1–15

work page 2019
[49]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InSC21: International Conference for High Performance Co...

work page 2021
[50]

2022.Upgrading Multi-GPU Interconnectivity with the Third-Generation NVIDIA NVSwitch

NVIDIA. 2022.Upgrading Multi-GPU Interconnectivity with the Third-Generation NVIDIA NVSwitch. https://developer.nvidia.com/blog/?p=53977

work page 2022
[51]

NVIDIA. 2023. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM

work page 2023
[52]

NVIDIA. 2024. nccl-tests. https://github.com/NVIDIA/nccl-tests/blob/master/ doc/PERFORMANCE.md

work page 2024
[53]

NVIDIA. 2024. NVIDIA DGX H200. https://resources.nvidia.com/en-us-dgx- systems/dgx-h200-datasheet?ncid=no-ncid

work page 2024
[54]

NVIDIA. 2024. NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference. https://developer.nvidia.com/blog/nvidia-gb200-nvl72- delivers-trillion-parameter-llm-training-and-real-time-inference/?ncid=no- ncid

work page 2024
[55]

NVIDIA. 2024. NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records. https://developer.nvidia.com/blog/nvidia-h200- tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/

work page 2024
[56]

NVIDIA. 2026. Device-Initiated Communication — NCCL 2.29.7 Documen- tation. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/ deviceapi.html

work page 2026
[57]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Brad- bury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference.Proceedings of machine learning and systems5 (2023), 606–624

work page 2023
[58]

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna

work page
[59]

In2020 IEEE International Symposium on Performance Analy- sis of Systems and Software (ISPASS)

ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. In2020 IEEE International Symposium on Performance Analy- sis of Systems and Software (ISPASS). 81–92. https://doi.org/10.1109/ISPASS48437. 2020.00018

work page doi:10.1109/ispass48437 2020
[60]

Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kan...

work page arXiv 2019
[61]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi

work page
[62]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arXiv:1907.10641 [cs.CL] https://arxiv.org/abs/1907.10641

work page internal anchor Pith review arXiv 1907
[63]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Pe- ter Richtárik. 2021. Scaling distributed machine learning with {In-Network} aggregation. In18th USENIX Symposium on Networked Systems Design and Im- plementation (NSDI 21). 785–808

work page 2021
[64]

Prajwal Singhania, Siddharth Singh, Lannie Dalton Hough, Ishaan Revankar, Harshitha Menon, Charles Jekel, and Abhinav Bhatele. 2025. Understanding Communication Bottlenecks in Multi-Node LLM Inference. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Research Poster

work page 2025
[65]

Synopsys. 2026. Pre-Silicon Prototyping. https://www.synopsys.com/ verification/emulation-prototyping/prototyping.html

work page 2026
[66]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autore- gressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax

work page 2021
[68]

Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajb- handari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He. 2023. Zero++: Extremely efficient collective communication for giant model training.arXiv preprint arXiv:2306.10209(2023). 13 Aojie Jiang, Kang Zhu, Zhiheng Zhang, Zhengxu Su, Juntao Liu, Yuan Du, and Li Du

work page arXiv 2023
[69]

William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Swati Gupta, and Tushar Krishna. 2024. TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning. InProceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture. 856–870. https://doi.org/10.1109/ MICRO61859.2024.00068

work page arXiv 2024
[70]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

work page
[71]

InInternational conference on machine learning

Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning. 38087–38099

work page
[72]

Lang Xu, Kaushik Kandadi Suresh, Quentin Anthony, Nawras Alnaasan, and Dha- baleswar K Panda. 2025. Characterizing communication patterns in distributed large language model inference. In2025 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, 1–11

work page 2025
[73]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence? arXiv:1905.07830 [cs.CL] https://arxiv.org/abs/1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019
[74]

Chen Zhang, Qijun Zhang, Zhuoshan Zhou, Yijia Diao, Haibo Wang, Zhe Zhou, Zhipeng Tu, Zhiyao Li, Guangyu Sun, Zhuoran Song, Zhigang Ji, Jingwen Leng, and Minyi Guo. 2026. Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). 1–15. ht...

work page arXiv 2026
[75]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: efficient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Information Processing Systems. 14

work page 2024