pith. machine review for the scientific record. sign in

arxiv: 2602.22457 · v3 · submitted 2026-02-25 · 💻 cs.DC · cs.ET

Recognition: no theorem link

CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:56 UTC · model grok-4.3

classification 💻 cs.DC cs.ET
keywords CXLGPU collectivesmemory poolingRDMALLM trainingcollective communicationInfiniBandcross-node GPU
0
0 comments X

The pith

CCCL uses CXL shared memory pooling to deliver faster node-spanning GPU collectives than RDMA over 200 Gbps InfiniBand.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CCCL, a collective communication library built on CXL memory pooling to handle cross-node GPU operations for collectives like AllGather, Broadcast, Gather, and Scatter without traditional RDMA networking. It focuses on solving synchronization, data interleaving, and parallelization issues that arise when using the shared CXL pool for these tasks. On multi-node hardware with a TITAN-II CXL switch and Micron CZ120 cards, CCCL shows average speedups of 1.34 times for AllGather, 1.84 times for Broadcast, 1.94 times for Gather, and 1.04 times for Scatter versus the RDMA baseline. In an LLM training case, it achieves 1.11 times overall speedup while cutting hardware costs by 2.75 times. The work aims to show that memory-centric CXL designs can improve performance and reduce over-provisioning in distributed GPU systems.

Core claim

CCCL enables efficient node-spanning GPU collectives by leveraging the CXL shared memory pool for synchronization, data interleaving, and parallelized communication, achieving measured performance gains over RDMA-based InfiniBand implementations in standard collective benchmarks and in an LLM training workload.

What carries the argument

The CCCL library, which implements custom mechanisms for synchronization, data interleaving, and communication parallelization on top of CXL memory pooling to replace RDMA for cross-node GPU collectives.

If this is right

  • CCCL can serve as a drop-in alternative for common collectives with measured speedups over high-speed RDMA.
  • Hardware costs for multi-node LLM training setups drop by a factor of 2.75 while retaining or improving runtime.
  • Resource utilization improves because the CXL pool reduces the need for over-provisioned per-node GPU memory.
  • Collective communication becomes memory-centric rather than network-centric, changing how interconnects are sized in GPU clusters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If CXL pools scale beyond the tested node count, cluster designs could shift away from expensive high-bandwidth fabrics for many workloads.
  • The same CXL-based approach could extend to other distributed GPU patterns such as parameter sharding or gradient aggregation.
  • Existing GPU collective libraries might incorporate CXL backends as an optional path when the hardware is present.
  • Further hardware tuning of CXL switches could amplify the reported gains in latency-sensitive phases of training.

Load-bearing premise

The CXL hardware delivers low-latency coherent access that supports collective synchronization and data movement without hidden scale-dependent bottlenecks.

What would settle it

Performance measurements on a larger number of nodes or with different workload sizes that show CCCL falling below InfiniBand speeds or introducing higher latency than reported would disprove the central performance claims.

Figures

Figures reproduced from arXiv: 2602.22457 by (2) Zhejinag University, (3) Bytedance, (4) Xconn-tech), Dengcheng Zhu (3), Dong Li (1) ((1) UC Merced, Dong Xu (1), Fei Liu (3), Han Meng (1), Henry Hu (3), Hui Zhang (3), Jianping Jiang (4), Liguang Xie (3), Rui Shi (3), Wei Tang (3), Wu Xiang (3), Xinyu Chen (2), Yue Li (3).

Figure 1
Figure 1. Figure 1: Architecture of the CXL shared memory pool. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sequentially stacked memory address space. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance characterization of the CXL shared memory pool. X-axis represents the transferred data volume. Y-axis [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The traditional copy-RDMA communication pipeline in NCCL. Listing 2 presents the core structure of a communication prim￾itive. The shared memory pool is mapped and registered in the CUDA address space, enabling direct memory transfers between the node and CXL device. Execution begins by writing data from GPU memory to the memory pool using cudaMemcpy with the flag cudaMemcpyDeviceToHost. After the write co… view at source ↗
Figure 5
Figure 5. Figure 5: An example: ReduceScatter with four GPUs via a CXL shared memory pool. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example of spreading data across multiple CXL devices. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Communication overlapping. 4.5 Lightweight Locking Mechanism To enable synchronization across nodes, we must establish a lock mechanism. Existing inter-node locking mechanisms—such as cen￾tralized lock services [68, 69] and lease-based locks [39]—provide mutual exclusion for shared data, but they rely on inter-node mes￾saging in the critical path. This dependency introduces high la￾tency and easily becomes… view at source ↗
Figure 8
Figure 8. Figure 8: The workflow of locking mechanism in CXL-CCL. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: CXL-CCL performance using the CXL shared memory pool. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Scalability evaluation using the CXL shared memory pool. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sensitivity study for end-to-end latency with re [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
read the original abstract

Large language models (LLMs) training or inference across multiple nodes introduces significant pressure on GPU memory and interconnect bandwidth. The Compute Express Link (CXL) shared memory pool offers a scalable solution by enabling memory sharing across nodes, reducing over-provisioning and improving resource utilization. We propose \name, a collective communication library, leveraging the CXL shared memory pool to support cross-node GPU operations without relying on traditional RDMA-based networking. Our design addresses the challenges on synchronization, data interleaving, and communication parallelization faced by using the CXL shared memory pool for collective communications. Evaluating on multiple nodes with a TITAN-II CXL switch and six Micron CZ120 memory cards, we show that \name achieves highly efficient collective operations across hosts, demonstrating CXL's potential for scalable, memory-centric GPU communication. Our evaluation demonstrates that \name achieves average performance improvements of 1.34$\times$ for AllGather, 1.84$\times$ for Broadcast, 1.94$\times$ for Gather, and 1.04$\times$ for Scatter, compared to the original RDMA-based implementation over 200 Gbps InfiniBand. \textcolor{dong}{In addition, the evaluation with a case of LLM training shows 1.11$\times$ speedup compared with the InfiniBand while saving production cost by $2.75\times$ in hardware.}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CCCL, a collective communication library that leverages CXL shared memory pooling to support cross-node GPU collectives without RDMA networking. It describes design solutions for synchronization, data interleaving, and parallelization, then evaluates on a TITAN-II CXL switch with six Micron CZ120 cards, reporting average speedups of 1.34× (AllGather), 1.84× (Broadcast), 1.94× (Gather), and 1.04× (Scatter) versus a 200 Gbps InfiniBand RDMA baseline, plus 1.11× speedup and 2.75× hardware cost savings in an LLM training case.

Significance. If the empirical results hold, the work provides concrete evidence that CXL memory pools can enable efficient memory-centric GPU collectives, reducing interconnect bandwidth pressure and hardware over-provisioning in multi-node LLM training. The evaluation rests on direct hardware measurements rather than fitted parameters or self-referential derivations, which is a strength.

major comments (2)
  1. [Evaluation] Evaluation section: the reported speedups (1.34–1.94×) and LLM-training result lack methodological details on exact message sizes, node count, run-to-run variance, timing methodology, and precise RDMA baseline configuration, leaving the central performance claims only partially verifiable.
  2. [Design and Evaluation] Design and Evaluation: the assumption that the CXL pool delivers low-latency coherent access sufficient for synchronization and interleaving is demonstrated only on a small-scale TITAN-II + 6-card setup; any directory overhead, contention, or ordering costs that emerge at larger node counts would directly undermine the reported speedups and cost-saving claims.
minor comments (1)
  1. [Abstract] Abstract contains unresolved LaTeX commands (e.g., “name” and “textcolor{dong}”) that impair readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve verifiability and transparency.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the reported speedups (1.34–1.94×) and LLM-training result lack methodological details on exact message sizes, node count, run-to-run variance, timing methodology, and precise RDMA baseline configuration, leaving the central performance claims only partially verifiable.

    Authors: We agree that the Evaluation section requires additional methodological details for full verifiability. In the revised manuscript we will explicitly report: the exact message sizes tested for each collective (ranging from 256 KB to 4 GB), the node count (six nodes connected via the TITAN-II switch), run-to-run variance with standard deviations from at least ten repetitions per data point, the timing methodology (CUDA events for GPU-side operations combined with high-resolution host timers), and the precise RDMA baseline configuration including the MPI library version, InfiniBand driver settings, and queue-pair parameters. revision: yes

  2. Referee: [Design and Evaluation] Design and Evaluation: the assumption that the CXL pool delivers low-latency coherent access sufficient for synchronization and interleaving is demonstrated only on a small-scale TITAN-II + 6-card setup; any directory overhead, contention, or ordering costs that emerge at larger node counts would directly undermine the reported speedups and cost-saving claims.

    Authors: We acknowledge that all empirical results are obtained on a small-scale TITAN-II + six-card configuration. While we cannot provide new measurements at larger scales, the design of synchronization and interleaving primitives is grounded in CXL 2.0 coherence semantics that are architecturally intended to scale. In the revision we will add a dedicated scalability discussion subsection that analytically examines directory overhead, contention, and ordering costs using CXL protocol specifications, and we will explicitly list the current scale as a limitation with suggested directions for future larger-scale validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical hardware measurements only

full rationale

The paper proposes CCCL for CXL-based GPU collectives and reports speedups (1.34× AllGather etc.) solely from direct benchmarks on a TITAN-II + 6-card setup versus 200 Gbps InfiniBand. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described design. Claims rest on external hardware measurements, not on any chain that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on empirical hardware evaluation rather than derivation.

pith-pipeline@v0.9.0 · 5647 in / 1142 out tokens · 33250 ms · 2026-05-15T18:56:47.785411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism

    cs.DC 2026-04 unverdicted novelty 6.0

    CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.

  2. TierBPF: Page Migration Admission Control for Tiered Memory via eBPF

    cs.OS 2026-04 unverdicted novelty 6.0

    TierBPF uses lightweight eBPF hooks for custom page admission control in tiered memory, delivering up to 17.7% geomean and 75% peak throughput gains across 17 workloads on three systems.

  3. Hybrid Adaptive Tuning for Tiered Memory Systems

    cs.OS 2026-04 unverdicted novelty 6.0

    PTMT is a lightweight framework that automates parameter tuning for memory tiering via hybrid offline database building and online customized reinforcement learning, delivering 14-30% gains over defaults and 32% over ...

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 3 Pith papers · 4 internal anchors

  1. [1]

    Buddy and Slab Allocators

    2020. Buddy and Slab Allocators. https://students.mimuw.edu.pl/ZSO/Wyklady/ 06_memory2/BuddySlabAllocator.pdf

  2. [2]

    Compute Express Link (CXL)

    2026. Compute Express Link (CXL). https://computeexpresslink.org/

  3. [3]

    2026. PyTorch. https://pytorch.org/

  4. [4]

    TensorFlow

    2026. TensorFlow. https://www.tensorflow.org/

  5. [5]

    Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application- transparent Page Management for Two-tiered Main Memory.Proceedings of 12 CXL-CCL : Inter-Node Collective GPU-Communication Using a CXL Shared Memory Pool ’ICS 2026’, July 6-9, 2026, Belfast, Northern Ireland, United Kingdom the Twenty-Second International Conference on Architectural Suppor...

  6. [6]

    Minseon Ahn, Andrew Chang, Donghun Lee, Jongmin Gim, Jungmin Kim, Jaemin Jung, Oliver Rebholz, Vincent Pham, Krishna Malladi, and Yang Seok Ki. 2022. Enabling CXL Memory Expansion for In-Memory Database Management Systems. InInternational Workshop on Data Management on New Hardware

  7. [7]

    Moiz Arif, Kevin Assogba, M Mustafa Rafique, and Sudharshan Vazhkudai. 2022. Exploiting cxl-based memory for distributed deep learning. InProceedings of the 51st International Conference on Parallel Processing. 1–11

  8. [8]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

  9. [9]

    Weilin Cai, Le Qin, and Jiayi Huang. 2025. MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’25). ACM, 655–671. https://doi.org/10. 1145/3676641.3716006

  10. [10]

    Gonzalez, Matei Zaharia, and Ion Stoica

    Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica. 2025. MoE-Lightning: High- Throughput MoE Inference on Memory-constrained GPUs. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1(Rotterdam...

  11. [11]

    Jonathan Corbet. 2023. Weighted interleaving for memory tiering. https://lwn. net/Articles/948037/

  12. [13]

    arXiv:EECS Technical report

    Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link. arXiv:EECS Technical report. University of California, Merced

  13. [14]

    Wikimedia Foundation. [n. d.].Wikimedia Downloads. https://dumps.wikimedia. org

  14. [15]

    Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. 2022. Direct access,{High-Performance} memory disaggregation with {DirectCXL}. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 287–294

  15. [16]

    Yunyan Guo and Guoliang Li. 2024. A CXL-Powered Database System: Op- portunities and Challenges. In2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 5593–5604

  16. [17]

    Taekyung Heo, Yang Wang, Wei Cui, Jaehyuk Huh, and Lintao Zhang. 2022. Adaptive Page Migration Policy With Huge Pages in Tiered Memory Systems. IEEE Trans. Comput.71, 1 (2022), 53–68. https://doi.org/10.1109/TC.2020.3036686

  17. [18]

    Yibo Huang, Haowei Chen, Newton Ni, Vijay Chidambaram, Dixin Tang, Emmett Witchel, Zhiting Zhu, and Zhipeng Jia. 2025. Tigon: A distributed database for a CXL pod. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), Boston, MA

  18. [19]

    Yingchao Huang and Dong Li. 2017. Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems. InIEEE International Conference on Cluster Computing

  19. [20]

    Taehyung Lee, Sumit Kumar Monga, Changwoo Min, and Young Ik Eom. 2023. MEMTIS: Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, NY, USA, 17–34. https://doi.org/10.1145/3600006.3613167

  20. [21]

    Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, et al. 2023. Pond: Cxl-based memory pooling systems for cloud platforms. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2...

  21. [22]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  22. [23]

    Haifeng Liu, Long Zheng, Yu Huang, Jingyi Zhou, Chaoqiang Liu, Runze Wang, Xiaofei Liao, Hai Jin, and Jingling Xue. 2024. Enabling efficient large recommen- dation model training with near cxl memory processing. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 382–395

  23. [24]

    Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S Berger, Marie Nguyen, Xun Jian, Sam H Noh, and Huaicheng Li. 2025. Systematic cxl memory characterization and performance analysis at scale. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1203–1217

  24. [25]

    Jiawen Liu, Jie Ren, Roberto Gioiosa, Dong Li, and Jiajia Li. 2021. Sparta: high- performance, element-wise sparse tensor contraction on heterogeneous memory. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Virtual Event, Republic of Korea)(PPoPP ’21). Association for Computing Machinery, New York, NY, US...

  25. [26]

    LWN.net. [n. d.]. AutoNUMA Balancing. "https://access.redhat.com/ documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_ and_optimization_guide/sect-virtualization_tuning_optimization_guide- numa-auto_numa_balancing"

  26. [27]

    Adnan Maruf, Ashikee Ghosh, Janki Bhimani, Daniela Campello, Andy Rudoff, and Raju Rangaswami. 2022. MULTI-CLOCK: Dynamic Tiering for Hybrid Memory Systems.2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)(2022), 925–937. https://api.semanticscholar.org/ CorpusID:248865268

  27. [28]

    Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, and Prakash Chauhan. 2023. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. InProceedings of the 28th ACM International Conference on Ar- chitectural Support for Programming Languages and Ope...

  28. [29]

    Siyuan Mu and Sen Lin. 2026. A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications. arXiv:2503.07137 [cs.LG] https://arxiv. org/abs/2503.07137

  29. [30]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron- LM. InProceedings of the International Conference for High ...

  30. [31]

    NVIDIA. [n. d.]. NVIDIA Collective Communications Library (NCCL). https: //developer.nvidia.com/nccl

  31. [32]

    2021.Introduction to InfiniBand

    NVIDIA Corporation. 2021.Introduction to InfiniBand. White Paper. NVIDIA. https://network.nvidia.com/pdf/whitepapers/IB_Intro_WP_190.pdf

  32. [33]

    Pytorch. 2022. Fully sharded data parallelism. https://pytorch.org/blog/ introducing-pytorch-fully-sharded-data-parallel-api/

  33. [34]

    Amanda Raybuck, Tim Stamler, Wei Zhang, Mattan Erez, and Simon Peter. 2021. HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM.Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles(2021). https://api.semanticscholar.org/CorpusID:239029009

  34. [35]

    Jie Ren, Jiaolin Luo, Ivy Peng, Kai Wu, and Dong Li. 2021. Optimizing large- scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy. InProceedings of the ACM International Conference on Supercomputing(Virtual Event, USA)(ICS ’21). Association for Computing Machinery, New York, NY, USA...

  35. [36]

    Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li. 2021. Sen- tinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 598–611

  36. [37]

    2021.{ZeRO-Offload}: Democratizing {Billion-Scale} model training

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021.{ZeRO-Offload}: Democratizing {Billion-Scale} model training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). 551–564

  37. [38]

    Jie Ren, Dong Xu, Junhee Ryu, Kwangsik Shin, Daewoo Kim, and Dong Li. 2024. MTM: Rethinking Memory Profiling and Migration for Multi-Tiered Large Mem- ory. InProceedings of the Nineteenth European Conference on Computer Systems (<conf-loc>, <city>Athens</city>, <country>Greece</country>, </conf-loc>) (EuroSys ’24). Association for Computing Machinery, New...

  38. [39]

    Jie Ren, Minjia Zhang, and Dong Li. 2020. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory. InConference on Neural Information Processing Systems (NeurIPS)

  39. [40]

    Andre Rodriguez and William Osborn. 2025. Distributed Locking: Performance Analysis and Optimization Strategies. arXiv:2504.03073 [cs.DC] https://arxiv. org/abs/2504.03073

  40. [41]

    Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, and Michael Math- eson

    Joshua Romero, Junqi Yin, Nouamane Laanait, Bing Xie, M. Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, and Michael Math- eson. 2022. Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Ass...

  41. [42]

    John, and Arkaprava Basu

    Jee Ho Ryoo, Lizy K. John, and Arkaprava Basu. 2018. A Case for Granularity Aware Page Migration. InProceedings of the 2018 International Conference on Supercomputing(Beijing, China)(ICS ’18). Association for Computing Machinery, New York, NY, USA, 352–362. https://doi.org/10.1145/3205289.3208064

  42. [43]

    Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, Akiyuki Kaneko, and Tatsuo Shiozawa

  43. [44]

    InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis

    GPU graph processing on cxl-based microsecond-latency external memory. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis. 962–972. 13 ’ICS 2026’, July 6-9, 2026, Belfast, Northern Ireland, United Kingdom Dong Xu, Han Meng, Xinyu Chen, Dengcheng Zhu, Wei Tang, Fei Liu, Liguang...

  44. [45]

    Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, Deep Shah, Ashmitha Jee- varaj Shetty, Greg Steinbrecher, Yulun Wang, Bruce Wu, Xinfeng Xie, Jingyi Yang, Mingran Yang, Kenny Yu, Minlan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Prashanth Kannan, Cristian Lu...

  45. [46]

    arXiv:2510.20171 [cs.DC] https://arxiv.org/abs/2510.20171

    Collective Communication for 100k+ GPUs. arXiv:2510.20171 [cs.DC] https://arxiv.org/abs/2510.20171

  46. [47]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  47. [48]

    Vishal Verma. 2022. Tiering-0.8. https://git.kernel.org/pub/scm/linux/kernel/git/ vishal/tiering.git/log/?h=tiering-0.8

  48. [49]

    Xi Wang, Jie Liu, Jianbo Wu, Shuangyan Yang, Jie Ren, Bhanu Shankar, and Dong Li. 2024. Exploring and evaluating real-world cxl: use cases and system adoption. arXiv preprint arXiv:2405.14209(2024)

  49. [50]

    Xi Wang, Bin Ma, Jongryool Kim, Byungil Koh, Hoshik Kim, and Dong Li. 2025. cMPI: Using CXL Memory Sharing for MPI One-Sided and Two-Sided Inter- Node Communications. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2216–2232

  50. [51]

    Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, and Huatao Wu. 2024. Rcmp: Reconstructing rdma-based memory disaggregation via cxl.ACM Transactions on Architecture and Code Optimization21, 1 (2024), 1–26

  51. [52]

    Xingda Wei, Haotian Wang, Tianxia Wang, Rong Chen, Jinyu Gu, Pengfei Zuo, and Haibo Chen. 2023. Transactional indexes on (RDMA or cxl-based) disag- gregated memory with repairable transaction.arXiv preprint arXiv:2308.02501 (2023)

  52. [53]

    Bryan Woolley. 2015. NCCL: Multi-GPU Collective Communication Library. https://images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf

  53. [54]

    K. Wu, Y. Huang, and D. Li. 2017. Unimem: Runtime Data Management on Non- Volatile Memory-based Heterogeneous Main Memory. InInternational Conference for High Performance Computing, Networking, Storage and Analysis

  54. [55]

    Kai Wu, Jie Ren Ivy Peng, and Dong Li. 2021. ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory. InUSENIX Conference on File and Storage Technologies

  55. [56]

    Kai Wu, Jie Ren, and Dong Li. 2018. Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 401–413. https://doi.org/10.1109/SC.2018.00034

  56. [57]

    Panruo Wu, Dong Li, Zizhong Chen, Jeffrey Vetter, and Sparsh Mittal. 2016. Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory. InACM Symposium on High-Performance Parallel and Distributed Computing (HPDC)

  57. [58]

    Xconn. 2025. Xconn Technologies. https://www.xconn-tech.com/

  58. [59]

    Zhen Xie, Wenqian Dong, Jie Liu, Ivy Peng, Yanbao Ma, and Dong Li. 2021. MD- HM: memoization-based molecular dynamics simulations on big memory system. InProceedings of the ACM International Conference on Supercomputing(Virtual Event, USA)(ICS ’21). Association for Computing Machinery, New York, NY, USA, 215–226. https://doi.org/10.1145/3447818.3460365

  59. [60]

    Zhen Xie, Jie Liu, Jiajia Li, and Dong Li. 2023. Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Prin- ciples and Practice of Parallel Programming(<conf-loc>, <city>Montreal</city>, <state>QC</state>, <country>Canada</country>, </...

  60. [61]

    Dong Xu, Yuan Feng, Kwangsik Shin, Daewoo Kim, Hyeran Jeon, and Dong Li

  61. [62]

    Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

    Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link. InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–18. https://doi.org/10.1109/ SC41406.2024.00100

  62. [63]

    Dong Xu, Junhee Ryu, Jinho Baek, Kwangsik Shin, Pengfei Su, and Dong Li

  63. [64]

    In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA)(USENIX ATC’24)

    FlexMem: adaptive page profiling and migration for tiered memory. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA)(USENIX ATC’24). USENIX Association, USA, Article 50, 17 pages

  64. [65]

    Nellans, and Abhishek Bhattacharjee

    Zi Yan, Daniel Lustig, David W. Nellans, and Abhishek Bhattacharjee. 2019. Nim- ble Page Management for Tiered Memory Systems.Proceedings of the Twenty- Fourth International Conference on Architectural Support for Programming Lan- guages and Operating Systems(2019). https://api.semanticscholar.org/CorpusID: 102348046

  65. [66]

    Shuo Yang, Kai Wu, Yifan Qiao, Dong Li, and Jidong Zhai. 2017. Algorithm- Directed Crash Consistence in Non-Volatile Memory for HPC. InIEEE Interna- tional Conference on Cluster Computing

  66. [67]

    Shuangyan Yang, Minjia Zhang, Wenqian Dong, and Dong Li. 2023. Betty: Enabling Large-Scale GNN Training with Batch-Level Graph Partitioning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association for Computing Machinery, New...

  67. [69]

    Xinjun Yang, Qingda Hu, Junru Li, Feifei Li, Yicong Zhu, Yuqi Zhou, Qiuru Lin, Jian Dai, Yang Kong, Jiayu Zhang, Guoqiang Xu, and Qiang Liu. 2025. Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management. arXiv:2511.20172 [cs.DC] https://arxiv.org/abs/2511.20172

  68. [70]

    Xinjun Yang, Yingqiang Zhang, Hao Chen, Feifei Li, Gerry Fan, Yang Kong, Bo Wang, Jing Fang, Yuhui Wang, Tao Huang, et al. 2025. Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases. InCompanion of the 2025 International Conference on Management of Data. 689–702

  69. [71]

    Xinjun Yang, Yingqiang Zhang, Hao Chen, Feifei Li, Gerry Fan, Yang Kong, Bo Wang, Jing Fang, Yuhui Wang, Tao Huang, Wenpu Hu, Jim Kao, and Jianping Jiang. 2025. Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases. InCompanion of the 2025 International Conference on Management of Data(Berlin, Germany)(SIGMOD/PODS ’25). Associa...

  70. [72]

    Noh, and Jongryool Kim

    Dongha Yoon, Younghoon Min, Hoshik Kim, Sam H. Noh, and Jongryool Kim

  71. [73]

    arXiv:2512.18194 [cs.DC] https://arxiv.org/abs/2512.18194

    TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale. arXiv:2512.18194 [cs.DC] https://arxiv.org/abs/2512.18194

  72. [74]

    Dong Young Yoon, Mosharaf Chowdhury, and Barzan Mozafari. 2018. Distributed Lock Management with RDMA: Decentralization without Starvation. InProceed- ings of the 2018 International Conference on Management of Data(Houston, TX, USA)(SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1571–1586. https://doi.org/10.1145/3183713.3196890

  73. [75]

    Zhuolong Yu, Yiwen Zhang, Vladimir Braverman, Mosharaf Chowdhury, and Xin Jin. 2020. NetLock: Fast, Centralized Lock Management Using Programmable Switches. InProceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication(Virtual Event,...

  74. [76]

    Mingxing Zhang, Teng Ma, Jinqi Hua, Zheng Liu, Kang Chen, Ning Ding, Fan Du, Jinlei Jiang, Tao Ma, and Yongwei Wu. 2023. Partial failure resilient memory management system for (cxl-based) distributed shared memory. InProceedings of the 29th Symposium on Operating Systems Principles. 658–674. 14