arxiv: 2602.22457 · v3 · submitted 2026-02-25 · 💻 cs.DC · cs.ET

Recognition: no theorem link

CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling

Dong Xu (1) , Han Meng (1) , Xinyu Chen (2) , Dengcheng Zhu (3) , Wei Tang (3) , Fei Liu (3) , Liguang Xie (3) , Wu Xiang (3)

show 9 more authors

Rui Shi (3) Yue Li (3) Henry Hu (3) Hui Zhang (3) Jianping Jiang (4) Dong Li (1) ((1) UC Merced (2) Zhejinag University (3) Bytedance (4) Xconn-tech)

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:56 UTC · model grok-4.3

classification 💻 cs.DC cs.ET

keywords CXLGPU collectivesmemory poolingRDMALLM trainingcollective communicationInfiniBandcross-node GPU

0 comments

The pith

CCCL uses CXL shared memory pooling to deliver faster node-spanning GPU collectives than RDMA over 200 Gbps InfiniBand.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CCCL, a collective communication library built on CXL memory pooling to handle cross-node GPU operations for collectives like AllGather, Broadcast, Gather, and Scatter without traditional RDMA networking. It focuses on solving synchronization, data interleaving, and parallelization issues that arise when using the shared CXL pool for these tasks. On multi-node hardware with a TITAN-II CXL switch and Micron CZ120 cards, CCCL shows average speedups of 1.34 times for AllGather, 1.84 times for Broadcast, 1.94 times for Gather, and 1.04 times for Scatter versus the RDMA baseline. In an LLM training case, it achieves 1.11 times overall speedup while cutting hardware costs by 2.75 times. The work aims to show that memory-centric CXL designs can improve performance and reduce over-provisioning in distributed GPU systems.

Core claim

CCCL enables efficient node-spanning GPU collectives by leveraging the CXL shared memory pool for synchronization, data interleaving, and parallelized communication, achieving measured performance gains over RDMA-based InfiniBand implementations in standard collective benchmarks and in an LLM training workload.

What carries the argument

The CCCL library, which implements custom mechanisms for synchronization, data interleaving, and communication parallelization on top of CXL memory pooling to replace RDMA for cross-node GPU collectives.

If this is right

CCCL can serve as a drop-in alternative for common collectives with measured speedups over high-speed RDMA.
Hardware costs for multi-node LLM training setups drop by a factor of 2.75 while retaining or improving runtime.
Resource utilization improves because the CXL pool reduces the need for over-provisioned per-node GPU memory.
Collective communication becomes memory-centric rather than network-centric, changing how interconnects are sized in GPU clusters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If CXL pools scale beyond the tested node count, cluster designs could shift away from expensive high-bandwidth fabrics for many workloads.
The same CXL-based approach could extend to other distributed GPU patterns such as parameter sharding or gradient aggregation.
Existing GPU collective libraries might incorporate CXL backends as an optional path when the hardware is present.
Further hardware tuning of CXL switches could amplify the reported gains in latency-sensitive phases of training.

Load-bearing premise

The CXL hardware delivers low-latency coherent access that supports collective synchronization and data movement without hidden scale-dependent bottlenecks.

What would settle it

Performance measurements on a larger number of nodes or with different workload sizes that show CCCL falling below InfiniBand speeds or introducing higher latency than reported would disprove the central performance claims.

Figures

Figures reproduced from arXiv: 2602.22457 by (2) Zhejinag University, (3) Bytedance, (4) Xconn-tech), Dengcheng Zhu (3), Dong Li (1) ((1) UC Merced, Dong Xu (1), Fei Liu (3), Han Meng (1), Henry Hu (3), Hui Zhang (3), Jianping Jiang (4), Liguang Xie (3), Rui Shi (3), Wei Tang (3), Wu Xiang (3), Xinyu Chen (2), Yue Li (3).

**Figure 2.** Figure 2: Sequentially stacked memory address space. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Performance characterization of the CXL shared memory pool. X-axis represents the transferred data volume. Y-axis [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The traditional copy-RDMA communication pipeline in NCCL. Listing 2 presents the core structure of a communication primitive. The shared memory pool is mapped and registered in the CUDA address space, enabling direct memory transfers between the node and CXL device. Execution begins by writing data from GPU memory to the memory pool using cudaMemcpy with the flag cudaMemcpyDeviceToHost. After the write co… view at source ↗

**Figure 5.** Figure 5: An example: ReduceScatter with four GPUs via a CXL shared memory pool. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: An example of spreading data across multiple CXL devices. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Communication overlapping. 4.5 Lightweight Locking Mechanism To enable synchronization across nodes, we must establish a lock mechanism. Existing inter-node locking mechanisms—such as centralized lock services [68, 69] and lease-based locks [39]—provide mutual exclusion for shared data, but they rely on inter-node messaging in the critical path. This dependency introduces high latency and easily becomes… view at source ↗

**Figure 8.** Figure 8: The workflow of locking mechanism in CXL-CCL. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: CXL-CCL performance using the CXL shared memory pool. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Scalability evaluation using the CXL shared memory pool. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Sensitivity study for end-to-end latency with re [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

read the original abstract

Large language models (LLMs) training or inference across multiple nodes introduces significant pressure on GPU memory and interconnect bandwidth. The Compute Express Link (CXL) shared memory pool offers a scalable solution by enabling memory sharing across nodes, reducing over-provisioning and improving resource utilization. We propose \name, a collective communication library, leveraging the CXL shared memory pool to support cross-node GPU operations without relying on traditional RDMA-based networking. Our design addresses the challenges on synchronization, data interleaving, and communication parallelization faced by using the CXL shared memory pool for collective communications. Evaluating on multiple nodes with a TITAN-II CXL switch and six Micron CZ120 memory cards, we show that \name achieves highly efficient collective operations across hosts, demonstrating CXL's potential for scalable, memory-centric GPU communication. Our evaluation demonstrates that \name achieves average performance improvements of 1.34$\times$ for AllGather, 1.84$\times$ for Broadcast, 1.94$\times$ for Gather, and 1.04$\times$ for Scatter, compared to the original RDMA-based implementation over 200 Gbps InfiniBand. \textcolor{dong}{In addition, the evaluation with a case of LLM training shows 1.11$\times$ speedup compared with the InfiniBand while saving production cost by $2.75\times$ in hardware.}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CCCL shows modest CXL-based speedups over RDMA on a small GPU cluster but the evaluation is too narrow to judge real-world impact.

read the letter

The paper introduces CCCL, a collective library that routes GPU operations through a CXL memory pool instead of RDMA. On their TITAN-II switch plus six Micron cards they measure 1.34× for AllGather, 1.84× for Broadcast, 1.94× for Gather, and 1.04× for Scatter against 200 Gbps InfiniBand, plus a 1.11× end-to-end gain and 2.75× hardware cost reduction in one LLM training run. That hardware result is the main new piece; prior CXL work has not targeted these exact GPU collective patterns with real measurements. They also spell out the synchronization and interleaving changes needed to make the shared pool usable, which is concrete engineering rather than hand-waving. The cost-saving claim is worth noting if the numbers survive scrutiny. The evaluation stays inside a very small configuration. Nothing is shown at larger node counts where directory traffic, coherence overhead, or contention could erase the reported gains. The abstract gives no variance numbers, no exact message sizes, and no description of how the RDMA baseline was tuned, so the speedups are hard to reproduce or generalize from the text alone. Scatter barely moves, which already hints the benefit is operation-dependent. This is useful reading for anyone building or comparing interconnects for GPU clusters. A person working on CXL deployments or memory-centric designs would get the most from the hardware numbers and the design choices. The work is grounded enough and the problem is timely enough that it should go to referees rather than desk reject, but any review will need to press on scale and measurement transparency.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CCCL, a collective communication library that leverages CXL shared memory pooling to support cross-node GPU collectives without RDMA networking. It describes design solutions for synchronization, data interleaving, and parallelization, then evaluates on a TITAN-II CXL switch with six Micron CZ120 cards, reporting average speedups of 1.34× (AllGather), 1.84× (Broadcast), 1.94× (Gather), and 1.04× (Scatter) versus a 200 Gbps InfiniBand RDMA baseline, plus 1.11× speedup and 2.75× hardware cost savings in an LLM training case.

Significance. If the empirical results hold, the work provides concrete evidence that CXL memory pools can enable efficient memory-centric GPU collectives, reducing interconnect bandwidth pressure and hardware over-provisioning in multi-node LLM training. The evaluation rests on direct hardware measurements rather than fitted parameters or self-referential derivations, which is a strength.

major comments (2)

[Evaluation] Evaluation section: the reported speedups (1.34–1.94×) and LLM-training result lack methodological details on exact message sizes, node count, run-to-run variance, timing methodology, and precise RDMA baseline configuration, leaving the central performance claims only partially verifiable.
[Design and Evaluation] Design and Evaluation: the assumption that the CXL pool delivers low-latency coherent access sufficient for synchronization and interleaving is demonstrated only on a small-scale TITAN-II + 6-card setup; any directory overhead, contention, or ordering costs that emerge at larger node counts would directly undermine the reported speedups and cost-saving claims.

minor comments (1)

[Abstract] Abstract contains unresolved LaTeX commands (e.g., “name” and “textcolor{dong}”) that impair readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve verifiability and transparency.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported speedups (1.34–1.94×) and LLM-training result lack methodological details on exact message sizes, node count, run-to-run variance, timing methodology, and precise RDMA baseline configuration, leaving the central performance claims only partially verifiable.

Authors: We agree that the Evaluation section requires additional methodological details for full verifiability. In the revised manuscript we will explicitly report: the exact message sizes tested for each collective (ranging from 256 KB to 4 GB), the node count (six nodes connected via the TITAN-II switch), run-to-run variance with standard deviations from at least ten repetitions per data point, the timing methodology (CUDA events for GPU-side operations combined with high-resolution host timers), and the precise RDMA baseline configuration including the MPI library version, InfiniBand driver settings, and queue-pair parameters. revision: yes
Referee: [Design and Evaluation] Design and Evaluation: the assumption that the CXL pool delivers low-latency coherent access sufficient for synchronization and interleaving is demonstrated only on a small-scale TITAN-II + 6-card setup; any directory overhead, contention, or ordering costs that emerge at larger node counts would directly undermine the reported speedups and cost-saving claims.

Authors: We acknowledge that all empirical results are obtained on a small-scale TITAN-II + six-card configuration. While we cannot provide new measurements at larger scales, the design of synchronization and interleaving primitives is grounded in CXL 2.0 coherence semantics that are architecturally intended to scale. In the revision we will add a dedicated scalability discussion subsection that analytically examines directory overhead, contention, and ordering costs using CXL protocol specifications, and we will explicitly list the current scale as a limitation with suggested directions for future larger-scale validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical hardware measurements only

full rationale

The paper proposes CCCL for CXL-based GPU collectives and reports speedups (1.34× AllGather etc.) solely from direct benchmarks on a TITAN-II + 6-card setup versus 200 Gbps InfiniBand. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described design. Claims rest on external hardware measurements, not on any chain that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on empirical hardware evaluation rather than derivation.

pith-pipeline@v0.9.0 · 5647 in / 1142 out tokens · 33250 ms · 2026-05-15T18:56:47.785411+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
cs.DC 2026-04 unverdicted novelty 6.0

CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
TierBPF: Page Migration Admission Control for Tiered Memory via eBPF
cs.OS 2026-04 unverdicted novelty 6.0

TierBPF uses lightweight eBPF hooks for custom page admission control in tiered memory, delivering up to 17.7% geomean and 75% peak throughput gains across 17 workloads on three systems.
Hybrid Adaptive Tuning for Tiered Memory Systems
cs.OS 2026-04 unverdicted novelty 6.0

PTMT is a lightweight framework that automates parameter tuning for memory tiering via hybrid offline database building and online customized reinforcement learning, delivering 14-30% gains over defaults and 32% over ...

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 3 Pith papers · 4 internal anchors

[1]

Buddy and Slab Allocators

2020. Buddy and Slab Allocators. https://students.mimuw.edu.pl/ZSO/Wyklady/ 06_memory2/BuddySlabAllocator.pdf

work page 2020
[2]

Compute Express Link (CXL)

2026. Compute Express Link (CXL). https://computeexpresslink.org/

work page 2026
[3]

2026. PyTorch. https://pytorch.org/

work page 2026
[4]

TensorFlow

2026. TensorFlow. https://www.tensorflow.org/

work page 2026
[5]

Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application- transparent Page Management for Two-tiered Main Memory.Proceedings of 12 CXL-CCL : Inter-Node Collective GPU-Communication Using a CXL Shared Memory Pool ’ICS 2026’, July 6-9, 2026, Belfast, Northern Ireland, United Kingdom the Twenty-Second International Conference on Architectural Suppor...

work page 2017
[6]

Minseon Ahn, Andrew Chang, Donghun Lee, Jongmin Gim, Jungmin Kim, Jaemin Jung, Oliver Rebholz, Vincent Pham, Krishna Malladi, and Yang Seok Ki. 2022. Enabling CXL Memory Expansion for In-Memory Database Management Systems. InInternational Workshop on Data Management on New Hardware

work page 2022
[7]

Moiz Arif, Kevin Assogba, M Mustafa Rafique, and Sudharshan Vazhkudai. 2022. Exploiting cxl-based memory for distributed deep learning. InProceedings of the 51st International Conference on Parallel Processing. 1–11

work page 2022
[8]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Weilin Cai, Le Qin, and Jiayi Huang. 2025. MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’25). ACM, 655–671. https://doi.org/10. 1145/3676641.3716006

work page arXiv 2025
[10]

Gonzalez, Matei Zaharia, and Ion Stoica

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica. 2025. MoE-Lightning: High- Throughput MoE Inference on Memory-constrained GPUs. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1(Rotterdam...

work page doi:10.1145/3669940.3707267 2025
[11]

Jonathan Corbet. 2023. Weighted interleaving for memory tiering. https://lwn. net/Articles/948037/

work page 2023
[13]

arXiv:EECS Technical report

Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link. arXiv:EECS Technical report. University of California, Merced

work page
[14]

Wikimedia Foundation. [n. d.].Wikimedia Downloads. https://dumps.wikimedia. org

work page
[15]

Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. 2022. Direct access,{High-Performance} memory disaggregation with {DirectCXL}. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 287–294

work page 2022
[16]

Yunyan Guo and Guoliang Li. 2024. A CXL-Powered Database System: Op- portunities and Challenges. In2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 5593–5604

work page 2024
[17]

Taekyung Heo, Yang Wang, Wei Cui, Jaehyuk Huh, and Lintao Zhang. 2022. Adaptive Page Migration Policy With Huge Pages in Tiered Memory Systems. IEEE Trans. Comput.71, 1 (2022), 53–68. https://doi.org/10.1109/TC.2020.3036686

work page doi:10.1109/tc.2020.3036686 2022
[18]

Yibo Huang, Haowei Chen, Newton Ni, Vijay Chidambaram, Dixin Tang, Emmett Witchel, Zhiting Zhu, and Zhipeng Jia. 2025. Tigon: A distributed database for a CXL pod. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), Boston, MA

work page 2025
[19]

Yingchao Huang and Dong Li. 2017. Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems. InIEEE International Conference on Cluster Computing

work page 2017
[20]

Taehyung Lee, Sumit Kumar Monga, Changwoo Min, and Young Ik Eom. 2023. MEMTIS: Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, NY, USA, 17–34. https://doi.org/10.1145/3600006.3613167

work page doi:10.1145/3600006.3613167 2023
[21]

Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, et al. 2023. Pond: Cxl-based memory pooling systems for cloud platforms. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2...

work page 2023
[22]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Haifeng Liu, Long Zheng, Yu Huang, Jingyi Zhou, Chaoqiang Liu, Runze Wang, Xiaofei Liao, Hai Jin, and Jingling Xue. 2024. Enabling efficient large recommen- dation model training with near cxl memory processing. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 382–395

work page 2024
[24]

Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S Berger, Marie Nguyen, Xun Jian, Sam H Noh, and Huaicheng Li. 2025. Systematic cxl memory characterization and performance analysis at scale. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1203–1217

work page 2025
[25]

Jiawen Liu, Jie Ren, Roberto Gioiosa, Dong Li, and Jiajia Li. 2021. Sparta: high- performance, element-wise sparse tensor contraction on heterogeneous memory. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Virtual Event, Republic of Korea)(PPoPP ’21). Association for Computing Machinery, New York, NY, US...

work page arXiv 2021
[26]

LWN.net. [n. d.]. AutoNUMA Balancing. "https://access.redhat.com/ documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_ and_optimization_guide/sect-virtualization_tuning_optimization_guide- numa-auto_numa_balancing"

work page
[27]

Adnan Maruf, Ashikee Ghosh, Janki Bhimani, Daniela Campello, Andy Rudoff, and Raju Rangaswami. 2022. MULTI-CLOCK: Dynamic Tiering for Hybrid Memory Systems.2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)(2022), 925–937. https://api.semanticscholar.org/ CorpusID:248865268

work page 2022
[28]

Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, and Prakash Chauhan. 2023. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. InProceedings of the 28th ACM International Conference on Ar- chitectural Support for Programming Languages and Ope...

work page doi:10.1145/3582016.3582063 2023
[29]

Siyuan Mu and Sen Lin. 2026. A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications. arXiv:2503.07137 [cs.LG] https://arxiv. org/abs/2503.07137

work page arXiv 2026
[30]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron- LM. InProceedings of the International Conference for High ...

work page doi:10.1145/3458817.3476209 2021
[31]

NVIDIA. [n. d.]. NVIDIA Collective Communications Library (NCCL). https: //developer.nvidia.com/nccl

work page
[32]

2021.Introduction to InfiniBand

NVIDIA Corporation. 2021.Introduction to InfiniBand. White Paper. NVIDIA. https://network.nvidia.com/pdf/whitepapers/IB_Intro_WP_190.pdf

work page 2021
[33]

Pytorch. 2022. Fully sharded data parallelism. https://pytorch.org/blog/ introducing-pytorch-fully-sharded-data-parallel-api/

work page 2022
[34]

Amanda Raybuck, Tim Stamler, Wei Zhang, Mattan Erez, and Simon Peter. 2021. HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM.Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles(2021). https://api.semanticscholar.org/CorpusID:239029009

work page 2021
[35]

Jie Ren, Jiaolin Luo, Ivy Peng, Kai Wu, and Dong Li. 2021. Optimizing large- scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy. InProceedings of the ACM International Conference on Supercomputing(Virtual Event, USA)(ICS ’21). Association for Computing Machinery, New York, NY, USA...

work page doi:10.1145/3447818.3460356 2021
[36]

Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li. 2021. Sen- tinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 598–611

work page 2021
[37]

2021.{ZeRO-Offload}: Democratizing {Billion-Scale} model training

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021.{ZeRO-Offload}: Democratizing {Billion-Scale} model training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). 551–564

work page 2021
[38]

Jie Ren, Dong Xu, Junhee Ryu, Kwangsik Shin, Daewoo Kim, and Dong Li. 2024. MTM: Rethinking Memory Profiling and Migration for Multi-Tiered Large Mem- ory. InProceedings of the Nineteenth European Conference on Computer Systems (<conf-loc>, <city>Athens</city>, <country>Greece</country>, </conf-loc>) (EuroSys ’24). Association for Computing Machinery, New...

work page doi:10.1145/3627703.3650075 2024
[39]

Jie Ren, Minjia Zhang, and Dong Li. 2020. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory. InConference on Neural Information Processing Systems (NeurIPS)

work page 2020
[40]

Andre Rodriguez and William Osborn. 2025. Distributed Locking: Performance Analysis and Optimization Strategies. arXiv:2504.03073 [cs.DC] https://arxiv. org/abs/2504.03073

work page arXiv 2025
[41]

Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, and Michael Math- eson

Joshua Romero, Junqi Yin, Nouamane Laanait, Bing Xie, M. Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, and Michael Math- eson. 2022. Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Ass...

work page 2022
[42]

John, and Arkaprava Basu

Jee Ho Ryoo, Lizy K. John, and Arkaprava Basu. 2018. A Case for Granularity Aware Page Migration. InProceedings of the 2018 International Conference on Supercomputing(Beijing, China)(ICS ’18). Association for Computing Machinery, New York, NY, USA, 352–362. https://doi.org/10.1145/3205289.3208064

work page doi:10.1145/3205289.3208064 2018
[43]

Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, Akiyuki Kaneko, and Tatsuo Shiozawa

work page
[44]

InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis

GPU graph processing on cxl-based microsecond-latency external memory. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis. 962–972. 13 ’ICS 2026’, July 6-9, 2026, Belfast, Northern Ireland, United Kingdom Dong Xu, Han Meng, Xinyu Chen, Dengcheng Zhu, Wei Tang, Fei Liu, Liguang...

work page 2026
[45]

Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, Deep Shah, Ashmitha Jee- varaj Shetty, Greg Steinbrecher, Yulun Wang, Bruce Wu, Xinfeng Xie, Jingyi Yang, Mingran Yang, Kenny Yu, Minlan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Prashanth Kannan, Cristian Lu...

work page
[46]

arXiv:2510.20171 [cs.DC] https://arxiv.org/abs/2510.20171

Collective Communication for 100k+ GPUs. arXiv:2510.20171 [cs.DC] https://arxiv.org/abs/2510.20171

work page arXiv
[47]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Vishal Verma. 2022. Tiering-0.8. https://git.kernel.org/pub/scm/linux/kernel/git/ vishal/tiering.git/log/?h=tiering-0.8

work page 2022
[49]

Xi Wang, Jie Liu, Jianbo Wu, Shuangyan Yang, Jie Ren, Bhanu Shankar, and Dong Li. 2024. Exploring and evaluating real-world cxl: use cases and system adoption. arXiv preprint arXiv:2405.14209(2024)

work page arXiv 2024
[50]

Xi Wang, Bin Ma, Jongryool Kim, Byungil Koh, Hoshik Kim, and Dong Li. 2025. cMPI: Using CXL Memory Sharing for MPI One-Sided and Two-Sided Inter- Node Communications. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2216–2232

work page 2025
[51]

Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, and Huatao Wu. 2024. Rcmp: Reconstructing rdma-based memory disaggregation via cxl.ACM Transactions on Architecture and Code Optimization21, 1 (2024), 1–26

work page 2024
[52]

Xingda Wei, Haotian Wang, Tianxia Wang, Rong Chen, Jinyu Gu, Pengfei Zuo, and Haibo Chen. 2023. Transactional indexes on (RDMA or cxl-based) disag- gregated memory with repairable transaction.arXiv preprint arXiv:2308.02501 (2023)

work page arXiv 2023
[53]

Bryan Woolley. 2015. NCCL: Multi-GPU Collective Communication Library. https://images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf

work page 2015
[54]

K. Wu, Y. Huang, and D. Li. 2017. Unimem: Runtime Data Management on Non- Volatile Memory-based Heterogeneous Main Memory. InInternational Conference for High Performance Computing, Networking, Storage and Analysis

work page 2017
[55]

Kai Wu, Jie Ren Ivy Peng, and Dong Li. 2021. ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory. InUSENIX Conference on File and Storage Technologies

work page 2021
[56]

Kai Wu, Jie Ren, and Dong Li. 2018. Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 401–413. https://doi.org/10.1109/SC.2018.00034

work page doi:10.1109/sc.2018.00034 2018
[57]

Panruo Wu, Dong Li, Zizhong Chen, Jeffrey Vetter, and Sparsh Mittal. 2016. Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory. InACM Symposium on High-Performance Parallel and Distributed Computing (HPDC)

work page 2016
[58]

Xconn. 2025. Xconn Technologies. https://www.xconn-tech.com/

work page 2025
[59]

Zhen Xie, Wenqian Dong, Jie Liu, Ivy Peng, Yanbao Ma, and Dong Li. 2021. MD- HM: memoization-based molecular dynamics simulations on big memory system. InProceedings of the ACM International Conference on Supercomputing(Virtual Event, USA)(ICS ’21). Association for Computing Machinery, New York, NY, USA, 215–226. https://doi.org/10.1145/3447818.3460365

work page doi:10.1145/3447818.3460365 2021
[60]

Zhen Xie, Jie Liu, Jiajia Li, and Dong Li. 2023. Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Prin- ciples and Practice of Parallel Programming(<conf-loc>, <city>Montreal</city>, <state>QC</state>, <country>Canada</country>, </...

work page doi:10.1145/3572848.3577497 2023
[61]

Dong Xu, Yuan Feng, Kwangsik Shin, Daewoo Kim, Hyeran Jeon, and Dong Li

work page
[62]

Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link. InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–18. https://doi.org/10.1109/ SC41406.2024.00100

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Dong Xu, Junhee Ryu, Jinho Baek, Kwangsik Shin, Pengfei Su, and Dong Li

work page
[64]

In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA)(USENIX ATC’24)

FlexMem: adaptive page profiling and migration for tiered memory. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA)(USENIX ATC’24). USENIX Association, USA, Article 50, 17 pages

work page 2024
[65]

Nellans, and Abhishek Bhattacharjee

Zi Yan, Daniel Lustig, David W. Nellans, and Abhishek Bhattacharjee. 2019. Nim- ble Page Management for Tiered Memory Systems.Proceedings of the Twenty- Fourth International Conference on Architectural Support for Programming Lan- guages and Operating Systems(2019). https://api.semanticscholar.org/CorpusID: 102348046

work page 2019
[66]

Shuo Yang, Kai Wu, Yifan Qiao, Dong Li, and Jidong Zhai. 2017. Algorithm- Directed Crash Consistence in Non-Volatile Memory for HPC. InIEEE Interna- tional Conference on Cluster Computing

work page 2017
[67]

Shuangyan Yang, Minjia Zhang, Wenqian Dong, and Dong Li. 2023. Betty: Enabling Large-Scale GNN Training with Batch-Level Graph Partitioning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association for Computing Machinery, New...

work page doi:10.1145/3575693.3575725 2023
[69]

Xinjun Yang, Qingda Hu, Junru Li, Feifei Li, Yicong Zhu, Yuqi Zhou, Qiuru Lin, Jian Dai, Yang Kong, Jiayu Zhang, Guoqiang Xu, and Qiang Liu. 2025. Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management. arXiv:2511.20172 [cs.DC] https://arxiv.org/abs/2511.20172

work page arXiv 2025
[70]

Xinjun Yang, Yingqiang Zhang, Hao Chen, Feifei Li, Gerry Fan, Yang Kong, Bo Wang, Jing Fang, Yuhui Wang, Tao Huang, et al. 2025. Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases. InCompanion of the 2025 International Conference on Management of Data. 689–702

work page 2025
[71]

Xinjun Yang, Yingqiang Zhang, Hao Chen, Feifei Li, Gerry Fan, Yang Kong, Bo Wang, Jing Fang, Yuhui Wang, Tao Huang, Wenpu Hu, Jim Kao, and Jianping Jiang. 2025. Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases. InCompanion of the 2025 International Conference on Management of Data(Berlin, Germany)(SIGMOD/PODS ’25). Associa...

work page arXiv 2025
[72]

Noh, and Jongryool Kim

Dongha Yoon, Younghoon Min, Hoshik Kim, Sam H. Noh, and Jongryool Kim

work page
[73]

arXiv:2512.18194 [cs.DC] https://arxiv.org/abs/2512.18194

TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale. arXiv:2512.18194 [cs.DC] https://arxiv.org/abs/2512.18194

work page arXiv
[74]

Dong Young Yoon, Mosharaf Chowdhury, and Barzan Mozafari. 2018. Distributed Lock Management with RDMA: Decentralization without Starvation. InProceed- ings of the 2018 International Conference on Management of Data(Houston, TX, USA)(SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1571–1586. https://doi.org/10.1145/3183713.3196890

work page doi:10.1145/3183713.3196890 2018
[75]

Zhuolong Yu, Yiwen Zhang, Vladimir Braverman, Mosharaf Chowdhury, and Xin Jin. 2020. NetLock: Fast, Centralized Lock Management Using Programmable Switches. InProceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication(Virtual Event,...

work page doi:10.1145/3387514.3405857 2020
[76]

Mingxing Zhang, Teng Ma, Jinqi Hua, Zheng Liu, Kang Chen, Ning Ding, Fan Du, Jinlei Jiang, Tao Ma, and Yongwei Wu. 2023. Partial failure resilient memory management system for (cxl-based) distributed shared memory. InProceedings of the 29th Symposium on Operating Systems Principles. 658–674. 14

work page 2023