Recognition: no theorem link
CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling
Pith reviewed 2026-05-15 18:56 UTC · model grok-4.3
The pith
CCCL uses CXL shared memory pooling to deliver faster node-spanning GPU collectives than RDMA over 200 Gbps InfiniBand.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CCCL enables efficient node-spanning GPU collectives by leveraging the CXL shared memory pool for synchronization, data interleaving, and parallelized communication, achieving measured performance gains over RDMA-based InfiniBand implementations in standard collective benchmarks and in an LLM training workload.
What carries the argument
The CCCL library, which implements custom mechanisms for synchronization, data interleaving, and communication parallelization on top of CXL memory pooling to replace RDMA for cross-node GPU collectives.
If this is right
- CCCL can serve as a drop-in alternative for common collectives with measured speedups over high-speed RDMA.
- Hardware costs for multi-node LLM training setups drop by a factor of 2.75 while retaining or improving runtime.
- Resource utilization improves because the CXL pool reduces the need for over-provisioned per-node GPU memory.
- Collective communication becomes memory-centric rather than network-centric, changing how interconnects are sized in GPU clusters.
Where Pith is reading between the lines
- If CXL pools scale beyond the tested node count, cluster designs could shift away from expensive high-bandwidth fabrics for many workloads.
- The same CXL-based approach could extend to other distributed GPU patterns such as parameter sharding or gradient aggregation.
- Existing GPU collective libraries might incorporate CXL backends as an optional path when the hardware is present.
- Further hardware tuning of CXL switches could amplify the reported gains in latency-sensitive phases of training.
Load-bearing premise
The CXL hardware delivers low-latency coherent access that supports collective synchronization and data movement without hidden scale-dependent bottlenecks.
What would settle it
Performance measurements on a larger number of nodes or with different workload sizes that show CCCL falling below InfiniBand speeds or introducing higher latency than reported would disprove the central performance claims.
Figures
read the original abstract
Large language models (LLMs) training or inference across multiple nodes introduces significant pressure on GPU memory and interconnect bandwidth. The Compute Express Link (CXL) shared memory pool offers a scalable solution by enabling memory sharing across nodes, reducing over-provisioning and improving resource utilization. We propose \name, a collective communication library, leveraging the CXL shared memory pool to support cross-node GPU operations without relying on traditional RDMA-based networking. Our design addresses the challenges on synchronization, data interleaving, and communication parallelization faced by using the CXL shared memory pool for collective communications. Evaluating on multiple nodes with a TITAN-II CXL switch and six Micron CZ120 memory cards, we show that \name achieves highly efficient collective operations across hosts, demonstrating CXL's potential for scalable, memory-centric GPU communication. Our evaluation demonstrates that \name achieves average performance improvements of 1.34$\times$ for AllGather, 1.84$\times$ for Broadcast, 1.94$\times$ for Gather, and 1.04$\times$ for Scatter, compared to the original RDMA-based implementation over 200 Gbps InfiniBand. \textcolor{dong}{In addition, the evaluation with a case of LLM training shows 1.11$\times$ speedup compared with the InfiniBand while saving production cost by $2.75\times$ in hardware.}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CCCL, a collective communication library that leverages CXL shared memory pooling to support cross-node GPU collectives without RDMA networking. It describes design solutions for synchronization, data interleaving, and parallelization, then evaluates on a TITAN-II CXL switch with six Micron CZ120 cards, reporting average speedups of 1.34× (AllGather), 1.84× (Broadcast), 1.94× (Gather), and 1.04× (Scatter) versus a 200 Gbps InfiniBand RDMA baseline, plus 1.11× speedup and 2.75× hardware cost savings in an LLM training case.
Significance. If the empirical results hold, the work provides concrete evidence that CXL memory pools can enable efficient memory-centric GPU collectives, reducing interconnect bandwidth pressure and hardware over-provisioning in multi-node LLM training. The evaluation rests on direct hardware measurements rather than fitted parameters or self-referential derivations, which is a strength.
major comments (2)
- [Evaluation] Evaluation section: the reported speedups (1.34–1.94×) and LLM-training result lack methodological details on exact message sizes, node count, run-to-run variance, timing methodology, and precise RDMA baseline configuration, leaving the central performance claims only partially verifiable.
- [Design and Evaluation] Design and Evaluation: the assumption that the CXL pool delivers low-latency coherent access sufficient for synchronization and interleaving is demonstrated only on a small-scale TITAN-II + 6-card setup; any directory overhead, contention, or ordering costs that emerge at larger node counts would directly undermine the reported speedups and cost-saving claims.
minor comments (1)
- [Abstract] Abstract contains unresolved LaTeX commands (e.g., “name” and “textcolor{dong}”) that impair readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve verifiability and transparency.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the reported speedups (1.34–1.94×) and LLM-training result lack methodological details on exact message sizes, node count, run-to-run variance, timing methodology, and precise RDMA baseline configuration, leaving the central performance claims only partially verifiable.
Authors: We agree that the Evaluation section requires additional methodological details for full verifiability. In the revised manuscript we will explicitly report: the exact message sizes tested for each collective (ranging from 256 KB to 4 GB), the node count (six nodes connected via the TITAN-II switch), run-to-run variance with standard deviations from at least ten repetitions per data point, the timing methodology (CUDA events for GPU-side operations combined with high-resolution host timers), and the precise RDMA baseline configuration including the MPI library version, InfiniBand driver settings, and queue-pair parameters. revision: yes
-
Referee: [Design and Evaluation] Design and Evaluation: the assumption that the CXL pool delivers low-latency coherent access sufficient for synchronization and interleaving is demonstrated only on a small-scale TITAN-II + 6-card setup; any directory overhead, contention, or ordering costs that emerge at larger node counts would directly undermine the reported speedups and cost-saving claims.
Authors: We acknowledge that all empirical results are obtained on a small-scale TITAN-II + six-card configuration. While we cannot provide new measurements at larger scales, the design of synchronization and interleaving primitives is grounded in CXL 2.0 coherence semantics that are architecturally intended to scale. In the revision we will add a dedicated scalability discussion subsection that analytically examines directory overhead, contention, and ordering costs using CXL protocol specifications, and we will explicitly list the current scale as a limitation with suggested directions for future larger-scale validation. revision: partial
Circularity Check
No circularity: empirical hardware measurements only
full rationale
The paper proposes CCCL for CXL-based GPU collectives and reports speedups (1.34× AllGather etc.) solely from direct benchmarks on a TITAN-II + 6-card setup versus 200 Gbps InfiniBand. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described design. Claims rest on external hardware measurements, not on any chain that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
-
TierBPF: Page Migration Admission Control for Tiered Memory via eBPF
TierBPF uses lightweight eBPF hooks for custom page admission control in tiered memory, delivering up to 17.7% geomean and 75% peak throughput gains across 17 workloads on three systems.
-
Hybrid Adaptive Tuning for Tiered Memory Systems
PTMT is a lightweight framework that automates parameter tuning for memory tiering via hybrid offline database building and online customized reinforcement learning, delivering 14-30% gains over defaults and 32% over ...
Reference graph
Works this paper leans on
-
[1]
2020. Buddy and Slab Allocators. https://students.mimuw.edu.pl/ZSO/Wyklady/ 06_memory2/BuddySlabAllocator.pdf
work page 2020
-
[2]
2026. Compute Express Link (CXL). https://computeexpresslink.org/
work page 2026
-
[3]
2026. PyTorch. https://pytorch.org/
work page 2026
- [4]
-
[5]
Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application- transparent Page Management for Two-tiered Main Memory.Proceedings of 12 CXL-CCL : Inter-Node Collective GPU-Communication Using a CXL Shared Memory Pool ’ICS 2026’, July 6-9, 2026, Belfast, Northern Ireland, United Kingdom the Twenty-Second International Conference on Architectural Suppor...
work page 2017
-
[6]
Minseon Ahn, Andrew Chang, Donghun Lee, Jongmin Gim, Jungmin Kim, Jaemin Jung, Oliver Rebholz, Vincent Pham, Krishna Malladi, and Yang Seok Ki. 2022. Enabling CXL Memory Expansion for In-Memory Database Management Systems. InInternational Workshop on Data Management on New Hardware
work page 2022
-
[7]
Moiz Arif, Kevin Assogba, M Mustafa Rafique, and Sudharshan Vazhkudai. 2022. Exploiting cxl-based memory for distributed deep learning. InProceedings of the 51st International Conference on Parallel Processing. 1–11
work page 2022
-
[8]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Weilin Cai, Le Qin, and Jiayi Huang. 2025. MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’25). ACM, 655–671. https://doi.org/10. 1145/3676641.3716006
-
[10]
Gonzalez, Matei Zaharia, and Ion Stoica
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica. 2025. MoE-Lightning: High- Throughput MoE Inference on Memory-constrained GPUs. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1(Rotterdam...
-
[11]
Jonathan Corbet. 2023. Weighted interleaving for memory tiering. https://lwn. net/Articles/948037/
work page 2023
-
[13]
Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link. arXiv:EECS Technical report. University of California, Merced
-
[14]
Wikimedia Foundation. [n. d.].Wikimedia Downloads. https://dumps.wikimedia. org
-
[15]
Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. 2022. Direct access,{High-Performance} memory disaggregation with {DirectCXL}. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 287–294
work page 2022
-
[16]
Yunyan Guo and Guoliang Li. 2024. A CXL-Powered Database System: Op- portunities and Challenges. In2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 5593–5604
work page 2024
-
[17]
Taekyung Heo, Yang Wang, Wei Cui, Jaehyuk Huh, and Lintao Zhang. 2022. Adaptive Page Migration Policy With Huge Pages in Tiered Memory Systems. IEEE Trans. Comput.71, 1 (2022), 53–68. https://doi.org/10.1109/TC.2020.3036686
-
[18]
Yibo Huang, Haowei Chen, Newton Ni, Vijay Chidambaram, Dixin Tang, Emmett Witchel, Zhiting Zhu, and Zhipeng Jia. 2025. Tigon: A distributed database for a CXL pod. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), Boston, MA
work page 2025
-
[19]
Yingchao Huang and Dong Li. 2017. Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems. InIEEE International Conference on Cluster Computing
work page 2017
-
[20]
Taehyung Lee, Sumit Kumar Monga, Changwoo Min, and Young Ik Eom. 2023. MEMTIS: Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, NY, USA, 17–34. https://doi.org/10.1145/3600006.3613167
-
[21]
Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, et al. 2023. Pond: Cxl-based memory pooling systems for cloud platforms. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2...
work page 2023
-
[22]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Haifeng Liu, Long Zheng, Yu Huang, Jingyi Zhou, Chaoqiang Liu, Runze Wang, Xiaofei Liao, Hai Jin, and Jingling Xue. 2024. Enabling efficient large recommen- dation model training with near cxl memory processing. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 382–395
work page 2024
-
[24]
Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S Berger, Marie Nguyen, Xun Jian, Sam H Noh, and Huaicheng Li. 2025. Systematic cxl memory characterization and performance analysis at scale. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1203–1217
work page 2025
-
[25]
Jiawen Liu, Jie Ren, Roberto Gioiosa, Dong Li, and Jiajia Li. 2021. Sparta: high- performance, element-wise sparse tensor contraction on heterogeneous memory. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Virtual Event, Republic of Korea)(PPoPP ’21). Association for Computing Machinery, New York, NY, US...
-
[26]
LWN.net. [n. d.]. AutoNUMA Balancing. "https://access.redhat.com/ documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_ and_optimization_guide/sect-virtualization_tuning_optimization_guide- numa-auto_numa_balancing"
-
[27]
Adnan Maruf, Ashikee Ghosh, Janki Bhimani, Daniela Campello, Andy Rudoff, and Raju Rangaswami. 2022. MULTI-CLOCK: Dynamic Tiering for Hybrid Memory Systems.2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)(2022), 925–937. https://api.semanticscholar.org/ CorpusID:248865268
work page 2022
-
[28]
Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, and Prakash Chauhan. 2023. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. InProceedings of the 28th ACM International Conference on Ar- chitectural Support for Programming Languages and Ope...
- [29]
-
[30]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron- LM. InProceedings of the International Conference for High ...
-
[31]
NVIDIA. [n. d.]. NVIDIA Collective Communications Library (NCCL). https: //developer.nvidia.com/nccl
-
[32]
2021.Introduction to InfiniBand
NVIDIA Corporation. 2021.Introduction to InfiniBand. White Paper. NVIDIA. https://network.nvidia.com/pdf/whitepapers/IB_Intro_WP_190.pdf
work page 2021
-
[33]
Pytorch. 2022. Fully sharded data parallelism. https://pytorch.org/blog/ introducing-pytorch-fully-sharded-data-parallel-api/
work page 2022
-
[34]
Amanda Raybuck, Tim Stamler, Wei Zhang, Mattan Erez, and Simon Peter. 2021. HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM.Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles(2021). https://api.semanticscholar.org/CorpusID:239029009
work page 2021
-
[35]
Jie Ren, Jiaolin Luo, Ivy Peng, Kai Wu, and Dong Li. 2021. Optimizing large- scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy. InProceedings of the ACM International Conference on Supercomputing(Virtual Event, USA)(ICS ’21). Association for Computing Machinery, New York, NY, USA...
-
[36]
Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li. 2021. Sen- tinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 598–611
work page 2021
-
[37]
2021.{ZeRO-Offload}: Democratizing {Billion-Scale} model training
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021.{ZeRO-Offload}: Democratizing {Billion-Scale} model training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). 551–564
work page 2021
-
[38]
Jie Ren, Dong Xu, Junhee Ryu, Kwangsik Shin, Daewoo Kim, and Dong Li. 2024. MTM: Rethinking Memory Profiling and Migration for Multi-Tiered Large Mem- ory. InProceedings of the Nineteenth European Conference on Computer Systems (<conf-loc>, <city>Athens</city>, <country>Greece</country>, </conf-loc>) (EuroSys ’24). Association for Computing Machinery, New...
-
[39]
Jie Ren, Minjia Zhang, and Dong Li. 2020. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory. InConference on Neural Information Processing Systems (NeurIPS)
work page 2020
- [40]
-
[41]
Joshua Romero, Junqi Yin, Nouamane Laanait, Bing Xie, M. Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, and Michael Math- eson. 2022. Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Ass...
work page 2022
-
[42]
Jee Ho Ryoo, Lizy K. John, and Arkaprava Basu. 2018. A Case for Granularity Aware Page Migration. InProceedings of the 2018 International Conference on Supercomputing(Beijing, China)(ICS ’18). Association for Computing Machinery, New York, NY, USA, 352–362. https://doi.org/10.1145/3205289.3208064
-
[43]
Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, Akiyuki Kaneko, and Tatsuo Shiozawa
-
[44]
GPU graph processing on cxl-based microsecond-latency external memory. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis. 962–972. 13 ’ICS 2026’, July 6-9, 2026, Belfast, Northern Ireland, United Kingdom Dong Xu, Han Meng, Xinyu Chen, Dengcheng Zhu, Wei Tang, Fei Liu, Liguang...
work page 2026
-
[45]
Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, Deep Shah, Ashmitha Jee- varaj Shetty, Greg Steinbrecher, Yulun Wang, Bruce Wu, Xinfeng Xie, Jingyi Yang, Mingran Yang, Kenny Yu, Minlan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Prashanth Kannan, Cristian Lu...
-
[46]
arXiv:2510.20171 [cs.DC] https://arxiv.org/abs/2510.20171
Collective Communication for 100k+ GPUs. arXiv:2510.20171 [cs.DC] https://arxiv.org/abs/2510.20171
-
[47]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Vishal Verma. 2022. Tiering-0.8. https://git.kernel.org/pub/scm/linux/kernel/git/ vishal/tiering.git/log/?h=tiering-0.8
work page 2022
- [49]
-
[50]
Xi Wang, Bin Ma, Jongryool Kim, Byungil Koh, Hoshik Kim, and Dong Li. 2025. cMPI: Using CXL Memory Sharing for MPI One-Sided and Two-Sided Inter- Node Communications. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2216–2232
work page 2025
-
[51]
Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, and Huatao Wu. 2024. Rcmp: Reconstructing rdma-based memory disaggregation via cxl.ACM Transactions on Architecture and Code Optimization21, 1 (2024), 1–26
work page 2024
- [52]
-
[53]
Bryan Woolley. 2015. NCCL: Multi-GPU Collective Communication Library. https://images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf
work page 2015
-
[54]
K. Wu, Y. Huang, and D. Li. 2017. Unimem: Runtime Data Management on Non- Volatile Memory-based Heterogeneous Main Memory. InInternational Conference for High Performance Computing, Networking, Storage and Analysis
work page 2017
-
[55]
Kai Wu, Jie Ren Ivy Peng, and Dong Li. 2021. ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory. InUSENIX Conference on File and Storage Technologies
work page 2021
-
[56]
Kai Wu, Jie Ren, and Dong Li. 2018. Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 401–413. https://doi.org/10.1109/SC.2018.00034
-
[57]
Panruo Wu, Dong Li, Zizhong Chen, Jeffrey Vetter, and Sparsh Mittal. 2016. Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory. InACM Symposium on High-Performance Parallel and Distributed Computing (HPDC)
work page 2016
-
[58]
Xconn. 2025. Xconn Technologies. https://www.xconn-tech.com/
work page 2025
-
[59]
Zhen Xie, Wenqian Dong, Jie Liu, Ivy Peng, Yanbao Ma, and Dong Li. 2021. MD- HM: memoization-based molecular dynamics simulations on big memory system. InProceedings of the ACM International Conference on Supercomputing(Virtual Event, USA)(ICS ’21). Association for Computing Machinery, New York, NY, USA, 215–226. https://doi.org/10.1145/3447818.3460365
-
[60]
Zhen Xie, Jie Liu, Jiajia Li, and Dong Li. 2023. Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Prin- ciples and Practice of Parallel Programming(<conf-loc>, <city>Montreal</city>, <state>QC</state>, <country>Canada</country>, </...
-
[61]
Dong Xu, Yuan Feng, Kwangsik Shin, Daewoo Kim, Hyeran Jeon, and Dong Li
-
[62]
Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link. InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–18. https://doi.org/10.1109/ SC41406.2024.00100
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Dong Xu, Junhee Ryu, Jinho Baek, Kwangsik Shin, Pengfei Su, and Dong Li
-
[64]
FlexMem: adaptive page profiling and migration for tiered memory. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA)(USENIX ATC’24). USENIX Association, USA, Article 50, 17 pages
work page 2024
-
[65]
Nellans, and Abhishek Bhattacharjee
Zi Yan, Daniel Lustig, David W. Nellans, and Abhishek Bhattacharjee. 2019. Nim- ble Page Management for Tiered Memory Systems.Proceedings of the Twenty- Fourth International Conference on Architectural Support for Programming Lan- guages and Operating Systems(2019). https://api.semanticscholar.org/CorpusID: 102348046
work page 2019
-
[66]
Shuo Yang, Kai Wu, Yifan Qiao, Dong Li, and Jidong Zhai. 2017. Algorithm- Directed Crash Consistence in Non-Volatile Memory for HPC. InIEEE Interna- tional Conference on Cluster Computing
work page 2017
-
[67]
Shuangyan Yang, Minjia Zhang, Wenqian Dong, and Dong Li. 2023. Betty: Enabling Large-Scale GNN Training with Batch-Level Graph Partitioning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). Association for Computing Machinery, New...
-
[69]
Xinjun Yang, Qingda Hu, Junru Li, Feifei Li, Yicong Zhu, Yuqi Zhou, Qiuru Lin, Jian Dai, Yang Kong, Jiayu Zhang, Guoqiang Xu, and Qiang Liu. 2025. Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management. arXiv:2511.20172 [cs.DC] https://arxiv.org/abs/2511.20172
-
[70]
Xinjun Yang, Yingqiang Zhang, Hao Chen, Feifei Li, Gerry Fan, Yang Kong, Bo Wang, Jing Fang, Yuhui Wang, Tao Huang, et al. 2025. Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases. InCompanion of the 2025 International Conference on Management of Data. 689–702
work page 2025
-
[71]
Xinjun Yang, Yingqiang Zhang, Hao Chen, Feifei Li, Gerry Fan, Yang Kong, Bo Wang, Jing Fang, Yuhui Wang, Tao Huang, Wenpu Hu, Jim Kao, and Jianping Jiang. 2025. Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases. InCompanion of the 2025 International Conference on Management of Data(Berlin, Germany)(SIGMOD/PODS ’25). Associa...
-
[72]
Dongha Yoon, Younghoon Min, Hoshik Kim, Sam H. Noh, and Jongryool Kim
-
[73]
arXiv:2512.18194 [cs.DC] https://arxiv.org/abs/2512.18194
TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale. arXiv:2512.18194 [cs.DC] https://arxiv.org/abs/2512.18194
-
[74]
Dong Young Yoon, Mosharaf Chowdhury, and Barzan Mozafari. 2018. Distributed Lock Management with RDMA: Decentralization without Starvation. InProceed- ings of the 2018 International Conference on Management of Data(Houston, TX, USA)(SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1571–1586. https://doi.org/10.1145/3183713.3196890
-
[75]
Zhuolong Yu, Yiwen Zhang, Vladimir Braverman, Mosharaf Chowdhury, and Xin Jin. 2020. NetLock: Fast, Centralized Lock Management Using Programmable Switches. InProceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication(Virtual Event,...
-
[76]
Mingxing Zhang, Teng Ma, Jinqi Hua, Zheng Liu, Kang Chen, Ning Ding, Fan Du, Jinlei Jiang, Tao Ma, and Yongwei Wu. 2023. Partial failure resilient memory management system for (cxl-based) distributed shared memory. InProceedings of the 29th Symposium on Operating Systems Principles. 658–674. 14
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.