Octopus: Enhancing CXL Memory Pods via Sparse Topology

Daniel S. Berger; Fiodar Kazhamiaka; Mark D. Hill; Pantea Zardoshti; Rodrigo Fonseca; Shuwei Teng; Yuhong Zhong

arxiv: 2501.09020 · v3 · submitted 2025-01-15 · 💻 cs.AR

Octopus: Enhancing CXL Memory Pods via Sparse Topology

Yuhong Zhong , Fiodar Kazhamiaka , Pantea Zardoshti , Shuwei Teng , Rodrigo Fonseca , Mark D. Hill , Daniel S. Berger This is my paper

Pith reviewed 2026-05-23 05:10 UTC · model grok-4.3

classification 💻 cs.AR

keywords CXLmemory poolingsparse topologyisland groupingRPC latencyserver cost savingsCXL podsinterconnect design

0 comments

The pith

Octopus uses sparse CXL connections and server islands to scale memory pods without switches while cutting RPC latency and costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing CXL memory pod designs stay small or rely on expensive switches because they assume every server must connect to every pooling device. Octopus breaks this by building a sparse topology where each low-port pooling device links only to a chosen subset of servers. Servers are grouped into islands so that communication stays fast inside each island while overlap between islands supports efficient memory sharing across the pod. The design respects physical limits like 1.5-meter copper cables and scales to 96 servers. Hardware measurements show RPCs run 3.2 times faster than in-rack RDMA and 2.4 times faster than CXL switches, and simulations report 3 to 5.4 percent net server cost savings.

Core claim

Octopus constructs a sparse CXL topology in which each pooling device connects to a carefully chosen subset of servers grouped into islands; this arrangement delivers low-latency intra-island communication, sufficient inter-island overlap for pooling, and scalability to large pods without switches while staying inside 1.5 m cable constraints.

What carries the argument

Island grouping that creates low-latency clusters while allowing controlled overlap between clusters for memory pooling.

If this is right

RPCs complete 3.2 times faster than in-rack RDMA on the hardware prototype.
RPCs complete 2.4 times faster than through CXL switches on the hardware prototype.
Simulated 96-server pods deliver 3 to 5.4 percent net server cost savings.
CXL-switch designs produce a net cost increase in the same simulations.
The topology respects 1.5 m copper cable limits while scaling to 96 servers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparse-island pattern could be tested on other disaggregated interconnects such as PCIe or NVLink to see if similar latency and cost gains appear.
Dynamic adjustment of island boundaries at runtime might further improve overlap for changing workload patterns.
Reducing reliance on switches could lower overall rack power draw even if the paper does not measure power directly.
Standards bodies might consider specifying lower port counts on future CXL devices if sparse topologies prove reliable at scale.

Load-bearing premise

A carefully chosen sparse subset of connections per pooling device combined with island grouping can satisfy low-latency intra-island communication, sufficient inter-island pooling overlap, and physical cable length limits without hidden performance or scalability penalties.

What would settle it

Running the three-server hardware prototype at larger scale, such as 16 servers, and checking whether RPC latency remains 3.2 times faster than RDMA and net cost savings stay positive under measured device characteristics and 1.5 m cables.

Figures

Figures reproduced from arXiv: 2501.09020 by Daniel S. Berger, Fiodar Kazhamiaka, Mark D. Hill, Pantea Zardoshti, Rodrigo Fonseca, Shuwei Teng, Yuhong Zhong.

**Figure 2.** Figure 2: Multi-ported device with two CXL ports. Each port offers x8 CXL lanes. to one PD (minimally-connected Octopus) or multiple PDs (redundantly-connected Octopus). This paper makes the following technical contributions. • First paper to question the assumption that pods must fully-connect hosts and pooling devices • Propose Octopus pod designs with bounded connections and provide algorithmic constructions for… view at source ↗

**Figure 1.** Figure 1: Conventional CXL pod designs assume that all 16 hosts (𝐻0 to 𝐻15) connect to all pooling devices, which requires a still-expensive 16-ported device. Octopus introduces minimally-connected pod designs based on near-commodity 4-port pooling devices, which cuts pooling device cost in half. than the overall pod size. Similarly, we find that communication largely requires pair-wise memory sharing. We propose … view at source ↗

**Figure 3.** Figure 3: Larger CXL pods lead to higher DRAM savings. Besides PDs, another approach to build a pool is a switch. Switches offer multiple CXL ports and can forward CXL packets between them. An example is the XC50256 CXL switch offered by XConn [110]. A CXL pod comprises multiple hosts using multiple PDs and/or switches. In principle, hosts can concurrently access the memory behind a PD or behind a switch. However, w… view at source ↗

**Figure 4.** Figure 4: Die area estimates for PDs with different numbers of CXL ports and DDR5 channels. Note that for visual simplicity we show the logic and network-on-chip (NOC) area as a single block. PD Size 𝑁 = 2 𝑁 = 4 𝑁 = 8 𝑁 = 16 DDR5 channels 2 4 8 12 References [40, 54, 77] [38] [81] Overall die area 14 𝑚𝑚2 30 𝑚𝑚2 69 𝑚𝑚2 181 𝑚𝑚2 Dead silicon 0 𝑚𝑚2 2 𝑚𝑚2 12 𝑚𝑚2 77 𝑚𝑚2 Wafer cost 70% 80% 100% 150% Estimated cost𝑎 $260 $… view at source ↗

**Figure 5.** Figure 5: A 3-rack Octopus configuration. 5.2 Physical Layout The physical layout of an Octopus topology is largely driven by CXL cable lengths and cable routing. CXL 2.0 (PCIe5) systems are typically constrained by insertion loss, which admits a simplified analysis. Under a 36 dB insertion-loss budget at 16 GHz, we allocate 8 dB for pad-to-pin (root complex) and 10 dB for the CXL pooling device. We conservatively … view at source ↗

**Figure 6.** Figure 6: Host software view in fully-connected and Octopus pods. H1 P1 P2... H2 H3 PX+1 H1 P1 P2... H2 H3 PX+1 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Pair-wise message passing and shuffle 6.1 Software CXL Pod API In a fully-connected topology, memory is interleaved across CXL PDs by default. Specifically, CXL specifies hardware interleaving at 256 Byte granularity across CXL devices connected to a CPU socket [20].2 Figure 6a shows the host memory map in a typical fully-connected topology. Octopus makes individual CXL ports visible to the Operating Sy… view at source ↗

**Figure 9.** Figure 9: Fully-connected CXL pod topologies define a Pareto frontier where cost rises super-linearly with CXL pod size. For medium to large pod sizes, Octopus can achieve much larger pod sizes at equal or lower cost. D [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Additional memory required by Octopus relative to a FC design with the same host count for three production workloads. Note that a need for 5% more memory comparable favorably to enabling 4.5× larger host counts ( [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 12.** Figure 12: Distribution of RPC round-trip latency for small and large messages. CXL’s latency is 3.3× and 1.5× lower than RDMA, and 10× lower than user-space networking. RPC services see median request sizes under 1500 Bytes with resonses under 315 Bytes [94]. At Meta, almost all KVS request sizes fit into 128 Bytes and response sizes are typically around 1-10kB [15]. For small messages, a minimally-connected Octop… view at source ↗

**Figure 13.** Figure 13: Minimally-connected Octopus topologies for the common case of 𝑋 = 8 host ports. (a) visualizes design #2 in [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 15.** Figure 15: Redundantly-connected Octopus topology for 𝑁 = 4-ported PDs and the common case of 𝑋 = 8 host ports (design #6 in [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 14.** Figure 14: Minimally-connected Octopus topology for 𝑁 = 2-ported PDs and the common case of 𝑋 = 8 host ports (design #1 in [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 16.** Figure 16: Assuming that wafers cost half as much as in [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Assuming that wafers cost twice as much as in [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

read the original abstract

The Compute Express Link (CXL) interconnect enables compute "pods" that pool memory across servers to reduce cost and improve efficiency. These pods also facilitate pairwise communication whose needs conflict with pooling. Importantly, existing pod designs are small or require indirection through expensive switches. These conventional designs implicitly assume that pods must fully connect all servers to all CXL pooling devices. This paper breaks with this conventional wisdom by introducing Octopus pods. Octopus directly connects servers to low-port-count CXL pooling devices (e.g., 4 ports) yet scales to large pods without switches by constructing a sparse CXL topology in which each pooling device connects to a carefully chosen subset of servers. Octopus explicitly balances "overlap", where two servers connect to the same pooling device: overlap reduces pooling efficiency but enables low-latency communication. Octopus resolves this tension by grouping servers into "islands" with low-latency intra-island communication and interconnecting islands to favor pooling. We build a three-server CXL pod prototype and simulate scaled pods with 96 servers under measured device characteristics and physical constraints (1.5 m copper cables). On hardware, Octopus RPCs are 3.2x faster than in-rack RDMA and 2.4x faster than CXL switches. In simulation, Octopus achieves net server cost savings of 3-5.4% whereas CXL switches result in a net cost increase.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that conventional CXL memory pods require full connectivity or expensive switches, but Octopus enables scalable pods by using low-port-count pooling devices in a sparse topology: servers are grouped into islands for low-latency intra-island communication while islands are interconnected to maintain sufficient pooling overlap, all subject to 1.5 m cable limits. A 3-server hardware prototype demonstrates 3.2x faster RPCs than in-rack RDMA and 2.4x faster than CXL switches; 96-server simulations under measured device traits report 3-5.4% net server cost savings (versus net increases with switches).

Significance. If the central claims hold, the work could meaningfully reduce the cost and complexity of large CXL pods by eliminating switches while preserving both pooling efficiency and low-latency pairwise communication. Strengths include the hardware prototype driven by real device characteristics and physical cable constraints, plus the explicit framing of the overlap-versus-pooling tension. The island-based sparse construction is a concrete alternative to the full-connectivity assumption in prior designs.

major comments (2)

[§4] §4 (Hardware Prototype): The three-server prototype permits near-full connectivity within the 1.5 m cable limit, so the measured 3.2x RPC speedup does not exercise the island-grouping or sparse-subset tension that the 96-server simulation claims to resolve; this leaves the simulation cost savings dependent on an unvalidated extrapolation.
[§5] §5 (Simulation Results): The algorithm or procedure used to select the sparse per-device subsets and to define island boundaries is not described with sufficient detail (no pseudocode, parameters, or validation against cable and overlap constraints), so the reported 3-5.4% savings cannot be independently checked for robustness under the stated physical limits.

minor comments (2)

[Abstract] Abstract: omits any mention of how sparse subsets or island boundaries are chosen, which is load-bearing for the claimed benefits.
[§3] Figure captions and §3 notation for overlap and pooling metrics could be made more precise to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Hardware Prototype): The three-server prototype permits near-full connectivity within the 1.5 m cable limit, so the measured 3.2x RPC speedup does not exercise the island-grouping or sparse-subset tension that the 96-server simulation claims to resolve; this leaves the simulation cost savings dependent on an unvalidated extrapolation.

Authors: We agree with the referee that the three-server prototype, given its small scale, permits near-full connectivity within the cable length limit and therefore does not fully exercise the sparse topology or island-grouping mechanism central to the 96-server simulation. The prototype primarily serves to validate the performance of direct CXL connections using real hardware and measured device characteristics under physical constraints. The simulation then applies these traits to larger sparse configurations. This represents a genuine limitation in hardware validation of the scaling claims. In the revised manuscript, we will update §4 to clearly distinguish the prototype's contributions from the simulation, explicitly note the extrapolation involved, and discuss potential avenues for further validation. We will also add a brief sensitivity analysis in the simulation section if space permits. revision: partial
Referee: [§5] §5 (Simulation Results): The algorithm or procedure used to select the sparse per-device subsets and to define island boundaries is not described with sufficient detail (no pseudocode, parameters, or validation against cable and overlap constraints), so the reported 3-5.4% savings cannot be independently checked for robustness under the stated physical limits.

Authors: The referee is correct that the current manuscript lacks detailed description of the algorithm for selecting sparse subsets and defining island boundaries. No pseudocode or explicit parameters are provided, which hinders reproducibility and independent verification of the results under the cable and overlap constraints. We will revise the paper by adding a new subsection in §5 (or an appendix) that includes: (1) pseudocode for the island construction and subset selection procedure, (2) all relevant parameters and thresholds used (e.g., overlap targets, cable length enforcement), and (3) validation checks ensuring the generated topologies respect the 1.5 m limit and maintain required pooling overlap. This will allow readers to reproduce and assess the robustness of the 3-5.4% cost savings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from hardware and external measurements

full rationale

The paper reports latency and cost numbers directly from a 3-server hardware prototype and from simulations driven by measured device characteristics plus physical cable constraints (1.5 m). No equations, fitted parameters, or topology-construction procedure are shown to be self-definitional or to rename a prediction as a result. The central design choice (sparse island-based topology) is presented as an engineering tradeoff rather than a derived theorem, and quantitative claims do not reduce to the inputs by construction. This is the normal case of a measurement-driven systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on domain assumptions about CXL device port counts and physical cable constraints rather than new fitted parameters or invented entities.

axioms (2)

domain assumption CXL pooling devices have low port counts (e.g., 4 ports)
Explicitly used as the starting point for the sparse topology construction.
domain assumption Physical cable length is limited to 1.5 m copper
Invoked as a constraint in the 96-server simulations.

pith-pipeline@v0.9.0 · 5812 in / 1287 out tokens · 62227 ms · 2026-05-23T05:10:02.529857+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Proxics: an efficient programming model for far memory accelerators
cs.OS 2026-04 conditional novelty 6.0

Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensiv...

Reference graph

Works this paper leans on

125 extracted references · 125 canonical work pages · cited by 1 Pith paper

[1]

Mpi allgather utilizing cxl shared memory pool in multi-node computing systems

Hooyoung Ahn, Seonyoung Kim, Yoomi Park, Woojong Han, Shiny- oung Ahn, Tu Tran, Bharath Ramesh, Hari Subramoni, and Dha- baleswar K Panda. Mpi allgather utilizing cxl shared memory pool in multi-node computing systems. In 2024 IEEE International Conference on Big Data (BigData) , pages 332–337. IEEE, 2024

work page 2024
[2]

An examination of cxl memory use cases for in-memory database management systems using sap hana

Minseon Ahn, Thomas Willhalm, Norman May, Donghun Lee, Suprasad Mutalik Desai, Daniel Booss, Jungmin Kim, Navneet Singh, Daniel Ritter, and Oliver Rebholz. An examination of cxl memory use cases for in-memory database management systems using sap hana. Proceedings of the VLDB Endowment , 17(12):3827–3840, 2024

work page 2024
[3]

Logical memory pools: Flexible and local disaggregated memory

Emmanuel Amaro, Stephanie Wang, Aurojit Panda, and Marcos K Aguilera. Logical memory pools: Flexible and local disaggregated memory. In Proceedings of the 22nd ACM Workshop on Hot Topics in Networks, pages 25–32, 2023

work page 2023
[4]

AMD EPYC 9124 Specifications

AMD. AMD EPYC 9124 Specifications. https://www.techpowerup. com/cpu-specs/epyc-9124.c2917. Accessed: 2025-01-15

work page 2025
[5]

Astera Labs

Inc. Astera Labs. Aries pcie ®/cxl® smart cable modules™ . https: //www.asteralabs.com/products/aries-smart-cable-modules/ , 2024. Product Brief

work page 2024
[6]

Hardware support for cloud database systems in the post-moore’s law era (dagstuhl seminar 24162)

David F Bacon, Carsten Binnig, David Patterson, and Margo Seltzer. Hardware support for cloud database systems in the post-moore’s law era (dagstuhl seminar 24162). Dagstuhl Reports, 14(4):54–84, 2024

work page 2024
[7]

Global combine on mesh architectures with wormhole rout- ing

Michael Barnett, Rick Littlefield, David G Payne, and Robert van de Geijn. Global combine on mesh architectures with wormhole rout- ing. In [1993] Proceedings Seventh International Parallel Processing Symposium, pages 156–162. IEEE, 1993

work page 1993
[8]

So far and yet so near-accelerating distributed joins with cxl

Alexander Baumstark, Marcus Paradies, Kai-Uwe Sattler, Steffen Kläbe, and Stephan Baumann. So far and yet so near-accelerating distributed joins with cxl. In Proceedings of the 20th International Workshop on Data Management on New Hardware , pages 1–9, 2024

work page 2024
[9]

Design tradeoffs in cxl-based memory pools for public cloud platforms

Daniel S Berger, Daniel Ernst, Huaicheng Li, Pantea Zardoshti, Mon- ish Shah, Samir Rajadnya, Scott Lee, Lisa Hsu, Ishwar Agarwal, Mark D Hill, et al. Design tradeoffs in cxl-based memory pools for public cloud platforms. IEEE Micro, 43(2):30–38, 2023

work page 2023
[10]

Slim fly: A cost effective low- diameter network topology

Maciej Besta and Torsten Hoefler. Slim fly: A cost effective low- diameter network topology. In SC’14: proceedings of the international conference for high performance computing, networking, storage and analysis, pages 348–359. IEEE, 2014

work page 2014
[11]

Design Theory, volume 1

Thomas Beth, Dieter Jungnickel, and Hanfried Lenz. Design Theory, volume 1. Cambridge University Press, Cambridge, UK, 2nd edition,

work page
[12]

Covers constructions of combinatorial designs, including BIBDs 12 Octopus: Scalable Low-Cost CXL Memory Pooling using finite projective planes

work page
[13]

A survey of research and practices of network-on-chip

Tobias Bjerregaard and Shankar Mahadevan. A survey of research and practices of network-on-chip. ACM Computing Surveys (CSUR), 38(1):1–es, 2006

work page 2006
[14]

A {High-Performance} design, implementation, deployment, and evaluation of the slim fly network

Nils Blach, Maciej Besta, Daniele De Sensi, Jens Domke, Hussein Harake, Shigang Li, Patrick Iff, Marek Konieczny, Kartik Lakhotia, Ales Kubicek, et al. A {High-Performance} design, implementation, deployment, and evaluation of the slim fly network. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1025–1044, 2024

work page 2024
[15]

R. C. Bose. On balanced incomplete block designs. Annals of Human Genetics, 9:353–399, 1939

work page 1939
[16]

Char- acterizing, modeling, and benchmarking {RocksDB} {Key-Value} workloads at facebook

Zhichao Cao, Siying Dong, Sagar Vemuri, and David HC Du. Char- acterizing, modeling, and benchmarking {RocksDB} {Key-Value} workloads at facebook. In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 209–223, 2020

work page 2020
[17]

Hyperscale tiered memory expander spec- ification for compute express link

Prakash Chauhan, Chris Petersen, Brian Morris, and Jerome Glisse. Hyperscale tiered memory expander spec- ification for compute express link. Available at https: //www.opencompute.org/documents/hyperscale-tiered-memory- expander-specification-for-compute-express-link-cxl-1-pdf , 2023. Open Compute Project, Revision 1, Effective October 27, 2023

work page 2023
[18]

The mpi mes- sage passing interface standard

Lyndon Clarke, Ian Glendinning, and Rolf Hempel. The mpi mes- sage passing interface standard. In Programming Environments for Massively Parallel Distributed Systems: Working Conference of the IFIP WG 10.3, April 25–29, 1994 , pages 213–218. Springer, 1994

work page 1994
[19]

Dictionary based cache line compression

Daniel Cohen, Sarel Cohen, Dalit Naor, Daniel Waddington, and Moshik Hershcovitch. Dictionary based cache line compression. In Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems, pages 8–14, 2024

work page 2024
[20]

Colbourn and Jeffrey H

Charles J. Colbourn and Jeffrey H. Dinitz, editors. Handbook of Combinatorial Designs. Chapman and Hall/CRC, 2nd edition, 2006

work page 2006
[21]

Compute Express Link (CXL) Specification, Revision 2.0, November 2020

Compute Express Link Consortium. Compute Express Link (CXL) Specification, Revision 2.0, November 2020. Accessed: 2025-03-11

work page 2020
[22]

Algorithmic techniques for the genera- tion and analysis of strongly regular graphs and other combinatorial configurations

DG Corneil and RA Mathon. Algorithmic techniques for the genera- tion and analysis of strongly regular graphs and other combinatorial configurations. In Annals of Discrete Mathematics, volume 2, pages 1–32. Elsevier, 1978

work page 1978
[23]

Mscclang: Microsoft collective communication language

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. Mscclang: Microsoft collective communication language. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 502–514, 2023

work page 2023
[24]

William J. Dally. Performance analysis of k-ary n-cube intercon- nection networks. IEEE transactions on Computers , 39(6):775–785, 1990

work page 1990
[25]

An introduction to the compute express link (cxl) interconnect

Debendra Das Sharma, Robert Blankenship, and Daniel Berger. An introduction to the compute express link (cxl) interconnect. ACM Computing Surveys, 56(11):1–37, 2024

work page 2024
[26]

Seagate Composable Memory Appliance (CMA) Architecture

Mohamad El-Batal. Seagate Composable Memory Appliance (CMA) Architecture. https://www.youtube.com/watch?v=KCgE0WejXl0, June 2024

work page 2024
[27]

Disaggregated memory in the datacenter: A survey

Mohammad Ewais and Paul Chow. Disaggregated memory in the datacenter: A survey. IEEE Access, 11:20688–20712, 2023

work page 2023
[28]

Power provisioning for a warehouse-sized computer

Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. Power provisioning for a warehouse-sized computer. ACM SIGARCH com- puter architecture news, 35(2):13–23, 2007

work page 2007
[29]

R. A. Fisher and F. Yates. The construction of balanced incomplete block designs. Annals of Human Genetics, 9:30–43, 1938

work page 1938
[30]

Fugaku: Japan’s super- computer successor to the k computer

Riken Center for Computational Science. Fugaku: Japan’s super- computer successor to the k computer. Riken Press Release , 2020. Fugaku employs the Tofu interconnect, maintaining a 6D mesh/torus topology for scalability and performance

work page 2020
[31]

Mak- ing kernel bypass practical for the cloud with junction

Joshua Fried, Gohar Irfan Chaudhry, Enrique Saurez, Esha Choukse, Íñigo Goiri, Sameh Elnikety, Rodrigo Fonseca, and Adam Belay. Mak- ing kernel bypass practical for the cloud with junction. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 55–73, 2024

work page 2024
[32]

K computer: The world’s first 10-petaflop super- computer

Fujitsu and RIKEN. K computer: The world’s first 10-petaflop super- computer. Fujitsu Technical Journal, 2011. The K Computer uses the Tofu interconnect, a proprietary 6D mesh/torus network topology

work page 2011
[33]

Acid support for compute express link memory transactions

Ellis Giles and Peter Varman. Acid support for compute express link memory transactions. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 982–995. IEEE, 2024

work page 2024
[34]

Efficient memory disaggregation with infiniswap

Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Efficient memory disaggregation with infiniswap. In 14th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 17), pages 649–667, 2017

work page 2017
[35]

Bcube: a high performance, server-centric network architecture for modular data centers

Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. Bcube: a high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, SIGCOMM ’09, page 63–74, New York, NY, USA, 2009. Association for Computing Machinery

work page 2009
[36]

A cxl-powered database system: Op- portunities and challenges

Yunyan Guo and Guoliang Li. A cxl-powered database system: Op- portunities and challenges. In 2024 IEEE 40th International Conference on Data Engineering (ICDE) , pages 5593–5604. IEEE, 2024

work page 2024
[37]

Dynamic capacity service for improving cxl pooled memory efficiency

Minho Ha, Junhee Ryu, Jungmin Choi, Kwangjin Ko, Sunwoong Kim, Sungwoo Hyun, Donguk Moon, Byungil Koh, Hokyoon Lee, Myoungseo Kim, et al. Dynamic capacity service for improving cxl pooled memory efficiency. IEEE Micro, 43(2):39–47, 2023

work page 2023
[38]

Protean: {VM} allocation service at scale

Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, et al. Protean: {VM} allocation service at scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 845–861, 2020

work page 2020
[39]

Design Considerations for CXL Device Hardware Coherency (HDM-DB)

David Hawkins and Matt Bromage. Design Considerations for CXL Device Hardware Coherency (HDM-DB). https://www.youtube.com/ watch?v=2ktM7dPcmqI, 2024

work page 2024
[40]

How generative ai and accelerated compute is creating the next generation liquid cooled data centers with focus on chal- lenges, opportunities and the road ahead

Ali Heydari. How generative ai and accelerated compute is creating the next generation liquid cooled data centers with focus on chal- lenges, opportunities and the road ahead. In 2024 IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Elec- tronic Systems (ITHERM) , Gaylord Rockies, CO, May 2024. Keynote presentation

work page 2024
[41]

Txcocket: an innovative solution for efficient cross-node data transmission enabled by cxl-based shared memory

Tao Huang, Yonggui Liang, Shubao Yu, and Kexin Chen. Txcocket: an innovative solution for efficient cross-node data transmission enabled by cxl-based shared memory. CCF Transactions on High Performance Computing, January 2025. Regular Paper, Published: 22 January 2025

work page 2025
[42]

Pasha: An efficient, scalable database architecture for cxl pods

Yibo Huang, Newton Ni, Vijay Chidambaram, Emmett Witchel, and Dixin Tang. Pasha: An efficient, scalable database architecture for cxl pods. In 15th Annual Conference on Innovative Data Systems Research (CIDR ’25), Amsterdam, The Netherlands, 2025. The University of Texas at Austin. Published under the Creative Commons Attribution 4.0 International (CC-BY ...

work page 2025
[43]

P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait- free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (ATC), pages 145–158. USENIX Association, 2010

work page 2010
[44]

The quest for bandwidth and capacity: Memory edition, 2023

Ronen Hyatt. The quest for bandwidth and capacity: Memory edition, 2023. https://www.hpcuserforum.com/wp-content/uploads/ 2023/09/Ronen-Hyatt_UnifabriX_The-Quest-for-Bandwidth-and- Capacity-Memory-Edition_Sept-2023-HPC-UF.pdf

work page 2023
[45]

Towards uni- versally accessible SAT technology

Alexey Ignatiev, Zi Li Tan, and Christos Karamanos. Towards uni- versally accessible SAT technology. In SAT, pages 4:1–4:11, 2024

work page 2024
[46]

CXL Switch for Scalable & Composable Memory Pool- ing/Sharing

JP Jiang. CXL Switch for Scalable & Composable Memory Pool- ing/Sharing. FMS presentation available at https://www.xconn- 13 Berger et al. tech.com/products, 2024

work page 2024
[47]

F. P. Junqueira, B. C. Reed, and M. Serafini. Zab: High-performance broadcast for primary-backup systems. InProceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) , pages 245–256. IEEE Computer Society, 2011

work page 2011
[48]

New supermicro x14 systems, 2024

Michael Kalodrich. New supermicro x14 systems, 2024. Accessed: 2024-12-18

work page 2024
[49]

CXL 2.0 Switch for a Composable Memory Sys- tem

Jim Kao. CXL 2.0 Switch for a Composable Memory Sys- tem. https://computeexpresslink.org/wp-content/uploads/ 2024/09/Xconn_CXL-2.0-Switch-for-a-Composable-Memory- System_FMS-2024_FINAL.pdf, October 2024

work page 2024
[50]

Lenovo has a cxl memory monster with 128x 128gb ddr5 dimms, 2024

Patrick Kennedy. Lenovo has a cxl memory monster with 128x 128gb ddr5 dimms, 2024. Accessed: 2024-12-18

work page 2024
[51]

Technology- driven, highly-scalable dragonfly topology.ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008

John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. Technology- driven, highly-scalable dragonfly topology.ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008

work page 2008
[52]

Flattened butterfly: a cost-efficient topology for high-radix networks

John Kim, William J Dally, and Dennis Abts. Flattened butterfly: a cost-efficient topology for high-radix networks. In Proceedings of the 34th annual international symposium on Computer architecture , pages 126–137, 2007

work page 2007
[53]

Microar- chitecture of a high radix router

John Kim, William J Dally, Brian Towles, and Amit K Gupta. Microar- chitecture of a high radix router. In 32nd International Symposium on Computer Architecture (ISCA’05), pages 420–431. IEEE, 2005

work page 2005
[54]

Dense server design for immersion cooling

Milin Kodnongbua, Zachary Englhardt, Ricardo Bianchini, Rodrigo Fonseca, Alvin Lebeck, Daniel S Berger, Vikram Iyer, Fiodar Kazhami- aka, and Adriana Schulz. Dense server design for immersion cooling. ACM Transactions on Graphics (TOG) , 43(6):1–20, 2024

work page 2024
[55]

Leo cxl smart memory controllers

Astera Labs. Leo cxl smart memory controllers. Available at https://www.asteralabs.com/products/leo-cxl-smart-memory- controllers/, December 2023. Product Brief

work page 2023
[56]

Polarfly: a cost-effective and flexible low-diameter topology

Kartik Lakhotia, Maciej Besta, Laura Monroe, Kelly Isham, Patrick Iff, Torsten Hoefler, and Fabrizio Petrini. Polarfly: a cost-effective and flexible low-diameter topology. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , pages 1–15. IEEE, 2022

work page 2022
[57]

The part-time parliament

Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2):133–169, May 1998

work page 1998
[58]

The evolution of compet- itive advantage in the worldwide semiconductor industry

Richard Langlois and Edward Steinmueller. The evolution of compet- itive advantage in the worldwide semiconductor industry. Sources of industrial leadership. Cambridge University Press, Cambridge, UK , 1999

work page 1999
[59]

Elastic use of far memory for in-memory data- base management systems

Donghun Lee, Thomas Willhalm, Minseon Ahn, Suprasad Muta- lik Desai, Daniel Booss, Navneet Singh, Daniel Ritter, Jungmin Kim, and Oliver Rebholz. Elastic use of far memory for in-memory data- base management systems. In Proceedings of the 19th International Workshop on Data Management on New Hardware , pages 35–43, 2023

work page 2023
[60]

Dram scaling challenges and solu- tions

Donghyuk Lee and Onur Mutlu. Dram scaling challenges and solu- tions. IEEE Micro, 42(2):14–25, 2022

work page 2022
[61]

Memtis: Efficient memory tiering with dynamic page clas- sification and page size determination

Taehyung Lee, Sumit Kumar Monga, Changwoo Min, and Young Ik Eom. Memtis: Efficient memory tiering with dynamic page clas- sification and page size determination. In Proceedings of the 29th Symposium on Operating Systems Principles , pages 17–34, 2023

work page 2023
[62]

Fat-trees: Universal networks for hardware- efficient supercomputing

Charles E Leiserson. Fat-trees: Universal networks for hardware- efficient supercomputing. IEEE transactions on Computers , 100(10):892–901, 1985

work page 1985
[63]

Lenovo thinksystem sr860 v3 server, 2024

Lenovo. Lenovo thinksystem sr860 v3 server, 2024. Accessed: 2024- 12-18

work page 2024
[64]

Cxl and the return of scale-up database engines

Alberto Lerner and Gustavo Alonso. Cxl and the return of scale-up database engines. arXiv preprint arXiv:2401.01150, 2024

work page arXiv 2024
[65]

A case against cxl memory pooling

Philip Levis, Kun Lin, and Amy Tai. A case against cxl memory pooling. In Proceedings of the 22nd ACM Workshop on Hot Topics in Networks, pages 18–24, 2023

work page 2023
[66]

Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D

Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, and Ricardo Bian- chini. Pond: CXL-Based Memory Pooling Systems for Cloud Plat- forms. In ASPLOS, 2023

work page 2023
[67]

System-level implications of disaggregated memory

Kevin Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F Wenisch. System-level implications of disaggregated memory. In IEEE Inter- national Symposium on High-Performance Comp Architecture , pages 1–12. IEEE, 2012

work page 2012
[68]

Perftest: Infiniband verbs performance tests, 2025

linux rdma. Perftest: Infiniband verbs performance tests, 2025. Ac- cessed: 2025-03-12

work page 2025
[69]

Liskov and J

B. Liskov and J. Cowling. Viewstamped replication revisited. Tech- nical Report MIT-CSAIL-TR-2012-021, MIT, July 2012

work page 2012
[70]

The primary-backup approach to fault-tolerant distributed systems

Brian Liskov and Robert Scheifler. The primary-backup approach to fault-tolerant distributed systems. In Proceedings of the 7th ACM Sym- posium on Operating Systems Principles (SOSP) , pages 48–55. ACM, 1979

work page 1979
[71]

Berger, Marie Nguyen, Xun Jian, Sam H

Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S. Berger, Marie Nguyen, Xun Jian, Sam H. Noh, and Huaicheng Li. Systematic CXL memory characterization and performance analysis at scale. In Pro- ceedings of the 2025 ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASP- LOS’25), 2025

work page 2025
[72]

Nvidia collective communications library (nccl).https: //developer.nvidia.com/nccl, 2016

Nathan Luehr. Nvidia collective communications library (nccl).https: //developer.nvidia.com/nccl, 2016. Accessed 1/1/2025

work page 2016
[73]

Hyrax: {Fail-in-Place} server operation in cloud platforms

Jialun Lyu, Marisa You, Celine Irvene, Mark Jung, Tyler Narmore, Jacob Shapiro, Luke Marshall, Savyasachi Samal, Ioannis Manousakis, Lisa Hsu, et al. Hyrax: {Fail-in-Place} server operation in cloud platforms. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) , pages 287–304, 2023

work page 2023
[74]

{HydraRPC}:{RPC} in the {CXL} era

Teng Ma, Zheng Liu, Chengkun Wei, Jialiang Huang, Youwei Zhuo, Haoyu Li, Ning Zhang, Yijin Guan, Dimin Niu, Mingxing Zhang, et al. {HydraRPC}:{RPC} in the {CXL} era. In 2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 387–395, 2024

work page 2024
[75]

You don’t know ’jack’: Cxl fabric orchestration and management

Grant Mackey. You don’t know ’jack’: Cxl fabric orchestration and management. Available at https://files.futurememorystorage.com/ proceedings/2024/20240806_CXLT-102-1_Mackey.pdf, 2024. Pre- sented by Jackrabbit Labs

work page 2024
[76]

Telepathic datacenters: Fast rpcs using shared cxl memory, 2024

Suyash Mahar, Ehsan Hajyjasini, Seungjin Lee, Zifeng Zhang, Mingyao Shen, and Steven Swanson. Telepathic datacenters: Fast rpcs using shared cxl memory, 2024

work page 2024
[77]

Tpp: Trans- parent page placement for cxl-enabled tiered-memory

Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, and Prakash Chauhan. Tpp: Trans- parent page placement for cxl-enabled tiered-memory. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating...

work page 2023
[78]

Marvell Technology

Inc. Marvell Technology. Structera a 2504 memory-expansion controller. Available at https://www.marvell.com/content/dam/ marvell/en/public-collateral/assets/marvell-structera-a-2504-near- memory-accelerator-product-brief.pdf , 2024. Product Brief, P/N MV-SLA25041-A0-HF350AA-C000

work page 2024
[79]

Memverge unveils world’s first cxl-based multi-server shared memory at isc

MemVerge. Memverge unveils world’s first cxl-based multi-server shared memory at isc. Press release, May 2023. International Super- computing Conference (ISC), Hamburg, Germany

work page 2023
[80]

The power of two choices in randomized load balancing

Michael Mitzenmacher. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems , 12(10):1094–1104, 2001

work page 2001

Showing first 80 references.

[1] [1]

Mpi allgather utilizing cxl shared memory pool in multi-node computing systems

Hooyoung Ahn, Seonyoung Kim, Yoomi Park, Woojong Han, Shiny- oung Ahn, Tu Tran, Bharath Ramesh, Hari Subramoni, and Dha- baleswar K Panda. Mpi allgather utilizing cxl shared memory pool in multi-node computing systems. In 2024 IEEE International Conference on Big Data (BigData) , pages 332–337. IEEE, 2024

work page 2024

[2] [2]

An examination of cxl memory use cases for in-memory database management systems using sap hana

Minseon Ahn, Thomas Willhalm, Norman May, Donghun Lee, Suprasad Mutalik Desai, Daniel Booss, Jungmin Kim, Navneet Singh, Daniel Ritter, and Oliver Rebholz. An examination of cxl memory use cases for in-memory database management systems using sap hana. Proceedings of the VLDB Endowment , 17(12):3827–3840, 2024

work page 2024

[3] [3]

Logical memory pools: Flexible and local disaggregated memory

Emmanuel Amaro, Stephanie Wang, Aurojit Panda, and Marcos K Aguilera. Logical memory pools: Flexible and local disaggregated memory. In Proceedings of the 22nd ACM Workshop on Hot Topics in Networks, pages 25–32, 2023

work page 2023

[4] [4]

AMD EPYC 9124 Specifications

AMD. AMD EPYC 9124 Specifications. https://www.techpowerup. com/cpu-specs/epyc-9124.c2917. Accessed: 2025-01-15

work page 2025

[5] [5]

Astera Labs

Inc. Astera Labs. Aries pcie ®/cxl® smart cable modules™ . https: //www.asteralabs.com/products/aries-smart-cable-modules/ , 2024. Product Brief

work page 2024

[6] [6]

Hardware support for cloud database systems in the post-moore’s law era (dagstuhl seminar 24162)

David F Bacon, Carsten Binnig, David Patterson, and Margo Seltzer. Hardware support for cloud database systems in the post-moore’s law era (dagstuhl seminar 24162). Dagstuhl Reports, 14(4):54–84, 2024

work page 2024

[7] [7]

Global combine on mesh architectures with wormhole rout- ing

Michael Barnett, Rick Littlefield, David G Payne, and Robert van de Geijn. Global combine on mesh architectures with wormhole rout- ing. In [1993] Proceedings Seventh International Parallel Processing Symposium, pages 156–162. IEEE, 1993

work page 1993

[8] [8]

So far and yet so near-accelerating distributed joins with cxl

Alexander Baumstark, Marcus Paradies, Kai-Uwe Sattler, Steffen Kläbe, and Stephan Baumann. So far and yet so near-accelerating distributed joins with cxl. In Proceedings of the 20th International Workshop on Data Management on New Hardware , pages 1–9, 2024

work page 2024

[9] [9]

Design tradeoffs in cxl-based memory pools for public cloud platforms

Daniel S Berger, Daniel Ernst, Huaicheng Li, Pantea Zardoshti, Mon- ish Shah, Samir Rajadnya, Scott Lee, Lisa Hsu, Ishwar Agarwal, Mark D Hill, et al. Design tradeoffs in cxl-based memory pools for public cloud platforms. IEEE Micro, 43(2):30–38, 2023

work page 2023

[10] [10]

Slim fly: A cost effective low- diameter network topology

Maciej Besta and Torsten Hoefler. Slim fly: A cost effective low- diameter network topology. In SC’14: proceedings of the international conference for high performance computing, networking, storage and analysis, pages 348–359. IEEE, 2014

work page 2014

[11] [11]

Design Theory, volume 1

Thomas Beth, Dieter Jungnickel, and Hanfried Lenz. Design Theory, volume 1. Cambridge University Press, Cambridge, UK, 2nd edition,

work page

[12] [12]

Covers constructions of combinatorial designs, including BIBDs 12 Octopus: Scalable Low-Cost CXL Memory Pooling using finite projective planes

work page

[13] [13]

A survey of research and practices of network-on-chip

Tobias Bjerregaard and Shankar Mahadevan. A survey of research and practices of network-on-chip. ACM Computing Surveys (CSUR), 38(1):1–es, 2006

work page 2006

[14] [14]

A {High-Performance} design, implementation, deployment, and evaluation of the slim fly network

Nils Blach, Maciej Besta, Daniele De Sensi, Jens Domke, Hussein Harake, Shigang Li, Patrick Iff, Marek Konieczny, Kartik Lakhotia, Ales Kubicek, et al. A {High-Performance} design, implementation, deployment, and evaluation of the slim fly network. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1025–1044, 2024

work page 2024

[15] [15]

R. C. Bose. On balanced incomplete block designs. Annals of Human Genetics, 9:353–399, 1939

work page 1939

[16] [16]

Char- acterizing, modeling, and benchmarking {RocksDB} {Key-Value} workloads at facebook

Zhichao Cao, Siying Dong, Sagar Vemuri, and David HC Du. Char- acterizing, modeling, and benchmarking {RocksDB} {Key-Value} workloads at facebook. In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 209–223, 2020

work page 2020

[17] [17]

Hyperscale tiered memory expander spec- ification for compute express link

Prakash Chauhan, Chris Petersen, Brian Morris, and Jerome Glisse. Hyperscale tiered memory expander spec- ification for compute express link. Available at https: //www.opencompute.org/documents/hyperscale-tiered-memory- expander-specification-for-compute-express-link-cxl-1-pdf , 2023. Open Compute Project, Revision 1, Effective October 27, 2023

work page 2023

[18] [18]

The mpi mes- sage passing interface standard

Lyndon Clarke, Ian Glendinning, and Rolf Hempel. The mpi mes- sage passing interface standard. In Programming Environments for Massively Parallel Distributed Systems: Working Conference of the IFIP WG 10.3, April 25–29, 1994 , pages 213–218. Springer, 1994

work page 1994

[19] [19]

Dictionary based cache line compression

Daniel Cohen, Sarel Cohen, Dalit Naor, Daniel Waddington, and Moshik Hershcovitch. Dictionary based cache line compression. In Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems, pages 8–14, 2024

work page 2024

[20] [20]

Colbourn and Jeffrey H

Charles J. Colbourn and Jeffrey H. Dinitz, editors. Handbook of Combinatorial Designs. Chapman and Hall/CRC, 2nd edition, 2006

work page 2006

[21] [21]

Compute Express Link (CXL) Specification, Revision 2.0, November 2020

Compute Express Link Consortium. Compute Express Link (CXL) Specification, Revision 2.0, November 2020. Accessed: 2025-03-11

work page 2020

[22] [22]

Algorithmic techniques for the genera- tion and analysis of strongly regular graphs and other combinatorial configurations

DG Corneil and RA Mathon. Algorithmic techniques for the genera- tion and analysis of strongly regular graphs and other combinatorial configurations. In Annals of Discrete Mathematics, volume 2, pages 1–32. Elsevier, 1978

work page 1978

[23] [23]

Mscclang: Microsoft collective communication language

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. Mscclang: Microsoft collective communication language. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 502–514, 2023

work page 2023

[24] [24]

William J. Dally. Performance analysis of k-ary n-cube intercon- nection networks. IEEE transactions on Computers , 39(6):775–785, 1990

work page 1990

[25] [25]

An introduction to the compute express link (cxl) interconnect

Debendra Das Sharma, Robert Blankenship, and Daniel Berger. An introduction to the compute express link (cxl) interconnect. ACM Computing Surveys, 56(11):1–37, 2024

work page 2024

[26] [26]

Seagate Composable Memory Appliance (CMA) Architecture

Mohamad El-Batal. Seagate Composable Memory Appliance (CMA) Architecture. https://www.youtube.com/watch?v=KCgE0WejXl0, June 2024

work page 2024

[27] [27]

Disaggregated memory in the datacenter: A survey

Mohammad Ewais and Paul Chow. Disaggregated memory in the datacenter: A survey. IEEE Access, 11:20688–20712, 2023

work page 2023

[28] [28]

Power provisioning for a warehouse-sized computer

Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. Power provisioning for a warehouse-sized computer. ACM SIGARCH com- puter architecture news, 35(2):13–23, 2007

work page 2007

[29] [29]

R. A. Fisher and F. Yates. The construction of balanced incomplete block designs. Annals of Human Genetics, 9:30–43, 1938

work page 1938

[30] [30]

Fugaku: Japan’s super- computer successor to the k computer

Riken Center for Computational Science. Fugaku: Japan’s super- computer successor to the k computer. Riken Press Release , 2020. Fugaku employs the Tofu interconnect, maintaining a 6D mesh/torus topology for scalability and performance

work page 2020

[31] [31]

Mak- ing kernel bypass practical for the cloud with junction

Joshua Fried, Gohar Irfan Chaudhry, Enrique Saurez, Esha Choukse, Íñigo Goiri, Sameh Elnikety, Rodrigo Fonseca, and Adam Belay. Mak- ing kernel bypass practical for the cloud with junction. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 55–73, 2024

work page 2024

[32] [32]

K computer: The world’s first 10-petaflop super- computer

Fujitsu and RIKEN. K computer: The world’s first 10-petaflop super- computer. Fujitsu Technical Journal, 2011. The K Computer uses the Tofu interconnect, a proprietary 6D mesh/torus network topology

work page 2011

[33] [33]

Acid support for compute express link memory transactions

Ellis Giles and Peter Varman. Acid support for compute express link memory transactions. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 982–995. IEEE, 2024

work page 2024

[34] [34]

Efficient memory disaggregation with infiniswap

Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Efficient memory disaggregation with infiniswap. In 14th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 17), pages 649–667, 2017

work page 2017

[35] [35]

Bcube: a high performance, server-centric network architecture for modular data centers

Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. Bcube: a high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, SIGCOMM ’09, page 63–74, New York, NY, USA, 2009. Association for Computing Machinery

work page 2009

[36] [36]

A cxl-powered database system: Op- portunities and challenges

Yunyan Guo and Guoliang Li. A cxl-powered database system: Op- portunities and challenges. In 2024 IEEE 40th International Conference on Data Engineering (ICDE) , pages 5593–5604. IEEE, 2024

work page 2024

[37] [37]

Dynamic capacity service for improving cxl pooled memory efficiency

Minho Ha, Junhee Ryu, Jungmin Choi, Kwangjin Ko, Sunwoong Kim, Sungwoo Hyun, Donguk Moon, Byungil Koh, Hokyoon Lee, Myoungseo Kim, et al. Dynamic capacity service for improving cxl pooled memory efficiency. IEEE Micro, 43(2):39–47, 2023

work page 2023

[38] [38]

Protean: {VM} allocation service at scale

Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, et al. Protean: {VM} allocation service at scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 845–861, 2020

work page 2020

[39] [39]

Design Considerations for CXL Device Hardware Coherency (HDM-DB)

David Hawkins and Matt Bromage. Design Considerations for CXL Device Hardware Coherency (HDM-DB). https://www.youtube.com/ watch?v=2ktM7dPcmqI, 2024

work page 2024

[40] [40]

How generative ai and accelerated compute is creating the next generation liquid cooled data centers with focus on chal- lenges, opportunities and the road ahead

Ali Heydari. How generative ai and accelerated compute is creating the next generation liquid cooled data centers with focus on chal- lenges, opportunities and the road ahead. In 2024 IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Elec- tronic Systems (ITHERM) , Gaylord Rockies, CO, May 2024. Keynote presentation

work page 2024

[41] [41]

Txcocket: an innovative solution for efficient cross-node data transmission enabled by cxl-based shared memory

Tao Huang, Yonggui Liang, Shubao Yu, and Kexin Chen. Txcocket: an innovative solution for efficient cross-node data transmission enabled by cxl-based shared memory. CCF Transactions on High Performance Computing, January 2025. Regular Paper, Published: 22 January 2025

work page 2025

[42] [42]

Pasha: An efficient, scalable database architecture for cxl pods

Yibo Huang, Newton Ni, Vijay Chidambaram, Emmett Witchel, and Dixin Tang. Pasha: An efficient, scalable database architecture for cxl pods. In 15th Annual Conference on Innovative Data Systems Research (CIDR ’25), Amsterdam, The Netherlands, 2025. The University of Texas at Austin. Published under the Creative Commons Attribution 4.0 International (CC-BY ...

work page 2025

[43] [43]

P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait- free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (ATC), pages 145–158. USENIX Association, 2010

work page 2010

[44] [44]

The quest for bandwidth and capacity: Memory edition, 2023

Ronen Hyatt. The quest for bandwidth and capacity: Memory edition, 2023. https://www.hpcuserforum.com/wp-content/uploads/ 2023/09/Ronen-Hyatt_UnifabriX_The-Quest-for-Bandwidth-and- Capacity-Memory-Edition_Sept-2023-HPC-UF.pdf

work page 2023

[45] [45]

Towards uni- versally accessible SAT technology

Alexey Ignatiev, Zi Li Tan, and Christos Karamanos. Towards uni- versally accessible SAT technology. In SAT, pages 4:1–4:11, 2024

work page 2024

[46] [46]

CXL Switch for Scalable & Composable Memory Pool- ing/Sharing

JP Jiang. CXL Switch for Scalable & Composable Memory Pool- ing/Sharing. FMS presentation available at https://www.xconn- 13 Berger et al. tech.com/products, 2024

work page 2024

[47] [47]

F. P. Junqueira, B. C. Reed, and M. Serafini. Zab: High-performance broadcast for primary-backup systems. InProceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) , pages 245–256. IEEE Computer Society, 2011

work page 2011

[48] [48]

New supermicro x14 systems, 2024

Michael Kalodrich. New supermicro x14 systems, 2024. Accessed: 2024-12-18

work page 2024

[49] [49]

CXL 2.0 Switch for a Composable Memory Sys- tem

Jim Kao. CXL 2.0 Switch for a Composable Memory Sys- tem. https://computeexpresslink.org/wp-content/uploads/ 2024/09/Xconn_CXL-2.0-Switch-for-a-Composable-Memory- System_FMS-2024_FINAL.pdf, October 2024

work page 2024

[50] [50]

Lenovo has a cxl memory monster with 128x 128gb ddr5 dimms, 2024

Patrick Kennedy. Lenovo has a cxl memory monster with 128x 128gb ddr5 dimms, 2024. Accessed: 2024-12-18

work page 2024

[51] [51]

Technology- driven, highly-scalable dragonfly topology.ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008

John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. Technology- driven, highly-scalable dragonfly topology.ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008

work page 2008

[52] [52]

Flattened butterfly: a cost-efficient topology for high-radix networks

John Kim, William J Dally, and Dennis Abts. Flattened butterfly: a cost-efficient topology for high-radix networks. In Proceedings of the 34th annual international symposium on Computer architecture , pages 126–137, 2007

work page 2007

[53] [53]

Microar- chitecture of a high radix router

John Kim, William J Dally, Brian Towles, and Amit K Gupta. Microar- chitecture of a high radix router. In 32nd International Symposium on Computer Architecture (ISCA’05), pages 420–431. IEEE, 2005

work page 2005

[54] [54]

Dense server design for immersion cooling

Milin Kodnongbua, Zachary Englhardt, Ricardo Bianchini, Rodrigo Fonseca, Alvin Lebeck, Daniel S Berger, Vikram Iyer, Fiodar Kazhami- aka, and Adriana Schulz. Dense server design for immersion cooling. ACM Transactions on Graphics (TOG) , 43(6):1–20, 2024

work page 2024

[55] [55]

Leo cxl smart memory controllers

Astera Labs. Leo cxl smart memory controllers. Available at https://www.asteralabs.com/products/leo-cxl-smart-memory- controllers/, December 2023. Product Brief

work page 2023

[56] [56]

Polarfly: a cost-effective and flexible low-diameter topology

Kartik Lakhotia, Maciej Besta, Laura Monroe, Kelly Isham, Patrick Iff, Torsten Hoefler, and Fabrizio Petrini. Polarfly: a cost-effective and flexible low-diameter topology. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , pages 1–15. IEEE, 2022

work page 2022

[57] [57]

The part-time parliament

Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2):133–169, May 1998

work page 1998

[58] [58]

The evolution of compet- itive advantage in the worldwide semiconductor industry

Richard Langlois and Edward Steinmueller. The evolution of compet- itive advantage in the worldwide semiconductor industry. Sources of industrial leadership. Cambridge University Press, Cambridge, UK , 1999

work page 1999

[59] [59]

Elastic use of far memory for in-memory data- base management systems

Donghun Lee, Thomas Willhalm, Minseon Ahn, Suprasad Muta- lik Desai, Daniel Booss, Navneet Singh, Daniel Ritter, Jungmin Kim, and Oliver Rebholz. Elastic use of far memory for in-memory data- base management systems. In Proceedings of the 19th International Workshop on Data Management on New Hardware , pages 35–43, 2023

work page 2023

[60] [60]

Dram scaling challenges and solu- tions

Donghyuk Lee and Onur Mutlu. Dram scaling challenges and solu- tions. IEEE Micro, 42(2):14–25, 2022

work page 2022

[61] [61]

Memtis: Efficient memory tiering with dynamic page clas- sification and page size determination

Taehyung Lee, Sumit Kumar Monga, Changwoo Min, and Young Ik Eom. Memtis: Efficient memory tiering with dynamic page clas- sification and page size determination. In Proceedings of the 29th Symposium on Operating Systems Principles , pages 17–34, 2023

work page 2023

[62] [62]

Fat-trees: Universal networks for hardware- efficient supercomputing

Charles E Leiserson. Fat-trees: Universal networks for hardware- efficient supercomputing. IEEE transactions on Computers , 100(10):892–901, 1985

work page 1985

[63] [63]

Lenovo thinksystem sr860 v3 server, 2024

Lenovo. Lenovo thinksystem sr860 v3 server, 2024. Accessed: 2024- 12-18

work page 2024

[64] [64]

Cxl and the return of scale-up database engines

Alberto Lerner and Gustavo Alonso. Cxl and the return of scale-up database engines. arXiv preprint arXiv:2401.01150, 2024

work page arXiv 2024

[65] [65]

A case against cxl memory pooling

Philip Levis, Kun Lin, and Amy Tai. A case against cxl memory pooling. In Proceedings of the 22nd ACM Workshop on Hot Topics in Networks, pages 18–24, 2023

work page 2023

[66] [66]

Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D

Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, and Ricardo Bian- chini. Pond: CXL-Based Memory Pooling Systems for Cloud Plat- forms. In ASPLOS, 2023

work page 2023

[67] [67]

System-level implications of disaggregated memory

Kevin Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F Wenisch. System-level implications of disaggregated memory. In IEEE Inter- national Symposium on High-Performance Comp Architecture , pages 1–12. IEEE, 2012

work page 2012

[68] [68]

Perftest: Infiniband verbs performance tests, 2025

linux rdma. Perftest: Infiniband verbs performance tests, 2025. Ac- cessed: 2025-03-12

work page 2025

[69] [69]

Liskov and J

B. Liskov and J. Cowling. Viewstamped replication revisited. Tech- nical Report MIT-CSAIL-TR-2012-021, MIT, July 2012

work page 2012

[70] [70]

The primary-backup approach to fault-tolerant distributed systems

Brian Liskov and Robert Scheifler. The primary-backup approach to fault-tolerant distributed systems. In Proceedings of the 7th ACM Sym- posium on Operating Systems Principles (SOSP) , pages 48–55. ACM, 1979

work page 1979

[71] [71]

Berger, Marie Nguyen, Xun Jian, Sam H

Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S. Berger, Marie Nguyen, Xun Jian, Sam H. Noh, and Huaicheng Li. Systematic CXL memory characterization and performance analysis at scale. In Pro- ceedings of the 2025 ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASP- LOS’25), 2025

work page 2025

[72] [72]

Nvidia collective communications library (nccl).https: //developer.nvidia.com/nccl, 2016

Nathan Luehr. Nvidia collective communications library (nccl).https: //developer.nvidia.com/nccl, 2016. Accessed 1/1/2025

work page 2016

[73] [73]

Hyrax: {Fail-in-Place} server operation in cloud platforms

Jialun Lyu, Marisa You, Celine Irvene, Mark Jung, Tyler Narmore, Jacob Shapiro, Luke Marshall, Savyasachi Samal, Ioannis Manousakis, Lisa Hsu, et al. Hyrax: {Fail-in-Place} server operation in cloud platforms. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) , pages 287–304, 2023

work page 2023

[74] [74]

{HydraRPC}:{RPC} in the {CXL} era

Teng Ma, Zheng Liu, Chengkun Wei, Jialiang Huang, Youwei Zhuo, Haoyu Li, Ning Zhang, Yijin Guan, Dimin Niu, Mingxing Zhang, et al. {HydraRPC}:{RPC} in the {CXL} era. In 2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 387–395, 2024

work page 2024

[75] [75]

You don’t know ’jack’: Cxl fabric orchestration and management

Grant Mackey. You don’t know ’jack’: Cxl fabric orchestration and management. Available at https://files.futurememorystorage.com/ proceedings/2024/20240806_CXLT-102-1_Mackey.pdf, 2024. Pre- sented by Jackrabbit Labs

work page 2024

[76] [76]

Telepathic datacenters: Fast rpcs using shared cxl memory, 2024

Suyash Mahar, Ehsan Hajyjasini, Seungjin Lee, Zifeng Zhang, Mingyao Shen, and Steven Swanson. Telepathic datacenters: Fast rpcs using shared cxl memory, 2024

work page 2024

[77] [77]

Tpp: Trans- parent page placement for cxl-enabled tiered-memory

Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, and Prakash Chauhan. Tpp: Trans- parent page placement for cxl-enabled tiered-memory. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating...

work page 2023

[78] [78]

Marvell Technology

Inc. Marvell Technology. Structera a 2504 memory-expansion controller. Available at https://www.marvell.com/content/dam/ marvell/en/public-collateral/assets/marvell-structera-a-2504-near- memory-accelerator-product-brief.pdf , 2024. Product Brief, P/N MV-SLA25041-A0-HF350AA-C000

work page 2024

[79] [79]

Memverge unveils world’s first cxl-based multi-server shared memory at isc

MemVerge. Memverge unveils world’s first cxl-based multi-server shared memory at isc. Press release, May 2023. International Super- computing Conference (ISC), Hamburg, Germany

work page 2023

[80] [80]

The power of two choices in randomized load balancing

Michael Mitzenmacher. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems , 12(10):1094–1104, 2001

work page 2001