pith. sign in

arxiv: 2501.09020 · v3 · submitted 2025-01-15 · 💻 cs.AR

Octopus: Enhancing CXL Memory Pods via Sparse Topology

Pith reviewed 2026-05-23 05:10 UTC · model grok-4.3

classification 💻 cs.AR
keywords CXLmemory poolingsparse topologyisland groupingRPC latencyserver cost savingsCXL podsinterconnect design
0
0 comments X

The pith

Octopus uses sparse CXL connections and server islands to scale memory pods without switches while cutting RPC latency and costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing CXL memory pod designs stay small or rely on expensive switches because they assume every server must connect to every pooling device. Octopus breaks this by building a sparse topology where each low-port pooling device links only to a chosen subset of servers. Servers are grouped into islands so that communication stays fast inside each island while overlap between islands supports efficient memory sharing across the pod. The design respects physical limits like 1.5-meter copper cables and scales to 96 servers. Hardware measurements show RPCs run 3.2 times faster than in-rack RDMA and 2.4 times faster than CXL switches, and simulations report 3 to 5.4 percent net server cost savings.

Core claim

Octopus constructs a sparse CXL topology in which each pooling device connects to a carefully chosen subset of servers grouped into islands; this arrangement delivers low-latency intra-island communication, sufficient inter-island overlap for pooling, and scalability to large pods without switches while staying inside 1.5 m cable constraints.

What carries the argument

Island grouping that creates low-latency clusters while allowing controlled overlap between clusters for memory pooling.

If this is right

  • RPCs complete 3.2 times faster than in-rack RDMA on the hardware prototype.
  • RPCs complete 2.4 times faster than through CXL switches on the hardware prototype.
  • Simulated 96-server pods deliver 3 to 5.4 percent net server cost savings.
  • CXL-switch designs produce a net cost increase in the same simulations.
  • The topology respects 1.5 m copper cable limits while scaling to 96 servers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sparse-island pattern could be tested on other disaggregated interconnects such as PCIe or NVLink to see if similar latency and cost gains appear.
  • Dynamic adjustment of island boundaries at runtime might further improve overlap for changing workload patterns.
  • Reducing reliance on switches could lower overall rack power draw even if the paper does not measure power directly.
  • Standards bodies might consider specifying lower port counts on future CXL devices if sparse topologies prove reliable at scale.

Load-bearing premise

A carefully chosen sparse subset of connections per pooling device combined with island grouping can satisfy low-latency intra-island communication, sufficient inter-island pooling overlap, and physical cable length limits without hidden performance or scalability penalties.

What would settle it

Running the three-server hardware prototype at larger scale, such as 16 servers, and checking whether RPC latency remains 3.2 times faster than RDMA and net cost savings stay positive under measured device characteristics and 1.5 m cables.

Figures

Figures reproduced from arXiv: 2501.09020 by Daniel S. Berger, Fiodar Kazhamiaka, Mark D. Hill, Pantea Zardoshti, Rodrigo Fonseca, Shuwei Teng, Yuhong Zhong.

Figure 2
Figure 2. Figure 2: Multi-ported device with two CXL ports. Each port offers x8 CXL lanes. to one PD (minimally-connected Octopus) or multiple PDs (redundantly-connected Octopus). This paper makes the following technical contributions. • First paper to question the assumption that pods must fully-connect hosts and pooling devices • Propose Octopus pod designs with bounded connec￾tions and provide algorithmic constructions for… view at source ↗
Figure 1
Figure 1. Figure 1: Conventional CXL pod designs assume that all 16 hosts (𝐻0 to 𝐻15) connect to all pooling devices, which re￾quires a still-expensive 16-ported device. Octopus introduces minimally-connected pod designs based on near-commodity 4-port pooling devices, which cuts pooling device cost in half. than the overall pod size. Similarly, we find that communica￾tion largely requires pair-wise memory sharing. We propose … view at source ↗
Figure 3
Figure 3. Figure 3: Larger CXL pods lead to higher DRAM savings. Besides PDs, another approach to build a pool is a switch. Switches offer multiple CXL ports and can forward CXL packets between them. An example is the XC50256 CXL switch offered by XConn [110]. A CXL pod comprises multiple hosts using multiple PDs and/or switches. In principle, hosts can concurrently access the memory behind a PD or behind a switch. However, w… view at source ↗
Figure 4
Figure 4. Figure 4: Die area estimates for PDs with different numbers of CXL ports and DDR5 channels. Note that for visual sim￾plicity we show the logic and network-on-chip (NOC) area as a single block. PD Size 𝑁 = 2 𝑁 = 4 𝑁 = 8 𝑁 = 16 DDR5 channels 2 4 8 12 References [40, 54, 77] [38] [81] Overall die area 14 𝑚𝑚2 30 𝑚𝑚2 69 𝑚𝑚2 181 𝑚𝑚2 Dead silicon 0 𝑚𝑚2 2 𝑚𝑚2 12 𝑚𝑚2 77 𝑚𝑚2 Wafer cost 70% 80% 100% 150% Estimated cost𝑎 $260 $… view at source ↗
Figure 5
Figure 5. Figure 5: A 3-rack Octopus configuration. 5.2 Physical Layout The physical layout of an Octopus topology is largely driven by CXL cable lengths and cable routing. CXL 2.0 (PCIe5) systems are typically constrained by insertion loss, which admits a simplified analysis. Under a 36 dB insertion-loss bud￾get at 16 GHz, we allocate 8 dB for pad-to-pin (root complex) and 10 dB for the CXL pooling device. We conservatively … view at source ↗
Figure 6
Figure 6. Figure 6: Host software view in fully-connected and Octopus pods. H1 P1 P2... H2 H3 PX+1 H1 P1 P2... H2 H3 PX+1 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pair-wise message passing and shuffle 6.1 Software CXL Pod API In a fully-connected topology, memory is interleaved across CXL PDs by default. Specifically, CXL specifies hardware interleaving at 256 Byte granularity across CXL devices con￾nected to a CPU socket [20].2 Figure 6a shows the host mem￾ory map in a typical fully-connected topology. Octopus makes individual CXL ports visible to the Operat￾ing Sy… view at source ↗
Figure 9
Figure 9. Figure 9: Fully-connected CXL pod topologies define a Pareto frontier where cost rises super-linearly with CXL pod size. For medium to large pod sizes, Octopus can achieve much larger pod sizes at equal or lower cost. D            [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional memory required by Octopus relative to a FC design with the same host count for three production workloads. Note that a need for 5% more memory comparable favorably to enabling 4.5× larger host counts ( [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of RPC round-trip latency for small and large messages. CXL’s latency is 3.3× and 1.5× lower than RDMA, and 10× lower than user-space networking. RPC services see median request sizes under 1500 Bytes with resonses under 315 Bytes [94]. At Meta, almost all KVS re￾quest sizes fit into 128 Bytes and response sizes are typically around 1-10kB [15]. For small messages, a minimally-connected Octop… view at source ↗
Figure 13
Figure 13. Figure 13: Minimally-connected Octopus topologies for the common case of 𝑋 = 8 host ports. (a) visualizes design #2 in [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Redundantly-connected Octopus topology for 𝑁 = 4-ported PDs and the common case of 𝑋 = 8 host ports (design #6 in [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 14
Figure 14. Figure 14: Minimally-connected Octopus topology for 𝑁 = 2-ported PDs and the common case of 𝑋 = 8 host ports (design #1 in [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Assuming that wafers cost half as much as in [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Assuming that wafers cost twice as much as in [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
read the original abstract

The Compute Express Link (CXL) interconnect enables compute "pods" that pool memory across servers to reduce cost and improve efficiency. These pods also facilitate pairwise communication whose needs conflict with pooling. Importantly, existing pod designs are small or require indirection through expensive switches. These conventional designs implicitly assume that pods must fully connect all servers to all CXL pooling devices. This paper breaks with this conventional wisdom by introducing Octopus pods. Octopus directly connects servers to low-port-count CXL pooling devices (e.g., 4 ports) yet scales to large pods without switches by constructing a sparse CXL topology in which each pooling device connects to a carefully chosen subset of servers. Octopus explicitly balances "overlap", where two servers connect to the same pooling device: overlap reduces pooling efficiency but enables low-latency communication. Octopus resolves this tension by grouping servers into "islands" with low-latency intra-island communication and interconnecting islands to favor pooling. We build a three-server CXL pod prototype and simulate scaled pods with 96 servers under measured device characteristics and physical constraints (1.5 m copper cables). On hardware, Octopus RPCs are 3.2x faster than in-rack RDMA and 2.4x faster than CXL switches. In simulation, Octopus achieves net server cost savings of 3-5.4% whereas CXL switches result in a net cost increase.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that conventional CXL memory pods require full connectivity or expensive switches, but Octopus enables scalable pods by using low-port-count pooling devices in a sparse topology: servers are grouped into islands for low-latency intra-island communication while islands are interconnected to maintain sufficient pooling overlap, all subject to 1.5 m cable limits. A 3-server hardware prototype demonstrates 3.2x faster RPCs than in-rack RDMA and 2.4x faster than CXL switches; 96-server simulations under measured device traits report 3-5.4% net server cost savings (versus net increases with switches).

Significance. If the central claims hold, the work could meaningfully reduce the cost and complexity of large CXL pods by eliminating switches while preserving both pooling efficiency and low-latency pairwise communication. Strengths include the hardware prototype driven by real device characteristics and physical cable constraints, plus the explicit framing of the overlap-versus-pooling tension. The island-based sparse construction is a concrete alternative to the full-connectivity assumption in prior designs.

major comments (2)
  1. [§4] §4 (Hardware Prototype): The three-server prototype permits near-full connectivity within the 1.5 m cable limit, so the measured 3.2x RPC speedup does not exercise the island-grouping or sparse-subset tension that the 96-server simulation claims to resolve; this leaves the simulation cost savings dependent on an unvalidated extrapolation.
  2. [§5] §5 (Simulation Results): The algorithm or procedure used to select the sparse per-device subsets and to define island boundaries is not described with sufficient detail (no pseudocode, parameters, or validation against cable and overlap constraints), so the reported 3-5.4% savings cannot be independently checked for robustness under the stated physical limits.
minor comments (2)
  1. [Abstract] Abstract: omits any mention of how sparse subsets or island boundaries are chosen, which is load-bearing for the claimed benefits.
  2. [§3] Figure captions and §3 notation for overlap and pooling metrics could be made more precise to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Hardware Prototype): The three-server prototype permits near-full connectivity within the 1.5 m cable limit, so the measured 3.2x RPC speedup does not exercise the island-grouping or sparse-subset tension that the 96-server simulation claims to resolve; this leaves the simulation cost savings dependent on an unvalidated extrapolation.

    Authors: We agree with the referee that the three-server prototype, given its small scale, permits near-full connectivity within the cable length limit and therefore does not fully exercise the sparse topology or island-grouping mechanism central to the 96-server simulation. The prototype primarily serves to validate the performance of direct CXL connections using real hardware and measured device characteristics under physical constraints. The simulation then applies these traits to larger sparse configurations. This represents a genuine limitation in hardware validation of the scaling claims. In the revised manuscript, we will update §4 to clearly distinguish the prototype's contributions from the simulation, explicitly note the extrapolation involved, and discuss potential avenues for further validation. We will also add a brief sensitivity analysis in the simulation section if space permits. revision: partial

  2. Referee: [§5] §5 (Simulation Results): The algorithm or procedure used to select the sparse per-device subsets and to define island boundaries is not described with sufficient detail (no pseudocode, parameters, or validation against cable and overlap constraints), so the reported 3-5.4% savings cannot be independently checked for robustness under the stated physical limits.

    Authors: The referee is correct that the current manuscript lacks detailed description of the algorithm for selecting sparse subsets and defining island boundaries. No pseudocode or explicit parameters are provided, which hinders reproducibility and independent verification of the results under the cable and overlap constraints. We will revise the paper by adding a new subsection in §5 (or an appendix) that includes: (1) pseudocode for the island construction and subset selection procedure, (2) all relevant parameters and thresholds used (e.g., overlap targets, cable length enforcement), and (3) validation checks ensuring the generated topologies respect the 1.5 m limit and maintain required pooling overlap. This will allow readers to reproduce and assess the robustness of the 3-5.4% cost savings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from hardware and external measurements

full rationale

The paper reports latency and cost numbers directly from a 3-server hardware prototype and from simulations driven by measured device characteristics plus physical cable constraints (1.5 m). No equations, fitted parameters, or topology-construction procedure are shown to be self-definitional or to rename a prediction as a result. The central design choice (sparse island-based topology) is presented as an engineering tradeoff rather than a derived theorem, and quantitative claims do not reduce to the inputs by construction. This is the normal case of a measurement-driven systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on domain assumptions about CXL device port counts and physical cable constraints rather than new fitted parameters or invented entities.

axioms (2)
  • domain assumption CXL pooling devices have low port counts (e.g., 4 ports)
    Explicitly used as the starting point for the sparse topology construction.
  • domain assumption Physical cable length is limited to 1.5 m copper
    Invoked as a constraint in the 96-server simulations.

pith-pipeline@v0.9.0 · 5812 in / 1287 out tokens · 62227 ms · 2026-05-23T05:10:02.529857+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Proxics: an efficient programming model for far memory accelerators

    cs.OS 2026-04 conditional novelty 6.0

    Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensiv...

Reference graph

Works this paper leans on

125 extracted references · 125 canonical work pages · cited by 1 Pith paper

  1. [1]

    Mpi allgather utilizing cxl shared memory pool in multi-node computing systems

    Hooyoung Ahn, Seonyoung Kim, Yoomi Park, Woojong Han, Shiny- oung Ahn, Tu Tran, Bharath Ramesh, Hari Subramoni, and Dha- baleswar K Panda. Mpi allgather utilizing cxl shared memory pool in multi-node computing systems. In 2024 IEEE International Conference on Big Data (BigData) , pages 332–337. IEEE, 2024

  2. [2]

    An examination of cxl memory use cases for in-memory database management systems using sap hana

    Minseon Ahn, Thomas Willhalm, Norman May, Donghun Lee, Suprasad Mutalik Desai, Daniel Booss, Jungmin Kim, Navneet Singh, Daniel Ritter, and Oliver Rebholz. An examination of cxl memory use cases for in-memory database management systems using sap hana. Proceedings of the VLDB Endowment , 17(12):3827–3840, 2024

  3. [3]

    Logical memory pools: Flexible and local disaggregated memory

    Emmanuel Amaro, Stephanie Wang, Aurojit Panda, and Marcos K Aguilera. Logical memory pools: Flexible and local disaggregated memory. In Proceedings of the 22nd ACM Workshop on Hot Topics in Networks, pages 25–32, 2023

  4. [4]

    AMD EPYC 9124 Specifications

    AMD. AMD EPYC 9124 Specifications. https://www.techpowerup. com/cpu-specs/epyc-9124.c2917. Accessed: 2025-01-15

  5. [5]

    Astera Labs

    Inc. Astera Labs. Aries pcie ®/cxl® smart cable modules™ . https: //www.asteralabs.com/products/aries-smart-cable-modules/ , 2024. Product Brief

  6. [6]

    Hardware support for cloud database systems in the post-moore’s law era (dagstuhl seminar 24162)

    David F Bacon, Carsten Binnig, David Patterson, and Margo Seltzer. Hardware support for cloud database systems in the post-moore’s law era (dagstuhl seminar 24162). Dagstuhl Reports, 14(4):54–84, 2024

  7. [7]

    Global combine on mesh architectures with wormhole rout- ing

    Michael Barnett, Rick Littlefield, David G Payne, and Robert van de Geijn. Global combine on mesh architectures with wormhole rout- ing. In [1993] Proceedings Seventh International Parallel Processing Symposium, pages 156–162. IEEE, 1993

  8. [8]

    So far and yet so near-accelerating distributed joins with cxl

    Alexander Baumstark, Marcus Paradies, Kai-Uwe Sattler, Steffen Kläbe, and Stephan Baumann. So far and yet so near-accelerating distributed joins with cxl. In Proceedings of the 20th International Workshop on Data Management on New Hardware , pages 1–9, 2024

  9. [9]

    Design tradeoffs in cxl-based memory pools for public cloud platforms

    Daniel S Berger, Daniel Ernst, Huaicheng Li, Pantea Zardoshti, Mon- ish Shah, Samir Rajadnya, Scott Lee, Lisa Hsu, Ishwar Agarwal, Mark D Hill, et al. Design tradeoffs in cxl-based memory pools for public cloud platforms. IEEE Micro, 43(2):30–38, 2023

  10. [10]

    Slim fly: A cost effective low- diameter network topology

    Maciej Besta and Torsten Hoefler. Slim fly: A cost effective low- diameter network topology. In SC’14: proceedings of the international conference for high performance computing, networking, storage and analysis, pages 348–359. IEEE, 2014

  11. [11]

    Design Theory, volume 1

    Thomas Beth, Dieter Jungnickel, and Hanfried Lenz. Design Theory, volume 1. Cambridge University Press, Cambridge, UK, 2nd edition,

  12. [12]

    Covers constructions of combinatorial designs, including BIBDs 12 Octopus: Scalable Low-Cost CXL Memory Pooling using finite projective planes

  13. [13]

    A survey of research and practices of network-on-chip

    Tobias Bjerregaard and Shankar Mahadevan. A survey of research and practices of network-on-chip. ACM Computing Surveys (CSUR), 38(1):1–es, 2006

  14. [14]

    A {High-Performance} design, implementation, deployment, and evaluation of the slim fly network

    Nils Blach, Maciej Besta, Daniele De Sensi, Jens Domke, Hussein Harake, Shigang Li, Patrick Iff, Marek Konieczny, Kartik Lakhotia, Ales Kubicek, et al. A {High-Performance} design, implementation, deployment, and evaluation of the slim fly network. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1025–1044, 2024

  15. [15]

    R. C. Bose. On balanced incomplete block designs. Annals of Human Genetics, 9:353–399, 1939

  16. [16]

    Char- acterizing, modeling, and benchmarking {RocksDB} {Key-Value} workloads at facebook

    Zhichao Cao, Siying Dong, Sagar Vemuri, and David HC Du. Char- acterizing, modeling, and benchmarking {RocksDB} {Key-Value} workloads at facebook. In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 209–223, 2020

  17. [17]

    Hyperscale tiered memory expander spec- ification for compute express link

    Prakash Chauhan, Chris Petersen, Brian Morris, and Jerome Glisse. Hyperscale tiered memory expander spec- ification for compute express link. Available at https: //www.opencompute.org/documents/hyperscale-tiered-memory- expander-specification-for-compute-express-link-cxl-1-pdf , 2023. Open Compute Project, Revision 1, Effective October 27, 2023

  18. [18]

    The mpi mes- sage passing interface standard

    Lyndon Clarke, Ian Glendinning, and Rolf Hempel. The mpi mes- sage passing interface standard. In Programming Environments for Massively Parallel Distributed Systems: Working Conference of the IFIP WG 10.3, April 25–29, 1994 , pages 213–218. Springer, 1994

  19. [19]

    Dictionary based cache line compression

    Daniel Cohen, Sarel Cohen, Dalit Naor, Daniel Waddington, and Moshik Hershcovitch. Dictionary based cache line compression. In Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems, pages 8–14, 2024

  20. [20]

    Colbourn and Jeffrey H

    Charles J. Colbourn and Jeffrey H. Dinitz, editors. Handbook of Combinatorial Designs. Chapman and Hall/CRC, 2nd edition, 2006

  21. [21]

    Compute Express Link (CXL) Specification, Revision 2.0, November 2020

    Compute Express Link Consortium. Compute Express Link (CXL) Specification, Revision 2.0, November 2020. Accessed: 2025-03-11

  22. [22]

    Algorithmic techniques for the genera- tion and analysis of strongly regular graphs and other combinatorial configurations

    DG Corneil and RA Mathon. Algorithmic techniques for the genera- tion and analysis of strongly regular graphs and other combinatorial configurations. In Annals of Discrete Mathematics, volume 2, pages 1–32. Elsevier, 1978

  23. [23]

    Mscclang: Microsoft collective communication language

    Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. Mscclang: Microsoft collective communication language. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 502–514, 2023

  24. [24]

    William J. Dally. Performance analysis of k-ary n-cube intercon- nection networks. IEEE transactions on Computers , 39(6):775–785, 1990

  25. [25]

    An introduction to the compute express link (cxl) interconnect

    Debendra Das Sharma, Robert Blankenship, and Daniel Berger. An introduction to the compute express link (cxl) interconnect. ACM Computing Surveys, 56(11):1–37, 2024

  26. [26]

    Seagate Composable Memory Appliance (CMA) Architecture

    Mohamad El-Batal. Seagate Composable Memory Appliance (CMA) Architecture. https://www.youtube.com/watch?v=KCgE0WejXl0, June 2024

  27. [27]

    Disaggregated memory in the datacenter: A survey

    Mohammad Ewais and Paul Chow. Disaggregated memory in the datacenter: A survey. IEEE Access, 11:20688–20712, 2023

  28. [28]

    Power provisioning for a warehouse-sized computer

    Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. Power provisioning for a warehouse-sized computer. ACM SIGARCH com- puter architecture news, 35(2):13–23, 2007

  29. [29]

    R. A. Fisher and F. Yates. The construction of balanced incomplete block designs. Annals of Human Genetics, 9:30–43, 1938

  30. [30]

    Fugaku: Japan’s super- computer successor to the k computer

    Riken Center for Computational Science. Fugaku: Japan’s super- computer successor to the k computer. Riken Press Release , 2020. Fugaku employs the Tofu interconnect, maintaining a 6D mesh/torus topology for scalability and performance

  31. [31]

    Mak- ing kernel bypass practical for the cloud with junction

    Joshua Fried, Gohar Irfan Chaudhry, Enrique Saurez, Esha Choukse, Íñigo Goiri, Sameh Elnikety, Rodrigo Fonseca, and Adam Belay. Mak- ing kernel bypass practical for the cloud with junction. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 55–73, 2024

  32. [32]

    K computer: The world’s first 10-petaflop super- computer

    Fujitsu and RIKEN. K computer: The world’s first 10-petaflop super- computer. Fujitsu Technical Journal, 2011. The K Computer uses the Tofu interconnect, a proprietary 6D mesh/torus network topology

  33. [33]

    Acid support for compute express link memory transactions

    Ellis Giles and Peter Varman. Acid support for compute express link memory transactions. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 982–995. IEEE, 2024

  34. [34]

    Efficient memory disaggregation with infiniswap

    Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Efficient memory disaggregation with infiniswap. In 14th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 17), pages 649–667, 2017

  35. [35]

    Bcube: a high performance, server-centric network architecture for modular data centers

    Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. Bcube: a high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, SIGCOMM ’09, page 63–74, New York, NY, USA, 2009. Association for Computing Machinery

  36. [36]

    A cxl-powered database system: Op- portunities and challenges

    Yunyan Guo and Guoliang Li. A cxl-powered database system: Op- portunities and challenges. In 2024 IEEE 40th International Conference on Data Engineering (ICDE) , pages 5593–5604. IEEE, 2024

  37. [37]

    Dynamic capacity service for improving cxl pooled memory efficiency

    Minho Ha, Junhee Ryu, Jungmin Choi, Kwangjin Ko, Sunwoong Kim, Sungwoo Hyun, Donguk Moon, Byungil Koh, Hokyoon Lee, Myoungseo Kim, et al. Dynamic capacity service for improving cxl pooled memory efficiency. IEEE Micro, 43(2):39–47, 2023

  38. [38]

    Protean: {VM} allocation service at scale

    Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, et al. Protean: {VM} allocation service at scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 845–861, 2020

  39. [39]

    Design Considerations for CXL Device Hardware Coherency (HDM-DB)

    David Hawkins and Matt Bromage. Design Considerations for CXL Device Hardware Coherency (HDM-DB). https://www.youtube.com/ watch?v=2ktM7dPcmqI, 2024

  40. [40]

    How generative ai and accelerated compute is creating the next generation liquid cooled data centers with focus on chal- lenges, opportunities and the road ahead

    Ali Heydari. How generative ai and accelerated compute is creating the next generation liquid cooled data centers with focus on chal- lenges, opportunities and the road ahead. In 2024 IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Elec- tronic Systems (ITHERM) , Gaylord Rockies, CO, May 2024. Keynote presentation

  41. [41]

    Txcocket: an innovative solution for efficient cross-node data transmission enabled by cxl-based shared memory

    Tao Huang, Yonggui Liang, Shubao Yu, and Kexin Chen. Txcocket: an innovative solution for efficient cross-node data transmission enabled by cxl-based shared memory. CCF Transactions on High Performance Computing, January 2025. Regular Paper, Published: 22 January 2025

  42. [42]

    Pasha: An efficient, scalable database architecture for cxl pods

    Yibo Huang, Newton Ni, Vijay Chidambaram, Emmett Witchel, and Dixin Tang. Pasha: An efficient, scalable database architecture for cxl pods. In 15th Annual Conference on Innovative Data Systems Research (CIDR ’25), Amsterdam, The Netherlands, 2025. The University of Texas at Austin. Published under the Creative Commons Attribution 4.0 International (CC-BY ...

  43. [43]

    P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait- free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (ATC), pages 145–158. USENIX Association, 2010

  44. [44]

    The quest for bandwidth and capacity: Memory edition, 2023

    Ronen Hyatt. The quest for bandwidth and capacity: Memory edition, 2023. https://www.hpcuserforum.com/wp-content/uploads/ 2023/09/Ronen-Hyatt_UnifabriX_The-Quest-for-Bandwidth-and- Capacity-Memory-Edition_Sept-2023-HPC-UF.pdf

  45. [45]

    Towards uni- versally accessible SAT technology

    Alexey Ignatiev, Zi Li Tan, and Christos Karamanos. Towards uni- versally accessible SAT technology. In SAT, pages 4:1–4:11, 2024

  46. [46]

    CXL Switch for Scalable & Composable Memory Pool- ing/Sharing

    JP Jiang. CXL Switch for Scalable & Composable Memory Pool- ing/Sharing. FMS presentation available at https://www.xconn- 13 Berger et al. tech.com/products, 2024

  47. [47]

    F. P. Junqueira, B. C. Reed, and M. Serafini. Zab: High-performance broadcast for primary-backup systems. InProceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) , pages 245–256. IEEE Computer Society, 2011

  48. [48]

    New supermicro x14 systems, 2024

    Michael Kalodrich. New supermicro x14 systems, 2024. Accessed: 2024-12-18

  49. [49]

    CXL 2.0 Switch for a Composable Memory Sys- tem

    Jim Kao. CXL 2.0 Switch for a Composable Memory Sys- tem. https://computeexpresslink.org/wp-content/uploads/ 2024/09/Xconn_CXL-2.0-Switch-for-a-Composable-Memory- System_FMS-2024_FINAL.pdf, October 2024

  50. [50]

    Lenovo has a cxl memory monster with 128x 128gb ddr5 dimms, 2024

    Patrick Kennedy. Lenovo has a cxl memory monster with 128x 128gb ddr5 dimms, 2024. Accessed: 2024-12-18

  51. [51]

    Technology- driven, highly-scalable dragonfly topology.ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008

    John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. Technology- driven, highly-scalable dragonfly topology.ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008

  52. [52]

    Flattened butterfly: a cost-efficient topology for high-radix networks

    John Kim, William J Dally, and Dennis Abts. Flattened butterfly: a cost-efficient topology for high-radix networks. In Proceedings of the 34th annual international symposium on Computer architecture , pages 126–137, 2007

  53. [53]

    Microar- chitecture of a high radix router

    John Kim, William J Dally, Brian Towles, and Amit K Gupta. Microar- chitecture of a high radix router. In 32nd International Symposium on Computer Architecture (ISCA’05), pages 420–431. IEEE, 2005

  54. [54]

    Dense server design for immersion cooling

    Milin Kodnongbua, Zachary Englhardt, Ricardo Bianchini, Rodrigo Fonseca, Alvin Lebeck, Daniel S Berger, Vikram Iyer, Fiodar Kazhami- aka, and Adriana Schulz. Dense server design for immersion cooling. ACM Transactions on Graphics (TOG) , 43(6):1–20, 2024

  55. [55]

    Leo cxl smart memory controllers

    Astera Labs. Leo cxl smart memory controllers. Available at https://www.asteralabs.com/products/leo-cxl-smart-memory- controllers/, December 2023. Product Brief

  56. [56]

    Polarfly: a cost-effective and flexible low-diameter topology

    Kartik Lakhotia, Maciej Besta, Laura Monroe, Kelly Isham, Patrick Iff, Torsten Hoefler, and Fabrizio Petrini. Polarfly: a cost-effective and flexible low-diameter topology. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , pages 1–15. IEEE, 2022

  57. [57]

    The part-time parliament

    Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2):133–169, May 1998

  58. [58]

    The evolution of compet- itive advantage in the worldwide semiconductor industry

    Richard Langlois and Edward Steinmueller. The evolution of compet- itive advantage in the worldwide semiconductor industry. Sources of industrial leadership. Cambridge University Press, Cambridge, UK , 1999

  59. [59]

    Elastic use of far memory for in-memory data- base management systems

    Donghun Lee, Thomas Willhalm, Minseon Ahn, Suprasad Muta- lik Desai, Daniel Booss, Navneet Singh, Daniel Ritter, Jungmin Kim, and Oliver Rebholz. Elastic use of far memory for in-memory data- base management systems. In Proceedings of the 19th International Workshop on Data Management on New Hardware , pages 35–43, 2023

  60. [60]

    Dram scaling challenges and solu- tions

    Donghyuk Lee and Onur Mutlu. Dram scaling challenges and solu- tions. IEEE Micro, 42(2):14–25, 2022

  61. [61]

    Memtis: Efficient memory tiering with dynamic page clas- sification and page size determination

    Taehyung Lee, Sumit Kumar Monga, Changwoo Min, and Young Ik Eom. Memtis: Efficient memory tiering with dynamic page clas- sification and page size determination. In Proceedings of the 29th Symposium on Operating Systems Principles , pages 17–34, 2023

  62. [62]

    Fat-trees: Universal networks for hardware- efficient supercomputing

    Charles E Leiserson. Fat-trees: Universal networks for hardware- efficient supercomputing. IEEE transactions on Computers , 100(10):892–901, 1985

  63. [63]

    Lenovo thinksystem sr860 v3 server, 2024

    Lenovo. Lenovo thinksystem sr860 v3 server, 2024. Accessed: 2024- 12-18

  64. [64]

    Cxl and the return of scale-up database engines

    Alberto Lerner and Gustavo Alonso. Cxl and the return of scale-up database engines. arXiv preprint arXiv:2401.01150, 2024

  65. [65]

    A case against cxl memory pooling

    Philip Levis, Kun Lin, and Amy Tai. A case against cxl memory pooling. In Proceedings of the 22nd ACM Workshop on Hot Topics in Networks, pages 18–24, 2023

  66. [66]

    Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D

    Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zar- doshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, and Ricardo Bian- chini. Pond: CXL-Based Memory Pooling Systems for Cloud Plat- forms. In ASPLOS, 2023

  67. [67]

    System-level implications of disaggregated memory

    Kevin Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F Wenisch. System-level implications of disaggregated memory. In IEEE Inter- national Symposium on High-Performance Comp Architecture , pages 1–12. IEEE, 2012

  68. [68]

    Perftest: Infiniband verbs performance tests, 2025

    linux rdma. Perftest: Infiniband verbs performance tests, 2025. Ac- cessed: 2025-03-12

  69. [69]

    Liskov and J

    B. Liskov and J. Cowling. Viewstamped replication revisited. Tech- nical Report MIT-CSAIL-TR-2012-021, MIT, July 2012

  70. [70]

    The primary-backup approach to fault-tolerant distributed systems

    Brian Liskov and Robert Scheifler. The primary-backup approach to fault-tolerant distributed systems. In Proceedings of the 7th ACM Sym- posium on Operating Systems Principles (SOSP) , pages 48–55. ACM, 1979

  71. [71]

    Berger, Marie Nguyen, Xun Jian, Sam H

    Jinshu Liu, Hamid Hadian, Yuyue Wang, Daniel S. Berger, Marie Nguyen, Xun Jian, Sam H. Noh, and Huaicheng Li. Systematic CXL memory characterization and performance analysis at scale. In Pro- ceedings of the 2025 ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASP- LOS’25), 2025

  72. [72]

    Nvidia collective communications library (nccl).https: //developer.nvidia.com/nccl, 2016

    Nathan Luehr. Nvidia collective communications library (nccl).https: //developer.nvidia.com/nccl, 2016. Accessed 1/1/2025

  73. [73]

    Hyrax: {Fail-in-Place} server operation in cloud platforms

    Jialun Lyu, Marisa You, Celine Irvene, Mark Jung, Tyler Narmore, Jacob Shapiro, Luke Marshall, Savyasachi Samal, Ioannis Manousakis, Lisa Hsu, et al. Hyrax: {Fail-in-Place} server operation in cloud platforms. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) , pages 287–304, 2023

  74. [74]

    {HydraRPC}:{RPC} in the {CXL} era

    Teng Ma, Zheng Liu, Chengkun Wei, Jialiang Huang, Youwei Zhuo, Haoyu Li, Ning Zhang, Yijin Guan, Dimin Niu, Mingxing Zhang, et al. {HydraRPC}:{RPC} in the {CXL} era. In 2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 387–395, 2024

  75. [75]

    You don’t know ’jack’: Cxl fabric orchestration and management

    Grant Mackey. You don’t know ’jack’: Cxl fabric orchestration and management. Available at https://files.futurememorystorage.com/ proceedings/2024/20240806_CXLT-102-1_Mackey.pdf, 2024. Pre- sented by Jackrabbit Labs

  76. [76]

    Telepathic datacenters: Fast rpcs using shared cxl memory, 2024

    Suyash Mahar, Ehsan Hajyjasini, Seungjin Lee, Zifeng Zhang, Mingyao Shen, and Steven Swanson. Telepathic datacenters: Fast rpcs using shared cxl memory, 2024

  77. [77]

    Tpp: Trans- parent page placement for cxl-enabled tiered-memory

    Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, and Prakash Chauhan. Tpp: Trans- parent page placement for cxl-enabled tiered-memory. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating...

  78. [78]

    Marvell Technology

    Inc. Marvell Technology. Structera a 2504 memory-expansion controller. Available at https://www.marvell.com/content/dam/ marvell/en/public-collateral/assets/marvell-structera-a-2504-near- memory-accelerator-product-brief.pdf , 2024. Product Brief, P/N MV-SLA25041-A0-HF350AA-C000

  79. [79]

    Memverge unveils world’s first cxl-based multi-server shared memory at isc

    MemVerge. Memverge unveils world’s first cxl-based multi-server shared memory at isc. Press release, May 2023. International Super- computing Conference (ISC), Hamburg, Germany

  80. [80]

    The power of two choices in randomized load balancing

    Michael Mitzenmacher. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems , 12(10):1094–1104, 2001

Showing first 80 references.