CXL-ClusterSim: Modeling CXL-based Disaggregated Memory Cluster for Pooling and Sharing using gem5 and SST

Hoa Nguyen; Jason Lowe-Power; Kaustav Goswami; Maryam Babaie; Venkatesh Akella

arxiv: 2605.27745 · v1 · pith:3QTA4JHKnew · submitted 2026-05-26 · 💻 cs.AR

CXL-ClusterSim: Modeling CXL-based Disaggregated Memory Cluster for Pooling and Sharing using gem5 and SST

Kaustav Goswami , Maryam Babaie , Hoa Nguyen , Venkatesh Akella , Jason Lowe-Power This is my paper

Pith reviewed 2026-06-29 14:49 UTC · model grok-4.3

classification 💻 cs.AR

keywords CXLdisaggregated memorysimulationgem5SSTmemory poolingcomputer architecturefull-system simulation

0 comments

The pith

CXL-ClusterSim combines gem5 and SST to simulate CXL disaggregated memory clusters at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CXL-ClusterSim as a new simulation framework for modeling systems with disaggregated memory using the CXL protocol. It integrates the gem5 simulator, which provides detailed system modeling, with the Structural Simulation Toolkit for handling larger parallel simulations. This approach aims to help explore design choices in memory pooling and sharing without the need for physical hardware prototypes. Architects can use it to study performance tradeoffs in large AI and cloud computing setups where DRAM is often underutilized.

Core claim

CXL-ClusterSim is a full-system simulation framework that merges gem5 for high-fidelity modeling of individual nodes with SST for scalable parallel execution across clusters, enabling evaluation of CXL-based memory disaggregation for improved resource utilization.

What carries the argument

The CXL-ClusterSim framework integrates gem5 and SST to model CXL protocol behavior in disaggregated memory clusters.

Load-bearing premise

The integration of gem5 and SST can preserve modeling fidelity for CXL behavior and cluster interactions without major accuracy loss or slowdown.

What would settle it

Running the same workload on both CXL-ClusterSim and a real CXL-enabled hardware cluster and finding large discrepancies in measured latency or throughput.

Figures

Figures reproduced from arXiv: 2605.27745 by Hoa Nguyen, Jason Lowe-Power, Kaustav Goswami, Maryam Babaie, Venkatesh Akella.

**Figure 2.** Figure 2: System-level view of the remote memory ranges. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ideal simulation timeline. ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The simulation timeline for CXL-ClusterSim. We [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Statistics used to validate the correctness of CXL [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The total bandwidth was measured at each of the memory controllers. There were 8 system nodes in the cluster [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: The bandwidth reported by each of the STREAM [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: Bandwidth reported by the two system nodes and [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 8.** Figure 8: Reported host system statistics when the number [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: NPB workloads’ relative IPC of NUMA-Local [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Share of retired memory instructions served by [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Reported IPC of GAPBS kernels in CXLClusterSim, when the same unweighted synthetic graph was shared across all the kernels. We compare the IPC of CXL-ClusterSim’s experiments with a single system running one kernel without additional latencies. pressure and switch modeling, which likely understates taillatency growth and interference under heavy co-tenancy; we leave detailed switch/fabric and credit-bas… view at source ↗

read the original abstract

Large-scale AI training and inference require hundreds of gigabytes to terabytes of DRAM with high peak to average utilization ratios, resulting in overprovisioning. In cloud computing, DRAM constitutes a significant share of the cost. Yet, as shown by recent articles, DRAM is heavily under utilized. Memory disaggregation is a solution to both these problems. With the advent of the CXL protocol, there is renewed interest in designing and optimizing computing systems with disaggregated memory. However, at present, there are limited simulation tools available for exploring the design space and evaluating the performance tradeoffs in computer systems with disaggregated memory. In this paper, we propose CXL-ClusterSim, a full-system modeling and simulation framework by combining the gem5 simulator for fidelity, with the Structural Simulation Toolkit (SST) for parallel simulation. We outline the challenges in creating this simulation infrastructure and present a design that is scalable, flexible, and reasonably fast to help computer architects to explore the design space of CXL-based disaggregated memory and identify new opportunities for hardware/software codesign and performance optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CXL-ClusterSim proposes a gem5-SST hybrid for CXL disaggregated memory but supplies no validation data or fidelity checks on the integration.

read the letter

The main takeaway is that the authors have put together a new simulation framework called CXL-ClusterSim that pairs gem5's detailed modeling with SST's parallel capabilities to study CXL-based memory pools and sharing. That combination for this specific workload is the actual new element.

They do a reasonable job laying out the motivation around DRAM overprovisioning in AI systems and the current shortage of tools for exploring disaggregation. Naming the integration challenges and aiming for a scalable, flexible setup shows they have thought through the practical side of building such a thing.

The soft spots are clear and central. The text gives no numbers on simulation speed, no comparisons against standalone gem5, and no checks that CXL link behavior, coherence, or cluster-scale timing survive the coupling. The stress-test note is accurate here: any synchronization layer added for parallelism risks changing exactly the latency and contention effects the framework is meant to capture. Without those results the claim that it is reasonably fast and useful for design-space exploration stays unverified.

This paper is aimed at computer architects who need infrastructure to model memory disaggregation. Someone looking for a ready tool with proven accuracy will not find it yet. It could still be worth a serious referee if the full manuscript adds implementation details and basic validation runs, because the underlying problem is real and the hybrid approach is a plausible direction even if the current writeup is early-stage.

I would send it out for review with the expectation that validation sections get added.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes CXL-ClusterSim, a full-system simulation framework that integrates the gem5 simulator (for modeling fidelity) with the Structural Simulation Toolkit (SST) (for parallel execution) to enable exploration of CXL-based disaggregated memory clusters for pooling and sharing. It outlines integration challenges and claims the resulting design is scalable, flexible, and reasonably fast for computer architects studying hardware/software co-design opportunities in memory disaggregation.

Significance. A validated implementation of this framework could address the current scarcity of simulation tools for CXL disaggregated memory systems and support design-space exploration at cluster scale. However, the manuscript supplies no validation data, performance measurements, fidelity comparisons to standalone gem5, or evidence that the gem5-SST coupling preserves CXL link, coherence, and memory-pool semantics, so the practical significance cannot yet be assessed.

major comments (2)

[Abstract / §1] Abstract and §1 (Introduction): The central claim that the framework is 'scalable, flexible, and reasonably fast' while preserving 'sufficient modeling fidelity for CXL protocol behavior and cluster-scale interactions' is unsupported; the text contains no simulation speed results, accuracy metrics versus a pure-gem5 baseline, or timing-accuracy data for CXL transactions.
[Design / Integration description] The integration layer between gem5 and SST is identified as the least-secured point for preserving transaction ordering, latency distributions, and bandwidth contention, yet no section demonstrates that the chosen synchronization or abstraction mechanisms maintain these properties at cluster scale.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript describing CXL-ClusterSim. The work focuses on the design of a gem5-SST integration framework for CXL disaggregated memory, including integration challenges. We acknowledge that the manuscript does not include quantitative validation, speed results, or fidelity metrics, as it is primarily a design and architecture paper rather than an evaluation study. We will revise the text to ensure claims are appropriately qualified.

read point-by-point responses

Referee: [Abstract / §1] Abstract and §1 (Introduction): The central claim that the framework is 'scalable, flexible, and reasonably fast' while preserving 'sufficient modeling fidelity for CXL protocol behavior and cluster-scale interactions' is unsupported; the text contains no simulation speed results, accuracy metrics versus a pure-gem5 baseline, or timing-accuracy data for CXL transactions.

Authors: We agree that these claims lack supporting quantitative data in the manuscript. The paper's scope is the proposal of the framework architecture and discussion of integration challenges, not empirical evaluation. We will revise the abstract and §1 to remove or qualify the unsupported claims (e.g., stating design goals rather than demonstrated properties) and add a note that performance and fidelity evaluations are planned future work. revision: yes
Referee: [Design / Integration description] The integration layer between gem5 and SST is identified as the least-secured point for preserving transaction ordering, latency distributions, and bandwidth contention, yet no section demonstrates that the chosen synchronization or abstraction mechanisms maintain these properties at cluster scale.

Authors: The manuscript describes the integration mechanisms chosen to address ordering, latency, and contention. However, we acknowledge the absence of any demonstration or empirical evidence that these mechanisms preserve the required properties at cluster scale. We will expand the relevant design section with additional detail on the synchronization approach and its rationale, while explicitly noting the lack of validation and the assumptions involved. Full demonstration would require new experiments. revision: partial

standing simulated objections not resolved

Providing empirical validation data, speed measurements, accuracy metrics, or cluster-scale demonstrations of semantic preservation, as no such experiments or results exist in the current work.

Circularity Check

0 steps flagged

No circularity: tool-construction paper with no derivations or predictions

full rationale

The paper is a description of a proposed simulation framework (CXL-ClusterSim) that integrates two established external simulators (gem5 and SST). No equations, fitted parameters, predictions, or derivation chains exist in the text. The central claim is a design proposal for scalability and flexibility, not a result derived from inputs by construction. All load-bearing elements are engineering choices justified by reference to the capabilities of the base tools rather than self-referential reduction. This matches the default expectation of no circularity for non-derivational work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the established capabilities of two pre-existing simulators rather than new fitted parameters or invented entities.

axioms (2)

domain assumption gem5 provides high-fidelity modeling of individual CPU and memory systems
Abstract relies on gem5's known properties for the fidelity component of the hybrid framework.
domain assumption SST supports scalable parallel simulation of large systems
Abstract relies on SST's known properties for the parallel and scalability component.

pith-pipeline@v0.9.1-grok · 5740 in / 1303 out tokens · 33319 ms · 2026-06-29T14:49:57.135442+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 13 canonical work pages · 1 internal anchor

[1]

CXL®4.0 Specification

“CXL®4.0 Specification. ” [Online]. Available: https:// computeexpresslink.org/cxl-specification/"
[2]

CXL®Specification

“CXL®Specification. ” [Online]. Available: https://computeexpresslink. org/cxl-specification/"
[3]

GPT-4 architecture, datasets, costs and more leaked,

“GPT-4 architecture, datasets, costs and more leaked, ” 2020. [Online]. Available: https://the-decoder.com/ gpt-4-architecture-datasets-costs-and-more-leaked/"

2020
[4]

Compute Express Link (CXL): All you need to know,

“Compute Express Link (CXL): All you need to know, ” 2024. [Online]. Available: https://www.rambus.com/blogs/compute-express-link/"

2024
[5]

Memory disaggregation: why now and what are the challenges,

M. K. Aguilera, E. Amaro, N. Amit, E. Hunhoff, A. Yelam, and G. Zellweger, “Memory disaggregation: why now and what are the challenges, ”SIGOPS Oper. Syst. Rev., vol. 57, no. 1, p. 38–46, jun
[6]

Available: https://doi.org/10.1145/3606557.3606563

[Online]. Available: https://doi.org/10.1145/3606557.3606563

work page doi:10.1145/3606557.3606563
[7]

pd-gem5: Simulation Infrastructure for Parallel/Distributed Computer Systems ,

M. Alian, D. Kim, and N. Sung Kim, “ pd-gem5: Simulation Infrastructure for Parallel/Distributed Computer Systems , ”IEEE Computer Architecture Letters, vol. 15, no. 01, pp. 41–44, Jan. 2016. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/LCA. 2015.2438295

work page doi:10.1109/lca 2016
[8]

Xerxes: Extensive exploration of scalable hardware systems with CXL-Based simulation framework,

Y. An, S. Yi, B. Mao, Q. Li, M. Zhang, D. Zhou, K. Zhou, N. Xiao, G. Sun, Y. Luo, and J. Zhang, “Xerxes: Extensive exploration of scalable hardware systems with CXL-Based simulation framework, ” in24th USENIX Conference on File and Storage Technologies (FAST 26). Santa Clara, CA: USENIX Association, Feb. 2026, pp. 329–345. [Online]. Available: https: //ww...

2026
[9]

The NAS Parallel Benchmarks,

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiberet al., “The NAS Parallel Benchmarks, ”The International Journal of Supercomputing Applications, vol. 5, no. 3, pp. 63–73, 1991

1991
[10]

The GAP Benchmark Suite

S. Beamer, K. Asanović, and D. Patterson, “The gap benchmark suite, ” 2017. [Online]. Available: https://arxiv.org/abs/1508.03619

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Enabling reproducible and agile full-system simulation,

B. R. Bruce, A. Akram, H. Nguyen, K. Roarty, M. Samani, M. Friborz, T. Reddy, M. D. Sinclair, and J. Lowe-Power, “Enabling reproducible and agile full-system simulation, ” in2021 IEEE International Sym- posium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2021, pp. 183–193

2021
[12]

Starnuma: Mitigating numa challenges with memory pooling,

A. Cho and A. Daglis, “Starnuma: Mitigating numa challenges with memory pooling, ” inProceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. IEEE Press, 2024, p. 997–1012. [Online]. Available: https: //doi.org/10.1109/MICRO61859.2024.00077

work page doi:10.1109/micro61859.2024.00077 2024
[13]

Ai and memory wall,

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “Ai and memory wall, ”IEEE Micro, 2024

2024
[14]

Direct access, {High- Performance} memory disaggregation with {DirectCXL},

D. Gouk, S. Lee, M. Kwon, and M. Jung, “Direct access, {High- Performance} memory disaggregation with {DirectCXL}, ” in2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 287–294

2022
[15]

Simulating DRAM controllers for future system architecture exploration,

A. Hansson, N. Agarwal, A. Kolli, T. F. Wenisch, and A. N. Udipi, “Simulating DRAM controllers for future system architecture exploration, ” in2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014, Monterey, CA, USA, March 23-25, 2014. IEEE Computer Society, 2014, pp. 201–210. [Online]. Available: https://doi.org/1...

work page doi:10.1109/ispass.2014.6844484 2014
[16]

Sst + gem5 = a scalable simulation infrastructure for high performance computing,

M. Hsieh, K. Pedretti, J. Meng, A. Coskun, M. Levenhagen, and A. Rodrigues, “Sst + gem5 = a scalable simulation infrastructure for high performance computing, ” inProceedings of the 5th International ICST Conference on Simulation Tools and Techniques, ser. SIMUTOOLS ’12. Brussels, BEL: ICST (Institute for Computer Sciences, Social- Informatics and Telecom...

2012
[17]

OpenCAPI (Open Coherent Accelerator Processor Interface),

IBM, “OpenCAPI (Open Coherent Accelerator Processor Interface), ”
[18]

Available: https://docs.kernel.org/userspace-api/ accelerators/ocxl.html

[Online]. Available: https://docs.kernel.org/userspace-api/ accelerators/ocxl.html
[19]

Gen-z zmmu and memory interleave„

G.-Z. Interconnect, “Gen-z zmmu and memory interleave„ ” Gen-Z Consortium, Tech. Rep., July 2017. [Online]. Available: https://genzconsortium.org/wp-content/uploads/2018/05/ Gen-Z-MMUand-Memory-Interleave.pdf

2017
[20]

Pond: Cxl-based memory pooling systems for cloud platforms,

H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura, and R. Bianchini, “Pond: Cxl-based memory pooling systems for cloud platforms, ” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser...

work page doi:10.1145/3575693.3578835 2023
[21]

System-level implications of disaggregated memory,

K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, P. Ranganathan, and T. F. Wenisch, “System-level implications of disaggregated memory, ” inIEEE International Symposium on High-Performance Comp Architecture (HPCA). IEEE, 2012, pp. 1–12

2012
[22]

The gem5 simulator: Version 20.0+,

J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, M. Andreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharadwaj, G. Black, G. Bloom, B. R. Bruce, D. R. Carvalho, J. Castrillon, L. Chen, N. Derumigny, S. Diestelhorst, W. Elsasser, C. Escuin, M. Fariborz, A. Farmahini-Farahani, P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass, A. Gutierr...

2020
[23]

Scaling the

S.-L. Lu, T. Karnik, G. Srinivasa, K.-Y. Chao, D. Carmean, and J. Held, “Scaling the "memory wall", ” inProceedings of the International Conference on Computer-Aided Design, ser. ICCAD ’12. New York, NY, USA: Association for Computing Machinery, 2012, p. 271–272. [Online]. Available: https://doi.org/10.1145/2429384.2429437

work page doi:10.1145/2429384.2429437 2012
[24]

Tpp: Transparent page placement for cxl-enabled tiered-memory,

H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, and P. Chauhan, “Tpp: Transparent page placement for cxl-enabled tiered-memory, ” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023. New...

work page doi:10.1145/3582016.3582063 2023
[25]

Stream: Sustainable memory bandwidth in high performance computers,

J. D. McCalpin, “Stream: Sustainable memory bandwidth in high performance computers, ” University of Virginia, Charlottesville, Virginia, Tech. Rep., 1991-2007, a continually updated technical report. http://www.cs.virginia.edu/stream/. [Online]. Available: http://www.cs.virginia.edu/stream/ 12

1991
[26]

Memory bandwidth and machine balance in current high performance computers,

——, “Memory bandwidth and machine balance in current high performance computers, ”IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, Dec. 1995

1995
[27]

Famfs Shared Memory Filesystem Framework - User Space Repo,

MICRON, “Famfs Shared Memory Filesystem Framework - User Space Repo, ” Tech. Rep., 2024. [Online]. Available: https: //github.com/cxl-micron-reskit/famfs

2024
[28]

dist-gem5: Distributed simulation of computer clusters,

A. Mohammad, U. Darbaz, G. Dozsa, S. Diestelhorst, D. Kim, and N. S. Kim, “dist-gem5: Distributed simulation of computer clusters, ” in2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2017, pp. 153–162

2017
[29]

gem5/sst integration 2021: Scaling full-system simulations

H. Nguyen and J. Lowe-Power, “gem5/sst integration 2021: Scaling full-system simulations. ” [Online]. Available: https://www.gem5.org/events/isca-2022#:~: text=gem5/SST%20Integration%202021%3A%20Scaling%20Full% 2Dsystem%20Simulations

2021
[30]

Dracksim: Simulating cxl-enabled large-scale disaggregated memory systems,

A. Puri, K. Bellamkonda, K. Narreddy, J. Jose, V. Tamarapalli, and V. Narayanan, “Dracksim: Simulating cxl-enabled large-scale disaggregated memory systems, ” inProceedings of the 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, ser. SIGSIM-PADS ’24. New York, NY, USA: Association for Computing Machinery, 2024, p. 3–14. [Online]. ...

work page doi:10.1145/3615979.3656059 2024
[31]

The structural simulation toolkit,

A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “The structural simulation toolkit, ”SIGMETRICS Perform. Eval. Rev., vol. 38, no. 4, p. 37–42, mar 2011. [Online]. Available: https://doi.org/10.1145/1964218.1964225

work page doi:10.1145/1964218.1964225 2011
[32]

Dramsim2: A cycle accurate memory system simulator,

P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycle accurate memory system simulator, ”IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 16–19, 2011

2011
[33]

Compute express link,

D. D. Sharma, “Compute express link, ”CXL Consortium White Paper, 2019

2019
[34]

Compute express link®: An open industry-standard intercon- nect enabling heterogeneous data-centric computing,

——, “Compute express link®: An open industry-standard intercon- nect enabling heterogeneous data-centric computing, ” in2022 IEEE Symposium on High-Performance Interconnects (HOTI), 2022, pp. 5–12

2022
[35]

An introduction to the compute express link (cxl) interconnect,

D. D. Sharma, R. Blankenship, and D. S. Berger, “An introduction to the compute express link (cxl) interconnect, ” 2024

2024
[36]

Demystifying cxl memory with genuine cxl-ready systems and devices,

Y. Sun, Y. Yuan, Z. Yu, R. Kuper, C. Song, J. Huang, H. Ji, S. Agarwal, J. Lou, I. Jeong, R. Wang, J. H. Ahn, T. Xu, and N. S. Kim, “Demystifying cxl memory with genuine cxl-ready systems and devices, ” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY, USA: Association for Computing Machi...

work page doi:10.1145/3613424.3614256 2023
[37]

A novel, highly integrated simulator for parallel and distributed systems,

N. Tampouratzis, I. Papaefstathiou, A. Nikitakis, A. Brokalakis, S. Andrianakis, A. Dollas, M. Marcon, and E. Plebani, “A novel, highly integrated simulator for parallel and distributed systems, ” ACM Transactions on Architecture and Code Optimization (TACO), vol. 17, no. 1, pp. 1–28, 2020

2020
[38]

Asynchronous memory access unit: Exploiting massive parallelism for far memory access,

L. Wang, X. Zhang, S. Wang, Z. Jiang, T. Lu, M. Chen, S. Luo, and K. Huang, “Asynchronous memory access unit: Exploiting massive parallelism for far memory access, ”ACM Trans. Archit. Code Optim., vol. 21, no. 3, Sep. 2024. [Online]. Available: https://doi.org/10.1145/3663479

work page doi:10.1145/3663479 2024
[39]

Cohet: A cxl-driven coherent heterogeneous computing framework with hardware-calibrated full-system simulation,

Y. Wang, L. Wu, S. Gao, Y. Tang, J. Luo, Z. Wang, Y. Ou, D. Dong, N. Xiao, and M. Lai, “Cohet: A cxl-driven coherent heterogeneous computing framework with hardware-calibrated full-system simulation, ” 2026. [Online]. Available: https://arxiv.org/ abs/2511.23011

work page arXiv 2026
[40]

Cxl-dmsim: A full-system cxl disaggregated memory simulator with comprehensive silicon validation,

Y. Wang, L. Wu, W. Hong, Y. Ou, Z. Wang, S. Gao, J. Zhang, S. Ma, D. Dong, X. Qi, M. Lai, and N. Xiao, “Cxl-dmsim: A full-system cxl disaggregated memory simulator with comprehensive silicon validation, ”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 45, no. 4, pp. 1787–1801, 2026. 13

2026

[1] [1]

CXL®4.0 Specification

“CXL®4.0 Specification. ” [Online]. Available: https:// computeexpresslink.org/cxl-specification/"

[2] [2]

CXL®Specification

“CXL®Specification. ” [Online]. Available: https://computeexpresslink. org/cxl-specification/"

[3] [3]

GPT-4 architecture, datasets, costs and more leaked,

“GPT-4 architecture, datasets, costs and more leaked, ” 2020. [Online]. Available: https://the-decoder.com/ gpt-4-architecture-datasets-costs-and-more-leaked/"

2020

[4] [4]

Compute Express Link (CXL): All you need to know,

“Compute Express Link (CXL): All you need to know, ” 2024. [Online]. Available: https://www.rambus.com/blogs/compute-express-link/"

2024

[5] [5]

Memory disaggregation: why now and what are the challenges,

M. K. Aguilera, E. Amaro, N. Amit, E. Hunhoff, A. Yelam, and G. Zellweger, “Memory disaggregation: why now and what are the challenges, ”SIGOPS Oper. Syst. Rev., vol. 57, no. 1, p. 38–46, jun

[6] [6]

Available: https://doi.org/10.1145/3606557.3606563

[Online]. Available: https://doi.org/10.1145/3606557.3606563

work page doi:10.1145/3606557.3606563

[7] [7]

pd-gem5: Simulation Infrastructure for Parallel/Distributed Computer Systems ,

M. Alian, D. Kim, and N. Sung Kim, “ pd-gem5: Simulation Infrastructure for Parallel/Distributed Computer Systems , ”IEEE Computer Architecture Letters, vol. 15, no. 01, pp. 41–44, Jan. 2016. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/LCA. 2015.2438295

work page doi:10.1109/lca 2016

[8] [8]

Xerxes: Extensive exploration of scalable hardware systems with CXL-Based simulation framework,

Y. An, S. Yi, B. Mao, Q. Li, M. Zhang, D. Zhou, K. Zhou, N. Xiao, G. Sun, Y. Luo, and J. Zhang, “Xerxes: Extensive exploration of scalable hardware systems with CXL-Based simulation framework, ” in24th USENIX Conference on File and Storage Technologies (FAST 26). Santa Clara, CA: USENIX Association, Feb. 2026, pp. 329–345. [Online]. Available: https: //ww...

2026

[9] [9]

The NAS Parallel Benchmarks,

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiberet al., “The NAS Parallel Benchmarks, ”The International Journal of Supercomputing Applications, vol. 5, no. 3, pp. 63–73, 1991

1991

[10] [10]

The GAP Benchmark Suite

S. Beamer, K. Asanović, and D. Patterson, “The gap benchmark suite, ” 2017. [Online]. Available: https://arxiv.org/abs/1508.03619

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Enabling reproducible and agile full-system simulation,

B. R. Bruce, A. Akram, H. Nguyen, K. Roarty, M. Samani, M. Friborz, T. Reddy, M. D. Sinclair, and J. Lowe-Power, “Enabling reproducible and agile full-system simulation, ” in2021 IEEE International Sym- posium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2021, pp. 183–193

2021

[12] [12]

Starnuma: Mitigating numa challenges with memory pooling,

A. Cho and A. Daglis, “Starnuma: Mitigating numa challenges with memory pooling, ” inProceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’24. IEEE Press, 2024, p. 997–1012. [Online]. Available: https: //doi.org/10.1109/MICRO61859.2024.00077

work page doi:10.1109/micro61859.2024.00077 2024

[13] [13]

Ai and memory wall,

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “Ai and memory wall, ”IEEE Micro, 2024

2024

[14] [14]

Direct access, {High- Performance} memory disaggregation with {DirectCXL},

D. Gouk, S. Lee, M. Kwon, and M. Jung, “Direct access, {High- Performance} memory disaggregation with {DirectCXL}, ” in2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 287–294

2022

[15] [15]

Simulating DRAM controllers for future system architecture exploration,

A. Hansson, N. Agarwal, A. Kolli, T. F. Wenisch, and A. N. Udipi, “Simulating DRAM controllers for future system architecture exploration, ” in2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014, Monterey, CA, USA, March 23-25, 2014. IEEE Computer Society, 2014, pp. 201–210. [Online]. Available: https://doi.org/1...

work page doi:10.1109/ispass.2014.6844484 2014

[16] [16]

Sst + gem5 = a scalable simulation infrastructure for high performance computing,

M. Hsieh, K. Pedretti, J. Meng, A. Coskun, M. Levenhagen, and A. Rodrigues, “Sst + gem5 = a scalable simulation infrastructure for high performance computing, ” inProceedings of the 5th International ICST Conference on Simulation Tools and Techniques, ser. SIMUTOOLS ’12. Brussels, BEL: ICST (Institute for Computer Sciences, Social- Informatics and Telecom...

2012

[17] [17]

OpenCAPI (Open Coherent Accelerator Processor Interface),

IBM, “OpenCAPI (Open Coherent Accelerator Processor Interface), ”

[18] [18]

Available: https://docs.kernel.org/userspace-api/ accelerators/ocxl.html

[Online]. Available: https://docs.kernel.org/userspace-api/ accelerators/ocxl.html

[19] [19]

Gen-z zmmu and memory interleave„

G.-Z. Interconnect, “Gen-z zmmu and memory interleave„ ” Gen-Z Consortium, Tech. Rep., July 2017. [Online]. Available: https://genzconsortium.org/wp-content/uploads/2018/05/ Gen-Z-MMUand-Memory-Interleave.pdf

2017

[20] [20]

Pond: Cxl-based memory pooling systems for cloud platforms,

H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura, and R. Bianchini, “Pond: Cxl-based memory pooling systems for cloud platforms, ” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser...

work page doi:10.1145/3575693.3578835 2023

[21] [21]

System-level implications of disaggregated memory,

K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, P. Ranganathan, and T. F. Wenisch, “System-level implications of disaggregated memory, ” inIEEE International Symposium on High-Performance Comp Architecture (HPCA). IEEE, 2012, pp. 1–12

2012

[22] [22]

The gem5 simulator: Version 20.0+,

J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, M. Andreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharadwaj, G. Black, G. Bloom, B. R. Bruce, D. R. Carvalho, J. Castrillon, L. Chen, N. Derumigny, S. Diestelhorst, W. Elsasser, C. Escuin, M. Fariborz, A. Farmahini-Farahani, P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass, A. Gutierr...

2020

[23] [23]

Scaling the

S.-L. Lu, T. Karnik, G. Srinivasa, K.-Y. Chao, D. Carmean, and J. Held, “Scaling the "memory wall", ” inProceedings of the International Conference on Computer-Aided Design, ser. ICCAD ’12. New York, NY, USA: Association for Computing Machinery, 2012, p. 271–272. [Online]. Available: https://doi.org/10.1145/2429384.2429437

work page doi:10.1145/2429384.2429437 2012

[24] [24]

Tpp: Transparent page placement for cxl-enabled tiered-memory,

H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, and P. Chauhan, “Tpp: Transparent page placement for cxl-enabled tiered-memory, ” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023. New...

work page doi:10.1145/3582016.3582063 2023

[25] [25]

Stream: Sustainable memory bandwidth in high performance computers,

J. D. McCalpin, “Stream: Sustainable memory bandwidth in high performance computers, ” University of Virginia, Charlottesville, Virginia, Tech. Rep., 1991-2007, a continually updated technical report. http://www.cs.virginia.edu/stream/. [Online]. Available: http://www.cs.virginia.edu/stream/ 12

1991

[26] [26]

Memory bandwidth and machine balance in current high performance computers,

——, “Memory bandwidth and machine balance in current high performance computers, ”IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, Dec. 1995

1995

[27] [27]

Famfs Shared Memory Filesystem Framework - User Space Repo,

MICRON, “Famfs Shared Memory Filesystem Framework - User Space Repo, ” Tech. Rep., 2024. [Online]. Available: https: //github.com/cxl-micron-reskit/famfs

2024

[28] [28]

dist-gem5: Distributed simulation of computer clusters,

A. Mohammad, U. Darbaz, G. Dozsa, S. Diestelhorst, D. Kim, and N. S. Kim, “dist-gem5: Distributed simulation of computer clusters, ” in2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2017, pp. 153–162

2017

[29] [29]

gem5/sst integration 2021: Scaling full-system simulations

H. Nguyen and J. Lowe-Power, “gem5/sst integration 2021: Scaling full-system simulations. ” [Online]. Available: https://www.gem5.org/events/isca-2022#:~: text=gem5/SST%20Integration%202021%3A%20Scaling%20Full% 2Dsystem%20Simulations

2021

[30] [30]

Dracksim: Simulating cxl-enabled large-scale disaggregated memory systems,

A. Puri, K. Bellamkonda, K. Narreddy, J. Jose, V. Tamarapalli, and V. Narayanan, “Dracksim: Simulating cxl-enabled large-scale disaggregated memory systems, ” inProceedings of the 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, ser. SIGSIM-PADS ’24. New York, NY, USA: Association for Computing Machinery, 2024, p. 3–14. [Online]. ...

work page doi:10.1145/3615979.3656059 2024

[31] [31]

The structural simulation toolkit,

A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “The structural simulation toolkit, ”SIGMETRICS Perform. Eval. Rev., vol. 38, no. 4, p. 37–42, mar 2011. [Online]. Available: https://doi.org/10.1145/1964218.1964225

work page doi:10.1145/1964218.1964225 2011

[32] [32]

Dramsim2: A cycle accurate memory system simulator,

P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycle accurate memory system simulator, ”IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 16–19, 2011

2011

[33] [33]

Compute express link,

D. D. Sharma, “Compute express link, ”CXL Consortium White Paper, 2019

2019

[34] [34]

Compute express link®: An open industry-standard intercon- nect enabling heterogeneous data-centric computing,

——, “Compute express link®: An open industry-standard intercon- nect enabling heterogeneous data-centric computing, ” in2022 IEEE Symposium on High-Performance Interconnects (HOTI), 2022, pp. 5–12

2022

[35] [35]

An introduction to the compute express link (cxl) interconnect,

D. D. Sharma, R. Blankenship, and D. S. Berger, “An introduction to the compute express link (cxl) interconnect, ” 2024

2024

[36] [36]

Demystifying cxl memory with genuine cxl-ready systems and devices,

Y. Sun, Y. Yuan, Z. Yu, R. Kuper, C. Song, J. Huang, H. Ji, S. Agarwal, J. Lou, I. Jeong, R. Wang, J. H. Ahn, T. Xu, and N. S. Kim, “Demystifying cxl memory with genuine cxl-ready systems and devices, ” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY, USA: Association for Computing Machi...

work page doi:10.1145/3613424.3614256 2023

[37] [37]

A novel, highly integrated simulator for parallel and distributed systems,

N. Tampouratzis, I. Papaefstathiou, A. Nikitakis, A. Brokalakis, S. Andrianakis, A. Dollas, M. Marcon, and E. Plebani, “A novel, highly integrated simulator for parallel and distributed systems, ” ACM Transactions on Architecture and Code Optimization (TACO), vol. 17, no. 1, pp. 1–28, 2020

2020

[38] [38]

Asynchronous memory access unit: Exploiting massive parallelism for far memory access,

L. Wang, X. Zhang, S. Wang, Z. Jiang, T. Lu, M. Chen, S. Luo, and K. Huang, “Asynchronous memory access unit: Exploiting massive parallelism for far memory access, ”ACM Trans. Archit. Code Optim., vol. 21, no. 3, Sep. 2024. [Online]. Available: https://doi.org/10.1145/3663479

work page doi:10.1145/3663479 2024

[39] [39]

Cohet: A cxl-driven coherent heterogeneous computing framework with hardware-calibrated full-system simulation,

Y. Wang, L. Wu, S. Gao, Y. Tang, J. Luo, Z. Wang, Y. Ou, D. Dong, N. Xiao, and M. Lai, “Cohet: A cxl-driven coherent heterogeneous computing framework with hardware-calibrated full-system simulation, ” 2026. [Online]. Available: https://arxiv.org/ abs/2511.23011

work page arXiv 2026

[40] [40]

Cxl-dmsim: A full-system cxl disaggregated memory simulator with comprehensive silicon validation,

Y. Wang, L. Wu, W. Hong, Y. Ou, Z. Wang, S. Gao, J. Zhang, S. Ma, D. Dong, X. Qi, M. Lai, and N. Xiao, “Cxl-dmsim: A full-system cxl disaggregated memory simulator with comprehensive silicon validation, ”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 45, no. 4, pp. 1787–1801, 2026. 13

2026