pith. sign in

arxiv: 2604.06668 · v1 · submitted 2026-04-08 · 💻 cs.AR · cs.DC

SwarmIO: Towards 100 Million IOPS SSD Emulation for Next-generation GPU-centric Storage Systems

Pith reviewed 2026-05-10 18:28 UTC · model grok-4.3

classification 💻 cs.AR cs.DC
keywords SSD emulatorGPU-initiated I/Ohigh IOPS storageGPU-centric systemsstorage performance modelingvector searchI/O emulation
0
0 comments X

The pith

SwarmIO models IOPS-optimized SSDs at up to 40 MIOPS with 303.9x speedup for GPU-centric storage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SwarmIO is proposed as an SSD emulator tailored for GPU-centric storage systems that use GPU-initiated I/O and target ultra-high random-read IOPS. Existing emulators struggle with ingesting massive request streams from GPUs, managing control and data paths with high software overhead, and maintaining timing models at extreme I/O rates. SwarmIO addresses these by achieving faithful modeling up to 40 million IOPS and a 303.9 times speedup over the prior best emulator. In a case study on vector search, scaling SSD IOPS from 2.5 to 40 MIOPS delivers average end-to-end speedups of up to 9.7 times.

Core claim

SwarmIO is an SSD emulator for massively parallel, GPU-centric storage that faithfully models IOPS-optimized SSDs at target performance levels of up to 40 MIOPS, achieving a 303.9x speedup over the state-of-the-art baseline SSD emulator under GPU-initiated I/O. It further demonstrates utility through a vector search case study showing that increasing SSD IOPS from 2.5 MIOPS to 40 MIOPS yields an average end-to-end speedup of up to 9.7x.

What carries the argument

SwarmIO emulator architecture that improves frontend scalability for massive request streams, reduces software overhead in emulating GPU-initiated I/O control and data paths, and lowers timing-model maintenance overhead at high request rates.

If this is right

  • End-to-end quantitative evaluation of IOPS-optimized GPU-centric storage systems is now possible without physical hardware.
  • Designers can measure the performance effects of scaling SSD IOPS from 2.5 MIOPS to 40 MIOPS in applications such as vector search.
  • The emulator supports exploration of next-generation storage architectures targeting even higher IOPS targets.
  • GPU storage system prototypes can be tested and iterated rapidly before hardware availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to emulating other massively parallel I/O patterns in non-GPU accelerators.
  • Faster emulation cycles could accelerate hardware-software co-design for data-center GPU storage.
  • Observed application speedups imply that real SSD IOPS gains would produce multiplicative benefits in GPU workloads.
  • The title's reference to 100 million IOPS suggests the current 40 MIOPS target is an intermediate step toward higher rates.

Load-bearing premise

The timing models and overhead reductions in SwarmIO accurately capture real GPU-initiated I/O behavior at high request rates without introducing significant emulation artifacts or inaccuracies.

What would settle it

Direct comparison of SwarmIO's latency and throughput predictions against measurements from physical high-IOPS SSD hardware running identical GPU-initiated I/O workloads near 40 MIOPS would confirm or refute the model's fidelity.

Figures

Figures reproduced from arXiv: 2604.06668 by Gwangoo Yeo, Hyeseong Kim, Minsoo Rhu.

Figure 1
Figure 1. Figure 1: GPU-initiated I/O. of GPU-centric storage systems with GPU-initiated I/O. • We demonstrate, on real GPU systems, that SwarmIO scales to 40 MIOPS, achieving a 307.7× speedup over a state-of-the-art SSD emulation framework. • We enable end-to-end analysis of GPU-centric storage systems with future IOPS-optimized SSDs, and demon￾strate through a vector-search case study that increasing SSD IOPS from 2.5 MIOPS… view at source ↗
Figure 2
Figure 2. Figure 2: (a) High-level overview of NVMeVirt and (b) an [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Frontend throughput of NVMeVirt under CPU-centric [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dynamic address mapping/unmapping overhead ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Average latency and (b) IOPS of NVMeVirt under [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example asynchronous, batched data transfer work [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of SwarmIO’s asynchronous copy offloading on [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of DSA’s batch size (BS) on (a) total worker throughput and (b) total dispatcher throughput across four service units instantiated on a single DSA device. 4 KB transfer as 64 I/O requests) and seek to maximize total worker IOPS up to this threshold. Effect of asynchronous and batched offloading. Fig￾ure 8(a) shows the benefits of SwarmIO’s asynchronous copy offloading by showing total worker IOPS as… view at source ↗
Figure 10
Figure 10. Figure 10: Sustained IOPS under (a) CPU-centric I/O (fio) and [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average target completion latency from the timing [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Effect of SwarmIO’s optimizations on baseline [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Effect of SwarmIO’s aggregated timing model updates [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Sensitivity of SwarmIO’s sustained IOPS to (a) the [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: End-to-end performance of on-disk CAGRA search [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗
read the original abstract

GPU-initiated I/O has emerged as a key mechanism for achieving high-throughput storage access by leveraging massive GPU thread-level parallelism, while recent industry trends point toward SSDs optimized for ultra-high random-read IOPS. Together, these trends are enabling the emergence of IOPS-optimized, GPU-centric storage systems. Despite this momentum, no existing framework enables quantitative end-to-end evaluation of storage systems optimized for GPU-initiated I/O. While conventional SSD emulators provide a promising path toward end-to-end modeling in traditional storage systems, they face three key challenges in this GPU-centric setting: limited frontend scalability for ingesting massive request streams, high software overhead in emulating GPU-initiated I/O control and data paths, and excessive timing-model maintenance overhead at extremely high I/O request rates. We propose SwarmIO, an SSD emulator for massively parallel, GPU-centric storage. SwarmIO faithfully models IOPS-optimized SSDs at target performance levels of up to 40 MIOPS, achieving a 303.9x speedup over the state-of-the-art baseline SSD emulator under GPU-initiated I/O. We further demonstrate its utility through a vector search case study, showing that increasing SSD IOPS from 2.5 MIOPS to 40 MIOPS yields an average end-to-end speedup of up to 9.7x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SwarmIO, an SSD emulator for GPU-centric storage systems supporting GPU-initiated I/O. It identifies three limitations in prior emulators (frontend scalability for massive request streams, software overhead on GPU I/O paths, and timing-model maintenance at high rates) and claims to address them, enabling faithful modeling of IOPS-optimized SSDs up to 40 MIOPS. Key results include a 303.9x speedup over the state-of-the-art baseline emulator under GPU-initiated I/O and a vector-search case study showing up to 9.7x end-to-end speedup when scaling SSD IOPS from 2.5 MIOPS to 40 MIOPS.

Significance. If the timing models prove accurate, SwarmIO would be a significant contribution by enabling quantitative end-to-end evaluation of emerging GPU-centric storage architectures at performance levels (40+ MIOPS) that are currently impractical to prototype or simulate with existing tools. This could accelerate design exploration in high-performance computing and AI workloads that rely on massive parallel random-read I/O.

major comments (2)
  1. [Results/Evaluation section] Results/Evaluation section: The central claim that SwarmIO 'faithfully models' IOPS-optimized SSDs at up to 40 MIOPS (including flash channel contention and controller effects under massive GPU thread parallelism) is load-bearing for the reported 303.9x speedup and 9.7x case-study gain, yet the manuscript provides no direct validation against physical SSD hardware at target rates with GPU-initiated I/O. If comparisons are limited to other emulators, synthetic traces, or lower-rate regimes, the speedup figures may reflect model simplifications rather than real-system fidelity.
  2. [Abstract and §1] Abstract and §1: The assertion of 'no existing framework' for quantitative end-to-end evaluation of GPU-initiated I/O storage systems requires explicit comparison to the closest prior GPU-aware or high-IOPS emulators; without this, it is unclear whether SwarmIO's overhead reductions are incremental or fundamentally new.
minor comments (2)
  1. [Title and Abstract] The title references '100 Million IOPS' while all concrete claims and results target 40 MIOPS; clarifying the gap between aspirational and demonstrated performance would improve precision.
  2. [Figures/Tables] Figure and table captions should explicitly state whether error bars or confidence intervals are shown and what baseline configuration (e.g., number of GPU threads, request pattern) was used for the 303.9x measurement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment point by point below, proposing targeted revisions to strengthen the presentation of our contributions while maintaining accuracy.

read point-by-point responses
  1. Referee: [Results/Evaluation section] Results/Evaluation section: The central claim that SwarmIO 'faithfully models' IOPS-optimized SSDs at up to 40 MIOPS (including flash channel contention and controller effects under massive GPU thread parallelism) is load-bearing for the reported 303.9x speedup and 9.7x case-study gain, yet the manuscript provides no direct validation against physical SSD hardware at target rates with GPU-initiated I/O. If comparisons are limited to other emulators, synthetic traces, or lower-rate regimes, the speedup figures may reflect model simplifications rather than real-system fidelity.

    Authors: We acknowledge the value of direct hardware validation at target rates. Such validation is inherently limited because IOPS-optimized SSDs with native GPU-initiated I/O support at 40 MIOPS scale are emerging technologies not yet available for comprehensive end-to-end benchmarking. Our timing models are derived from commercial SSD datasheets, flash channel specifications, and controller behaviors documented in prior literature. We have performed validation against available lower-rate physical SSDs and synthetic traces that reproduce known contention effects. In revision, we will expand the evaluation section with additional details on model derivation, low-rate hardware comparisons, and an explicit limitations discussion on high-rate fidelity. This is a partial revision as we cannot fabricate unavailable hardware data. revision: partial

  2. Referee: [Abstract and §1] Abstract and §1: The assertion of 'no existing framework' for quantitative end-to-end evaluation of GPU-initiated I/O storage systems requires explicit comparison to the closest prior GPU-aware or high-IOPS emulators; without this, it is unclear whether SwarmIO's overhead reductions are incremental or fundamentally new.

    Authors: We will revise the abstract and Section 1 to include an explicit comparison to the closest prior emulators (both high-IOPS and any GPU-aware variants). Our claim centers on the absence of any framework that simultaneously supports GPU-initiated I/O, scales to 40+ MIOPS, and maintains low software overhead on the GPU path. We will add a table or paragraph differentiating SwarmIO from related work on these axes to clarify the novelty. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims are measured speedups against external baseline

full rationale

The paper's derivation chain consists of engineering solutions to three stated challenges (frontend scalability, GPU I/O path overhead, timing-model maintenance) followed by direct benchmarking of the resulting emulator against a state-of-the-art baseline. Reported figures (303.9x speedup, 9.7x end-to-end gain) are presented as empirical measurements on synthetic and case-study workloads, not as outputs of fitted parameters or self-referential equations. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the modeling or evaluation sections; the timing model is described as an implementation artifact whose accuracy is asserted via comparison to the external baseline rather than by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; SwarmIO is presented as an engineering artifact whose internal model details are not disclosed.

pith-pipeline@v0.9.0 · 5542 in / 1154 out tokens · 68089 ms · 2026-05-10T18:28:06.892941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

  1. [1]

    Design tradeoffs for ssd performance,

    N. Agrawal, V . Prabhakaran, T. Wobber, J. D. Davis, M. Manasse, and R. Panigrahy, “Design tradeoffs for ssd performance,” inUSENIX Annual Technical Conference (ATC), 2008

  2. [2]

    Flexible I/O Tester,

    J. Axboe, “Flexible I/O Tester,” 2024. [Online]. Available: https: //github.com/axboe/fio

  3. [3]

    The gem5 Simulator,

    N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 Simulator,”ACM SIGARCH Computer Architecture News, 2011

  4. [4]

    High IOPS SSDs for AI Use Cases,

    R. Bolt, “High IOPS SSDs for AI Use Cases,” Flash Memory Summit (FMS), 2025

  5. [5]

    GMT: GPU Orchestrated Memory Tiering for the Big Data Era,

    C.-H. Chang, J. Han, A. Sivasubramaniam, V . Sharma Mailthody, Z. Qureshi, and W.-m. Hwu, “GMT: GPU Orchestrated Memory Tiering for the Big Data Era,” inProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

  6. [6]

    Dynamic Warp For- mation and Scheduling for Efficient GPU Control Flow,

    W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, “Dynamic Warp For- mation and Scheduling for Efficient GPU Control Flow,” inProceedings of the International Symposium on Microarchitecture (MICRO), 2007

  7. [7]

    Megis: High-performance, energy-efficient, and low-cost metagenomic analysis with in-storage processing,

    N. M. Ghiasi, M. Sadrosadati, H. Mustafa, A. Gollwitzer, C. Firtina, J. Eudine, H. Mao, J. Lindegger, M. B. Cavlak, M. Alser, J. Park, and O. Mutlu, “Megis: High-performance, energy-efficient, and low-cost metagenomic analysis with in-storage processing,” inProceedings of the International Symposium on Computer Architecture (ISCA), 2024

  8. [8]

    Amber: Enabling Precise Full-system Simulation with Detailed Modeling of All SSD Resources,

    D. Gouk, M. Kwon, J. Zhang, S. Koh, W. Choi, N. S. Kim, M. Kandemir, and M. Jung, “Amber: Enabling Precise Full-system Simulation with Detailed Modeling of All SSD Resources,” inProceedings of the International Symposium on Microarchitecture (MICRO), 2018

  9. [9]

    Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm with SSD,

    H. Guo and Y . Lu, “Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm with SSD,” inProceedings of the USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI), 2025

  10. [10]

    Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO,

    J. Han, A. Sivasubramaniam, C.-H. Chang, V . S. Mailthody, Z. Qureshi, and W.-M. Hwu, “Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO,” inProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2026

  11. [11]

    ZNS+: Advanced Zoned Namespace Interface for Supporting In-Storage Zone Compaction,

    K. Han, H. Gwak, D. Shin, and J. Hwang, “ZNS+: Advanced Zoned Namespace Interface for Supporting In-Storage Zone Compaction,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2021

  12. [12]

    The unwritten contract of solid state drives,

    J. He, S. Kannan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “The unwritten contract of solid state drives,” inProceedings of the Twelfth European Conference on Computer Systems, 2017

  13. [13]

    Performance impact and interplay of ssd parallelism through advanced commands, allocation strategy and data granularity,

    Y . Hu, H. Jiang, D. Feng, L. Tian, H. Luo, and S. Zhang, “Performance impact and interplay of ssd parallelism through advanced commands, allocation strategy and data granularity,” inProceedings of the Interna- tional Conference on Supercomputing, 2011

  14. [14]

    Intel Data Streaming Accelerator (Intel DSA),

    Intel, “Intel Data Streaming Accelerator (Intel DSA),” 2022. [Online]. Available: https://www.intel.com/content/www/us/en/products/ docs/accelerator-engines/data-streaming-accelerator.html

  15. [15]

    OpenExpress: Fully Hardware Automated Open Research Framework for Future Fast NVMe Devices,

    M. Jung, “OpenExpress: Fully Hardware Automated Open Research Framework for Future Fast NVMe Devices,” inUSENIX Annual Tech- nical Conference (ATC), 2020

  16. [16]

    Nandflashsim: High-fidelity, microarchitecture-aware nand flash memory simulation,

    M. Jung, W. Choi, S. Gao, E. H. Wilson III, D. Donofrio, J. Shalf, and M. T. Kandemir, “Nandflashsim: High-fidelity, microarchitecture-aware nand flash memory simulation,”ACM Trans. Storage, 2016

  17. [17]

    Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,

    M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2020

  18. [18]

    Beyond SSD : SK Hynix AIN Family Redefining Storage as the Core Enabler of AI at Scale presented by SK Hynix,

    C. Kim, “Beyond SSD : SK Hynix AIN Family Redefining Storage as the Core Enabler of AI at Scale presented by SK Hynix,” Open Compute Project (OCP) Global Summit, 2025

  19. [19]

    The cost of dynamic reasoning: Demystifying ai agents and test-time scaling from an ai infrastructure perspective,

    J. Kim, B. Shin, J. Chung, and M. Rhu, “The cost of dynamic reasoning: Demystifying ai agents and test-time scaling from an ai infrastructure perspective,” inProceedings of the International Symposium on High- Performance Computer Architecture (HPCA), 2026

  20. [20]

    NVMeVirt: A Versatile Software-defined Virtual NVMe Device,

    S.-H. Kim, J. Shim, E. Lee, S. Jeong, I. Kang, and J.-S. Kim, “NVMeVirt: A Versatile Software-defined Virtual NVMe Device,” in Proceedings of the Conference on File and Storage Technologies (FAST), 2023

  21. [21]

    Flashsim: A simulator for nand flash-based solid-state drives,

    Y . Kim, B. Tauras, A. Gupta, and B. Urgaonkar, “Flashsim: A simulator for nand flash-based solid-state drives,” in2009 First International Conference on Advances in System Simulation, 2009

  22. [22]

    KIOXIA CM9-V Series (2.5-inch),

    KIOXIA, “KIOXIA CM9-V Series (2.5-inch),” 2025. [Online]. Available: https://americas.kioxia.com/en-us/business/ssd/enterprise- ssd/cm9-v.html

  23. [23]

    KIOXIA XL-FLASH,

    ——, “KIOXIA XL-FLASH,” 2025. [Online]. Avail- able: https://kr.kioxia.com/content/dam/kioxia/shared/business/memory/ xlflash/asset/KIOXIA XL-FLASH Infographic.pdf

  24. [24]

    A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors,

    R. Kuper, I. Jeong, Y . Yuan, R. Wang, N. Ranganathan, N. Rao, J. Hu, S. Kumar, P. Lantz, and N. S. Kim, “A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors,” inProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

  25. [25]

    Cosmos+ OpenSSD: Rapid Prototype for Flash Storage Systems,

    J. Kwak, S. Lee, K. Park, J. Jeong, and Y . H. Song, “Cosmos+ OpenSSD: Rapid Prototype for Flash Storage Systems,”ACM Transactions on Storage, 2020

  26. [26]

    FADU: Pushing the Storage Frontier: Next- Generation SSDs for Tomorrow’s Datacenters,

    J. Lee and R. Stenfort, “FADU: Pushing the Storage Frontier: Next- Generation SSDs for Tomorrow’s Datacenters,” Flash Memory Summit (FMS), 2025

  27. [27]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel, S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” inProceedings of the International Conference on Neural Information Processing Systems (NeurIPS), 2020

  28. [28]

    The Case of FEMU: Cheap, Accurate, Scalable and Extensible Flash Emulator,

    H. Li, M. Hao, M. H. Tong, S. Sundararaman, M. Bjørling, and H. S. Gunawi, “The Case of FEMU: Cheap, Accurate, Scalable and Extensible Flash Emulator,” inProceedings of the Conference on File and Storage Technologies (FAST), 2018

  29. [29]

    Managing Scalable Direct Storage Accesses for GPUs with GoFS,

    S. Li, Y . E. Zhou, Y . Xue, Y . Xu, and J. Huang, “Managing Scalable Direct Storage Accesses for GPUs with GoFS,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2025

  30. [30]

    Ecssd: Hardware/data layout co-designed in-storage-computing archi- tecture for extreme classification,

    S. Li, F. Tu, L. Liu, J. Lin, Z. Wang, Y . Kang, Y . Ding, and Y . Xie, “Ecssd: Hardware/data layout co-designed in-storage-computing archi- tecture for extreme classification,” inProceedings of the International Symposium on Computer Architecture (ISCA), 2023

  31. [31]

    DMAEngine documentation,

    Linux Kernel Organization, “DMAEngine documentation,” 2026. [Online]. Available: https://www.kernel.org/doc/html/latest/driver-api/ dmaengine/index.html

  32. [32]

    Advancing Memory and Storage Architectures for Next-Gen AI Workloads,

    V . S. Mailthody, “Advancing Memory and Storage Architectures for Next-Gen AI Workloads,” Flash Memory Summit (FMS), 2025

  33. [33]

    FlexDrive: A Framework to Explore NVMe Storage Solutions,

    K. T. Malladi, M. Awasthi, and H. Zheng, “FlexDrive: A Framework to Explore NVMe Storage Solutions,” inProceedings of the International Conference on High Performance Computing and Communications; International Conference on Smart City; International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016

  34. [34]

    Marvell Bravera SC5 SSD Controllers,

    Marvell, “Marvell Bravera SC5 SSD Controllers,” 2021. [Online]. Available: https://www.marvell.com/content/dam/marvell/en/public- collateral/storage/marvell-ssd-mv-ss1331-1333-product-brief.pdf

  35. [35]

    Graphssd: Graph semantics aware ssd,

    K. K. Matam, G. Koo, H. Zha, H.-W. Tseng, and M. Annavaram, “Graphssd: Graph semantics aware ssd,” inProceedings of the Inter- national Symposium on Computer Architecture (ISCA), 2019

  36. [36]

    9550 NVMe SSD,

    Micron, “9550 NVMe SSD,” 2024. [Online]. Available: https: //www.micron.com/products/storage/ssd/data-center-ssd/9550-ssd

  37. [37]

    Deep Learning Recommendation Model for Personalization and Recommendation Systems,

    M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y . Lu, R. Krishnamoorthi, A. Yu, V . Kon- dratenko, S. Pereira, X. Chen, W. Chen, V . Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep Learning Recommendation Model for Personalization and Reco...

  38. [38]

    Storage Implications for the New Generation of AI Applications,

    C. J. Newburn and W.-m. Hwu, “Storage Implications for the New Generation of AI Applications,” SNIA Developer Conference (SDC), 2025

  39. [39]

    Technical Paths to the New Era of GPU-initiated Storage,

    C. J. Newburn and V . S. Mailthody, “Technical Paths to the New Era of GPU-initiated Storage,” Open Compute Project (OCP) Global Summit, 2025

  40. [40]

    StorageNext for AI: How to Eliminate the Memory Wall for GenAI and LLM Workloads,

    C. Newburn, P. Prabhu, and V . S. Mailthody, “StorageNext for AI: How to Eliminate the Memory Wall for GenAI and LLM Workloads,” NVIDIA GTC, 2025. [Online]. Available: https://www.nvidia.com/en- us/on-demand/session/gtc25-s73012/

  41. [41]

    NVIDIA H200 GPU,

    NVIDIA, “NVIDIA H200 GPU,” 2024. [Online]. Available: https: //www.nvidia.com/en-us/data-center/h200/

  42. [42]

    GPUDirect RDMA,

    ——, “GPUDirect RDMA,” 2026. [Online]. Available: https://docs. nvidia.com/cuda/pdf/GPUDirect RDMA.pdf

  43. [43]

    NVIDIA CMX Context Memory Storage Platform,

    ——, “NVIDIA CMX Context Memory Storage Platform,”

  44. [44]

    Available: https://www.nvidia.com/en-us/data-center/ai- storage/cmx/

    [Online]. Available: https://www.nvidia.com/en-us/data-center/ai- storage/cmx/

  45. [45]

    NVM Express Base Specification,

    NVM Express, “NVM Express Base Specification,” 2026. [Online]. Available: https://nvmexpress.org/specification/nvm-express- base-specification/

  46. [46]

    Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus,

    H. Ootomo, A. Naruse, C. Nolet, R. Wang, T. Feher, and Y . Wang, “Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus,” inProceedings of the International Con- ference on Data Engineering (ICDE), 2024

  47. [47]

    InstAttention: In-Storage Attention Offloading for Cost- Effective Long-Context LLM Inference,

    X. Pan, E. Li, Q. Li, S. Liang, Y . Shan, K. Zhou, Y . Luo, X. Wang, and J. Zhang, “InstAttention: In-Storage Attention Offloading for Cost- Effective Long-Context LLM Inference,” inProceedings of the Interna- tional Symposium on High-Performance Computer Architecture (HPCA), 2025

  48. [48]

    Accelerat- ing Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses,

    J. B. Park, V . S. Mailthody, Z. Qureshi, and W.-m. Hwu, “Accelerat- ing Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses,” inProceedings of the VLDB Endowment (PVLDB), 2024

  49. [49]

    GeminiFS: A Companion File System for GPUs,

    S. Qiu, W. Liu, Y . Hu, J. Yan, Z. Shen, X. Yao, R. Chen, G. Zhang, and Y . Zhang, “GeminiFS: A Companion File System for GPUs,” in Proceedings of the Conference on File and Storage Technologies (FAST), 2025

  50. [50]

    A high-performance and scalable nvme controller featuring hardware acceleration,

    Y . Qiu, W. Yin, and L. Wang, “A high-performance and scalable nvme controller featuring hardware acceleration,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022

  51. [51]

    GPU-Initiated On-Demand High- Throughput Storage Access in the BaM System Architecture,

    Z. Qureshi, V . S. Mailthody, I. Gelado, S. Min, A. Masood, J. Park, J. Xiong, C. J. Newburn, D. Vainbrand, I.-H. Chung, M. Gar- land, W. Dally, and W.-m. Hwu, “GPU-Initiated On-Demand High- Throughput Storage Access in the BaM System Architecture,” inPro- ceedings of the International Conference on Architectural Support for Programming Languages and Oper...

  52. [52]

    Hermes: Algorithm-System Co-design for Efficient Retrieval-Augmented Gen- eration At-Scale,

    M. Shen, M. Umar, K. Maeng, G. E. Suh, and U. Gupta, “Hermes: Algorithm-System Co-design for Efficient Retrieval-Augmented Gen- eration At-Scale,” inProceedings of the International Symposium on Computer Architecture (ISCA), 2025

  53. [53]

    Turbocharging Vector Databases Using Modern SSDs,

    J. Shim, J. Oh, H. Roh, J. Do, and S.-W. Lee, “Turbocharging Vector Databases Using Modern SSDs,” inProceedings of the VLDB Endow- ment (PVLDB), 2025

  54. [54]

    Results of the NeurIPS’21 Challenge on Billion-Scale Approximate Nearest Neighbor Search,

    H. V . Simhadri, G. Williams, M. Aum ¨uller, M. Douze, A. Babenko, D. Baranchuk, Q. Chen, L. Hosseini, R. Krishnaswamny, G. Srinivasa, S. J. Subramanya, and J. Wang, “Results of the NeurIPS’21 Challenge on Billion-Scale Approximate Nearest Neighbor Search,” inProceedings of Machine Learning Research (PMLR), 2021

  55. [55]

    D7-PS1010,

    Solidigm, “D7-PS1010,” 2024. [Online]. Available: https://www. solidigm.com/products/data-center/d7/ps1010.html

  56. [56]

    ConfZNS: A Novel Emulator for Exploring Design Space of ZNS SSDs,

    I. Song, M. Oh, B. S. J. Kim, S. Yoo, J. Lee, and J. Choi, “ConfZNS: A Novel Emulator for Exploring Design Space of ZNS SSDs,” in Proceedings of the ACM International Conference on Systems and Storage (SYSTOR), 2023

  57. [57]

    CAM: Asynchronous GPU-Initiated, CPU- Managed SSD Management for Batching Storage Access,

    Z. Song, J. Zhang, J. Sun, M. Sun, Z. Yang, Z. Zhang, X. Chen, F. Wu, H. Tang, and Z. Wang, “CAM: Asynchronous GPU-Initiated, CPU- Managed SSD Management for Batching Storage Access,” inProceed- ings of the International Conference on Data Engineering (ICDE), 2025

  58. [58]

    SPDK: NVMe Driver,

    SPDK, “SPDK: NVMe Driver,” 2026. [Online]. Available: https: //spdk.io/doc/nvme.html

  59. [59]

    DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single node,

    S. J. Subramanya, Devvrit, R. Kadekodi, R. Krishaswamy, and H. V . Simhadri, “DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single node,” inProceedings of the International Conference on Neural Information Processing Systems (NeurIPS), 2019

  60. [60]

    MQSim: A Framework for Enabling Realistic Studies of Modern Multi- Queue SSD Devices,

    A. Tavakkol, J. G ´omez-Luna, M. Sadrosadati, S. Ghose, and O. Mutlu, “MQSim: A Framework for Enabling Realistic Studies of Modern Multi- Queue SSD Devices,” inProceedings of the Conference on File and Storage Technologies (FAST), 2018

  61. [61]

    Towards High-throughput and Low-latency Billion-scale Vector Search via CPU/GPU Collaborative Filtering and Re-ranking,

    B. Tian, H. Liu, Y . Tang, S. Xiao, Z. Duan, X. Liao, H. Jin, X. Zhang, J. Zhu, and Y . Zhang, “Towards High-throughput and Low-latency Billion-scale Vector Search via CPU/GPU Collaborative Filtering and Re-ranking,” inProceedings of the Conference on File and Storage Technologies (FAST), 2025

  62. [62]

    Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment,

    M. Wang, W. Xu, X. Yi, S. Wu, Z. Peng, X. Ke, Y . Gao, X. Xu, R. Guo, and C. Xie, “Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment,”Proceedings of the ACM on Management of Data, 2024

  63. [63]

    VSSIM: Virtual Machine based SSD Simulator,

    J. Yoo, Y . Won, J. Hwang, S. Kang, J. Choi, S. Yoon, and J. Cha, “VSSIM: Virtual Machine based SSD Simulator,” inProceedings of the Symposium on Mass Storage Systems and Technologies (MSST), 2013

  64. [64]

    Cylon: Fast and accurate full-system emulation of cxl-ssds,

    D. Yoon, H. Idden, J. Liu, B. Inceisci, S. H. Noh, and H. Li, “Cylon: Fast and accurate full-system emulation of cxl-ssds,” in24th USENIX Conference on File and Storage Technologies (FAST 26), 2026

  65. [65]

    Fssd: Fpga-based emulator for ssds,

    L. Yu, Y . Lu, M. Mandava, E. Richter, V . S. Mailthody, S. W. Min, W.-m. Hwu, and D. Chen, “Fssd: Fpga-based emulator for ssds,” in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), 2023

  66. [66]

    Intel Accelerators Ecosystem: An SoC-Oriented Perspective : Industry Product,

    Y . Yuan, R. Wang, N. Ranganathan, N. Rao, S. Kumar, P. Lantz, V . Sanjeepan, J. Cabrera, A. Kwatra, R. Sankaran, I. Jeong, and N. S. Kim, “Intel Accelerators Ecosystem: An SoC-Oriented Perspective : Industry Product,” inProceedings of the International Symposium on Computer Architecture (ISCA), 2024

  67. [67]

    Cemu: Enabling full-system emulation of computational storage beyond hardware limits,

    Q. Zhang, J. Wang, Y . Zhou, P. Xu, K. Lu, J. Wan, F. Wu, and T. Lu, “Cemu: Enabling full-system emulation of computational storage beyond hardware limits,” inProceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2026

  68. [68]

    Assasin: Architecture support for stream computing to accelerate computational storage,

    C. Zou and A. A. Chien, “Assasin: Architecture support for stream computing to accelerate computational storage,” inProceedings of the International Symposium on Microarchitecture (MICRO), 2022