Communication Offloading on SmartNIC DPUs: A Quantitative Approach
Pith reviewed 2026-05-25 06:39 UTC · model grok-4.3
The pith
Offloading communication to SmartNIC DPUs speeds host-dominated workloads up to 1.55x when memory-to-communication ratio is low.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The memory-to-communication ratio determines offloading benefit on SmartNIC DPUs; workloads dominated by host computation achieve up to 1.55x speedup when communication is moved to the DPU, yet the absence of Direct Cache Access produces a 625x increase in DRAM traffic.
What carries the argument
Buddy, the communication offloading engine that decouples message routing from the application process and runs on the DPU.
If this is right
- Workloads whose compute time greatly exceeds communication time gain measurable speedup from DPU offloading.
- Future SmartNIC hardware must add Direct Cache Access to keep DRAM traffic from exploding.
- The fire-and-forget model can be supported on programmable DPUs without changing the host application interface.
- The memory-to-communication ratio supplies a simple static rule for deciding whether to offload a given communication service.
Where Pith is reading between the lines
- If the ratio rule generalizes, schedulers could decide at launch time whether to place communication on the DPU.
- The DRAM-traffic penalty may restrict DPU use to latency-insensitive or bandwidth-rich networks until hardware changes.
- Repeating the experiments on a DPU that does support Direct Cache Access would isolate the hardware contribution from the software design.
Load-bearing premise
The five tested applications represent the range of workloads for which memory-to-communication ratio reliably forecasts offloading gains and that running Buddy on the DPU adds no unmeasured costs beyond the reported DRAM traffic.
What would settle it
A workload with low memory-to-communication ratio that shows no speedup or even slowdown when communication is offloaded to the DPU, or a DPU with Direct Cache Access that does not produce the 625x DRAM traffic increase.
Figures
read the original abstract
SmartNIC Data Processing Units (DPUs) offer a promising solution for saving high-end CPU resources by offloading tasks to programmable cores near the network interface. In this work, we explore the feasibility of SmartNIC DPUs in supporting an asynchronous communication model called "fire-and-forget", particularly its core message routing service. We design a communication offloading engine called Buddy that decouples communication tasks from the application process. Buddy runs flexibly on SmartNIC DPUs such as the Nvidia BlueField-3 DPU and generic x86 CPUs. Our evaluation results in five applications identify the memory-to-communication ratio as a key predictor of the offloading performance. Host-dominated workloads, such as Quicksilver and Sparse Matrix Transpose, achieved up to 1.55x speedup with communication offloaded to the DPU. We further identify a 625x increase in DRAM traffic due to the absence of Direct Cache Access support on the DPU, highlighting a critical need in future SmartNIC designs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Buddy, a communication offloading engine for asynchronous 'fire-and-forget' message routing that can run on SmartNIC DPUs (e.g., Nvidia BlueField-3) or x86 CPUs. Evaluation on five applications identifies the memory-to-communication ratio as a key predictor of offloading benefit; host-dominated workloads such as Quicksilver and Sparse Matrix Transpose achieve up to 1.55x speedup when communication is offloaded to the DPU. The work also reports a 625x increase in DRAM traffic due to the absence of Direct Cache Access support on the DPU.
Significance. If the memory-to-communication ratio is shown to be a reliable, generalizable predictor, the results could offer actionable guidance for when offloading communication to DPUs yields net benefit and could inform hardware requirements for future SmartNIC designs. The explicit quantification of the DRAM traffic penalty provides a concrete data point on current DPU limitations.
major comments (2)
- [Abstract] Abstract: The central performance claims (1.55x speedup, 625x DRAM traffic increase) and the identification of the memory-to-communication ratio as predictor are stated without any description of the experimental setup, workload characteristics, how the ratio is computed, baseline configurations, or measurement methodology. This absence makes the numerical results unverifiable from the provided text.
- [Evaluation] Evaluation (implied by the five-application results): The claim that the memory-to-communication ratio is a 'key predictor' rests on only five applications. No evidence is given that the observed relationship holds beyond this small, potentially non-representative set, nor are controls shown for DPU-specific overheads other than the reported DRAM traffic increase.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback. We address the concerns about the abstract and the scope of the evaluation below, providing clarifications from the manuscript while noting where revisions can strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claims (1.55x speedup, 625x DRAM traffic increase) and the identification of the memory-to-communication ratio as predictor are stated without any description of the experimental setup, workload characteristics, how the ratio is computed, baseline configurations, or measurement methodology. This absence makes the numerical results unverifiable from the provided text.
Authors: The abstract follows the conventional format of providing a concise summary of the problem, approach, and key findings without experimental details, which are instead fully described in the manuscript body. Section 3 details the Buddy design and DPU deployment on Nvidia BlueField-3; Section 4 describes the five applications (including Quicksilver and Sparse Matrix Transpose), how the memory-to-communication ratio is computed from application traces, the baseline configurations (host-only vs. DPU-offloaded), and the measurement methodology using hardware performance counters for speedup and DRAM traffic. We agree that a brief reference to the evaluation methodology could improve verifiability and will revise the abstract accordingly within length constraints. revision: partial
-
Referee: [Evaluation] Evaluation (implied by the five-application results): The claim that the memory-to-communication ratio is a 'key predictor' rests on only five applications. No evidence is given that the observed relationship holds beyond this small, potentially non-representative set, nor are controls shown for DPU-specific overheads other than the reported DRAM traffic increase.
Authors: The manuscript selects the five applications specifically to span a range of memory-to-communication ratios and to include both host-dominated and communication-dominated workloads, allowing the ratio to be identified as a predictor from the measured speedups (up to 1.55x) and the 625x DRAM traffic increase due to missing Direct Cache Access. The paper does not assert that the relationship is proven for all possible workloads; it presents the ratio as an actionable predictor derived from these cases. We can expand the evaluation section with an explicit limitations paragraph discussing the sample size and the need for future validation across additional applications. Other DPU overheads were quantified via the same performance counters but were not dominant compared to the DRAM traffic penalty in the reported experiments. revision: partial
Circularity Check
No circularity; empirical speedups rest on direct measurement
full rationale
The paper reports measured speedups (up to 1.55x) and a DRAM traffic increase (625x) from running five applications with and without the Buddy offloader on DPU vs host. The memory-to-communication ratio is presented as an observed predictor derived from those measurements, not from any equation, fitted parameter, or self-citation chain that reduces the result to its inputs by construction. No load-bearing self-citations, ansatzes, or uniqueness theorems appear in the provided text. The central claims are therefore self-contained against external benchmarks (the actual runs).
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Fire-and-forget messaging can be decoupled from the application process and executed on a separate DPU core without altering application semantics.
invented entities (1)
-
Buddy communication offloading engine
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Post-Moore Technologies for Plasma Simulation: A Community Roadmap
No single post-Moore technology replaces current HPC for plasma simulations, but FPGA-class accelerators offer near-term kernel offload, non-von Neumann architectures medium-term operator acceleration, and quantum com...
Reference graph
Works this paper leans on
-
[1]
Alian, M., Yuan, Y., Zhang, J., Wang, R., Jung, M., Kim, N.S.: Data direct I/O characterization for future I/O system exploration. In: 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2020)
work page 2020
-
[2]
In: International Conference on High Performance Computing (2021) 14 J
Bayatpour, M., Sarkauskas, N., Subramoni, H., Maqbool Hashmi, J., Panda, D.K.: BluesMPI: Efficient MPI non-blocking Alltoall offloading designs on modern Blue- Field smart NICs. In: International Conference on High Performance Computing (2021) 14 J. Wahlgren et al
work page 2021
-
[3]
In: Proceedings of the 48th International Conference on Parallel Processing (2019)
Brock, B., Buluç, A., Yelick, K.: BCL: A cross-platform distributed data structures library. In: Proceedings of the 48th International Conference on Parallel Processing (2019)
work page 2019
-
[4]
In: 2020 USENIX Annual Technical Conference (USENIX ATC 20) (2020)
Farshin, A., Roozbeh, A., Maguire Jr, G.Q., Kostić, D.: Reexamining direct cache access to optimize I/O intensive applications for multi-hundred-gigabit networks. In: 2020 USENIX Annual Technical Conference (USENIX ATC 20) (2020)
work page 2020
-
[5]
In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing (2006)
Garg, R., Sabharwal, Y.: Software routing and aggregation of messages to optimize the performance of HPCC Randomaccess benchmark. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing (2006)
work page 2006
-
[6]
In: Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing (2024)
Gu, T., Fei, J., Canini, M.: OmNICCL: Zero-cost sparse AllReduce with direct cache access and SmartNICs. In: Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing (2024)
work page 2024
-
[7]
In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2022)
Karamati, S., Hughes, C., Hemmert, K.S., Grant, R.E., Schonbein, W.W., Levy, S., Conte, T.M., Young, J., Vuduc, R.W.: “Smarter” NICs for faster molecular dynam- ics: a case study. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2022)
work page 2022
-
[8]
In: 15th Annual IEEE Symposium on High- Performance Interconnects (HOTI 2007) (2007)
León, E.A., Ferreira, K.B., Maccabe, A.B.: Reducing the impact of the memory wall for I/O using cache injection. In: 15th Annual IEEE Symposium on High- Performance Interconnects (HOTI 2007) (2007)
work page 2007
-
[9]
In: 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Li, Y., Kashyap, A., Chen, W., Guo, Y., Lu, X.: Accelerating lossy and lossless compression on emerging bluefield dpu architectures. In: 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). pp. 373–385 (2024)
work page 2024
-
[10]
In: 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3) (2019)
Maley, F.M., DeVinney, J.G.: Conveyors for streaming many-to-many communica- tion. In: 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3) (2019)
work page 2019
-
[11]
IEEE Computer Architecture Letters (2025)
Mamandipoor, A., Tran, H.D., Alian, M.: SDT: Cutting datacenter tax through simultaneous data-delivery threads. IEEE Computer Architecture Letters (2025)
work page 2025
-
[12]
Steil, T., Reza, T., Priest, B., Pearce, R.: Embracing irregular parallelism in HPC with YGM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2023)
work page 2023
-
[13]
In: 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2023)
Suresh, K.K., Michalowicz, B., Ramesh, B., Contini, N., Yao, J., Xu, S., Shafi, A., Subramoni, H., Panda, D.: A novel framework for efficient offloading of commu- nication operations to BlueField SmartNICs. In: 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2023)
work page 2023
-
[14]
Future Generation Computer Sys- tems (2025)
Tibbetts, N., Ibtisum, S., Puri, S.: A survey on heterogeneous computing using SmartNICs and emerging data processing units. Future Generation Computer Sys- tems (2025)
work page 2025
-
[15]
Usman, M., Benito, M., Iserte, S., Peña, A.J.: ODOS-MPI: HPC-friendly Smart- NIC offloading of computation/communication kernels. In: Proceedings of the In- ternational Conference for High Performance Computing, Networking, Storage and Analysis (2025)
work page 2025
-
[16]
Usman,M.,Iserte,S.,Ferrer,R.,Peña,A.J.:DPUoffloadingprogrammingwiththe OpenMP API. In: Proceedings of the SC’23 Workshops of The International Con- ference on High Performance Computing, Network, Storage, and Analysis (2023)
work page 2023
-
[17]
Wahlgren, J., Schieffer, G., Gokhale, M., Pearce, R., Peng, I.: Disaggregated mem- ory with smartnic offloading: a case study on graph processing. In: 2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). pp. 159–169. IEEE (2024)
work page 2024
-
[18]
Proceedings of the ACM on Measurement and Analysis of Computing Systems6(1) (2022)
Wang, M., Xu, M., Wu, J.: Understanding I/O direct cache access performance for end host networking. Proceedings of the ACM on Measurement and Analysis of Computing Systems6(1) (2022)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.