pith. sign in

arxiv: 2606.11158 · v1 · pith:M66SNEFFnew · submitted 2026-06-09 · 💻 cs.AR · cs.PL

Defeat the Heap: Zero-Copy Data Movement in AXI4MLIR

Pith reviewed 2026-06-27 11:08 UTC · model grok-4.3

classification 💻 cs.AR cs.PL
keywords zero-copydata movementAXI4MLIRDMA-mapped buffersMLIR dialectaccelerator utilizationmatrix multiplication
0
0 comments X

The pith

Allocating MLIR buffers directly in DMA-mapped memory removes a redundant staging copy and halves main memory data movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that AXI4MLIR performs an unnecessary copy from heap-allocated buffers into contiguous DMA-mapped buffers before accelerator transfers. It removes this copy by extending the accel dialect with lowering support that allocates buffers straight into DMA-mapped memory. A sympathetic reader would care because the change reduces main memory data movement by up to 2x and lets the accelerator spend more time on computation rather than transfers.

Core claim

This work identifies the copy from heap-allocated memory buffers into contiguous DMA-mapped buffers as a redundant staging operation during host to accelerator transfers in AXI4MLIR. The optimization extends the accel dialect and implements lowering support that allocates buffers directly within DMA-mapped memory, omitting the staging copy. Evaluation with a configurable matrix-matrix multiplication accelerator shows the zero-copy scheme reduces main memory data movement by up to 2x and increases overall accelerator utilization.

What carries the argument

The lowering support added to the accel dialect that allocates buffers directly within DMA-mapped memory regions instead of using heap buffers followed by a copy.

If this is right

  • Main memory data movement is reduced by up to 2x.
  • Overall accelerator utilization increases as transfers take less time.
  • The redundant staging copy between heap and DMA buffers is eliminated for host-accelerator transfers.
  • The change applies to linear algebra kernels running on custom hardware accelerators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same direct-allocation pattern could be added to other MLIR dialects that generate accelerator driver code.
  • Runtime systems would need to guarantee safe DMA-mapped regions for the allocation change to remain zero-overhead.
  • The reduction in data movement could compound with other optimizations such as tiling or prefetching.

Load-bearing premise

Buffers can be allocated directly inside DMA-mapped memory regions without violating hardware constraints, OS policies, or introducing new overheads that offset the removed copy.

What would settle it

Measure the volume of main memory data transfers during matrix-matrix multiplication before and after switching to direct DMA-mapped buffer allocation.

Figures

Figures reproduced from arXiv: 2606.11158 by Antonino Tumeo, David Kaeli, Elam Cohavi, Jos\'e Cano, Jude Haris, Nicolas Bohm Agostini.

Figure 1
Figure 1. Figure 1: Zero-copy DMA-aware optimization: System view. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MLIR memref descriptor structure. A memref.subview produces a new descriptor that reuses the same allocated and aligned pointers from the original buffer, but updates the remaining fields to express the view: the offset is adjusted to the new starting ele￾ment, sizes describes the dimensions of the subview, and strides defines the distance (in number of elements) to step through the underlying buffer along… view at source ↗
Figure 3
Figure 3. Figure 3: IR comparison of explicit copy operations with [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized execution time breakdown across stationary data flows for various dimensions, tile sizes. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

As custom hardware accelerators become increasingly central to machine learning workloads, efficient data transfer is critical for maximizing accelerator performance on linear algebra kernels. AXI4MLIR, an extension of the Multi-Level Intermediate Representation (MLIR) compiler framework for automated generation of host-accelerator driver code, incurs significant runtime overhead due to non-zero-copy CPU-accelerator data movement. During transfers from the host to the accelerator, data is copied from heap-allocated memory buffers into contiguous Direct Memory Access (DMA)-mapped buffers. This work identifies this copy as a redundant staging operation and eliminates it through zero-copy data movement. The optimization extends accel, an MLIR dialect introduced by AXI4MLIR, and implements lowering support that allocates buffers directly within DMA-mapped memory, thereby omitting the staging copy. We evaluate the proposed scheme using a configurable matrix-matrix multiplication accelerator and show that the zero-copy optimization reduces main memory data movement by up to 2x, increasing overall accelerator utilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper describes an extension to AXI4MLIR that eliminates a redundant heap-to-DMA staging copy during host-accelerator transfers by extending the accel dialect and its lowering pass to allocate MLIR buffers directly inside DMA-mapped regions. It evaluates the change on a configurable matrix-matrix multiplication accelerator and reports that the optimization reduces main-memory data movement by up to 2x while increasing accelerator utilization.

Significance. A verified zero-copy path that removes a staging copy without new overheads would be a useful, incremental improvement for MLIR-based accelerator flows that rely on DMA. The approach is conceptually straightforward and targets a known source of overhead, but the manuscript supplies no experimental details, baselines, or verification of the three necessary conditions (DMA-region allocation feasibility, OS cost parity, and absence of new coherence/TLB traffic), so the practical significance cannot yet be assessed.

major comments (2)
  1. [Abstract / Evaluation] Abstract and evaluation description: the headline claim of 'up to 2x' reduction in main-memory data movement is stated without any workload description, matrix sizes, baseline (non-zero-copy) measurements, error bars, or even the number of runs, so the quantitative result cannot be reproduced or verified from the text.
  2. [Implementation / Lowering] Lowering pass and allocation strategy: the central assumption that arbitrary-sized, arbitrarily-aligned buffers can be placed directly in DMA-mapped regions without padding, extra copies, or new TLB/cache/coherence costs is load-bearing for the net 2x win, yet the manuscript provides no evidence or measurement that any of these conditions were checked.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive criticism. We agree that the current manuscript lacks sufficient experimental detail and verification of key assumptions. We will revise the paper to include the requested information and measurements. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and evaluation description: the headline claim of 'up to 2x' reduction in main-memory data movement is stated without any workload description, matrix sizes, baseline (non-zero-copy) measurements, error bars, or even the number of runs, so the quantitative result cannot be reproduced or verified from the text.

    Authors: We agree that the abstract and evaluation sections do not provide enough detail for reproducibility. In the revised manuscript we will expand the evaluation section to describe the workloads (matrix-multiplication kernels), report the exact matrix dimensions tested, include the non-zero-copy baseline numbers, add error bars from multiple runs, and state the number of repetitions performed. These additions will make the 'up to 2x' claim verifiable from the text. revision: yes

  2. Referee: [Implementation / Lowering] Lowering pass and allocation strategy: the central assumption that arbitrary-sized, arbitrarily-aligned buffers can be placed directly in DMA-mapped regions without padding, extra copies, or new TLB/cache/coherence costs is load-bearing for the net 2x win, yet the manuscript provides no evidence or measurement that any of these conditions were checked.

    Authors: We acknowledge that the manuscript does not present explicit measurements confirming the absence of padding, extra copies, or new TLB/cache/coherence traffic. In the revision we will add a dedicated subsection that reports (1) the DMA-region allocation strategy and its feasibility for the tested sizes, (2) measured OS allocation cost parity, and (3) hardware performance-counter data showing no measurable increase in TLB misses or coherence traffic on the evaluated platform. If any of these conditions require additional platform-specific caveats, they will be stated explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical implementation change with direct measurement

full rationale

The paper presents an engineering optimization that removes a heap-to-DMA staging copy by allocating buffers directly in DMA-mapped regions. The central claim (up to 2x reduction in memory movement) is supported by runtime measurements on a matrix-multiplication accelerator, not by any derivation, fitted parameters, or self-referential equations. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)
  • domain assumption DMA-mapped memory regions can be allocated directly from the compiler without additional runtime cost or hardware incompatibility
    The zero-copy scheme depends on this being feasible for the target accelerator platform.

pith-pipeline@v0.9.1-grok · 5718 in / 1114 out tokens · 24665 ms · 2026-06-27T11:08:18.220074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages

  1. [1]

    Hirac: A hier- archical accelerator with sorting-based packing for spgemms in dnn applications,

    H. Shabani, A. Singh, B. Youhana, and X. Guo, “Hirac: A hier- archical accelerator with sorting-based packing for spgemms in dnn applications,” inIEEE International Symposium on High-Performance Computer Architecture, ser. HPCA’23, 2023, pp. 247–258

  2. [2]

    Inca: Input-stationary dataflow at outside-the- box thinking about deep learning accelerators,

    B. Kim, S. Li, and H. Li, “Inca: Input-stationary dataflow at outside-the- box thinking about deep learning accelerators,” inIEEE International Symposium on High-Performance Computer Architecture, ser. HPCA’23, 2023, pp. 29–41

  3. [3]

    Mp-rec: Hardware-software co-design to enable multi-path recommendation,

    S. Hsia, U. Gupta, B. Acun, N. Ardalani, P. Zhong, G.-Y . Wei, D. Brooks, and C.-J. Wu, “Mp-rec: Hardware-software co-design to enable multi-path recommendation,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023, 2023, p. 449–465

  4. [4]

    Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing,

    F. Muñoz Martínez, R. Garg, M. Pellauer, J. L. Abellán, M. E. Aca- cio, and T. Krishna, “Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS’23, 2023, p. 252–265

  5. [5]

    AXI4MLIR: User-Driven Auto- matic Host Code Generation for Custom AXI-Based Accelerators,

    N. B. Agostini, J. Haris, P. Gibson, M. Jayaweera, N. Rubin, A. Tumeo, J. L. Abellán, J. Cano, and D. Kaeli, “AXI4MLIR: User-Driven Auto- matic Host Code Generation for Custom AXI-Based Accelerators,” in Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization, ser. CGO ’24, 2024, p. 143–157

  6. [6]

    MLIR: Scaling Compiler Infrastructure for Domain Specific Computation,

    C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko, “MLIR: Scaling Compiler Infrastructure for Domain Specific Computation,” in IEEE/ACM International Symposium on Code Generation and Optimiza- tion, ser. CGO’21, 2021, pp. 2–14

  7. [7]

    ’linalg’ Dialect,

    M. Developers, “’linalg’ Dialect,” 2020, online accessed on 11-04-2023. [Online]. Available: https://mlir.llvm.org/docs/Dialects/Linalg/

  8. [8]

    Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR,

    J. Haris, N. B. Agostini, A. Tumeo, D. Kaeli, and J. Cano, “Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR,”

  9. [9]

    Available: https://arxiv.org/abs/2402.19184

    [Online]. Available: https://arxiv.org/abs/2402.19184

  10. [10]

    Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks,

    Y .-H. Chen, J. Emer, and V . Sze, “Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks,” in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. New York, USA: IEEE Press, 2016, p. 367–379. [Online]. Available: https://doi.org/10.1109/ISCA.2016.40

  11. [11]

    DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration,

    P. Gibson, J. Cano, E. Crowley, A. Storkey, and M. O’boyle, “DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration,”ACM Transactions on Architecture and Code Optimization, 2025

  12. [12]

    Mlir-based ai engine toolchain and iron api,

    Xilinx/AMD, “Mlir-based ai engine toolchain and iron api,” https: //github.com/Xilinx/mlir-aie, 2025, accessed on December 5, 2025

  13. [13]

    Compiler-driven simulation of reconfigurable hardware accelerators,

    Z. Li, Y . Ye, S. Neuendorffer, and A. Sampso, “Compiler-driven simulation of reconfigurable hardware accelerators,” 2022. [Online]. Available: https://arxiv.org/abs/2202.00739