Defeat the Heap: Zero-Copy Data Movement in AXI4MLIR

Antonino Tumeo; David Kaeli; Elam Cohavi; Jos\'e Cano; Jude Haris; Nicolas Bohm Agostini

arxiv: 2606.11158 · v1 · pith:M66SNEFFnew · submitted 2026-06-09 · 💻 cs.AR · cs.PL

Defeat the Heap: Zero-Copy Data Movement in AXI4MLIR

Elam Cohavi , Nicolas Bohm Agostini , Jude Haris , Antonino Tumeo , David Kaeli , Jos\'e Cano This is my paper

Pith reviewed 2026-06-27 11:08 UTC · model grok-4.3

classification 💻 cs.AR cs.PL

keywords zero-copydata movementAXI4MLIRDMA-mapped buffersMLIR dialectaccelerator utilizationmatrix multiplication

0 comments

The pith

Allocating MLIR buffers directly in DMA-mapped memory removes a redundant staging copy and halves main memory data movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that AXI4MLIR performs an unnecessary copy from heap-allocated buffers into contiguous DMA-mapped buffers before accelerator transfers. It removes this copy by extending the accel dialect with lowering support that allocates buffers straight into DMA-mapped memory. A sympathetic reader would care because the change reduces main memory data movement by up to 2x and lets the accelerator spend more time on computation rather than transfers.

Core claim

This work identifies the copy from heap-allocated memory buffers into contiguous DMA-mapped buffers as a redundant staging operation during host to accelerator transfers in AXI4MLIR. The optimization extends the accel dialect and implements lowering support that allocates buffers directly within DMA-mapped memory, omitting the staging copy. Evaluation with a configurable matrix-matrix multiplication accelerator shows the zero-copy scheme reduces main memory data movement by up to 2x and increases overall accelerator utilization.

What carries the argument

The lowering support added to the accel dialect that allocates buffers directly within DMA-mapped memory regions instead of using heap buffers followed by a copy.

If this is right

Main memory data movement is reduced by up to 2x.
Overall accelerator utilization increases as transfers take less time.
The redundant staging copy between heap and DMA buffers is eliminated for host-accelerator transfers.
The change applies to linear algebra kernels running on custom hardware accelerators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same direct-allocation pattern could be added to other MLIR dialects that generate accelerator driver code.
Runtime systems would need to guarantee safe DMA-mapped regions for the allocation change to remain zero-overhead.
The reduction in data movement could compound with other optimizations such as tiling or prefetching.

Load-bearing premise

Buffers can be allocated directly inside DMA-mapped memory regions without violating hardware constraints, OS policies, or introducing new overheads that offset the removed copy.

What would settle it

Measure the volume of main memory data transfers during matrix-matrix multiplication before and after switching to direct DMA-mapped buffer allocation.

Figures

Figures reproduced from arXiv: 2606.11158 by Antonino Tumeo, David Kaeli, Elam Cohavi, Jos\'e Cano, Jude Haris, Nicolas Bohm Agostini.

**Figure 2.** Figure 2: MLIR memref descriptor structure. A memref.subview produces a new descriptor that reuses the same allocated and aligned pointers from the original buffer, but updates the remaining fields to express the view: the offset is adjusted to the new starting element, sizes describes the dimensions of the subview, and strides defines the distance (in number of elements) to step through the underlying buffer along… view at source ↗

**Figure 3.** Figure 3: IR comparison of explicit copy operations with [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Normalized execution time breakdown across stationary data flows for various dimensions, tile sizes. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

As custom hardware accelerators become increasingly central to machine learning workloads, efficient data transfer is critical for maximizing accelerator performance on linear algebra kernels. AXI4MLIR, an extension of the Multi-Level Intermediate Representation (MLIR) compiler framework for automated generation of host-accelerator driver code, incurs significant runtime overhead due to non-zero-copy CPU-accelerator data movement. During transfers from the host to the accelerator, data is copied from heap-allocated memory buffers into contiguous Direct Memory Access (DMA)-mapped buffers. This work identifies this copy as a redundant staging operation and eliminates it through zero-copy data movement. The optimization extends accel, an MLIR dialect introduced by AXI4MLIR, and implements lowering support that allocates buffers directly within DMA-mapped memory, thereby omitting the staging copy. We evaluate the proposed scheme using a configurable matrix-matrix multiplication accelerator and show that the zero-copy optimization reduces main memory data movement by up to 2x, increasing overall accelerator utilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds direct DMA allocation to AXI4MLIR for zero-copy transfers and reports up to 2x less data movement, but the lack of experimental details makes the result hard to assess.

read the letter

The core change here is a new lowering pass in the accel dialect that allocates MLIR buffers straight into DMA-mapped regions instead of copying from the heap first. That removes the staging step they identified as redundant.

It does a clean job describing the existing AXI4MLIR flow and the exact modification needed to skip the copy. For anyone already using that framework to generate host-accelerator code, the idea is straightforward and targeted.

The evaluation is the thin part. The abstract states a 2x reduction in main-memory movement on a matrix-matrix multiply accelerator, yet it gives no workload sizes, baseline measurements, or confirmation that the DMA regions handled the buffers without added alignment or coherence costs. The stress-test note is on point: we have no evidence that the OS permitted the allocation at the same cost or that no new overheads appeared.

The work stays within one compiler extension and does not claim broader novelty beyond that. Citation pattern is normal, building directly on the prior AXI4MLIR papers.

This is for people working inside the AXI4MLIR ecosystem or similar MLIR accelerator flows. A reader outside that niche will not get much. It deserves peer review once the experiments are written out with enough detail to check the net gain.

Referee Report

2 major / 0 minor

Summary. The paper describes an extension to AXI4MLIR that eliminates a redundant heap-to-DMA staging copy during host-accelerator transfers by extending the accel dialect and its lowering pass to allocate MLIR buffers directly inside DMA-mapped regions. It evaluates the change on a configurable matrix-matrix multiplication accelerator and reports that the optimization reduces main-memory data movement by up to 2x while increasing accelerator utilization.

Significance. A verified zero-copy path that removes a staging copy without new overheads would be a useful, incremental improvement for MLIR-based accelerator flows that rely on DMA. The approach is conceptually straightforward and targets a known source of overhead, but the manuscript supplies no experimental details, baselines, or verification of the three necessary conditions (DMA-region allocation feasibility, OS cost parity, and absence of new coherence/TLB traffic), so the practical significance cannot yet be assessed.

major comments (2)

[Abstract / Evaluation] Abstract and evaluation description: the headline claim of 'up to 2x' reduction in main-memory data movement is stated without any workload description, matrix sizes, baseline (non-zero-copy) measurements, error bars, or even the number of runs, so the quantitative result cannot be reproduced or verified from the text.
[Implementation / Lowering] Lowering pass and allocation strategy: the central assumption that arbitrary-sized, arbitrarily-aligned buffers can be placed directly in DMA-mapped regions without padding, extra copies, or new TLB/cache/coherence costs is load-bearing for the net 2x win, yet the manuscript provides no evidence or measurement that any of these conditions were checked.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive criticism. We agree that the current manuscript lacks sufficient experimental detail and verification of key assumptions. We will revise the paper to include the requested information and measurements. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation description: the headline claim of 'up to 2x' reduction in main-memory data movement is stated without any workload description, matrix sizes, baseline (non-zero-copy) measurements, error bars, or even the number of runs, so the quantitative result cannot be reproduced or verified from the text.

Authors: We agree that the abstract and evaluation sections do not provide enough detail for reproducibility. In the revised manuscript we will expand the evaluation section to describe the workloads (matrix-multiplication kernels), report the exact matrix dimensions tested, include the non-zero-copy baseline numbers, add error bars from multiple runs, and state the number of repetitions performed. These additions will make the 'up to 2x' claim verifiable from the text. revision: yes
Referee: [Implementation / Lowering] Lowering pass and allocation strategy: the central assumption that arbitrary-sized, arbitrarily-aligned buffers can be placed directly in DMA-mapped regions without padding, extra copies, or new TLB/cache/coherence costs is load-bearing for the net 2x win, yet the manuscript provides no evidence or measurement that any of these conditions were checked.

Authors: We acknowledge that the manuscript does not present explicit measurements confirming the absence of padding, extra copies, or new TLB/cache/coherence traffic. In the revision we will add a dedicated subsection that reports (1) the DMA-region allocation strategy and its feasibility for the tested sizes, (2) measured OS allocation cost parity, and (3) hardware performance-counter data showing no measurable increase in TLB misses or coherence traffic on the evaluated platform. If any of these conditions require additional platform-specific caveats, they will be stated explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical implementation change with direct measurement

full rationale

The paper presents an engineering optimization that removes a heap-to-DMA staging copy by allocating buffers directly in DMA-mapped regions. The central claim (up to 2x reduction in memory movement) is supported by runtime measurements on a matrix-multiplication accelerator, not by any derivation, fitted parameters, or self-referential equations. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)

domain assumption DMA-mapped memory regions can be allocated directly from the compiler without additional runtime cost or hardware incompatibility
The zero-copy scheme depends on this being feasible for the target accelerator platform.

pith-pipeline@v0.9.1-grok · 5718 in / 1114 out tokens · 24665 ms · 2026-06-27T11:08:18.220074+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages

[1]

Hirac: A hier- archical accelerator with sorting-based packing for spgemms in dnn applications,

H. Shabani, A. Singh, B. Youhana, and X. Guo, “Hirac: A hier- archical accelerator with sorting-based packing for spgemms in dnn applications,” inIEEE International Symposium on High-Performance Computer Architecture, ser. HPCA’23, 2023, pp. 247–258

2023
[2]

Inca: Input-stationary dataflow at outside-the- box thinking about deep learning accelerators,

B. Kim, S. Li, and H. Li, “Inca: Input-stationary dataflow at outside-the- box thinking about deep learning accelerators,” inIEEE International Symposium on High-Performance Computer Architecture, ser. HPCA’23, 2023, pp. 29–41

2023
[3]

Mp-rec: Hardware-software co-design to enable multi-path recommendation,

S. Hsia, U. Gupta, B. Acun, N. Ardalani, P. Zhong, G.-Y . Wei, D. Brooks, and C.-J. Wu, “Mp-rec: Hardware-software co-design to enable multi-path recommendation,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023, 2023, p. 449–465

2023
[4]

Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing,

F. Muñoz Martínez, R. Garg, M. Pellauer, J. L. Abellán, M. E. Aca- cio, and T. Krishna, “Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS’23, 2023, p. 252–265

2023
[5]

AXI4MLIR: User-Driven Auto- matic Host Code Generation for Custom AXI-Based Accelerators,

N. B. Agostini, J. Haris, P. Gibson, M. Jayaweera, N. Rubin, A. Tumeo, J. L. Abellán, J. Cano, and D. Kaeli, “AXI4MLIR: User-Driven Auto- matic Host Code Generation for Custom AXI-Based Accelerators,” in Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization, ser. CGO ’24, 2024, p. 143–157

2024
[6]

MLIR: Scaling Compiler Infrastructure for Domain Specific Computation,

C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko, “MLIR: Scaling Compiler Infrastructure for Domain Specific Computation,” in IEEE/ACM International Symposium on Code Generation and Optimiza- tion, ser. CGO’21, 2021, pp. 2–14

2021
[7]

’linalg’ Dialect,

M. Developers, “’linalg’ Dialect,” 2020, online accessed on 11-04-2023. [Online]. Available: https://mlir.llvm.org/docs/Dialects/Linalg/

2020
[8]

Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR,

J. Haris, N. B. Agostini, A. Tumeo, D. Kaeli, and J. Cano, “Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR,”
[9]

Available: https://arxiv.org/abs/2402.19184

[Online]. Available: https://arxiv.org/abs/2402.19184

work page arXiv
[10]

Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks,

Y .-H. Chen, J. Emer, and V . Sze, “Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks,” in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. New York, USA: IEEE Press, 2016, p. 367–379. [Online]. Available: https://doi.org/10.1109/ISCA.2016.40

work page doi:10.1109/isca.2016.40 2016
[11]

DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration,

P. Gibson, J. Cano, E. Crowley, A. Storkey, and M. O’boyle, “DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration,”ACM Transactions on Architecture and Code Optimization, 2025

2025
[12]

Mlir-based ai engine toolchain and iron api,

Xilinx/AMD, “Mlir-based ai engine toolchain and iron api,” https: //github.com/Xilinx/mlir-aie, 2025, accessed on December 5, 2025

2025
[13]

Compiler-driven simulation of reconfigurable hardware accelerators,

Z. Li, Y . Ye, S. Neuendorffer, and A. Sampso, “Compiler-driven simulation of reconfigurable hardware accelerators,” 2022. [Online]. Available: https://arxiv.org/abs/2202.00739

work page arXiv 2022

[1] [1]

Hirac: A hier- archical accelerator with sorting-based packing for spgemms in dnn applications,

H. Shabani, A. Singh, B. Youhana, and X. Guo, “Hirac: A hier- archical accelerator with sorting-based packing for spgemms in dnn applications,” inIEEE International Symposium on High-Performance Computer Architecture, ser. HPCA’23, 2023, pp. 247–258

2023

[2] [2]

Inca: Input-stationary dataflow at outside-the- box thinking about deep learning accelerators,

B. Kim, S. Li, and H. Li, “Inca: Input-stationary dataflow at outside-the- box thinking about deep learning accelerators,” inIEEE International Symposium on High-Performance Computer Architecture, ser. HPCA’23, 2023, pp. 29–41

2023

[3] [3]

Mp-rec: Hardware-software co-design to enable multi-path recommendation,

S. Hsia, U. Gupta, B. Acun, N. Ardalani, P. Zhong, G.-Y . Wei, D. Brooks, and C.-J. Wu, “Mp-rec: Hardware-software co-design to enable multi-path recommendation,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023, 2023, p. 449–465

2023

[4] [4]

Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing,

F. Muñoz Martínez, R. Garg, M. Pellauer, J. L. Abellán, M. E. Aca- cio, and T. Krishna, “Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS’23, 2023, p. 252–265

2023

[5] [5]

AXI4MLIR: User-Driven Auto- matic Host Code Generation for Custom AXI-Based Accelerators,

N. B. Agostini, J. Haris, P. Gibson, M. Jayaweera, N. Rubin, A. Tumeo, J. L. Abellán, J. Cano, and D. Kaeli, “AXI4MLIR: User-Driven Auto- matic Host Code Generation for Custom AXI-Based Accelerators,” in Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization, ser. CGO ’24, 2024, p. 143–157

2024

[6] [6]

MLIR: Scaling Compiler Infrastructure for Domain Specific Computation,

C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko, “MLIR: Scaling Compiler Infrastructure for Domain Specific Computation,” in IEEE/ACM International Symposium on Code Generation and Optimiza- tion, ser. CGO’21, 2021, pp. 2–14

2021

[7] [7]

’linalg’ Dialect,

M. Developers, “’linalg’ Dialect,” 2020, online accessed on 11-04-2023. [Online]. Available: https://mlir.llvm.org/docs/Dialects/Linalg/

2020

[8] [8]

Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR,

J. Haris, N. B. Agostini, A. Tumeo, D. Kaeli, and J. Cano, “Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR,”

[9] [9]

Available: https://arxiv.org/abs/2402.19184

[Online]. Available: https://arxiv.org/abs/2402.19184

work page arXiv

[10] [10]

Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks,

Y .-H. Chen, J. Emer, and V . Sze, “Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks,” in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. New York, USA: IEEE Press, 2016, p. 367–379. [Online]. Available: https://doi.org/10.1109/ISCA.2016.40

work page doi:10.1109/isca.2016.40 2016

[11] [11]

DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration,

P. Gibson, J. Cano, E. Crowley, A. Storkey, and M. O’boyle, “DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration,”ACM Transactions on Architecture and Code Optimization, 2025

2025

[12] [12]

Mlir-based ai engine toolchain and iron api,

Xilinx/AMD, “Mlir-based ai engine toolchain and iron api,” https: //github.com/Xilinx/mlir-aie, 2025, accessed on December 5, 2025

2025

[13] [13]

Compiler-driven simulation of reconfigurable hardware accelerators,

Z. Li, Y . Ye, S. Neuendorffer, and A. Sampso, “Compiler-driven simulation of reconfigurable hardware accelerators,” 2022. [Online]. Available: https://arxiv.org/abs/2202.00739

work page arXiv 2022