Lifting to tensors when compiling scientific computing workloads for AI Engines

Gabriel Rodriguez-Canal; Nick Brown

arxiv: 2605.03566 · v1 · submitted 2026-05-05 · 💻 cs.DC

Lifting to tensors when compiling scientific computing workloads for AI Engines

Nick Brown , Gabriel Rodriguez-Canal This is my paper

Pith reviewed 2026-05-07 14:16 UTC · model grok-4.3

classification 💻 cs.DC

keywords OpenMPAI Enginestensor liftingscientific computingcompilation pipelineheterogeneous executionenergy efficiencykernel benchmarks

0 comments

The pith

Lifting OpenMP loops to tensors maps scientific codes to AI Engines with minimal changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a compilation pipeline that converts the semantics of general-purpose loops marked with OpenMP into tensor representations. This tensor view supplies the structure needed to target AMD AI Engines without rewriting the original code. On six kernel benchmarks from AI and scientific computing, the approach lets the AI Engine hardware match multicore CPU performance for 32-bit floating point while consuming less energy. For two scientific kernels, splitting work across CPU and AI Engine yields up to 40% higher performance and 15% lower energy use than the CPU alone.

Core claim

Lifting the semantics of an application into tensors captures the intention of general purpose loops annotated with OpenMP and such high-level tensor information provides a richness that is effective when mapping to the AI Engines. Requiring only an OpenMP decorated loop, the approach significantly reduces code complexity when targeting the architecture.

What carries the argument

Tensor lifting of OpenMP loop semantics, which converts general loop structures into tensor forms that carry enough information for direct mapping to the AI Engine execution model.

If this is right

Scientific and AI codes can target AI Engines using only standard OpenMP loop annotations instead of architecture-specific rewrites.
For float32 workloads the AI Engine delivers CPU-comparable speed at lower energy to solution.
Heterogeneous execution across CPU and AI Engine improves performance by up to 40% and reduces energy by 15% for selected scientific kernels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lifting technique could be adapted for other integrated accelerators if equivalent high-level semantic mappings are defined.
Widespread use of OpenMP annotations in scientific codes would lower the barrier to exploiting future on-chip NPUs without expert porting effort.
The method highlights tensor representations as a potential bridge between ordinary loop-based code and domain-specific hardware.

Load-bearing premise

That converting OpenMP-annotated general-purpose loops into tensors preserves their original meaning and permits correct, high-performance execution on AI Engines for arbitrary scientific codes.

What would settle it

An OpenMP-annotated scientific loop whose tensor-lifted version produces wrong results or loses the claimed performance and energy advantages when run on the AI Engine hardware.

Figures

Figures reproduced from arXiv: 2605.03566 by Gabriel Rodriguez-Canal, Nick Brown.

**Figure 1.** Figure 1: Illustration of AMD’s Hawk Point NPU, comprising five columns view at source ↗

**Figure 2.** Figure 2: Illustration of our MLIR-based OpenMP loop compilation flow for the AI Engines. view at source ↗

read the original abstract

It has been demonstrated that specialised architectures, such as FPGAs and AMD's AI Engines (AIEs), have the potential to deliver energy and performance advantages for scientific computing. Given the integration of AIEs into AMD's CPUs, this is an interesting potential avenue especially when executing on the edge or making better use of local compute constrained resources. However, a major challenge is in enabling existing codes to run on this architecture without extensive modification. Put simply, it requires significant expertise and time to port codes to the AIE's execution model. In this paper we explore a compilation pipeline for efficiently mapping loops in general purpose, scientific codes to AIEs. Lifting the semantics of an application into tensors, we demonstrate that this is able to capture the intention of general purpose loops annotated with OpenMP and such high-level tensor information provides a richness that is effective when mapping to the AIEs. Requiring only an OpenMP decorated loop, our approach significantly reduces code complexity when targeting the architecture. For six kernel benchmarks, representing AI and scientific computing, using our approach the NPU performs comparatively to the multicore CPU for float32, in all cases at reduced energy to solution. For two scientific computing kernels running across both the CPU and NPU together delivers up to a 40% improvement in performance and 15% reduction in energy usage compared to the CPU alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tensor lifting lets OpenMP loops target AIEs with less rewriting and shows decent kernel results, but the approach needs clearer limits on what loops it actually handles.

read the letter

The paper's core idea is a compilation pipeline that lifts the semantics of OpenMP-annotated loops into tensors so they can run on AMD AI Engines without heavy manual porting. On the six kernels tested, the NPU matches multicore CPU performance for float32 while using less energy, and hybrid CPU-NPU execution on two scientific kernels gives up to 40% better performance and 15% lower energy than CPU alone. That is the practical takeaway worth noting first.

Referee Report

2 major / 2 minor

Summary. The manuscript describes a compilation pipeline for mapping OpenMP-annotated loops from general-purpose scientific codes to AMD AI Engines by lifting their semantics to tensor representations. The authors argue that this high-level tensor information facilitates effective mapping to the AIE architecture, significantly reducing the code complexity required for targeting it. Empirical evaluation on six kernel benchmarks from AI and scientific computing domains shows that the NPU achieves performance comparable to a multicore CPU for float32 operations, with lower energy consumption. Additionally, for two scientific kernels, hybrid execution on CPU and NPU together provides up to 40% performance improvement and 15% energy reduction compared to CPU alone.

Significance. If the tensor lifting correctly preserves semantics and the empirical results hold under rigorous conditions, this approach could substantially lower the expertise barrier for porting scientific workloads to integrated AIE hardware, enabling energy-efficient execution on edge and resource-constrained systems. The hybrid CPU+NPU gains demonstrate practical value in co-execution strategies for heterogeneous architectures.

major comments (2)

[Abstract] Abstract: The benchmark outcomes are reported without details on experimental setup, error bars, benchmark selection criteria, or verification of tensor lifting correctness. This omission is load-bearing for the central empirical claims of comparable float32 performance, reduced energy to solution, and hybrid improvements up to 40%/15%.
[Compilation Pipeline] The tensor lifting approach (compilation pipeline section): No formal semantics or supported OpenMP subset is defined. This is load-bearing for the claim that an OpenMP-decorated loop alone suffices for arbitrary scientific codes, as non-affine accesses, reductions, or irregular patterns common in scientific computing may not map to dense tensor abstractions without semantic loss or performance degradation.

minor comments (2)

[Abstract] The abstract would benefit from a brief quantitative measure (e.g., lines of code or porting effort) to support the claim of significantly reduced code complexity.
Consider adding a table in the evaluation section listing the six kernels with their key characteristics (iteration spaces, access patterns) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where additional clarity would strengthen the manuscript. We address each major comment below and commit to revisions that improve the presentation of experimental details and the scope of the compilation approach.

read point-by-point responses

Referee: [Abstract] Abstract: The benchmark outcomes are reported without details on experimental setup, error bars, benchmark selection criteria, or verification of tensor lifting correctness. This omission is load-bearing for the central empirical claims of comparable float32 performance, reduced energy to solution, and hybrid improvements up to 40%/15%.

Authors: We agree that the abstract would be strengthened by including these details. In the revised manuscript we will expand the abstract to briefly describe the experimental setup (AMD Ryzen AI processor with integrated AIEs), the benchmark selection (six kernels drawn from AI and scientific computing domains with regular affine access patterns), verification of tensor lifting correctness (output equivalence checks against CPU baselines), and the presence of error bars derived from multiple runs. The full methodology, including run counts and statistical methods, will be elaborated in the evaluation section with cross-references from the abstract. revision: yes
Referee: [Compilation Pipeline] The tensor lifting approach (compilation pipeline section): No formal semantics or supported OpenMP subset is defined. This is load-bearing for the claim that an OpenMP-decorated loop alone suffices for arbitrary scientific codes, as non-affine accesses, reductions, or irregular patterns common in scientific computing may not map to dense tensor abstractions without semantic loss or performance degradation.

Authors: We acknowledge that the manuscript does not supply a formal semantics or an exhaustive enumeration of supported OpenMP constructs. We will revise the compilation pipeline section to explicitly define the supported OpenMP subset (parallel for loops with static scheduling, affine accesses, and no reductions or irregular control flow) and to provide an informal semantics description of how loop nests are lifted to tensor representations. We will also add a limitations paragraph clarifying that non-affine or reduction-heavy patterns fall outside the current scope and may require manual restructuring. These changes will accurately bound the applicability of the approach without altering the results for the evaluated kernels. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation of tensor-lifting compilation pipeline

full rationale

The paper presents an engineering compilation pipeline that lifts OpenMP-annotated loops to tensors for mapping onto AI Engines, with all load-bearing claims resting on direct performance and energy measurements across six kernels. No equations, fitted parameters, uniqueness theorems, or ansatzes are introduced; the lifting step is described as an implementation choice whose correctness and effectiveness are demonstrated by benchmark results rather than derived from prior self-referential definitions. The abstract and description contain no self-citations that serve as load-bearing premises, and the hybrid CPU+NPU gains are reported as observed outcomes, not predictions forced by construction. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to identify specific free parameters or invented entities; the central approach assumes standard compiler semantics for OpenMP and tensor representations.

axioms (1)

domain assumption OpenMP annotations on loops sufficiently capture high-level semantics for accurate tensor lifting without additional programmer input
The pipeline requires only OpenMP decorated loops and relies on this to reduce code complexity while preserving intent.

pith-pipeline@v0.9.0 · 5542 in / 1212 out tokens · 53055 ms · 2026-05-07T14:16:09.787379+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 1 canonical work pages

[1]

Exploring the versal ai engines for accelerating stencil- based atmospheric advection simulation,

N. Brown, “Exploring the versal ai engines for accelerating stencil- based atmospheric advection simulation,” in Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2023, pp. 91–97

2023
[2]

Evaluating versal ai engines for option price discovery in market risk analysis,

M. Klaisoongnoen et al. , “Evaluating versal ai engines for option price discovery in market risk analysis,” in Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2024, pp. 176–182

2024
[3]

Efficiency, expressivity, and extensibility in a close- to-metal npu programming interface,

E. Hunhoff et al., “Efficiency, expressivity, and extensibility in a close- to-metal npu programming interface,” in 2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2025, pp. 85–94

2025
[4]

Fortran high-level synthesis: Reducing the barriers to accelerating hpc codes on fpgas,

Rodriguez-Canal et al. , “Fortran high-level synthesis: Reducing the barriers to accelerating hpc codes on fpgas,” in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 10–18

2023
[5]

[Online]

(2024) Versal adaptive soc aie-ml architecture manual. [Online]. Available: https://docs.amd.com/r/en-US/am020-versal-aie-ml/Overview

2024
[6]

A dataflow overlay for monte carlo multi-asset option pricing on amd versal ai engines,

M. Bouaziz et al. , “A dataflow overlay for monte carlo multi-asset option pricing on amd versal ai engines,” in ISC High Performance 2025 Research Paper Proceedings (40th International Conference). Prometeus GmbH, 2025, pp. 1–12

2025
[7]

New filter2d accelerator on the versal platform powered by the ai engine,

W. Zhang et al., “New filter2d accelerator on the versal platform powered by the ai engine,” in International Symposium on Advanced Parallel Processing Technologies. Springer, 2023, pp. 437–449

2023
[8]

An end-to-end programming model for ai engine architectures,

M. Levental et al., “An end-to-end programming model for ai engine architectures,” in Proceedings of the 14th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies , 2024, pp. 135–136

2024
[9]

Enabling fpga and ai engine tasks in the hpx programming framework for heterogeneous high-performance computing,

T. Kalkhof et al. , “Enabling fpga and ai engine tasks in the hpx programming framework for heterogeneous high-performance computing,” in International Symposium on Applied Reconfigurable Computing . Springer, 2024, pp. 75–89

2024
[10]

Seamless acceleration of fortran intrinsics via amd ai engines,

N. Brown et al. , “Seamless acceleration of fortran intrinsics via amd ai engines,” in Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays , 2025, pp. 185–185

2025
[11]

Fully integrating the flang fortran compiler with standard mlir,

N. Brown, “Fully integrating the flang fortran compiler with standard mlir,” in SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 2024, pp. 939–949

2024
[12]

OpenMP device offloading to FPGAs using the Nymble infrastructure,

J. Huthmann et al. , “OpenMP device offloading to FPGAs using the Nymble infrastructure,” inInternational Workshop on OpenMP. Springer, 2020, pp. 265–279

2020
[13]

An mlir pipeline for offloading fortran to fpgas via openmp,

G. Rodriguez-Canal et al., “An mlir pipeline for offloading fortran to fpgas via openmp,” 2025

2025
[14]

Tensorize: Fast synthesis of tensor programs from legacy code using symbolic tracing, sketching and solving,

A. Brauckmann et al. , “Tensorize: Fast synthesis of tensor programs from legacy code using symbolic tracing, sketching and solving,” in Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization , 2025, pp. 15–30

2025
[15]

Programmer productivity and performance on amd’s ai engines: Offloading fortran intrinsics via mlir a case-study,

N. Brown and G. Rodriguez-Canal, “Programmer productivity and performance on amd’s ai engines: Offloading fortran intrinsics via mlir a case-study,” 2025

2025
[16]

[Online]

(2025) Tensor operator set architecture (tosa). [Online]. Available: https://www.mlplatform.org/tosa/tosa_spec.html

2025
[17]

[Online]

(2025) Aie automatic vectorization. [Online]. Available: https: //github.com/Xilinx/mlir-aie/blob/main/docs/AIEVectorization.md

2025
[18]

[Online]

(2025) Iron api and mlir-based ai engine toolchain. [Online]. Available: https://github.com/Xilinx/mlir-aie

2025
[19]

A highly scalable met office nerc cloud model,

N. Brown et al., “A highly scalable met office nerc cloud model,” arXiv preprint arXiv:2009.12849, 2020

work page arXiv 2009

[1] [1]

Exploring the versal ai engines for accelerating stencil- based atmospheric advection simulation,

N. Brown, “Exploring the versal ai engines for accelerating stencil- based atmospheric advection simulation,” in Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2023, pp. 91–97

2023

[2] [2]

Evaluating versal ai engines for option price discovery in market risk analysis,

M. Klaisoongnoen et al. , “Evaluating versal ai engines for option price discovery in market risk analysis,” in Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2024, pp. 176–182

2024

[3] [3]

Efficiency, expressivity, and extensibility in a close- to-metal npu programming interface,

E. Hunhoff et al., “Efficiency, expressivity, and extensibility in a close- to-metal npu programming interface,” in 2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2025, pp. 85–94

2025

[4] [4]

Fortran high-level synthesis: Reducing the barriers to accelerating hpc codes on fpgas,

Rodriguez-Canal et al. , “Fortran high-level synthesis: Reducing the barriers to accelerating hpc codes on fpgas,” in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 10–18

2023

[5] [5]

[Online]

(2024) Versal adaptive soc aie-ml architecture manual. [Online]. Available: https://docs.amd.com/r/en-US/am020-versal-aie-ml/Overview

2024

[6] [6]

A dataflow overlay for monte carlo multi-asset option pricing on amd versal ai engines,

M. Bouaziz et al. , “A dataflow overlay for monte carlo multi-asset option pricing on amd versal ai engines,” in ISC High Performance 2025 Research Paper Proceedings (40th International Conference). Prometeus GmbH, 2025, pp. 1–12

2025

[7] [7]

New filter2d accelerator on the versal platform powered by the ai engine,

W. Zhang et al., “New filter2d accelerator on the versal platform powered by the ai engine,” in International Symposium on Advanced Parallel Processing Technologies. Springer, 2023, pp. 437–449

2023

[8] [8]

An end-to-end programming model for ai engine architectures,

M. Levental et al., “An end-to-end programming model for ai engine architectures,” in Proceedings of the 14th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies , 2024, pp. 135–136

2024

[9] [9]

Enabling fpga and ai engine tasks in the hpx programming framework for heterogeneous high-performance computing,

T. Kalkhof et al. , “Enabling fpga and ai engine tasks in the hpx programming framework for heterogeneous high-performance computing,” in International Symposium on Applied Reconfigurable Computing . Springer, 2024, pp. 75–89

2024

[10] [10]

Seamless acceleration of fortran intrinsics via amd ai engines,

N. Brown et al. , “Seamless acceleration of fortran intrinsics via amd ai engines,” in Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays , 2025, pp. 185–185

2025

[11] [11]

Fully integrating the flang fortran compiler with standard mlir,

N. Brown, “Fully integrating the flang fortran compiler with standard mlir,” in SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 2024, pp. 939–949

2024

[12] [12]

OpenMP device offloading to FPGAs using the Nymble infrastructure,

J. Huthmann et al. , “OpenMP device offloading to FPGAs using the Nymble infrastructure,” inInternational Workshop on OpenMP. Springer, 2020, pp. 265–279

2020

[13] [13]

An mlir pipeline for offloading fortran to fpgas via openmp,

G. Rodriguez-Canal et al., “An mlir pipeline for offloading fortran to fpgas via openmp,” 2025

2025

[14] [14]

Tensorize: Fast synthesis of tensor programs from legacy code using symbolic tracing, sketching and solving,

A. Brauckmann et al. , “Tensorize: Fast synthesis of tensor programs from legacy code using symbolic tracing, sketching and solving,” in Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization , 2025, pp. 15–30

2025

[15] [15]

Programmer productivity and performance on amd’s ai engines: Offloading fortran intrinsics via mlir a case-study,

N. Brown and G. Rodriguez-Canal, “Programmer productivity and performance on amd’s ai engines: Offloading fortran intrinsics via mlir a case-study,” 2025

2025

[16] [16]

[Online]

(2025) Tensor operator set architecture (tosa). [Online]. Available: https://www.mlplatform.org/tosa/tosa_spec.html

2025

[17] [17]

[Online]

(2025) Aie automatic vectorization. [Online]. Available: https: //github.com/Xilinx/mlir-aie/blob/main/docs/AIEVectorization.md

2025

[18] [18]

[Online]

(2025) Iron api and mlir-based ai engine toolchain. [Online]. Available: https://github.com/Xilinx/mlir-aie

2025

[19] [19]

A highly scalable met office nerc cloud model,

N. Brown et al., “A highly scalable met office nerc cloud model,” arXiv preprint arXiv:2009.12849, 2020

work page arXiv 2009