Lifting to tensors when compiling scientific computing workloads for AI Engines
Pith reviewed 2026-05-07 14:16 UTC · model grok-4.3
The pith
Lifting OpenMP loops to tensors maps scientific codes to AI Engines with minimal changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lifting the semantics of an application into tensors captures the intention of general purpose loops annotated with OpenMP and such high-level tensor information provides a richness that is effective when mapping to the AI Engines. Requiring only an OpenMP decorated loop, the approach significantly reduces code complexity when targeting the architecture.
What carries the argument
Tensor lifting of OpenMP loop semantics, which converts general loop structures into tensor forms that carry enough information for direct mapping to the AI Engine execution model.
If this is right
- Scientific and AI codes can target AI Engines using only standard OpenMP loop annotations instead of architecture-specific rewrites.
- For float32 workloads the AI Engine delivers CPU-comparable speed at lower energy to solution.
- Heterogeneous execution across CPU and AI Engine improves performance by up to 40% and reduces energy by 15% for selected scientific kernels.
Where Pith is reading between the lines
- The same lifting technique could be adapted for other integrated accelerators if equivalent high-level semantic mappings are defined.
- Widespread use of OpenMP annotations in scientific codes would lower the barrier to exploiting future on-chip NPUs without expert porting effort.
- The method highlights tensor representations as a potential bridge between ordinary loop-based code and domain-specific hardware.
Load-bearing premise
That converting OpenMP-annotated general-purpose loops into tensors preserves their original meaning and permits correct, high-performance execution on AI Engines for arbitrary scientific codes.
What would settle it
An OpenMP-annotated scientific loop whose tensor-lifted version produces wrong results or loses the claimed performance and energy advantages when run on the AI Engine hardware.
Figures
read the original abstract
It has been demonstrated that specialised architectures, such as FPGAs and AMD's AI Engines (AIEs), have the potential to deliver energy and performance advantages for scientific computing. Given the integration of AIEs into AMD's CPUs, this is an interesting potential avenue especially when executing on the edge or making better use of local compute constrained resources. However, a major challenge is in enabling existing codes to run on this architecture without extensive modification. Put simply, it requires significant expertise and time to port codes to the AIE's execution model. In this paper we explore a compilation pipeline for efficiently mapping loops in general purpose, scientific codes to AIEs. Lifting the semantics of an application into tensors, we demonstrate that this is able to capture the intention of general purpose loops annotated with OpenMP and such high-level tensor information provides a richness that is effective when mapping to the AIEs. Requiring only an OpenMP decorated loop, our approach significantly reduces code complexity when targeting the architecture. For six kernel benchmarks, representing AI and scientific computing, using our approach the NPU performs comparatively to the multicore CPU for float32, in all cases at reduced energy to solution. For two scientific computing kernels running across both the CPU and NPU together delivers up to a 40% improvement in performance and 15% reduction in energy usage compared to the CPU alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes a compilation pipeline for mapping OpenMP-annotated loops from general-purpose scientific codes to AMD AI Engines by lifting their semantics to tensor representations. The authors argue that this high-level tensor information facilitates effective mapping to the AIE architecture, significantly reducing the code complexity required for targeting it. Empirical evaluation on six kernel benchmarks from AI and scientific computing domains shows that the NPU achieves performance comparable to a multicore CPU for float32 operations, with lower energy consumption. Additionally, for two scientific kernels, hybrid execution on CPU and NPU together provides up to 40% performance improvement and 15% energy reduction compared to CPU alone.
Significance. If the tensor lifting correctly preserves semantics and the empirical results hold under rigorous conditions, this approach could substantially lower the expertise barrier for porting scientific workloads to integrated AIE hardware, enabling energy-efficient execution on edge and resource-constrained systems. The hybrid CPU+NPU gains demonstrate practical value in co-execution strategies for heterogeneous architectures.
major comments (2)
- [Abstract] Abstract: The benchmark outcomes are reported without details on experimental setup, error bars, benchmark selection criteria, or verification of tensor lifting correctness. This omission is load-bearing for the central empirical claims of comparable float32 performance, reduced energy to solution, and hybrid improvements up to 40%/15%.
- [Compilation Pipeline] The tensor lifting approach (compilation pipeline section): No formal semantics or supported OpenMP subset is defined. This is load-bearing for the claim that an OpenMP-decorated loop alone suffices for arbitrary scientific codes, as non-affine accesses, reductions, or irregular patterns common in scientific computing may not map to dense tensor abstractions without semantic loss or performance degradation.
minor comments (2)
- [Abstract] The abstract would benefit from a brief quantitative measure (e.g., lines of code or porting effort) to support the claim of significantly reduced code complexity.
- Consider adding a table in the evaluation section listing the six kernels with their key characteristics (iteration spaces, access patterns) to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting areas where additional clarity would strengthen the manuscript. We address each major comment below and commit to revisions that improve the presentation of experimental details and the scope of the compilation approach.
read point-by-point responses
-
Referee: [Abstract] Abstract: The benchmark outcomes are reported without details on experimental setup, error bars, benchmark selection criteria, or verification of tensor lifting correctness. This omission is load-bearing for the central empirical claims of comparable float32 performance, reduced energy to solution, and hybrid improvements up to 40%/15%.
Authors: We agree that the abstract would be strengthened by including these details. In the revised manuscript we will expand the abstract to briefly describe the experimental setup (AMD Ryzen AI processor with integrated AIEs), the benchmark selection (six kernels drawn from AI and scientific computing domains with regular affine access patterns), verification of tensor lifting correctness (output equivalence checks against CPU baselines), and the presence of error bars derived from multiple runs. The full methodology, including run counts and statistical methods, will be elaborated in the evaluation section with cross-references from the abstract. revision: yes
-
Referee: [Compilation Pipeline] The tensor lifting approach (compilation pipeline section): No formal semantics or supported OpenMP subset is defined. This is load-bearing for the claim that an OpenMP-decorated loop alone suffices for arbitrary scientific codes, as non-affine accesses, reductions, or irregular patterns common in scientific computing may not map to dense tensor abstractions without semantic loss or performance degradation.
Authors: We acknowledge that the manuscript does not supply a formal semantics or an exhaustive enumeration of supported OpenMP constructs. We will revise the compilation pipeline section to explicitly define the supported OpenMP subset (parallel for loops with static scheduling, affine accesses, and no reductions or irregular control flow) and to provide an informal semantics description of how loop nests are lifted to tensor representations. We will also add a limitations paragraph clarifying that non-affine or reduction-heavy patterns fall outside the current scope and may require manual restructuring. These changes will accurately bound the applicability of the approach without altering the results for the evaluated kernels. revision: partial
Circularity Check
No circularity: empirical evaluation of tensor-lifting compilation pipeline
full rationale
The paper presents an engineering compilation pipeline that lifts OpenMP-annotated loops to tensors for mapping onto AI Engines, with all load-bearing claims resting on direct performance and energy measurements across six kernels. No equations, fitted parameters, uniqueness theorems, or ansatzes are introduced; the lifting step is described as an implementation choice whose correctness and effectiveness are demonstrated by benchmark results rather than derived from prior self-referential definitions. The abstract and description contain no self-citations that serve as load-bearing premises, and the hybrid CPU+NPU gains are reported as observed outcomes, not predictions forced by construction. This is a standard non-circular empirical systems paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption OpenMP annotations on loops sufficiently capture high-level semantics for accurate tensor lifting without additional programmer input
Reference graph
Works this paper leans on
-
[1]
Exploring the versal ai engines for accelerating stencil- based atmospheric advection simulation,
N. Brown, “Exploring the versal ai engines for accelerating stencil- based atmospheric advection simulation,” in Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2023, pp. 91–97
2023
-
[2]
Evaluating versal ai engines for option price discovery in market risk analysis,
M. Klaisoongnoen et al. , “Evaluating versal ai engines for option price discovery in market risk analysis,” in Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2024, pp. 176–182
2024
-
[3]
Efficiency, expressivity, and extensibility in a close- to-metal npu programming interface,
E. Hunhoff et al., “Efficiency, expressivity, and extensibility in a close- to-metal npu programming interface,” in 2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2025, pp. 85–94
2025
-
[4]
Fortran high-level synthesis: Reducing the barriers to accelerating hpc codes on fpgas,
Rodriguez-Canal et al. , “Fortran high-level synthesis: Reducing the barriers to accelerating hpc codes on fpgas,” in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 10–18
2023
-
[5]
[Online]
(2024) Versal adaptive soc aie-ml architecture manual. [Online]. Available: https://docs.amd.com/r/en-US/am020-versal-aie-ml/Overview
2024
-
[6]
A dataflow overlay for monte carlo multi-asset option pricing on amd versal ai engines,
M. Bouaziz et al. , “A dataflow overlay for monte carlo multi-asset option pricing on amd versal ai engines,” in ISC High Performance 2025 Research Paper Proceedings (40th International Conference). Prometeus GmbH, 2025, pp. 1–12
2025
-
[7]
New filter2d accelerator on the versal platform powered by the ai engine,
W. Zhang et al., “New filter2d accelerator on the versal platform powered by the ai engine,” in International Symposium on Advanced Parallel Processing Technologies. Springer, 2023, pp. 437–449
2023
-
[8]
An end-to-end programming model for ai engine architectures,
M. Levental et al., “An end-to-end programming model for ai engine architectures,” in Proceedings of the 14th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies , 2024, pp. 135–136
2024
-
[9]
Enabling fpga and ai engine tasks in the hpx programming framework for heterogeneous high-performance computing,
T. Kalkhof et al. , “Enabling fpga and ai engine tasks in the hpx programming framework for heterogeneous high-performance computing,” in International Symposium on Applied Reconfigurable Computing . Springer, 2024, pp. 75–89
2024
-
[10]
Seamless acceleration of fortran intrinsics via amd ai engines,
N. Brown et al. , “Seamless acceleration of fortran intrinsics via amd ai engines,” in Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays , 2025, pp. 185–185
2025
-
[11]
Fully integrating the flang fortran compiler with standard mlir,
N. Brown, “Fully integrating the flang fortran compiler with standard mlir,” in SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 2024, pp. 939–949
2024
-
[12]
OpenMP device offloading to FPGAs using the Nymble infrastructure,
J. Huthmann et al. , “OpenMP device offloading to FPGAs using the Nymble infrastructure,” inInternational Workshop on OpenMP. Springer, 2020, pp. 265–279
2020
-
[13]
An mlir pipeline for offloading fortran to fpgas via openmp,
G. Rodriguez-Canal et al., “An mlir pipeline for offloading fortran to fpgas via openmp,” 2025
2025
-
[14]
Tensorize: Fast synthesis of tensor programs from legacy code using symbolic tracing, sketching and solving,
A. Brauckmann et al. , “Tensorize: Fast synthesis of tensor programs from legacy code using symbolic tracing, sketching and solving,” in Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization , 2025, pp. 15–30
2025
-
[15]
Programmer productivity and performance on amd’s ai engines: Offloading fortran intrinsics via mlir a case-study,
N. Brown and G. Rodriguez-Canal, “Programmer productivity and performance on amd’s ai engines: Offloading fortran intrinsics via mlir a case-study,” 2025
2025
-
[16]
[Online]
(2025) Tensor operator set architecture (tosa). [Online]. Available: https://www.mlplatform.org/tosa/tosa_spec.html
2025
-
[17]
[Online]
(2025) Aie automatic vectorization. [Online]. Available: https: //github.com/Xilinx/mlir-aie/blob/main/docs/AIEVectorization.md
2025
-
[18]
[Online]
(2025) Iron api and mlir-based ai engine toolchain. [Online]. Available: https://github.com/Xilinx/mlir-aie
2025
-
[19]
A highly scalable met office nerc cloud model,
N. Brown et al., “A highly scalable met office nerc cloud model,” arXiv preprint arXiv:2009.12849, 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.