Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation

Ege Beysel; Jan Moritz Joseph; Maximilian Bartel

arxiv: 2605.12445 · v2 · pith:LVSFEATXnew · submitted 2026-05-12 · 💻 cs.PF

Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation

Ege Beysel , Maximilian Bartel , Jan Moritz Joseph This is my paper

Pith reviewed 2026-05-20 21:23 UTC · model grok-4.3

classification 💻 cs.PF

keywords vector-length-agnosticpacked data layoutsscalable vectorizationML code generationtiling and fusioncompiler extensionsperformance portability

0 comments

The pith

Vector-length-aware packed layouts enable practical code generation for vector-length-agnostic ML execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that data layouts aware of vector length, together with matching compiler changes, make it feasible to produce vector-length-agnostic code inside an ML compilation pipeline. A sympathetic reader would care because hardware now supplies scalable vector instructions whose exact length is not known until runtime, so fixed tiling and layout choices no longer work. The method adjusts packing, tiling, fusion, and vectorization to operate with runtime lengths instead. When the overhead stays modest, the resulting code runs at least as fast as fixed-length versions and continues to improve as vector length grows on compute-bound tasks.

Core claim

The paper claims that vector-length-aware packed data layouts, paired with extensions to tiling, fusion, and vectorization, allow an end-to-end ML compiler to emit efficient code that runs correctly and performs well for any vector length at runtime, producing results competitive with fixed-vector baselines and that scale with longer vectors on compute-bound workloads.

What carries the argument

Vector-length-aware packed data layouts that arrange storage according to the actual vector length so that scalable vector operations can access data efficiently without compile-time commitment to a fixed length.

If this is right

A single generated implementation can execute correctly on hardware with different vector lengths.
Performance remains competitive with or exceeds that of traditional fixed-length vector code.
The generated code exhibits continued improvement as vector length increases on compute-bound workloads.
Tiling, fusion, and vectorization passes can be extended to respect runtime vector lengths without losing their effectiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layout technique could be tested on numerical kernels outside machine learning to check whether the overhead remains acceptable in other domains.
If the overhead proves consistently low, the approach would lower the cost of writing performance-portable code across future vector architectures.
Measuring cache-miss rates on memory-bound workloads would give a concrete test of where the packing cost becomes the dominant factor.

Load-bearing premise

The memory footprint and access overhead introduced by the vector-length-aware packed layouts remains small enough not to cancel the gains from scalable vectorization.

What would settle it

A direct comparison on the same ML workloads in which the packed-layout code runs slower than the fixed-vector baseline on the target hardware would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.12445 by Ege Beysel, Jan Moritz Joseph, Maximilian Bartel.

**Figure 2.** Figure 2: Speedups achieved with our IREE (SVE) code generation approach against (2a) the existing NEON pipeline in IREE, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Speedup of our scalable SVE code generation relative [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

Scalable vector instruction sets such as Arm SVE enable vector-length-agnostic (VLA) execution, allowing a single implementation to adapt across hardware with different vector lengths. However, they complicate compiler code generation, as tiling and data layout decisions can no longer be fixed at compile time. We present an approach for enabling VLA code generation in an end-to-end ML compilation pipeline through vector-length-aware packed data layouts and corresponding compiler extensions. We integrate these mechanisms into MLIR/IREE and extend tiling, fusion, and vectorization to operate with scalable vector lengths. Evaluated on real-world ML workloads on Arm CPUs, our approach generates SVE code that is competitive with, and often outperforms, existing NEON-based code generation within IREE, achieving up to $1.45\times$ speedup. We also outperform PyTorch ecosystem frameworks, including ExecuTorch, TorchInductor, and eager execution, demonstrating the effectiveness of scalable vectorization in a production compiler setting. A simulator-based study further shows that the generated code scales with increasing SVE vector length on compute-bound workloads, supporting performance portability across hardware configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds vector-length-aware packed layouts plus MLIR extensions for VLA ML codegen and reports up to 1.45x end-to-end speedups, but does not isolate layout overhead from the vectorization gains.

read the letter

The core contribution is a set of packed data layouts that stay aware of runtime vector length, plus the changes to tiling, fusion, and vectorization passes inside MLIR/IREE so they can emit SVE code without fixing the length at compile time. That combination is new for ML workloads and lets a single binary adapt across different SVE hardware sizes. They also show the generated code scaling on a simulator as vector length grows on compute-bound kernels, which is useful evidence for portability claims. The implementation sits inside a production compiler stack and is evaluated on real Arm CPUs against both their own NEON baseline and several PyTorch paths, which gives the results some practical weight. The 1.45x figure comes from actual runs rather than micro-benchmarks alone. That part is solid engineering work worth noting. The evaluation still leaves the layout cost question open. Speedups are reported end-to-end, with no ablation that runs the same VLA vectorized kernels on conventional layouts to measure extra memory footprint, strided access penalties, or cache effects from the packing. On memory-bound kernels or at larger SVE lengths the padding or alignment rules could eat into the reported gains, and the abstract gives little detail on how layouts are chosen or how large those overheads actually are. The stress-test note on this point holds up from what is visible. Compiler engineers working on MLIR, IREE, or Arm SVE targets will get the most out of it; people outside that niche will find the numbers interesting but not transformative. The work is grounded enough and the implementation concrete enough that it deserves a serious referee rather than a desk reject. I would send it for review but ask the authors to add overhead ablations and clearer measurements of layout-induced costs before final acceptance.

Referee Report

1 major / 2 minor

Summary. The manuscript presents vector-length-aware packed data layouts and corresponding compiler extensions integrated into MLIR/IREE to enable vector-length-agnostic (VLA) code generation for ML workloads on Arm SVE. It extends tiling, fusion, and vectorization to operate with scalable vector lengths, reports up to 1.45× speedup over NEON-based generation within IREE, outperforms PyTorch frameworks including ExecuTorch and TorchInductor, and includes a simulator study showing scaling with increasing SVE vector length on compute-bound workloads.

Significance. If the central performance claims hold after addressing evaluation gaps, the work would be significant for enabling performance-portable ML compilation on scalable vector ISAs. The end-to-end integration into a production compiler pipeline and the demonstration of scaling behavior are concrete strengths that could influence future VLA code generation techniques.

major comments (1)

Evaluation section: The reported end-to-end speedups (up to 1.45×) compare against NEON and PyTorch baselines but do not include an ablation isolating the memory footprint growth, strided access costs, or runtime overhead attributable to the vector-length-aware packed layouts themselves. Without such isolation (e.g., via cache-miss counters or comparisons of VLA vectorized kernels on conventional vs. packed layouts), it remains unclear whether packing costs offset gains on memory-bound kernels or at larger SVE lengths, which is load-bearing for the speedup claim.

minor comments (2)

Abstract: The 1.45× speedup figure lacks accompanying details on statistical significance, number of runs, or variance; adding these would strengthen the empirical claims.
The manuscript would benefit from a clearer description of the layout selection heuristics and any compile-time or runtime costs associated with packing decisions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address the major comment on the evaluation below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: Evaluation section: The reported end-to-end speedups (up to 1.45×) compare against NEON and PyTorch baselines but do not include an ablation isolating the memory footprint growth, strided access costs, or runtime overhead attributable to the vector-length-aware packed layouts themselves. Without such isolation (e.g., via cache-miss counters or comparisons of VLA vectorized kernels on conventional vs. packed layouts), it remains unclear whether packing costs offset gains on memory-bound kernels or at larger SVE lengths, which is load-bearing for the speedup claim.

Authors: We agree that the current evaluation would benefit from a dedicated ablation isolating the overheads of the vector-length-aware packed layouts. The manuscript focuses on end-to-end speedups and a simulator study showing scaling on compute-bound workloads, but does not provide the requested isolation of memory footprint growth, strided access costs, or runtime overheads via cache-miss counters or direct comparisons against conventional layouts. In the revised manuscript we will add such an ablation, including performance comparisons of VLA kernels on packed versus conventional layouts and relevant hardware counter data where available from our experimental setup. This will better quantify any costs and clarify the contribution to the reported speedups, especially for memory-bound cases and larger vector lengths. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results from direct implementation and benchmarking

full rationale

The manuscript describes an engineering approach for vector-length-aware packed layouts and compiler extensions in MLIR/IREE to support VLA code generation for ML workloads. Central claims rest on explicit integration of tiling/fusion/vectorization with scalable lengths, followed by end-to-end performance measurements against NEON baselines and PyTorch frameworks on Arm hardware plus simulator scaling studies. No equations, fitted parameters, or predictions are presented that reduce to their own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The derivation chain is therefore self-contained through concrete implementation details and external empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard compiler infrastructure assumptions and hardware vector-length-agnostic features; no new free parameters, axioms, or invented entities are introduced or fitted in the reported work.

pith-pipeline@v0.9.0 · 5732 in / 1134 out tokens · 40630 ms · 2026-05-20T21:23:54.185842+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose scalable packed layouts as an abstraction for representing data layouts parameterized by the hardware vector length... mr = fm(VL), nr = fn(VL), kr = fk(VL)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We integrate these mechanisms into MLIR/IREE and extend tiling, fusion, and vectorization to operate with scalable vector lengths.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

https://lists.riscv.org/g/tech-vector- ext/attachment/691/0/riscv-v-spec-1.0.pdf, 2021

The risc-v vector extension, version 1.0. https://lists.riscv.org/g/tech-vector- ext/attachment/691/0/riscv-v-spec-1.0.pdf, 2021

work page 2021
[2]

https://developer

Arm scalable matrix extension (sme) architecture specification. https://developer. arm.com/documentation/109246/0101/, 2024

work page 2024
[3]

https: //executorch.ai, 2026

Executorch: On-device ai across mobile, embedded and edge for pytorch. https: //executorch.ai, 2026

work page 2026
[4]

https://github.com/google/XNNPACK, 2026

Xnnpack: High-efficiency floating-point neural network inference operators for mobile, server, and web. https://github.com/google/XNNPACK, 2026. Accessed: 2026-04-20

work page 2026
[5]

Adit, N., and Sampson, A.Performance left on the table: An evaluation of compiler autovectorization for risc-v.IEEE Micro 42, 5 (2022), 41–48

work page 2022
[6]

Anonymous artifact: Compiler extensions for scalable vector code generation, 2026

Anonymous Authors. Anonymous artifact: Compiler extensions for scalable vector code generation, 2026. Link to branch omitted due to double-blind review; will be added for final publication

work page 2026
[7]

In 10 Proceedings of the 29th ACM international conference on architectural support for programming languages and operating systems, volume 2(2024), pp

Ansel, J., Y ang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., et al.Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In 10 Proceedings of the 29th ACM international conference on architectural support for programming languages and operating syst...

work page 2024
[8]

https://gitlab.arm

Arm Ltd.Kleidiai: Ai microkernels optimized for arm cpus. https://gitlab.arm. com/kleidi/kleidiai, 2024. GitLab repository, accessed 2026-04-22

work page 2024
[9]

K., Saidi, A., Basu, A., Hestness, J., Hower, D

Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., Hestness, J., Hower, D. R., Krishna, T., Sardashti, S., et al.The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1–7

work page 2011
[10]

PhD thesis, Ph

Brank, B.Vector length agnostic SIMD parallelism on modern processor architec- tures with the focus on Arm’s SVE. PhD thesis, Ph. D. thesis, Bergische Universität Wuppertal, 2023

work page 2023
[11]

In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)(2025), IEEE, pp

Carpentieri, L., VazirPanah, M., and Cosenza, B.A performance analysis of autovectorization on rvv risc-v boards. In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)(2025), IEEE, pp. 129–136

work page 2025
[12]

InTenth international workshop on frontiers in handwriting recognition(2006), Suvisoft

Chellapilla, K., Puri, S., and Simard, P.High performance convolutional neural networks for document processing. InTenth international workshop on frontiers in handwriting recognition(2006), Suvisoft

work page 2006
[13]

{TVM}: An automated {End-to-End} optimizing compiler for deep learning

Chen, T., Moreau, T., Jiang, Z., Zheng, L., Y an, E., Shen, H., Cowan, M., W ang, L., Hu, Y., Ceze, L., et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX symposium on operating systems design and implementation (OSDI 18)(2018), pp. 578–594

work page 2018
[14]

Goto, K., and Geijn, R. A. v. d.Anatomy of high-performance matrix mul- tiplication.ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 1–25

work page 2008
[15]

InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Net- work, Storage, and Analysis(2023), pp

Igual, F., Piñuel, L., Catalán, S., Martínez, H., Castelló, A., and Quintana- Ortí, E.Automatic generation of micro-kernels for performance portability of matrix multiplication on risc-v vector processors. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Net- work, Storage, and Analysis(2023), pp. 1523–1532

work page 2023
[16]

In Proceedings of the Workshop on Compilers for Machine Learning (C4ML) at CGO (2024)

Kalda, E., and Hutton, L.Introducing vector length agnostic programming into ml compilation: Comparing sve and sme enablement in tvm and mlir. In Proceedings of the Workshop on Compilers for Machine Learning (C4ML) at CGO (2024). Arm Ltd

work page 2024
[17]

InProceedings of the SC’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis(2025), pp

Lai, H.-M., Lin, P.-H., Gokhale, M., Peng, I., Patel, H., and Lee, J.-K.Risc- v vectorization coverage for hpc: A tsvc-based analysis. InProceedings of the SC’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis(2025), pp. 1676–1683

work page 2025
[18]

In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)(2021), pp

Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O.MLIR: Scaling compiler infrastructure for domain specific computation. In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)(2021), pp. 2–14

work page 2021
[19]

Liu, H.-I. C., Brehler, M., Ravishankar, M., Vasilache, N., Vanik, B., and Laurenzo, S.Tinyiree: An ml execution environment for embedded systems from compilation to deployment.IEEE micro 42, 5 (2022), 9–16

work page 2022
[20]

Torch-mlir

LLVM Project. Torch-mlir. https://github.com/llvm/torch-mlir, 2026. Compiler infrastructure bridging the PyTorch and MLIR ecosystems

work page 2026
[21]

M., Akram, A., Alian, M., Amslinger, R., An- dreozzi, M., Armejach, A., Asmussen, N., Beckmann, B., Bharadwaj, S., et al

Lowe-Power, J., Ahmad, A. M., Akram, A., Alian, M., Amslinger, R., An- dreozzi, M., Armejach, A., Asmussen, N., Beckmann, B., Bharadwaj, S., et al. The gem5 simulator: Version 20.0+.arXiv preprint arXiv:2007.03152(2020)

work page arXiv 2007
[22]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.Pytorch: An imperative style, high- performance deep learning library.Advances in neural information processing systems 32(2019)

work page 2019
[23]

N., Haxel, F., and Bringmann, O.Tensor program optimization for the risc-v vector extension using probabilistic programs

Peccia, F. N., Haxel, F., and Bringmann, O.Tensor program optimization for the risc-v vector extension using probabilistic programs. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD)(2025), IEEE, pp. 1–9

work page 2025
[24]

InEuropean Conference on Parallel Processing (2020), Springer, pp

Poenaru, A., and McIntosh-Smith, S.Evaluating the effectiveness of a vector- length-agnostic instruction set. InEuropean Conference on Parallel Processing (2020), Springer, pp. 98–114

work page 2020
[25]

In2019 International Conference on High Performance Computing & Simulation (HPCS)(2019), IEEE, pp

Pohl, A., Greese, M., Cosenza, B., and Juurlink, B.A performance analysis of vector length agnostic code. In2019 International Conference on High Performance Computing & Simulation (HPCS)(2019), IEEE, pp. 159–164

work page 2019
[26]

InSC24-W: Workshops of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(2024), IEEE, pp

Remke, S., and Breuer, A.Hello sme! generating fast matrix multiplication kernels using the scalable matrix extension. InSC24-W: Workshops of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(2024), IEEE, pp. 1443–1454

work page 2024
[27]

Supercomputer fugaku, 2021

RIKEN Center for Computational Science and Fujitsu. Supercomputer fugaku, 2021. Arm-based A64FX processor, world-leading HPC system

work page 2021
[28]

M., V an De Geijn, R., Smelyanskiy, M., Hammond, J

Smith, T. M., V an De Geijn, R., Smelyanskiy, M., Hammond, J. R., and V an Zee, F. G.Anatomy of high-performance many-threaded matrix multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium(2014), IEEE, pp. 1049–1059

work page 2014
[29]

Stephens, N., Biles, S., Boettcher, M., Eapen, J., Eyole, M., Gabrielli, G., Horsnell, M., Magklis, G., Martinez, A., Premillieu, N., et al.The arm scalable vector extension.IEEE micro 37, 2 (2017), 26–39

work page 2017
[30]

Accelerated pytorch inference with torch.compile on aws graviton processors

Sunita Nadampalli. Accelerated pytorch inference with torch.compile on aws graviton processors. https://pytorch.org/blog/accelerated-pytorch-inference/, July 2024. Accessed: 2026-04-20

work page 2024
[31]

G., and van de Geijn, R

Van Zee, F. G., and van de Geijn, R. A.BLIS: A framework for rapidly instan- tiating BLAS functionality.ACM Transactions on Mathematical Software 41, 3 (June 2015), 14:1–14:33. 11

work page 2015

[1] [1]

https://lists.riscv.org/g/tech-vector- ext/attachment/691/0/riscv-v-spec-1.0.pdf, 2021

The risc-v vector extension, version 1.0. https://lists.riscv.org/g/tech-vector- ext/attachment/691/0/riscv-v-spec-1.0.pdf, 2021

work page 2021

[2] [2]

https://developer

Arm scalable matrix extension (sme) architecture specification. https://developer. arm.com/documentation/109246/0101/, 2024

work page 2024

[3] [3]

https: //executorch.ai, 2026

Executorch: On-device ai across mobile, embedded and edge for pytorch. https: //executorch.ai, 2026

work page 2026

[4] [4]

https://github.com/google/XNNPACK, 2026

Xnnpack: High-efficiency floating-point neural network inference operators for mobile, server, and web. https://github.com/google/XNNPACK, 2026. Accessed: 2026-04-20

work page 2026

[5] [5]

Adit, N., and Sampson, A.Performance left on the table: An evaluation of compiler autovectorization for risc-v.IEEE Micro 42, 5 (2022), 41–48

work page 2022

[6] [6]

Anonymous artifact: Compiler extensions for scalable vector code generation, 2026

Anonymous Authors. Anonymous artifact: Compiler extensions for scalable vector code generation, 2026. Link to branch omitted due to double-blind review; will be added for final publication

work page 2026

[7] [7]

In 10 Proceedings of the 29th ACM international conference on architectural support for programming languages and operating systems, volume 2(2024), pp

Ansel, J., Y ang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., et al.Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In 10 Proceedings of the 29th ACM international conference on architectural support for programming languages and operating syst...

work page 2024

[8] [8]

https://gitlab.arm

Arm Ltd.Kleidiai: Ai microkernels optimized for arm cpus. https://gitlab.arm. com/kleidi/kleidiai, 2024. GitLab repository, accessed 2026-04-22

work page 2024

[9] [9]

K., Saidi, A., Basu, A., Hestness, J., Hower, D

Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., Hestness, J., Hower, D. R., Krishna, T., Sardashti, S., et al.The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1–7

work page 2011

[10] [10]

PhD thesis, Ph

Brank, B.Vector length agnostic SIMD parallelism on modern processor architec- tures with the focus on Arm’s SVE. PhD thesis, Ph. D. thesis, Bergische Universität Wuppertal, 2023

work page 2023

[11] [11]

In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)(2025), IEEE, pp

Carpentieri, L., VazirPanah, M., and Cosenza, B.A performance analysis of autovectorization on rvv risc-v boards. In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)(2025), IEEE, pp. 129–136

work page 2025

[12] [12]

InTenth international workshop on frontiers in handwriting recognition(2006), Suvisoft

Chellapilla, K., Puri, S., and Simard, P.High performance convolutional neural networks for document processing. InTenth international workshop on frontiers in handwriting recognition(2006), Suvisoft

work page 2006

[13] [13]

{TVM}: An automated {End-to-End} optimizing compiler for deep learning

Chen, T., Moreau, T., Jiang, Z., Zheng, L., Y an, E., Shen, H., Cowan, M., W ang, L., Hu, Y., Ceze, L., et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX symposium on operating systems design and implementation (OSDI 18)(2018), pp. 578–594

work page 2018

[14] [14]

Goto, K., and Geijn, R. A. v. d.Anatomy of high-performance matrix mul- tiplication.ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 1–25

work page 2008

[15] [15]

InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Net- work, Storage, and Analysis(2023), pp

Igual, F., Piñuel, L., Catalán, S., Martínez, H., Castelló, A., and Quintana- Ortí, E.Automatic generation of micro-kernels for performance portability of matrix multiplication on risc-v vector processors. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Net- work, Storage, and Analysis(2023), pp. 1523–1532

work page 2023

[16] [16]

In Proceedings of the Workshop on Compilers for Machine Learning (C4ML) at CGO (2024)

Kalda, E., and Hutton, L.Introducing vector length agnostic programming into ml compilation: Comparing sve and sme enablement in tvm and mlir. In Proceedings of the Workshop on Compilers for Machine Learning (C4ML) at CGO (2024). Arm Ltd

work page 2024

[17] [17]

InProceedings of the SC’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis(2025), pp

Lai, H.-M., Lin, P.-H., Gokhale, M., Peng, I., Patel, H., and Lee, J.-K.Risc- v vectorization coverage for hpc: A tsvc-based analysis. InProceedings of the SC’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis(2025), pp. 1676–1683

work page 2025

[18] [18]

In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)(2021), pp

Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O.MLIR: Scaling compiler infrastructure for domain specific computation. In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)(2021), pp. 2–14

work page 2021

[19] [19]

Liu, H.-I. C., Brehler, M., Ravishankar, M., Vasilache, N., Vanik, B., and Laurenzo, S.Tinyiree: An ml execution environment for embedded systems from compilation to deployment.IEEE micro 42, 5 (2022), 9–16

work page 2022

[20] [20]

Torch-mlir

LLVM Project. Torch-mlir. https://github.com/llvm/torch-mlir, 2026. Compiler infrastructure bridging the PyTorch and MLIR ecosystems

work page 2026

[21] [21]

M., Akram, A., Alian, M., Amslinger, R., An- dreozzi, M., Armejach, A., Asmussen, N., Beckmann, B., Bharadwaj, S., et al

Lowe-Power, J., Ahmad, A. M., Akram, A., Alian, M., Amslinger, R., An- dreozzi, M., Armejach, A., Asmussen, N., Beckmann, B., Bharadwaj, S., et al. The gem5 simulator: Version 20.0+.arXiv preprint arXiv:2007.03152(2020)

work page arXiv 2007

[22] [22]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.Pytorch: An imperative style, high- performance deep learning library.Advances in neural information processing systems 32(2019)

work page 2019

[23] [23]

N., Haxel, F., and Bringmann, O.Tensor program optimization for the risc-v vector extension using probabilistic programs

Peccia, F. N., Haxel, F., and Bringmann, O.Tensor program optimization for the risc-v vector extension using probabilistic programs. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD)(2025), IEEE, pp. 1–9

work page 2025

[24] [24]

InEuropean Conference on Parallel Processing (2020), Springer, pp

Poenaru, A., and McIntosh-Smith, S.Evaluating the effectiveness of a vector- length-agnostic instruction set. InEuropean Conference on Parallel Processing (2020), Springer, pp. 98–114

work page 2020

[25] [25]

In2019 International Conference on High Performance Computing & Simulation (HPCS)(2019), IEEE, pp

Pohl, A., Greese, M., Cosenza, B., and Juurlink, B.A performance analysis of vector length agnostic code. In2019 International Conference on High Performance Computing & Simulation (HPCS)(2019), IEEE, pp. 159–164

work page 2019

[26] [26]

InSC24-W: Workshops of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(2024), IEEE, pp

Remke, S., and Breuer, A.Hello sme! generating fast matrix multiplication kernels using the scalable matrix extension. InSC24-W: Workshops of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(2024), IEEE, pp. 1443–1454

work page 2024

[27] [27]

Supercomputer fugaku, 2021

RIKEN Center for Computational Science and Fujitsu. Supercomputer fugaku, 2021. Arm-based A64FX processor, world-leading HPC system

work page 2021

[28] [28]

M., V an De Geijn, R., Smelyanskiy, M., Hammond, J

Smith, T. M., V an De Geijn, R., Smelyanskiy, M., Hammond, J. R., and V an Zee, F. G.Anatomy of high-performance many-threaded matrix multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium(2014), IEEE, pp. 1049–1059

work page 2014

[29] [29]

Stephens, N., Biles, S., Boettcher, M., Eapen, J., Eyole, M., Gabrielli, G., Horsnell, M., Magklis, G., Martinez, A., Premillieu, N., et al.The arm scalable vector extension.IEEE micro 37, 2 (2017), 26–39

work page 2017

[30] [30]

Accelerated pytorch inference with torch.compile on aws graviton processors

Sunita Nadampalli. Accelerated pytorch inference with torch.compile on aws graviton processors. https://pytorch.org/blog/accelerated-pytorch-inference/, July 2024. Accessed: 2026-04-20

work page 2024

[31] [31]

G., and van de Geijn, R

Van Zee, F. G., and van de Geijn, R. A.BLIS: A framework for rapidly instan- tiating BLAS functionality.ACM Transactions on Mathematical Software 41, 3 (June 2015), 14:1–14:33. 11

work page 2015