pith. sign in

arxiv: 2605.12445 · v2 · pith:LVSFEATXnew · submitted 2026-05-12 · 💻 cs.PF

Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation

Pith reviewed 2026-05-20 21:23 UTC · model grok-4.3

classification 💻 cs.PF
keywords vector-length-agnosticpacked data layoutsscalable vectorizationML code generationtiling and fusioncompiler extensionsperformance portability
0
0 comments X

The pith

Vector-length-aware packed layouts enable practical code generation for vector-length-agnostic ML execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that data layouts aware of vector length, together with matching compiler changes, make it feasible to produce vector-length-agnostic code inside an ML compilation pipeline. A sympathetic reader would care because hardware now supplies scalable vector instructions whose exact length is not known until runtime, so fixed tiling and layout choices no longer work. The method adjusts packing, tiling, fusion, and vectorization to operate with runtime lengths instead. When the overhead stays modest, the resulting code runs at least as fast as fixed-length versions and continues to improve as vector length grows on compute-bound tasks.

Core claim

The paper claims that vector-length-aware packed data layouts, paired with extensions to tiling, fusion, and vectorization, allow an end-to-end ML compiler to emit efficient code that runs correctly and performs well for any vector length at runtime, producing results competitive with fixed-vector baselines and that scale with longer vectors on compute-bound workloads.

What carries the argument

Vector-length-aware packed data layouts that arrange storage according to the actual vector length so that scalable vector operations can access data efficiently without compile-time commitment to a fixed length.

If this is right

  • A single generated implementation can execute correctly on hardware with different vector lengths.
  • Performance remains competitive with or exceeds that of traditional fixed-length vector code.
  • The generated code exhibits continued improvement as vector length increases on compute-bound workloads.
  • Tiling, fusion, and vectorization passes can be extended to respect runtime vector lengths without losing their effectiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layout technique could be tested on numerical kernels outside machine learning to check whether the overhead remains acceptable in other domains.
  • If the overhead proves consistently low, the approach would lower the cost of writing performance-portable code across future vector architectures.
  • Measuring cache-miss rates on memory-bound workloads would give a concrete test of where the packing cost becomes the dominant factor.

Load-bearing premise

The memory footprint and access overhead introduced by the vector-length-aware packed layouts remains small enough not to cancel the gains from scalable vectorization.

What would settle it

A direct comparison on the same ML workloads in which the packed-layout code runs slower than the fixed-vector baseline on the target hardware would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.12445 by Ege Beysel, Jan Moritz Joseph, Maximilian Bartel.

Figure 1
Figure 1. Figure 1: Representative transformation from a row-major [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Speedups achieved with our IREE (SVE) code generation approach against (2a) the existing NEON pipeline in IREE, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Speedup of our scalable SVE code generation relative [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Scalable vector instruction sets such as Arm SVE enable vector-length-agnostic (VLA) execution, allowing a single implementation to adapt across hardware with different vector lengths. However, they complicate compiler code generation, as tiling and data layout decisions can no longer be fixed at compile time. We present an approach for enabling VLA code generation in an end-to-end ML compilation pipeline through vector-length-aware packed data layouts and corresponding compiler extensions. We integrate these mechanisms into MLIR/IREE and extend tiling, fusion, and vectorization to operate with scalable vector lengths. Evaluated on real-world ML workloads on Arm CPUs, our approach generates SVE code that is competitive with, and often outperforms, existing NEON-based code generation within IREE, achieving up to $1.45\times$ speedup. We also outperform PyTorch ecosystem frameworks, including ExecuTorch, TorchInductor, and eager execution, demonstrating the effectiveness of scalable vectorization in a production compiler setting. A simulator-based study further shows that the generated code scales with increasing SVE vector length on compute-bound workloads, supporting performance portability across hardware configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents vector-length-aware packed data layouts and corresponding compiler extensions integrated into MLIR/IREE to enable vector-length-agnostic (VLA) code generation for ML workloads on Arm SVE. It extends tiling, fusion, and vectorization to operate with scalable vector lengths, reports up to 1.45× speedup over NEON-based generation within IREE, outperforms PyTorch frameworks including ExecuTorch and TorchInductor, and includes a simulator study showing scaling with increasing SVE vector length on compute-bound workloads.

Significance. If the central performance claims hold after addressing evaluation gaps, the work would be significant for enabling performance-portable ML compilation on scalable vector ISAs. The end-to-end integration into a production compiler pipeline and the demonstration of scaling behavior are concrete strengths that could influence future VLA code generation techniques.

major comments (1)
  1. Evaluation section: The reported end-to-end speedups (up to 1.45×) compare against NEON and PyTorch baselines but do not include an ablation isolating the memory footprint growth, strided access costs, or runtime overhead attributable to the vector-length-aware packed layouts themselves. Without such isolation (e.g., via cache-miss counters or comparisons of VLA vectorized kernels on conventional vs. packed layouts), it remains unclear whether packing costs offset gains on memory-bound kernels or at larger SVE lengths, which is load-bearing for the speedup claim.
minor comments (2)
  1. Abstract: The 1.45× speedup figure lacks accompanying details on statistical significance, number of runs, or variance; adding these would strengthen the empirical claims.
  2. The manuscript would benefit from a clearer description of the layout selection heuristics and any compile-time or runtime costs associated with packing decisions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address the major comment on the evaluation below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: Evaluation section: The reported end-to-end speedups (up to 1.45×) compare against NEON and PyTorch baselines but do not include an ablation isolating the memory footprint growth, strided access costs, or runtime overhead attributable to the vector-length-aware packed layouts themselves. Without such isolation (e.g., via cache-miss counters or comparisons of VLA vectorized kernels on conventional vs. packed layouts), it remains unclear whether packing costs offset gains on memory-bound kernels or at larger SVE lengths, which is load-bearing for the speedup claim.

    Authors: We agree that the current evaluation would benefit from a dedicated ablation isolating the overheads of the vector-length-aware packed layouts. The manuscript focuses on end-to-end speedups and a simulator study showing scaling on compute-bound workloads, but does not provide the requested isolation of memory footprint growth, strided access costs, or runtime overheads via cache-miss counters or direct comparisons against conventional layouts. In the revised manuscript we will add such an ablation, including performance comparisons of VLA kernels on packed versus conventional layouts and relevant hardware counter data where available from our experimental setup. This will better quantify any costs and clarify the contribution to the reported speedups, especially for memory-bound cases and larger vector lengths. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results from direct implementation and benchmarking

full rationale

The manuscript describes an engineering approach for vector-length-aware packed layouts and compiler extensions in MLIR/IREE to support VLA code generation for ML workloads. Central claims rest on explicit integration of tiling/fusion/vectorization with scalable lengths, followed by end-to-end performance measurements against NEON baselines and PyTorch frameworks on Arm hardware plus simulator scaling studies. No equations, fitted parameters, or predictions are presented that reduce to their own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The derivation chain is therefore self-contained through concrete implementation details and external empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard compiler infrastructure assumptions and hardware vector-length-agnostic features; no new free parameters, axioms, or invented entities are introduced or fitted in the reported work.

pith-pipeline@v0.9.0 · 5732 in / 1134 out tokens · 40630 ms · 2026-05-20T21:23:54.185842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    https://lists.riscv.org/g/tech-vector- ext/attachment/691/0/riscv-v-spec-1.0.pdf, 2021

    The risc-v vector extension, version 1.0. https://lists.riscv.org/g/tech-vector- ext/attachment/691/0/riscv-v-spec-1.0.pdf, 2021

  2. [2]

    https://developer

    Arm scalable matrix extension (sme) architecture specification. https://developer. arm.com/documentation/109246/0101/, 2024

  3. [3]

    https: //executorch.ai, 2026

    Executorch: On-device ai across mobile, embedded and edge for pytorch. https: //executorch.ai, 2026

  4. [4]

    https://github.com/google/XNNPACK, 2026

    Xnnpack: High-efficiency floating-point neural network inference operators for mobile, server, and web. https://github.com/google/XNNPACK, 2026. Accessed: 2026-04-20

  5. [5]

    Adit, N., and Sampson, A.Performance left on the table: An evaluation of compiler autovectorization for risc-v.IEEE Micro 42, 5 (2022), 41–48

  6. [6]

    Anonymous artifact: Compiler extensions for scalable vector code generation, 2026

    Anonymous Authors. Anonymous artifact: Compiler extensions for scalable vector code generation, 2026. Link to branch omitted due to double-blind review; will be added for final publication

  7. [7]

    In 10 Proceedings of the 29th ACM international conference on architectural support for programming languages and operating systems, volume 2(2024), pp

    Ansel, J., Y ang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., et al.Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In 10 Proceedings of the 29th ACM international conference on architectural support for programming languages and operating syst...

  8. [8]

    https://gitlab.arm

    Arm Ltd.Kleidiai: Ai microkernels optimized for arm cpus. https://gitlab.arm. com/kleidi/kleidiai, 2024. GitLab repository, accessed 2026-04-22

  9. [9]

    K., Saidi, A., Basu, A., Hestness, J., Hower, D

    Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., Hestness, J., Hower, D. R., Krishna, T., Sardashti, S., et al.The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1–7

  10. [10]

    PhD thesis, Ph

    Brank, B.Vector length agnostic SIMD parallelism on modern processor architec- tures with the focus on Arm’s SVE. PhD thesis, Ph. D. thesis, Bergische Universität Wuppertal, 2023

  11. [11]

    In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)(2025), IEEE, pp

    Carpentieri, L., VazirPanah, M., and Cosenza, B.A performance analysis of autovectorization on rvv risc-v boards. In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)(2025), IEEE, pp. 129–136

  12. [12]

    InTenth international workshop on frontiers in handwriting recognition(2006), Suvisoft

    Chellapilla, K., Puri, S., and Simard, P.High performance convolutional neural networks for document processing. InTenth international workshop on frontiers in handwriting recognition(2006), Suvisoft

  13. [13]

    {TVM}: An automated {End-to-End} optimizing compiler for deep learning

    Chen, T., Moreau, T., Jiang, Z., Zheng, L., Y an, E., Shen, H., Cowan, M., W ang, L., Hu, Y., Ceze, L., et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX symposium on operating systems design and implementation (OSDI 18)(2018), pp. 578–594

  14. [14]

    Goto, K., and Geijn, R. A. v. d.Anatomy of high-performance matrix mul- tiplication.ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 1–25

  15. [15]

    InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Net- work, Storage, and Analysis(2023), pp

    Igual, F., Piñuel, L., Catalán, S., Martínez, H., Castelló, A., and Quintana- Ortí, E.Automatic generation of micro-kernels for performance portability of matrix multiplication on risc-v vector processors. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Net- work, Storage, and Analysis(2023), pp. 1523–1532

  16. [16]

    In Proceedings of the Workshop on Compilers for Machine Learning (C4ML) at CGO (2024)

    Kalda, E., and Hutton, L.Introducing vector length agnostic programming into ml compilation: Comparing sve and sme enablement in tvm and mlir. In Proceedings of the Workshop on Compilers for Machine Learning (C4ML) at CGO (2024). Arm Ltd

  17. [17]

    InProceedings of the SC’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis(2025), pp

    Lai, H.-M., Lin, P.-H., Gokhale, M., Peng, I., Patel, H., and Lee, J.-K.Risc- v vectorization coverage for hpc: A tsvc-based analysis. InProceedings of the SC’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis(2025), pp. 1676–1683

  18. [18]

    In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)(2021), pp

    Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O.MLIR: Scaling compiler infrastructure for domain specific computation. In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)(2021), pp. 2–14

  19. [19]

    Liu, H.-I. C., Brehler, M., Ravishankar, M., Vasilache, N., Vanik, B., and Laurenzo, S.Tinyiree: An ml execution environment for embedded systems from compilation to deployment.IEEE micro 42, 5 (2022), 9–16

  20. [20]

    Torch-mlir

    LLVM Project. Torch-mlir. https://github.com/llvm/torch-mlir, 2026. Compiler infrastructure bridging the PyTorch and MLIR ecosystems

  21. [21]

    M., Akram, A., Alian, M., Amslinger, R., An- dreozzi, M., Armejach, A., Asmussen, N., Beckmann, B., Bharadwaj, S., et al

    Lowe-Power, J., Ahmad, A. M., Akram, A., Alian, M., Amslinger, R., An- dreozzi, M., Armejach, A., Asmussen, N., Beckmann, B., Bharadwaj, S., et al. The gem5 simulator: Version 20.0+.arXiv preprint arXiv:2007.03152(2020)

  22. [22]

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.Pytorch: An imperative style, high- performance deep learning library.Advances in neural information processing systems 32(2019)

  23. [23]

    N., Haxel, F., and Bringmann, O.Tensor program optimization for the risc-v vector extension using probabilistic programs

    Peccia, F. N., Haxel, F., and Bringmann, O.Tensor program optimization for the risc-v vector extension using probabilistic programs. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD)(2025), IEEE, pp. 1–9

  24. [24]

    InEuropean Conference on Parallel Processing (2020), Springer, pp

    Poenaru, A., and McIntosh-Smith, S.Evaluating the effectiveness of a vector- length-agnostic instruction set. InEuropean Conference on Parallel Processing (2020), Springer, pp. 98–114

  25. [25]

    In2019 International Conference on High Performance Computing & Simulation (HPCS)(2019), IEEE, pp

    Pohl, A., Greese, M., Cosenza, B., and Juurlink, B.A performance analysis of vector length agnostic code. In2019 International Conference on High Performance Computing & Simulation (HPCS)(2019), IEEE, pp. 159–164

  26. [26]

    InSC24-W: Workshops of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(2024), IEEE, pp

    Remke, S., and Breuer, A.Hello sme! generating fast matrix multiplication kernels using the scalable matrix extension. InSC24-W: Workshops of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(2024), IEEE, pp. 1443–1454

  27. [27]

    Supercomputer fugaku, 2021

    RIKEN Center for Computational Science and Fujitsu. Supercomputer fugaku, 2021. Arm-based A64FX processor, world-leading HPC system

  28. [28]

    M., V an De Geijn, R., Smelyanskiy, M., Hammond, J

    Smith, T. M., V an De Geijn, R., Smelyanskiy, M., Hammond, J. R., and V an Zee, F. G.Anatomy of high-performance many-threaded matrix multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium(2014), IEEE, pp. 1049–1059

  29. [29]

    Stephens, N., Biles, S., Boettcher, M., Eapen, J., Eyole, M., Gabrielli, G., Horsnell, M., Magklis, G., Martinez, A., Premillieu, N., et al.The arm scalable vector extension.IEEE micro 37, 2 (2017), 26–39

  30. [30]

    Accelerated pytorch inference with torch.compile on aws graviton processors

    Sunita Nadampalli. Accelerated pytorch inference with torch.compile on aws graviton processors. https://pytorch.org/blog/accelerated-pytorch-inference/, July 2024. Accessed: 2026-04-20

  31. [31]

    G., and van de Geijn, R

    Van Zee, F. G., and van de Geijn, R. A.BLIS: A framework for rapidly instan- tiating BLAS functionality.ACM Transactions on Mathematical Software 41, 3 (June 2015), 14:1–14:33. 11