Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation
Pith reviewed 2026-05-20 21:23 UTC · model grok-4.3
The pith
Vector-length-aware packed layouts enable practical code generation for vector-length-agnostic ML execution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that vector-length-aware packed data layouts, paired with extensions to tiling, fusion, and vectorization, allow an end-to-end ML compiler to emit efficient code that runs correctly and performs well for any vector length at runtime, producing results competitive with fixed-vector baselines and that scale with longer vectors on compute-bound workloads.
What carries the argument
Vector-length-aware packed data layouts that arrange storage according to the actual vector length so that scalable vector operations can access data efficiently without compile-time commitment to a fixed length.
If this is right
- A single generated implementation can execute correctly on hardware with different vector lengths.
- Performance remains competitive with or exceeds that of traditional fixed-length vector code.
- The generated code exhibits continued improvement as vector length increases on compute-bound workloads.
- Tiling, fusion, and vectorization passes can be extended to respect runtime vector lengths without losing their effectiveness.
Where Pith is reading between the lines
- The same layout technique could be tested on numerical kernels outside machine learning to check whether the overhead remains acceptable in other domains.
- If the overhead proves consistently low, the approach would lower the cost of writing performance-portable code across future vector architectures.
- Measuring cache-miss rates on memory-bound workloads would give a concrete test of where the packing cost becomes the dominant factor.
Load-bearing premise
The memory footprint and access overhead introduced by the vector-length-aware packed layouts remains small enough not to cancel the gains from scalable vectorization.
What would settle it
A direct comparison on the same ML workloads in which the packed-layout code runs slower than the fixed-vector baseline on the target hardware would show the central claim does not hold.
Figures
read the original abstract
Scalable vector instruction sets such as Arm SVE enable vector-length-agnostic (VLA) execution, allowing a single implementation to adapt across hardware with different vector lengths. However, they complicate compiler code generation, as tiling and data layout decisions can no longer be fixed at compile time. We present an approach for enabling VLA code generation in an end-to-end ML compilation pipeline through vector-length-aware packed data layouts and corresponding compiler extensions. We integrate these mechanisms into MLIR/IREE and extend tiling, fusion, and vectorization to operate with scalable vector lengths. Evaluated on real-world ML workloads on Arm CPUs, our approach generates SVE code that is competitive with, and often outperforms, existing NEON-based code generation within IREE, achieving up to $1.45\times$ speedup. We also outperform PyTorch ecosystem frameworks, including ExecuTorch, TorchInductor, and eager execution, demonstrating the effectiveness of scalable vectorization in a production compiler setting. A simulator-based study further shows that the generated code scales with increasing SVE vector length on compute-bound workloads, supporting performance portability across hardware configurations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents vector-length-aware packed data layouts and corresponding compiler extensions integrated into MLIR/IREE to enable vector-length-agnostic (VLA) code generation for ML workloads on Arm SVE. It extends tiling, fusion, and vectorization to operate with scalable vector lengths, reports up to 1.45× speedup over NEON-based generation within IREE, outperforms PyTorch frameworks including ExecuTorch and TorchInductor, and includes a simulator study showing scaling with increasing SVE vector length on compute-bound workloads.
Significance. If the central performance claims hold after addressing evaluation gaps, the work would be significant for enabling performance-portable ML compilation on scalable vector ISAs. The end-to-end integration into a production compiler pipeline and the demonstration of scaling behavior are concrete strengths that could influence future VLA code generation techniques.
major comments (1)
- Evaluation section: The reported end-to-end speedups (up to 1.45×) compare against NEON and PyTorch baselines but do not include an ablation isolating the memory footprint growth, strided access costs, or runtime overhead attributable to the vector-length-aware packed layouts themselves. Without such isolation (e.g., via cache-miss counters or comparisons of VLA vectorized kernels on conventional vs. packed layouts), it remains unclear whether packing costs offset gains on memory-bound kernels or at larger SVE lengths, which is load-bearing for the speedup claim.
minor comments (2)
- Abstract: The 1.45× speedup figure lacks accompanying details on statistical significance, number of runs, or variance; adding these would strengthen the empirical claims.
- The manuscript would benefit from a clearer description of the layout selection heuristics and any compile-time or runtime costs associated with packing decisions.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address the major comment on the evaluation below and will revise the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: Evaluation section: The reported end-to-end speedups (up to 1.45×) compare against NEON and PyTorch baselines but do not include an ablation isolating the memory footprint growth, strided access costs, or runtime overhead attributable to the vector-length-aware packed layouts themselves. Without such isolation (e.g., via cache-miss counters or comparisons of VLA vectorized kernels on conventional vs. packed layouts), it remains unclear whether packing costs offset gains on memory-bound kernels or at larger SVE lengths, which is load-bearing for the speedup claim.
Authors: We agree that the current evaluation would benefit from a dedicated ablation isolating the overheads of the vector-length-aware packed layouts. The manuscript focuses on end-to-end speedups and a simulator study showing scaling on compute-bound workloads, but does not provide the requested isolation of memory footprint growth, strided access costs, or runtime overheads via cache-miss counters or direct comparisons against conventional layouts. In the revised manuscript we will add such an ablation, including performance comparisons of VLA kernels on packed versus conventional layouts and relevant hardware counter data where available from our experimental setup. This will better quantify any costs and clarify the contribution to the reported speedups, especially for memory-bound cases and larger vector lengths. revision: yes
Circularity Check
No significant circularity; results from direct implementation and benchmarking
full rationale
The manuscript describes an engineering approach for vector-length-aware packed layouts and compiler extensions in MLIR/IREE to support VLA code generation for ML workloads. Central claims rest on explicit integration of tiling/fusion/vectorization with scalable lengths, followed by end-to-end performance measurements against NEON baselines and PyTorch frameworks on Arm hardware plus simulator scaling studies. No equations, fitted parameters, or predictions are presented that reduce to their own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The derivation chain is therefore self-contained through concrete implementation details and external empirical validation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose scalable packed layouts as an abstraction for representing data layouts parameterized by the hardware vector length... mr = fm(VL), nr = fn(VL), kr = fk(VL)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We integrate these mechanisms into MLIR/IREE and extend tiling, fusion, and vectorization to operate with scalable vector lengths.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://lists.riscv.org/g/tech-vector- ext/attachment/691/0/riscv-v-spec-1.0.pdf, 2021
The risc-v vector extension, version 1.0. https://lists.riscv.org/g/tech-vector- ext/attachment/691/0/riscv-v-spec-1.0.pdf, 2021
work page 2021
-
[2]
Arm scalable matrix extension (sme) architecture specification. https://developer. arm.com/documentation/109246/0101/, 2024
work page 2024
-
[3]
Executorch: On-device ai across mobile, embedded and edge for pytorch. https: //executorch.ai, 2026
work page 2026
-
[4]
https://github.com/google/XNNPACK, 2026
Xnnpack: High-efficiency floating-point neural network inference operators for mobile, server, and web. https://github.com/google/XNNPACK, 2026. Accessed: 2026-04-20
work page 2026
-
[5]
Adit, N., and Sampson, A.Performance left on the table: An evaluation of compiler autovectorization for risc-v.IEEE Micro 42, 5 (2022), 41–48
work page 2022
-
[6]
Anonymous artifact: Compiler extensions for scalable vector code generation, 2026
Anonymous Authors. Anonymous artifact: Compiler extensions for scalable vector code generation, 2026. Link to branch omitted due to double-blind review; will be added for final publication
work page 2026
-
[7]
Ansel, J., Y ang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., et al.Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In 10 Proceedings of the 29th ACM international conference on architectural support for programming languages and operating syst...
work page 2024
-
[8]
Arm Ltd.Kleidiai: Ai microkernels optimized for arm cpus. https://gitlab.arm. com/kleidi/kleidiai, 2024. GitLab repository, accessed 2026-04-22
work page 2024
-
[9]
K., Saidi, A., Basu, A., Hestness, J., Hower, D
Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., Hestness, J., Hower, D. R., Krishna, T., Sardashti, S., et al.The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1–7
work page 2011
-
[10]
Brank, B.Vector length agnostic SIMD parallelism on modern processor architec- tures with the focus on Arm’s SVE. PhD thesis, Ph. D. thesis, Bergische Universität Wuppertal, 2023
work page 2023
-
[11]
Carpentieri, L., VazirPanah, M., and Cosenza, B.A performance analysis of autovectorization on rvv risc-v boards. In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)(2025), IEEE, pp. 129–136
work page 2025
-
[12]
InTenth international workshop on frontiers in handwriting recognition(2006), Suvisoft
Chellapilla, K., Puri, S., and Simard, P.High performance convolutional neural networks for document processing. InTenth international workshop on frontiers in handwriting recognition(2006), Suvisoft
work page 2006
-
[13]
{TVM}: An automated {End-to-End} optimizing compiler for deep learning
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Y an, E., Shen, H., Cowan, M., W ang, L., Hu, Y., Ceze, L., et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX symposium on operating systems design and implementation (OSDI 18)(2018), pp. 578–594
work page 2018
-
[14]
Goto, K., and Geijn, R. A. v. d.Anatomy of high-performance matrix mul- tiplication.ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 1–25
work page 2008
-
[15]
Igual, F., Piñuel, L., Catalán, S., Martínez, H., Castelló, A., and Quintana- Ortí, E.Automatic generation of micro-kernels for performance portability of matrix multiplication on risc-v vector processors. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Net- work, Storage, and Analysis(2023), pp. 1523–1532
work page 2023
-
[16]
In Proceedings of the Workshop on Compilers for Machine Learning (C4ML) at CGO (2024)
Kalda, E., and Hutton, L.Introducing vector length agnostic programming into ml compilation: Comparing sve and sme enablement in tvm and mlir. In Proceedings of the Workshop on Compilers for Machine Learning (C4ML) at CGO (2024). Arm Ltd
work page 2024
-
[17]
Lai, H.-M., Lin, P.-H., Gokhale, M., Peng, I., Patel, H., and Lee, J.-K.Risc- v vectorization coverage for hpc: A tsvc-based analysis. InProceedings of the SC’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis(2025), pp. 1676–1683
work page 2025
-
[18]
In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)(2021), pp
Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O.MLIR: Scaling compiler infrastructure for domain specific computation. In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)(2021), pp. 2–14
work page 2021
-
[19]
Liu, H.-I. C., Brehler, M., Ravishankar, M., Vasilache, N., Vanik, B., and Laurenzo, S.Tinyiree: An ml execution environment for embedded systems from compilation to deployment.IEEE micro 42, 5 (2022), 9–16
work page 2022
-
[20]
LLVM Project. Torch-mlir. https://github.com/llvm/torch-mlir, 2026. Compiler infrastructure bridging the PyTorch and MLIR ecosystems
work page 2026
-
[21]
Lowe-Power, J., Ahmad, A. M., Akram, A., Alian, M., Amslinger, R., An- dreozzi, M., Armejach, A., Asmussen, N., Beckmann, B., Bharadwaj, S., et al. The gem5 simulator: Version 20.0+.arXiv preprint arXiv:2007.03152(2020)
-
[22]
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.Pytorch: An imperative style, high- performance deep learning library.Advances in neural information processing systems 32(2019)
work page 2019
-
[23]
Peccia, F. N., Haxel, F., and Bringmann, O.Tensor program optimization for the risc-v vector extension using probabilistic programs. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD)(2025), IEEE, pp. 1–9
work page 2025
-
[24]
InEuropean Conference on Parallel Processing (2020), Springer, pp
Poenaru, A., and McIntosh-Smith, S.Evaluating the effectiveness of a vector- length-agnostic instruction set. InEuropean Conference on Parallel Processing (2020), Springer, pp. 98–114
work page 2020
-
[25]
In2019 International Conference on High Performance Computing & Simulation (HPCS)(2019), IEEE, pp
Pohl, A., Greese, M., Cosenza, B., and Juurlink, B.A performance analysis of vector length agnostic code. In2019 International Conference on High Performance Computing & Simulation (HPCS)(2019), IEEE, pp. 159–164
work page 2019
-
[26]
Remke, S., and Breuer, A.Hello sme! generating fast matrix multiplication kernels using the scalable matrix extension. InSC24-W: Workshops of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis(2024), IEEE, pp. 1443–1454
work page 2024
-
[27]
RIKEN Center for Computational Science and Fujitsu. Supercomputer fugaku, 2021. Arm-based A64FX processor, world-leading HPC system
work page 2021
-
[28]
M., V an De Geijn, R., Smelyanskiy, M., Hammond, J
Smith, T. M., V an De Geijn, R., Smelyanskiy, M., Hammond, J. R., and V an Zee, F. G.Anatomy of high-performance many-threaded matrix multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium(2014), IEEE, pp. 1049–1059
work page 2014
-
[29]
Stephens, N., Biles, S., Boettcher, M., Eapen, J., Eyole, M., Gabrielli, G., Horsnell, M., Magklis, G., Martinez, A., Premillieu, N., et al.The arm scalable vector extension.IEEE micro 37, 2 (2017), 26–39
work page 2017
-
[30]
Accelerated pytorch inference with torch.compile on aws graviton processors
Sunita Nadampalli. Accelerated pytorch inference with torch.compile on aws graviton processors. https://pytorch.org/blog/accelerated-pytorch-inference/, July 2024. Accessed: 2026-04-20
work page 2024
-
[31]
Van Zee, F. G., and van de Geijn, R. A.BLIS: A framework for rapidly instan- tiating BLAS functionality.ACM Transactions on Mathematical Software 41, 3 (June 2015), 14:1–14:33. 11
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.