pith. sign in

arxiv: 2604.04599 · v1 · submitted 2026-04-06 · 💻 cs.DC · cs.CV· cs.LG

LP-GEMM: Integrating Layout Propagation into GEMM Operations

Pith reviewed 2026-05-10 19:41 UTC · model grok-4.3

classification 💻 cs.DC cs.CVcs.LG
keywords GEMMBLASlayout propagationmatrix multiplicationperformance optimizationscientific computingmachine learning
0
0 comments X

The pith

LP-GEMM propagates internal packing layouts across chains of GEMM calls to skip repeated data repacking while preserving exact BLAS behavior at sequence boundaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sequences of dependent matrix multiplications in scientific computing and machine learning spend time on redundant packing and unpacking because each standard BLAS call starts from a canonical memory layout. LP-GEMM decomposes the GEMM kernel so that the packed layout chosen for one multiplication can travel directly to the next dependent call. The decomposition removes those extra packing steps yet still delivers the same numerical results and memory layout that a normal BLAS user would expect when the sequence finishes. Measurements on x86 with AVX-512 and on RISC-V with RVV 1.0 show average speedups of 2.25 times versus OpenBLAS for MLP-style and attention-style workloads. The same technique was used to build a complete Llama-3.2 inference path that issues only ordinary GEMM calls.

Core claim

LP-GEMM is a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries.

What carries the argument

A decomposed GEMM kernel that forwards the internal packed data layout from one operation directly into the next dependent operation.

If this is right

  • Redundant packing and unpacking disappear inside any chain of dependent GEMM operations.
  • BLAS semantic correctness remains unchanged at the start and end of each sequence.
  • Average speedups of 2.25 times over OpenBLAS appear on Intel x86 for MLP-like and attention-like patterns.
  • Performance stays competitive with vendor libraries such as Intel MKL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The propagation technique could be applied to other linear-algebra kernels that also incur repeated packing costs.
  • Code generators that emit long GEMM sequences could insert layout tracking automatically to capture the same savings.
  • Adapting the same decomposition to GPU or accelerator memory systems would require matching the packing strategy to each device's cache and vector layout rules.

Load-bearing premise

The internal layout changes produce exactly the same numerical results and output layout as a sequence of independent, fully packed GEMM calls.

What would settle it

Execute the same sequence of matrix multiplications once with ordinary BLAS GEMM calls and once with LP-GEMM calls on identical inputs, then verify that every element of the final output matrices matches within floating-point tolerance.

Figures

Figures reproduced from arXiv: 2604.04599 by C\'esar Guedes Carneiro, Guido Araujo, Lucas Alvarenga, Sandro Rigo.

Figure 1
Figure 1. Figure 1: Comparison of OpenBLAS and LP-GEMM kernel [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visual representation of GotoBLAS approach [10]. (a) shows how matrices are tiled for the architecture’s memory [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of sequential GEMM operations using differ [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Micro-kernel Layout. the nc×mc dimension, which is the same dimension that will be used to calculate data in the case of a subsequent consumer GEMM operator. As data is not stored in its original layout, it is required that the following are true: (1) If C already contained data before the first GEMM call, and β is not zero; C must be packed into the layout of equation 3. (2) Calls to the µkernel need to t… view at source ↗
Figure 5
Figure 5. Figure 5: Speedup of a single GEMM extracted from gemm [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Speedup of Attention layer from LLaMA 3.2 model, with embedded dimension of 2048, and MLP weights with [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Usage comparison of LP-GEMM with FlashGEMM [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads. Our results show average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs and competitive gains relative to vendor-optimized libraries such as Intel MKL. We demonstrate the practicality of the approach beyond microbenchmarks by implementing a standalone C++ version of the Llama-3.2 inference path using exclusively BLAS-level GEMM calls. These results confirm that leveraging data layout propagation between operations can significantly boost performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces LP-GEMM, a decomposition of the GEMM kernel that propagates packing layouts across sequential dependent GEMM operations common in ML and scientific computing workloads. This eliminates redundant packing/unpacking steps required by the standard BLAS API while claiming to preserve full semantic correctness (including alpha/beta scaling, transposes, and strides) at operation boundaries. Evaluations on x86 (AVX-512) and RISC-V (RVV) report average speedups of 2.25x over OpenBLAS for sequential GEMMs with competitive results versus MKL; a standalone C++ Llama-3.2 inference path using only BLAS-level GEMM calls is provided as a practical demonstration.

Significance. If the internal decomposition maintains bit-identical or numerically equivalent results to standard BLAS calls and the reported speedups are reproducible, the technique could meaningfully reduce memory-bandwidth overhead in chained linear-algebra kernels that dominate modern ML inference and training. The explicit Llama-3.2 implementation is a positive step toward reproducibility and shows the approach is not limited to micro-benchmarks.

major comments (3)
  1. [§3 and §4] §3 (LP-GEMM Decomposition) and §4 (Micro-kernel): the central claim that layout propagation preserves full BLAS semantic correctness requires an explicit argument or verification that the packing/unpacking steps are exact inverses and that the fused multiply-add sequence executed by the micro-kernel is identical regardless of the propagated memory layout. The abstract asserts this equivalence, but without a concrete walk-through of stride handling, alpha/beta application, or transpose flags inside the propagated path, the correctness contract remains unsubstantiated.
  2. [§5] §5 (Evaluation): reported speedups (2.25x over OpenBLAS, competitive with MKL) are given without workload matrix dimensions, number of sequential GEMM calls, number of repetitions, error bars, or any numerical verification (e.g., maximum absolute difference versus reference BLAS results). These omissions make it impossible to determine whether observed gains arise from layout propagation or from other implementation differences, directly undermining the performance claims.
  3. [§6] §6 (Llama-3.2 Implementation): while the standalone C++ inference path is a useful end-to-end demonstration, the manuscript must include a side-by-side numerical check (e.g., output logits or loss values) against an otherwise identical implementation that uses only standard BLAS GEMM calls with explicit repacking. Without this, the claim that “full BLAS semantic correctness at the boundaries” is maintained in a realistic workload is not demonstrated.
minor comments (3)
  1. [Abstract and §5] Abstract and §5: specify the exact matrix sizes, batch dimensions, and sequence lengths used for the MLP-like and Attention-like workloads so that the experiments can be reproduced.
  2. [§2 and §3] Notation: the manuscript should clarify whether “packing-layout propagation” modifies the internal representation only or also changes the user-visible output layout at the end of a sequence; a small diagram would help.
  3. [§1] References: add citations to prior work on fused or layout-aware GEMM kernels (e.g., papers on BLIS, libxsmm, or ML-specific GEMM fusions) to better situate the novelty of the propagation technique.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below. We agree that the manuscript requires additional substantiation for the correctness claims, more complete experimental details, and explicit numerical verification in the end-to-end demonstration. We will incorporate revisions to address these points.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (LP-GEMM Decomposition) and §4 (Micro-kernel): the central claim that layout propagation preserves full BLAS semantic correctness requires an explicit argument or verification that the packing/unpacking steps are exact inverses and that the fused multiply-add sequence executed by the micro-kernel is identical regardless of the propagated memory layout. The abstract asserts this equivalence, but without a concrete walk-through of stride handling, alpha/beta application, or transpose flags inside the propagated path, the correctness contract remains unsubstantiated.

    Authors: We acknowledge that the current manuscript does not provide a sufficiently explicit argument for semantic equivalence. In the revised version we will add a new subsection to §3 that walks through the packing and unpacking operations as exact inverses, including explicit handling of strides, alpha/beta scaling factors, and transpose flags. We will also include pseudocode in §4 comparing the fused multiply-add sequence in the standard BLAS path versus the layout-propagated path to demonstrate that the arithmetic operations remain identical. revision: yes

  2. Referee: [§5] §5 (Evaluation): reported speedups (2.25x over OpenBLAS, competitive with MKL) are given without workload matrix dimensions, number of sequential GEMM calls, number of repetitions, error bars, or any numerical verification (e.g., maximum absolute difference versus reference BLAS results). These omissions make it impossible to determine whether observed gains arise from layout propagation or from other implementation differences, directly undermining the performance claims.

    Authors: We agree that the evaluation section is missing critical details required for reproducibility and to isolate the source of the speedups. In the revision we will expand §5 with tables listing the exact matrix dimensions for each workload, the number of sequential GEMM calls per workload, the number of repetitions, and error bars computed across runs. We will also add a column or subsection reporting the maximum absolute difference between LP-GEMM outputs and reference BLAS results for each configuration. revision: yes

  3. Referee: [§6] §6 (Llama-3.2 Implementation): while the standalone C++ inference path is a useful end-to-end demonstration, the manuscript must include a side-by-side numerical check (e.g., output logits or loss values) against an otherwise identical implementation that uses only standard BLAS GEMM calls with explicit repacking. Without this, the claim that “full BLAS semantic correctness at the boundaries” is maintained in a realistic workload is not demonstrated.

    Authors: We recognize that the current Llama-3.2 demonstration lacks a direct numerical comparison. In the revised manuscript we will add to §6 a side-by-side comparison of output logits (or token probabilities) produced by the LP-GEMM-based inference path versus an otherwise identical implementation that uses standard BLAS GEMM calls with explicit repacking at each boundary. We will report the maximum absolute difference observed across the inference run to substantiate the claim of preserved semantic correctness. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation technique validated by external benchmarks

full rationale

The paper introduces LP-GEMM as a practical decomposition of the GEMM kernel to propagate packing layouts across sequential calls while claiming to preserve BLAS semantics at boundaries. No equations, fitted parameters, ansatzes, or uniqueness theorems appear in the abstract or description. The central claim is an engineering assertion about exact inversion of packing/unpacking steps and micro-kernel equivalence, supported by performance measurements on x86 and RISC-V rather than any self-referential derivation or self-citation chain. The work is self-contained against external BLAS libraries and workloads; no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard BLAS interface semantics and architecture-specific vector instructions; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5546 in / 1101 out tokens · 42732 ms · 2026-05-10T19:41:51.034941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Physics-informed machine learning,

    G. Karniadakis, Y . Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang, “Physics-informed machine learning,”Nature Reviews Physics, pp. 1–19, 2021

  2. [2]

    Machine learning for chemistry: Basics and applications,

    Y .-F. Shi, Z.-X. Yang, S. Ma, P.-L. Kang, C. Shang, P. Hu, and Z.-P. Liu, “Machine learning for chemistry: Basics and applications,”Engineering, vol. 27, pp. 70–83, 2023

  3. [3]

    Accurate structure prediction of biomolecular interactions with alphafold 3,

    J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, S. Bodenstein, D. A. Evans, C.-C. Hung, M. O’Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, A. ˇZemgulyt˙e, E. Arvaniti, C. Beattie, O. Bertolli, A. Bridgland, A. Cherepanov, M. Congreve, A. I. Cowen-Rivers, A. Cowie, M. Fig- urnov, F. ...

  4. [4]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeurIPS’17, 2017

  5. [5]

    The llama 3 herd of models,

    Meta’s Llama 3 team, “The llama 3 herd of models,”arXiv e-prints, 2024

  6. [6]

    BERT: pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT’2019, 2019. [Online]. Available: https://doi.org/10.18653/ v1/n19-1423

  7. [7]

    Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration,

    H. Genc, S. Kim, A. Amid, A. Haj-Ali, V . Iyer, P. Prakash, J. Zhao, D. Grubb, H. Liew, H. Mao, A. J. Ou, C. Schmidt, S. Steffl, J. C. Wright, I. Stoica, J. Ragan-Kelley, K. Asanovic, B. Nikolic, and Y . S. Shao, “Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration,” inDAC’2021, 2021

  8. [8]

    Towards optimized tensor code generation for deep learning on sunway many-core processor,

    M. Li, C. Liu, J. Liao, X. Zheng, H. Yang, R. Sun, J. Xu, L. Gan, G. Yang, Z. Luan, and D. Qian, “Towards optimized tensor code generation for deep learning on sunway many-core processor,”Frontiers Comput. Sci., vol. 18, p. 182101, 2024

  9. [9]

    Exploiting intel advanced matrix extensions (AMX) for large language model inference,

    H. Kim, G. Ye, N. Wang, A. Yazdanbakhsh, and N. S. Kim, “Exploiting intel advanced matrix extensions (AMX) for large language model inference,”IEEE Comput. Archit. Lett., vol. 23, no. 1, pp. 117–120, 2024

  10. [10]

    Anatomy of high-performance matrix multiplication,

    K. Goto and R. A. van de Geijn, “Anatomy of high-performance matrix multiplication,”ACM Transactions on Mathematical Software (TOMS), vol. 34, pp. 1–25, 2008

  11. [11]

    Flashgemm: Optimizing sequences of matrix multiplication by exploiting data reuse on cpus,

    J. Zhang, W. Yang, J. Fang, D. Dong, and X. Chen, “Flashgemm: Optimizing sequences of matrix multiplication by exploiting data reuse on cpus,”ACM Trans. Archit. Code Optim., vol. 22, no. 4, Dec. 2025. [Online]. Available: https://doi.org/10.1145/3760784

  12. [12]

    More asymmetry yields faster matrix multiplication,

    J. Alman, R. Duan, V . Vassilevska Williams, Y . Xu, Z. Xu, and R. Zhou, “More asymmetry yields faster matrix multiplication,” inSODA’2025, Y . Azar and D. Panigrahi, Eds., 2025, pp. 2005–2039

  13. [13]

    Intel math kernel library (intel mkl),

    Intel Corporation, “Intel math kernel library (intel mkl),” https:// www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html, 2024, accessed: 2025-12-10

  14. [14]

    cublas: Cuda basic linear algebra subroutines,

    NVIDIA Corporation, “cublas: Cuda basic linear algebra subroutines,” https://developer.nvidia.com/cublas, 2024, accessed: 2025-12-10

  15. [15]

    Hello sme! generating fast matrix multipli- cation kernels using the scalable matrix extension,

    S. Remke and A. Breuer, “Hello sme! generating fast matrix multipli- cation kernels using the scalable matrix extension,” inSC24-W:, 2024, pp. 1443–1454

  16. [16]

    The cache performance and optimizations of blocked algorithms,

    M. D. Lam, E. E. Rothberg, and M. E. Wolf, “The cache performance and optimizations of blocked algorithms,”ACM SIGOPS Operating Systems Review, vol. 25, no. Special Issue, pp. 63–74, 1991

  17. [17]

    M. J. Wolfe, C. Shanklin, and L. Ortega,High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing Co., Inc., 1995

  18. [18]

    To pack or not to pack: A generalized packing analysis and transformation,

    C. S. Rohwedder, N. Henderson, J. ao P. L. De Carvalho, Y . Chen, and J. N. Amaral, “To pack or not to pack: A generalized packing analysis and transformation,” inCGO’23, 2023, p. 14–27

  19. [19]

    J. L. Hennessy and D. A. Patterson,Computer Architecture: A Quanti- tative Approach, 5th ed. Morgan Kaufmann Publishers Inc., 2011

  20. [20]

    BLAS (Basic Linear Algebra Subprograms),

    The Netlib, “BLAS (Basic Linear Algebra Subprograms),” https://www. netlib.org/blas/, 2025, accessed: 2025-11-25

  21. [21]

    InProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization(Edinburgh, United Kingdom)(CGO ’24)

    J. Li, Z. Qin, Y . Mei, J. Cui, Y . Song, C. Chen, Y . Zhang, L. Du, X. Cheng, B. Jin, Y . Zhang, J. Ye, E. Lin, and D. Lavery, “onednn graph compiler: A hybrid approach for high-performance deep learning compilation,” inIEEE/ACM International Symposium on Code Generation and Optimization, CGO 2024, Edinburgh, United Kingdom, March 2-6, 2024, T. Grosser, ...

  22. [22]

    Library liberation: Competitive performance matmul through compiler-composed nanokernels,

    A. Thangamani, M. A. A. Shahid, A. Siemieniuk, R. Morel, R. Golin, and A. Heinecke, “Library liberation: Competitive performance matmul through compiler-composed nanokernels,”arXiv e-prints, 2025

  23. [23]

    Optimizing CNN model inference on CPUs,

    Y . Liu, Y . Wang, R. Yu, M. Li, V . Sharma, and Y . Wang, “Optimizing CNN model inference on CPUs,” inUSENIX ATC’19, 2019, pp. 1025– 1040

  24. [24]

    BLIS: A framework for rapidly instantiating BLAS functionality,

    F. G. Van Zee and R. A. van de Geijn, “BLIS: A framework for rapidly instantiating BLAS functionality,”ACM Transactions on Mathematical Software, vol. 41, no. 3, pp. 14:1–14:33, June 2015. [Online]. Available: https://doi.acm.org/10.1145/2764454

  25. [25]

    GEMMBench: A lightweight benchmarking frame- work for evaluating custom GEMM microkernels

    Lucas Alvarenga, “GEMMBench: A lightweight benchmarking frame- work for evaluating custom GEMM microkernels.” https://github.com/ LucasFernando-aes/gemmbench, 2025, accessed: 2025-11-25

  26. [26]

    Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

    M. Levy, A. Jacoby, and Y . Goldberg, “Same task, more tokens: the impact of input length on the reasoning performance of large language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Asso...

  27. [27]

    Context length alone hurts LLM performance despite perfect retrieval,

    Y . Du, M. Tian, S. Ronanki, S. Rongali, S. B. Bodapati, A. Galstyan, A. Wells, R. Schwartz, E. A. Huerta, and H. Peng, “Context length alone hurts LLM performance despite perfect retrieval,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Associatio...