LP-GEMM: Integrating Layout Propagation into GEMM Operations
Pith reviewed 2026-05-10 19:41 UTC · model grok-4.3
The pith
LP-GEMM propagates internal packing layouts across chains of GEMM calls to skip repeated data repacking while preserving exact BLAS behavior at sequence boundaries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LP-GEMM is a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries.
What carries the argument
A decomposed GEMM kernel that forwards the internal packed data layout from one operation directly into the next dependent operation.
If this is right
- Redundant packing and unpacking disappear inside any chain of dependent GEMM operations.
- BLAS semantic correctness remains unchanged at the start and end of each sequence.
- Average speedups of 2.25 times over OpenBLAS appear on Intel x86 for MLP-like and attention-like patterns.
- Performance stays competitive with vendor libraries such as Intel MKL.
Where Pith is reading between the lines
- The propagation technique could be applied to other linear-algebra kernels that also incur repeated packing costs.
- Code generators that emit long GEMM sequences could insert layout tracking automatically to capture the same savings.
- Adapting the same decomposition to GPU or accelerator memory systems would require matching the packing strategy to each device's cache and vector layout rules.
Load-bearing premise
The internal layout changes produce exactly the same numerical results and output layout as a sequence of independent, fully packed GEMM calls.
What would settle it
Execute the same sequence of matrix multiplications once with ordinary BLAS GEMM calls and once with LP-GEMM calls on identical inputs, then verify that every element of the final output matrices matches within floating-point tolerance.
Figures
read the original abstract
In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads. Our results show average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs and competitive gains relative to vendor-optimized libraries such as Intel MKL. We demonstrate the practicality of the approach beyond microbenchmarks by implementing a standalone C++ version of the Llama-3.2 inference path using exclusively BLAS-level GEMM calls. These results confirm that leveraging data layout propagation between operations can significantly boost performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LP-GEMM, a decomposition of the GEMM kernel that propagates packing layouts across sequential dependent GEMM operations common in ML and scientific computing workloads. This eliminates redundant packing/unpacking steps required by the standard BLAS API while claiming to preserve full semantic correctness (including alpha/beta scaling, transposes, and strides) at operation boundaries. Evaluations on x86 (AVX-512) and RISC-V (RVV) report average speedups of 2.25x over OpenBLAS for sequential GEMMs with competitive results versus MKL; a standalone C++ Llama-3.2 inference path using only BLAS-level GEMM calls is provided as a practical demonstration.
Significance. If the internal decomposition maintains bit-identical or numerically equivalent results to standard BLAS calls and the reported speedups are reproducible, the technique could meaningfully reduce memory-bandwidth overhead in chained linear-algebra kernels that dominate modern ML inference and training. The explicit Llama-3.2 implementation is a positive step toward reproducibility and shows the approach is not limited to micro-benchmarks.
major comments (3)
- [§3 and §4] §3 (LP-GEMM Decomposition) and §4 (Micro-kernel): the central claim that layout propagation preserves full BLAS semantic correctness requires an explicit argument or verification that the packing/unpacking steps are exact inverses and that the fused multiply-add sequence executed by the micro-kernel is identical regardless of the propagated memory layout. The abstract asserts this equivalence, but without a concrete walk-through of stride handling, alpha/beta application, or transpose flags inside the propagated path, the correctness contract remains unsubstantiated.
- [§5] §5 (Evaluation): reported speedups (2.25x over OpenBLAS, competitive with MKL) are given without workload matrix dimensions, number of sequential GEMM calls, number of repetitions, error bars, or any numerical verification (e.g., maximum absolute difference versus reference BLAS results). These omissions make it impossible to determine whether observed gains arise from layout propagation or from other implementation differences, directly undermining the performance claims.
- [§6] §6 (Llama-3.2 Implementation): while the standalone C++ inference path is a useful end-to-end demonstration, the manuscript must include a side-by-side numerical check (e.g., output logits or loss values) against an otherwise identical implementation that uses only standard BLAS GEMM calls with explicit repacking. Without this, the claim that “full BLAS semantic correctness at the boundaries” is maintained in a realistic workload is not demonstrated.
minor comments (3)
- [Abstract and §5] Abstract and §5: specify the exact matrix sizes, batch dimensions, and sequence lengths used for the MLP-like and Attention-like workloads so that the experiments can be reproduced.
- [§2 and §3] Notation: the manuscript should clarify whether “packing-layout propagation” modifies the internal representation only or also changes the user-visible output layout at the end of a sequence; a small diagram would help.
- [§1] References: add citations to prior work on fused or layout-aware GEMM kernels (e.g., papers on BLIS, libxsmm, or ML-specific GEMM fusions) to better situate the novelty of the propagation technique.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point by point below. We agree that the manuscript requires additional substantiation for the correctness claims, more complete experimental details, and explicit numerical verification in the end-to-end demonstration. We will incorporate revisions to address these points.
read point-by-point responses
-
Referee: [§3 and §4] §3 (LP-GEMM Decomposition) and §4 (Micro-kernel): the central claim that layout propagation preserves full BLAS semantic correctness requires an explicit argument or verification that the packing/unpacking steps are exact inverses and that the fused multiply-add sequence executed by the micro-kernel is identical regardless of the propagated memory layout. The abstract asserts this equivalence, but without a concrete walk-through of stride handling, alpha/beta application, or transpose flags inside the propagated path, the correctness contract remains unsubstantiated.
Authors: We acknowledge that the current manuscript does not provide a sufficiently explicit argument for semantic equivalence. In the revised version we will add a new subsection to §3 that walks through the packing and unpacking operations as exact inverses, including explicit handling of strides, alpha/beta scaling factors, and transpose flags. We will also include pseudocode in §4 comparing the fused multiply-add sequence in the standard BLAS path versus the layout-propagated path to demonstrate that the arithmetic operations remain identical. revision: yes
-
Referee: [§5] §5 (Evaluation): reported speedups (2.25x over OpenBLAS, competitive with MKL) are given without workload matrix dimensions, number of sequential GEMM calls, number of repetitions, error bars, or any numerical verification (e.g., maximum absolute difference versus reference BLAS results). These omissions make it impossible to determine whether observed gains arise from layout propagation or from other implementation differences, directly undermining the performance claims.
Authors: We agree that the evaluation section is missing critical details required for reproducibility and to isolate the source of the speedups. In the revision we will expand §5 with tables listing the exact matrix dimensions for each workload, the number of sequential GEMM calls per workload, the number of repetitions, and error bars computed across runs. We will also add a column or subsection reporting the maximum absolute difference between LP-GEMM outputs and reference BLAS results for each configuration. revision: yes
-
Referee: [§6] §6 (Llama-3.2 Implementation): while the standalone C++ inference path is a useful end-to-end demonstration, the manuscript must include a side-by-side numerical check (e.g., output logits or loss values) against an otherwise identical implementation that uses only standard BLAS GEMM calls with explicit repacking. Without this, the claim that “full BLAS semantic correctness at the boundaries” is maintained in a realistic workload is not demonstrated.
Authors: We recognize that the current Llama-3.2 demonstration lacks a direct numerical comparison. In the revised manuscript we will add to §6 a side-by-side comparison of output logits (or token probabilities) produced by the LP-GEMM-based inference path versus an otherwise identical implementation that uses standard BLAS GEMM calls with explicit repacking at each boundary. We will report the maximum absolute difference observed across the inference run to substantiate the claim of preserved semantic correctness. revision: yes
Circularity Check
No circularity: implementation technique validated by external benchmarks
full rationale
The paper introduces LP-GEMM as a practical decomposition of the GEMM kernel to propagate packing layouts across sequential calls while claiming to preserve BLAS semantics at boundaries. No equations, fitted parameters, ansatzes, or uniqueness theorems appear in the abstract or description. The central claim is an engineering assertion about exact inversion of packing/unpacking steps and micro-kernel equivalence, supported by performance measurements on x86 and RISC-V rather than any self-referential derivation or self-citation chain. The work is self-contained against external BLAS libraries and workloads; no load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Physics-informed machine learning,
G. Karniadakis, Y . Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang, “Physics-informed machine learning,”Nature Reviews Physics, pp. 1–19, 2021
work page 2021
-
[2]
Machine learning for chemistry: Basics and applications,
Y .-F. Shi, Z.-X. Yang, S. Ma, P.-L. Kang, C. Shang, P. Hu, and Z.-P. Liu, “Machine learning for chemistry: Basics and applications,”Engineering, vol. 27, pp. 70–83, 2023
work page 2023
-
[3]
Accurate structure prediction of biomolecular interactions with alphafold 3,
J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, S. Bodenstein, D. A. Evans, C.-C. Hung, M. O’Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, A. ˇZemgulyt˙e, E. Arvaniti, C. Beattie, O. Bertolli, A. Bridgland, A. Cherepanov, M. Congreve, A. I. Cowen-Rivers, A. Cowie, M. Fig- urnov, F. ...
work page 2024
-
[4]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeurIPS’17, 2017
work page 2017
-
[5]
Meta’s Llama 3 team, “The llama 3 herd of models,”arXiv e-prints, 2024
work page 2024
-
[6]
BERT: pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT’2019, 2019. [Online]. Available: https://doi.org/10.18653/ v1/n19-1423
work page 2019
-
[7]
Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration,
H. Genc, S. Kim, A. Amid, A. Haj-Ali, V . Iyer, P. Prakash, J. Zhao, D. Grubb, H. Liew, H. Mao, A. J. Ou, C. Schmidt, S. Steffl, J. C. Wright, I. Stoica, J. Ragan-Kelley, K. Asanovic, B. Nikolic, and Y . S. Shao, “Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration,” inDAC’2021, 2021
work page 2021
-
[8]
Towards optimized tensor code generation for deep learning on sunway many-core processor,
M. Li, C. Liu, J. Liao, X. Zheng, H. Yang, R. Sun, J. Xu, L. Gan, G. Yang, Z. Luan, and D. Qian, “Towards optimized tensor code generation for deep learning on sunway many-core processor,”Frontiers Comput. Sci., vol. 18, p. 182101, 2024
work page 2024
-
[9]
Exploiting intel advanced matrix extensions (AMX) for large language model inference,
H. Kim, G. Ye, N. Wang, A. Yazdanbakhsh, and N. S. Kim, “Exploiting intel advanced matrix extensions (AMX) for large language model inference,”IEEE Comput. Archit. Lett., vol. 23, no. 1, pp. 117–120, 2024
work page 2024
-
[10]
Anatomy of high-performance matrix multiplication,
K. Goto and R. A. van de Geijn, “Anatomy of high-performance matrix multiplication,”ACM Transactions on Mathematical Software (TOMS), vol. 34, pp. 1–25, 2008
work page 2008
-
[11]
Flashgemm: Optimizing sequences of matrix multiplication by exploiting data reuse on cpus,
J. Zhang, W. Yang, J. Fang, D. Dong, and X. Chen, “Flashgemm: Optimizing sequences of matrix multiplication by exploiting data reuse on cpus,”ACM Trans. Archit. Code Optim., vol. 22, no. 4, Dec. 2025. [Online]. Available: https://doi.org/10.1145/3760784
-
[12]
More asymmetry yields faster matrix multiplication,
J. Alman, R. Duan, V . Vassilevska Williams, Y . Xu, Z. Xu, and R. Zhou, “More asymmetry yields faster matrix multiplication,” inSODA’2025, Y . Azar and D. Panigrahi, Eds., 2025, pp. 2005–2039
work page 2025
-
[13]
Intel math kernel library (intel mkl),
Intel Corporation, “Intel math kernel library (intel mkl),” https:// www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html, 2024, accessed: 2025-12-10
work page 2024
-
[14]
cublas: Cuda basic linear algebra subroutines,
NVIDIA Corporation, “cublas: Cuda basic linear algebra subroutines,” https://developer.nvidia.com/cublas, 2024, accessed: 2025-12-10
work page 2024
-
[15]
Hello sme! generating fast matrix multipli- cation kernels using the scalable matrix extension,
S. Remke and A. Breuer, “Hello sme! generating fast matrix multipli- cation kernels using the scalable matrix extension,” inSC24-W:, 2024, pp. 1443–1454
work page 2024
-
[16]
The cache performance and optimizations of blocked algorithms,
M. D. Lam, E. E. Rothberg, and M. E. Wolf, “The cache performance and optimizations of blocked algorithms,”ACM SIGOPS Operating Systems Review, vol. 25, no. Special Issue, pp. 63–74, 1991
work page 1991
-
[17]
M. J. Wolfe, C. Shanklin, and L. Ortega,High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing Co., Inc., 1995
work page 1995
-
[18]
To pack or not to pack: A generalized packing analysis and transformation,
C. S. Rohwedder, N. Henderson, J. ao P. L. De Carvalho, Y . Chen, and J. N. Amaral, “To pack or not to pack: A generalized packing analysis and transformation,” inCGO’23, 2023, p. 14–27
work page 2023
-
[19]
J. L. Hennessy and D. A. Patterson,Computer Architecture: A Quanti- tative Approach, 5th ed. Morgan Kaufmann Publishers Inc., 2011
work page 2011
-
[20]
BLAS (Basic Linear Algebra Subprograms),
The Netlib, “BLAS (Basic Linear Algebra Subprograms),” https://www. netlib.org/blas/, 2025, accessed: 2025-11-25
work page 2025
-
[21]
J. Li, Z. Qin, Y . Mei, J. Cui, Y . Song, C. Chen, Y . Zhang, L. Du, X. Cheng, B. Jin, Y . Zhang, J. Ye, E. Lin, and D. Lavery, “onednn graph compiler: A hybrid approach for high-performance deep learning compilation,” inIEEE/ACM International Symposium on Code Generation and Optimization, CGO 2024, Edinburgh, United Kingdom, March 2-6, 2024, T. Grosser, ...
-
[22]
Library liberation: Competitive performance matmul through compiler-composed nanokernels,
A. Thangamani, M. A. A. Shahid, A. Siemieniuk, R. Morel, R. Golin, and A. Heinecke, “Library liberation: Competitive performance matmul through compiler-composed nanokernels,”arXiv e-prints, 2025
work page 2025
-
[23]
Optimizing CNN model inference on CPUs,
Y . Liu, Y . Wang, R. Yu, M. Li, V . Sharma, and Y . Wang, “Optimizing CNN model inference on CPUs,” inUSENIX ATC’19, 2019, pp. 1025– 1040
work page 2019
-
[24]
BLIS: A framework for rapidly instantiating BLAS functionality,
F. G. Van Zee and R. A. van de Geijn, “BLIS: A framework for rapidly instantiating BLAS functionality,”ACM Transactions on Mathematical Software, vol. 41, no. 3, pp. 14:1–14:33, June 2015. [Online]. Available: https://doi.acm.org/10.1145/2764454
-
[25]
GEMMBench: A lightweight benchmarking frame- work for evaluating custom GEMM microkernels
Lucas Alvarenga, “GEMMBench: A lightweight benchmarking frame- work for evaluating custom GEMM microkernels.” https://github.com/ LucasFernando-aes/gemmbench, 2025, accessed: 2025-11-25
work page 2025
-
[26]
M. Levy, A. Jacoby, and Y . Goldberg, “Same task, more tokens: the impact of input length on the reasoning performance of large language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Asso...
-
[27]
Context length alone hurts LLM performance despite perfect retrieval,
Y . Du, M. Tian, S. Ronanki, S. Rongali, S. B. Bodapati, A. Galstyan, A. Wells, R. Schwartz, E. A. Huerta, and H. Peng, “Context length alone hurts LLM performance despite perfect retrieval,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Associatio...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.