Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

Gaetan Hains; Lenore Mullin

arxiv: 2606.07713 · v1 · pith:YCQVTFUFnew · submitted 2026-06-05 · 💻 cs.LG · cs.AI· cs.PF

Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

Lenore Mullin , Gaetan Hains This is my paper

Pith reviewed 2026-06-27 22:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.PF

keywords attention mechanismtransformermemory optimizationdata movementarray algebrasoftmaxkernel designperformance model

0 comments

The pith

A Mathematics of Arrays rewrite derives a normal form for attention that removes every intermediate array by algebraic construction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reformulates scaled dot-product attention and its stable softmax inside an array algebra so that all temporary storage disappears before any code is generated. This changes data movement from quadratic in sequence length to linear in the key and value dimensions. A reader cares because each DRAM access costs far more energy than an arithmetic operation, so the reduction directly improves speed and power on actual hardware. The memory bound is obtained as a theorem from the algebra rather than from measurement or hardware-specific tuning. Numerical checks against a standard library confirm the outputs match on concrete double-precision inputs.

Core claim

The Denotational Normal Form obtained from the Mathematics of Arrays treatment of attention and softmax eliminates the implicit transposed-key buffer, every softmax temporary, and all other intermediate arrays, yielding O(n_dk + n_dv) data movement where the standard form requires O(n^2 + n_dk + n_dv); memory minimality is therefore a theorem established prior to implementation, and the result is verified by exact numerical agreement with PyTorch at full double precision.

What carries the argument

The Denotational Normal Form produced by algebraic transformation of the attention expression inside the Mathematics of Arrays framework, which removes every intermediate array by construction.

If this is right

Predictive models derived from the algebra project 2-100x speedup and 2-50x energy reduction, with the gap increasing at larger scale.
The same algebraic pipeline supplies array fusion, shape-transformation correctness, and cost models from a single source rather than separate empirical steps.
A formally verified path exists from the Python-level specification through operational normal form to dimension-lifted hardware mapping.
The kernels are performance-portable across architectures because the optimality proof does not rely on hardware-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same algebraic elimination of temporaries could be applied to the feed-forward and layer-norm stages inside a full transformer block.
If the normal-form construction extends to other attention variants, low-memory implementations could become the default rather than special cases.
The predictive cost model supplies a way to compare attention formulations on new hardware before writing any kernel code.

Load-bearing premise

The algebraic rewrite produces exactly the same numerical results as the original attention computation for every input without depending on particular array shapes or floating-point rounding rules.

What would settle it

Execution on any concrete input sequence where the output tensors differ from the reference PyTorch implementation by more than machine epsilon at double precision.

Figures

Figures reproduced from arXiv: 2606.07713 by Gaetan Hains, Lenore Mullin.

**Figure 1.** Figure 1: The MoA psi-reduction pipeline for attention. Each Omega step eliminates one category of intermediate array (red, dashed). The final DNF and output access only original Q, K, V elements by multi-index, with no scratch memory. The key structural consequence is that the leading dimensions (B, h) are consumed by this outer Omega, partitioning the computation into independent 2-D sub-problems—one per batch ele… view at source ↗

**Figure 2.** Figure 2: Memory traffic ratio Mclassical/MMoA as a function of sequence length n for three head dimensions dk. The advantage grows linearly with n/dk, reaching 512× at n = 32,768, dk = 64. 3. Streaming access patterns. The DNF produces affine, stride-predictable index arithmetic (3), enabling hardware prefetching and eliminating cache thrashing from irregular accesses. 4. Fusion by construction. The entire pipelin… view at source ↗

read the original abstract

The attention mechanism is the dominant computational bottleneck in modern transformer-based AI. Its standard implementation incurs quadratic memory traffic in the sequence length~$n$, and DRAM accesses cost 100--1000$\times$ more energy than arithmetic operations on contemporary hardware, so any analysis focused solely on FLOP counts fundamentally mischaracterises the bottleneck. We present a Mathematics of Arrays (MoA) reformulation of scaled dot-product attention and its numerically stable softmax, deriving a Denotational Normal Form (DNF) that eliminates all intermediate arrays -- including the implicit transposed-key buffer and every softmax temporary -- by algebraic construction rather than empirical tuning. The DNF achieves $O(n_{dk} + n{_{dv}})$ data movement versus $O(n^2 + n_{dk} + n_{dv})$ for the standard implementation, where $n$ is the sequence length, $dk$ is the key dimensionality and $dv$ the value dimensionality, and is verified numerically against PyTorch at full double-precision floating-point on concrete inputs. Unlike hardware-specific accelerators or empirical tiling schemes such as FlashAttention, MoA simultaneously provides array fusion, shape-transformation correctness, and predictive cost models from a single algebraic framework. Memory minimality is a theorem established before any code is written. A predictive performance model projects $2$--$100\times$ speedup and $2$--$50\times$ energy reduction, with the advantage widening at exascale. The derivation establishes a formally verified pipeline from Python specification through (ONF) Operational Normal Form, and dimension-lifted hardware mapping, providing performance-portable AI kernels of direct relevance to DARPA edge-deployment and DOE exascale priorities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoA algebra yields a claimed memory-minimal DNF for attention with O(n_dk + n_dv) traffic as a theorem, but equivalence to stable softmax rests only on numerical matches rather than a general derivation.

read the letter

The paper's core move is to take scaled dot-product attention plus its stable softmax and rewrite it inside the Mathematics of Arrays framework so that the n-by-n score matrix, the transposed key buffer, and every softmax temporary are eliminated by construction. The resulting DNF is said to move only O(n_dk + n_dv) data, and the authors treat memory minimality as a theorem proved before any kernel is written.

What stands out is the consistent algebraic route. They start from a Python-level specification, move through an operational normal form, and end with a dimension-lifted mapping that also supplies a predictive cost model. That single framework is meant to give both correctness and performance numbers without hardware-specific tiling. The numerical check against PyTorch at double precision on concrete inputs is at least a concrete data point, and the emphasis on DRAM traffic rather than FLOP counts matches where the real energy cost lies.

The soft spot is exactly the one the stress-test flags. Equivalence is asserted via numerical agreement on specific inputs; there is no machine-checked proof or even a step-by-step derivation shown that the array identities preserve semantics for arbitrary shapes, under IEEE associativity, or when the implicit transpose is never materialized. If those identities have hidden shape or rounding constraints, the O(n_dk + n_dv) bound does not automatically follow for production use. The abstract's claim of a "formally verified pipeline" is therefore stronger than the visible evidence.

This is for kernel writers and people who build performance-portable libraries for transformers. A reader who already knows MoA or who needs algebraic rather than empirical fusion could extract useful ideas. It is worth sending to a serious referee because the target problem is real and the algebraic framing is distinct from FlashAttention-style work, even though the current manuscript would need the full derivation and a stronger equivalence argument before acceptance.

Referee Report

2 major / 1 minor

Summary. The paper claims a Mathematics of Arrays (MoA) reformulation of scaled dot-product attention and numerically stable softmax that derives a Denotational Normal Form (DNF) eliminating all intermediate arrays (including the n² score matrix, transposed-key buffer, and softmax temporaries) by algebraic construction. This yields an O(n_dk + n_dv) data-movement bound versus O(n² + n_dk + n_dv) for the standard implementation, with memory minimality asserted as a theorem established prior to code; equivalence is verified by numerical agreement with PyTorch at double precision on concrete inputs, and a predictive model projects 2–100× speedup and 2–50× energy reduction at scale.

Significance. If the algebraic equivalence and DNF construction hold exactly, the work would supply a single formal framework delivering array fusion, shape-transformation correctness, and predictive cost models for memory-optimal attention kernels. This approach is distinguished from empirical tiling methods such as FlashAttention by its theorem-based minimality and performance-portable pipeline from Python specification through Operational Normal Form to hardware mapping, with direct relevance to exascale and edge-deployment priorities.

major comments (2)

[Abstract] Abstract: the claim that the MoA DNF preserves exact semantics of scaled dot-product attention plus numerically stable softmax (and thereby realizes the O(n_dk + n_dv) bound as a theorem) rests solely on numerical agreement at double precision on concrete inputs. No explicit derivation steps, error analysis, or machine-checked equivalence argument is provided to show that array fusion and elimination of the n² matrix and all temporaries introduce no hidden shape or IEEE-754 rounding constraints for arbitrary n, dk, dv.
[Abstract] Abstract: the statement that 'memory minimality is a theorem established before any code is written' and that the pipeline is 'formally verified' is not supported by any shown algebraic identities or proof obligations; the numerical verification alone does not discharge the semantic-equivalence obligation required for the central data-movement claim.

minor comments (1)

[Abstract] Abstract: the notation 'O(n_{dk} + n{_{dv}})' contains a typographic error in the subscript for n_dv.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that the abstract's strong claims about algebraic derivation and formal verification would benefit from more explicit supporting material in the manuscript. We address each point below and will revise to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the MoA DNF preserves exact semantics of scaled dot-product attention plus numerically stable softmax (and thereby realizes the O(n_dk + n_dv) bound as a theorem) rests solely on numerical agreement at double precision on concrete inputs. No explicit derivation steps, error analysis, or machine-checked equivalence argument is provided to show that array fusion and elimination of the n² matrix and all temporaries introduce no hidden shape or IEEE-754 rounding constraints for arbitrary n, dk, dv.

Authors: The manuscript's central contribution is the algebraic construction of the DNF via successive MoA transformations that eliminate intermediates by definition of the normal form; this is presented through the sequence of shape and operation rewrites in the body. Numerical agreement at double precision on the tested inputs serves as confirmation rather than the sole basis. We agree that the presentation would be strengthened by an expanded outline of the key identities, a short discussion of potential IEEE-754 effects for arbitrary dimensions, and an explicit statement that full machine-checked equivalence is beyond the current scope. We will add these elements in revision. revision: yes
Referee: [Abstract] Abstract: the statement that 'memory minimality is a theorem established before any code is written' and that the pipeline is 'formally verified' is not supported by any shown algebraic identities or proof obligations; the numerical verification alone does not discharge the semantic-equivalence obligation required for the central data-movement claim.

Authors: Memory minimality follows as a direct consequence of the DNF having no intermediate arrays under the MoA axioms; the pipeline is 'formally verified' in the sense that each step applies the framework's denotational rules from specification onward. The abstract summarizes rather than displays the identities. We accept that the current text does not make the proof obligations sufficiently visible. In the revision we will include representative algebraic steps (either inline or as a new figure) and clarify the precise scope of the formal claims versus the numerical checks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is algebraic and self-contained

full rationale

The paper derives the DNF and its O(n_dk + n_dv) data-movement bound directly from the Mathematics of Arrays algebraic identities, stating that memory minimality is a theorem established before any code is written and that elimination of intermediates occurs by algebraic construction rather than empirical tuning. Numerical agreement with PyTorch is presented only as post-derivation verification on concrete inputs, not as the source of the bound or theorem. No equations or steps in the provided text reduce a claimed result to a fitted parameter, self-definition, or load-bearing self-citation; the performance projections are described as arising from the same algebraic framework. The derivation chain from specification through ONF is therefore independent of the target metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability and correctness of the Mathematics of Arrays framework to attention and softmax; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption The Mathematics of Arrays framework can be used to derive denotational and operational normal forms that preserve the semantics of array operations including softmax.
This is the foundational premise invoked to establish the DNF and the memory-minimality theorem.

pith-pipeline@v0.9.1-grok · 5848 in / 1300 out tokens · 33639 ms · 2026-06-27T22:30:33.866014+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 7 canonical work pages · 2 internal anchors

[1]

FATHOM: Fast attention through optimizing memory

Binder, E., Sudarsanam, A., Sunkavalli, R., Low, T.M., 2025. FATHOM: Fast attention through optimizing memory. In: Proceedings of IEEE IPDPS, pp. 1166--1178

2025
[2]

MOSA: Matrix optimized self-attention hardware accelerator for mobile device

Chang, Y.-R., Chen, H.-F., Shih, H.-Y., 2025. MOSA: Matrix optimized self-attention hardware accelerator for mobile device. In: Proceedings of ICEIC, pp. 1--4

2025
[3]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T., 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv:2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao, T., Fu, D.Y., Ermon, S., Rudra, A., R\' e , C., 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. arXiv:2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Hardware considerations for tensor implementation and analysis using the field programmable gate array

Grout, I.A., Mullin, L., 2018. Hardware considerations for tensor implementation and analysis using the field programmable gate array. Electronics 7 (11), 320. doi:10.3390/electronics7110320

work page doi:10.3390/electronics7110320 2018
[6]

Realizing mathematics of arrays operations as custom architecture hardware-software co-design solutions

Grout, I.A., Mullin, L., 2022a. Realizing mathematics of arrays operations as custom architecture hardware-software co-design solutions. Information 13 (11), 528. doi:10.3390/info13110528

work page doi:10.3390/info13110528
[7]

Processing in memory for mathematics of arrays operations

Grout, I.A., Mullin, L., 2022b. Processing in memory for mathematics of arrays operations. In: Proceedings of SAI Computing Conference, 2022

2022
[8]

A fast optimization view: Reformulating single layer attention in LLM based on tensor and SVM trick

Gao, Y., Song, Z., Wang, W., Yin, J., 2023. A fast optimization view: Reformulating single layer attention in LLM based on tensor and SVM trick. arXiv:2309.07418

work page arXiv 2023
[9]

Hains, G., 2026. From algorithm to code to hardware to execution: Why isn't computing performance an experimental science? In: Proceedings of PDP 2026, Euromicro Conference on Parallel, Distributed, and Network-Based Processing. Cluj-Napoca, Romania

2026
[10]

A ^3 : Accelerating attention mechanisms in neural networks with approximation

Ham, T.J., Jung, S.J., Kim, S., Oh, Y.H., Park, Y., Song, Y., Park, J.-H., Lee, S., Park, K., Lee, J.W., Jeong, D.-K., 2020. A ^3 : Accelerating attention mechanisms in neural networks with approximation. In: Proceedings of IEEE HPCA, pp. 328--341

2020
[11]

Acceleration of fully connected layers on FPGA using the Strassen matrix multiplication

Le\' o n-Vega, L.G., Chac\' o n-Rodr\' i guez, A., Salazar-Villalobos, E., Castro-God\' i nez, J., 2023. Acceleration of fully connected layers on FPGA using the Strassen matrix multiplication. In: Proceedings of IEEE BIP, pp. 1--6

2023
[12]

Design and implementation of an FPGA-based hardware accelerator for transformer

Li, R., Chen, S., 2025. Design and implementation of an FPGA-based hardware accelerator for transformer. arXiv:2503.16731

work page arXiv 2025
[13]

Research on matrix multiplication optimization and deployment method for heterogeneous platforms

Liu, J., Wu, C., Li, J., Chen, X., Wang, S., 2024. Research on matrix multiplication optimization and deployment method for heterogeneous platforms. In: Proceedings of ICCBD+AI, pp. 266--270

2024
[14]

Array access and performance regarding numerical algorithms

Markus, A., Mullin, L., 2026. Array access and performance regarding numerical algorithms. Transactions on Engineering on Computing Sciences 14 (2), 1--14

2026
[15]

A Mathematics of Arrays

Mullin, L.M., 1988. A Mathematics of Arrays. Ph.D. dissertation. Syracuse University, Syracuse, NY

1988
[16]

From array algebra to energy efficiency on GPUs

Mullin, L.M.R., 2023. From array algebra to energy efficiency on GPUs. arXiv:2306.11148

work page arXiv 2023
[17]

Towards automatic, predictable and high-performance parallel code generation

Mullin, L., Hains, G., 2025. Towards automatic, predictable and high-performance parallel code generation. In: Proceedings of CSCE/PDPTA2024, Springer Nature Switzerland, pp. 109--119

2025
[18]

New mathematics for computer performance: array algebra and cost functions

Mullin, L., Hains, G., 2026. New mathematics for computer performance: array algebra and cost functions. Technical Report. https://hal.science/hal-05442652

2026
[19]

OPTIMUS: Optimized matrix multiplication structure for transformer neural network accelerator

Park, J., Yoon, H., Ahn, D., Choi, J., Kim, J.-J., 2020. OPTIMUS: Optimized matrix multiplication structure for transformer neural network accelerator. In: Proceedings of MLSys, vol. 2, pp. 363--378

2020
[20]

FACT: FFN-attention co-optimized transformer architecture with eager correlation prediction

Qin, Y., Wang, Y., Deng, D., Zhao, Z., Yang, X., Liu, L., Wei, S., Hu, Y., Yin, S., 2023. FACT: FFN-attention co-optimized transformer architecture with eager correlation prediction. In: Proceedings of ISCA, pp. 1--14

2023
[21]

High-performance Gemmini-based matrix multiplication accelerator for deep learning workloads

Sharma, A., Krishna, L.H., Srinivasu, B., 2025. High-performance Gemmini-based matrix multiplication accelerator for deep learning workloads. IEEE Transactions on VLSI Systems 33 (12), 3276--3289

2025
[22]

Design and implementation of a BRAM-banked double-buffered matrix multiplication accelerator for transformer models on edge FPGAs

Tan, X., Teo, T.H., 2025. Design and implementation of a BRAM-banked double-buffered matrix multiplication accelerator for transformer models on edge FPGAs. In: Proceedings of IEEE MCSoC, pp. 582--585

2025
[23]

Improving the performance of DGEMM with MoA and cache-blocking

Thomas, S., Mullin, L., Swirydowicz, K., 2021a. Improving the performance of DGEMM with MoA and cache-blocking. In: Proceedings of ACM ARRAY'21
[24]

Threaded multicore GEMM with MoA and cache-blocking

Thomas, S., Mullin, L., Swirydowicz, K., Khan, R., 2021b. Threaded multicore GEMM with MoA and cache-blocking. In: Proceedings of CSC'21, CSCE'21
[25]

Attention is all you need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, ., Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS)

2017
[26]

Hardware friendly transformer optimization with dynamic attention matrix fusion

Yang, Q., Wang, X., Zhou, Y., Li, Q., Qiao, S., 2025. Hardware friendly transformer optimization with dynamic attention matrix fusion. In: Proceedings of IEEE ISCAS, pp. 1--5

2025

[1] [1]

FATHOM: Fast attention through optimizing memory

Binder, E., Sudarsanam, A., Sunkavalli, R., Low, T.M., 2025. FATHOM: Fast attention through optimizing memory. In: Proceedings of IEEE IPDPS, pp. 1166--1178

2025

[2] [2]

MOSA: Matrix optimized self-attention hardware accelerator for mobile device

Chang, Y.-R., Chen, H.-F., Shih, H.-Y., 2025. MOSA: Matrix optimized self-attention hardware accelerator for mobile device. In: Proceedings of ICEIC, pp. 1--4

2025

[3] [3]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T., 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv:2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao, T., Fu, D.Y., Ermon, S., Rudra, A., R\' e , C., 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. arXiv:2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Hardware considerations for tensor implementation and analysis using the field programmable gate array

Grout, I.A., Mullin, L., 2018. Hardware considerations for tensor implementation and analysis using the field programmable gate array. Electronics 7 (11), 320. doi:10.3390/electronics7110320

work page doi:10.3390/electronics7110320 2018

[6] [6]

Realizing mathematics of arrays operations as custom architecture hardware-software co-design solutions

Grout, I.A., Mullin, L., 2022a. Realizing mathematics of arrays operations as custom architecture hardware-software co-design solutions. Information 13 (11), 528. doi:10.3390/info13110528

work page doi:10.3390/info13110528

[7] [7]

Processing in memory for mathematics of arrays operations

Grout, I.A., Mullin, L., 2022b. Processing in memory for mathematics of arrays operations. In: Proceedings of SAI Computing Conference, 2022

2022

[8] [8]

A fast optimization view: Reformulating single layer attention in LLM based on tensor and SVM trick

Gao, Y., Song, Z., Wang, W., Yin, J., 2023. A fast optimization view: Reformulating single layer attention in LLM based on tensor and SVM trick. arXiv:2309.07418

work page arXiv 2023

[9] [9]

Hains, G., 2026. From algorithm to code to hardware to execution: Why isn't computing performance an experimental science? In: Proceedings of PDP 2026, Euromicro Conference on Parallel, Distributed, and Network-Based Processing. Cluj-Napoca, Romania

2026

[10] [10]

A ^3 : Accelerating attention mechanisms in neural networks with approximation

Ham, T.J., Jung, S.J., Kim, S., Oh, Y.H., Park, Y., Song, Y., Park, J.-H., Lee, S., Park, K., Lee, J.W., Jeong, D.-K., 2020. A ^3 : Accelerating attention mechanisms in neural networks with approximation. In: Proceedings of IEEE HPCA, pp. 328--341

2020

[11] [11]

Acceleration of fully connected layers on FPGA using the Strassen matrix multiplication

Le\' o n-Vega, L.G., Chac\' o n-Rodr\' i guez, A., Salazar-Villalobos, E., Castro-God\' i nez, J., 2023. Acceleration of fully connected layers on FPGA using the Strassen matrix multiplication. In: Proceedings of IEEE BIP, pp. 1--6

2023

[12] [12]

Design and implementation of an FPGA-based hardware accelerator for transformer

Li, R., Chen, S., 2025. Design and implementation of an FPGA-based hardware accelerator for transformer. arXiv:2503.16731

work page arXiv 2025

[13] [13]

Research on matrix multiplication optimization and deployment method for heterogeneous platforms

Liu, J., Wu, C., Li, J., Chen, X., Wang, S., 2024. Research on matrix multiplication optimization and deployment method for heterogeneous platforms. In: Proceedings of ICCBD+AI, pp. 266--270

2024

[14] [14]

Array access and performance regarding numerical algorithms

Markus, A., Mullin, L., 2026. Array access and performance regarding numerical algorithms. Transactions on Engineering on Computing Sciences 14 (2), 1--14

2026

[15] [15]

A Mathematics of Arrays

Mullin, L.M., 1988. A Mathematics of Arrays. Ph.D. dissertation. Syracuse University, Syracuse, NY

1988

[16] [16]

From array algebra to energy efficiency on GPUs

Mullin, L.M.R., 2023. From array algebra to energy efficiency on GPUs. arXiv:2306.11148

work page arXiv 2023

[17] [17]

Towards automatic, predictable and high-performance parallel code generation

Mullin, L., Hains, G., 2025. Towards automatic, predictable and high-performance parallel code generation. In: Proceedings of CSCE/PDPTA2024, Springer Nature Switzerland, pp. 109--119

2025

[18] [18]

New mathematics for computer performance: array algebra and cost functions

Mullin, L., Hains, G., 2026. New mathematics for computer performance: array algebra and cost functions. Technical Report. https://hal.science/hal-05442652

2026

[19] [19]

OPTIMUS: Optimized matrix multiplication structure for transformer neural network accelerator

Park, J., Yoon, H., Ahn, D., Choi, J., Kim, J.-J., 2020. OPTIMUS: Optimized matrix multiplication structure for transformer neural network accelerator. In: Proceedings of MLSys, vol. 2, pp. 363--378

2020

[20] [20]

FACT: FFN-attention co-optimized transformer architecture with eager correlation prediction

Qin, Y., Wang, Y., Deng, D., Zhao, Z., Yang, X., Liu, L., Wei, S., Hu, Y., Yin, S., 2023. FACT: FFN-attention co-optimized transformer architecture with eager correlation prediction. In: Proceedings of ISCA, pp. 1--14

2023

[21] [21]

High-performance Gemmini-based matrix multiplication accelerator for deep learning workloads

Sharma, A., Krishna, L.H., Srinivasu, B., 2025. High-performance Gemmini-based matrix multiplication accelerator for deep learning workloads. IEEE Transactions on VLSI Systems 33 (12), 3276--3289

2025

[22] [22]

Design and implementation of a BRAM-banked double-buffered matrix multiplication accelerator for transformer models on edge FPGAs

Tan, X., Teo, T.H., 2025. Design and implementation of a BRAM-banked double-buffered matrix multiplication accelerator for transformer models on edge FPGAs. In: Proceedings of IEEE MCSoC, pp. 582--585

2025

[23] [23]

Improving the performance of DGEMM with MoA and cache-blocking

Thomas, S., Mullin, L., Swirydowicz, K., 2021a. Improving the performance of DGEMM with MoA and cache-blocking. In: Proceedings of ACM ARRAY'21

[24] [24]

Threaded multicore GEMM with MoA and cache-blocking

Thomas, S., Mullin, L., Swirydowicz, K., Khan, R., 2021b. Threaded multicore GEMM with MoA and cache-blocking. In: Proceedings of CSC'21, CSCE'21

[25] [25]

Attention is all you need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, ., Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS)

2017

[26] [26]

Hardware friendly transformer optimization with dynamic attention matrix fusion

Yang, Q., Wang, X., Zhou, Y., Li, Q., Qiao, S., 2025. Hardware friendly transformer optimization with dynamic attention matrix fusion. In: Proceedings of IEEE ISCAS, pp. 1--5

2025