pith. machine review for the scientific record. sign in

arxiv: 2604.03446 · v1 · submitted 2026-04-03 · 💻 cs.AR

Recognition: 2 theorem links

· Lean Theorem

Fast Cross-Operator Optimization of Attention Dataflow

Bo Yuan, Hailiang Hu, Haodong Chang, Jiang Hu, Rongjian Liang, Yu Gong, Zhenrui Wang, Zhexiang Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:43 UTC · model grok-4.3

classification 💻 cs.AR
keywords attentiondataflow optimizationcross-operatoranalytical performance modelmatrix encodingenergy efficiencylatency reductionaccelerators
0
0 comments X

The pith

MMEE uses matrix encoding of dataflow solutions to optimize attention computations across operators, delivering up to 50% lower energy use and 69% lower latency at 64-343 times the speed of prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MMEE, a method for optimizing dataflow in attention kernels used in transformers. It builds an analytical model that encodes many possible tiling, ordering, and buffer choices as matrix multiplications. This allows rapid evaluation and pruning of a huge space of solutions to find the best one for given hardware. A reader would care because attention dominates LLM workloads, so better dataflow directly improves efficiency on accelerators. The approach outperforms existing techniques in both quality and speed across tested cases.

Core claim

Cross-operator dataflow optimization for attention can be performed by constructing an analytical performance model that encodes candidate solutions via matrix multiplication, enabling exhaustive enumeration with effective pruning to identify optimal configurations for energy or latency objectives on accelerators.

What carries the argument

The analytical performance model that represents dataflow decisions through matrix encodings, supporting fast evaluation of many candidates for cross-operator optimization.

If this is right

  • For energy-driven optimization, energy consumption drops 48%-50% and latency drops 31%-69% versus state-of-the-art methods.
  • For latency-driven optimization, both energy and latency drop 40%-50% and 40%-69% respectively.
  • The optimization process runs 64× to 343× faster than previous works.
  • The gains hold across multiple accelerator configurations and attention test cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The enumeration speed could support runtime dataflow selection if the model can be recomputed quickly during inference.
  • Similar matrix encodings might apply to optimizing dataflow for other heavy kernels such as convolutions.
  • Reliable analytical models of this type could reduce the need for cycle-accurate simulations when exploring new accelerator designs.

Load-bearing premise

The analytical performance model accurately captures real hardware behavior across the tested accelerator configurations without requiring post-hoc calibration or detailed cycle-accurate simulation.

What would settle it

Deploying the MMEE-optimized dataflows on actual hardware accelerators and verifying whether measured energy consumption and latency match the model's predictions within a small error margin.

Figures

Figures reproduced from arXiv: 2604.03446 by Bo Yuan, Hailiang Hu, Haodong Chang, Jiang Hu, Rongjian Liang, Yu Gong, Zhenrui Wang, Zhexiang Tang.

Figure 1
Figure 1. Figure 1: Comparison of various cross-operator dataflow mappers. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Attention computation and accelerator architecture. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fusion dataflow keeps intermediate results ( [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The impact of tiling. Each compute stage represents the multiplication [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The computation ordering on the left is valid, while the right case is [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Buffer management and its impact. (a) The dataflow follows Fig. 7(a). [PITH_FULL_IMAGE:figures/full_fig_p004_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pseudo nested loop example and components. Loop boundaries come [PITH_FULL_IMAGE:figures/full_fig_p004_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (a) Specific numeric values are assigned to loop boundaries to illustrate the [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of buffer-size requirement and DRAM-access estimation. [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overview of the MMEE dataflow optimization framework. The [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗
Figure 15
Figure 15. Figure 15: Fusing FFN networks of GPT-3-6.7B [PITH_FULL_IMAGE:figures/full_fig_p009_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Fusing attention computation of GPT-3-6.7B. [PITH_FULL_IMAGE:figures/full_fig_p009_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Dataflow optimization results on energy and latency for Accel. 1. The overall energy is the sum of four components, and the overall latency is [PITH_FULL_IMAGE:figures/full_fig_p010_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Dataflow optimization results on energy and latency for Accel. 2. [PITH_FULL_IMAGE:figures/full_fig_p011_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Compute utilization of TileFlow and MMEE. [PITH_FULL_IMAGE:figures/full_fig_p011_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Energy-latency trade-off of the attention fusion on Accel. 2. [PITH_FULL_IMAGE:figures/full_fig_p011_20.png] view at source ↗
Figure 23
Figure 23. Figure 23: Energy and latency trends with increasing sequence length for fused [PITH_FULL_IMAGE:figures/full_fig_p012_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Analysis of decision elements. “TF”, “TF+T”, and “TF+T+BM” [PITH_FULL_IMAGE:figures/full_fig_p012_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Comparison of Chimera, TileFlow, Orojenesis, MMEE* (MMEE [PITH_FULL_IMAGE:figures/full_fig_p013_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: A case study of MMEE* (MMEE without recomputation) and MMEE [PITH_FULL_IMAGE:figures/full_fig_p013_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: EDP-driven optimization of reconfigurable PE arrays. Fixed: 32×32 [PITH_FULL_IMAGE:figures/full_fig_p014_27.png] view at source ↗
read the original abstract

Attention is a fundamental computational kernel that accounts for the majority of the workload in transformer and LLM computing. Optimizing dataflow is crucial for enhancing both performance and energy efficiency in attention computation. This optimization involves a range of decisions, such as tiling, computation ordering and buffer management, and can be applied at both intra-operator and inter-operator levels, resulting in a highly complex decision space. We propose a new approach to cross-operator dataflow optimization. Its centerpiece is an analytical performance model that spans a large decision space and enables matrix-based encoding of multiple candidate solutions. Built on this foundation, a vast number of solutions can be evaluated rapidly, and with the aid of an effective pruning technique, the optimal solution can be identified through exhaustive enumeration. We refer to our method as MMEE (Matrix Multiplication Encoded Enumeration). The ability to efficiently enumerate a large design space allows MMEE to deliver higher-quality solutions at a substantially faster speed compared to prior approaches. The MMEE approach is evaluated across various test cases for different accelerator configurations. For energy-driven optimization, MMEE reduces energy consumption by 48%-50% and latency by 31%-69%, compared to state-of-the-art methods. For latency-driven optimization, MMEE achieves simultaneous reductions of 40%-50% in energy consumption and 40%-69% in latency, respectively. Additionally, MMEE is $64\times$ to $343\times$ faster than previous works.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MMEE (Matrix Multiplication Encoded Enumeration), a method for cross-operator dataflow optimization of attention kernels. It encodes tiling, ordering, and buffer decisions via an analytical performance model as matrix operations, enabling rapid exhaustive enumeration with pruning to identify optimal solutions for energy- or latency-driven objectives on accelerators. The paper reports 48-50% energy and 31-69% latency reductions versus prior methods for energy-driven cases, 40-50% energy and 40-69% latency reductions for latency-driven cases, and 64x-343x faster optimization times.

Significance. If the analytical model proves accurate, the work offers a practical advance for efficient transformer/LLM accelerators by systematically exploring cross-operator dataflows that prior heuristics miss. The matrix-based encoding for fast enumeration is a clear technical contribution that could scale to larger design spaces.

major comments (2)
  1. [Evaluation section] Evaluation section: The headline claims (48%-50% energy reduction, 31%-69% latency reduction, 64x-343x speedup) are produced entirely by the analytical model used both to guide search and to report final metrics. No calibration against RTL simulation, gate-level power analysis, or hardware measurements is described, so it is impossible to assess whether unmodeled effects (interconnect contention, bank conflicts) erode the reported gains.
  2. [Analytical performance model section] Analytical performance model section: The model encodes dataflow decisions as matrix operations, yet the manuscript provides no equations or parameter derivation showing how latency and energy costs are computed from hardware characteristics. Without this, it is unclear whether the model systematically under-counts costs that grow with cross-operator fusion.
minor comments (1)
  1. [Abstract] The abstract and evaluation refer to 'various test cases' and 'different accelerator configurations' without listing the specific attention variants, tiling factors, or accelerator parameters used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and outline the changes we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation section] The headline claims (48%-50% energy reduction, 31%-69% latency reduction, 64x-343x speedup) are produced entirely by the analytical model used both to guide search and to report final metrics. No calibration against RTL simulation, gate-level power analysis, or hardware measurements is described, so it is impossible to assess whether unmodeled effects (interconnect contention, bank conflicts) erode the reported gains.

    Authors: We acknowledge that all reported metrics are generated by the analytical model. In the revised manuscript we will add a new subsection in Evaluation that explicitly discusses model assumptions, lists unmodeled effects such as interconnect contention and bank conflicts, and reports a cycle-accurate simulator validation on a representative subset of the evaluated configurations. Full RTL or gate-level power analysis remains outside the scope of this work, which focuses on rapid cross-operator enumeration; we will note this limitation clearly. revision: partial

  2. Referee: [Analytical performance model section] The model encodes dataflow decisions as matrix operations, yet the manuscript provides no equations or parameter derivation showing how latency and energy costs are computed from hardware characteristics. Without this, it is unclear whether the model systematically under-counts costs that grow with cross-operator fusion.

    Authors: We agree that the equations and parameter derivations were insufficiently detailed. In the revised Analytical Performance Model section we will insert the complete latency and energy equations, show how each term is derived from hardware parameters (memory bandwidth, compute throughput, buffer sizes, etc.), and explicitly derive the additional costs incurred by cross-operator fusion. This will make the model transparent and allow readers to assess potential under-counting. revision: yes

Circularity Check

0 steps flagged

No circularity: analytical model provides independent scoring for enumeration

full rationale

The derivation relies on an analytical performance model that encodes tiling, ordering, and buffer decisions as matrix operations to enable rapid evaluation of candidate dataflows. This model is applied uniformly to score solutions during exhaustive enumeration with pruning, and the reported energy/latency reductions are direct outputs of selecting the minimum-cost solution under the model. No step reduces a claimed prediction to a fitted parameter or self-citation by construction; the model is presented as a forward evaluator rather than a tautology. The approach is self-contained against the model's own equations, with comparisons to prior methods performed via the same framework. This yields a normal non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on an analytical performance model whose internal parameters and accuracy assumptions are not detailed in the abstract; no explicit free parameters, axioms, or invented entities are named.

pith-pipeline@v0.9.0 · 5569 in / 1273 out tokens · 32861 ms · 2026-05-13T17:43:01.404176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 7 internal anchors

  1. [1]

    The groq software-defined scale-out tensor streaming multiprocessor : From chips-to-systems architectural overview,

    D. Abts, J. Kim, G. Kimmell, M. Boyd, K. Kang, S. Parmar, A. Ling, A. Bitar, I. Ahmed, and J. Ross, “The groq software-defined scale-out tensor streaming multiprocessor : From chips-to-systems architectural overview,” in2022 IEEE Hot Chips 34 Symposium (HCS), 2022, pp. 1–69

  2. [2]

    A software-defined tensor stream- ing multiprocessor for large-scale machine learning,

    D. Abts, G. Kimmell, A. Ling, J. Kim, M. Boyd, A. Bitar, S. Parmar, I. Ahmed, R. DiCecco, D. Hanet al., “A software-defined tensor stream- ing multiprocessor for large-scale machine learning,” inProceedings of the 49th Annual International Symposium on Computer Architecture (ISCA), 2022, pp. 567–580

  3. [3]

    MLIR-AIE: MLIR-based Toolchain for AI Engine-enabled Devices,

    Advanced Micro Devices, Inc., “MLIR-AIE: MLIR-based Toolchain for AI Engine-enabled Devices,” https://xilinx.github.io/mlir-aie/, 2023, accessed: 2024-05-22

  4. [4]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints,

    J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4895– 4901

  5. [5]

    Fused-layer CNN accelerators,

    M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” inInternational Symposium on Microarchitecture, 2016

  6. [6]

    Neuroncore-v2 architecture,

    AWS Neuron Team, “Neuroncore-v2 architecture,” https://awsdocs- neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron- hardware/neuron-core-v2.html, 2024, accessed: 2026-02-22

  7. [7]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv preprint arXiv:2004.05150, 2020

  8. [8]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

  9. [9]

    Inter-layer scheduling space definition and exploration for tiled accelerators,

    J. Cai, Y . Wei, Z. Wu, S. Peng, and K. Ma, “Inter-layer scheduling space definition and exploration for tiled accelerators,” inInternational Symposium on Computer Architecture, 2023

  10. [10]

    Optimus: An operator fusion frame- work for deep neural networks,

    X. Cai, Y . Wang, and L. Zhang, “Optimus: An operator fusion frame- work for deep neural networks,”Transactions on Embedded Computing Systems, 2022

  11. [11]

    Multifuse: Efficient cross layer fusion for DNN accelerators with multi- level memory hierarchy,

    C.-W. Chang, J.-J. Liou, C.-T. Huang, W.-C. Hsu, and J.-M. Lu, “Multifuse: Efficient cross layer fusion for DNN accelerators with multi- level memory hierarchy,” inInternational Conference on Computer Design, 2023

  12. [12]

    Magis: Memory optimization via coordinated graph transformation and scheduling for dnn,

    R. Chen, Z. Ding, S. Zheng, C. Zhang, J. Leng, X. Liu, and Y . Liang, “Magis: Memory optimization via coordinated graph transformation and scheduling for dnn,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 607–621

  13. [13]

    Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,

    T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y . Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,”SIGARCH Computer Architecture News, 2014

  14. [14]

    Communication lower bound in convo- lution accelerators,

    X. Chen, Y . Han, and Y . Wang, “Communication lower bound in convo- lution accelerators,” inInternational Symposium on High Performance Computer Architecture, 2020

  15. [15]

    Eyeriss: An energy- efficient reconfigurable accelerator for deep convolutional neural net- works,

    Y .-H. Chen, T. Krishna, J. S. Emer, and V . Sze, “Eyeriss: An energy- efficient reconfigurable accelerator for deep convolutional neural net- works,”Journal of Solid-State Circuits, 2016

  16. [16]

    Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,

    Y .-H. Chen, T.-J. Yang, J. Emer, and V . Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” Journal on Emerging and Selected Topics in Circuits and Systems, 2019

  17. [17]

    Palm: Scaling language modeling with pathways,

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe- mawat, S. De...

  18. [18]

    Cross-lingual language model pretraining,

    A. Conneau and G. Lample, “Cross-lingual language model pretraining,” Advances in Neural Information Processing Systems, 2019

  19. [19]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    T. Dao, “FlashAttention-2: faster attention with better parallelism and work partitioning,”arXiv preprint arXiv:2307.08691, 2023

  20. [20]

    FlashAttention: fast and memory-efficient exact attention with IO-awareness,

    T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “FlashAttention: fast and memory-efficient exact attention with IO-awareness,”Advances in Neural Information Processing Systems, 2022

  21. [21]

    Compute and energy consumption trends in deep learning inference,

    R. Desislavov, F. Mart ´ınez-Plumed, and J. Hern´andez-Orallo, “Compute and energy consumption trends in deep learning inference,”arXiv preprint arXiv:2109.05472, 2021

  22. [22]

    BERT: Pre- training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” inConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019

  23. [23]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  24. [24]

    NeuFlow: a runtime reconfigurable dataflow processor for vision,

    C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y . LeCun, “NeuFlow: a runtime reconfigurable dataflow processor for vision,” inComputer Vision and Pattern Recognition Workshops, 2011

  25. [25]

    Long sequence modeling with attention tensorization: From sequence to tensor learning,

    A. Feng, R. Ying, and L. Tassiulas, “Long sequence modeling with attention tensorization: From sequence to tensor learning,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 14 642–14 655

  26. [26]

    Auto-NBA: efficient and effective search over the joint space of networks, bitwidths, and accelerators,

    Y . Fu, Y . Zhang, Y . Zhang, D. Cox, and Y . Lin, “Auto-NBA: efficient and effective search over the joint space of networks, bitwidths, and accelerators,” inInternational Conference on Machine Learning, 2021

  27. [27]

    TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators,

    M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators,” in International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

  28. [28]

    Crane: Inter-layer scheduling framework for dnn inference and training co-support on tiled architecture,

    Y . Gong, L. Huang, H. Chang, R. Liang, C. Yang, Z. Tang, J. Hu, and B. Yuan, “Crane: Inter-layer scheduling framework for dnn inference and training co-support on tiled architecture,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture®, 2025, pp. 1250–1263. 14

  29. [29]

    Coral NPU,

    Google, “Coral NPU,” https://developers.google.com/coral/guides/intro, 2025, accessed: 2025-10-16

  30. [30]

    Redas: A lightweight architecture for supporting fine-grained reshap- ing and multiple dataflows on systolic array,

    M. Han, L. Wang, L. Xiao, T. Cai, Z. Wang, X. Xu, and C. Zhang, “Redas: A lightweight architecture for supporting fine-grained reshap- ing and multiple dataflows on systolic array,”IEEE Transactions on Computers, vol. 73, no. 8, pp. 1997–2011, 2024

  31. [31]

    Block transformer: Global-to-local language modeling for fast inference,

    N. Ho, S. Bae, T. Kim, H. Jo, Y . Kim, T. Schuster, A. Fisch, J. Thorne, and S.-Y . Yun, “Block transformer: Global-to-local language modeling for fast inference,”Advances in Neural Information Processing Systems, vol. 37, pp. 48 740–48 783, 2024

  32. [32]

    Cosa: Scheduling by constrained optimization for spatial accelerators,

    Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel, J. Wawrzynek, and Y . S. Shao, “Cosa: Scheduling by constrained optimization for spatial accelerators,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 554–566

  33. [33]

    Mind the Gap: attainable data movement and operational intensity bounds for tensor algorithms,

    Q. Huang, P.-A. Tsai, J. S. Emer, and A. Parashar, “Mind the Gap: attainable data movement and operational intensity bounds for tensor algorithms,” inInternational Symposium on Computer Architecture, 2024

  34. [34]

    In-datacenter performance analysis of a tensor processing unit,

    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inInternational Symposium on Computer Architecture, 2017

  35. [35]

    DNNFuser: generative pre- trained transformer as a generalized mapper for layer fusion in DNN accelerators,

    S.-C. Kao, X. Huang, and T. Krishna, “DNNFuser: generative pre- trained transformer as a generalized mapper for layer fusion in DNN accelerators,”arXiv preprint arXiv:2201.11218, 2022

  36. [36]

    GAMMA: automating the hw mapping of DNN models on accelerators via genetic algorithm,

    S.-C. Kao and T. Krishna, “GAMMA: automating the hw mapping of DNN models on accelerators via genetic algorithm,” inInternational Conference on Computer-Aided Design, 2020

  37. [37]

    FLAT: an optimized dataflow for mitigating attention bot- tlenecks,

    S.-C. Kao, S. Subramanian, G. Agrawal, A. Yazdanbakhsh, and T. Kr- ishna, “FLAT: an optimized dataflow for mitigating attention bot- tlenecks,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2023

  38. [38]

    On the computational complexity of self-attention,

    F. D. Keles, P. M. Wijewardena, and C. Hegde, “On the computational complexity of self-attention,” inInternational Conference on Algorith- mic Learning Theory, 2023

  39. [39]

    Reformer: The Efficient Transformer

    N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient trans- former,”arXiv preprint arXiv:2001.04451, 2020

  40. [40]

    MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable intercon- nects,

    H. Kwon, A. Samajdar, and T. Krishna, “MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable intercon- nects,”International Conference on Architectural Support for Program- ming Languages and Operating Systems, 2018

  41. [41]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

  42. [42]

    Mlir: Scaling compiler infrastructure for domain specific computation,

    C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pien- aar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko, “Mlir: Scaling compiler infrastructure for domain specific computation,” in 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2021, pp. 2–14

  43. [43]

    Resa: Reconfigurable systolic array for multiple tiny dnn tensors,

    C.-J. Lee and T. T. Yeh, “Resa: Reconfigurable systolic array for multiple tiny dnn tensors,”ACM Transactions on Architecture and Code Optimization, vol. 21, no. 3, pp. 1–24, 2024

  44. [44]

    Tritonbench: Benchmarking large language model capabilities for generating triton operators,

    J. Li, S. Li, Z. Gao, Q. Shi, Y . Li, Z. Wang, J. Huang, W. WangHaojie, J. Wang, X. Hanet al., “Tritonbench: Benchmarking large language model capabilities for generating triton operators,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 23 053– 23 066

  45. [45]

    TileLoom: Automatic Dataflow Planning for Tile-Based Languages on Spatial Dataflow Accelerators

    W. Li, Z. Bai, H. Wang, P. Dangi, Z. Zhang, C. Tan, H. Lan, W.- F. Wong, and T. Mitra, “Tl: Automatic end-to-end compiler of tile- based languages for spatial dataflow architectures,”arXiv preprint arXiv:2512.22168, 2025

  46. [46]

    Deep learning toolkit-accelerated analytical co-optimization of CNN hardware and dataflow,

    R. Liang, J. Song, Y . Bo, and J. Hu, “Deep learning toolkit-accelerated analytical co-optimization of CNN hardware and dataflow,” inInterna- tional Conference on Computer-Aided Design, 2022

  47. [47]

    CAT: Cross attention in vision transformer,

    H. Lin, X. Cheng, X. Wu, and D. Shen, “CAT: Cross attention in vision transformer,” inInternational Conference on Multimedia and Expo, 2022

  48. [48]

    Systolicattention: Fus- ing flashattention within a single systolic array,

    J. Lin, Y . Li, G. Chen, and T. Bourgeat, “Systolicattention: Fus- ing flashattention within a single systolic array,”arXiv preprint arXiv:2507.11331, 2025

  49. [49]

    Dynamic sparse attention for scalable transformer acceleration,

    L. Liu, Z. Qu, Z. Chen, F. Tu, Y . Ding, and Y . Xie, “Dynamic sparse attention for scalable transformer acceleration,”IEEE Transactions on Computers, vol. 71, no. 12, pp. 3165–3178, 2022

  50. [50]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu, “Kivi: A tuning-free asymmetric 2bit quantization for kv cache,” arXiv preprint arXiv:2402.02750, 2024

  51. [51]

    Tenet: A framework for modeling tensor dataflow based on relation- centric notation,

    L. Lu, N. Guan, Y . Wang, L. Jia, Z. Luo, J. Yin, J. Cong, and Y . Liang, “Tenet: A framework for modeling tensor dataflow based on relation- centric notation,” inInternational Symposium on Computer Architecture, 2021

  52. [52]

    Defines: Enabling fast exploration of the depth-first scheduling space for DNN accelerators through analytical modeling,

    L. Mei, K. Goetschalckx, A. Symons, and M. Verhelst, “Defines: Enabling fast exploration of the depth-first scheduling space for DNN accelerators through analytical modeling,” inInternational Symposium on High-Performance Computer Architecture, 2023

  53. [53]

    Online normalizer calculation for softmax,

    M. Milakov and N. Gimelshein, “Online normalizer calculation for softmax,”arXiv preprint arXiv:1805.02867, 2018

  54. [54]

    Transformer models for enhancing AttnGAN based text to image generation,

    S. Naveen, M. S. R. Kiran, M. Indupriya, T. Manikanta, and P. Sudeep, “Transformer models for enhancing AttnGAN based text to image generation,”Image and Vision Computing, 2021

  55. [55]

    FuseMax: leveraging extended einsums to optimize attention accelerator design,

    N. Nayak, X. Wu, T. O. Odemuyiwa, M. Pellauer, J. S. Emer, and C. W. Fletcher, “FuseMax: leveraging extended einsums to optimize attention accelerator design,” inInternational Symposium on Microarchitecture, 2024

  56. [56]

    NVDLA Deep Learning Accelerator,

    NVIDIA, “NVDLA Deep Learning Accelerator,” http://nvdla.org, 2017, accessed: 2025-06-16

  57. [57]

    Fasttree: Optimizing attention kernel and runtime for tree-structured llm inference,

    Z. Pan, Y . Ding, Y . Guan, Z. Wang, Z. Yu, X. Tang, Y . Wang, and Y . Ding, “Fasttree: Optimizing attention kernel and runtime for tree-structured llm inference,”Proceedings of Machine Learning and Systems, vol. 7, 2025

  58. [58]

    Timeloop: A systematic approach to DNN accelerator evaluation,

    A. Parashar, P. Raina, Y . S. Shao, Y .-H. Chen, V . A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A systematic approach to DNN accelerator evaluation,” inInternational Symposium on Performance Analysis of Systems and Software, 2019

  59. [59]

    Triton Getting Started Tutorials,

    Philippe Tillet, “Triton Getting Started Tutorials,” https://tritonlang.org/ main/gettingstarted/tutorials/index.html, 2020, accessed: 2025-07-27

  60. [60]

    Blockwise self-attention for long document understanding,

    J. Qiu, H. Ma, O. Levy, W.-t. Yih, S. Wang, and J. Tang, “Blockwise self-attention for long document understanding,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 2555–2565

  61. [61]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of Machine Learning Research, 2020

  62. [62]

    A case for efficient accelerator design space exploration via bayesian optimization,

    B. Reagen, J. M. Hern ´andez-Lobato, R. Adolf, M. Gelbart, P. What- mough, G.-Y . Wei, and D. Brooks, “A case for efficient accelerator design space exploration via bayesian optimization,” inInternational Symposium on Low Power Electronics and Design, 2017

  63. [63]

    An evaluation of edge tpu accelerators for convolutional neural networks,

    K. Seshadri, B. Akin, J. Laudon, R. Narayanaswami, and A. Yaz- danbakhsh, “An evaluation of edge tpu accelerators for convolutional neural networks,” in2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2022, pp. 79–91

  64. [64]

    Flashattention-3: Fast and accurate attention with asynchrony and low- precision,

    J. Shah, G. Bikshandi, Y . Zhang, V . Thakkar, P. Ramani, and T. Dao, “Flashattention-3: Fast and accurate attention with asynchrony and low- precision,”Advances in Neural Information Processing Systems, vol. 37, pp. 68 658–68 685, 2024

  65. [65]

    An efficient training accelerator for transformers with hardware-algorithm co-optimization,

    H. Shao, J. Lu, M. Wang, and Z. Wang, “An efficient training accelerator for transformers with hardware-algorithm co-optimization,”Transactions on Very Large Scale Integration Systems, 2023

  66. [66]

    Fast Transformer Decoding: One Write-Head is All You Need

    N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv preprint arXiv:1911.02150, 2019

  67. [67]

    Efficient attention: Attention with linear complexities,

    Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: Attention with linear complexities,” inConference on Applications of Computer Vision, 2021

  68. [68]

    Flexgen: High-throughput generative inference of large language models with a single gpu,

    Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. R ´e, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 31 094–31 116

  69. [69]

    Pipelayer: A pipelined ReRAM- based accelerator for deep learning,

    L. Song, X. Qian, H. Li, and Y . Chen, “Pipelayer: A pipelined ReRAM- based accelerator for deep learning,” inInternational Symposium on High Performance Computer Architecture, 2017

  70. [70]

    Scaling granite code models to 128k context,

    M. Stallone, V . Saxena, L. Karlinsky, B. McGinn, T. Bula, M. Mishra, A. M. Soria, G. Zhang, A. Prasad, Y . Shenet al., “Scaling granite code models to 128k context,”arXiv preprint arXiv:2407.13739, 2024. 15

  71. [71]

    Optimally scheduling CNN convolutions for efficient memory access,

    A. Stoutchinin, F. Conti, and L. Benini, “Optimally scheduling CNN convolutions for efficient memory access,”arXiv preprint arXiv:1902.01492, 2019

  72. [72]

    Long range arena: A benchmark for efficient transformers,

    Y . Tay, M. Dehghani, S. Abnar, Y . Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, “Long range arena: A benchmark for efficient transformers,”arXiv preprint arXiv:2011.04006, 2020

  73. [73]

    Triton: An intermediate language and compiler for tiled neural network computations,

    P. Tillet, H.-T. Kung, and D. Cox, “Triton: An intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019, pp. 10–19

  74. [74]

    MAGNet: a modular accelerator generator for neural networks,

    R. Venkatesan, Y . S. Shao, M. Wang, J. Clemons, S. Dai, M. Fojtik, B. Keller, A. Klinefelter, N. Pinckney, P. Rainaet al., “MAGNet: a modular accelerator generator for neural networks,” inInternational Conference on Computer-Aided Design, 2019

  75. [75]

    A novel systolic array processor with dynamic dataflows,

    B. Wang, S. Ma, G. Zhu, X. Yi, and R. Xu, “A novel systolic array processor with dynamic dataflows,”Integration, vol. 85, pp. 42–47, 2022

  76. [76]

    Spatten: Efficient sparse attention architecture with cascade token and head pruning,

    H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” inInternational Symposium on High-Performance Computer Architecture, 2021

  77. [77]

    Desa: Dataflow efficient systolic array for acceleration of transformers,

    Z. Wang, H. Fan, and G. He, “Desa: Dataflow efficient systolic array for acceleration of transformers,”IEEE Transactions on Computers, 2025

  78. [78]

    Design space exploration for softmax implementations,

    Z. Wei, A. Arora, P. Patel, and L. John, “Design space exploration for softmax implementations,” in2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2020, pp. 45–52

  79. [79]

    Accelergy: An architecture-level energy estimation methodology for accelerator designs,

    Y . N. Wu, J. S. Emer, and V . Sze, “Accelergy: An architecture-level energy estimation methodology for accelerator designs,” inInternational Conference on Computer-Aided Design, 2019

  80. [80]

    HASCO: towards agile hardware and software co-design for tensor computation,

    Q. Xiao, S. Zheng, B. Wu, P. Xu, X. Qian, and Y . Liang, “HASCO: towards agile hardware and software co-design for tensor computation,” inInternational Symposium on Computer Architecture, 2021

Showing first 80 references.