TileLoom: Automatic Dataflow Planning for Tile-Based Languages on Spatial Dataflow Accelerators
Pith reviewed 2026-05-16 21:53 UTC · model grok-4.3
The pith
TileLoom automatically maps tile-based programs like Triton kernels to spatial dataflow accelerators by planning data movement across on-chip networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TileLoom is an end-to-end framework that compiles tile-based programs onto spatial dataflow architectures by distributing tile instances across spatially distributed cores and exploiting the on-chip network and distributed memories to increase data reuse and reduce communication. It introduces a hardware representation that captures interconnect topology, memory hierarchy, and compute capabilities, enabling both architecture-specific optimizations and support for diverse spatial dataflow targets.
What carries the argument
A hardware representation that models interconnect topology, memory hierarchy, and compute capabilities to guide automatic distribution of tiles and data movement.
If this is right
- Tile-based code written for GPUs can run on spatial accelerators without rewriting for each new hardware topology.
- Communication volume drops because tiles forward operands directly between nearby cores instead of using global memory.
- A single compiler framework supports multiple generations of spatial dataflow machines through updated hardware models.
- Users no longer need to rely exclusively on vendor-supplied hand-tuned libraries for good performance.
Where Pith is reading between the lines
- The same planning approach could apply to other tile-based front ends beyond Triton, widening the set of languages that target spatial hardware.
- On larger chips with more cores the automatic mappings may outperform hand tuning because exhaustive manual placement becomes infeasible.
- Future spatial architectures could expose the same hardware representation interface, letting TileLoom serve as a portable backend.
Load-bearing premise
The hardware representation accurately captures the interconnect topology, memory hierarchy, and compute capabilities of the target spatial dataflow architectures.
What would settle it
Compile the same set of kernels with TileLoom and with vendor libraries, then run both on the same Tenstorrent hardware and compare execution time and output correctness.
Figures
read the original abstract
Spatial dataflow accelerators are a promising direction for next-generation computer systems because they can reduce the memory bottlenecks of traditional von Neumann machines such as CPUs and GPUs. They organize computation around explicit, compiler-managed data movement over on-chip networks, allowing operands to be forwarded directly between processing elements and reducing reliance on high-latency, bandwidth-limited global shared memory. However, their performance depends strongly on how workloads are mapped to hardware. Naive mappings can perform poorly, and most users rely on hand-tuned vendor libraries. Thus, despite their potential for high performance, energy efficiency, and cost efficiency, limited programmability remains a major barrier to wider adoption. This paper presents TileLoom, an MLIR-based end-to-end framework that compiles tile-based programs, such as Triton kernels, onto spatial dataflow architectures. Unlike compiler frameworks that focus on optimizing code generation within a single tile, TileLoom distributes tile instances across spatially distributed cores and exploits the on-chip network and distributed memories to increase data reuse and reduce communication. TileLoom introduces a hardware representation that captures interconnect topology, memory hierarchy, and compute capabilities, enabling both architecture-specific optimizations and support for diverse spatial dataflow targets. In experiments on two generations of Tenstorrent systems, TileLoom achieves performance comparable to vendor libraries on various kernels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents TileLoom, an MLIR-based end-to-end framework for compiling tile-based programs such as Triton kernels to spatial dataflow accelerators. It introduces a hardware representation that captures interconnect topology, memory hierarchy, and compute capabilities to automatically distribute tile instances across cores, exploit on-chip networks for data reuse, and generate mappings. Experiments on two generations of Tenstorrent systems report performance comparable to vendor libraries on various kernels.
Significance. If the results hold under rigorous validation, TileLoom would meaningfully advance programmability for spatial dataflow architectures by replacing hand-tuned libraries with automatic planning while preserving performance. The reusable hardware model abstraction supports multiple targets and integrates cleanly with MLIR, which are concrete strengths that could accelerate adoption in the field.
major comments (2)
- [§4] §4 (Hardware Representation): the central performance claim rests on this model correctly capturing interconnect topology, memory hierarchy, and compute capabilities so that generated mappings are both legal and near-optimal. The manuscript supplies no description of how topology or bandwidth parameters are obtained, no microbenchmark validation of modeled vs. measured latencies, and no sensitivity analysis showing that small modeling errors do not produce large performance deviations. This is load-bearing for the comparability result.
- [§5] §5 (Experimental Evaluation): the claim of 'performance comparable to vendor libraries' is stated without quantitative metrics, baseline details, absolute runtimes, or a clear description of the experimental methodology and kernel set. Without these, it is impossible to assess whether the result is robust or merely coincidental on the tested cases.
minor comments (2)
- [Abstract] Abstract: the performance claim would be more informative if it included at least one concrete metric (e.g., average speedup or range) rather than the qualitative statement 'comparable'.
- [Throughout] Notation: ensure consistent terminology between 'tile instances', 'dataflow planning', and 'mappings' across sections to avoid reader confusion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript to incorporate additional details on the hardware model and experimental evaluation.
read point-by-point responses
-
Referee: [§4] §4 (Hardware Representation): the central performance claim rests on this model correctly capturing interconnect topology, memory hierarchy, and compute capabilities so that generated mappings are both legal and near-optimal. The manuscript supplies no description of how topology or bandwidth parameters are obtained, no microbenchmark validation of modeled vs. measured latencies, and no sensitivity analysis showing that small modeling errors do not produce large performance deviations. This is load-bearing for the comparability result.
Authors: We agree that the hardware representation is foundational and that the manuscript would be strengthened by explicit details on parameter acquisition and validation. In the revised version we will expand §4 with a description of how topology, bandwidth, and compute parameters are derived from Tenstorrent public hardware specifications and datasheets. We will also add microbenchmark results that compare modeled latencies against direct measurements on the target devices, and we will include a sensitivity study showing the effect of small parameter perturbations on final performance. These additions will be placed in §4 and will directly support the legality and near-optimality claims. revision: yes
-
Referee: [§5] §5 (Experimental Evaluation): the claim of 'performance comparable to vendor libraries' is stated without quantitative metrics, baseline details, absolute runtimes, or a clear description of the experimental methodology and kernel set. Without these, it is impossible to assess whether the result is robust or merely coincidental on the tested cases.
Authors: We acknowledge that the current experimental section would benefit from greater quantitative transparency. In the revision we will augment §5 with explicit performance ratios and absolute runtimes for each kernel, precise identification of the vendor library baselines (including version and configuration), and a complete description of the experimental methodology, kernel set, input sizes, and hardware setups on both Tenstorrent generations. These changes will allow readers to evaluate the robustness of the comparability result. revision: yes
Circularity Check
No circularity; performance claims rest on implementation and external benchmarking
full rationale
The paper describes an MLIR-based compiler framework (TileLoom) that introduces a hardware representation for spatial dataflow targets and reports experimental results on real Tenstorrent hardware. No equations, fitted parameters, or self-referential derivations are present in the provided text. The central performance claim is obtained by running generated code on physical systems and comparing against vendor libraries, which constitutes independent empirical validation rather than any reduction to inputs by construction. No load-bearing self-citations or ansatzes are invoked to force the result.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TileLoom introduces a hardware representation that captures interconnect topology, memory hierarchy, and compute capabilities
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
performance model estimates the cost of different data-movement plans
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Fast Cross-Operator Optimization of Attention Dataflow
MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.
Reference graph
Works this paper leans on
-
[1]
Dennis Abts, Garrin Kimmell, Andrew C. Ling, John Kim, et al. A software-defined tensor streaming mul- tiprocessor for large-scale machine learning. InPro- ceedings of the 49th Annual International Symposium on Computer Architecture (ISCA 2022), pages 567–580, 2022
work page 2022
-
[2]
Think fast: A tensor streaming processor (TSP) for accelerating deep learning work- loads
Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, et al. Think fast: A tensor streaming processor (TSP) for accelerating deep learning work- loads. InProceedings of the 47th Annual International Symposium on Computer Architecture (ISCA 2020), pages 145–158, 2020
work page 2020
-
[3]
Jenny Lynn Almerol, Elisabetta Boella, Mario Spera, and Daniele Gregori. Accelerating gravitational N-body simulations using the RISC-V-based tenstorrent worm- hole™.arXiv preprint, arXiv:2509.19294, 2025
-
[4]
Networks on chips: A new SoC paradigm.Computer, 35(1):70–78, 2002
Luca Benini and Giovanni De Micheli. Networks on chips: A new SoC paradigm.Computer, 35(1):70–78, 2002
work page 2002
-
[5]
Data-intensive supercomputing: The case for disc
Randal E Bryant. Data-intensive supercomputing: The case for disc. 2007
work page 2007
-
[6]
Aws trainium: the journey for designing and optimization full stack ml hardware
Nafea Bshara. Aws trainium: the journey for designing and optimization full stack ml hardware. InProceed- ings of the 29th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 3, pages 4–4, 2024
work page 2024
-
[7]
Mem- ory bandwidth limitations of future microprocessors
Doug Burger, James R Goodman, and Alain Kägi. Mem- ory bandwidth limitations of future microprocessors. ACM SIGARCH Computer Architecture News, 24(2):78– 89, 1996
work page 1996
-
[8]
Cerebras systems: Achieving indus- try best AI performance through a systems approach
Cerebras Systems. Cerebras systems: Achieving indus- try best AI performance through a systems approach. Technical report, Cerebras Systems, 2021. Whitepaper 03
work page 2021
-
[9]
The cerebras software development kit: A technical overview
Cerebras Systems. The cerebras software development kit: A technical overview. Whitepaper, 2023
work page 2023
-
[10]
Tvm: An automated end- to-end optimizing compiler for deep learning
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automated end- to-end optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 578–594, 2018
work page 2018
-
[11]
cuDNN: Efficient Primitives for Deep Learning
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient primitives for deep learn- ing.CoRR, abs/1410.0759, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[12]
Memory system char- acterization of deep learning workloads
Zeshan Chishti and Berkin Akin. Memory system char- acterization of deep learning workloads. InProceedings of the International Symposium on Memory Systems, pages 497–505, 2019
work page 2019
-
[13]
Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Hao Yu, Yida Wang, and Gennady Pekhimenko. Tilus: A tile-level GPGPU programming language for low-precision computation.arXiv preprint arXiv:2504.12984, 2025
-
[14]
Mtia: First generation silicon targeting meta’s recom- mendation systems
Amin Firoozshahian, Joel Coburn, Roman Levenstein, Rakesh Nattoji, Ashwin Kamath, Olivia Wu, Gurdeepak Grewal, Harish Aepala, Bhasker Jakka, Bob Dreyer, et al. Mtia: First generation silicon targeting meta’s recom- mendation systems. InProceedings of the 50th An- nual International Symposium on Computer Architec- ture, pages 1–13, 2023
work page 2023
-
[15]
Ai and memory wall.IEEE Micro, 44(3):33–39, 2024
Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W Mahoney, and Kurt Keutzer. Ai and memory wall.IEEE Micro, 44(3):33–39, 2024
work page 2024
-
[16]
Poplar graph framework soft- ware
Graphcore Ltd. Poplar graph framework soft- ware. https://www.graphcore.ai/products/ poplar, 2022. Accessed: 2024-03-19
work page 2022
-
[17]
Sumanth Gudaparthi, Sarabjeet Singh, Surya Narayanan, Rajeev Balasubramonian, and Visvesh Sathe. CANDLES: Channel-aware novel dataflow- microarchitecture co-design for low energy sparse neural network acceleration. In2022 IEEE Interna- tional Symposium on High-Performance Computer Architecture (HPCA), pages 876–891, 2022
work page 2022
-
[18]
Waqas Gul, Maitham Shams, and Dhamin Al-Khalili. Sram cell design challenges in modern deep sub-micron technologies: An overview.Micromachines, 13(8):1332, 2022
work page 2022
-
[19]
James Hamilton. Tesla project dojo overview. https://perspectives.mvdirona.com/2021/08/ tesla-project-dojo-overview/, 2021. Blog post
work page 2021
-
[20]
Wafer-scale ai compute: A system software perspective
Congjie He, Yeqi Huang, Pei Mu, Mike Wang, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, and Luo Mai. Wafer-scale ai compute: A system software perspective
-
[21]
Ron Ho, Kenneth W. Mai, and Mark A. Horowitz. The future of wires.Proceedings of the IEEE, 89(4):490– 504, 2001
work page 2001
-
[22]
1.1 computing’s energy problem (and what we can do about it)
Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC), pages 10–14. IEEE, 2014. 14
work page 2014
-
[23]
Yuanming Hu, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand. Taichi: A language for high-performance computation on spatially sparse data structures.ACM Transactions on Graphics, 38(6), 2019
work page 2019
-
[24]
Tensorlib: A spatial accelerator generation framework for tensor algebra
Liancheng Jia, Zizhang Luo, Liqiang Lu, and Yun Liang. Tensorlib: A spatial accelerator generation framework for tensor algebra. In2021 58th ACM/IEEE Design Automation Conference (DAC), pages 865–870. IEEE, 2021
work page 2021
-
[25]
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via microbenchmarking.arXiv preprint, arXiv:1804.06826, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. Dissecting the graphcore IPU architecture via microbenchmarking.arXiv preprint, arXiv:1912.03413, 2019
-
[27]
In- datacenter performance analysis of a tensor processing unit
Norman P Jouppi, Cliff Young, Nishant Patil, David Pat- terson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In- datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture, pages 1–12, 2017
work page 2017
-
[28]
MIOpen: An open source library for deep learning primitives.CEUR Workshop Proceedings, 2744, 2020
Jehandad Khan, Paul Fultz, Artem Tamazov, Daniel Lowell, Chao Liu, Michael Melesse, Murali Nandhi- mandalam, Kamil Nasyrov, Ilya Perminov, Tejash Shah, Vasilii Filippov, Jing Zhang, Jing Zhou, Bragadeesh Natarajan, and Mayank Daga. MIOpen: An open source library for deep learning primitives.CEUR Workshop Proceedings, 2744, 2020
work page 2020
-
[29]
Khronos Group.The OpenCL Specification, Version 3.0,
-
[30]
Available from the Khronos OpenCL Registry
-
[31]
David B. Kirk and Wen mei W. Hwu.Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, 2010
work page 2010
-
[32]
Spatial: A language and compiler for application ac- celerators
David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, et al. Spatial: A language and compiler for application ac- celerators. InProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Im- plementation, pages 296–311, 2018
work page 2018
-
[33]
Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. Maestro: A data-centric approach to under- stand reuse, performance, and hardware cost of dnn map- pings.IEEE micro, 40(3):20–29, 2020
work page 2020
-
[34]
Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. A communication-centric approach for designing flexi- ble DNN accelerators.IEEE Micro, 38(6):25–35, 2018
work page 2018
-
[35]
Yongin Kwon, JooHyoung Cha, Sehyeon Oh, Misun Yu, Jeman Park, and Jemin Lee. Luthier: Bridging auto- tuning and vendor libraries for efficient deep learning inference.ACM Transactions on Embedded Computing Systems, 24(5s), 2025
work page 2025
-
[36]
Heterocl: A multi-paradigm programming infrastructure for software-defined reconfigurable computing
Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. Heterocl: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. InPro- ceedings of the 2019 ACM/SIGDA International Sympo- sium on Field-Programmable Gate Arrays, pages 242– 251, 2019
work page 2019
-
[37]
Analyzing Machine Learning Workloads Using a Detailed GPU Simulator
Jonathan S. Lew, Deval A. Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla, Christo- pher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, and Tor M. Aamodt. Analyzing machine learn- ing workloads using a detailed GPU simulator.CoRR, abs/1811.08933, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[38]
Lisa: Graph neural network based portable mapping on spatial accelerators
Zhaoying Li, Dan Wu, Dhananjaya Wijerathne, and Tu- lika Mitra. Lisa: Graph neural network based portable mapping on spatial accelerators. In2022 IEEE Inter- national Symposium on High-Performance Computer Architecture (HPCA), pages 444–459. IEEE, 2022
work page 2022
-
[39]
Andrea Lottarini, João P. Cerqueira, Thomas J. Repetti, Stephen A. Edwards, Kenneth A. Ross, Mingoo Seok, and Martha A. Kim. Master of none acceleration: A comparison of accelerator architectures for analyt- ical query processing. InProceedings of the 46th An- nual International Symposium on Computer Architec- ture (ISCA), pages 762–773, 2019
work page 2019
-
[40]
Liqiang Lu, Zizhang Luo, Size Zheng, Jieming Yin, Ja- son Cong, Yun Liang, and Jianwei Yin. Rubick: A unified infrastructure for analyzing, exploring, and im- plementing spatial architectures via dataflow decompo- sition.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 43(4):1177–1190, 2023
work page 2023
-
[41]
Yixuan Luo, Cheng Tan, Nicolas Bohm Agostini, Ang Li, Antonino Tumeo, Nirav Dave, and Tong Geng. Ml- cgra: An integrated compilation framework to enable efficient machine learning acceleration on cgras. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2023
work page 2023
-
[42]
Rubick: A syn- thesis framework for spatial architectures via dataflow 15 decomposition
Zizhang Luo, Liqiang Lu, Size Zheng, Jieming Yin, Ja- son Cong, Jianwei Yin, and Yun Liang. Rubick: A syn- thesis framework for spatial architectures via dataflow 15 decomposition. In2023 60th ACM/IEEE Design Au- tomation Conference (DAC), pages 1–6. IEEE, 2023
work page 2023
-
[43]
Xingchen Man, Jianfeng Zhu, Guihuan Song, Shouyi Yin, Shaojun Wei, and Leibo Liu. Casmap: agile mapper for reconfigurable spatial architectures by automatically c lustering intermediate representations a nd s cattering mapping process. InProceedings of the 49th Annual In- ternational Symposium on Computer Architecture, pages 259–273, 2022
work page 2022
-
[44]
Memory bandwidth and ma- chine balance in current high performance computers
John D McCalpin et al. Memory bandwidth and ma- chine balance in current high performance computers. IEEE computer society technical committee on computer architecture (TCCA) newsletter, 2(19-25), 1995
work page 1995
-
[45]
triton-shared: A shared middle-layer for the triton compiler
Microsoft. triton-shared: A shared middle-layer for the triton compiler. https://github.com/microsoft/ triton-shared, 2025
work page 2025
-
[46]
Pengyu Mu, Yi Liu, Rui Wang, Guoxiang Liu, Hangcheng An, Qianhe Zhao, Hailong Yang, Chenhao Xie, Zhongzhi Luan, Chunye Gong, and Depei Qian. Deep learning operators performance tuning for change- able sized input data on tensor accelerate hardware. IEEE Transactions on Computers, 74(6):2101–2113, 2025
work page 2025
-
[47]
Memory scaling: A systems architecture perspective
Onur Mutlu. Memory scaling: A systems architecture perspective. In2013 5th IEEE International Memory Workshop, pages 21–25. IEEE, 2013
work page 2013
-
[48]
Ba- sics on NVIDIA GPU hardware architecture
NASA Advanced Supercomputing Division. Ba- sics on NVIDIA GPU hardware architecture. https://www.nas.nasa.gov/hecc/support/kb/ basics-on-nvidia-gpu-hardware-architecture_ 704.html, 2025. HECC Knowledge Base Article 704
work page 2025
-
[49]
Accelerat- ing sparse linear solvers on intelligence processing units
Tim Noack, Louis Krüger, and Andreas Koch. Accelerat- ing sparse linear solvers on intelligence processing units. InProceedings of the 39th IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1023–1035, 2025
work page 2025
-
[50]
NVIDIA Corporation.CUDA C Programming Guide,
-
[51]
Nvidia Corporation. Nvidia cuda tile. https: //developer.nvidia.com/cuda/tile, 2025. Ac- cessed: 2025-12-6
work page 2025
-
[52]
Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. Timeloop: A systematic ap- proach to DNN accelerator evaluation. In2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 304–315, 2019
work page 2019
-
[53]
Hongwu Peng, Caiwen Ding, Tong Geng, Sutanay Choudhury, Kevin Barker, and Ang Li. Evaluating emerging AI/ML accelerators: IPU, RDU, and NVIDI- A/AMD GPUs.arXiv preprint arXiv:2311.04417, 2024
-
[54]
Sambanova sn10 RDU: A 7nm dataflow architecture to accelerate software 2.0
Raghu Prabhakar, Sumti Jairath, and Jinuk Luke Shin. Sambanova sn10 RDU: A 7nm dataflow architecture to accelerate software 2.0. In2022 IEEE International Solid-State Circuits Conference (ISSCC), pages 350– 352, 2022
work page 2022
-
[55]
Plasticine: A reconfigurable architecture for parallel patterns
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Plasticine: A reconfigurable architecture for parallel patterns. In Proceedings of the 44th Annual International Sympo- sium on Computer Architecture (ISCA), pages 389–402, 2017
work page 2017
-
[56]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Ama- rasinghe. Halide: A language and compiler for optimiz- ing parallelism, locality, and recomputation in image processing pipelines. InProceedings of the 34th ACM SIGPLAN Conference on Programming Language De- sign and Implementation (PLDI), pages 519–530, 2013
work page 2013
-
[57]
Accelerated computing with a reconfigurable dataflow architecture
SambaNova Systems. Accelerated computing with a reconfigurable dataflow architecture. Technical report, SambaNova Systems, 2021. Whitepaper
work page 2021
-
[58]
Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, David Al- bonesi, Vivek Sarkar, Wenguang Chen, Paul Petersen, et al. T2s-tensor: Productively generating high- performance spatial hardware for dense tensor com- putations. In2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (...
work page 2019
-
[59]
tt-metal: Tt-nn operator library and tt- metalium low-level kernel programming model
Tenstorrent. tt-metal: Tt-nn operator library and tt- metalium low-level kernel programming model. https: //github.com/tenstorrent/tt-metal, 2025
work page 2025
-
[60]
Attention in sram on tenstorrent grayskull.arXiv preprint arXiv:2407.13885, 2024
Moritz Thüning. Attention in sram on tenstorrent grayskull.arXiv preprint arXiv:2407.13885, 2024
-
[61]
Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), 2019
work page 2019
- [63]
-
[64]
Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, et al. From loop nests to silicon: Mapping ai work- loads onto amd npus with mlir-air.arXiv preprint arXiv:2510.14871, 2025
-
[65]
Autosa: A polyhedral compiler for high-performance systolic ar- rays on fpga
Jie Wang, Licheng Guo, and Jason Cong. Autosa: A polyhedral compiler for high-performance systolic ar- rays on fpga. InThe 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 93–104, 2021
work page 2021
-
[66]
Tilelang: A composable tiled programming model for AI systems.arXiv preprint arXiv:2504.17577, 2025
Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, and Zhi Yang. Tilelang: A composable tiled programming model for AI systems.arXiv preprint arXiv:2504.17577, 2025
-
[67]
Dsagen: Synthesizing programmable spatial accelerators
Jian Weng, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, and Tony Nowatzki. Dsagen: Synthesizing programmable spatial accelerators. In2020 ACM/IEEE 47th Annual International Symposium on Computer Ar- chitecture (ISCA), pages 268–281. IEEE, 2020
work page 2020
-
[68]
Mor- pher: An open-source integrated compilation and simulation framework for cgra
Dhananjaya Wijerathne, Zhaoying Li, Manupa Karunaratne, Li-Shiuan Peh, and Tulika Mitra. Mor- pher: An open-source integrated compilation and simulation framework for cgra. InFifth Workshop on Open-Source EDA Technology (WOSET), 2022
work page 2022
-
[69]
Samuel Williams, Andrew Waterman, and David Patter- son. Roofline: an insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009
work page 2009
-
[70]
Wm A Wulf and Sally A McKee. Hitting the mem- ory wall: Implications of the obvious.ACM SIGARCH computer architecture news, 23(1):20–24, 1995
work page 1995
-
[71]
DiTile- DGNN: An efficient accelerator for distributed dynamic graph neural network inference
Jiaqi Yang, Hao Zheng, and Ahmed Louri. DiTile- DGNN: An efficient accelerator for distributed dynamic graph neural network inference. InProceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA), pages 1240–1253, 2025
work page 2025
-
[72]
Mlir-to-cgra: A versatile mlir-based compileir framework for cgras
Tianyi Yu, Omar Ragheb, Stephen Wicklund, and Ja- son Anderson. Mlir-to-cgra: A versatile mlir-based compileir framework for cgras. In2024 IEEE 35th In- ternational Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 184–192. IEEE, 2024
work page 2024
-
[73]
Jinming Zhang, Xi Fan, Yaoyao Ye, Xuyan Wang, Guo- jie Xiong, Xianglun Leng, Ningyi Xu, Yong Lian, and Guanghui He. INDM: Chiplet-based interconnect net- work and dataflow mapping for DNN accelerators.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 43(4):1107–1120, 2024
work page 2024
-
[74]
Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shen- gen Yan, and Yun Liang. Amos: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction. InProceedings of the 49th Annual International Symposium on Computer Architec- ture, pages 874–887, 2022
work page 2022
-
[75]
Aries: An agile mlir-based compi- lation flow for reconfigurable devices with ai engines
Jinming Zhuang, Shaojie Xiang, Hongzheng Chen, Ni- ansong Zhang, Zhuoping Yang, Tony Mao, Zhiru Zhang, and Peipei Zhou. Aries: An agile mlir-based compi- lation flow for reconfigurable devices with ai engines. InProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pages 92–102, 2025. 17
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.