pith. machine review for the scientific record. sign in

arxiv: 2604.22297 · v2 · submitted 2026-04-24 · 💻 cs.AR

Recognition: unknown

Exploiting pre-optimized kernels with polyhedral transformations for CGRA compilation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:26 UTC · model grok-4.3

classification 💻 cs.AR
keywords CGRApolyhedral transformationsmatrix multiplicationcompiler optimizationkernel substitutionreconfigurable arraysperformance speeduploop transformations
0
0 comments X

The pith

Polyhedral transformations detect hidden matrix multiplications in code and replace them with pre-optimized CGRA kernels for speedups up to 9.1x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that polyhedral transformations can systematically detect hidden matrix-matrix multiplication patterns in general source code through loop reordering and splitting. This matters because standard compilation for coarse-grained reconfigurable arrays struggles with the multi-dimensional structure of mmul, leading to poor parallelism extraction and lower runtime performance. If correct, the approach would allow substitution of these patterns with a specialized parametrizable mmul kernel schedule while compiling the remaining code independently. The result is mixed CGRA configurations that improve resource use and deliver concrete speedups across benchmarks and hardware sizes without requiring explicit mmul in the original program.

Core claim

The central claim is that a novel compilation methodology adapts program representations using polyhedral transformations to expose hidden mmul operations, substitutes them with optimized assembly from a pre-optimized and size-parametrizable CGRA kernel schedule, and generates configurations that combine the pre-compiled kernels with independently compiled program sections, thereby maximizing resource utilization and runtime performance even when mmul is not directly apparent in the source code.

What carries the argument

Polyhedral transformations that analyze complex patterns and perform loop reordering and splitting to expose hidden mmul operations, combined with a parametrizable specialized mmul CGRA kernel schedule for substitution.

Load-bearing premise

Polyhedral transformations will reliably and correctly identify hidden mmul patterns in arbitrary code without semantic errors or missed opportunities, and substituting pre-optimized kernels will always integrate cleanly with the independently compiled remaining sections.

What would settle it

A benchmark program containing a known hidden mmul where the polyhedral transformations fail to detect the pattern, produce an incorrect substitution, or yield no performance gain or erroneous results on a target CGRA.

Figures

Figures reproduced from arXiv: 2604.22297 by David Atienza, Fernando Castro, Giovanni Ansaloni, Katzalin Olcoz, Mar\'ia Jos\'e Belda, Yuxuan Wang.

Figure 1
Figure 1. Figure 1: Overview of the proposed CGRA compilation flow integrating pre-compiled kernels for efficient hardware execution. A CGRA typically consists of multiple processing elements (PEs) interconnected in a grid-like structure [6]. The PE arrays enable parallel execution of multiple operations, with data values propagated between connected PEs to satisfy data dependencies. Consequently, the key challenge in the dep… view at source ↗
Figure 2
Figure 2. Figure 2: Execution flow of mmul based on CDFG compila￾tion, highlighting the inefficiency resulting from only exploit￾ing data parallelism in innermost loops. optimization approaches such as satisfiability formulations and Integer Linear Programming (ILP). Nevertheless, MS strategy approaches are ill-suited for the mmul kernel: although MS compilers can apply scheduling to the innermost loop, they fail to fully exp… view at source ↗
Figure 3
Figure 3. Figure 3: Motivating example for mapping a benchmark with hidden mmul onto CGRA. The baseline CDFG scheduling exploits operations parallelism only for instructions in the same basic block. In CDFG with MS loop-level parallelism is used to pipeline iterations of the innermost loop. Finally, in the proposed kernel scheduling strategy, a highly parallel schedule is achieved by substituting the entire mmul subspace with… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the integrated CGRA compilation framework: front-end (C parsing and MLIR/affine lowering), middle￾end (linearity-based fusion, statement reordering/loop splitting, and mmul kernel extraction with context generation), and back-end (assembly integration with kernel instantiation). target CGRA size) to generate optimized assembly. Non￾mmul regions are reduced to control-flow IR on static single as… view at source ↗
Figure 5
Figure 5. Figure 5: Abstract representation of the mmul kernel imple￾mentation structure, illustrating the sequence of steps and their organization within the execution flow. propagated through the array using the interconnections be￾tween PEs. The structure of the kernel implementation is illustrated in view at source ↗
Figure 6
Figure 6. Figure 6: Example mapping of mmul kernel on a 4 × 4 CGRA with torus interconnections between PEs. 4) Loop control (L3). Addresses of A and B are up￾dated and execution returns to Step 1. The address computation is reduced to offset-only calculations due to the decomposition being handled at compile time, thereby reducing runtime overhead, as shown in Step 4 of view at source ↗
Figure 7
Figure 7. Figure 7: Target CGRA architecture. Table I: Benchmarks characteristics. Benchmark #ops-CDFG #ops-kernel￾total #ops-kernel￾map loop attributes mmul 84 306 32 3-level nested mmul_relu 85 338 64 3-level nested mmul_batch 147 372 98 4-level nested 2mm 185 749 201 2x3-level nested 3mm 262 925 103 3x3-level nested gemm 100 432 158 3-level nested PCA 76 344 70 2-level nested +3-level nested Kalman_filter_1 85 348 74 2-lev… view at source ↗
Figure 8
Figure 8. Figure 8: Compilation time of benchmark applications on CGRAs with various sizes using the proposed kernel pre-compilation approach, compared with Compigra with MS optimization [23]. Under this configuration, two PEs form a group that enables simultaneous loads from both input matrices. Within each group, the computations of C[i][j + u] for u = 1, . . . , U are executed in parallel. For a broader evaluation of achie… view at source ↗
Figure 9
Figure 9. Figure 9: Execution cycle count (CC) comparison between Compigra and the proposed strategy with pre-compiled mmulkernels, across CGRA configurations and matrix sizes. The numbers annotated above the clipped bars indicate the speedup of the proposed approach w.r.t. MS and unroll compilation (top and bottom values, respectively), with a broken y-axis used to improve readability. emphasize the large absolute runtime ga… view at source ↗
Figure 10
Figure 10. Figure 10: Normalized runtimes on 4x4 OpenEdgeCGRA [29] compiled using kernel pre-compilation, e-GPU and systolic array, w.r.t CGRA using CDFG compilation with MS — lower is better. regions that expose more parallelism to the modulo scheduler. As a result, the kernel part benefits from highly efficient spatial execution and data reuse, whereas the residual computation can still be mapped onto the CGRA rather than be… view at source ↗
read the original abstract

Modern computing workloads commonly involve matrix-matrix multiplication (mmul) as a core computing pattern. Coarse-Grained Reconfigurable Arrays (CGRAs) can flexibly and efficiently support it, since they combine operation-level reconfigurability and high energy efficiency. However, mapping computational kernels that include mmul with state-of-the-art compilation strategies often leads to suboptimal results, since its multi-dimensional structure hampers the uncovering of its inherent parallelism and, ultimately, runtime performance. Here, we take a different position: we introduce a specialized mmul CGRA kernel schedule, parametrizable across different CGRA sizes. Then, we describe a novel compilation methodology that adapts program representations to effectively leverage it, employing polyhedral transformations to analyze complex computational patterns and expose hidden mmul operations through loop reordering and splitting. The identified patterns are then substituted with optimized assembly, while the remaining program sections are compiled independently. CGRA configurations are then generated, encompassing pre-compiled and compiled parts. Our strategy maximizes resource utilization and ultimately run-time performance, even when mmul is not directly apparent in the source code. The experimental results show speedups up to 9.1x across different benchmarks that contain hidden mmuls and CGRA instances of various sizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a compilation methodology for Coarse-Grained Reconfigurable Arrays (CGRAs) that uses polyhedral transformations (loop reordering and splitting) to detect hidden matrix-matrix multiplication (mmul) patterns in source code even when not syntactically apparent. These patterns are substituted with pre-optimized, parametrizable mmul kernel schedules while the remaining code sections are compiled independently; the resulting mixed CGRA configurations are claimed to maximize resource utilization and deliver speedups up to 9.1x on benchmarks containing hidden mmuls across varying CGRA sizes.

Significance. If the semantic preservation of the polyhedral detection and substitution steps holds and the performance results are reproducible, the work could meaningfully advance CGRA compilation by enabling automatic exploitation of specialized kernels for common linear-algebra patterns. The parametrizable mmul schedule and the idea of mixing pre-compiled kernels with independently compiled code are concrete strengths that address a practical bottleneck in reconfigurable architectures.

major comments (3)
  1. [Abstract and experimental results section] Abstract and experimental results section: the central claim of speedups up to 9.1x is presented without any description of the benchmarks used, the baseline CGRA compilers or mappings, error bars, statistical tests, CGRA instance sizes, or the method used to verify the presence and correct substitution of hidden mmuls, leaving the performance evaluation only partially supported.
  2. [Polyhedral transformation methodology] Polyhedral transformation methodology: the claim that loop reordering and splitting reliably expose hidden mmuls requires demonstration that the detected sub-computations preserve exact dataflow, aliasing behavior, and iteration-space semantics for non-obvious cases; no such formal argument, proof, or counter-example analysis is provided, which is load-bearing for the correctness of kernel substitution.
  3. [CGRA configuration generation] CGRA configuration generation: the integration of pre-compiled kernels with independently compiled sections does not address potential interface mismatches, data-transfer overheads, or boundary aliasing issues, which directly affects the asserted gains in resource utilization and run-time performance.
minor comments (2)
  1. [Kernel schedule description] The parametrization of the specialized mmul kernel schedule across different CGRA sizes would be clearer if accompanied by an explicit equation or small pseudocode example showing how the schedule parameters are derived.
  2. [Related work] A brief discussion of related polyhedral CGRA mapping techniques is missing, which would help readers assess the incremental contribution of the reordering-plus-substitution strategy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, agreeing where revisions are warranted and outlining the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and experimental results section] Abstract and experimental results section: the central claim of speedups up to 9.1x is presented without any description of the benchmarks used, the baseline CGRA compilers or mappings, error bars, statistical tests, CGRA instance sizes, or the method used to verify the presence and correct substitution of hidden mmuls, leaving the performance evaluation only partially supported.

    Authors: We agree that the abstract is concise and that the experimental section would benefit from more explicit detail to fully support the performance claims. In the revised manuscript we will expand the experimental results section to describe the benchmarks (a collection of linear-algebra kernels containing non-obvious mmul patterns), the baseline (a state-of-the-art CGRA compiler without polyhedral transformations), the CGRA sizes evaluated, the inclusion of error bars and statistical tests on the reported speedups, and the verification procedure used to confirm the presence and correct substitution of hidden mmuls. We will also revise the abstract to briefly reference these experimental elements. revision: yes

  2. Referee: [Polyhedral transformation methodology] Polyhedral transformation methodology: the claim that loop reordering and splitting reliably expose hidden mmuls requires demonstration that the detected sub-computations preserve exact dataflow, aliasing behavior, and iteration-space semantics for non-obvious cases; no such formal argument, proof, or counter-example analysis is provided, which is load-bearing for the correctness of kernel substitution.

    Authors: We agree that an explicit formal argument is needed to substantiate the semantic preservation of the transformations. While the approach rests on the standard guarantees of the polyhedral model for affine loop nests, the current manuscript does not provide a dedicated proof or counter-example analysis. In the revision we will add a subsection containing a formal argument demonstrating preservation of dataflow, aliasing behavior, and iteration-space semantics, together with counter-example analysis for non-obvious cases. revision: yes

  3. Referee: [CGRA configuration generation] CGRA configuration generation: the integration of pre-compiled kernels with independently compiled sections does not address potential interface mismatches, data-transfer overheads, or boundary aliasing issues, which directly affects the asserted gains in resource utilization and run-time performance.

    Authors: We agree that the description of configuration generation is currently high-level and does not explicitly treat these integration concerns. In the revised manuscript we will expand the relevant methodology section to discuss how source-level substitution ensures interface compatibility, to quantify data-transfer overheads, and to explain how boundary aliasing is managed through the polyhedral analysis. We will also incorporate these factors into the experimental evaluation to show their impact on the reported resource-utilization and performance gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical methodology with independent experimental validation

full rationale

The paper presents a compilation flow that applies polyhedral loop transformations to detect hidden mmul patterns, substitutes pre-optimized kernels, and generates mixed CGRA configurations. Performance claims (up to 9.1x speedups) are supported by benchmark experiments rather than by construction from the input program or fitted parameters. No self-definitional equations, fitted-input predictions, load-bearing self-citations, or smuggled ansatzes appear in the derivation. The approach is self-contained against external benchmarks and does not reduce its central result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard compiler assumptions about polyhedral analysis preserving program semantics and on the existence of beneficial pre-optimized kernels; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Polyhedral transformations can be applied to loop nests to expose matrix multiplication patterns without changing program semantics.
    Invoked when describing how loop reordering and splitting identify hidden mmul operations.

pith-pipeline@v0.9.0 · 5538 in / 1221 out tokens · 43545 ms · 2026-05-08T09:26:28.616646+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 2 canonical work pages

  1. [1]

    Matrix multiplication, a little faster

    Elaye Karstadt and Oded Schwartz. Matrix multiplication, a little faster. Journal of the ACM (JACM), 67(1):1–31, 2020

  2. [2]

    A multiplication-free framework for signal processing and applications in biomedical image analysis

    Alexander Suhre, Furkan Keskin, Tulin Ersahin, Rengul Cetin-Atalay, Rashid Ansari, and A Enis Cetin. A multiplication-free framework for signal processing and applications in biomedical image analysis. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 1123–1127. IEEE, 2013. 12

  3. [3]

    Ace: Automated optimization towards iterative classification in edge health monitors.IEEE Transactions on Biomedical Circuits and Systems, 19(1):82–92, 2025

    Yuxuan Wang, Lara Orlandic, Simone Machetti, Giovanni Ansaloni, and David Atienza. Ace: Automated optimization towards iterative classification in edge health monitors.IEEE Transactions on Biomedical Circuits and Systems, 19(1):82–92, 2025

  4. [4]

    Mobile edge computing: A survey on architecture and computation offloading.IEEE Communications Surveys & Tutorials, 19(3):1628–1656, 2017

    Pavel Mach and Zdenek Becvar. Mobile edge computing: A survey on architecture and computation offloading.IEEE Communications Surveys & Tutorials, 19(3):1628–1656, 2017

  5. [5]

    Omid Akbari, Mehdi Kamal, Ali Afzali-Kusha, Massoud Pedram, and Muhammad Shafique. X-cgra: An energy-efficient approximate coarse- grained reconfigurable architecture.IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, 39(10):2558–2571, 2019

  6. [6]

    Alexander Chin, Noriaki Sakamoto, Allan Rui, Jim Zhao, Jin Hee Kim, Yuko Hara-Azumi, and Jason Anderson

    S. Alexander Chin, Noriaki Sakamoto, Allan Rui, Jim Zhao, Jin Hee Kim, Yuko Hara-Azumi, and Jason Anderson. CGRA-ME: A unified framework for CGRA modelling and exploration. In2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 184–189, 2017

  7. [8]

    Mlir: Scaling compiler infrastructure for domain specific computation

    Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasi- lache, and Oleksandr Zinenko. Mlir: Scaling compiler infrastructure for domain specific computation. In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 2–14, 2021

  8. [9]

    Sat-mapit: A sat- based modulo scheduling mapper for coarse grain reconfigurable archi- tectures

    Cristian Tirelli, Lorenzo Ferretti, and Laura Pozzi. Sat-mapit: A sat- based modulo scheduling mapper for coarse grain reconfigurable archi- tectures. In2023 Design, Automation and Test in Europe Conference Exhibition (DATE), pages 1–6, 2023

  9. [10]

    An mlir-based compilation framework for control flow management on cgras.arXiv preprint arXiv:2508.02167, 2025

    Yuxuan Wang, Cristian Tirelli, Giovanni Ansaloni, Laura Pozzi, and David Atienza. An mlir-based compilation framework for control flow management on cgras.arXiv preprint arXiv:2508.02167, 2025

  10. [11]

    Morpher: An open-source integrated compilation and simulation framework for cgra

    Dhananjaya Wijerathne, Zhaoying Li, Manupa Karunaratne, Li-Shiuan Peh, and Tulika Mitra. Morpher: An open-source integrated compilation and simulation framework for cgra. InFifth Workshop on Open-Source EDA Technology (WOSET), 2022

  11. [12]

    Ml-cgra: An integrated compilation framework to enable efficient machine learning acceleration on cgras

    Yixuan Luo, Cheng Tan, Nicolas Bohm Agostini, Ang Li, Antonino Tumeo, Nirav Dave, and Tong Geng. Ml-cgra: An integrated compilation framework to enable efficient machine learning acceleration on cgras. In2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1–6, 2023

  12. [13]

    B. Mei, M. Berekovic, and J-Y . Mignolet.ADRES & DRESC: Architec- ture and Compiler for Coarse-GrainReconfigurable Processors, pages 255–297. Springer Netherlands, Dordrecht, 2007

  13. [14]

    Mlir- to-cgra: A versatile mlir-based compiler framework for cgras

    Tianyi Yu, Omar Ragheb, Stephen Wicklund, and Jason Anderson. Mlir- to-cgra: A versatile mlir-based compiler framework for cgras. In2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 184–192, 2024

  14. [15]

    Intel Corporation.Intel Math Kernel Library: Performance and Accuracy for Scientific Computing, 2014

  15. [16]

    NVIDIA Corporation.NVIDIA cuBLAS Library, 2024

  16. [17]

    Kevin J. M. Martin. Twenty years of automated methods for mapping applications on cgra. In2022 IEEE International Parallel and Dis- tributed Processing Symposium Workshops (IPDPSW), pages 679–686, 2022

  17. [18]

    Towards higher performance and robust compilation for cgra modulo scheduling.IEEE Transactions on Parallel and Distributed Systems, 31(9):2201–2219, 2020

    Zhongyuan Zhao, Weiguang Sheng, Qin Wang, Wenzhi Yin, Pengfei Ye, Jinchao Li, and Zhigang Mao. Towards higher performance and robust compilation for cgra modulo scheduling.IEEE Transactions on Parallel and Distributed Systems, 31(9):2201–2219, 2020

  18. [19]

    Chilankamol Sunny, Satyajit Das, Kevin J. M. Martin, and Philippe Coussy. Crepe: Concurrent reverse-modulo-scheduling and placement for cgras.IEEE Transactions on Parallel and Distributed Systems, 35(7):1293–1306, 2024

  19. [20]

    Monomorphism- based cgra mapping via space and time decoupling

    Cristian Tirelli, Rodrigo Otoni, and Laura Pozzi. Monomorphism- based cgra mapping via space and time decoupling. In2025 Design, Automation & Test in Europe Conference (DATE), pages 1–7, 2025

  20. [21]

    Sat-based exact modulo scheduling mapping for resource-constrained cgras.J

    Cristian Tirelli, Juan Sapriza, Rub ´en Rodr´ıguez ´Alvarez, Lorenzo Fer- retti, Benoˆıt Denkinger, Giovanni Ansaloni, Jos´e Miranda Calero, David Atienza, and Laura Pozzi. Sat-based exact modulo scheduling mapping for resource-constrained cgras.J. Emerg. Technol. Comput. Syst., 20(3), August 2024

  21. [22]

    Riptide: A pro- grammable, energy-minimal dataflow compiler and architecture

    Graham Gobieski, Souradip Ghosh, Marijn Heule, Todd Mowry, Tony Nowatzki, Nathan Beckmann, and Brandon Lucia. Riptide: A pro- grammable, energy-minimal dataflow compiler and architecture. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 546–564, 2022

  22. [23]

    An mlir-based compilation framework for cgra application deployment

    Yuxuan Wang, Cristian Tirelli, Lara Orlandic, Juan Sapriza, Rub´en Rodr ´ıguez ´Alvarez, Giovanni Ansaloni, Laura Pozzi, and David Atienza. An mlir-based compilation framework for cgra application deployment. InInternational Symposium on Applied Reconfigurable Computing, pages 33–50. Springer Nature Switzerland Cham, 2025

  23. [24]

    Sadayappan

    Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-No ¨el Pouchet, and P. Sadayappan. When polyhedral transformations meet simd code generation.SIGPLAN Not., 48(6):127–138, June 2013

  24. [25]

    An optimizing framework on mlir for efficient fpga- based accelerator generation

    Weichuang Zhang, Jieru Zhao, Guan Shen, Quan Chen, Chen Chen, and Minyi Guo. An optimizing framework on mlir for efficient fpga- based accelerator generation. In2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 75–90. IEEE, 2024

  25. [26]

    Polyhedral-based pipelining of imperfectly-nested loop for cgras

    Dajiang Liu, Ting Liu, Xingyu Mo, Jiaxing Shang, and Shouyi Yin. Polyhedral-based pipelining of imperfectly-nested loop for cgras. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9, 2021

  26. [27]

    Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies

    Sylvain Girbal, Nicolas Vasilache, C ´edric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Program., 34(3):261–317, June 2006

  27. [28]

    Z3: an efficient smt solver

    Leonardo De Moura and Nikolaj Bjørner. Z3: an efficient smt solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS’08/ETAPS’08, page 337–340, Berlin, Heidelberg,

  28. [29]

    An open- hardware coarse-grained reconfigurable array for edge computing

    Rub ´en Rodr ´ıguez ´Alvarez, Beno ˆıt Denkinger, Juan Sapriza, Jos ´e Mi- randa Calero, Giovanni Ansaloni, and David Atienza Alonso. An open- hardware coarse-grained reconfigurable array for edge computing. In Proceedings of the 20th ACM International Conference on Computing Frontiers, CF ’23, page 391–392, New York, NY , USA, 2023. Associ- ation for Com...

  29. [30]

    Polybench: The polyhedral benchmark suite.URL: http://www.cs.ucla.edu/pouchet/software/polybench, 437:1– 1, 2012

    Louis-No ¨el Pouchet et al. Polybench: The polyhedral benchmark suite.URL: http://www.cs.ucla.edu/pouchet/software/polybench, 437:1– 1, 2012

  30. [31]

    Principal components analysis (pca).Computers & Geosciences, 19(3):303–342, 1993

    Andrzej Ma ´ckiewicz and Waldemar Ratajczak. Principal components analysis (pca).Computers & Geosciences, 19(3):303–342, 1993

  31. [32]

    An introduction to the kalman filter,

    Greg Welch, Gary Bishop, et al. An introduction to the kalman filter,

  32. [33]

    Chapel Hill, University of North Carolina, USA

  33. [34]

    e-gpu: An open-source and configurable risc-v graphic processing unit for tinyai applications.arXiv preprint arXiv:2505.08421, 2025

    Simone Machetti, Pasquale Davide Schiavone, Lara Orlandic, Darong Huang, Deniz Kasap, Giovanni Ansaloni, and David Atienza. e-gpu: An open-source and configurable risc-v graphic processing unit for tinyai applications.arXiv preprint arXiv:2505.08421, 2025

  34. [35]

    Systolic arrays and structured pruning co-design for efficient transformers in edge systems

    Pedro Palacios, Rafael Medina, Jean-Luc Rouas, Giovanni Ansaloni, and David Atienza. Systolic arrays and structured pruning co-design for efficient transformers in edge systems. InProceedings of the Great Lakes Symposium on VLSI 2025, GLSVLSI ’25, page 320–327, New York, NY , USA, 2025. Association for Computing Machinery

  35. [36]

    X-heep: An open-source, configurable and extendible risc-v microcontroller

    Pasquale Davide Schiavone, Simone Machetti, Miguel Pe ´on-Quir´os, Jose Miranda, Beno ˆıt Denkinger, Thomas Christoph M ¨uller, Ruben Rodr´ıguez, Saverio Nasturzio, and David Atienza Alonso. X-heep: An open-source, configurable and extendible risc-v microcontroller. In Proceedings of the 20th ACM International Conference on Computing Frontiers, CF ’23, pa...