MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs

arxiv: 2604.09124 · v1 · submitted 2026-04-10 · 💻 cs.DC · cs.AR· cs.LG

MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs

Enrico Russo , Mohamed Amine Hamdi , Alessandro Ottaviano , Francesco Conti , Angelo Garofalo , Daniele Jahier Pagliari , Maurizio Palesi , Luca Benini

show 1 more author

Alessio Burrello

This is my paper

Pith reviewed 2026-05-10 17:00 UTC · model grok-4.3

classification 💻 cs.DC cs.ARcs.LG

keywords DNN deploymentheterogeneous acceleratorsedge SoCsconstraint programmingMLPerf Tinyinference schedulingaccelerator utilization

0 comments p. Extension

The pith

MATCHA framework deploys deep neural networks on multi-accelerator edge SoCs with up to 35 percent lower inference latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MATCHA as a deployment system that creates execution schedules allowing different accelerators on a single chip to run neural network layers at the same time. Most existing tools leave some accelerators idle during inference because they cannot coordinate heterogeneous hardware units or manage shared memory levels effectively. MATCHA applies pattern matching to break models into pieces that fit each accelerator, tiles operations for better parallelism, maps pieces to specific hardware, and solves a constraint problem to assign memory buffers at L3 and L2 levels. On the MLPerf Tiny benchmark running on a chip with two unlike accelerators, the resulting schedules raise accelerator use and shorten end-to-end latency by as much as 35 percent versus the prior MATCH compiler.

Core claim

MATCHA is a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the state-of-the-art MATCH compiler.

What carries the argument

MATCHA framework combining pattern matching, tiling, mapping, and constraint programming to produce concurrent schedules across heterogeneous accelerators while optimizing memory allocation.

If this is right

Edge devices with multiple accelerators can execute DNN inference with higher hardware utilization.
Inference latency drops measurably on standard tiny-ML benchmarks when schedules exploit accelerator parallelism.
Memory allocation at L3 and L2 levels becomes part of an automated optimization step rather than manual tuning.
Deployment pipelines gain the ability to target heterogeneous hardware without rewriting schedules for each new SoC design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scheduling approach could extend to SoCs containing three or more accelerator types if the constraint solver scales.
MATCHA might serve as a backend that other compilers call to handle the final mapping and memory decisions after front-end graph transformations.
Testing on real silicon rather than simulation would reveal whether the generated schedules also improve energy per inference.

Load-bearing premise

Pattern matching, tiling, mapping, and constraint programming will reliably generate concurrent schedules that keep heterogeneous accelerators busy on the target SoC.

What would settle it

Running MATCHA and the MATCH compiler on the same MLPerf Tiny models on the two-accelerator SoC and measuring no reduction or an increase in measured inference latency.

Figures

Figures reproduced from arXiv: 2604.09124 by Alessandro Ottaviano, Alessio Burrello, Angelo Garofalo, Daniele Jahier Pagliari, Enrico Russo, Francesco Conti, Luca Benini, Maurizio Palesi, Mohamed Amine Hamdi.

**Figure 3.** Figure 3: MATCHA’s tile-centric pattern matching and tiling [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Scheduling and memory plan example with differ [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Use case Carfield HSoC considered for evaluation. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: ResNet inference profiling timeline (left) and the [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: FLOPS comparison for DNN benchmark blocks. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

read the original abstract

Deploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MATCHA extends MATCH with constraint programming for concurrent scheduling and memory allocation on heterogeneous edge accelerators, claiming up to 35% latency cuts on MLPerf Tiny.

read the letter

MATCHA takes the existing MATCH compiler and layers on constraint programming to decide L3/L2 allocations and produce concurrent schedules across multiple heterogeneous accelerators in an edge SoC. Pattern matching, tiling, and per-unit mapping are used to keep the accelerators busy in parallel rather than running them one at a time. The headline result is a measured reduction in inference latency of up to 35% on MLPerf Tiny workloads versus the prior MATCH baseline on a two-accelerator SoC, along with higher reported utilization.

Referee Report

1 major / 0 minor

Summary. The manuscript presents MATCHA, a unified DNN deployment framework for System-on-Chips with multiple heterogeneous acceleration engines. It generates highly concurrent schedules for parallel heterogeneous accelerators by combining pattern matching, tiling, and mapping across individual hardware units with constraint programming to optimize L3/L2 memory allocation and scheduling. On the MLPerf Tiny benchmark using a SoC with two heterogeneous accelerators, MATCHA is reported to improve accelerator utilization and reduce inference latency by up to 35% relative to the state-of-the-art MATCH compiler.

Significance. If the empirical results hold under detailed scrutiny, the work offers a practical advance in exploiting hardware heterogeneity for edge DNN inference. The integration of established compiler passes with constraint programming for concurrent scheduling provides a concrete, testable improvement over prior compilers like MATCH, with direct relevance to MLPerf Tiny workloads on resource-constrained devices.

major comments (1)

The central performance claim (up to 35% latency reduction) is presented without accompanying details on experimental setup, exact MATCH baselines, error bars, or ablation studies isolating the contributions of pattern matching/tiling versus constraint programming; this weakens evaluation of the weakest assumption that these techniques reliably yield high utilization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the practical relevance of MATCHA for heterogeneous edge SoCs. We address the major comment below with a commitment to strengthen the evaluation.

read point-by-point responses

Referee: The central performance claim (up to 35% latency reduction) is presented without accompanying details on experimental setup, exact MATCH baselines, error bars, or ablation studies isolating the contributions of pattern matching/tiling versus constraint programming; this weakens evaluation of the weakest assumption that these techniques reliably yield high utilization.

Authors: We agree that the current presentation of results would benefit from greater detail to allow readers to fully assess the claims. In the revised manuscript we will expand the evaluation section with: (1) a complete description of the experimental setup, including the precise SoC configuration, accelerator specifications, software versions, and latency measurement methodology; (2) the exact MATCH compiler configurations, flags, and versions used as baseline; (3) error bars computed from repeated inference runs to quantify variability; and (4) ablation experiments that separately measure the contributions of the pattern-matching/tiling pass versus the constraint-programming memory and scheduling optimizer. These additions will directly address concerns about the reliability of the reported utilization improvements and the 35% latency reduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces MATCHA, a DNN deployment framework that applies pattern matching, tiling, mapping, and constraint programming to generate concurrent schedules on heterogeneous accelerators. Its central claim is an empirical result: up to 35% lower inference latency and higher utilization on MLPerf Tiny versus the prior MATCH compiler. No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-definitions. The approach relies on standard compiler passes whose outputs are directly measured against an external baseline, rendering the performance claims self-contained and falsifiable outside any internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of constraint programming for memory and scheduling decisions plus pattern matching for hardware mapping; these are standard techniques whose combination is presented as novel but whose success is assumed rather than derived from first principles.

axioms (1)

domain assumption Constraint programming can efficiently solve the joint L3/L2 memory allocation and scheduling problem for DNNs on heterogeneous accelerators.
Invoked to generate the optimized concurrent schedules described in the abstract.

pith-pipeline@v0.9.0 · 5442 in / 1236 out tokens · 52189 ms · 2026-05-10T17:00:13.914008+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, et al

work page
[2]

MLPerf Tiny Benchmark.Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks(2021)

work page 2021
[3]

Michael Bayer. n.d.. Mako Templates for Python. https://www.makotemplates. org/

work page
[4]

Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis.ACM Comput. Surv.52, 4, Article 65 (Aug. 2019), 43 pages. doi:10.1145/3320060

work page doi:10.1145/3320060 2019
[5]

Halima Bouzidi, Mohanad Odema, Hamza Ouarnoughi, Smail Niar, and Moham- mad Abdullah Al Faruque. 2023. Map-and-conquer: Energy-efficient mapping of dynamic neural nets onto heterogeneous mpsocs. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

work page 2023
[6]

Alessio Burrello, Angelo Garofalo, Nazareno Bruschi, Giuseppe Tagliavini, Davide Rossi, and Francesco Conti. 2021. DORY: Automatic end-to-end deployment of real-world DNNs on low-cost IoT MCUs.IEEE Trans. Comput.70, 8 (2021), 1253–1268

work page 2021
[7]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated End-to-End optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594

work page 2018
[8]

Ismet Dagli, Alexander Cieslewicz, Jedidiah McClurg, and Mehmet E Belviranli

work page
[9]

InProceedings of the 59th ACM/IEEE Design Automation Conference

Axonn: Energy-aware execution of neural network inference on multi- accelerator heterogeneous socs. InProceedings of the 59th ACM/IEEE Design Automation Conference. 1069–1074

work page
[10]

Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna Natraj, Tiezhen Wang, et al. 2021. Tensorflow lite micro: Embedded machine learning for tinyml systems.Proceedings of machine learning and systems3 (2021), 800–811

work page 2021
[11]

Gurkaynak, Davide Rossi, and Luca Benini

Angelo Garofalo, Alessandro Ottaviano, Matteo Perotti, Thomas Benz, Yvan Tortorella, Robert Balas, Michael Rogenmoser, Chi Zhang, Luca Bertaccini, Nils Wistoff, Maicol Ciani, Cyril Koenig, Mattia Sinigaglia, Luca Valente, Paul Scheffler, Manuel Eggimann, Matheus Cavalcante, Francesco Restuccia, Alessandro Biondi, Francesco Conti, Frank K. Gurkaynak, David...

work page doi:10.1109/tcsii.2025.3591225 2025
[12]

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: En- abling Systematic Deep-Learning Architecture Evaluation via ...

work page doi:10.1109/dac18074.2021.9586216 2021
[13]

Sukhpal Singh Gill, Muhammed Golec, Jianmin Hu, Minxian Xu, Junhui Du, Huaming Wu, Guneet Kaur Walia, Subramaniam Subramanian Murugesan, Babar Ali, Mohit Kumar, Kejiang Ye, Prabal Verma, Surendra Kumar, Felix Cuadrado, and Steve Uhlig. 2024. Edge AI: A Taxonomy, Systematic Review and Future Directions.Cluster Computing28, 1 (Oct. 2024), 18. doi:10.1007/s1...

work page doi:10.1007/s10586-024- 2024
[14]

Mohamed Amine Hamdi, Francesco Daghero, Giuseppe Maria Sarda, Josse Van Delm, Arne Symons, Luca Benini, Marian Verhelst, Daniele Jahier Pagliari, and Alessio Burrello. 2025. MATCH: Model-Aware TVM-Based Compilation for Heterogeneous Edge Devices.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems(2025), 1–1. doi:10.1109/TCAD.2025.3556967

work page doi:10.1109/tcad.2025.3556967 2025
[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

work page 2016
[16]

Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. Cosa: Scheduling by constrained optimization for spatial accelerators. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 554–566

work page 2021
[17]

Sheng-Chun Kao and Tushar Krishna. 2020. Gamma: Automating the hw mapping of dnn models on accelerators via genetic algorithm. InProceedings of the 39th International Conference on Computer-Aided Design. 1–9

work page 2020
[18]

Mukul Lokhande, Gopal Raut, and Santosh Kumar Vishvakarma. 2025. Flex-PE: Flexible and SIMD Multiprecision Processing Element for AI Workloads.IEEE Transactions on Very Large Scale Integration (VLSI) Systems33, 6 (2025), 1610–1623. doi:10.1109/TVLSI.2025.3553069

work page doi:10.1109/tvlsi.2025.3553069 2025
[19]

Martin Maas, Ulysse Beaugnon, Arun Chauhan, and Berkin Ilbeyi. 2022. Tela- malloc: Efficient on-chip memory allocation for production machine learning accelerators. InProceedings of the 28th ACM International Conference on Archi- tectural Support for Programming Languages and Operating Systems, Volume 1. 123–137

work page 2022
[20]

Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging joint architecture-mapping design space explo- ration for DNN accelerators.IEEE Trans. Comput.70, 8 (2021), 1160–1174

work page 2021
[21]

Alessandro Ottaviano, Thomas Benz, Paul Scheffler, and Luca Benini. 2023. Cheshire: A lightweight, linux-capable risc-v host platform for domain-specific accelerator plug-in.IEEE Transactions on Circuits and Systems II: Express Briefs 70, 10 (2023), 3777–3781

work page 2023
[22]

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to dnn accelerator evaluation. In2019 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 304–315

work page 2019
[23]

Jeman Park, Misun Yu, Jinse Kwon, Junmo Park, Jemin Lee, and Yongin Kwon

work page
[24]

NEST-C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators.ETRI Journal46, 5 (2024), 851– 864

work page 2024
[25]

Matteo Perotti, Samuel Riedel, Matheus Cavalcante, and Luca Benini. 2025. Spatz: Clustering compact RISC-V-based vector units to maximize computing efficiency. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2025)

work page 2025
[26]

Michael Rogenmoser, Yvan Tortorella, Davide Rossi, Francesco Conti, and Luca Benini. 2025. Hybrid modular redundancy: Exploring modular redundancy approaches in RISC-V multi-core computing clusters for reliable processing in space.ACM Transactions on Cyber-Physical Systems9, 1 (2025), 1–29

work page 2025
[27]

Enrico Russo, Maurizio Palesi, Giuseppe Ascia, Davide Patti, Salvatore Mon- teleone, and Vincenzo Catania. 2023. Memory-aware DNN algorithm-hardware mapping via integer linear programming. InProceedings of the 20th ACM Interna- tional Conference on Computing Frontiers. 134–143

work page 2023
[28]

Moritz Scherer, Luka Macan, Victor JB Jung, Philip Wiese, Luca Bompani, Alessio Burrello, Francesco Conti, and Luca Benini. 2024. Deeploy: Enabling energy- efficient deployment of small language models on heterogeneous microcon- trollers.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems43, 11 (2024), 4009–4020

work page 2024
[29]

Arne Symons, Linyan Mei, and Marian Verhelst. 2021. Loma: Fast auto-scheduling on dnn accelerators through loop-order-based memory allocation. In2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 1–4

work page 2021
[30]

In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol

Kodai Ueyoshi, Ioannis A. Papistas, Pouya Houshmand, Giuseppe M. Sarda, Vikram Jain, Man Shi, Qilin Zheng, Sebastian Giraldo, Peter Vrancx, Jonas Doevenspeck, Debjyoti Bhattacharjee, Stefan Cosemans, Arindam Mallik, Pe- ter Debacker, Diederik Verkest, and Marian Verhelst. 2022. DIANA: An End- to-End Energy-Efficient Digital and ANAlog Hybrid Neural Networ...

work page doi:10.1109/isscc42614.2022.9731716 2022
[31]

Josse Van Delm, Maarten Vandersteegen, Alessio Burrello, Giuseppe Maria Sarda, Francesco Conti, Daniele Jahier Pagliari, Luca Benini, and Marian Verhelst. 2023. HTVM: Efficient neural network deployment on heterogeneous TinyML plat- forms. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

work page 2023
[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[33]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 1492–1500

work page 2017
[34]

Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the gap between auto-tuners and hardware-native performance.Proceedings of Machine Learning and Systems4 (2022), 204–216

work page 2022
[35]

Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, et al . 2020. Interstellar: Using halide’s scheduling language to analyze dnn accelerators. InProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 369–383

work page 2020
[36]

Size Zheng, Siyuan Chen, and Yun Liang. 2023. Memory and computation coordinated mapping of dnns onto complex heterogeneous soc. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

work page 2023

[1] [1]

Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, et al

work page

[2] [2]

MLPerf Tiny Benchmark.Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks(2021)

work page 2021

[3] [3]

Michael Bayer. n.d.. Mako Templates for Python. https://www.makotemplates. org/

work page

[4] [4]

Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis.ACM Comput. Surv.52, 4, Article 65 (Aug. 2019), 43 pages. doi:10.1145/3320060

work page doi:10.1145/3320060 2019

[5] [5]

Halima Bouzidi, Mohanad Odema, Hamza Ouarnoughi, Smail Niar, and Moham- mad Abdullah Al Faruque. 2023. Map-and-conquer: Energy-efficient mapping of dynamic neural nets onto heterogeneous mpsocs. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

work page 2023

[6] [6]

Alessio Burrello, Angelo Garofalo, Nazareno Bruschi, Giuseppe Tagliavini, Davide Rossi, and Francesco Conti. 2021. DORY: Automatic end-to-end deployment of real-world DNNs on low-cost IoT MCUs.IEEE Trans. Comput.70, 8 (2021), 1253–1268

work page 2021

[7] [7]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated End-to-End optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594

work page 2018

[8] [8]

Ismet Dagli, Alexander Cieslewicz, Jedidiah McClurg, and Mehmet E Belviranli

work page

[9] [9]

InProceedings of the 59th ACM/IEEE Design Automation Conference

Axonn: Energy-aware execution of neural network inference on multi- accelerator heterogeneous socs. InProceedings of the 59th ACM/IEEE Design Automation Conference. 1069–1074

work page

[10] [10]

Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna Natraj, Tiezhen Wang, et al. 2021. Tensorflow lite micro: Embedded machine learning for tinyml systems.Proceedings of machine learning and systems3 (2021), 800–811

work page 2021

[11] [11]

Gurkaynak, Davide Rossi, and Luca Benini

Angelo Garofalo, Alessandro Ottaviano, Matteo Perotti, Thomas Benz, Yvan Tortorella, Robert Balas, Michael Rogenmoser, Chi Zhang, Luca Bertaccini, Nils Wistoff, Maicol Ciani, Cyril Koenig, Mattia Sinigaglia, Luca Valente, Paul Scheffler, Manuel Eggimann, Matheus Cavalcante, Francesco Restuccia, Alessandro Biondi, Francesco Conti, Frank K. Gurkaynak, David...

work page doi:10.1109/tcsii.2025.3591225 2025

[12] [12]

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: En- abling Systematic Deep-Learning Architecture Evaluation via ...

work page doi:10.1109/dac18074.2021.9586216 2021

[13] [13]

Sukhpal Singh Gill, Muhammed Golec, Jianmin Hu, Minxian Xu, Junhui Du, Huaming Wu, Guneet Kaur Walia, Subramaniam Subramanian Murugesan, Babar Ali, Mohit Kumar, Kejiang Ye, Prabal Verma, Surendra Kumar, Felix Cuadrado, and Steve Uhlig. 2024. Edge AI: A Taxonomy, Systematic Review and Future Directions.Cluster Computing28, 1 (Oct. 2024), 18. doi:10.1007/s1...

work page doi:10.1007/s10586-024- 2024

[14] [14]

Mohamed Amine Hamdi, Francesco Daghero, Giuseppe Maria Sarda, Josse Van Delm, Arne Symons, Luca Benini, Marian Verhelst, Daniele Jahier Pagliari, and Alessio Burrello. 2025. MATCH: Model-Aware TVM-Based Compilation for Heterogeneous Edge Devices.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems(2025), 1–1. doi:10.1109/TCAD.2025.3556967

work page doi:10.1109/tcad.2025.3556967 2025

[15] [15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

work page 2016

[16] [16]

Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. Cosa: Scheduling by constrained optimization for spatial accelerators. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 554–566

work page 2021

[17] [17]

Sheng-Chun Kao and Tushar Krishna. 2020. Gamma: Automating the hw mapping of dnn models on accelerators via genetic algorithm. InProceedings of the 39th International Conference on Computer-Aided Design. 1–9

work page 2020

[18] [18]

Mukul Lokhande, Gopal Raut, and Santosh Kumar Vishvakarma. 2025. Flex-PE: Flexible and SIMD Multiprecision Processing Element for AI Workloads.IEEE Transactions on Very Large Scale Integration (VLSI) Systems33, 6 (2025), 1610–1623. doi:10.1109/TVLSI.2025.3553069

work page doi:10.1109/tvlsi.2025.3553069 2025

[19] [19]

Martin Maas, Ulysse Beaugnon, Arun Chauhan, and Berkin Ilbeyi. 2022. Tela- malloc: Efficient on-chip memory allocation for production machine learning accelerators. InProceedings of the 28th ACM International Conference on Archi- tectural Support for Programming Languages and Operating Systems, Volume 1. 123–137

work page 2022

[20] [20]

Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging joint architecture-mapping design space explo- ration for DNN accelerators.IEEE Trans. Comput.70, 8 (2021), 1160–1174

work page 2021

[21] [21]

Alessandro Ottaviano, Thomas Benz, Paul Scheffler, and Luca Benini. 2023. Cheshire: A lightweight, linux-capable risc-v host platform for domain-specific accelerator plug-in.IEEE Transactions on Circuits and Systems II: Express Briefs 70, 10 (2023), 3777–3781

work page 2023

[22] [22]

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to dnn accelerator evaluation. In2019 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 304–315

work page 2019

[23] [23]

Jeman Park, Misun Yu, Jinse Kwon, Junmo Park, Jemin Lee, and Yongin Kwon

work page

[24] [24]

NEST-C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators.ETRI Journal46, 5 (2024), 851– 864

work page 2024

[25] [25]

Matteo Perotti, Samuel Riedel, Matheus Cavalcante, and Luca Benini. 2025. Spatz: Clustering compact RISC-V-based vector units to maximize computing efficiency. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2025)

work page 2025

[26] [26]

Michael Rogenmoser, Yvan Tortorella, Davide Rossi, Francesco Conti, and Luca Benini. 2025. Hybrid modular redundancy: Exploring modular redundancy approaches in RISC-V multi-core computing clusters for reliable processing in space.ACM Transactions on Cyber-Physical Systems9, 1 (2025), 1–29

work page 2025

[27] [27]

Enrico Russo, Maurizio Palesi, Giuseppe Ascia, Davide Patti, Salvatore Mon- teleone, and Vincenzo Catania. 2023. Memory-aware DNN algorithm-hardware mapping via integer linear programming. InProceedings of the 20th ACM Interna- tional Conference on Computing Frontiers. 134–143

work page 2023

[28] [28]

Moritz Scherer, Luka Macan, Victor JB Jung, Philip Wiese, Luca Bompani, Alessio Burrello, Francesco Conti, and Luca Benini. 2024. Deeploy: Enabling energy- efficient deployment of small language models on heterogeneous microcon- trollers.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems43, 11 (2024), 4009–4020

work page 2024

[29] [29]

Arne Symons, Linyan Mei, and Marian Verhelst. 2021. Loma: Fast auto-scheduling on dnn accelerators through loop-order-based memory allocation. In2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 1–4

work page 2021

[30] [30]

In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol

Kodai Ueyoshi, Ioannis A. Papistas, Pouya Houshmand, Giuseppe M. Sarda, Vikram Jain, Man Shi, Qilin Zheng, Sebastian Giraldo, Peter Vrancx, Jonas Doevenspeck, Debjyoti Bhattacharjee, Stefan Cosemans, Arindam Mallik, Pe- ter Debacker, Diederik Verkest, and Marian Verhelst. 2022. DIANA: An End- to-End Energy-Efficient Digital and ANAlog Hybrid Neural Networ...

work page doi:10.1109/isscc42614.2022.9731716 2022

[31] [31]

Josse Van Delm, Maarten Vandersteegen, Alessio Burrello, Giuseppe Maria Sarda, Francesco Conti, Daniele Jahier Pagliari, Luca Benini, and Marian Verhelst. 2023. HTVM: Efficient neural network deployment on heterogeneous TinyML plat- forms. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

work page 2023

[32] [32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017

[33] [33]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 1492–1500

work page 2017

[34] [34]

Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the gap between auto-tuners and hardware-native performance.Proceedings of Machine Learning and Systems4 (2022), 204–216

work page 2022

[35] [35]

Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, et al . 2020. Interstellar: Using halide’s scheduling language to analyze dnn accelerators. InProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 369–383

work page 2020

[36] [36]

Size Zheng, Siyuan Chen, and Yun Liang. 2023. Memory and computation coordinated mapping of dnns onto complex heterogeneous soc. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

work page 2023