MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs
Pith reviewed 2026-05-10 17:00 UTC · model grok-4.3
The pith
MATCHA framework deploys deep neural networks on multi-accelerator edge SoCs with up to 35 percent lower inference latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MATCHA is a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the state-of-the-art MATCH compiler.
What carries the argument
MATCHA framework combining pattern matching, tiling, mapping, and constraint programming to produce concurrent schedules across heterogeneous accelerators while optimizing memory allocation.
If this is right
- Edge devices with multiple accelerators can execute DNN inference with higher hardware utilization.
- Inference latency drops measurably on standard tiny-ML benchmarks when schedules exploit accelerator parallelism.
- Memory allocation at L3 and L2 levels becomes part of an automated optimization step rather than manual tuning.
- Deployment pipelines gain the ability to target heterogeneous hardware without rewriting schedules for each new SoC design.
Where Pith is reading between the lines
- The same scheduling approach could extend to SoCs containing three or more accelerator types if the constraint solver scales.
- MATCHA might serve as a backend that other compilers call to handle the final mapping and memory decisions after front-end graph transformations.
- Testing on real silicon rather than simulation would reveal whether the generated schedules also improve energy per inference.
Load-bearing premise
Pattern matching, tiling, mapping, and constraint programming will reliably generate concurrent schedules that keep heterogeneous accelerators busy on the target SoC.
What would settle it
Running MATCHA and the MATCH compiler on the same MLPerf Tiny models on the two-accelerator SoC and measuring no reduction or an increase in measured inference latency.
Figures
read the original abstract
Deploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MATCHA, a unified DNN deployment framework for System-on-Chips with multiple heterogeneous acceleration engines. It generates highly concurrent schedules for parallel heterogeneous accelerators by combining pattern matching, tiling, and mapping across individual hardware units with constraint programming to optimize L3/L2 memory allocation and scheduling. On the MLPerf Tiny benchmark using a SoC with two heterogeneous accelerators, MATCHA is reported to improve accelerator utilization and reduce inference latency by up to 35% relative to the state-of-the-art MATCH compiler.
Significance. If the empirical results hold under detailed scrutiny, the work offers a practical advance in exploiting hardware heterogeneity for edge DNN inference. The integration of established compiler passes with constraint programming for concurrent scheduling provides a concrete, testable improvement over prior compilers like MATCH, with direct relevance to MLPerf Tiny workloads on resource-constrained devices.
major comments (1)
- The central performance claim (up to 35% latency reduction) is presented without accompanying details on experimental setup, exact MATCH baselines, error bars, or ablation studies isolating the contributions of pattern matching/tiling versus constraint programming; this weakens evaluation of the weakest assumption that these techniques reliably yield high utilization.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the practical relevance of MATCHA for heterogeneous edge SoCs. We address the major comment below with a commitment to strengthen the evaluation.
read point-by-point responses
-
Referee: The central performance claim (up to 35% latency reduction) is presented without accompanying details on experimental setup, exact MATCH baselines, error bars, or ablation studies isolating the contributions of pattern matching/tiling versus constraint programming; this weakens evaluation of the weakest assumption that these techniques reliably yield high utilization.
Authors: We agree that the current presentation of results would benefit from greater detail to allow readers to fully assess the claims. In the revised manuscript we will expand the evaluation section with: (1) a complete description of the experimental setup, including the precise SoC configuration, accelerator specifications, software versions, and latency measurement methodology; (2) the exact MATCH compiler configurations, flags, and versions used as baseline; (3) error bars computed from repeated inference runs to quantify variability; and (4) ablation experiments that separately measure the contributions of the pattern-matching/tiling pass versus the constraint-programming memory and scheduling optimizer. These additions will directly address concerns about the reliability of the reported utilization improvements and the 35% latency reduction. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces MATCHA, a DNN deployment framework that applies pattern matching, tiling, mapping, and constraint programming to generate concurrent schedules on heterogeneous accelerators. Its central claim is an empirical result: up to 35% lower inference latency and higher utilization on MLPerf Tiny versus the prior MATCH compiler. No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-definitions. The approach relies on standard compiler passes whose outputs are directly measured against an external baseline, rendering the performance claims self-contained and falsifiable outside any internal construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Constraint programming can efficiently solve the joint L3/L2 memory allocation and scheduling problem for DNNs on heterogeneous accelerators.
Reference graph
Works this paper leans on
-
[1]
Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, et al
-
[2]
MLPerf Tiny Benchmark.Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks(2021)
work page 2021
-
[3]
Michael Bayer. n.d.. Mako Templates for Python. https://www.makotemplates. org/
-
[4]
Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis.ACM Comput. Surv.52, 4, Article 65 (Aug. 2019), 43 pages. doi:10.1145/3320060
-
[5]
Halima Bouzidi, Mohanad Odema, Hamza Ouarnoughi, Smail Niar, and Moham- mad Abdullah Al Faruque. 2023. Map-and-conquer: Energy-efficient mapping of dynamic neural nets onto heterogeneous mpsocs. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6
work page 2023
-
[6]
Alessio Burrello, Angelo Garofalo, Nazareno Bruschi, Giuseppe Tagliavini, Davide Rossi, and Francesco Conti. 2021. DORY: Automatic end-to-end deployment of real-world DNNs on low-cost IoT MCUs.IEEE Trans. Comput.70, 8 (2021), 1253–1268
work page 2021
-
[7]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated End-to-End optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594
work page 2018
-
[8]
Ismet Dagli, Alexander Cieslewicz, Jedidiah McClurg, and Mehmet E Belviranli
-
[9]
InProceedings of the 59th ACM/IEEE Design Automation Conference
Axonn: Energy-aware execution of neural network inference on multi- accelerator heterogeneous socs. InProceedings of the 59th ACM/IEEE Design Automation Conference. 1069–1074
-
[10]
Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna Natraj, Tiezhen Wang, et al. 2021. Tensorflow lite micro: Embedded machine learning for tinyml systems.Proceedings of machine learning and systems3 (2021), 800–811
work page 2021
-
[11]
Gurkaynak, Davide Rossi, and Luca Benini
Angelo Garofalo, Alessandro Ottaviano, Matteo Perotti, Thomas Benz, Yvan Tortorella, Robert Balas, Michael Rogenmoser, Chi Zhang, Luca Bertaccini, Nils Wistoff, Maicol Ciani, Cyril Koenig, Mattia Sinigaglia, Luca Valente, Paul Scheffler, Manuel Eggimann, Matheus Cavalcante, Francesco Restuccia, Alessandro Biondi, Francesco Conti, Frank K. Gurkaynak, David...
-
[12]
Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: En- abling Systematic Deep-Learning Architecture Evaluation via ...
-
[13]
Sukhpal Singh Gill, Muhammed Golec, Jianmin Hu, Minxian Xu, Junhui Du, Huaming Wu, Guneet Kaur Walia, Subramaniam Subramanian Murugesan, Babar Ali, Mohit Kumar, Kejiang Ye, Prabal Verma, Surendra Kumar, Felix Cuadrado, and Steve Uhlig. 2024. Edge AI: A Taxonomy, Systematic Review and Future Directions.Cluster Computing28, 1 (Oct. 2024), 18. doi:10.1007/s1...
-
[14]
Mohamed Amine Hamdi, Francesco Daghero, Giuseppe Maria Sarda, Josse Van Delm, Arne Symons, Luca Benini, Marian Verhelst, Daniele Jahier Pagliari, and Alessio Burrello. 2025. MATCH: Model-Aware TVM-Based Compilation for Heterogeneous Edge Devices.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems(2025), 1–1. doi:10.1109/TCAD.2025.3556967
-
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778
work page 2016
-
[16]
Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. Cosa: Scheduling by constrained optimization for spatial accelerators. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 554–566
work page 2021
-
[17]
Sheng-Chun Kao and Tushar Krishna. 2020. Gamma: Automating the hw mapping of dnn models on accelerators via genetic algorithm. InProceedings of the 39th International Conference on Computer-Aided Design. 1–9
work page 2020
-
[18]
Mukul Lokhande, Gopal Raut, and Santosh Kumar Vishvakarma. 2025. Flex-PE: Flexible and SIMD Multiprecision Processing Element for AI Workloads.IEEE Transactions on Very Large Scale Integration (VLSI) Systems33, 6 (2025), 1610–1623. doi:10.1109/TVLSI.2025.3553069
-
[19]
Martin Maas, Ulysse Beaugnon, Arun Chauhan, and Berkin Ilbeyi. 2022. Tela- malloc: Efficient on-chip memory allocation for production machine learning accelerators. InProceedings of the 28th ACM International Conference on Archi- tectural Support for Programming Languages and Operating Systems, Volume 1. 123–137
work page 2022
-
[20]
Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging joint architecture-mapping design space explo- ration for DNN accelerators.IEEE Trans. Comput.70, 8 (2021), 1160–1174
work page 2021
-
[21]
Alessandro Ottaviano, Thomas Benz, Paul Scheffler, and Luca Benini. 2023. Cheshire: A lightweight, linux-capable risc-v host platform for domain-specific accelerator plug-in.IEEE Transactions on Circuits and Systems II: Express Briefs 70, 10 (2023), 3777–3781
work page 2023
-
[22]
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to dnn accelerator evaluation. In2019 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 304–315
work page 2019
-
[23]
Jeman Park, Misun Yu, Jinse Kwon, Junmo Park, Jemin Lee, and Yongin Kwon
-
[24]
NEST-C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators.ETRI Journal46, 5 (2024), 851– 864
work page 2024
-
[25]
Matteo Perotti, Samuel Riedel, Matheus Cavalcante, and Luca Benini. 2025. Spatz: Clustering compact RISC-V-based vector units to maximize computing efficiency. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2025)
work page 2025
-
[26]
Michael Rogenmoser, Yvan Tortorella, Davide Rossi, Francesco Conti, and Luca Benini. 2025. Hybrid modular redundancy: Exploring modular redundancy approaches in RISC-V multi-core computing clusters for reliable processing in space.ACM Transactions on Cyber-Physical Systems9, 1 (2025), 1–29
work page 2025
-
[27]
Enrico Russo, Maurizio Palesi, Giuseppe Ascia, Davide Patti, Salvatore Mon- teleone, and Vincenzo Catania. 2023. Memory-aware DNN algorithm-hardware mapping via integer linear programming. InProceedings of the 20th ACM Interna- tional Conference on Computing Frontiers. 134–143
work page 2023
-
[28]
Moritz Scherer, Luka Macan, Victor JB Jung, Philip Wiese, Luca Bompani, Alessio Burrello, Francesco Conti, and Luca Benini. 2024. Deeploy: Enabling energy- efficient deployment of small language models on heterogeneous microcon- trollers.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems43, 11 (2024), 4009–4020
work page 2024
-
[29]
Arne Symons, Linyan Mei, and Marian Verhelst. 2021. Loma: Fast auto-scheduling on dnn accelerators through loop-order-based memory allocation. In2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 1–4
work page 2021
-
[30]
In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol
Kodai Ueyoshi, Ioannis A. Papistas, Pouya Houshmand, Giuseppe M. Sarda, Vikram Jain, Man Shi, Qilin Zheng, Sebastian Giraldo, Peter Vrancx, Jonas Doevenspeck, Debjyoti Bhattacharjee, Stefan Cosemans, Arindam Mallik, Pe- ter Debacker, Diederik Verkest, and Marian Verhelst. 2022. DIANA: An End- to-End Energy-Efficient Digital and ANAlog Hybrid Neural Networ...
-
[31]
Josse Van Delm, Maarten Vandersteegen, Alessio Burrello, Giuseppe Maria Sarda, Francesco Conti, Daniele Jahier Pagliari, Luca Benini, and Marian Verhelst. 2023. HTVM: Efficient neural network deployment on heterogeneous TinyML plat- forms. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6
work page 2023
-
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
work page 2017
-
[33]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 1492–1500
work page 2017
-
[34]
Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the gap between auto-tuners and hardware-native performance.Proceedings of Machine Learning and Systems4 (2022), 204–216
work page 2022
-
[35]
Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, et al . 2020. Interstellar: Using halide’s scheduling language to analyze dnn accelerators. InProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 369–383
work page 2020
-
[36]
Size Zheng, Siyuan Chen, and Yun Liang. 2023. Memory and computation coordinated mapping of dnns onto complex heterogeneous soc. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.