pith. sign in

arxiv: 2604.09124 · v1 · submitted 2026-04-10 · 💻 cs.DC · cs.AR· cs.LG

MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs

Pith reviewed 2026-05-10 17:00 UTC · model grok-4.3

classification 💻 cs.DC cs.ARcs.LG
keywords DNN deploymentheterogeneous acceleratorsedge SoCsconstraint programmingMLPerf Tinyinference schedulingaccelerator utilization
0
0 comments X p. Extension

The pith

MATCHA framework deploys deep neural networks on multi-accelerator edge SoCs with up to 35 percent lower inference latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MATCHA as a deployment system that creates execution schedules allowing different accelerators on a single chip to run neural network layers at the same time. Most existing tools leave some accelerators idle during inference because they cannot coordinate heterogeneous hardware units or manage shared memory levels effectively. MATCHA applies pattern matching to break models into pieces that fit each accelerator, tiles operations for better parallelism, maps pieces to specific hardware, and solves a constraint problem to assign memory buffers at L3 and L2 levels. On the MLPerf Tiny benchmark running on a chip with two unlike accelerators, the resulting schedules raise accelerator use and shorten end-to-end latency by as much as 35 percent versus the prior MATCH compiler.

Core claim

MATCHA is a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the state-of-the-art MATCH compiler.

What carries the argument

MATCHA framework combining pattern matching, tiling, mapping, and constraint programming to produce concurrent schedules across heterogeneous accelerators while optimizing memory allocation.

If this is right

  • Edge devices with multiple accelerators can execute DNN inference with higher hardware utilization.
  • Inference latency drops measurably on standard tiny-ML benchmarks when schedules exploit accelerator parallelism.
  • Memory allocation at L3 and L2 levels becomes part of an automated optimization step rather than manual tuning.
  • Deployment pipelines gain the ability to target heterogeneous hardware without rewriting schedules for each new SoC design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scheduling approach could extend to SoCs containing three or more accelerator types if the constraint solver scales.
  • MATCHA might serve as a backend that other compilers call to handle the final mapping and memory decisions after front-end graph transformations.
  • Testing on real silicon rather than simulation would reveal whether the generated schedules also improve energy per inference.

Load-bearing premise

Pattern matching, tiling, mapping, and constraint programming will reliably generate concurrent schedules that keep heterogeneous accelerators busy on the target SoC.

What would settle it

Running MATCHA and the MATCH compiler on the same MLPerf Tiny models on the two-accelerator SoC and measuring no reduction or an increase in measured inference latency.

Figures

Figures reproduced from arXiv: 2604.09124 by Alessandro Ottaviano, Alessio Burrello, Angelo Garofalo, Daniele Jahier Pagliari, Enrico Russo, Francesco Conti, Luca Benini, Maurizio Palesi, Mohamed Amine Hamdi.

Figure 1
Figure 1. Figure 1: MATCHA deployment framework: inputs (left) and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: MATCHA’s tile-centric pattern matching and tiling [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scheduling and memory plan example with differ [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Use case Carfield HSoC considered for evaluation. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ResNet inference profiling timeline (left) and the [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: FLOPS comparison for DNN benchmark blocks. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
read the original abstract

Deploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents MATCHA, a unified DNN deployment framework for System-on-Chips with multiple heterogeneous acceleration engines. It generates highly concurrent schedules for parallel heterogeneous accelerators by combining pattern matching, tiling, and mapping across individual hardware units with constraint programming to optimize L3/L2 memory allocation and scheduling. On the MLPerf Tiny benchmark using a SoC with two heterogeneous accelerators, MATCHA is reported to improve accelerator utilization and reduce inference latency by up to 35% relative to the state-of-the-art MATCH compiler.

Significance. If the empirical results hold under detailed scrutiny, the work offers a practical advance in exploiting hardware heterogeneity for edge DNN inference. The integration of established compiler passes with constraint programming for concurrent scheduling provides a concrete, testable improvement over prior compilers like MATCH, with direct relevance to MLPerf Tiny workloads on resource-constrained devices.

major comments (1)
  1. The central performance claim (up to 35% latency reduction) is presented without accompanying details on experimental setup, exact MATCH baselines, error bars, or ablation studies isolating the contributions of pattern matching/tiling versus constraint programming; this weakens evaluation of the weakest assumption that these techniques reliably yield high utilization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the practical relevance of MATCHA for heterogeneous edge SoCs. We address the major comment below with a commitment to strengthen the evaluation.

read point-by-point responses
  1. Referee: The central performance claim (up to 35% latency reduction) is presented without accompanying details on experimental setup, exact MATCH baselines, error bars, or ablation studies isolating the contributions of pattern matching/tiling versus constraint programming; this weakens evaluation of the weakest assumption that these techniques reliably yield high utilization.

    Authors: We agree that the current presentation of results would benefit from greater detail to allow readers to fully assess the claims. In the revised manuscript we will expand the evaluation section with: (1) a complete description of the experimental setup, including the precise SoC configuration, accelerator specifications, software versions, and latency measurement methodology; (2) the exact MATCH compiler configurations, flags, and versions used as baseline; (3) error bars computed from repeated inference runs to quantify variability; and (4) ablation experiments that separately measure the contributions of the pattern-matching/tiling pass versus the constraint-programming memory and scheduling optimizer. These additions will directly address concerns about the reliability of the reported utilization improvements and the 35% latency reduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces MATCHA, a DNN deployment framework that applies pattern matching, tiling, mapping, and constraint programming to generate concurrent schedules on heterogeneous accelerators. Its central claim is an empirical result: up to 35% lower inference latency and higher utilization on MLPerf Tiny versus the prior MATCH compiler. No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-definitions. The approach relies on standard compiler passes whose outputs are directly measured against an external baseline, rendering the performance claims self-contained and falsifiable outside any internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of constraint programming for memory and scheduling decisions plus pattern matching for hardware mapping; these are standard techniques whose combination is presented as novel but whose success is assumed rather than derived from first principles.

axioms (1)
  • domain assumption Constraint programming can efficiently solve the joint L3/L2 memory allocation and scheduling problem for DNNs on heterogeneous accelerators.
    Invoked to generate the optimized concurrent schedules described in the abstract.

pith-pipeline@v0.9.0 · 5442 in / 1236 out tokens · 52189 ms · 2026-05-10T17:00:13.914008+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, et al

  2. [2]

    MLPerf Tiny Benchmark.Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks(2021)

  3. [3]

    Michael Bayer. n.d.. Mako Templates for Python. https://www.makotemplates. org/

  4. [4]

    Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis.ACM Comput. Surv.52, 4, Article 65 (Aug. 2019), 43 pages. doi:10.1145/3320060

  5. [5]

    Halima Bouzidi, Mohanad Odema, Hamza Ouarnoughi, Smail Niar, and Moham- mad Abdullah Al Faruque. 2023. Map-and-conquer: Energy-efficient mapping of dynamic neural nets onto heterogeneous mpsocs. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

  6. [6]

    Alessio Burrello, Angelo Garofalo, Nazareno Bruschi, Giuseppe Tagliavini, Davide Rossi, and Francesco Conti. 2021. DORY: Automatic end-to-end deployment of real-world DNNs on low-cost IoT MCUs.IEEE Trans. Comput.70, 8 (2021), 1253–1268

  7. [7]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated End-to-End optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594

  8. [8]

    Ismet Dagli, Alexander Cieslewicz, Jedidiah McClurg, and Mehmet E Belviranli

  9. [9]

    InProceedings of the 59th ACM/IEEE Design Automation Conference

    Axonn: Energy-aware execution of neural network inference on multi- accelerator heterogeneous socs. InProceedings of the 59th ACM/IEEE Design Automation Conference. 1069–1074

  10. [10]

    Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna Natraj, Tiezhen Wang, et al. 2021. Tensorflow lite micro: Embedded machine learning for tinyml systems.Proceedings of machine learning and systems3 (2021), 800–811

  11. [11]

    Gurkaynak, Davide Rossi, and Luca Benini

    Angelo Garofalo, Alessandro Ottaviano, Matteo Perotti, Thomas Benz, Yvan Tortorella, Robert Balas, Michael Rogenmoser, Chi Zhang, Luca Bertaccini, Nils Wistoff, Maicol Ciani, Cyril Koenig, Mattia Sinigaglia, Luca Valente, Paul Scheffler, Manuel Eggimann, Matheus Cavalcante, Francesco Restuccia, Alessandro Biondi, Francesco Conti, Frank K. Gurkaynak, David...

  12. [12]

    Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: En- abling Systematic Deep-Learning Architecture Evaluation via ...

  13. [13]

    Sukhpal Singh Gill, Muhammed Golec, Jianmin Hu, Minxian Xu, Junhui Du, Huaming Wu, Guneet Kaur Walia, Subramaniam Subramanian Murugesan, Babar Ali, Mohit Kumar, Kejiang Ye, Prabal Verma, Surendra Kumar, Felix Cuadrado, and Steve Uhlig. 2024. Edge AI: A Taxonomy, Systematic Review and Future Directions.Cluster Computing28, 1 (Oct. 2024), 18. doi:10.1007/s1...

  14. [14]

    Mohamed Amine Hamdi, Francesco Daghero, Giuseppe Maria Sarda, Josse Van Delm, Arne Symons, Luca Benini, Marian Verhelst, Daniele Jahier Pagliari, and Alessio Burrello. 2025. MATCH: Model-Aware TVM-Based Compilation for Heterogeneous Edge Devices.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems(2025), 1–1. doi:10.1109/TCAD.2025.3556967

  15. [15]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

  16. [16]

    Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. Cosa: Scheduling by constrained optimization for spatial accelerators. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 554–566

  17. [17]

    Sheng-Chun Kao and Tushar Krishna. 2020. Gamma: Automating the hw mapping of dnn models on accelerators via genetic algorithm. InProceedings of the 39th International Conference on Computer-Aided Design. 1–9

  18. [18]

    Mukul Lokhande, Gopal Raut, and Santosh Kumar Vishvakarma. 2025. Flex-PE: Flexible and SIMD Multiprecision Processing Element for AI Workloads.IEEE Transactions on Very Large Scale Integration (VLSI) Systems33, 6 (2025), 1610–1623. doi:10.1109/TVLSI.2025.3553069

  19. [19]

    Martin Maas, Ulysse Beaugnon, Arun Chauhan, and Berkin Ilbeyi. 2022. Tela- malloc: Efficient on-chip memory allocation for production machine learning accelerators. InProceedings of the 28th ACM International Conference on Archi- tectural Support for Programming Languages and Operating Systems, Volume 1. 123–137

  20. [20]

    Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging joint architecture-mapping design space explo- ration for DNN accelerators.IEEE Trans. Comput.70, 8 (2021), 1160–1174

  21. [21]

    Alessandro Ottaviano, Thomas Benz, Paul Scheffler, and Luca Benini. 2023. Cheshire: A lightweight, linux-capable risc-v host platform for domain-specific accelerator plug-in.IEEE Transactions on Circuits and Systems II: Express Briefs 70, 10 (2023), 3777–3781

  22. [22]

    Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to dnn accelerator evaluation. In2019 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 304–315

  23. [23]

    Jeman Park, Misun Yu, Jinse Kwon, Junmo Park, Jemin Lee, and Yongin Kwon

  24. [24]

    NEST-C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators.ETRI Journal46, 5 (2024), 851– 864

  25. [25]

    Matteo Perotti, Samuel Riedel, Matheus Cavalcante, and Luca Benini. 2025. Spatz: Clustering compact RISC-V-based vector units to maximize computing efficiency. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2025)

  26. [26]

    Michael Rogenmoser, Yvan Tortorella, Davide Rossi, Francesco Conti, and Luca Benini. 2025. Hybrid modular redundancy: Exploring modular redundancy approaches in RISC-V multi-core computing clusters for reliable processing in space.ACM Transactions on Cyber-Physical Systems9, 1 (2025), 1–29

  27. [27]

    Enrico Russo, Maurizio Palesi, Giuseppe Ascia, Davide Patti, Salvatore Mon- teleone, and Vincenzo Catania. 2023. Memory-aware DNN algorithm-hardware mapping via integer linear programming. InProceedings of the 20th ACM Interna- tional Conference on Computing Frontiers. 134–143

  28. [28]

    Moritz Scherer, Luka Macan, Victor JB Jung, Philip Wiese, Luca Bompani, Alessio Burrello, Francesco Conti, and Luca Benini. 2024. Deeploy: Enabling energy- efficient deployment of small language models on heterogeneous microcon- trollers.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems43, 11 (2024), 4009–4020

  29. [29]

    Arne Symons, Linyan Mei, and Marian Verhelst. 2021. Loma: Fast auto-scheduling on dnn accelerators through loop-order-based memory allocation. In2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 1–4

  30. [30]

    In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol

    Kodai Ueyoshi, Ioannis A. Papistas, Pouya Houshmand, Giuseppe M. Sarda, Vikram Jain, Man Shi, Qilin Zheng, Sebastian Giraldo, Peter Vrancx, Jonas Doevenspeck, Debjyoti Bhattacharjee, Stefan Cosemans, Arindam Mallik, Pe- ter Debacker, Diederik Verkest, and Marian Verhelst. 2022. DIANA: An End- to-End Energy-Efficient Digital and ANAlog Hybrid Neural Networ...

  31. [31]

    Josse Van Delm, Maarten Vandersteegen, Alessio Burrello, Giuseppe Maria Sarda, Francesco Conti, Daniele Jahier Pagliari, Luca Benini, and Marian Verhelst. 2023. HTVM: Efficient neural network deployment on heterogeneous TinyML plat- forms. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

  32. [32]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  33. [33]

    Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 1492–1500

  34. [34]

    Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the gap between auto-tuners and hardware-native performance.Proceedings of Machine Learning and Systems4 (2022), 204–216

  35. [35]

    Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, et al . 2020. Interstellar: Using halide’s scheduling language to analyze dnn accelerators. InProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 369–383

  36. [36]

    Size Zheng, Siyuan Chen, and Yun Liang. 2023. Memory and computation coordinated mapping of dnns onto complex heterogeneous soc. In2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6