Compilation and Execution of an Embeddable YOLO-NAS on the VTA
Pith reviewed 2026-05-07 17:43 UTC · model grok-4.3
The pith
The VTA compilation chain has been extended to fully automate handling of large CNNs that exceed on-chip memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that by extending and automating the VTA compilation chain, complete CNNs including those larger than on-chip memory can be compiled and executed in simulation. This overcomes limitations of the previous compiler, enabling support for complex models like YOLO-NAS in embedded FPGA accelerators.
What carries the argument
The extended VTA compilation chain with automated off-chip memory access handling.
If this is right
- Complete CNNs can now be compiled without manual intervention for memory management.
- Larger models such as YOLO-NAS can be targeted for VTA-based accelerators.
- Simulated execution becomes feasible for verification in safety-critical contexts.
- The chain supports avionic applications by reducing manual steps.
Where Pith is reading between the lines
- If the simulation accurately models hardware, this could streamline the path to certified deployments in regulated industries.
- The automation might generalize to other tensor accelerators beyond VTA.
- Further work could integrate this with real hardware testing to validate timing for certification.
Load-bearing premise
The simulation environment and automated off-chip memory handling must accurately reflect real hardware behavior and timing.
What would settle it
Running the compiled YOLO-NAS on actual VTA hardware and comparing execution traces or timings to the simulation results for discrepancies.
Figures
read the original abstract
Deploying complex Convolutional Neural Networks (CNNs) on FPGA-based accelerators is a promising way forward for safety-critical domains such as aeronautics. In a previous work, we have explored the Versatile Tensor Accelerator (VTA) and showed its suitability for avionic applications. For that, we developed an initial stand-alone compiler designed with certification in mind. However, this compiler still suffers from some limitations that are overcome in this paper. The contributions consist in extending and fully automating the VTA compilation chain to allow complete CNN compilation and support larger CNNs (which parameters do not fit in the on-chip memory). The effectiveness is demonstrated by the successful compilation and simulated execution of a YOLO-NAS object detection model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends the Versatile Tensor Accelerator (VTA) compilation chain with automated support for off-chip memory spilling and scheduling to enable full compilation of CNNs whose parameters exceed on-chip memory capacity. Building on prior stand-alone compiler work for avionic applications, the central demonstration is the successful compilation and simulated execution of a YOLO-NAS object detection model.
Significance. If the automated compiler extensions and simulation traces are correct, the work provides a practical engineering advance for deploying larger CNNs on FPGA accelerators in safety-critical domains such as aeronautics. The full automation of memory management removes a key manual bottleneck from the prior implementation, though the contribution remains an incremental extension rather than a parameter-free or formally verified derivation.
major comments (1)
- [§4 (Compiler Extensions) and §5 (YOLO-NAS Case Study)] The manuscript does not provide sufficient detail on the verification of correctness for the off-chip memory spilling and scheduling logic when applied to YOLO-NAS (whose size exceeds on-chip memory). Without explicit checks against a reference implementation or cycle-accurate hardware traces, the claim of 'successful compilation and simulated execution' rests on the simulator output alone.
minor comments (3)
- [Abstract and §1] The abstract and introduction use 'embeddable' and 'stand-alone compiler' without defining these terms or contrasting them with the new automated chain.
- [§5 and associated figures/tables] Figure captions and simulation result tables would benefit from explicit units and quantitative metrics (e.g., off-chip access counts, estimated latency) rather than qualitative statements of success.
- [Throughout] A few minor typographical inconsistencies appear in section headings and reference formatting.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work. We address the major comment below and will incorporate revisions to improve clarity on verification.
read point-by-point responses
-
Referee: [§4 (Compiler Extensions) and §5 (YOLO-NAS Case Study)] The manuscript does not provide sufficient detail on the verification of correctness for the off-chip memory spilling and scheduling logic when applied to YOLO-NAS (whose size exceeds on-chip memory). Without explicit checks against a reference implementation or cycle-accurate hardware traces, the claim of 'successful compilation and simulated execution' rests on the simulator output alone.
Authors: We agree that the current description of verification could be strengthened. The VTA simulator employed is the official cycle-accurate simulator from the VTA project, which faithfully models memory hierarchy, spilling, and scheduling behavior. For the YOLO-NAS case study we performed functional equivalence checks by comparing simulator outputs (class scores and bounding boxes) against a reference PyTorch CPU implementation on identical inputs; any discrepancies were investigated and traced to quantization differences rather than scheduling errors. We will revise §5 to explicitly document these reference comparisons, include a brief description of the spilling verification procedure for layers exceeding on-chip SRAM, and add representative excerpts from the simulation logs that confirm correct off-chip memory traffic. We note that cycle-accurate hardware traces from an actual FPGA implementation are outside the scope of the present simulation-focused study, but the simulator itself provides the necessary cycle-level visibility into memory operations. revision: yes
Circularity Check
Minor self-citation to prior compiler work; central claim is empirical demonstration with no reduction to inputs
full rationale
The paper describes an engineering extension that automates off-chip memory handling in the VTA chain and reports successful compilation plus simulated execution of YOLO-NAS. No equations, parameters, or derivations appear; the effectiveness claim rests on direct traces from the extended compiler and simulator rather than any fitted quantity or self-defined relation. The single self-citation to the authors' prior stand-alone compiler is background context and not load-bearing for the new automation results. The demonstration is therefore self-contained against external benchmarks (successful run on the target model) and does not match any circularity pattern.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. Abinaya, S. Sumathi, et al. Moving vehicles count- ing and detection using deep neural networks based yolo-nas algorithm. In 2025 International Conference on Innovative Trends in Information Technology (ICI- TIIT), pages 1–6. IEEE, 2025
work page 2025
-
[2]
J. Bachrach, H. V o, B. C. Richards, Y . Lee, A. Water- man, R. Avizienis, and et al. Chisel: constructing hard- ware in a Scala embedded language. In The 49th An- nual Design Automation Conference 2012, DAC, pages 1216–1225, 2012
work page 2012
-
[3]
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Q. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy. TVM: An Au- tomated End-to-End Optimizing Compiler for Deep Learning. In A. C. Arpaci-Dusseau and G. V oelker, editors, 13th USENIX Symposium on Operating Sys- tems Design and Implementation, OSDI , pages 578– 594, 2018
work page 2018
- [4]
-
[5]
A. Faure-Gignoux, K. Delmas, A. Gauffriau, and C. Pagetti. Open-source stand-alone Versatile Tensor Accelerator. In Digital Avionics Systems Conference (DASC) 2025, 2025. to appear
work page 2025
-
[6]
M. Haandbaek et al. Safety-critical systems: A case for modern verification tools and accelerators. In Pro- ceedings of the 2023 Conference on Autonomous Flight Systems. Daedalean, 2023
work page 2023
- [7]
-
[8]
Y . LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Hand- written Digit Recognition with a Back-Propagation Network. In D. S. Touretzky, editor, Advances in Neu- ral Information Processing Systems 2, [NIPS Confer- ence,, pages 396–404, 1989
work page 1989
- [9]
-
[10]
T. Nguyen Thi Phuong, G. S. Cho, and I. Chatter- jee. Automating container damage detection with the yolo-nas deep learning model. Science Progress, 108(1):00368504251314084, 2025
work page 2025
- [11]
-
[12]
DO-178C: Software Considerations in Air- borne Systems and Equipment Certification, December 2011
RTCA, Inc. DO-178C: Software Considerations in Air- borne Systems and Equipment Certification, December 2011
work page 2011
-
[13]
J. R. Terven, D. C ´ordova-Esparza, and J. Romero- Gonz´alez. A comprehensive review of YOLO architec- tures in computer vision: From yolov1 to yolov8 and YOLO-NAS. Mach. Learn. Knowl. Extr. , 5(4):1680– 1716, 2023
work page 2023
-
[14]
P. Wang, X. Wang, R. Luo, D. Wang, M. Luo, S. Qiao, and Y . Zhou. An efficient im2row-based fast convolu- tion algorithm for ARM cortex-m mcus. IEEE Access, 9:124384–124395, 2021. European Congress of Embedded Real Time Systems, ISSN 2680-0918, 2026 11
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.