{\mu}-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

Jinming Zhuang; Peipei Zhou; Shixin Ji; Wei Zhang; Xingzhen Chen; Zhuoping Yang

arxiv: 2605.17683 · v2 · pith:XDHKGM5Inew · submitted 2026-05-17 · 💻 cs.AR

{μ}-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

Shixin Ji , Jinming Zhuang , Zhuoping Yang , Xingzhen Chen , Wei Zhang , Peipei Zhou This is my paper

Pith reviewed 2026-05-19 21:49 UTC · model grok-4.3

classification 💻 cs.AR

keywords ACAPDNN inferencemicrosecond latencyAIE arrayjet taggingDeepSetscascade connectionaccelerator framework

0 comments

The pith

μ-ORCA achieves 0.93 μs DNN inference latency on ACAP by direct AIE array communication

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that a customized framework can meet the demanding 1 microsecond latency target for small DNN models in jet-tagging applications where existing ACAP frameworks fail due to communication bottlenecks. μ-ORCA does so by allowing direct communication between layers on the AIE array using high-bandwidth cascade connections and by incorporating an overhead-aware performance model to guide optimizations. This results in support for models like DeepSets with operations such as ReLU and global aggregation directly on the array. A reader would care if this approach proves viable because it could make reconfigurable hardware practical for real-time, ultra-low-latency scientific computing tasks that demand both flexibility and speed.

Core claim

μ-ORCA enables direct inter-layer communication between DNN layers on the AIE array instead of using shared memory tiles or FPGA fabric, applies a 512-bit/cycle cascade connection instead of a 32-bit/cycle DMA connection, provides an overhead-aware performance model that adapts to different NN layer sizes, and conducts design space exploration to optimize end-to-end latency for MLP and DeepSets models with non-MM kernels including bias, ReLU, and global aggregation on AIE, achieving average latency reduction of >1.70× and >1.83× compared with different state-of-the-art ACAP frameworks and 0.93 μs latency for a 6-layer real-world DeepSets model on the AMD ACAP VEK280 platform.

What carries the argument

Direct inter-layer communication on the AIE array with 512-bit/cycle cascade connections that bypass shared memory and DMA for reduced latency in small models.

Load-bearing premise

Direct inter-layer communication on the AIE array can be realized with negligible additional synchronization or routing overhead for the small problem sizes typical of jet-tagging models.

What would settle it

A test run on the VEK280 platform where the 6-layer DeepSets model using μ-ORCA's direct communication method exhibits latency above 1 μs due to unaccounted synchronization costs.

Figures

Figures reproduced from arXiv: 2605.17683 by Jinming Zhuang, Peipei Zhou, Shixin Ji, Wei Zhang, Xingzhen Chen, Zhuoping Yang.

**Figure 1.** Figure 1: Data movement methods among AIE tiles. (a)Baseline DMA-based Data Movement (a)𝜇-ORCA Cascade-enabled Data Movement From/to PLIO/Shared Memory Tile From Last Layer To Next Layer Input Weight Output Partial Results DMA Cascade Parameter (pre-loaded) [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 6.** Figure 6: Cascade connection for inter-layer data movement. [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 5.** Figure 5: Inter-layer intermediate activation data communi [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 7.** Figure 7: Global aggregation layer design. other, as most inter-layer data is generated and processed when AIE1 is still computing. This enables µ-ORCA to reduce almost all the communication latency except for the last j loop. 4.3 Non-MM Kernels Design & Implementation 4.3.1 Global aggregation layers: µ-ORCA supports global aggregation layers to support handling DeepSets models fully within the AIE array instead of… view at source ↗

**Figure 8.** Figure 8: µ-ORCA design space exploration. 5.1.3 Inter-layer communication latency. The communication latency between layers depends on the method used. When the DMA connection is applied, the communication happens sequentially with the computation. That is, only after the producer kernel releases the output buffer (typically when the kernel finishes) can the DMA communication begin. Then the producer kernel canno… view at source ↗

**Figure 9.** Figure 9: Normalized single AIE measured latency and esti [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Latency comparison of synthetic MLP workloads with various layer shapes and number of layers. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Latency comparison on realistic workloads [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

read the original abstract

Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale inference on small problem sizes remains underexplored. In jet-tagging applications in high-energy physics, inefficient on-chip communication and large inter-layer latency prevent existing frameworks from meeting the 1-{\mu}s latency budget. Moreover, hardware overheads such as synchronization and VLIW processor prologue are often overlooked, making it infeasible to optimize accelerators correctly. To address these problems, we propose {\mu}-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency model inference. {\mu}-ORCA enables direct inter-layer communication between DNN layers on the AIE array, instead of using shared memory tiles or FPGA fabric. Moreover, a 512-bit/cycle cascade connection is applied instead of a 32-bit/cycle DMA connection. {\mu}-ORCA also provides an overhead-aware performance model that adapts to different NN layer sizes, and conducts design space exploration to optimize end-to-end latency. {\mu}-ORCA supports MLP and DeepSets models with non-MM kernels, including bias, ReLU, and global aggregation on AIE. We evaluate {\mu}-ORCA on the AMD ACAP VEK280 platform. Experimental results show that {\mu}-ORCA achieves average latency reduction of >1.70$\times$ and >1.83$\times$ compared with different state-of-the-art ACAP frameworks, and achieves 0.93 {\mu}s latency for a 6-layer real-world DeepSets model, satisfying the latency budget. We open source {\mu}-ORCA at https://github.com/arc-research-lab/u-ORCA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

μ-ORCA hits 0.93 μs on real ACAP hardware for a jet-tagging DeepSets model via direct AIE cascades, but the overhead assumptions need clearer checks.

read the letter

The main thing to know is that μ-ORCA achieves 0.93 microseconds latency for a 6-layer DeepSets model on the AMD VEK280 platform, which fits inside the 1-microsecond budget needed for jet-tagging in high-energy physics. It also reports average speedups of over 1.7 times compared to other ACAP frameworks by switching to direct inter-layer communication on the AIE array using 512-bit cascade connections. What the work actually adds is the combination of bypassing shared memory and DMA for layer-to-layer data movement in favor of these wide direct cascades, along with an overhead-aware performance model that factors in synchronization and VLIW prologue costs when exploring designs for non-matrix-multiply operations like ReLU and global aggregation. The authors evaluate on real hardware and provide open-source code, which lets others reproduce the latency numbers. One area that feels under-supported is the claim that these direct links incur negligible additional overhead for the small problem sizes in jet-tagging models. The stress test points out the lack of an ablation study isolating the communication method and no per-component latency breakdown to confirm the model's predictions against actual measurements. This makes it a bit harder to judge how much of the gain comes purely from the new communication path versus other tuning. That said, since the final numbers come from physical runs rather than simulation, the practical result still holds up. This kind of paper is useful for researchers focused on low-latency inference accelerators for scientific applications, especially those already looking at AMD ACAP or similar heterogeneous platforms. A reader who needs concrete implementation ideas for microsecond-scale DNNs on reconfigurable hardware will get something out of the framework description and the results. It deserves a serious referee because the hardware validation is there and the targeted problem is specific enough that detailed feedback on the model and experiments would strengthen it without requiring a complete overhaul.

Referee Report

1 major / 2 minor

Summary. The paper proposes μ-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency DNN inference on AMD ACAP platforms. It enables direct inter-layer communication on the AIE array using 512-bit/cycle cascade connections rather than shared memory or DMA, provides an overhead-aware performance model that adapts to different layer sizes for design space exploration, and supports MLP and DeepSets models including non-MM kernels like bias, ReLU, and global aggregation. Evaluation on the VEK280 platform shows that μ-ORCA achieves 0.93 μs latency for a 6-layer real-world DeepSets model, meeting the 1-μs budget, along with average latency reductions of over 1.70× and 1.83× compared to state-of-the-art ACAP frameworks.

Significance. This work addresses an important gap in deploying DNNs for microsecond-scale inference on reconfigurable hardware, particularly for high-energy physics applications such as jet tagging. The hardware measurements on real ACAP hardware provide concrete evidence for the latency claims, and the open-sourcing of the implementation is a strength that supports reproducibility. The overhead-aware model is a useful contribution for accurate performance estimation in accelerator design.

major comments (1)

Framework design and evaluation sections: the 0.93 μs latency claim and reported speedups rest on direct AIE-array inter-layer communication (via 512-bit/cycle cascade) incurring negligible synchronization/routing overhead for the small problem sizes typical of jet-tagging models. No ablation isolating the communication method, no per-component latency breakdown, and no explicit validation of the overhead-aware model predictions against measured overheads are described, leaving this assumption as the least-secured link.

minor comments (2)

Abstract: the specific state-of-the-art ACAP frameworks used for the >1.70× and >1.83× comparisons should be named or cited for better context.
Notation and figures: ensure consistent use of terms such as 'cascade connection' versus 'DMA connection' across text and diagrams.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance, hardware evaluation, and open-sourcing. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [—] Framework design and evaluation sections: the 0.93 μs latency claim and reported speedups rest on direct AIE-array inter-layer communication (via 512-bit/cycle cascade) incurring negligible synchronization/routing overhead for the small problem sizes typical of jet-tagging models. No ablation isolating the communication method, no per-component latency breakdown, and no explicit validation of the overhead-aware model predictions against measured overheads are described, leaving this assumption as the least-secured link.

Authors: We appreciate the referee highlighting this point. The overhead-aware performance model explicitly incorporates synchronization, routing, and VLIW prologue costs and is calibrated using hardware measurements on the VEK280 for the evaluated layer sizes; the reported 0.93 μs end-to-end latency and speedups are obtained from direct hardware timing rather than model predictions alone. For the small problem sizes characteristic of jet-tagging models, the 512-bit/cycle AIE cascade connections are architecturally designed to incur lower overhead than DMA or shared-memory alternatives, consistent with AMD AIE documentation. Nevertheless, we agree that an explicit ablation isolating the cascade communication method, a per-component latency breakdown, and direct validation plots of model-predicted versus measured overheads would strengthen the presentation. We will add these elements to the revised manuscript, including a table or figure showing component-wise timings and a comparison of end-to-end latency with and without direct cascade links where implementation constraints allow. revision: yes

Circularity Check

0 steps flagged

No significant circularity; latency claims derive from hardware execution

full rationale

The paper's central claims (0.93 μs latency for the 6-layer DeepSets model and >1.70×/>1.83× speedups) are obtained from direct physical measurements on the AMD ACAP VEK280 platform after implementing the accelerator. The overhead-aware performance model is used only for design-space exploration to select configurations; the reported end-to-end numbers are not produced by that model or by any fitted parameter renamed as a prediction. No self-definitional equations, load-bearing self-citations, or ansatz smuggling appear in the derivation of the final results, which remain externally falsifiable via the open-sourced code and real hardware.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on hardware-specific assumptions about ACAP AIE array capabilities and the accuracy of its performance model for small layers; no new physical entities are postulated.

free parameters (1)

layer-size adaptation parameters
The overhead-aware performance model adapts to different NN layer sizes, implying a small number of tuned or measured constants for synchronization and prologue costs.

axioms (1)

domain assumption Direct inter-layer communication via 512-bit cascade on AIE array incurs negligible extra overhead compared with DMA for the target small problem sizes.
Invoked to justify replacing shared-memory or FPGA-fabric communication.

pith-pipeline@v0.9.0 · 5880 in / 1394 out tokens · 53462 ms · 2026-05-19T21:49:25.255898+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

µ-ORCA enables direct inter-layer communication between DNN layers on the AIE array... a 512-bit/cycle cascade connection is applied instead of a 32-bit/cycle DMA connection... overhead-aware performance model that adapts to different NN layer sizes
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate µ-ORCA on the AMD ACAP VEK280 platform... achieves 0.93 µs latency for a 6-layer real-world DeepSets model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.