Stannic: Systolic STochAstic ONliNe SchedulIng AcCelerator
Pith reviewed 2026-05-19 06:13 UTC · model grok-4.3
The pith
A systolic FPGA accelerator produces heterogeneity-aware schedules for stochastic workloads in near real time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that Stannic, by inheriting a schedule-centric abstraction on FPGA hardware, reduces latency per computation iteration by 7.5 times and increases the supported size of the target heterogeneous system by 14 times compared with prior hardware acceleration, while still generating schedules that achieve efficient machine utilization and low average job latency under stochastic conditions.
What carries the argument
Stannic systolic accelerator that uses a schedule-centric abstraction to parallelize the computation of heterogeneity-aware schedules.
Load-bearing premise
The hardware correctly executes the full stochastic scheduling logic and produces exactly the same schedule decisions as a correct software implementation.
What would settle it
A side-by-side run on identical stochastic job traces where the hardware-generated schedules show higher average job latency or lower overall machine utilization than the original software algorithm.
Figures
read the original abstract
Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workloads due to scheduling overhead, lack of adaptability to stochastic workloads, and suboptimal resource utilization. The scheduling problem further compounds in the context of shared HPC clusters, where job arrivals and processing times are inherently stochastic. Prediction of these elements is possible, but it introduces additional overhead. To perform this complex scheduling, we developed two FPGA-assisted hardware accelerator microarchitectures, Hercules and Stannic. Hercules adopts a task-centric abstraction of stochastic scheduling, whereas Stannic inherits a schedule-centric abstraction. These hardware-assisted solutions leverage parallelism, pre-calculation, and spatial memory access to significantly accelerate scheduling. We accelerate a non-preemptive stochastic online scheduling algorithm to produce heterogeneity-aware schedules in near real time. With Hercules, we achieved a speedup of up to 1060x over a baseline C/C++ implementation, demonstrating the efficacy of a hardware-assisted acceleration for heterogeneity-aware stochastic scheduling. With Stannic, we further improved efficiency, achieving a 7.5x reduction in latency per computation iteration and a 14x increase in the target heterogeneous system size. Experimental results show that the resulting schedules demonstrate efficient machine utilization and low average job latency in stochastic contexts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents two FPGA-based hardware accelerators, Hercules and Stannic, for non-preemptive stochastic online scheduling in heterogeneous HPC systems. Hercules uses a task-centric abstraction while Stannic adopts a schedule-centric abstraction with systolic parallelism and pre-calculation. The authors report a speedup of up to 1060x over a C/C++ baseline with Hercules, and with Stannic a 7.5x reduction in latency per iteration plus a 14x increase in supported system size, while claiming efficient machine utilization and low average job latency for stochastic workloads.
Significance. If the hardware faithfully reproduces the software scheduler's decisions, the work could enable near-real-time heterogeneity-aware scheduling for large stochastic workloads in shared HPC clusters, where software overhead currently limits adaptability. The shift to schedule-centric systolic design and the reported scaling improvements would be a notable engineering contribution for hardware-accelerated resource management.
major comments (2)
- [Abstract] Abstract: the performance claims (1060x speedup, 7.5x latency reduction, 14x system-size increase) are stated without any description of experimental methodology, workload generation model, number of trials, baseline C/C++ implementation details, or hardware resource counts, leaving the central speedup results unsupported.
- [Evaluation] Evaluation section: no quantitative equivalence data (machine utilization, average job latency, or schedule-quality metrics) is provided comparing hardware outputs to the software reference implementation. Without this, it is impossible to confirm that fixed-point arithmetic, pseudo-random generation, or spatial memory access in the systolic design preserve the stochastic decision distribution.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and agree that greater detail on methodology and validation will strengthen the work. Revisions have been prepared to incorporate these suggestions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the performance claims (1060x speedup, 7.5x latency reduction, 14x system-size increase) are stated without any description of experimental methodology, workload generation model, number of trials, baseline C/C++ implementation details, or hardware resource counts, leaving the central speedup results unsupported.
Authors: We agree that the abstract would benefit from additional context to support the performance claims. In the revised version we will add a concise description of the workload generation model, number of trials performed, key aspects of the C/C++ baseline implementation, and hardware resource counts. Full experimental details remain in the Evaluation section, but the abstract update will make the central results more self-contained. revision: yes
-
Referee: [Evaluation] Evaluation section: no quantitative equivalence data (machine utilization, average job latency, or schedule-quality metrics) is provided comparing hardware outputs to the software reference implementation. Without this, it is impossible to confirm that fixed-point arithmetic, pseudo-random generation, or spatial memory access in the systolic design preserve the stochastic decision distribution.
Authors: The Evaluation section reports machine utilization and average job latency results for the hardware accelerators on stochastic workloads. We acknowledge, however, that direct quantitative equivalence metrics comparing hardware schedule quality and decision distributions to the software reference are not explicitly tabulated. In the revision we will add these comparisons, including schedule-quality scores, statistical similarity measures between hardware and software decisions, and targeted checks on the fixed-point and pseudo-random components to confirm preservation of the stochastic behavior. revision: yes
Circularity Check
No circularity: speedups derived from direct hardware-vs-software timing measurements
full rationale
The paper's core claims are empirical speedups (1060x for Hercules, 7.5x latency reduction and 14x scale increase for Stannic) obtained by comparing FPGA hardware execution time against a baseline C/C++ software implementation of the same non-preemptive stochastic online scheduler. No equations, fitted parameters, or self-citations are used to derive the reported performance numbers; the results follow from direct benchmarking of the implemented microarchitectures. The schedule-quality statements are presented as experimental observations rather than predictions forced by construction. The derivation chain is therefore self-contained against external timing measurements and does not reduce to any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Job arrivals and processing times in shared HPC clusters are inherently stochastic.
Reference graph
Works this paper leans on
-
[1]
Exploration on Task Scheduling Strategy for CPU-GPU Heterogeneous Computing System
Juan Fang, Jiaxing Zhang, Shuaibing Lu, and Hui Zhao. Exploration on Task Scheduling Strategy for CPU-GPU Heterogeneous Computing System. IEEE Computer Society Annual Symp. on VLSI (ISVLSI) , 2020
work page 2020
-
[2]
Lanjun Wan, Weihua Zheng, and Xinpan Yuan. Efficient Inter-Device Task Scheduling Schemes for Multi-Device Co-Processing of Data- Parallel Kernels on Heterogeneous Systems. IEEE Access, 2021
work page 2021
-
[3]
Mohammad Navid Habibpour Roudsari. Improved Task Scheduling in Heterogeneous Distributed Systems using Intelligent Greedy Harris Hawk Optimization Algorithm. Evol. Intel. (EI) , 2024
work page 2024
-
[4]
Yifan Liu, Jinchao Chen, Jiangong Yang, Chenglie Du, and Xiaoyan Du. Uncertainty-Aware Online Deadline-Constrained Scheduling of Parallel Applications in Distributed Heterogeneous Systems. Computers & Industrial Engineering , 2024
work page 2024
-
[5]
Feature- Aware Task Scheduling on CPU-FPGA Heterogeneous Platforms
Peilun Du, Zichang Sun, Haitao Zhang, and Huadong Ma. Feature- Aware Task Scheduling on CPU-FPGA Heterogeneous Platforms. Int’l Conf. on High Performance Computing and Communications(HPCC) , 2019
work page 2019
-
[6]
Scheduling for Heterogeneous Systems in Accelerator-Rich Environments
Serif Yesil and Ozcan Ozturk. Scheduling for Heterogeneous Systems in Accelerator-Rich Environments. The Journal of Supercomputing (JSC) , 2022
work page 2022
-
[7]
Reliability-Aware Scheduling on Heterogeneous Multicore Processors
Ajeya Naithani, Stijn Eyerman, and Lieven Eeckhout. Reliability-Aware Scheduling on Heterogeneous Multicore Processors. Int’l Symp. on High-Performance Computer Architecture (HPCA) , 2017
work page 2017
-
[8]
Runtime and Energy Constrained Work Scheduling for Hetero- geneous Systems
Valon Raca, Seeun William Umboh, Eduard Mehofer, and Bernhard Scholz. Runtime and Energy Constrained Work Scheduling for Hetero- geneous Systems. The Journal of Supercomputing (JSC) , 2022
work page 2022
-
[9]
Optimal Task Scheduling for Partially Heterogeneous Systems
Michael Orr and Oliver Sinnen. Optimal Task Scheduling for Partially Heterogeneous Systems. Parallel Computing, 2021
work page 2021
-
[10]
An Improved Greedy Algorithm for Stochastic Online Scheduling on Unrelated Machines
Sven J ¨ager. An Improved Greedy Algorithm for Stochastic Online Scheduling on Unrelated Machines. Discrete Optimization (DO) , 2023
work page 2023
- [11]
-
[12]
Abraham Silberschatz, Peter B. Galvin, and Greg Gagne. Operating System Concepts. John Wiley & Sons, 9 edition, 2012
work page 2012
-
[13]
Greedy Scheduling of Tasks With Time Constraints for Energy-Efficient Cloud-Computing Data Centers
Ziqian Dong, Ning Liu, and Roberto Rojas-Cessa. Greedy Scheduling of Tasks With Time Constraints for Energy-Efficient Cloud-Computing Data Centers. Journal of Cloud Computing , 2015
work page 2015
-
[14]
Task- flow: A General-Purpose Parallel and Heterogeneous Task Programming System
Tsung-Wei Huang, Dian-Lun Lin, Yibo Lin, and Chun-Xun Lin. Task- flow: A General-Purpose Parallel and Heterogeneous Task Programming System. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems (TCAD) , 2022
work page 2022
-
[15]
AMD. AMD Alveo U55C Product brief. https://www.amd.com/en/ products/accelerators/alveo/u55c.html, 2024. Accessed: September 7, 2025
work page 2024
-
[16]
Allo: A Programming Model for Composable Accelerator Design
Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, and Zhiru Zhang. Allo: A Programming Model for Composable Accelerator Design. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI) , 2024
work page 2024
-
[17]
Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Com- puting. Int’l Symp. on Field-Programmable Gate Arrays (FPGA), 2019
work page 2019
-
[18]
A Configurable Hardware Scheduler for Real-Time Systems
Pramote Kuacharoen, Mohamed Shalan, and Vincent John Mooney. A Configurable Hardware Scheduler for Real-Time Systems. Engineering of Reconfigurable Systems and Algorithms , 2003
work page 2003
- [19]
-
[20]
HRHS: A High-Performance Real-Time Hardware Scheduler
Danesh Derafshi, Amin Norollah, Mohsen Khosroanjam, and Hakem Beitollahi. HRHS: A High-Performance Real-Time Hardware Scheduler. IEEE Trans. on Parallel and Distributed Systems (TPDS) , 2020
work page 2020
-
[21]
Efficient Scheduling of Dependent Tasks in Many-Core Real-Time System Using a Hardware Scheduler
Amin Norollah, Zahra Kazemi, Niloufar Sayadi, Hakem Beitollahi, Mahdi Fazeli, and David Hely. Efficient Scheduling of Dependent Tasks in Many-Core Real-Time System Using a Hardware Scheduler. Workshop on High-Performance Embedded Computing , 2021
work page 2021
-
[22]
HD-CPS: Hardware-Assisted Drift-Aware Concurrent Priority Scheduler for Shared Memory Multicores
Mohsin Shan and Omer Khan. HD-CPS: Hardware-Assisted Drift-Aware Concurrent Priority Scheduler for Shared Memory Multicores. Int’l Symp. on High-Performance Computer Architecture (HPCA) , 2022
work page 2022
-
[23]
SchedTask: A Hardware- Assisted Task Scheduler
Prathmesh Kallurkar and Smruti R Sarangi. SchedTask: A Hardware- Assisted Task Scheduler. Int’l Symp. on Microarchitecture (MICRO) , 2017
work page 2017
-
[24]
Task- flow: A Lightweight Parallel and Heterogeneous Task Graph Computing System
Tsung-Wei Huang, Dian-Lun Lin, Chun-Xun Lin, and Yibo Lin. Task- flow: A Lightweight Parallel and Heterogeneous Task Graph Computing System. IEEE Trans. on Parallel and Distributed Systems (TPDS), 2022
work page 2022
-
[25]
Models and Algorithms for Stochastic Online Scheduling
Nicole Megow, Marc Uetz, and Tjark Vredeveld. Models and Algorithms for Stochastic Online Scheduling. Mathematics of Operations Research (MOR), 2006
work page 2006
-
[26]
AMD. AMD Vitis User Guide. https://docs.amd.com/r/en-US/Vitis Libraries/User-Guide, 2024. Accessed: September 7, 2025
work page 2024
-
[27]
Xilinx. Xilinx XRT Documentation. https://xilinx.github.io/XRT/2024. 1/html/index.html, 2024. Accessed: September 7, 2025
work page 2024
-
[28]
AMD. Vitis HLS User Guide. https://docs.amd.com/r/en-US/ ug1399-vitis-hls, 2024. Accessed: September 7, 2025
work page 2024
-
[29]
Optimal Task Scheduling Benefits from A Duplicate-Free State-Space
Michael Orr and Oliver Sinnen. Optimal Task Scheduling Benefits from A Duplicate-Free State-Space. Journal of Parallel and Distributed Computing, 2020
work page 2020
-
[30]
Task Scheduling Frameworks for Heterogeneous Computing Toward Exascale
Suhelah Sandokji and Fathy Eassa. Task Scheduling Frameworks for Heterogeneous Computing Toward Exascale. Int’l Journal of Advanced Computer Science and Applications(IJACSA) , 2018
work page 2018
-
[31]
Design and Analysis of Scheduling Strategies for Multi-CPU and Multi-GPU Architectures
Joao VF Lima, Thierry Gautier, Vincent Danjean, Bruno Raffin, and Nicolas Maillard. Design and Analysis of Scheduling Strategies for Multi-CPU and Multi-GPU Architectures. Parallel Computing, 2015
work page 2015
-
[32]
Real- Time Scheduling of Parallel Tasks with Tight Deadlines
Xu Jiang, Nan Guan, Xiang Long, Yue Tang, and Qingqiang He. Real- Time Scheduling of Parallel Tasks with Tight Deadlines. Journal of Systems Architecture, 2020
work page 2020
-
[33]
Energy-Efficient Stochastic Task Scheduling on Heterogeneous Computing Systems
Kenli Li, Xiaoyong Tang, and Keqin Li. Energy-Efficient Stochastic Task Scheduling on Heterogeneous Computing Systems. IEEE Trans. on Parallel and Distributed Systems (TPDS) , 2013
work page 2013
-
[34]
Efficient Program Scheduling for Hetero- geneous Multi-Core Processors
Jian Chen and Lizy K John. Efficient Program Scheduling for Hetero- geneous Multi-Core Processors. Design Automation Conf. (DAC), 2009
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.