pith. sign in

arxiv: 2507.01113 · v3 · submitted 2025-07-01 · 💻 cs.DC · cs.SY· eess.SY

Stannic: Systolic STochAstic ONliNe SchedulIng AcCelerator

Pith reviewed 2026-05-19 06:13 UTC · model grok-4.3

classification 💻 cs.DC cs.SYeess.SY
keywords stochastic schedulingFPGA acceleratorheterogeneous computingonline schedulingHPCsystolic arrayworkload balancing
0
0 comments X

The pith

A systolic FPGA accelerator produces heterogeneity-aware schedules for stochastic workloads in near real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Stannic as a systolic microarchitecture that accelerates a non-preemptive stochastic online scheduling algorithm for heterogeneous systems. It shifts from a task-centric approach in an earlier design to a schedule-centric abstraction that exploits parallelism, pre-calculation, and spatial memory access. This produces schedules that balance unpredictable job arrivals and processing times across machines of varying capabilities. A reader would care because software schedulers in shared HPC clusters often cannot adapt fast enough to stochastic conditions without excessive overhead.

Core claim

The paper claims that Stannic, by inheriting a schedule-centric abstraction on FPGA hardware, reduces latency per computation iteration by 7.5 times and increases the supported size of the target heterogeneous system by 14 times compared with prior hardware acceleration, while still generating schedules that achieve efficient machine utilization and low average job latency under stochastic conditions.

What carries the argument

Stannic systolic accelerator that uses a schedule-centric abstraction to parallelize the computation of heterogeneity-aware schedules.

Load-bearing premise

The hardware correctly executes the full stochastic scheduling logic and produces exactly the same schedule decisions as a correct software implementation.

What would settle it

A side-by-side run on identical stochastic job traces where the hardware-generated schedules show higher average job latency or lower overall machine utilization than the original software algorithm.

Figures

Figures reproduced from arXiv: 2507.01113 by Adam H. Ross, Debjit Pal, Vairavan Palaniappan.

Figure 1
Figure 1. Figure 1: Algorithmic flow for stochastic online scheduling. Phase I prepares a job for the scheduler, Phase II and Phase III show the steps involved in scheduling the job. microarchitecture (µarchitecture) of the accelerator. Sections V and VI discuss the experimental setup and results respectively, done for testing the scheduler. Section VII surveys the related work followed by conclusion in Section VIII. II. PREL… view at source ↗
Figure 2
Figure 2. Figure 2: Top-level block diagram of the HERCULES sched￾uler. Phase II and III are the phases shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a): Cost Calculator. TAH: Tree Adder to compute costH. TAL: Tree Adder to compute costL. N: # of jobs in each machine. In: {K.ID, sumH, sumL, T K i } ×N. Out: {sumH, sumL} ×N. (b): Individual Job Cost Calculator. (c): αJ check module. CAM: Content Addressable Memory. K.ID is used as the tag for content matching and data retrieval. Data Selector (DS) Reg N Reg 2 Reg 1 Reg 0 0 popIDnew RD LD NewD RD LD NewD… view at source ↗
Figure 6
Figure 6. Figure 6: (a): Various quantization techniques applied to each job attribute. Green highlights the most suitable quantization. (b): Scheduled job distribution in each machine. (c): % Error in αJ . (d): % Error in WSPT. WSPT and αJ , respectively. INT8 exhibits the second-highest WSPT error, while INT4 and Mixed-precision approaches show lower WSPT errors. However, INT8 demonstrates lower αJ error than INT4 and Mixed… view at source ↗
Figure 7
Figure 7. Figure 7: (a): Average machine utilization across emulations. Darker the color, more the number of jobs assigned to the machine. (b): Scheduler throughput across emulations. Hardware for SOS scheduler: We have used an AMD Alveo U55C [15] as our target FPGA to implement the SOS scheduler. We used Allo/HeteroCL [16], [17] programming language to design the scheduler. The operating frequency of the scheduler is 371.47 … view at source ↗
Figure 10
Figure 10. Figure 10: Job distribution and average latency across M1 – M5 under varied workloads. SOS: Stochastic Online Scheduler; RR: Round Robin Scheduler; WSRR: Work Stealing Round Robin Scheduler; WSG: Work Stealing Greedy Scheduler. These experiments show that SOSA is an efficient, effective, and adaptable scheduler under varying realistic workloads targeting heterogeneous and homogeneous hardware. VII. RELATED WORK Seve… view at source ↗
read the original abstract

Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workloads due to scheduling overhead, lack of adaptability to stochastic workloads, and suboptimal resource utilization. The scheduling problem further compounds in the context of shared HPC clusters, where job arrivals and processing times are inherently stochastic. Prediction of these elements is possible, but it introduces additional overhead. To perform this complex scheduling, we developed two FPGA-assisted hardware accelerator microarchitectures, Hercules and Stannic. Hercules adopts a task-centric abstraction of stochastic scheduling, whereas Stannic inherits a schedule-centric abstraction. These hardware-assisted solutions leverage parallelism, pre-calculation, and spatial memory access to significantly accelerate scheduling. We accelerate a non-preemptive stochastic online scheduling algorithm to produce heterogeneity-aware schedules in near real time. With Hercules, we achieved a speedup of up to 1060x over a baseline C/C++ implementation, demonstrating the efficacy of a hardware-assisted acceleration for heterogeneity-aware stochastic scheduling. With Stannic, we further improved efficiency, achieving a 7.5x reduction in latency per computation iteration and a 14x increase in the target heterogeneous system size. Experimental results show that the resulting schedules demonstrate efficient machine utilization and low average job latency in stochastic contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents two FPGA-based hardware accelerators, Hercules and Stannic, for non-preemptive stochastic online scheduling in heterogeneous HPC systems. Hercules uses a task-centric abstraction while Stannic adopts a schedule-centric abstraction with systolic parallelism and pre-calculation. The authors report a speedup of up to 1060x over a C/C++ baseline with Hercules, and with Stannic a 7.5x reduction in latency per iteration plus a 14x increase in supported system size, while claiming efficient machine utilization and low average job latency for stochastic workloads.

Significance. If the hardware faithfully reproduces the software scheduler's decisions, the work could enable near-real-time heterogeneity-aware scheduling for large stochastic workloads in shared HPC clusters, where software overhead currently limits adaptability. The shift to schedule-centric systolic design and the reported scaling improvements would be a notable engineering contribution for hardware-accelerated resource management.

major comments (2)
  1. [Abstract] Abstract: the performance claims (1060x speedup, 7.5x latency reduction, 14x system-size increase) are stated without any description of experimental methodology, workload generation model, number of trials, baseline C/C++ implementation details, or hardware resource counts, leaving the central speedup results unsupported.
  2. [Evaluation] Evaluation section: no quantitative equivalence data (machine utilization, average job latency, or schedule-quality metrics) is provided comparing hardware outputs to the software reference implementation. Without this, it is impossible to confirm that fixed-point arithmetic, pseudo-random generation, or spatial memory access in the systolic design preserve the stochastic decision distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and agree that greater detail on methodology and validation will strengthen the work. Revisions have been prepared to incorporate these suggestions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the performance claims (1060x speedup, 7.5x latency reduction, 14x system-size increase) are stated without any description of experimental methodology, workload generation model, number of trials, baseline C/C++ implementation details, or hardware resource counts, leaving the central speedup results unsupported.

    Authors: We agree that the abstract would benefit from additional context to support the performance claims. In the revised version we will add a concise description of the workload generation model, number of trials performed, key aspects of the C/C++ baseline implementation, and hardware resource counts. Full experimental details remain in the Evaluation section, but the abstract update will make the central results more self-contained. revision: yes

  2. Referee: [Evaluation] Evaluation section: no quantitative equivalence data (machine utilization, average job latency, or schedule-quality metrics) is provided comparing hardware outputs to the software reference implementation. Without this, it is impossible to confirm that fixed-point arithmetic, pseudo-random generation, or spatial memory access in the systolic design preserve the stochastic decision distribution.

    Authors: The Evaluation section reports machine utilization and average job latency results for the hardware accelerators on stochastic workloads. We acknowledge, however, that direct quantitative equivalence metrics comparing hardware schedule quality and decision distributions to the software reference are not explicitly tabulated. In the revision we will add these comparisons, including schedule-quality scores, statistical similarity measures between hardware and software decisions, and targeted checks on the fixed-point and pseudo-random components to confirm preservation of the stochastic behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: speedups derived from direct hardware-vs-software timing measurements

full rationale

The paper's core claims are empirical speedups (1060x for Hercules, 7.5x latency reduction and 14x scale increase for Stannic) obtained by comparing FPGA hardware execution time against a baseline C/C++ software implementation of the same non-preemptive stochastic online scheduler. No equations, fitted parameters, or self-citations are used to derive the reported performance numbers; the results follow from direct benchmarking of the implemented microarchitectures. The schedule-quality statements are presented as experimental observations rather than predictions forced by construction. The derivation chain is therefore self-contained against external timing measurements and does not reduce to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard domain assumptions about stochastic workloads and FPGA capabilities with no free parameters, invented entities, or ad-hoc axioms visible in the abstract.

axioms (1)
  • domain assumption Job arrivals and processing times in shared HPC clusters are inherently stochastic.
    Invoked in the abstract as the core challenge that software schedulers struggle with.

pith-pipeline@v0.9.0 · 5787 in / 1308 out tokens · 90507 ms · 2026-05-19T06:13:21.425971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Exploration on Task Scheduling Strategy for CPU-GPU Heterogeneous Computing System

    Juan Fang, Jiaxing Zhang, Shuaibing Lu, and Hui Zhao. Exploration on Task Scheduling Strategy for CPU-GPU Heterogeneous Computing System. IEEE Computer Society Annual Symp. on VLSI (ISVLSI) , 2020

  2. [2]

    Efficient Inter-Device Task Scheduling Schemes for Multi-Device Co-Processing of Data- Parallel Kernels on Heterogeneous Systems

    Lanjun Wan, Weihua Zheng, and Xinpan Yuan. Efficient Inter-Device Task Scheduling Schemes for Multi-Device Co-Processing of Data- Parallel Kernels on Heterogeneous Systems. IEEE Access, 2021

  3. [3]

    Improved Task Scheduling in Heterogeneous Distributed Systems using Intelligent Greedy Harris Hawk Optimization Algorithm

    Mohammad Navid Habibpour Roudsari. Improved Task Scheduling in Heterogeneous Distributed Systems using Intelligent Greedy Harris Hawk Optimization Algorithm. Evol. Intel. (EI) , 2024

  4. [4]

    Uncertainty-Aware Online Deadline-Constrained Scheduling of Parallel Applications in Distributed Heterogeneous Systems

    Yifan Liu, Jinchao Chen, Jiangong Yang, Chenglie Du, and Xiaoyan Du. Uncertainty-Aware Online Deadline-Constrained Scheduling of Parallel Applications in Distributed Heterogeneous Systems. Computers & Industrial Engineering , 2024

  5. [5]

    Feature- Aware Task Scheduling on CPU-FPGA Heterogeneous Platforms

    Peilun Du, Zichang Sun, Haitao Zhang, and Huadong Ma. Feature- Aware Task Scheduling on CPU-FPGA Heterogeneous Platforms. Int’l Conf. on High Performance Computing and Communications(HPCC) , 2019

  6. [6]

    Scheduling for Heterogeneous Systems in Accelerator-Rich Environments

    Serif Yesil and Ozcan Ozturk. Scheduling for Heterogeneous Systems in Accelerator-Rich Environments. The Journal of Supercomputing (JSC) , 2022

  7. [7]

    Reliability-Aware Scheduling on Heterogeneous Multicore Processors

    Ajeya Naithani, Stijn Eyerman, and Lieven Eeckhout. Reliability-Aware Scheduling on Heterogeneous Multicore Processors. Int’l Symp. on High-Performance Computer Architecture (HPCA) , 2017

  8. [8]

    Runtime and Energy Constrained Work Scheduling for Hetero- geneous Systems

    Valon Raca, Seeun William Umboh, Eduard Mehofer, and Bernhard Scholz. Runtime and Energy Constrained Work Scheduling for Hetero- geneous Systems. The Journal of Supercomputing (JSC) , 2022

  9. [9]

    Optimal Task Scheduling for Partially Heterogeneous Systems

    Michael Orr and Oliver Sinnen. Optimal Task Scheduling for Partially Heterogeneous Systems. Parallel Computing, 2021

  10. [10]

    An Improved Greedy Algorithm for Stochastic Online Scheduling on Unrelated Machines

    Sven J ¨ager. An Improved Greedy Algorithm for Stochastic Online Scheduling on Unrelated Machines. Discrete Optimization (DO) , 2023

  11. [11]

    popcount

    CPP Reference. popcount. https://en.cppreference.com/w/cpp/numeric/ popcount, 2024. Accessed: September 7, 2025

  12. [12]

    Galvin, and Greg Gagne

    Abraham Silberschatz, Peter B. Galvin, and Greg Gagne. Operating System Concepts. John Wiley & Sons, 9 edition, 2012

  13. [13]

    Greedy Scheduling of Tasks With Time Constraints for Energy-Efficient Cloud-Computing Data Centers

    Ziqian Dong, Ning Liu, and Roberto Rojas-Cessa. Greedy Scheduling of Tasks With Time Constraints for Energy-Efficient Cloud-Computing Data Centers. Journal of Cloud Computing , 2015

  14. [14]

    Task- flow: A General-Purpose Parallel and Heterogeneous Task Programming System

    Tsung-Wei Huang, Dian-Lun Lin, Yibo Lin, and Chun-Xun Lin. Task- flow: A General-Purpose Parallel and Heterogeneous Task Programming System. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems (TCAD) , 2022

  15. [15]

    AMD Alveo U55C Product brief

    AMD. AMD Alveo U55C Product brief. https://www.amd.com/en/ products/accelerators/alveo/u55c.html, 2024. Accessed: September 7, 2025

  16. [16]

    Allo: A Programming Model for Composable Accelerator Design

    Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, and Zhiru Zhang. Allo: A Programming Model for Composable Accelerator Design. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI) , 2024

  17. [17]

    HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Com- puting

    Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Com- puting. Int’l Symp. on Field-Programmable Gate Arrays (FPGA), 2019

  18. [18]

    A Configurable Hardware Scheduler for Real-Time Systems

    Pramote Kuacharoen, Mohamed Shalan, and Vincent John Mooney. A Configurable Hardware Scheduler for Real-Time Systems. Engineering of Reconfigurable Systems and Algorithms , 2003

  19. [19]

    Bergmann

    Yi Tang and Neil W. Bergmann. A Hardware Scheduler Based on Task Queues for FPGA-Based Embedded Real-Time Systems. IEEE Trans. on Computers (TC) , 2015

  20. [20]

    HRHS: A High-Performance Real-Time Hardware Scheduler

    Danesh Derafshi, Amin Norollah, Mohsen Khosroanjam, and Hakem Beitollahi. HRHS: A High-Performance Real-Time Hardware Scheduler. IEEE Trans. on Parallel and Distributed Systems (TPDS) , 2020

  21. [21]

    Efficient Scheduling of Dependent Tasks in Many-Core Real-Time System Using a Hardware Scheduler

    Amin Norollah, Zahra Kazemi, Niloufar Sayadi, Hakem Beitollahi, Mahdi Fazeli, and David Hely. Efficient Scheduling of Dependent Tasks in Many-Core Real-Time System Using a Hardware Scheduler. Workshop on High-Performance Embedded Computing , 2021

  22. [22]

    HD-CPS: Hardware-Assisted Drift-Aware Concurrent Priority Scheduler for Shared Memory Multicores

    Mohsin Shan and Omer Khan. HD-CPS: Hardware-Assisted Drift-Aware Concurrent Priority Scheduler for Shared Memory Multicores. Int’l Symp. on High-Performance Computer Architecture (HPCA) , 2022

  23. [23]

    SchedTask: A Hardware- Assisted Task Scheduler

    Prathmesh Kallurkar and Smruti R Sarangi. SchedTask: A Hardware- Assisted Task Scheduler. Int’l Symp. on Microarchitecture (MICRO) , 2017

  24. [24]

    Task- flow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

    Tsung-Wei Huang, Dian-Lun Lin, Chun-Xun Lin, and Yibo Lin. Task- flow: A Lightweight Parallel and Heterogeneous Task Graph Computing System. IEEE Trans. on Parallel and Distributed Systems (TPDS), 2022

  25. [25]

    Models and Algorithms for Stochastic Online Scheduling

    Nicole Megow, Marc Uetz, and Tjark Vredeveld. Models and Algorithms for Stochastic Online Scheduling. Mathematics of Operations Research (MOR), 2006

  26. [26]

    AMD Vitis User Guide

    AMD. AMD Vitis User Guide. https://docs.amd.com/r/en-US/Vitis Libraries/User-Guide, 2024. Accessed: September 7, 2025

  27. [27]

    Xilinx XRT Documentation

    Xilinx. Xilinx XRT Documentation. https://xilinx.github.io/XRT/2024. 1/html/index.html, 2024. Accessed: September 7, 2025

  28. [28]

    Vitis HLS User Guide

    AMD. Vitis HLS User Guide. https://docs.amd.com/r/en-US/ ug1399-vitis-hls, 2024. Accessed: September 7, 2025

  29. [29]

    Optimal Task Scheduling Benefits from A Duplicate-Free State-Space

    Michael Orr and Oliver Sinnen. Optimal Task Scheduling Benefits from A Duplicate-Free State-Space. Journal of Parallel and Distributed Computing, 2020

  30. [30]

    Task Scheduling Frameworks for Heterogeneous Computing Toward Exascale

    Suhelah Sandokji and Fathy Eassa. Task Scheduling Frameworks for Heterogeneous Computing Toward Exascale. Int’l Journal of Advanced Computer Science and Applications(IJACSA) , 2018

  31. [31]

    Design and Analysis of Scheduling Strategies for Multi-CPU and Multi-GPU Architectures

    Joao VF Lima, Thierry Gautier, Vincent Danjean, Bruno Raffin, and Nicolas Maillard. Design and Analysis of Scheduling Strategies for Multi-CPU and Multi-GPU Architectures. Parallel Computing, 2015

  32. [32]

    Real- Time Scheduling of Parallel Tasks with Tight Deadlines

    Xu Jiang, Nan Guan, Xiang Long, Yue Tang, and Qingqiang He. Real- Time Scheduling of Parallel Tasks with Tight Deadlines. Journal of Systems Architecture, 2020

  33. [33]

    Energy-Efficient Stochastic Task Scheduling on Heterogeneous Computing Systems

    Kenli Li, Xiaoyong Tang, and Keqin Li. Energy-Efficient Stochastic Task Scheduling on Heterogeneous Computing Systems. IEEE Trans. on Parallel and Distributed Systems (TPDS) , 2013

  34. [34]

    Efficient Program Scheduling for Hetero- geneous Multi-Core Processors

    Jian Chen and Lizy K John. Efficient Program Scheduling for Hetero- geneous Multi-Core Processors. Design Automation Conf. (DAC), 2009