Evaluating Rapid Makespan Predictions for Heterogeneous Systems with Programmable Logic
Pith reviewed 2026-05-18 09:02 UTC · model grok-4.3
The pith
A flexible framework collects real makespan data from abstract task graphs on heterogeneous CPU-GPU-FPGA systems to evaluate rapid prediction algorithms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents a highly flexible evaluation framework capable of collecting real-world makespan results based on abstract task graph descriptions for heterogeneous systems with CPUs, GPUs, and FPGAs. It analyzes the predictive power of existing analytical approaches and presents challenges arising from high-level characteristics like data transfer overhead and device congestion.
What carries the argument
The evaluation framework that executes abstract task graphs on real heterogeneous hardware to measure makespans.
If this is right
- Developers can iterate on analytical prediction models more rapidly without full hardware implementations for each test.
- Task mapping algorithms can be refined using predictions that better reflect real behaviors in mixed systems.
- Challenges such as data transfer overhead and device congestion can be directly incorporated into improved prediction functions.
Where Pith is reading between the lines
- The framework could be adapted for additional accelerator types or cloud-based heterogeneous setups.
- Standardized abstract task graph formats might allow direct comparisons of prediction methods across different research groups.
- Combining the collected real-world data with machine learning could yield hybrid prediction models with higher accuracy.
Load-bearing premise
Abstract task graph descriptions can generate makespan measurements on real hardware that are representative of full implementations, particularly for programmable logic.
What would settle it
Running full task implementations on FPGAs and comparing the measured makespans to those obtained from the abstract descriptions in the framework.
Figures
read the original abstract
Heterogeneous computing systems, which combine general-purpose processors with specialized accelerators, are increasingly important for optimizing the performance of modern applications. A central challenge is to decide which parts of an application should be executed on which accelerator or, more generally, how to map the tasks of an application to available devices. Predicting the impact of a change in a task mapping on the overall makespan is non-trivial. While there are very capable simulators, these generally require a full implementation of the tasks in question, which is particularly time-intensive for programmable logic. A promising alternative is to use a purely analytical function, which allows for very fast predictions, but abstracts significantly from reality. Bridging the gap between theory and practice poses a significant challenge to algorithm developers. This paper aims to aid in the development of rapid makespan prediction algorithms by providing a highly flexible evaluation framework for heterogeneous systems consisting of CPUs, GPUs and FPGAs, which is capable of collecting real-world makespan results based on abstract task graph descriptions. We analyze to what extent actual makespans can be predicted by existing analytical approaches. Furthermore, we present common challenges that arise from high-level characteristics such as data transfer overhead and device congestion in heterogeneous systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a flexible evaluation framework for heterogeneous systems combining CPUs, GPUs, and FPGAs. The framework executes abstract task-graph descriptions on real hardware to collect makespan measurements without requiring complete task implementations (especially for programmable logic). It then compares these measurements against existing analytical prediction models and identifies recurring challenges such as data-transfer overhead and device congestion.
Significance. If the collected measurements prove representative, the framework would provide a practical bridge between high-level analytical predictors and hardware reality, lowering the barrier for developing and validating rapid makespan-prediction algorithms. The explicit discussion of congestion and transfer effects supplies concrete guidance for future model refinement in mixed-accelerator environments.
major comments (2)
- [§4] §4 (Evaluation): the manuscript reports measured makespans and analytical predictions but supplies no quantitative error metrics, confidence intervals, or statistical tests (e.g., mean absolute percentage error or correlation coefficients) that would allow readers to judge the practical utility of the analytical approaches.
- [§3.2] §3.2 (FPGA abstraction layer): the description of how abstract task graphs are realized on FPGAs does not specify any mechanism or calibration step that accounts for synthesis, placement, routing, or bitstream-loading delays; without such detail the collected makespans risk systematic divergence from deployed behavior, undermining the central claim that the framework yields representative data for predictor evaluation.
minor comments (2)
- [Figure 2] Figure 2 and Table 1: axis labels and legend entries are too small for comfortable reading; enlarging them would improve clarity.
- [§2] The notation for task-graph nodes and device mappings is introduced inconsistently between §2 and §3; a single consolidated definition table would help.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and have revised the manuscript to strengthen the presentation of quantitative results and the description of the FPGA abstraction.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation): the manuscript reports measured makespans and analytical predictions but supplies no quantitative error metrics, confidence intervals, or statistical tests (e.g., mean absolute percentage error or correlation coefficients) that would allow readers to judge the practical utility of the analytical approaches.
Authors: We agree that the original evaluation would benefit from explicit quantitative metrics. In the revised manuscript we now report mean absolute percentage error (MAPE), root-mean-square error, and Pearson correlation coefficients between measured and predicted makespans for each device combination. We also include 95 % confidence intervals on the error statistics and a brief discussion of the correlation significance to help readers assess the practical utility of the analytical predictors. revision: yes
-
Referee: [§3.2] §3.2 (FPGA abstraction layer): the description of how abstract task graphs are realized on FPGAs does not specify any mechanism or calibration step that accounts for synthesis, placement, routing, or bitstream-loading delays; without such detail the collected makespans risk systematic divergence from deployed behavior, undermining the central claim that the framework yields representative data for predictor evaluation.
Authors: The framework is deliberately abstract so that makespan data can be collected without full task implementations on programmable logic. We have nevertheless expanded §3.2 to clarify that bitstream-loading latency is measured and included in the reported makespans whenever a pre-synthesized bitstream is available, while synthesis, placement, and routing times are treated as one-time offline costs outside the runtime makespan. We also added a short paragraph describing a simple calibration procedure that uses a small set of representative kernels to estimate average routing overhead for the target FPGA; this calibration can be applied by users who wish to align the abstract measurements more closely with a fully deployed flow. revision: partial
Circularity Check
No circularity: empirical evaluation framework relies on independent hardware measurements
full rationale
The paper introduces a flexible evaluation framework to collect real-world makespan data from abstract task-graph descriptions on heterogeneous hardware (CPUs, GPUs, FPGAs) and then compares these measurements against existing analytical prediction methods. No derivation, prediction, or first-principles result is presented that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The central claim is an empirical bridge between abstract descriptions and hardware observations, with analysis of prediction accuracy performed on independently collected data. This structure is self-contained and externally falsifiable via hardware runs, satisfying the criteria for a non-circular contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Abstract task graph descriptions can sufficiently capture execution characteristics to enable meaningful real-world makespan collection and prediction evaluation in heterogeneous CPU-GPU-FPGA systems
Reference graph
Works this paper leans on
-
[1]
Predictive modeling for cpu, gpu, and FPGA performance and power consumption: A survey,
K. O’Neal and P. Brisk, “Predictive modeling for cpu, gpu, and FPGA performance and power consumption: A survey,” in2018 IEEE Com- puter Society Annual Symposium on VLSI, ISVLSI 2018, Hong Kong, China, July 8-11, 2018. IEEE Computer Society, 2018, pp. 763–768
work page 2018
-
[2]
Task mapping evaluator framework,
M. Wilhelm, “Task mapping evaluator framework,” 2025. [Online]. Available: https://github.com/HTI-OVGU/task-mapping-evaluator
work page 2025
-
[3]
Workload placement on heterogeneous CPU-GPU systems,
M. N. L. Carvalho, A. Simitsis, A. Queralt, and O. Romero, “Workload placement on heterogeneous CPU-GPU systems,”Proc. VLDB Endow., vol. 17, no. 12, pp. 4241–4244, 2024
work page 2024
-
[4]
A survey of CPU-GPU heterogeneous computing techniques,
S. Mittal and J. S. Vetter, “A survey of CPU-GPU heterogeneous computing techniques,”ACM Comput. Surv., vol. 47, no. 4, pp. 69:1– 69:35, 2015
work page 2015
-
[5]
Task partitioning upon heterogeneous multiprocessor platforms,
S. K. Baruah, “Task partitioning upon heterogeneous multiprocessor platforms,” in10th IEEE Real-Time and Embedded Technology and Ap- plications Symposium (RTAS 2004), 25-28 May 2004, Toronto, Canada. IEEE Computer Society, 2004, pp. 536–543
work page 2004
-
[6]
A multi- stage hybrid approach for mapping applications on heterogeneous multi- core platforms,
A. Emeretlis, G. Theodoridis, P. Alefragis, and N. S. V oros, “A multi- stage hybrid approach for mapping applications on heterogeneous multi- core platforms,” in30th IFIP/IEEE 30th International Conference on Very Large Scale Integration, VLSI-SoC 2022, Patras, Greece, October 3-5, 2022. IEEE, 2022, pp. 1–6
work page 2022
-
[7]
S. Mohammadi, L. Pourkarimi, F. Droop, N. D. Mecquenem, U. Leser, and K. Reinert, “A mathematical programming approach for resource allocation of data analysis workflows on heterogeneous clusters,”J. Supercomput., vol. 79, no. 17, pp. 19 019–19 048, 2023
work page 2023
-
[8]
M. Wilhelm, H. Geppert, A. Drewes, and T. Pionteck, “A comprehensive modeling approach for the task mapping problem in heterogeneous systems with dataflow processing units,”Concurr. Comput. Pract. Exp., vol. 35, no. 25, 2023
work page 2023
-
[9]
Task mapping in heterogeneous embedded systems for fast completion time,
H. Zhou and C. Liu, “Task mapping in heterogeneous embedded systems for fast completion time,” in2014 International Conference on Embedded Software, EMSOFT 2014, New Delhi, India, October 12-17, 2014, T. Mitra and J. Reineke, Eds. ACM, 2014, pp. 22:1–22:10
work page 2014
-
[10]
Automated memory-aware application distribution for multi- processor system-on-chips,
H. Orsila, T. Kangas, E. Salminen, T. D. Hämäläinen, and M. Hän- nikäinen, “Automated memory-aware application distribution for multi- processor system-on-chips,”J. Syst. Archit., vol. 53, no. 11, pp. 795–815, 2007
work page 2007
-
[11]
Recommendations for using simulated annealing in task mapping,
H. Orsila, E. Salminen, and T. Hämäläinen, “Recommendations for using simulated annealing in task mapping,”Des. Autom. Embed. Syst., vol. 17, no. 1, pp. 53–85, 2013
work page 2013
-
[12]
Mapping interdependent tasks in a computational environment using genetie algorithms,
A. Alexandrescu, “Mapping interdependent tasks in a computational environment using genetie algorithms,” in2015 14th RoEduNet Interna- tional Conference - Networking in Education and Research (RoEduNet NER), Craiova, Romania, September 24-26, 2015. IEEE, 2015, pp. 173–177
work page 2015
-
[13]
C. Erbas, S. Cerav-Erbas, and A. D. Pimentel, “Multiobjective optimiza- tion and evolutionary algorithms for the application mapping problem in multiprocessor system-on-chip design,”IEEE Trans. Evol. Comput., vol. 10, no. 3, pp. 358–374, 2006
work page 2006
-
[14]
A static task partitioning approach for heterogeneous systems using opencl,
D. Grewe and M. F. P. O’Boyle, “A static task partitioning approach for heterogeneous systems using opencl,” inCompiler Construction - 20th International Conference, CC 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2011, Saarbrücken, Germany, March 26-April 3, 2011. Proceedings, ser. Lecture Notes in Comput...
work page 2011
-
[15]
Opencl task partitioning in the presence of GPU contention,
D. Grewe, Z. Wang, and M. F. P. O’Boyle, “Opencl task partitioning in the presence of GPU contention,” inLanguages and Compilers for Parallel Computing - 26th International Workshop, LCPC 2013, San Jose, CA, USA, September 25-27, 2013. Revised Selected Papers, ser. Lecture Notes in Computer Science, C. Cascaval and P. Montesinos, Eds., vol. 8664. Springer...
work page 2013
-
[16]
Static task mapping for heterogeneous systems based on series-parallel decompositions
M. Wilhelm and T. Pionteck, “Static task mapping for heterogeneous systems based on series-parallel decompositions,” inIEEE International Parallel and Distributed Processing Symposium, IPDPS 2025 - Workshop. IEEE, 2025, to be published. [Online]. Available: https://arxiv.org/abs/2502.19745
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
A comprehensive survey on hardware/software partitioning process in co-design,
I. Mhadhbi, S. Ben Othman, and S. Ben Saoud, “A comprehensive survey on hardware/software partitioning process in co-design,”Interna- tional Journal of Computer Science and Information Security, vol. 14, p. 263, 03 2016
work page 2016
-
[18]
Design space exploration of fpga-based accelerators with multi-level parallelism,
G. Zhong, A. Prakash, S. Wang, Y . Liang, T. Mitra, and S. Niar, “Design space exploration of fpga-based accelerators with multi-level parallelism,” inDesign, Automation & Test in Europe Conference & Exhibition, DATE 2017, Lausanne, Switzerland, March 27-31, 2017, D. Atienza and G. D. Natale, Eds. IEEE, 2017, pp. 1141–1146
work page 2017
-
[19]
Compiler-assisted selection of hardware acceleration candidates from application source code,
G. Zacharopoulos, L. Ferretti, G. Ansaloni, G. D. Guglielmo, L. P. Car- loni, and L. Pozzi, “Compiler-assisted selection of hardware acceleration candidates from application source code,” in37th IEEE International Conference on Computer Design, ICCD 2019, Abu Dhabi, United Arab Emirates, November 17-20, 2019. IEEE, 2019, pp. 129–137
work page 2019
-
[20]
Hlscope+, : Fast and accurate performance estimation for FPGA HLS,
Y . Choi, P. Zhang, P. Li, and J. Cong, “Hlscope+, : Fast and accurate performance estimation for FPGA HLS,” in2017 IEEE/ACM Interna- tional Conference on Computer-Aided Design, ICCAD 2017, Irvine, CA, USA, November 13-16, 2017, S. Parameswaran, Ed. IEEE, 2017, pp. 691–698
work page 2017
-
[21]
Automatic generation of efficient accelerators for reconfigurable hardware,
D. Koeplinger, R. Prabhakar, Y . Zhang, C. Delimitrou, C. Kozyrakis, and K. Olukotun, “Automatic generation of efficient accelerators for reconfigurable hardware,” in43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016. IEEE Computer Society, 2016, pp. 115–127
work page 2016
-
[22]
COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications,
J. Zhao, L. Feng, S. Sinha, W. Zhang, Y . Liang, and B. He, “COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications,” in2017 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2017, Irvine, CA, USA, November 13- 16, 2017, S. Parameswaran, Ed. IEEE, 2017, pp. 430–437
work page 2017
-
[23]
Hetsim: A simulator for task-based scheduling on heterogeneous hardware,
M. L. Dreimann, B. Friesel, and O. Spinczyk, “Hetsim: A simulator for task-based scheduling on heterogeneous hardware,” inCompanion of the 15th ACM/SPEC International Conference on Performance Engineering, ICPE 2024, London, United Kingdom, May 7-11, 2024, S. Balsamo, W. J. Knottenbelt, C. L. Abad, and W. Shang, Eds. ACM, 2024, pp. 261–268
work page 2024
-
[24]
Parameterizing simu- lated annealing for distributing kahn process networks on multiprocessor socs,
H. Orsila, E. Salminen, and T. D. Hämäläinen, “Parameterizing simu- lated annealing for distributing kahn process networks on multiprocessor socs,” in2008 IEEE International Symposium on System-on-Chip, SOC 2009, Tampere, Finland, October 6-7, 2008. IEEE, 2009, pp. 19–26
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.