arxiv: 2510.06998 · v1 · submitted 2025-10-08 · 💻 cs.DC · cs.AR

Evaluating Rapid Makespan Predictions for Heterogeneous Systems with Programmable Logic

Martin Wilhelm , Franz Freitag , Max Tzschoppe , Thilo Pionteck This is my paper

Pith reviewed 2026-05-18 09:02 UTC · model grok-4.3

classification 💻 cs.DC cs.AR

keywords heterogeneous computingmakespan predictiontask mappingFPGA evaluationanalytical modelsperformance frameworkdata transfer overhead

0 comments p. Extension

The pith

A flexible framework collects real makespan data from abstract task graphs on heterogeneous CPU-GPU-FPGA systems to evaluate rapid prediction algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces an evaluation framework for heterogeneous computing systems that include CPUs, GPUs, and FPGAs. The framework gathers actual makespan measurements using only abstract descriptions of task graphs, avoiding the need for complete task implementations on programmable logic. It examines the accuracy of existing analytical prediction methods against these real-world results. The work also identifies common issues such as data transfer costs and device congestion that affect performance in such systems. By providing this tool, the paper supports the creation of faster and more reliable makespan prediction techniques for task mapping decisions.

Core claim

The paper presents a highly flexible evaluation framework capable of collecting real-world makespan results based on abstract task graph descriptions for heterogeneous systems with CPUs, GPUs, and FPGAs. It analyzes the predictive power of existing analytical approaches and presents challenges arising from high-level characteristics like data transfer overhead and device congestion.

What carries the argument

The evaluation framework that executes abstract task graphs on real heterogeneous hardware to measure makespans.

If this is right

Developers can iterate on analytical prediction models more rapidly without full hardware implementations for each test.
Task mapping algorithms can be refined using predictions that better reflect real behaviors in mixed systems.
Challenges such as data transfer overhead and device congestion can be directly incorporated into improved prediction functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be adapted for additional accelerator types or cloud-based heterogeneous setups.
Standardized abstract task graph formats might allow direct comparisons of prediction methods across different research groups.
Combining the collected real-world data with machine learning could yield hybrid prediction models with higher accuracy.

Load-bearing premise

Abstract task graph descriptions can generate makespan measurements on real hardware that are representative of full implementations, particularly for programmable logic.

What would settle it

Running full task implementations on FPGAs and comparing the measured makespans to those obtained from the abstract descriptions in the framework.

Figures

Figures reproduced from arXiv: 2510.06998 by Franz Freitag, Martin Wilhelm, Max Tzschoppe, Thilo Pionteck.

**Figure 2.** Figure 2: The architecture of an FPGA kernel. In its core, the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Three different cases that complicate streaming be [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Predicted and actual execution times for a pure CPU [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Exemplary mapped task graph with predicted and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Heterogeneous computing systems, which combine general-purpose processors with specialized accelerators, are increasingly important for optimizing the performance of modern applications. A central challenge is to decide which parts of an application should be executed on which accelerator or, more generally, how to map the tasks of an application to available devices. Predicting the impact of a change in a task mapping on the overall makespan is non-trivial. While there are very capable simulators, these generally require a full implementation of the tasks in question, which is particularly time-intensive for programmable logic. A promising alternative is to use a purely analytical function, which allows for very fast predictions, but abstracts significantly from reality. Bridging the gap between theory and practice poses a significant challenge to algorithm developers. This paper aims to aid in the development of rapid makespan prediction algorithms by providing a highly flexible evaluation framework for heterogeneous systems consisting of CPUs, GPUs and FPGAs, which is capable of collecting real-world makespan results based on abstract task graph descriptions. We analyze to what extent actual makespans can be predicted by existing analytical approaches. Furthermore, we present common challenges that arise from high-level characteristics such as data transfer overhead and device congestion in heterogeneous systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a practical framework for measuring makespan on real CPU-GPU-FPGA hardware from abstract task graphs, but the FPGA abstraction risks missing synthesis and routing costs that dominate actual performance.

read the letter

The core contribution is a flexible setup that runs abstract task graphs on actual heterogeneous hardware to gather makespan numbers without requiring complete FPGA implementations. This targets the gap between slow full simulators and fast but crude analytical predictors, and the authors flag real issues like data-transfer overhead and device congestion that show up in practice. That framing is useful for anyone building quick mapping heuristics for mixed systems. They also compare a few existing analytical models against the collected data, which gives a concrete starting point rather than pure theory. The approach looks honest in its goals and avoids obvious circularity by grounding predictions in hardware runs. The main limitation is whether the abstract descriptions actually produce representative timings once programmable logic is involved. Synthesis, placement, routing, and bitstream effects often dominate FPGA runtime, and high-level graphs tend to skip those. The abstract notes some challenges but does not describe a mechanism that reproduces them at the needed fidelity or show validation against full implementations. Without error metrics or side-by-side comparisons in the full text, it is unclear how much the framework improves predictor development. This work is aimed at researchers in heterogeneous scheduling who need faster iteration than full simulation allows. It is narrow in scope but could save time for that group if the measurements hold up. I would send it to peer review because the idea is grounded enough to be worth referee scrutiny, even if revisions are needed on the FPGA abstraction details.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a flexible evaluation framework for heterogeneous systems combining CPUs, GPUs, and FPGAs. The framework executes abstract task-graph descriptions on real hardware to collect makespan measurements without requiring complete task implementations (especially for programmable logic). It then compares these measurements against existing analytical prediction models and identifies recurring challenges such as data-transfer overhead and device congestion.

Significance. If the collected measurements prove representative, the framework would provide a practical bridge between high-level analytical predictors and hardware reality, lowering the barrier for developing and validating rapid makespan-prediction algorithms. The explicit discussion of congestion and transfer effects supplies concrete guidance for future model refinement in mixed-accelerator environments.

major comments (2)

[§4] §4 (Evaluation): the manuscript reports measured makespans and analytical predictions but supplies no quantitative error metrics, confidence intervals, or statistical tests (e.g., mean absolute percentage error or correlation coefficients) that would allow readers to judge the practical utility of the analytical approaches.
[§3.2] §3.2 (FPGA abstraction layer): the description of how abstract task graphs are realized on FPGAs does not specify any mechanism or calibration step that accounts for synthesis, placement, routing, or bitstream-loading delays; without such detail the collected makespans risk systematic divergence from deployed behavior, undermining the central claim that the framework yields representative data for predictor evaluation.

minor comments (2)

[Figure 2] Figure 2 and Table 1: axis labels and legend entries are too small for comfortable reading; enlarging them would improve clarity.
[§2] The notation for task-graph nodes and device mappings is introduced inconsistently between §2 and §3; a single consolidated definition table would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and have revised the manuscript to strengthen the presentation of quantitative results and the description of the FPGA abstraction.

read point-by-point responses

Referee: [§4] §4 (Evaluation): the manuscript reports measured makespans and analytical predictions but supplies no quantitative error metrics, confidence intervals, or statistical tests (e.g., mean absolute percentage error or correlation coefficients) that would allow readers to judge the practical utility of the analytical approaches.

Authors: We agree that the original evaluation would benefit from explicit quantitative metrics. In the revised manuscript we now report mean absolute percentage error (MAPE), root-mean-square error, and Pearson correlation coefficients between measured and predicted makespans for each device combination. We also include 95 % confidence intervals on the error statistics and a brief discussion of the correlation significance to help readers assess the practical utility of the analytical predictors. revision: yes
Referee: [§3.2] §3.2 (FPGA abstraction layer): the description of how abstract task graphs are realized on FPGAs does not specify any mechanism or calibration step that accounts for synthesis, placement, routing, or bitstream-loading delays; without such detail the collected makespans risk systematic divergence from deployed behavior, undermining the central claim that the framework yields representative data for predictor evaluation.

Authors: The framework is deliberately abstract so that makespan data can be collected without full task implementations on programmable logic. We have nevertheless expanded §3.2 to clarify that bitstream-loading latency is measured and included in the reported makespans whenever a pre-synthesized bitstream is available, while synthesis, placement, and routing times are treated as one-time offline costs outside the runtime makespan. We also added a short paragraph describing a simple calibration procedure that uses a small set of representative kernels to estimate average routing overhead for the target FPGA; this calibration can be applied by users who wish to align the abstract measurements more closely with a fully deployed flow. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation framework relies on independent hardware measurements

full rationale

The paper introduces a flexible evaluation framework to collect real-world makespan data from abstract task-graph descriptions on heterogeneous hardware (CPUs, GPUs, FPGAs) and then compares these measurements against existing analytical prediction methods. No derivation, prediction, or first-principles result is presented that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The central claim is an empirical bridge between abstract descriptions and hardware observations, with analysis of prediction accuracy performed on independently collected data. This structure is self-contained and externally falsifiable via hardware runs, satisfying the criteria for a non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that abstract task graphs can serve as a practical proxy for real application behavior when measuring makespans on heterogeneous hardware.

axioms (1)

domain assumption Abstract task graph descriptions can sufficiently capture execution characteristics to enable meaningful real-world makespan collection and prediction evaluation in heterogeneous CPU-GPU-FPGA systems
The framework is explicitly built around using these abstract descriptions instead of full task implementations, particularly for FPGAs.

pith-pipeline@v0.9.0 · 5749 in / 1405 out tokens · 43937 ms · 2026-05-18T09:02:33.720563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

[1]

Predictive modeling for cpu, gpu, and FPGA performance and power consumption: A survey,

K. O’Neal and P. Brisk, “Predictive modeling for cpu, gpu, and FPGA performance and power consumption: A survey,” in2018 IEEE Com- puter Society Annual Symposium on VLSI, ISVLSI 2018, Hong Kong, China, July 8-11, 2018. IEEE Computer Society, 2018, pp. 763–768

work page 2018
[2]

Task mapping evaluator framework,

M. Wilhelm, “Task mapping evaluator framework,” 2025. [Online]. Available: https://github.com/HTI-OVGU/task-mapping-evaluator

work page 2025
[3]

Workload placement on heterogeneous CPU-GPU systems,

M. N. L. Carvalho, A. Simitsis, A. Queralt, and O. Romero, “Workload placement on heterogeneous CPU-GPU systems,”Proc. VLDB Endow., vol. 17, no. 12, pp. 4241–4244, 2024

work page 2024
[4]

A survey of CPU-GPU heterogeneous computing techniques,

S. Mittal and J. S. Vetter, “A survey of CPU-GPU heterogeneous computing techniques,”ACM Comput. Surv., vol. 47, no. 4, pp. 69:1– 69:35, 2015

work page 2015
[5]

Task partitioning upon heterogeneous multiprocessor platforms,

S. K. Baruah, “Task partitioning upon heterogeneous multiprocessor platforms,” in10th IEEE Real-Time and Embedded Technology and Ap- plications Symposium (RTAS 2004), 25-28 May 2004, Toronto, Canada. IEEE Computer Society, 2004, pp. 536–543

work page 2004
[6]

A multi- stage hybrid approach for mapping applications on heterogeneous multi- core platforms,

A. Emeretlis, G. Theodoridis, P. Alefragis, and N. S. V oros, “A multi- stage hybrid approach for mapping applications on heterogeneous multi- core platforms,” in30th IFIP/IEEE 30th International Conference on Very Large Scale Integration, VLSI-SoC 2022, Patras, Greece, October 3-5, 2022. IEEE, 2022, pp. 1–6

work page 2022
[7]

A mathematical programming approach for resource allocation of data analysis workflows on heterogeneous clusters,

S. Mohammadi, L. Pourkarimi, F. Droop, N. D. Mecquenem, U. Leser, and K. Reinert, “A mathematical programming approach for resource allocation of data analysis workflows on heterogeneous clusters,”J. Supercomput., vol. 79, no. 17, pp. 19 019–19 048, 2023

work page 2023
[8]

A comprehensive modeling approach for the task mapping problem in heterogeneous systems with dataflow processing units,

M. Wilhelm, H. Geppert, A. Drewes, and T. Pionteck, “A comprehensive modeling approach for the task mapping problem in heterogeneous systems with dataflow processing units,”Concurr. Comput. Pract. Exp., vol. 35, no. 25, 2023

work page 2023
[9]

Task mapping in heterogeneous embedded systems for fast completion time,

H. Zhou and C. Liu, “Task mapping in heterogeneous embedded systems for fast completion time,” in2014 International Conference on Embedded Software, EMSOFT 2014, New Delhi, India, October 12-17, 2014, T. Mitra and J. Reineke, Eds. ACM, 2014, pp. 22:1–22:10

work page 2014
[10]

Automated memory-aware application distribution for multi- processor system-on-chips,

H. Orsila, T. Kangas, E. Salminen, T. D. Hämäläinen, and M. Hän- nikäinen, “Automated memory-aware application distribution for multi- processor system-on-chips,”J. Syst. Archit., vol. 53, no. 11, pp. 795–815, 2007

work page 2007
[11]

Recommendations for using simulated annealing in task mapping,

H. Orsila, E. Salminen, and T. Hämäläinen, “Recommendations for using simulated annealing in task mapping,”Des. Autom. Embed. Syst., vol. 17, no. 1, pp. 53–85, 2013

work page 2013
[12]

Mapping interdependent tasks in a computational environment using genetie algorithms,

A. Alexandrescu, “Mapping interdependent tasks in a computational environment using genetie algorithms,” in2015 14th RoEduNet Interna- tional Conference - Networking in Education and Research (RoEduNet NER), Craiova, Romania, September 24-26, 2015. IEEE, 2015, pp. 173–177

work page 2015
[13]

Multiobjective optimiza- tion and evolutionary algorithms for the application mapping problem in multiprocessor system-on-chip design,

C. Erbas, S. Cerav-Erbas, and A. D. Pimentel, “Multiobjective optimiza- tion and evolutionary algorithms for the application mapping problem in multiprocessor system-on-chip design,”IEEE Trans. Evol. Comput., vol. 10, no. 3, pp. 358–374, 2006

work page 2006
[14]

A static task partitioning approach for heterogeneous systems using opencl,

D. Grewe and M. F. P. O’Boyle, “A static task partitioning approach for heterogeneous systems using opencl,” inCompiler Construction - 20th International Conference, CC 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2011, Saarbrücken, Germany, March 26-April 3, 2011. Proceedings, ser. Lecture Notes in Comput...

work page 2011
[15]

Opencl task partitioning in the presence of GPU contention,

D. Grewe, Z. Wang, and M. F. P. O’Boyle, “Opencl task partitioning in the presence of GPU contention,” inLanguages and Compilers for Parallel Computing - 26th International Workshop, LCPC 2013, San Jose, CA, USA, September 25-27, 2013. Revised Selected Papers, ser. Lecture Notes in Computer Science, C. Cascaval and P. Montesinos, Eds., vol. 8664. Springer...

work page 2013
[16]

Static task mapping for heterogeneous systems based on series-parallel decompositions

M. Wilhelm and T. Pionteck, “Static task mapping for heterogeneous systems based on series-parallel decompositions,” inIEEE International Parallel and Distributed Processing Symposium, IPDPS 2025 - Workshop. IEEE, 2025, to be published. [Online]. Available: https://arxiv.org/abs/2502.19745

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

A comprehensive survey on hardware/software partitioning process in co-design,

I. Mhadhbi, S. Ben Othman, and S. Ben Saoud, “A comprehensive survey on hardware/software partitioning process in co-design,”Interna- tional Journal of Computer Science and Information Security, vol. 14, p. 263, 03 2016

work page 2016
[18]

Design space exploration of fpga-based accelerators with multi-level parallelism,

G. Zhong, A. Prakash, S. Wang, Y . Liang, T. Mitra, and S. Niar, “Design space exploration of fpga-based accelerators with multi-level parallelism,” inDesign, Automation & Test in Europe Conference & Exhibition, DATE 2017, Lausanne, Switzerland, March 27-31, 2017, D. Atienza and G. D. Natale, Eds. IEEE, 2017, pp. 1141–1146

work page 2017
[19]

Compiler-assisted selection of hardware acceleration candidates from application source code,

G. Zacharopoulos, L. Ferretti, G. Ansaloni, G. D. Guglielmo, L. P. Car- loni, and L. Pozzi, “Compiler-assisted selection of hardware acceleration candidates from application source code,” in37th IEEE International Conference on Computer Design, ICCD 2019, Abu Dhabi, United Arab Emirates, November 17-20, 2019. IEEE, 2019, pp. 129–137

work page 2019
[20]

Hlscope+, : Fast and accurate performance estimation for FPGA HLS,

Y . Choi, P. Zhang, P. Li, and J. Cong, “Hlscope+, : Fast and accurate performance estimation for FPGA HLS,” in2017 IEEE/ACM Interna- tional Conference on Computer-Aided Design, ICCAD 2017, Irvine, CA, USA, November 13-16, 2017, S. Parameswaran, Ed. IEEE, 2017, pp. 691–698

work page 2017
[21]

Automatic generation of efficient accelerators for reconfigurable hardware,

D. Koeplinger, R. Prabhakar, Y . Zhang, C. Delimitrou, C. Kozyrakis, and K. Olukotun, “Automatic generation of efficient accelerators for reconfigurable hardware,” in43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016. IEEE Computer Society, 2016, pp. 115–127

work page 2016
[22]

COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications,

J. Zhao, L. Feng, S. Sinha, W. Zhang, Y . Liang, and B. He, “COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications,” in2017 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2017, Irvine, CA, USA, November 13- 16, 2017, S. Parameswaran, Ed. IEEE, 2017, pp. 430–437

work page 2017
[23]

Hetsim: A simulator for task-based scheduling on heterogeneous hardware,

M. L. Dreimann, B. Friesel, and O. Spinczyk, “Hetsim: A simulator for task-based scheduling on heterogeneous hardware,” inCompanion of the 15th ACM/SPEC International Conference on Performance Engineering, ICPE 2024, London, United Kingdom, May 7-11, 2024, S. Balsamo, W. J. Knottenbelt, C. L. Abad, and W. Shang, Eds. ACM, 2024, pp. 261–268

work page 2024
[24]

Parameterizing simu- lated annealing for distributing kahn process networks on multiprocessor socs,

H. Orsila, E. Salminen, and T. D. Hämäläinen, “Parameterizing simu- lated annealing for distributing kahn process networks on multiprocessor socs,” in2008 IEEE International Symposium on System-on-Chip, SOC 2009, Tampere, Finland, October 6-7, 2008. IEEE, 2009, pp. 19–26

work page 2009