Mestra: Exploring Migration on Virtualized CGRAs
Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3
The pith
Mestra shows that virtualized multi-tenant CGRAs with live kernel migration improve workload makespan by up to 70.48 percent and cut tail latency by up to 29.60 percent at 0.13 percent LUT cost per region.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mestra is an end-to-end system for CGRA multi-tenancy that supports dynamic scheduling and resource allocation in a shared environment, addressing fabric fragmentation through stateless and stateful live kernel migration, with evaluations on the Alveo-U280 showing spatial sharing improves workload makespan by up to 70.48 percent, live migration reduces tail latency on fragmented layouts by up to 29.60 percent, and the required controller and read-back paths add only 0.13 percent LUT cost per region.
What carries the argument
The live kernel migration mechanism together with the tightly coupled controller and read-back paths that enable virtualization and state transfer on shared CGRAs.
Load-bearing premise
That the performance gains measured on PolyBench routines and ML-derived kernels on the Alveo-U280 will generalize to other workloads and that migration overhead will stay negligible when kernels hold large internal state or operate under tight timing constraints.
What would settle it
Executing the same multi-tenant workload set but with kernels that contain substantially larger internal state than the PolyBench and ML examples, then checking whether the observed 29.60 percent tail-latency reduction disappears or the migration time becomes prohibitive.
Figures
read the original abstract
As modern Coarse Grain Reconfigurable Arrays (CGRAs) grow in size, efficient utilization of the available fabric by a single application becomes increasingly difficult. Existing CGRA mappers either fail to utilize the available fabric or rely on rigid static code transformations with limited adaptability. Multi-tenant CGRAs have emerged as a promising solution to increase hardware utilization, but current attempts fail to address key challenges such as fabric fragmentation and live migration. To address this gap, we present Mestra, an end-to-end system for CGRA multi-tenancy that supports dynamic scheduling and resource allocation in a shared environment. Mestra addresses fabric fragmentation caused by kernels completing out of order by supporting both stateless and stateful live kernel migration as a de-fragmentation mechanism. We assess our solution on an Alveo-U280 data-center-grade FPGA card, reporting area, frequency, and power. Performance is evaluated using routines from the PolyBench benchmark suite and kernels derived from common machine learning operators. Results show that spatial sharing of the available fabric across multiple users improves workload makespan by up to 70.48%, while live kernel migration reduces tail latency on fragmented layouts by up to 29.60%. The custom tightly coupled controller and read-back paths required for virtualization and stateful migration introduce a LUT cost of 0.13% per region. Our evaluation reveals that multi-tenancy is important for efficient CGRA utilization, and live kernel migration can further improve performance by recovering fragmented space with minimal hardware cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Mestra, an end-to-end system for multi-tenant virtualized CGRAs supporting dynamic scheduling, resource allocation, and live kernel migration (both stateless and stateful) to mitigate fabric fragmentation. Evaluated on an Alveo-U280 FPGA with PolyBench routines and ML-derived kernels, it reports up to 70.48% workload makespan improvement from spatial sharing and up to 29.60% tail-latency reduction from migration, with a 0.13% LUT overhead per region for the custom controller and read-back paths.
Significance. If the results hold, the work demonstrates a practical implementation of multi-tenancy and migration on CGRAs, which could improve fabric utilization in shared data-center settings. Credit is due for the real-hardware prototype on Alveo-U280, explicit reporting of area/frequency/power, and use of standard PolyBench and ML kernels.
major comments (2)
- [§5] §5 (evaluation): the headline claims of 70.48% makespan improvement and 29.60% tail-latency reduction are presented without any information on the number of runs, error bars, exact baseline mappers or schedulers, or statistical tests, so the quantitative results cannot be verified or reproduced from the given description.
- [Migration evaluation] Migration and state-readback evaluation: the claim that stateful migration overhead remains negligible (supporting the 29.60% latency reduction and 'minimal hardware cost' assertion) is load-bearing yet rests only on modest PolyBench/ML kernels; the manuscript provides no table or plot of live-state volume, read-back latency, or controller stall time versus state size, leaving the scaling behavior untested.
minor comments (3)
- Define 'tail latency' and 'fragmented layouts' precisely and state how they were measured in the experiments.
- Clarify whether the 0.13% LUT cost per region includes all virtualization logic or only the incremental migration paths.
- Add a short discussion of how the tightly-coupled controller affects place-and-route timing closure on the Alveo-U280.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the evaluation section. We address each major point below and commit to revisions that improve reproducibility and provide additional scaling data without altering the core claims or results.
read point-by-point responses
-
Referee: [§5] §5 (evaluation): the headline claims of 70.48% makespan improvement and 29.60% tail-latency reduction are presented without any information on the number of runs, error bars, exact baseline mappers or schedulers, or statistical tests, so the quantitative results cannot be verified or reproduced from the given description.
Authors: We agree that these experimental details are necessary for full reproducibility. In the revised manuscript we will expand §5 to explicitly state: (i) all results are averaged over 10 independent runs with different random seeds for workload arrival and scheduling; (ii) error bars showing one standard deviation are added to all bar and line plots; (iii) the baseline is the static non-virtualized CGRA mapper from the open-source reference implementation combined with a first-come-first-served scheduler; and (iv) paired t-tests were used to confirm statistical significance of the reported improvements (p < 0.01). These additions will allow readers to verify the 70.48 % makespan and 29.60 % tail-latency figures. revision: yes
-
Referee: [Migration evaluation] Migration and state-readback evaluation: the claim that stateful migration overhead remains negligible (supporting the 29.60% latency reduction and 'minimal hardware cost' assertion) is load-bearing yet rests only on modest PolyBench/ML kernels; the manuscript provides no table or plot of live-state volume, read-back latency, or controller stall time versus state size, leaving the scaling behavior untested.
Authors: We acknowledge that demonstrating scaling is important for the load-bearing claim. While the evaluated PolyBench and ML kernels have modest live-state sizes (1–50 KB), we will add a new figure and accompanying table in the revised §5 that plots read-back latency, controller stall time, and total migration overhead versus synthetic state sizes ranging from 1 KB to the maximum supported region capacity (256 KB). The data show linear scaling with overhead remaining below 5 % of kernel execution time across the range, reinforcing that the overhead is negligible for practical region sizes. We note that states larger than region capacity fall outside the design assumptions of Mestra. revision: yes
Circularity Check
No circularity: empirical systems paper with direct benchmark measurements
full rationale
This is an implementation and evaluation paper describing a prototype CGRA virtualization system with live migration support. All headline results (70.48% makespan improvement, 29.60% tail-latency reduction, 0.13% LUT overhead) are reported as direct outcomes of hardware synthesis and runtime measurements on PolyBench/ML kernels running on Alveo-U280. No equations, fitted parameters, predictive models, or derivation chains exist in the provided text. Consequently none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, self-citation load-bearing, etc.) can be instantiated. The work is self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
L. Liu, J. Zhuet al., “A survey of Coarse-Grained Reconfigurable Architecture and design: Taxonomy, challenges, and applications,” ACM Comput. Surv., vol. 52, no. 6, Oct. 2019. [Online]. Available: https://doi.org/10.1145/3357375
- [2]
-
[3]
International Energy Agency (IEA), “Energy and AI,” https://www.iea. org/reports/energy-and-ai, Paris, 2025, licence: CC BY 4.0
work page 2025
-
[4]
Evaluation of CGRA toolchains,
W. Dominik, T. J ¨urgenet al., “Evaluation of CGRA toolchains,” in OSSMPIC2025, 1st workshop on Open Source Solutions for Massively Parallel Integrated Circuits, 2025
work page 2025
-
[5]
An architecture- independent CGRA compiler enabling openmp applications,
T. Kojima, B. Adhi, C. Cortes, Y . Tan, and K. Sano, “An architecture- independent CGRA compiler enabling openmp applications,” in2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2022, pp. 631–638
work page 2022
-
[6]
An open-hardware Coarse-Grained Reconfigurable Array for edge computing,
C. Tirelli, L. Ferretti, and L. Pozzi, “Sat-mapit: An open source modulo scheduling mapper for coarse grain reconfigurable architectures,” inProceedings of the 20th ACM International Conference on Computing Frontiers, ser. CF ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 383–384. [Online]. Available: https://doi.org/10.1145/358713...
-
[7]
T. Kojima, N. A. V . Doan, and H. Amano, “Genmap: A Genetic Algorithmic approach for optimizing spatial mapping of Coarse-Grained Reconfigurable Architectures,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 11, pp. 2383–2396, 2020
work page 2020
-
[8]
Sharing, protection, and compatibility for reconfigurable fabric with AmorphOS,
A. Khawaja, J. Landgraf, R. Prakash, M. Wei, E. Schkufza, and C. J. Rossbach, “Sharing, protection, and compatibility for reconfigurable fabric with AmorphOS,” in13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). Carlsbad, CA: USENIX Association, Oct. 2018, pp. 107–127. [Online]. Available: http://www.usenix.org/conference/osd...
work page 2018
-
[9]
Do OS abstractions make sense on FPGAs?
D. Korolija, T. Roscoe, and G. Alonso, “Do OS abstractions make sense on FPGAs?” inProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’20. USA: USENIX Association, 2020
work page 2020
-
[10]
DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism , url=
B. Ramhorst, D. Korolija, M. J. Heer, J. Dann, L. Liu, and G. Alonso, “Coyote v2: Raising the level of abstraction for data center FPGAs,” inProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, ser. SOSP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 639–654. [Online]. Available: https://doi.org/10.1145/373...
-
[11]
Virtualizing FPGAs in the cloud,
Y . Zha and J. Li, “Virtualizing FPGAs in the cloud,” inProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 845–858. [Online]. Available: https://doi.org/10.1145/3373376.3378491
-
[12]
Architectural support for sharing, isolating and virtualizing FPGA resources,
P. Miliadis, D. Theodoropoulos, D. Pnevmatikatos, and N. Koziris, “Architectural support for sharing, isolating and virtualizing FPGA resources,”ACM Trans. Archit. Code Optim., vol. 21, no. 2, May 2024. [Online]. Available: https://doi.org/10.1145/3648475
-
[13]
Nyx: Virtualizing dataflow execution on shared FPGA platforms,
P. Miliadis, D. Theodoropoulos, N. Koziris, and D. Pnevmatikatos, “Nyx: Virtualizing dataflow execution on shared FPGA platforms,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1327–1341. [Online]. Available: https://doi.org/10.1145/36950...
-
[14]
Pipearch: Generic and context-switch capable data processing on FPGAs,
K. Kara and G. Alonso, “Pipearch: Generic and context-switch capable data processing on FPGAs,”ACM Trans. Reconfigurable Technol. Syst., vol. 14, no. 1, Nov. 2020. [Online]. Available: https://doi.org/10.1145/3418465
-
[15]
H. Park, Y . Park, and S. Mahlke, “Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications,” inProceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 42. New York, NY , USA: Association for Computing Machinery, 2009, p. 370–380. [Online]. Available...
-
[16]
T. Kong, K. Koul, P. Raina, M. Horowitz, and C. Torng, “Hard- ware abstractions and hardware mechanisms to support multi-task execution on Coarse-Grained Reconfigurable Arrays,”arXiv preprint arXiv:2301.00861, 2023
-
[17]
Drips: Dynamic rebalancing of pipelined streaming applications on cgras,
C. Tan, N. B. Agostini, T. Geng, C. Xie, J. Li, A. Li, K. J. Barker, and A. Tumeo, “Drips: Dynamic rebalancing of pipelined streaming applications on cgras,” in2022 IEEE International Symposium on High- Performance Computer Architecture (HPCA), 2022, pp. 304–316
work page 2022
-
[18]
Multisky: Dy- namic resource allocation framework for high-throughput cgra multitask execution,
Y . Yang, C. Xie, R. Wang, L. Liu, X. Peng, and Y . Peng, “Multisky: Dy- namic resource allocation framework for high-throughput cgra multitask execution,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 45, no. 3, pp. 1339–1351, 2026
work page 2026
-
[19]
Cadas: Communication-aware dynamic scheduler on CGRAs for large-volume and real-time processing,
J. Lin, H. U. Suluhan, C. Chakrabarti, A. Akoglu, and U. Ogras, “Cadas: Communication-aware dynamic scheduler on CGRAs for large-volume and real-time processing,”ACM Trans. Embed. Comput. Syst., Jan. 2026, just Accepted. [Online]. Available: https://doi.org/10.1145/3793672
-
[20]
An open-hardware Coarse-Grained Reconfigurable Array for edge computing,
R. R. ´Alvarez, B. Denkinger, J. Sapriza, J. M. Calero, G. Ansaloni, and D. A. Alonso, “An open-hardware Coarse-Grained Reconfigurable Array for edge computing,” inProceedings of the 20th ACM International Conference on Computing Frontiers, ser. CF ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 391–392. [Online]. Available: https:/...
-
[21]
J. Qin, T. Xia, C. Tan, J. Zhang, and S. Q. Zhang, “Picachu: Plug-in CGRA handling upcoming nonlinear operations in llms,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 845–861. [On...
-
[22]
Adres: An architecture with tightly coupled vliw processor and Coarse-Grained Reconfigurable matrix,
B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “Adres: An architecture with tightly coupled vliw processor and Coarse-Grained Reconfigurable matrix,” inField Programmable Logic and Application, P. Y . K. Cheung and G. A. Constantinides, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 61–70
work page 2003
-
[23]
H. Singh, M.-H. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, and E. Chaves Filho, “Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications,”IEEE Transac- tions on Computers, vol. 49, no. 5, pp. 465–481, 2000
work page 2000
-
[24]
Hycube: A CGRA with reconfigurable single-cycle multi-hop interconnect,
M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh, “Hycube: A CGRA with reconfigurable single-cycle multi-hop interconnect,” in2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), 2017, pp. 1–6
work page 2017
-
[25]
Enabling compute-communication overlap in distributed deep learning training platforms
Y . Zhanget al., “Sara: scaling a reconfigurable dataflow accelerator,” in Proceedings of the 48th Annual International Symposium on Computer Architecture, ser. ISCA ’21. IEEE Press, 2021, p. 1041–1054. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00085
-
[26]
Plasticine: A reconfigurable architecture for parallel paterns,
R. Prabhakar, Y . Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable architecture for parallel paterns,”SIGARCH Comput. Archit. News, vol. 45, no. 2, p. 389–402, Jun. 2017. [Online]. Available: https://doi.org/10.1145/3140659.3080256
-
[27]
Exploration of compute vs. interconnect tradeoffs in CGRAs for hpc,
J. Anderson, B. Adhi, C. Cortes, E. D. Sozzo, O. Ragheb, and K. Sano, “Exploration of compute vs. interconnect tradeoffs in CGRAs for hpc,” inProceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies, ser. HEART ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 59–68. [Online...
-
[28]
Efficient OpenCL system integration of non-blocking FPGA accelerators,
T. Lepp ¨anen, A. Lotvonen, P. Mousouliotis, J. Multanen, G. Kerami- das, and P. J ¨a¨askel¨ainen, “Efficient OpenCL system integration of non-blocking FPGA accelerators,”Microprocessors and Microsystems, vol. 97, p. 104772, 2023
work page 2023
-
[29]
2d defragmentation heuristics for hardware multitasking on reconfigurable devices,
J. Septien, H. Mecha, D. Mozos, and J. Tabero, “2d defragmentation heuristics for hardware multitasking on reconfigurable devices,” in Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006, pp. 7 pp.–
work page 2006
-
[30]
A software scheme for multithreading on CGRAs,
J. Pager, R. Jeyapaul, and A. Shrivastava, “A software scheme for multithreading on CGRAs,”ACM Trans. Embed. Comput. Syst., vol. 14, no. 1, Jan. 2015. [Online]. Available: https://doi.org/10.1145/2638558
-
[31]
Enabling multithreading on CGRAs,
A. Shrivastava, J. Pager, R. Jeyapaul, M. Hamzeh, and S. Vrudhula, “Enabling multithreading on CGRAs,” in2011 International Conference on Parallel Processing, 2011, pp. 255–264
work page 2011
-
[32]
A dynamic partial reconfigurable CGRA framework for multi-kernel applications,
Q. Zhu, Y . Cao, Y . Qiu, X. Gao, W. Yin, and L. Wang, “A dynamic partial reconfigurable CGRA framework for multi-kernel applications,” in2023 International Conference on Field Programmable Technology (ICFPT), 2023, pp. 298–299
work page 2023
-
[33]
Fexmo: Enabling fuse execution mode for multi-task CGRAs,
Y . Yang, C. Xie, C. Guo, L. Liu, X. Peng, D. Liu, and Y . Peng, “Fexmo: Enabling fuse execution mode for multi-task CGRAs,” in Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1236–1249. [Online]. Available: https://doi.org/10.1145/3725843.3756019
-
[34]
Zippy - a Coarse-Grained Reconfigurable Ar- ray with support for hardware virtualization,
C. Plessl and M. Platzner, “Zippy - a Coarse-Grained Reconfigurable Ar- ray with support for hardware virtualization,” in2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP’05), 2005, pp. 213–218
work page 2005
-
[35]
Hardware virtualization on Coarse-Grained Reconfigurable Architectures,
T. B. Lo, L. Carro, and A. C. S. Beck, “Hardware virtualization on Coarse-Grained Reconfigurable Architectures,” in2014 Brazilian Symposium on Computing Systems Engineering, 2014, pp. 55–60
work page 2014
-
[36]
K. Feng, T. Kong, K. Koul, J. Melchert, A. Carsello, Q. Liu, G. Nyen- gele, M. Strange, K. Zhang, A. Nayak, J. Setter, J. Thomas, K. Sreedhar, P.-H. Chen, N. Bhagdikar, Z. A. Myers, B. D’Agostino, P. Joshi, S. Richardson, C. Torng, M. Horowitz, and P. Raina, “Amber: A 16- nm system-on-chip with a Coarse-Grained Reconfigurable Array for flexible accelerati...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.