Mestra: Exploring Migration on Virtualized CGRAs

Agamemnon Kyriazis; Dimitris Theodoropoulos; Dionisios Pnevmatikatos; Nectarios Koziris; Panagiotis Miliadis

arxiv: 2604.04694 · v1 · submitted 2026-04-06 · 💻 cs.AR

Mestra: Exploring Migration on Virtualized CGRAs

Agamemnon Kyriazis , Panagiotis Miliadis , Dimitris Theodoropoulos , Nectarios Koziris , Dionisios Pnevmatikatos This is my paper

Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3

classification 💻 cs.AR

keywords CGRAmulti-tenancylive kernel migrationvirtualizationfabric fragmentationFPGAPolyBenchmachine learning kernels

0 comments

The pith

Mestra shows that virtualized multi-tenant CGRAs with live kernel migration improve workload makespan by up to 70.48 percent and cut tail latency by up to 29.60 percent at 0.13 percent LUT cost per region.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

As Coarse-Grained Reconfigurable Arrays increase in size, a single application often leaves large portions of the fabric unused. Mestra supplies an end-to-end system that lets multiple independent kernels share the same CGRA through dynamic scheduling and on-demand resource allocation. The system treats fabric fragmentation, which arises when kernels finish at different times, by performing live migration of both stateless and stateful kernels. Measurements on an Alveo-U280 FPGA using PolyBench routines and machine-learning operators show that spatial sharing shortens the time to complete a collection of workloads by as much as 70.48 percent, while migration lowers the longest observed delays on fragmented layouts by up to 29.60 percent. The added controller and read-back circuitry required for virtualization and migration consumes only 0.13 percent of the available LUTs per virtual region.

Core claim

Mestra is an end-to-end system for CGRA multi-tenancy that supports dynamic scheduling and resource allocation in a shared environment, addressing fabric fragmentation through stateless and stateful live kernel migration, with evaluations on the Alveo-U280 showing spatial sharing improves workload makespan by up to 70.48 percent, live migration reduces tail latency on fragmented layouts by up to 29.60 percent, and the required controller and read-back paths add only 0.13 percent LUT cost per region.

What carries the argument

The live kernel migration mechanism together with the tightly coupled controller and read-back paths that enable virtualization and state transfer on shared CGRAs.

Load-bearing premise

That the performance gains measured on PolyBench routines and ML-derived kernels on the Alveo-U280 will generalize to other workloads and that migration overhead will stay negligible when kernels hold large internal state or operate under tight timing constraints.

What would settle it

Executing the same multi-tenant workload set but with kernels that contain substantially larger internal state than the PolyBench and ML examples, then checking whether the observed 29.60 percent tail-latency reduction disappears or the migration time becomes prohibitive.

Figures

Figures reproduced from arXiv: 2604.04694 by Agamemnon Kyriazis, Dimitris Theodoropoulos, Dionisios Pnevmatikatos, Nectarios Koziris, Panagiotis Miliadis.

**Figure 3.** Figure 3: Per PE type state critical elements. accelerator’s regions, realize data exchange between device and host, and most importantly, schedule kernels dynamically, and enable live kernel migration ( [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: System-level organization of Mestra. Hypervisor re [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Kernels may not finish execution in the same order [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: (4 × 4)-architecture evaluation using hardware emulation. Mean wait time decreases by 91.39% improving user experience substantially, reducing P95 tail latency by a factor of 68.29% and specifically mean turnaround time by 76.07%. monolithic tiled 104 105 106 107 Time (log scale) 17823069 1534098 3107 3045 145203 634112 Mean TAT Components wait config exec [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 5.** Figure 5: We measure twait, tconf ig, and texec using (8), (9) and (10). We attach a set of reference timestamps to each kernel. We define tarrival as the time at which a kernel enters the hypervisor’s queue and tscheduled when the kernel is scheduled/placed on a vCGRA. Execution begins at tlaunch and completes at tcompleted. twait = tscheduled − tarrival (8) tconf ig = tlaunch − tscheduled (9) texec = tcompleted − … view at source ↗

**Figure 8.** Figure 8: Our tiled architecture decreases mean wait time by [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: (4×4)-architecture evaluation using selected workloads that induce fragmentation. For fairness, stateless and stateful solutions are compared under comparable fragmentation conditions, while ensuring stateful consistently performs equal to or more migrations than stateless. D. Resource Consumption Full shell synthesis for a (4×4)-architecture costs 17.28% of the total FPGA LUT resources, with total chip p… view at source ↗

**Figure 10.** Figure 10: (4 × 4)-architecture evaluation on the effect of the number of migrations to total performance gain with tiled as baseline. Numbers above each box denote the number of samples in each migration group; r and p indicate the strength and significance of the correlation. Overall, results suggest significant, but very weak correlation between number of migrations and performance gain. solutions focus on static… view at source ↗

read the original abstract

As modern Coarse Grain Reconfigurable Arrays (CGRAs) grow in size, efficient utilization of the available fabric by a single application becomes increasingly difficult. Existing CGRA mappers either fail to utilize the available fabric or rely on rigid static code transformations with limited adaptability. Multi-tenant CGRAs have emerged as a promising solution to increase hardware utilization, but current attempts fail to address key challenges such as fabric fragmentation and live migration. To address this gap, we present Mestra, an end-to-end system for CGRA multi-tenancy that supports dynamic scheduling and resource allocation in a shared environment. Mestra addresses fabric fragmentation caused by kernels completing out of order by supporting both stateless and stateful live kernel migration as a de-fragmentation mechanism. We assess our solution on an Alveo-U280 data-center-grade FPGA card, reporting area, frequency, and power. Performance is evaluated using routines from the PolyBench benchmark suite and kernels derived from common machine learning operators. Results show that spatial sharing of the available fabric across multiple users improves workload makespan by up to 70.48%, while live kernel migration reduces tail latency on fragmented layouts by up to 29.60%. The custom tightly coupled controller and read-back paths required for virtualization and stateful migration introduce a LUT cost of 0.13% per region. Our evaluation reveals that multi-tenancy is important for efficient CGRA utilization, and live kernel migration can further improve performance by recovering fragmented space with minimal hardware cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mestra adds live migration for CGRA de-fragmentation in multi-tenancy but the overhead claims rest on modest kernels without state-size scaling data.

read the letter

The paper's main contribution is an end-to-end system for multi-tenant CGRAs that includes dynamic scheduling, resource allocation, and both stateless and stateful live kernel migration to recover fragmented fabric space. They implement this on an Alveo-U280 with a tightly coupled controller and read-back paths, keeping the added LUT cost to 0.13% per region while reporting frequency and power numbers. That hardware grounding and the concrete migration mechanism stand out as new relative to prior CGRA mapping work that stayed static or ignored fragmentation.

Referee Report

2 major / 3 minor

Summary. The paper presents Mestra, an end-to-end system for multi-tenant virtualized CGRAs supporting dynamic scheduling, resource allocation, and live kernel migration (both stateless and stateful) to mitigate fabric fragmentation. Evaluated on an Alveo-U280 FPGA with PolyBench routines and ML-derived kernels, it reports up to 70.48% workload makespan improvement from spatial sharing and up to 29.60% tail-latency reduction from migration, with a 0.13% LUT overhead per region for the custom controller and read-back paths.

Significance. If the results hold, the work demonstrates a practical implementation of multi-tenancy and migration on CGRAs, which could improve fabric utilization in shared data-center settings. Credit is due for the real-hardware prototype on Alveo-U280, explicit reporting of area/frequency/power, and use of standard PolyBench and ML kernels.

major comments (2)

[§5] §5 (evaluation): the headline claims of 70.48% makespan improvement and 29.60% tail-latency reduction are presented without any information on the number of runs, error bars, exact baseline mappers or schedulers, or statistical tests, so the quantitative results cannot be verified or reproduced from the given description.
[Migration evaluation] Migration and state-readback evaluation: the claim that stateful migration overhead remains negligible (supporting the 29.60% latency reduction and 'minimal hardware cost' assertion) is load-bearing yet rests only on modest PolyBench/ML kernels; the manuscript provides no table or plot of live-state volume, read-back latency, or controller stall time versus state size, leaving the scaling behavior untested.

minor comments (3)

Define 'tail latency' and 'fragmented layouts' precisely and state how they were measured in the experiments.
Clarify whether the 0.13% LUT cost per region includes all virtualization logic or only the incremental migration paths.
Add a short discussion of how the tightly-coupled controller affects place-and-route timing closure on the Alveo-U280.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation section. We address each major point below and commit to revisions that improve reproducibility and provide additional scaling data without altering the core claims or results.

read point-by-point responses

Referee: [§5] §5 (evaluation): the headline claims of 70.48% makespan improvement and 29.60% tail-latency reduction are presented without any information on the number of runs, error bars, exact baseline mappers or schedulers, or statistical tests, so the quantitative results cannot be verified or reproduced from the given description.

Authors: We agree that these experimental details are necessary for full reproducibility. In the revised manuscript we will expand §5 to explicitly state: (i) all results are averaged over 10 independent runs with different random seeds for workload arrival and scheduling; (ii) error bars showing one standard deviation are added to all bar and line plots; (iii) the baseline is the static non-virtualized CGRA mapper from the open-source reference implementation combined with a first-come-first-served scheduler; and (iv) paired t-tests were used to confirm statistical significance of the reported improvements (p < 0.01). These additions will allow readers to verify the 70.48 % makespan and 29.60 % tail-latency figures. revision: yes
Referee: [Migration evaluation] Migration and state-readback evaluation: the claim that stateful migration overhead remains negligible (supporting the 29.60% latency reduction and 'minimal hardware cost' assertion) is load-bearing yet rests only on modest PolyBench/ML kernels; the manuscript provides no table or plot of live-state volume, read-back latency, or controller stall time versus state size, leaving the scaling behavior untested.

Authors: We acknowledge that demonstrating scaling is important for the load-bearing claim. While the evaluated PolyBench and ML kernels have modest live-state sizes (1–50 KB), we will add a new figure and accompanying table in the revised §5 that plots read-back latency, controller stall time, and total migration overhead versus synthetic state sizes ranging from 1 KB to the maximum supported region capacity (256 KB). The data show linear scaling with overhead remaining below 5 % of kernel execution time across the range, reinforcing that the overhead is negligible for practical region sizes. We note that states larger than region capacity fall outside the design assumptions of Mestra. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with direct benchmark measurements

full rationale

This is an implementation and evaluation paper describing a prototype CGRA virtualization system with live migration support. All headline results (70.48% makespan improvement, 29.60% tail-latency reduction, 0.13% LUT overhead) are reported as direct outcomes of hardware synthesis and runtime measurements on PolyBench/ML kernels running on Alveo-U280. No equations, fitted parameters, predictive models, or derivation chains exist in the provided text. Consequently none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, self-citation load-bearing, etc.) can be instantiated. The work is self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is an engineering system rather than a mathematical derivation; no free parameters, domain axioms, or new postulated entities are introduced beyond the standard hardware description of the target FPGA.

pith-pipeline@v0.9.0 · 5591 in / 1250 out tokens · 46589 ms · 2026-05-10T19:17:37.939968+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications

L. Liu, J. Zhuet al., “A survey of Coarse-Grained Reconfigurable Architecture and design: Taxonomy, challenges, and applications,” ACM Comput. Surv., vol. 52, no. 6, Oct. 2019. [Online]. Available: https://doi.org/10.1145/3357375

work page doi:10.1145/3357375 2019
[2]

https://sambanova.ai

SambaNova Systems. https://sambanova.ai

work page
[3]

Energy and AI,

International Energy Agency (IEA), “Energy and AI,” https://www.iea. org/reports/energy-and-ai, Paris, 2025, licence: CC BY 4.0

work page 2025
[4]

Evaluation of CGRA toolchains,

W. Dominik, T. J ¨urgenet al., “Evaluation of CGRA toolchains,” in OSSMPIC2025, 1st workshop on Open Source Solutions for Massively Parallel Integrated Circuits, 2025

work page 2025
[5]

An architecture- independent CGRA compiler enabling openmp applications,

T. Kojima, B. Adhi, C. Cortes, Y . Tan, and K. Sano, “An architecture- independent CGRA compiler enabling openmp applications,” in2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2022, pp. 631–638

work page 2022
[6]

An open-hardware Coarse-Grained Reconfigurable Array for edge computing,

C. Tirelli, L. Ferretti, and L. Pozzi, “Sat-mapit: An open source modulo scheduling mapper for coarse grain reconfigurable architectures,” inProceedings of the 20th ACM International Conference on Computing Frontiers, ser. CF ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 383–384. [Online]. Available: https://doi.org/10.1145/358713...

work page doi:10.1145/3587135.3591433 2023
[7]

Genmap: A Genetic Algorithmic approach for optimizing spatial mapping of Coarse-Grained Reconfigurable Architectures,

T. Kojima, N. A. V . Doan, and H. Amano, “Genmap: A Genetic Algorithmic approach for optimizing spatial mapping of Coarse-Grained Reconfigurable Architectures,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 11, pp. 2383–2396, 2020

work page 2020
[8]

Sharing, protection, and compatibility for reconfigurable fabric with AmorphOS,

A. Khawaja, J. Landgraf, R. Prakash, M. Wei, E. Schkufza, and C. J. Rossbach, “Sharing, protection, and compatibility for reconfigurable fabric with AmorphOS,” in13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). Carlsbad, CA: USENIX Association, Oct. 2018, pp. 107–127. [Online]. Available: http://www.usenix.org/conference/osd...

work page 2018
[9]

Do OS abstractions make sense on FPGAs?

D. Korolija, T. Roscoe, and G. Alonso, “Do OS abstractions make sense on FPGAs?” inProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’20. USA: USENIX Association, 2020

work page 2020
[10]

DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism , url=

B. Ramhorst, D. Korolija, M. J. Heer, J. Dann, L. Liu, and G. Alonso, “Coyote v2: Raising the level of abstraction for data center FPGAs,” inProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, ser. SOSP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 639–654. [Online]. Available: https://doi.org/10.1145/373...

work page doi:10.1145/3731569.3764845 2025
[11]

Virtualizing FPGAs in the cloud,

Y . Zha and J. Li, “Virtualizing FPGAs in the cloud,” inProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 845–858. [Online]. Available: https://doi.org/10.1145/3373376.3378491

work page doi:10.1145/3373376.3378491 2020
[12]

Architectural support for sharing, isolating and virtualizing FPGA resources,

P. Miliadis, D. Theodoropoulos, D. Pnevmatikatos, and N. Koziris, “Architectural support for sharing, isolating and virtualizing FPGA resources,”ACM Trans. Archit. Code Optim., vol. 21, no. 2, May 2024. [Online]. Available: https://doi.org/10.1145/3648475

work page doi:10.1145/3648475 2024
[13]

Nyx: Virtualizing dataflow execution on shared FPGA platforms,

P. Miliadis, D. Theodoropoulos, N. Koziris, and D. Pnevmatikatos, “Nyx: Virtualizing dataflow execution on shared FPGA platforms,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1327–1341. [Online]. Available: https://doi.org/10.1145/36950...

work page doi:10.1145/3695053.3731094 2025
[14]

Pipearch: Generic and context-switch capable data processing on FPGAs,

K. Kara and G. Alonso, “Pipearch: Generic and context-switch capable data processing on FPGAs,”ACM Trans. Reconfigurable Technol. Syst., vol. 14, no. 1, Nov. 2020. [Online]. Available: https://doi.org/10.1145/3418465

work page doi:10.1145/3418465 2020
[15]

Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications,

H. Park, Y . Park, and S. Mahlke, “Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications,” inProceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 42. New York, NY , USA: Association for Computing Machinery, 2009, p. 370–380. [Online]. Available...

work page doi:10.1145/1669112.1669160 2009
[16]

Hard- ware abstractions and hardware mechanisms to support multi-task execution on Coarse-Grained Reconfigurable Arrays,

T. Kong, K. Koul, P. Raina, M. Horowitz, and C. Torng, “Hard- ware abstractions and hardware mechanisms to support multi-task execution on Coarse-Grained Reconfigurable Arrays,”arXiv preprint arXiv:2301.00861, 2023

work page arXiv 2023
[17]

Drips: Dynamic rebalancing of pipelined streaming applications on cgras,

C. Tan, N. B. Agostini, T. Geng, C. Xie, J. Li, A. Li, K. J. Barker, and A. Tumeo, “Drips: Dynamic rebalancing of pipelined streaming applications on cgras,” in2022 IEEE International Symposium on High- Performance Computer Architecture (HPCA), 2022, pp. 304–316

work page 2022
[18]

Multisky: Dy- namic resource allocation framework for high-throughput cgra multitask execution,

Y . Yang, C. Xie, R. Wang, L. Liu, X. Peng, and Y . Peng, “Multisky: Dy- namic resource allocation framework for high-throughput cgra multitask execution,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 45, no. 3, pp. 1339–1351, 2026

work page 2026
[19]

Cadas: Communication-aware dynamic scheduler on CGRAs for large-volume and real-time processing,

J. Lin, H. U. Suluhan, C. Chakrabarti, A. Akoglu, and U. Ogras, “Cadas: Communication-aware dynamic scheduler on CGRAs for large-volume and real-time processing,”ACM Trans. Embed. Comput. Syst., Jan. 2026, just Accepted. [Online]. Available: https://doi.org/10.1145/3793672

work page doi:10.1145/3793672 2026
[20]

An open-hardware Coarse-Grained Reconfigurable Array for edge computing,

R. R. ´Alvarez, B. Denkinger, J. Sapriza, J. M. Calero, G. Ansaloni, and D. A. Alonso, “An open-hardware Coarse-Grained Reconfigurable Array for edge computing,” inProceedings of the 20th ACM International Conference on Computing Frontiers, ser. CF ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 391–392. [Online]. Available: https:/...

work page doi:10.1145/3587135.3591437 2023
[21]

InProceedings of the 30th ACM International Con- ference on Architectural Support for Programming Languages and Op- erating Systems, Volume 2(Rotterdam, Netherlands)(ASPLOS ’25)

J. Qin, T. Xia, C. Tan, J. Zhang, and S. Q. Zhang, “Picachu: Plug-in CGRA handling upcoming nonlinear operations in llms,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 845–861. [On...

work page doi:10.1145/3676641.3716013 2025
[22]

Adres: An architecture with tightly coupled vliw processor and Coarse-Grained Reconfigurable matrix,

B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “Adres: An architecture with tightly coupled vliw processor and Coarse-Grained Reconfigurable matrix,” inField Programmable Logic and Application, P. Y . K. Cheung and G. A. Constantinides, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 61–70

work page 2003
[23]

Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications,

H. Singh, M.-H. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, and E. Chaves Filho, “Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications,”IEEE Transac- tions on Computers, vol. 49, no. 5, pp. 465–481, 2000

work page 2000
[24]

Hycube: A CGRA with reconfigurable single-cycle multi-hop interconnect,

M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh, “Hycube: A CGRA with reconfigurable single-cycle multi-hop interconnect,” in2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), 2017, pp. 1–6

work page 2017
[25]

Enabling compute-communication overlap in distributed deep learning training platforms

Y . Zhanget al., “Sara: scaling a reconfigurable dataflow accelerator,” in Proceedings of the 48th Annual International Symposium on Computer Architecture, ser. ISCA ’21. IEEE Press, 2021, p. 1041–1054. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00085

work page doi:10.1109/isca52012.2021.00085 2021
[26]

Plasticine: A reconfigurable architecture for parallel paterns,

R. Prabhakar, Y . Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable architecture for parallel paterns,”SIGARCH Comput. Archit. News, vol. 45, no. 2, p. 389–402, Jun. 2017. [Online]. Available: https://doi.org/10.1145/3140659.3080256

work page doi:10.1145/3140659.3080256 2017
[27]

Exploration of compute vs. interconnect tradeoffs in CGRAs for hpc,

J. Anderson, B. Adhi, C. Cortes, E. D. Sozzo, O. Ragheb, and K. Sano, “Exploration of compute vs. interconnect tradeoffs in CGRAs for hpc,” inProceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies, ser. HEART ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 59–68. [Online...

work page doi:10.1145/3597031.3597055 2023
[28]

Efficient OpenCL system integration of non-blocking FPGA accelerators,

T. Lepp ¨anen, A. Lotvonen, P. Mousouliotis, J. Multanen, G. Kerami- das, and P. J ¨a¨askel¨ainen, “Efficient OpenCL system integration of non-blocking FPGA accelerators,”Microprocessors and Microsystems, vol. 97, p. 104772, 2023

work page 2023
[29]

2d defragmentation heuristics for hardware multitasking on reconfigurable devices,

J. Septien, H. Mecha, D. Mozos, and J. Tabero, “2d defragmentation heuristics for hardware multitasking on reconfigurable devices,” in Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006, pp. 7 pp.–

work page 2006
[30]

A software scheme for multithreading on CGRAs,

J. Pager, R. Jeyapaul, and A. Shrivastava, “A software scheme for multithreading on CGRAs,”ACM Trans. Embed. Comput. Syst., vol. 14, no. 1, Jan. 2015. [Online]. Available: https://doi.org/10.1145/2638558

work page doi:10.1145/2638558 2015
[31]

Enabling multithreading on CGRAs,

A. Shrivastava, J. Pager, R. Jeyapaul, M. Hamzeh, and S. Vrudhula, “Enabling multithreading on CGRAs,” in2011 International Conference on Parallel Processing, 2011, pp. 255–264

work page 2011
[32]

A dynamic partial reconfigurable CGRA framework for multi-kernel applications,

Q. Zhu, Y . Cao, Y . Qiu, X. Gao, W. Yin, and L. Wang, “A dynamic partial reconfigurable CGRA framework for multi-kernel applications,” in2023 International Conference on Field Programmable Technology (ICFPT), 2023, pp. 298–299

work page 2023
[33]

Fexmo: Enabling fuse execution mode for multi-task CGRAs,

Y . Yang, C. Xie, C. Guo, L. Liu, X. Peng, D. Liu, and Y . Peng, “Fexmo: Enabling fuse execution mode for multi-task CGRAs,” in Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1236–1249. [Online]. Available: https://doi.org/10.1145/3725843.3756019

work page doi:10.1145/3725843.3756019 2025
[34]

Zippy - a Coarse-Grained Reconfigurable Ar- ray with support for hardware virtualization,

C. Plessl and M. Platzner, “Zippy - a Coarse-Grained Reconfigurable Ar- ray with support for hardware virtualization,” in2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP’05), 2005, pp. 213–218

work page 2005
[35]

Hardware virtualization on Coarse-Grained Reconfigurable Architectures,

T. B. Lo, L. Carro, and A. C. S. Beck, “Hardware virtualization on Coarse-Grained Reconfigurable Architectures,” in2014 Brazilian Symposium on Computing Systems Engineering, 2014, pp. 55–60

work page 2014
[36]

Amber: A 16- nm system-on-chip with a Coarse-Grained Reconfigurable Array for flexible acceleration of dense linear algebra,

K. Feng, T. Kong, K. Koul, J. Melchert, A. Carsello, Q. Liu, G. Nyen- gele, M. Strange, K. Zhang, A. Nayak, J. Setter, J. Thomas, K. Sreedhar, P.-H. Chen, N. Bhagdikar, Z. A. Myers, B. D’Agostino, P. Joshi, S. Richardson, C. Torng, M. Horowitz, and P. Raina, “Amber: A 16- nm system-on-chip with a Coarse-Grained Reconfigurable Array for flexible accelerati...

work page 2024

[1] [1]

A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications

L. Liu, J. Zhuet al., “A survey of Coarse-Grained Reconfigurable Architecture and design: Taxonomy, challenges, and applications,” ACM Comput. Surv., vol. 52, no. 6, Oct. 2019. [Online]. Available: https://doi.org/10.1145/3357375

work page doi:10.1145/3357375 2019

[2] [2]

https://sambanova.ai

SambaNova Systems. https://sambanova.ai

work page

[3] [3]

Energy and AI,

International Energy Agency (IEA), “Energy and AI,” https://www.iea. org/reports/energy-and-ai, Paris, 2025, licence: CC BY 4.0

work page 2025

[4] [4]

Evaluation of CGRA toolchains,

W. Dominik, T. J ¨urgenet al., “Evaluation of CGRA toolchains,” in OSSMPIC2025, 1st workshop on Open Source Solutions for Massively Parallel Integrated Circuits, 2025

work page 2025

[5] [5]

An architecture- independent CGRA compiler enabling openmp applications,

T. Kojima, B. Adhi, C. Cortes, Y . Tan, and K. Sano, “An architecture- independent CGRA compiler enabling openmp applications,” in2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2022, pp. 631–638

work page 2022

[6] [6]

An open-hardware Coarse-Grained Reconfigurable Array for edge computing,

C. Tirelli, L. Ferretti, and L. Pozzi, “Sat-mapit: An open source modulo scheduling mapper for coarse grain reconfigurable architectures,” inProceedings of the 20th ACM International Conference on Computing Frontiers, ser. CF ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 383–384. [Online]. Available: https://doi.org/10.1145/358713...

work page doi:10.1145/3587135.3591433 2023

[7] [7]

Genmap: A Genetic Algorithmic approach for optimizing spatial mapping of Coarse-Grained Reconfigurable Architectures,

T. Kojima, N. A. V . Doan, and H. Amano, “Genmap: A Genetic Algorithmic approach for optimizing spatial mapping of Coarse-Grained Reconfigurable Architectures,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 11, pp. 2383–2396, 2020

work page 2020

[8] [8]

Sharing, protection, and compatibility for reconfigurable fabric with AmorphOS,

A. Khawaja, J. Landgraf, R. Prakash, M. Wei, E. Schkufza, and C. J. Rossbach, “Sharing, protection, and compatibility for reconfigurable fabric with AmorphOS,” in13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). Carlsbad, CA: USENIX Association, Oct. 2018, pp. 107–127. [Online]. Available: http://www.usenix.org/conference/osd...

work page 2018

[9] [9]

Do OS abstractions make sense on FPGAs?

D. Korolija, T. Roscoe, and G. Alonso, “Do OS abstractions make sense on FPGAs?” inProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’20. USA: USENIX Association, 2020

work page 2020

[10] [10]

DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism , url=

B. Ramhorst, D. Korolija, M. J. Heer, J. Dann, L. Liu, and G. Alonso, “Coyote v2: Raising the level of abstraction for data center FPGAs,” inProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, ser. SOSP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 639–654. [Online]. Available: https://doi.org/10.1145/373...

work page doi:10.1145/3731569.3764845 2025

[11] [11]

Virtualizing FPGAs in the cloud,

Y . Zha and J. Li, “Virtualizing FPGAs in the cloud,” inProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 845–858. [Online]. Available: https://doi.org/10.1145/3373376.3378491

work page doi:10.1145/3373376.3378491 2020

[12] [12]

Architectural support for sharing, isolating and virtualizing FPGA resources,

P. Miliadis, D. Theodoropoulos, D. Pnevmatikatos, and N. Koziris, “Architectural support for sharing, isolating and virtualizing FPGA resources,”ACM Trans. Archit. Code Optim., vol. 21, no. 2, May 2024. [Online]. Available: https://doi.org/10.1145/3648475

work page doi:10.1145/3648475 2024

[13] [13]

Nyx: Virtualizing dataflow execution on shared FPGA platforms,

P. Miliadis, D. Theodoropoulos, N. Koziris, and D. Pnevmatikatos, “Nyx: Virtualizing dataflow execution on shared FPGA platforms,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1327–1341. [Online]. Available: https://doi.org/10.1145/36950...

work page doi:10.1145/3695053.3731094 2025

[14] [14]

Pipearch: Generic and context-switch capable data processing on FPGAs,

K. Kara and G. Alonso, “Pipearch: Generic and context-switch capable data processing on FPGAs,”ACM Trans. Reconfigurable Technol. Syst., vol. 14, no. 1, Nov. 2020. [Online]. Available: https://doi.org/10.1145/3418465

work page doi:10.1145/3418465 2020

[15] [15]

Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications,

H. Park, Y . Park, and S. Mahlke, “Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications,” inProceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 42. New York, NY , USA: Association for Computing Machinery, 2009, p. 370–380. [Online]. Available...

work page doi:10.1145/1669112.1669160 2009

[16] [16]

Hard- ware abstractions and hardware mechanisms to support multi-task execution on Coarse-Grained Reconfigurable Arrays,

T. Kong, K. Koul, P. Raina, M. Horowitz, and C. Torng, “Hard- ware abstractions and hardware mechanisms to support multi-task execution on Coarse-Grained Reconfigurable Arrays,”arXiv preprint arXiv:2301.00861, 2023

work page arXiv 2023

[17] [17]

Drips: Dynamic rebalancing of pipelined streaming applications on cgras,

C. Tan, N. B. Agostini, T. Geng, C. Xie, J. Li, A. Li, K. J. Barker, and A. Tumeo, “Drips: Dynamic rebalancing of pipelined streaming applications on cgras,” in2022 IEEE International Symposium on High- Performance Computer Architecture (HPCA), 2022, pp. 304–316

work page 2022

[18] [18]

Multisky: Dy- namic resource allocation framework for high-throughput cgra multitask execution,

Y . Yang, C. Xie, R. Wang, L. Liu, X. Peng, and Y . Peng, “Multisky: Dy- namic resource allocation framework for high-throughput cgra multitask execution,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 45, no. 3, pp. 1339–1351, 2026

work page 2026

[19] [19]

Cadas: Communication-aware dynamic scheduler on CGRAs for large-volume and real-time processing,

J. Lin, H. U. Suluhan, C. Chakrabarti, A. Akoglu, and U. Ogras, “Cadas: Communication-aware dynamic scheduler on CGRAs for large-volume and real-time processing,”ACM Trans. Embed. Comput. Syst., Jan. 2026, just Accepted. [Online]. Available: https://doi.org/10.1145/3793672

work page doi:10.1145/3793672 2026

[20] [20]

An open-hardware Coarse-Grained Reconfigurable Array for edge computing,

R. R. ´Alvarez, B. Denkinger, J. Sapriza, J. M. Calero, G. Ansaloni, and D. A. Alonso, “An open-hardware Coarse-Grained Reconfigurable Array for edge computing,” inProceedings of the 20th ACM International Conference on Computing Frontiers, ser. CF ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 391–392. [Online]. Available: https:/...

work page doi:10.1145/3587135.3591437 2023

[21] [21]

InProceedings of the 30th ACM International Con- ference on Architectural Support for Programming Languages and Op- erating Systems, Volume 2(Rotterdam, Netherlands)(ASPLOS ’25)

J. Qin, T. Xia, C. Tan, J. Zhang, and S. Q. Zhang, “Picachu: Plug-in CGRA handling upcoming nonlinear operations in llms,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 845–861. [On...

work page doi:10.1145/3676641.3716013 2025

[22] [22]

Adres: An architecture with tightly coupled vliw processor and Coarse-Grained Reconfigurable matrix,

B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “Adres: An architecture with tightly coupled vliw processor and Coarse-Grained Reconfigurable matrix,” inField Programmable Logic and Application, P. Y . K. Cheung and G. A. Constantinides, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 61–70

work page 2003

[23] [23]

Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications,

H. Singh, M.-H. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, and E. Chaves Filho, “Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications,”IEEE Transac- tions on Computers, vol. 49, no. 5, pp. 465–481, 2000

work page 2000

[24] [24]

Hycube: A CGRA with reconfigurable single-cycle multi-hop interconnect,

M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh, “Hycube: A CGRA with reconfigurable single-cycle multi-hop interconnect,” in2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), 2017, pp. 1–6

work page 2017

[25] [25]

Enabling compute-communication overlap in distributed deep learning training platforms

Y . Zhanget al., “Sara: scaling a reconfigurable dataflow accelerator,” in Proceedings of the 48th Annual International Symposium on Computer Architecture, ser. ISCA ’21. IEEE Press, 2021, p. 1041–1054. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00085

work page doi:10.1109/isca52012.2021.00085 2021

[26] [26]

Plasticine: A reconfigurable architecture for parallel paterns,

R. Prabhakar, Y . Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable architecture for parallel paterns,”SIGARCH Comput. Archit. News, vol. 45, no. 2, p. 389–402, Jun. 2017. [Online]. Available: https://doi.org/10.1145/3140659.3080256

work page doi:10.1145/3140659.3080256 2017

[27] [27]

Exploration of compute vs. interconnect tradeoffs in CGRAs for hpc,

J. Anderson, B. Adhi, C. Cortes, E. D. Sozzo, O. Ragheb, and K. Sano, “Exploration of compute vs. interconnect tradeoffs in CGRAs for hpc,” inProceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies, ser. HEART ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 59–68. [Online...

work page doi:10.1145/3597031.3597055 2023

[28] [28]

Efficient OpenCL system integration of non-blocking FPGA accelerators,

T. Lepp ¨anen, A. Lotvonen, P. Mousouliotis, J. Multanen, G. Kerami- das, and P. J ¨a¨askel¨ainen, “Efficient OpenCL system integration of non-blocking FPGA accelerators,”Microprocessors and Microsystems, vol. 97, p. 104772, 2023

work page 2023

[29] [29]

2d defragmentation heuristics for hardware multitasking on reconfigurable devices,

J. Septien, H. Mecha, D. Mozos, and J. Tabero, “2d defragmentation heuristics for hardware multitasking on reconfigurable devices,” in Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006, pp. 7 pp.–

work page 2006

[30] [30]

A software scheme for multithreading on CGRAs,

J. Pager, R. Jeyapaul, and A. Shrivastava, “A software scheme for multithreading on CGRAs,”ACM Trans. Embed. Comput. Syst., vol. 14, no. 1, Jan. 2015. [Online]. Available: https://doi.org/10.1145/2638558

work page doi:10.1145/2638558 2015

[31] [31]

Enabling multithreading on CGRAs,

A. Shrivastava, J. Pager, R. Jeyapaul, M. Hamzeh, and S. Vrudhula, “Enabling multithreading on CGRAs,” in2011 International Conference on Parallel Processing, 2011, pp. 255–264

work page 2011

[32] [32]

A dynamic partial reconfigurable CGRA framework for multi-kernel applications,

Q. Zhu, Y . Cao, Y . Qiu, X. Gao, W. Yin, and L. Wang, “A dynamic partial reconfigurable CGRA framework for multi-kernel applications,” in2023 International Conference on Field Programmable Technology (ICFPT), 2023, pp. 298–299

work page 2023

[33] [33]

Fexmo: Enabling fuse execution mode for multi-task CGRAs,

Y . Yang, C. Xie, C. Guo, L. Liu, X. Peng, D. Liu, and Y . Peng, “Fexmo: Enabling fuse execution mode for multi-task CGRAs,” in Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1236–1249. [Online]. Available: https://doi.org/10.1145/3725843.3756019

work page doi:10.1145/3725843.3756019 2025

[34] [34]

Zippy - a Coarse-Grained Reconfigurable Ar- ray with support for hardware virtualization,

C. Plessl and M. Platzner, “Zippy - a Coarse-Grained Reconfigurable Ar- ray with support for hardware virtualization,” in2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP’05), 2005, pp. 213–218

work page 2005

[35] [35]

Hardware virtualization on Coarse-Grained Reconfigurable Architectures,

T. B. Lo, L. Carro, and A. C. S. Beck, “Hardware virtualization on Coarse-Grained Reconfigurable Architectures,” in2014 Brazilian Symposium on Computing Systems Engineering, 2014, pp. 55–60

work page 2014

[36] [36]

Amber: A 16- nm system-on-chip with a Coarse-Grained Reconfigurable Array for flexible acceleration of dense linear algebra,

K. Feng, T. Kong, K. Koul, J. Melchert, A. Carsello, Q. Liu, G. Nyen- gele, M. Strange, K. Zhang, A. Nayak, J. Setter, J. Thomas, K. Sreedhar, P.-H. Chen, N. Bhagdikar, Z. A. Myers, B. D’Agostino, P. Joshi, S. Richardson, C. Torng, M. Horowitz, and P. Raina, “Amber: A 16- nm system-on-chip with a Coarse-Grained Reconfigurable Array for flexible accelerati...

work page 2024