D\'ej\`a Vu Packing: Optimizing FPGA Logic Clustering Runtime via Pattern Memoization
Pith reviewed 2026-05-07 17:26 UTC · model grok-4.3
The pith
A signature tree memoizes repeated cluster legality checks to accelerate FPGA packing several times over.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a packing signature tree data structure can identify recurring patterns among attempted clusters and memoize the outcomes of their multi-source multi-sink routing legality checks. Because a large fraction of attempted clusters match one of a modest number of prior patterns, the tree avoids redundant expensive checks while preserving the exact same clustering decisions.
What carries the argument
The packing signature tree, a trie-like structure that hashes cluster signatures to store and retrieve legality results for repeated packing patterns.
Load-bearing premise
That enough attempted clusters match previous patterns for the signature tree to skip most checks without using too much memory or missing any pattern variations that would change a legality decision.
What would settle it
Running the packer on a new set of large benchmarks where every candidate cluster has a unique pattern; if speedups fall below 1.2x while memory grows linearly with attempts, the core premise does not hold.
Figures
read the original abstract
Implementing a digital circuit on an FPGA fabric requires clustering technology-mapped netlist primitives into coarser-granularity blocks that can be directly mapped to the physical resources available on the FPGA. As the architecture of FPGA logic blocks (LBs) has grown in complexity, with sophisticated logic elements (LEs) and highly irregular local interconnect, this packing problem has become more challenging. To ensure the feasibility of intracluster routing, the computer-aided design (CAD) tools must solve a costly multi-source multi-sink routing problem for each candidate cluster. In this paper, we first show that such packing legality checks consume a significant portion of the CAD flow runtime for LB architectures with complex LEs and local routing structures resembling modern commercial FPGAs. We demonstrate that the packing stage constitutes 58% and 94% of the entire Versatile Place and Route (VPR) flow runtime on average when mapping a wide variety of benchmarks to the AMD 7-series-like and Altera Stratix-10-like VTR architecture captures, respectively. By analyzing the packing algorithm behavior, we observe that a significant fraction of the attempted packed clusters are repetitions of a much smaller number of packing patterns, and therefore many of the packing legality checks are redundant and could be skipped. To this end, we introduce our D\'ej\`a Vu packing approach, which leverages a novel packing signature tree data structure that enables efficient identification of recurring packing patterns and memoization of their legality check outcomes. Our approach speeds up the packing by up to 13.4x and 29.3x, with an average of 3.7x and 6.9x, across the evaluated benchmarks on the 7-series and Stratix 10 architectures. These packing runtime gains result in a significant 1.6x and 5.3x average reduction in end-to-end VPR runtime, while maintaining quality of results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Déjà Vu packing, an optimization for FPGA logic clustering that observes a high fraction of repeated packing patterns during legality checks (which consume 58% and 94% of VPR runtime on two architectures). It introduces a packing signature tree data structure to identify recurring patterns and memoize their legality outcomes, claiming average packing speedups of 3.7× (7-series) and 6.9× (Stratix 10), with corresponding end-to-end VPR reductions of 1.6× and 5.3× while preserving quality of results.
Significance. If the memoization is provably sound and the empirical speedups are robust across benchmarks, the work could meaningfully accelerate FPGA CAD flows for modern complex logic blocks by eliminating redundant routing checks inside clusters. The core observation of pattern repetition and the signature-tree mechanism represent a practical algorithmic contribution to a known bottleneck.
major comments (2)
- [Section 3] Section 3 (Signature Tree Construction): The description of signature generation does not specify whether signatures are exact canonical forms, structural hashes, or include a verification step. Without this, it is impossible to rule out false-positive matches between non-identical clusters, which would silently reuse incorrect legality results and potentially degrade QoR or produce illegal packings.
- [Section 5] Section 5 (Experimental Evaluation): The reported speedups (up to 13.4×/29.3×, averages 3.7×/6.9×) and end-to-end gains lack any description of benchmark selection criteria, number of independent runs, statistical significance testing, or error bars. This makes it difficult to assess whether the results are generalizable or sensitive to post-hoc benchmark choices.
minor comments (2)
- [Abstract] The abstract states that quality of results is maintained but does not quantify the specific metrics (critical-path delay, area, or routing success rate) used to verify this claim; these should be reported with tables or figures in the evaluation section.
- [Section 3] Notation for 'packing patterns' and 'signatures' is introduced informally; a short formal definition or pseudocode in Section 3 would improve clarity for readers unfamiliar with the VPR packing algorithm.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify and strengthen our manuscript. We address each major comment in detail below, committing to revisions where appropriate to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Signature Tree Construction): The description of signature generation does not specify whether signatures are exact canonical forms, structural hashes, or include a verification step. Without this, it is impossible to rule out false-positive matches between non-identical clusters, which would silently reuse incorrect legality results and potentially degrade QoR or produce illegal packings.
Authors: We appreciate this point on soundness. The signature in our packing signature tree is constructed as an exact canonical form: primitives within each candidate cluster are sorted by type and position, their local connectivity is encoded as a deterministic adjacency list string, and the resulting tuple serves as the key. This is not a probabilistic hash but a structural canonicalization, guaranteeing identical signatures only for identical patterns. The tree traversal further verifies structural equivalence before reusing any memoized legality outcome. We will revise Section 3 to explicitly describe this canonicalization process, add pseudocode for signature generation, and state that no false-positive reuse is possible by design. revision: yes
-
Referee: [Section 5] Section 5 (Experimental Evaluation): The reported speedups (up to 13.4×/29.3×, averages 3.7×/6.9×) and end-to-end gains lack any description of benchmark selection criteria, number of independent runs, statistical significance testing, or error bars. This makes it difficult to assess whether the results are generalizable or sensitive to post-hoc benchmark choices.
Authors: We agree that additional methodological details will strengthen the evaluation. The benchmarks comprise the full VTR 8.0 suite (all 20+ circuits), selected because they represent the standard, diverse set used in prior FPGA packing literature and cover a range of sizes and domains. All runs used deterministic algorithms on a fixed platform, so single executions per benchmark were performed. To improve statistical transparency, we will add a new subsection describing benchmark selection, include a table of all circuits with size and runtime statistics, and report mean speedups with standard deviations obtained from five independent runs (varying placement seeds where applicable). These changes will be incorporated in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical runtime optimization with direct measurements
full rationale
The paper describes an algorithmic optimization for FPGA packing that observes repeated patterns in clustering attempts and uses a signature tree for memoization. All claims rest on runtime benchmarks across architectures and benchmarks, with no equations, fitted parameters, predictions derived from inputs, or self-citations that bear the central result. Speedups (e.g., 3.7x packing) are measured outcomes, not quantities defined by the method itself. The work is self-contained against external VPR flows and does not reduce any result to its own assumptions by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A New Paradigm for FPGA Placement Without Explicit Packing,
W. Li and D. Z. Pan, “A New Paradigm for FPGA Placement Without Explicit Packing,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 38, no. 11, pp. 2113– 2126, 2019
work page 2019
-
[2]
Optimizing FPGA Logic Block Architectures for Arithmetic,
K. E. Murray, J. Luu, M. J. Walker, C. McCullough, S. Wang, S. Huda, B. Yan, C. Chiasson, K. B. Kent, J. Andersonet al., “Optimizing FPGA Logic Block Architectures for Arithmetic,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 6, pp. 1378–1391, 2020
work page 2020
-
[3]
VTR 9: Open-source CAD for Fabric and Beyond FPGA Architecture Exploration,
M. A. Elgammal, A. Mohaghegh, S. G. Shahrouz, F. Mahmoudi, F. Ko¸ sar, K. Talaei, J. Fife, D. Khadivi, K. Murray, A. Boutroset al., “VTR 9: Open-source CAD for Fabric and Beyond FPGA Architecture Exploration,”ACM Transactions on Reconfigurable Technology and Systems, vol. 18, no. 3, pp. 1–53, 2025
work page 2025
-
[4]
Koios 2.0: Open-Source Deep Learning Benchmarks for FPGA Architecture and CAD Research,
A. Arora, A. Boutros, S. A. Damghani, K. Mathur, V . Mohanty, T. Anand, M. A. Elgammal, K. B. Kent, V . Betz, and L. K. John, “Koios 2.0: Open-Source Deep Learning Benchmarks for FPGA Architecture and CAD Research,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 42, no. 11, pp. 3895–3909, 2023
work page 2023
-
[5]
Titan: Enabling Large and Complex Benchmarks in Academic CAD,
K. E. Murray, S. Whitty, S. Liu, J. Luu, and V . Betz, “Titan: Enabling Large and Complex Benchmarks in Academic CAD,” inIEEE Interna- tional Conference on Field programmable Logic and Applications (FPL), 2013, pp. 1–8
work page 2013
-
[6]
Cluster-based Logic Blocks for FPGAs: Area- Efficiency vs. Input Sharing and Size,
V . Betz and J. Rose, “Cluster-based Logic Blocks for FPGAs: Area- Efficiency vs. Input Sharing and Size,” inIEEE Custom Integrated Circuits Conference (CICC), 1997, pp. 551–554
work page 1997
-
[7]
Using Cluster-Based Logic Blocks and Timing-Driven Packing to Improve FPGA Speed and Density,
A. Marquardt, V . Betz, and J. Rose, “Using Cluster-Based Logic Blocks and Timing-Driven Packing to Improve FPGA Speed and Density,” inACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 1999, pp. 37–46
work page 1999
-
[8]
RPack: Routability-Driven Packing for Cluster-Based FPGAs,
E. Bozorgzadeh, S. Ogrenci-Memik, and M. Sarrafzadeh, “RPack: Routability-Driven Packing for Cluster-Based FPGAs,” inACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), 2001, pp. 629–634
work page 2001
-
[9]
Efficient Circuit Clustering for Area and Power Reduction in FPGAs,
A. Singh, G. Parthasarathy, and M. Marek-Sadowska, “Efficient Circuit Clustering for Area and Power Reduction in FPGAs,”ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 7, no. 4, pp. 643–663, 2002
work page 2002
-
[10]
Improving Timing-Driven FPGA Packing with Physical Information,
D. T. Chen, K. V orwerk, and A. Kennings, “Improving Timing-Driven FPGA Packing with Physical Information,” inIEEE International Con- ference on Field Programmable Logic and Applications (FPL), 2007, pp. 117–123
work page 2007
-
[11]
MO-Pack: Many-Objective Clustering for FPGA CAD,
S. T. Rajavel and A. Akoglu, “MO-Pack: Many-Objective Clustering for FPGA CAD,” inACM/IEEE Design Automation Conference (DAC), 2011, pp. 818–823
work page 2011
-
[12]
M. Tom, D. Leong, and G. Lemieux, “Un/DoPack: Re-clustering of Large System-on-Chip Designs with Interconnect Variation for Low- Cost FPGAs,” inIEEE/ACM International Conference on Computer- aided design (ICCAD), 2006, pp. 680–687
work page 2006
-
[13]
T-NDPack: Timing-Driven Non-Uniform Depopulation Based Clustering,
H. Liu and A. Akoglu, “T-NDPack: Timing-Driven Non-Uniform Depopulation Based Clustering,” inIEEE Southern Conference on Programmable Logic (SPL), 2009, pp. 9–14
work page 2009
-
[14]
Multilevel K-Way Hypergraph Partitioning,
G. Karypis and V . Kumar, “Multilevel K-Way Hypergraph Partitioning,” inACM/IEEE Design Automation Conference (DAC), 1999, pp. 343– 348
work page 1999
-
[15]
K-Way Partitioning Based Packing for FPGA Logic Blocks Without Input Bandwidth Constraint,
W. Feng, “K-Way Partitioning Based Packing for FPGA Logic Blocks Without Input Bandwidth Constraint,” inIEEE International Conference on Field-Programmable Technology (FPT), 2012, pp. 8–15
work page 2012
-
[16]
Rent’s Rule Based FPGA Packing for Routability Optimization,
W. Feng, J. Greene, K. V orwerk, V . Pevzner, and A. Kundu, “Rent’s Rule Based FPGA Packing for Routability Optimization,” inACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2014, pp. 31–34
work page 2014
-
[17]
Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing,
D. Vercruyce, E. Vansteenkiste, and D. Stroobandt, “Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing,” inIEEE Inter- national Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1–9
work page 2016
-
[18]
How Preserving Circuit Design Hierarchy During FPGA Packing Leads to Better Performance,
——, “How Preserving Circuit Design Hierarchy During FPGA Packing Leads to Better Performance,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 37, no. 3, pp. 629–642, 2017
work page 2017
-
[19]
LSC: A Large-Scale Consensus-Based Clustering Algorithm for High-Performance FPGAs,
L. Singhal, M. A. Iyer, and S. Adya, “LSC: A Large-Scale Consensus-Based Clustering Algorithm for High-Performance FPGAs,” inACM/IEEE Design Automation Conference (DAC), 2017, pp. 1–6
work page 2017
-
[20]
Simultaneous Timing Driven Clustering and Placement for FPGAs,
G. Chen and J. Cong, “Simultaneous Timing Driven Clustering and Placement for FPGAs,” inInternational Conference on Field Pro- grammable Logic and Applications (FPL), 2004, pp. 158–167
work page 2004
-
[21]
Breaking Boundaries: Optimizing FPGA CAD with Flexible and Multi-threaded Re-Clustering,
M. A. Elgammal and V . Betz, “Breaking Boundaries: Optimizing FPGA CAD with Flexible and Multi-threaded Re-Clustering,” inACM Interna- tional Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), 2023, pp. 11–18
work page 2023
-
[22]
J. Luu, J. Anderson, and J. Rose, “Architecture description and packing for logic blocks with hierarchy, modes and complex interconnect,” Proceedings of 2011 Acm/Sigda International Symposium on Field Programmable Gate Arrays, pp. 227–236, 2011
work page 2011
-
[23]
Towards Interconnect-Adaptive Packing for FPGAs,
J. Luu, J. Rose, and J. Anderson, “Towards Interconnect-Adaptive Packing for FPGAs,” inACM/SIGDA International Symposium on Field- Programmable Gate Arrays (FPGA), 2014, pp. 21–30
work page 2014
-
[24]
Using Sparse Crossbars within LUT,
G. Lemieux and D. Lewis, “Using Sparse Crossbars within LUT,” inACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2001, pp. 59–68
work page 2001
-
[25]
Academic Packing for Commercial FPGA Architec- tures,
T. D. Haroldsen, “Academic Packing for Commercial FPGA Architec- tures,” Ph.D. dissertation, Brigham Young University, 2017
work page 2017
-
[26]
VTR 7.0: Next Generation Architecture and CAD System for FPGAs,
J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk, M. Nasr, S. Wang, T. Liu, N. Ahmed, K. B. Kent, J. Anderson, J. Rose, and V . Betz, “VTR 7.0: Next Generation Architecture and CAD System for FPGAs,”ACM Transactions on Reconfigurable Technology and Systems, vol. 7, no. 3, pp. 1–30, 2014
work page 2014
-
[27]
VTR 8: High-performance CAD and Customizable FPGA Architecture Modelling,
K. E. Murray, O. Petelin, S. Zhong, J. M. Wang, M. Eldafrawy, J.-P. Legault, E. Sha, A. G. Graham, J. Wu, M. J. P. Walker, H. Zeng, P. Pa- tros, J. Luu, K. B. Kent, and V . Betz, “VTR 8: High-performance CAD and Customizable FPGA Architecture Modelling,”ACM Transactions on Reconfigurable Technology and Systems, vol. 13, no. 2, pp. 1–60, 2020
work page 2020
-
[28]
Titan 2.0: Enabling Open-Source CAD Evaluation with a Modern Architecture Capture,
K. T. Khoozani, A. A. Dehkordi, and V . Betz, “Titan 2.0: Enabling Open-Source CAD Evaluation with a Modern Architecture Capture,” inIEEE International Conference on Field-Programmable Logic and Applications (FPL), 2023, pp. 57–64
work page 2023
-
[29]
PathFinder: A Negotiation-Based Performance-Driven Router for FPGAs,
L. McMurchie and C. Ebeling, “PathFinder: A Negotiation-Based Performance-Driven Router for FPGAs,” inACM International Sym- posium on Field-Programmable Gate Arrays (FPGA), 1995, pp. 111— -117
work page 1995
-
[30]
Guaranteed Yet Hard to Find: Uncovering FPGA Routing Convergence Paradox,
S. Shrivastava, S. Nikoli ´c, S. Tanaka, C. Ravishankar, D. Gaitonde, and M. Stojilovi ´c, “Guaranteed Yet Hard to Find: Uncovering FPGA Routing Convergence Paradox,” inIEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2025, pp. 143–151
work page 2025
- [31]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.