pith. sign in

arxiv: 2604.24649 · v1 · submitted 2026-04-27 · 💻 cs.AR

D\'ej\`a Vu Packing: Optimizing FPGA Logic Clustering Runtime via Pattern Memoization

Pith reviewed 2026-05-07 17:26 UTC · model grok-4.3

classification 💻 cs.AR
keywords FPGApackingclusteringmemoizationsignature treeCAD runtimeVPRlogic blocks
0
0 comments X

The pith

A signature tree memoizes repeated cluster legality checks to accelerate FPGA packing several times over.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The packing stage in FPGA CAD flows clusters logic primitives into larger blocks and must solve expensive routing legality checks for each candidate. For modern complex logic blocks, this stage dominates total runtime, often exceeding half or nearly all of the place-and-route time. The authors observe that many candidate clusters repeat a small set of patterns, so repeated checks are wasteful. Their Deja Vu approach builds a packing signature tree to detect these repeats quickly and reuse prior legality results. On standard benchmarks this yields average speedups of 3.7x for 7-series-like architectures and 6.9x for Stratix-10-like ones, translating to 1.6x and 5.3x faster end-to-end flows with no loss in quality.

Core claim

The central discovery is that a packing signature tree data structure can identify recurring patterns among attempted clusters and memoize the outcomes of their multi-source multi-sink routing legality checks. Because a large fraction of attempted clusters match one of a modest number of prior patterns, the tree avoids redundant expensive checks while preserving the exact same clustering decisions.

What carries the argument

The packing signature tree, a trie-like structure that hashes cluster signatures to store and retrieve legality results for repeated packing patterns.

Load-bearing premise

That enough attempted clusters match previous patterns for the signature tree to skip most checks without using too much memory or missing any pattern variations that would change a legality decision.

What would settle it

Running the packer on a new set of large benchmarks where every candidate cluster has a unique pattern; if speedups fall below 1.2x while memory grows linearly with attempts, the core premise does not hold.

Figures

Figures reproduced from arXiv: 2604.24649 by Amin Mohaghegh, Andrew Boutros, Milo Liebster.

Figure 1
Figure 1. Figure 1: Complex commercial Stratix 10 LE with 8 distinct inputs (A-H), one view at source ↗
Figure 2
Figure 2. Figure 2: VPR runtime breakdown for the VTR and Koios benchmarks on the 7-series architecture (W=300) (left), as well as the Titan23 and Titanium25 view at source ↗
Figure 3
Figure 3. Figure 3: Breakdown of the percentage of clusters that succeeded with specu view at source ↗
Figure 5
Figure 5. Figure 5: Example PST representation with two packing signatures (right) view at source ↗
Figure 6
Figure 6. Figure 6: Breakdown of final packed clusters that are the first occurrence vs. have a repeated packing signature. On average, 59% and 65% of the clusters have view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of peak memory utilization of the VPR packing stage and full VPR flow with and without déjà vu packing. All values are normalized view at source ↗
read the original abstract

Implementing a digital circuit on an FPGA fabric requires clustering technology-mapped netlist primitives into coarser-granularity blocks that can be directly mapped to the physical resources available on the FPGA. As the architecture of FPGA logic blocks (LBs) has grown in complexity, with sophisticated logic elements (LEs) and highly irregular local interconnect, this packing problem has become more challenging. To ensure the feasibility of intracluster routing, the computer-aided design (CAD) tools must solve a costly multi-source multi-sink routing problem for each candidate cluster. In this paper, we first show that such packing legality checks consume a significant portion of the CAD flow runtime for LB architectures with complex LEs and local routing structures resembling modern commercial FPGAs. We demonstrate that the packing stage constitutes 58% and 94% of the entire Versatile Place and Route (VPR) flow runtime on average when mapping a wide variety of benchmarks to the AMD 7-series-like and Altera Stratix-10-like VTR architecture captures, respectively. By analyzing the packing algorithm behavior, we observe that a significant fraction of the attempted packed clusters are repetitions of a much smaller number of packing patterns, and therefore many of the packing legality checks are redundant and could be skipped. To this end, we introduce our D\'ej\`a Vu packing approach, which leverages a novel packing signature tree data structure that enables efficient identification of recurring packing patterns and memoization of their legality check outcomes. Our approach speeds up the packing by up to 13.4x and 29.3x, with an average of 3.7x and 6.9x, across the evaluated benchmarks on the 7-series and Stratix 10 architectures. These packing runtime gains result in a significant 1.6x and 5.3x average reduction in end-to-end VPR runtime, while maintaining quality of results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Déjà Vu packing, an optimization for FPGA logic clustering that observes a high fraction of repeated packing patterns during legality checks (which consume 58% and 94% of VPR runtime on two architectures). It introduces a packing signature tree data structure to identify recurring patterns and memoize their legality outcomes, claiming average packing speedups of 3.7× (7-series) and 6.9× (Stratix 10), with corresponding end-to-end VPR reductions of 1.6× and 5.3× while preserving quality of results.

Significance. If the memoization is provably sound and the empirical speedups are robust across benchmarks, the work could meaningfully accelerate FPGA CAD flows for modern complex logic blocks by eliminating redundant routing checks inside clusters. The core observation of pattern repetition and the signature-tree mechanism represent a practical algorithmic contribution to a known bottleneck.

major comments (2)
  1. [Section 3] Section 3 (Signature Tree Construction): The description of signature generation does not specify whether signatures are exact canonical forms, structural hashes, or include a verification step. Without this, it is impossible to rule out false-positive matches between non-identical clusters, which would silently reuse incorrect legality results and potentially degrade QoR or produce illegal packings.
  2. [Section 5] Section 5 (Experimental Evaluation): The reported speedups (up to 13.4×/29.3×, averages 3.7×/6.9×) and end-to-end gains lack any description of benchmark selection criteria, number of independent runs, statistical significance testing, or error bars. This makes it difficult to assess whether the results are generalizable or sensitive to post-hoc benchmark choices.
minor comments (2)
  1. [Abstract] The abstract states that quality of results is maintained but does not quantify the specific metrics (critical-path delay, area, or routing success rate) used to verify this claim; these should be reported with tables or figures in the evaluation section.
  2. [Section 3] Notation for 'packing patterns' and 'signatures' is introduced informally; a short formal definition or pseudocode in Section 3 would improve clarity for readers unfamiliar with the VPR packing algorithm.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify and strengthen our manuscript. We address each major comment in detail below, committing to revisions where appropriate to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Signature Tree Construction): The description of signature generation does not specify whether signatures are exact canonical forms, structural hashes, or include a verification step. Without this, it is impossible to rule out false-positive matches between non-identical clusters, which would silently reuse incorrect legality results and potentially degrade QoR or produce illegal packings.

    Authors: We appreciate this point on soundness. The signature in our packing signature tree is constructed as an exact canonical form: primitives within each candidate cluster are sorted by type and position, their local connectivity is encoded as a deterministic adjacency list string, and the resulting tuple serves as the key. This is not a probabilistic hash but a structural canonicalization, guaranteeing identical signatures only for identical patterns. The tree traversal further verifies structural equivalence before reusing any memoized legality outcome. We will revise Section 3 to explicitly describe this canonicalization process, add pseudocode for signature generation, and state that no false-positive reuse is possible by design. revision: yes

  2. Referee: [Section 5] Section 5 (Experimental Evaluation): The reported speedups (up to 13.4×/29.3×, averages 3.7×/6.9×) and end-to-end gains lack any description of benchmark selection criteria, number of independent runs, statistical significance testing, or error bars. This makes it difficult to assess whether the results are generalizable or sensitive to post-hoc benchmark choices.

    Authors: We agree that additional methodological details will strengthen the evaluation. The benchmarks comprise the full VTR 8.0 suite (all 20+ circuits), selected because they represent the standard, diverse set used in prior FPGA packing literature and cover a range of sizes and domains. All runs used deterministic algorithms on a fixed platform, so single executions per benchmark were performed. To improve statistical transparency, we will add a new subsection describing benchmark selection, include a table of all circuits with size and runtime statistics, and report mean speedups with standard deviations obtained from five independent runs (varying placement seeds where applicable). These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical runtime optimization with direct measurements

full rationale

The paper describes an algorithmic optimization for FPGA packing that observes repeated patterns in clustering attempts and uses a signature tree for memoization. All claims rest on runtime benchmarks across architectures and benchmarks, with no equations, fitted parameters, predictions derived from inputs, or self-citations that bear the central result. Speedups (e.g., 3.7x packing) are measured outcomes, not quantities defined by the method itself. The work is self-contained against external VPR flows and does not reduce any result to its own assumptions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical optimization technique. No free parameters, domain axioms, or invented physical entities are described in the abstract; the signature tree is a standard algorithmic data structure rather than a postulated entity.

pith-pipeline@v0.9.0 · 5657 in / 1393 out tokens · 74803 ms · 2026-05-07T17:26:58.970191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    A New Paradigm for FPGA Placement Without Explicit Packing,

    W. Li and D. Z. Pan, “A New Paradigm for FPGA Placement Without Explicit Packing,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 38, no. 11, pp. 2113– 2126, 2019

  2. [2]

    Optimizing FPGA Logic Block Architectures for Arithmetic,

    K. E. Murray, J. Luu, M. J. Walker, C. McCullough, S. Wang, S. Huda, B. Yan, C. Chiasson, K. B. Kent, J. Andersonet al., “Optimizing FPGA Logic Block Architectures for Arithmetic,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 6, pp. 1378–1391, 2020

  3. [3]

    VTR 9: Open-source CAD for Fabric and Beyond FPGA Architecture Exploration,

    M. A. Elgammal, A. Mohaghegh, S. G. Shahrouz, F. Mahmoudi, F. Ko¸ sar, K. Talaei, J. Fife, D. Khadivi, K. Murray, A. Boutroset al., “VTR 9: Open-source CAD for Fabric and Beyond FPGA Architecture Exploration,”ACM Transactions on Reconfigurable Technology and Systems, vol. 18, no. 3, pp. 1–53, 2025

  4. [4]

    Koios 2.0: Open-Source Deep Learning Benchmarks for FPGA Architecture and CAD Research,

    A. Arora, A. Boutros, S. A. Damghani, K. Mathur, V . Mohanty, T. Anand, M. A. Elgammal, K. B. Kent, V . Betz, and L. K. John, “Koios 2.0: Open-Source Deep Learning Benchmarks for FPGA Architecture and CAD Research,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 42, no. 11, pp. 3895–3909, 2023

  5. [5]

    Titan: Enabling Large and Complex Benchmarks in Academic CAD,

    K. E. Murray, S. Whitty, S. Liu, J. Luu, and V . Betz, “Titan: Enabling Large and Complex Benchmarks in Academic CAD,” inIEEE Interna- tional Conference on Field programmable Logic and Applications (FPL), 2013, pp. 1–8

  6. [6]

    Cluster-based Logic Blocks for FPGAs: Area- Efficiency vs. Input Sharing and Size,

    V . Betz and J. Rose, “Cluster-based Logic Blocks for FPGAs: Area- Efficiency vs. Input Sharing and Size,” inIEEE Custom Integrated Circuits Conference (CICC), 1997, pp. 551–554

  7. [7]

    Using Cluster-Based Logic Blocks and Timing-Driven Packing to Improve FPGA Speed and Density,

    A. Marquardt, V . Betz, and J. Rose, “Using Cluster-Based Logic Blocks and Timing-Driven Packing to Improve FPGA Speed and Density,” inACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 1999, pp. 37–46

  8. [8]

    RPack: Routability-Driven Packing for Cluster-Based FPGAs,

    E. Bozorgzadeh, S. Ogrenci-Memik, and M. Sarrafzadeh, “RPack: Routability-Driven Packing for Cluster-Based FPGAs,” inACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), 2001, pp. 629–634

  9. [9]

    Efficient Circuit Clustering for Area and Power Reduction in FPGAs,

    A. Singh, G. Parthasarathy, and M. Marek-Sadowska, “Efficient Circuit Clustering for Area and Power Reduction in FPGAs,”ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 7, no. 4, pp. 643–663, 2002

  10. [10]

    Improving Timing-Driven FPGA Packing with Physical Information,

    D. T. Chen, K. V orwerk, and A. Kennings, “Improving Timing-Driven FPGA Packing with Physical Information,” inIEEE International Con- ference on Field Programmable Logic and Applications (FPL), 2007, pp. 117–123

  11. [11]

    MO-Pack: Many-Objective Clustering for FPGA CAD,

    S. T. Rajavel and A. Akoglu, “MO-Pack: Many-Objective Clustering for FPGA CAD,” inACM/IEEE Design Automation Conference (DAC), 2011, pp. 818–823

  12. [12]

    Un/DoPack: Re-clustering of Large System-on-Chip Designs with Interconnect Variation for Low- Cost FPGAs,

    M. Tom, D. Leong, and G. Lemieux, “Un/DoPack: Re-clustering of Large System-on-Chip Designs with Interconnect Variation for Low- Cost FPGAs,” inIEEE/ACM International Conference on Computer- aided design (ICCAD), 2006, pp. 680–687

  13. [13]

    T-NDPack: Timing-Driven Non-Uniform Depopulation Based Clustering,

    H. Liu and A. Akoglu, “T-NDPack: Timing-Driven Non-Uniform Depopulation Based Clustering,” inIEEE Southern Conference on Programmable Logic (SPL), 2009, pp. 9–14

  14. [14]

    Multilevel K-Way Hypergraph Partitioning,

    G. Karypis and V . Kumar, “Multilevel K-Way Hypergraph Partitioning,” inACM/IEEE Design Automation Conference (DAC), 1999, pp. 343– 348

  15. [15]

    K-Way Partitioning Based Packing for FPGA Logic Blocks Without Input Bandwidth Constraint,

    W. Feng, “K-Way Partitioning Based Packing for FPGA Logic Blocks Without Input Bandwidth Constraint,” inIEEE International Conference on Field-Programmable Technology (FPT), 2012, pp. 8–15

  16. [16]

    Rent’s Rule Based FPGA Packing for Routability Optimization,

    W. Feng, J. Greene, K. V orwerk, V . Pevzner, and A. Kundu, “Rent’s Rule Based FPGA Packing for Routability Optimization,” inACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2014, pp. 31–34

  17. [17]

    Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing,

    D. Vercruyce, E. Vansteenkiste, and D. Stroobandt, “Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing,” inIEEE Inter- national Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1–9

  18. [18]

    How Preserving Circuit Design Hierarchy During FPGA Packing Leads to Better Performance,

    ——, “How Preserving Circuit Design Hierarchy During FPGA Packing Leads to Better Performance,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 37, no. 3, pp. 629–642, 2017

  19. [19]

    LSC: A Large-Scale Consensus-Based Clustering Algorithm for High-Performance FPGAs,

    L. Singhal, M. A. Iyer, and S. Adya, “LSC: A Large-Scale Consensus-Based Clustering Algorithm for High-Performance FPGAs,” inACM/IEEE Design Automation Conference (DAC), 2017, pp. 1–6

  20. [20]

    Simultaneous Timing Driven Clustering and Placement for FPGAs,

    G. Chen and J. Cong, “Simultaneous Timing Driven Clustering and Placement for FPGAs,” inInternational Conference on Field Pro- grammable Logic and Applications (FPL), 2004, pp. 158–167

  21. [21]

    Breaking Boundaries: Optimizing FPGA CAD with Flexible and Multi-threaded Re-Clustering,

    M. A. Elgammal and V . Betz, “Breaking Boundaries: Optimizing FPGA CAD with Flexible and Multi-threaded Re-Clustering,” inACM Interna- tional Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), 2023, pp. 11–18

  22. [22]

    Architecture description and packing for logic blocks with hierarchy, modes and complex interconnect,

    J. Luu, J. Anderson, and J. Rose, “Architecture description and packing for logic blocks with hierarchy, modes and complex interconnect,” Proceedings of 2011 Acm/Sigda International Symposium on Field Programmable Gate Arrays, pp. 227–236, 2011

  23. [23]

    Towards Interconnect-Adaptive Packing for FPGAs,

    J. Luu, J. Rose, and J. Anderson, “Towards Interconnect-Adaptive Packing for FPGAs,” inACM/SIGDA International Symposium on Field- Programmable Gate Arrays (FPGA), 2014, pp. 21–30

  24. [24]

    Using Sparse Crossbars within LUT,

    G. Lemieux and D. Lewis, “Using Sparse Crossbars within LUT,” inACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2001, pp. 59–68

  25. [25]

    Academic Packing for Commercial FPGA Architec- tures,

    T. D. Haroldsen, “Academic Packing for Commercial FPGA Architec- tures,” Ph.D. dissertation, Brigham Young University, 2017

  26. [26]

    VTR 7.0: Next Generation Architecture and CAD System for FPGAs,

    J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk, M. Nasr, S. Wang, T. Liu, N. Ahmed, K. B. Kent, J. Anderson, J. Rose, and V . Betz, “VTR 7.0: Next Generation Architecture and CAD System for FPGAs,”ACM Transactions on Reconfigurable Technology and Systems, vol. 7, no. 3, pp. 1–30, 2014

  27. [27]

    VTR 8: High-performance CAD and Customizable FPGA Architecture Modelling,

    K. E. Murray, O. Petelin, S. Zhong, J. M. Wang, M. Eldafrawy, J.-P. Legault, E. Sha, A. G. Graham, J. Wu, M. J. P. Walker, H. Zeng, P. Pa- tros, J. Luu, K. B. Kent, and V . Betz, “VTR 8: High-performance CAD and Customizable FPGA Architecture Modelling,”ACM Transactions on Reconfigurable Technology and Systems, vol. 13, no. 2, pp. 1–60, 2020

  28. [28]

    Titan 2.0: Enabling Open-Source CAD Evaluation with a Modern Architecture Capture,

    K. T. Khoozani, A. A. Dehkordi, and V . Betz, “Titan 2.0: Enabling Open-Source CAD Evaluation with a Modern Architecture Capture,” inIEEE International Conference on Field-Programmable Logic and Applications (FPL), 2023, pp. 57–64

  29. [29]

    PathFinder: A Negotiation-Based Performance-Driven Router for FPGAs,

    L. McMurchie and C. Ebeling, “PathFinder: A Negotiation-Based Performance-Driven Router for FPGAs,” inACM International Sym- posium on Field-Programmable Gate Arrays (FPGA), 1995, pp. 111— -117

  30. [30]

    Guaranteed Yet Hard to Find: Uncovering FPGA Routing Convergence Paradox,

    S. Shrivastava, S. Nikoli ´c, S. Tanaka, C. Ravishankar, D. Gaitonde, and M. Stojilovi ´c, “Guaranteed Yet Hard to Find: Uncovering FPGA Routing Convergence Paradox,” inIEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2025, pp. 143–151

  31. [31]

    Valgrind,

    “Valgrind,” https://valgrind.org, accessed: 2026-01-17