Duet: Creating Harmony between Processors and Embedded FPGAs
Pith reviewed 2026-05-24 09:44 UTC · model grok-4.3
The pith
Duet integrates embedded FPGAs as equal peers with processors through non-intrusive bi-directional cache-coherent links.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Duet is a scalable manycore-FPGA architecture that promotes embedded FPGAs to equal peers with processors through non-intrusive, bi-directionally cache-coherent integration. Unlike prior CPU-FPGA hybrids where processors play a supportive role, Duet enables fine-grained acceleration by partitioning applications into small tasks and offloading frequently invoked compute-intensive ones onto small eFPGA accelerators while processors handle dynamic control flow and less accelerable tasks, plus hardware augmentation that employs eFPGA-emulated hardware widgets to improve processor efficiency or mitigate software overheads.
What carries the argument
Non-intrusive, bi-directionally cache-coherent integration that lets eFPGAs and processors access each other's caches without modifying the processor design.
If this is right
- Processor-accelerator communication latency drops by up to 82%.
- Bandwidth between processors and accelerators rises by up to 9.5x.
- Seven application benchmarks achieve speedups between 1.5x and 24.9x.
- Post-fabrication hardware changes become possible without redesigning the processor core.
Where Pith is reading between the lines
- The same integration pattern could let future chips reconfigure hardware support for different software stacks after tape-out.
- Designers might reduce reliance on fixed-function ASICs by keeping general acceleration capacity on-chip in reconfigurable form.
- Similar cache-coherent eFPGA blocks could be added to other manycore designs to support dynamic hardware specialization.
Load-bearing premise
The cache-coherent integration between processors and eFPGAs can be built in real silicon with low enough overhead to deliver the modeled latency and bandwidth numbers.
What would settle it
Fabricate a Duet chip and measure whether processor-accelerator communication latency and bandwidth match the RTL-reported 82% reduction and 9.5x increase.
Figures
read the original abstract
The demise of Moore's Law has led to the rise of hardware acceleration. However, the focus on accelerating stable algorithms in their entirety neglects the abundant fine-grained acceleration opportunities available in broader domains and squanders host processors' compute power. This paper presents Duet, a scalable, manycore-FPGA architecture that promotes embedded FPGAs (eFPGA) to be equal peers with processors through non-intrusive, bi-directionally cache-coherent integration. In contrast to existing CPU-FPGA hybrid systems in which the processors play a supportive role, Duet unleashes the full potential of both the processors and the eFPGAs with two classes of post-fabrication enhancements: fine-grained acceleration, which partitions an application into small tasks and offloads the frequently-invoked, compute-intensive ones onto various small accelerators, leveraging the processors to handle dynamic control flow and less accelerable tasks; hardware augmentation, which employs eFPGA-emulated hardware widgets to improve processor efficiency or mitigate software overheads in certain execution models. An RTL-level implementation of Duet is developed to evaluate the architecture with high fidelity. Experiments using synthetic benchmarks show that Duet can reduce the processor-accelerator communication latency by up to 82% and increase the bandwidth by up to 9.5x. The RTL implementation is further evaluated with seven application benchmarks, achieving 1.5-24.9x speedup.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Duet, a scalable manycore-FPGA architecture that integrates embedded FPGAs (eFPGAs) as equal peers with processors via non-intrusive, bi-directionally cache-coherent links. It proposes two classes of post-fabrication enhancements: fine-grained acceleration (partitioning applications to offload compute-intensive tasks to small eFPGA accelerators while processors handle control flow) and hardware augmentation (using eFPGA-emulated widgets to improve processor efficiency). An RTL-level implementation is evaluated with synthetic benchmarks (showing up to 82% latency reduction and 9.5x bandwidth increase) and seven application benchmarks (achieving 1.5-24.9x speedup).
Significance. If the low-overhead bi-directional cache coherence can be realized without eroding the reported gains, Duet would represent a meaningful advance over existing CPU-FPGA hybrids by enabling more dynamic, fine-grained interactions and better utilization of both components. The RTL evaluation provides concrete, high-fidelity measurements that support the architecture's potential.
major comments (1)
- [Abstract / Evaluation] Abstract and evaluation sections: The central claims depend on the bi-directional cache-coherent integration being non-intrusive with sufficiently low overhead. However, the manuscript reports only RTL-level latency/bandwidth numbers and provides no post-synthesis area, timing, or power breakdown of the coherence logic (directory, snoop filters, or protocol state machines), nor any comparison against a baseline without the eFPGA interface. This leaves open whether wire delays or protocol traffic would reduce the claimed 82% latency reduction and 9.5x bandwidth gains in silicon.
minor comments (1)
- [Abstract] Abstract: Concrete performance numbers (82% latency reduction, 9.5x bandwidth, 1.5-24.9x speedup) are presented without error bars, explicit methodology details, or data exclusion rules.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for highlighting the importance of quantifying the overhead of the bi-directional cache-coherent interface. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation sections: The central claims depend on the bi-directional cache-coherent integration being non-intrusive with sufficiently low overhead. However, the manuscript reports only RTL-level latency/bandwidth numbers and provides no post-synthesis area, timing, or power breakdown of the coherence logic (directory, snoop filters, or protocol state machines), nor any comparison against a baseline without the eFPGA interface. This leaves open whether wire delays or protocol traffic would reduce the claimed 82% latency reduction and 9.5x bandwidth gains in silicon.
Authors: The RTL model implements the full coherence protocol (directory, snoop filters, and state machines) and the reported latency/bandwidth figures are measured end-to-end with this logic active; the synthetic benchmarks explicitly compare against a baseline that uses conventional off-chip communication rather than the integrated interface. We therefore believe the 82% latency reduction and 9.5x bandwidth improvement already reflect protocol overhead. However, the manuscript does not contain post-synthesis area, timing, or power breakdowns, nor place-and-route results that would capture wire delays. These metrics would require a full-chip physical design flow that lies outside the scope of the current RTL-focused evaluation. We can add an explicit limitations paragraph in the revised manuscript acknowledging this gap while preserving the architectural claims supported by the cycle-accurate RTL data. revision: partial
Circularity Check
No circularity; architecture claims rest on direct RTL measurements, not fitted predictions or self-referential derivations
full rationale
The paper describes a hardware architecture and reports performance numbers obtained from an RTL model and application benchmarks. No equations, parameter fitting, predictions derived from inputs, or load-bearing self-citations appear in the abstract or provided text. The central results (latency/bandwidth gains, speedups) are presented as direct simulation outputs rather than reductions to prior fitted values or author theorems. This is a standard empirical systems paper whose evaluation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Spandex: A Flexible Interface for Efficient Heterogeneous Coherence,
J. Alsop, M. D. Sinclair, and S. V . Adve, “Spandex: A Flexible Interface for Efficient Heterogeneous Coherence,” in Proceedings of the 45th Annual International Symposium on Computer Architecture , ser. ISCA ’18. IEEE Press, 2018, p. 261–274. [Online]. Available: https://doi.org/10.1109/ISCA.2018.00031
-
[2]
Amazon, “Amazon EC2 F1 Instances,” https://aws.amazon.com/ec2/ instance-types/f1/
-
[3]
AMBA AXI and ACE Protocol Specification,
ARM Limited, “AMBA AXI and ACE Protocol Specification,” https: //developer.arm.com/documentation/ihi0022/e/
-
[4]
AMBA CHI Architecture Specification,
——, “AMBA CHI Architecture Specification,” https://developer.arm. com/documentation/ihi0050/c/
-
[5]
J. Balkind, K. Lim, M. Schaffner, F. Gao, G. Chirkov, A. Li, A. Lavrov, T. M. Nguyen, Y . Fu, F. Zaruba, K. Gulati, L. Benini, and D. Wentzlaff, “BYOC: A "Bring Your Own Core" Framework for Heterogeneous-ISA Research,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser....
-
[6]
OpenPiton: An Open Source Manycore Research Framework,
J. Balkind, M. McKeown, Y . Fu, T. Nguyen, Y . Zhou, A. Lavrov, M. Shahrad, A. Fuchs, S. Payne, X. Liang, M. Matl, and D. Wentzlaff, “OpenPiton: An Open Source Manycore Research Framework,” in Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems , ser. ASPLOS ’16. New York, NY , ...
-
[7]
A hierarchical O (N log N ) force-calculation algorithm,
J. Barnes and P. Hut, “A hierarchical O (N log N ) force-calculation algorithm,” Nature, vol. 324, pp. 446–449, 1986
work page 1986
-
[8]
A. Boutros, S. Yazdanshenas, and V . Betz, “You Cannot Improve What You Do Not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference,” ACM Trans. Reconfigurable Technol. Syst. , vol. 11, no. 3, dec 2018. [Online]. Available: https://doi.org/10.1145/3242898
-
[9]
The Garp Architecture and C Compiler,
T. Callahan, J. Hauser, and J. Wawrzynek, “The Garp Architecture and C Compiler,” Computer, vol. 33, no. 4, pp. 62–69, 2000
work page 2000
-
[10]
A Cloud-Scale Acceleration Architecture,
A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J.-Y . Kim, D. Lo, T. Mas- sengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger, “A Cloud-Scale Acceleration Architecture,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–13
work page 2016
-
[11]
Cache Coherent Interconnect for Accelerators (CCIX),
CCIX Consortium, “Cache Coherent Interconnect for Accelerators (CCIX),” https://www.ccixconsortium.com/
-
[12]
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Con- volutional Neural Networks,
Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Con- volutional Neural Networks,” in IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers , 2016, pp. 262– 263
work page 2016
-
[13]
A Quantitative Analysis on Microarchitectures of Modern CPU- FPGA Platforms,
Y .-k. Choi, J. Cong, Z. Fang, Y . Hao, G. Reinman, and P. Wei, “A Quantitative Analysis on Microarchitectures of Modern CPU- FPGA Platforms,” in 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE Press, 2016, p. 1–6. [Online]. Available: https://doi.org/10.1145/2897937.2897972
-
[14]
A DSL Compiler for Accelerating Image Processing Pipelines on FPGAs,
N. Chugh, V . Vasista, S. Purini, and U. Bondhugula, “A DSL Compiler for Accelerating Image Processing Pipelines on FPGAs,” in Proceedings of the 2016 International Conference on Parallel Architectures and Compilation , ser. PACT ’16. New York, NY , USA: Association for Computing Machinery, 2016, p. 327–338. [Online]. Available: https://doi.org/10.1145/29...
-
[15]
Serving DNNs in Real Time at Datacenter Scale with Project Brainwave,
E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. K. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. ...
work page 2018
-
[16]
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?
E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai, “Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?” in 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010, pp. 225–236
work page 2010
-
[17]
J. Coole, J. Wernsing, and G. Stitt, “A traversal cache framework for fpga acceleration of pointer data structures: A case study on barnes-hut n-body simulation,” in 2009 International Conference on Reconfigurable Computing and FPGAs , 2009, pp. 143–148
work page 2009
-
[18]
Parallel Discrete Event Simulation,
R. M. Fujimoto, “Parallel Discrete Event Simulation,” Commun. ACM, vol. 33, no. 10, p. 30–53, Oct. 1990. [Online]. Available: https://doi.org/10.1145/84537.84545
-
[19]
Xilinx Adaptive Compute Acceleration Platform: Versal™ Architecture,
B. Gaide, D. Gaitonde, C. Ravishankar, and T. Bauer, “Xilinx Adaptive Compute Acceleration Platform: Versal™ Architecture,” in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays , ser. FPGA ’19. New York, NY , USA: Association for Computing Machinery, 2019, p. 84–93. [Online]. Available: https://doi.org/10.1145/328...
-
[20]
A Quantitative Analysis of the Speedup Factors of FPGAs over Processors,
Z. Guo, W. Najjar, F. Vahid, and K. Vissers, “A Quantitative Analysis of the Speedup Factors of FPGAs over Processors,” in Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays , ser. FPGA ’04. New York, NY , USA: Association for Computing Machinery, 2004, p. 162–170. [Online]. Available: https://doi.org/10.1145/...
-
[21]
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA,
S. Han, J. Kang, H. Mao, Y . Hu, X. Li, Y . Li, D. Xie, H. Luo, S. Yao, Y . Wang, H. Yang, and W. B. J. Dally, “ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays , ser. FPGA ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. ...
-
[22]
EIE: Efficient Inference Engine on Compressed Deep Neural Network,
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” in Proceedings of the 43rd International Symposium on Computer Architecture , ser. ISCA ’16. IEEE Press, 2016, p. 243–254. [Online]. Available: https://doi.org/10.1109/ISCA.2016.30
-
[23]
Strategies in Optimizing Market Positions for Semicon- ductor Vendors Based on IP Leverage,
Handel Jones, “Strategies in Optimizing Market Positions for Semicon- ductor Vendors Based on IP Leverage,” https://www.ibs-inc.net/white- papers, 2014
work page 2014
-
[24]
The Chimaera Recon- figurable Functional Unit,
S. Hauck, T. Fry, M. Hosler, and J. Kao, “The Chimaera Recon- figurable Functional Unit,” in Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186), 1997, pp. 87–96
work page 1997
-
[25]
Intel Corporation, “Compute Express Link™ (CXL),” https://www.intel. com/content/www/us/en/io/cxl-cache-mem-protocol-interface-cpi.html
-
[26]
——, “Cyclone V SoC,” https://www.intel.com/content/www/us/en/ products/details/fpga/cyclone/v.html
-
[27]
A Scalable Architecture for Ordered Parallelism,
M. C. Jeffrey, S. Subramanian, C. Yan, J. Emer, and D. Sanchez, “A Scalable Architecture for Ordered Parallelism,” in 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) , 2015, pp. 228–241
work page 2015
-
[28]
FABulous: An Embedded FPGA Framework,
D. Koch, N. Dao, B. Healy, J. Yu, and A. Attwood, “FABulous: An Embedded FPGA Framework,” in The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 45–56. [Online]. Available: https://doi.org/10.1145/3431920.3439302
-
[29]
Post-Fabrication Microarchitecture,
C. Kumar, A. Seshadri, A. Chaudhary, S. Bhawalkar, R. Singh, and E. Rotenberg, “Post-Fabrication Microarchitecture,” in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture , ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 1270–1281. [Online]. Available: https://doi.org/10. 1145/3466752.3480119
-
[30]
FUSION: Design Tradeoffs in Coherent Cache Hierarchies for Accelerators,
S. Kumar, A. Shriraman, and N. Vedula, “FUSION: Design Tradeoffs in Coherent Cache Hierarchies for Accelerators,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) , 2015, pp. 733–745
work page 2015
-
[31]
PRGA: An Open-Source FPGA Research and Prototyping Framework,
A. Li and D. Wentzlaff, “PRGA: An Open-Source FPGA Research and Prototyping Framework,” in The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 127–137. [Online]. Available: https://doi.org/10.1145/3431920.3439294
-
[32]
A Hardware Accelerator for Tracing Garbage Collection,
M. Maas, K. Asanovic, and J. Kubiatowicz, “A Hardware Accelerator for Tracing Garbage Collection,” IEEE Micro, vol. 39, no. 3, pp. 38–46, 2019
work page 2019
-
[33]
C. A. Mack, “Fifty Years of Moore’s Law,” IEEE Transactions on Semiconductor Manufacturing, vol. 24, no. 2, pp. 202–207, 2011. 13
work page 2011
-
[34]
ASIC Clouds: Specializing the Datacenter,
I. Magaki, M. Khazraee, L. V . Gutierrez, and M. B. Taylor, “ASIC Clouds: Specializing the Datacenter,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) , 2016, pp. 178–190
work page 2016
-
[35]
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors,
J. M. Mellor-Crummey and M. L. Scott, “Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors,” ACM Trans. Comput. Syst., vol. 9, no. 1, p. 21–65, feb 1991. [Online]. Available: https://doi.org/10.1145/103727.103729
-
[36]
Microchip Technology Inc., “SmartFusion 2 SoC,” https://www. microsemi.com/product-directory/soc-fpgas/1692-smartfusion2
-
[37]
Microsemi Corporation, “PolarFire SoC,” https://www.microsemi.com/ product-directory/soc-fpgas/5498-polarfire-soc-fpga
-
[38]
VTR 8: High-Performance CAD and Customizable FPGA Architecture Modelling,
K. E. Murray, O. Petelin, S. Zhong, J. M. Wang, M. Eldafrawy, J.-P. Legault, E. Sha, A. G. Graham, J. Wu, M. J. P. Walker, H. Zeng, P. Patros, J. Luu, K. B. Kent, and V . Betz, “VTR 8: High-Performance CAD and Customizable FPGA Architecture Modelling,” ACM Trans. Reconfigurable Technol. Syst. , vol. 13, no. 2, May 2020. [Online]. Available: https://doi.org...
-
[39]
Crossing Guard: Mediating Host-Accelerator Coherence Interactions,
L. E. Olson, M. D. Hill, and D. A. Wood, “Crossing Guard: Mediating Host-Accelerator Coherence Interactions,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems , ser. ASPLOS ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 163–176. [Online]. Available...
- [40]
-
[41]
OpenSPARC™ T1 Microarchitecture Specification,
Oracle Corporation, “OpenSPARC™ T1 Microarchitecture Specification,” https://www.oracle.com/servers/technologies/opensparc- t1-page.html
-
[42]
QuickLogic Corporation, “EOS S3,” https://www.quicklogic.com/ products/soc/
-
[43]
A High-Performance Microarchitecture with Hardware-Programmable Functional Units,
R. Razdan and M. Smith, “A High-Performance Microarchitecture with Hardware-Programmable Functional Units,” in Proceedings of MICRO-
-
[44]
The 27th Annual IEEE/ACM International Symposium on Microar- chitecture, 1994, pp. 172–180
work page 1994
-
[45]
48 Years of Microprocessor Trend Data,
K. Rupp, “48 Years of Microprocessor Trend Data,” https://github.com/ karlrupp/microprocessor-trend-data, 2019
work page 2019
-
[46]
Arnold: an eFPGA-Augmented RISC-V SoC for Flexible and Low-Power IoT End-Nodes,
P. D. Schiavone, D. Rossi, A. D. Mauro, F. Gurkaynak, T. Saxe, M. Wang, K. C. Yap, and L. Benini, “Arnold: an eFPGA-Augmented RISC-V SoC for Flexible and Low-Power IoT End-Nodes,” 2020
work page 2020
-
[47]
FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review,
A. Shawahna, S. M. Sait, and A. El-Maleh, “FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review,” IEEE Access, vol. 7, pp. 7823–7859, 2019
work page 2019
-
[48]
Catapult High-Level Synthesis and Verification,
Siemens Digital Industries Software, “Catapult High-Level Synthesis and Verification,” https://eda.sw.siemens.com/en-US/ic/catapult-high- level-synthesis/
-
[49]
15NM OPEN-CELL LIBRARY AND 45NM FREEPDK,
Silicon Integration Initiative, Inc., “15NM OPEN-CELL LIBRARY AND 45NM FREEPDK,” https://si2.org/open-cell-library/
-
[50]
Decoupled Access/Execute Computer Architectures,
J. E. Smith, “Decoupled Access/Execute Computer Architectures,” SIGARCH Comput. Archit. News, vol. 10, no. 3, p. 112–119, Apr. 1982. [Online]. Available: https://doi.org/10.1145/1067649.801719
-
[51]
Freepdk: An open-source variation-aware design kit,
J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal, “Freepdk: An open-source variation-aware design kit,” in 2007 IEEE International Conference on Microelectronic Systems Education (MSE’07), 2007, pp. 173–174
work page 2007
-
[52]
Database Analytics Acceleration Using FPGAs,
B. Sukhwani, H. Min, M. Thoennes, P. Dube, B. Iyer, B. Brezzo, D. Dillenberger, and S. Asaad, “Database Analytics Acceleration Using FPGAs,” in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques , ser. PACT ’12. New York, NY , USA: Association for Computing Machinery, 2012, p. 411–420. [Online]. Available...
-
[53]
The Shunt: An FPGA-Based Accelerator for Network Intrusion Prevention,
N. Weaver, V . Paxson, and J. M. Gonzalez, “The Shunt: An FPGA-Based Accelerator for Network Intrusion Prevention,” in Proceedings of the 2007 ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays , ser. FPGA ’07. New York, NY , USA: Association for Computing Machinery, 2007, p. 199–206. [Online]. Available: https://doi.org/10.1145/1216...
-
[54]
P. N. Whatmough, S. K. Lee, M. Donato, H.-C. Hsueh, S. Xi, U. Gupta, L. Pentecost, G. G. Ko, D. Brooks, and G.-Y . Wei, “A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex- A53 to eFPGA and Cache-Coherent Accelerators,” in 2019 Symposium on VLSI Circuits , 2019, pp. C34–C35
work page 2019
-
[55]
C. Wolf, “Yosys open synthesis suite,” http://www.clifford.at/yosys/
-
[56]
Xilinx, Inc., “Zynq-7000 SoC,” https://www.xilinx.com/products/ silicon-devices/soc/zynq-7000.html
-
[57]
——, “Zynq UltraScale+ MPSoC,” https://www.xilinx.com/products/ silicon-devices/soc/zynq-ultrascale-mpsoc.html
-
[58]
F. Zaruba and L. Benini, “The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 11, pp. 2629–2640, Nov 2019
work page 2019
-
[59]
The Floating Point Trinity: A Multi-modal Approach to Extreme Energy-Efficiency and Per- formance,
F. Zaruba, F. Schuiki, S. Mach, and L. Benini, “The Floating Point Trinity: A Multi-modal Approach to Extreme Energy-Efficiency and Per- formance,” in 2019 26th IEEE International Conference on Electronics, Circuits and Systems (ICECS) , 2019, pp. 767–770
work page 2019
-
[60]
Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks,
C. Zhang, P. Li, G. Sun, Y . Guan, B. Xiao, and J. Cong, “Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’15. New York, NY , USA: Association for Computing Machinery, 2015, p. 161–170. [Online]. Available: https://do...
-
[61]
M. Zuluaga, P. Milder, and M. Püschel, “Streaming Sorting Networks,” ACM Trans. Des. Autom. Electron. Syst. , vol. 21, no. 4, May 2016. [Online]. Available: https://doi.org/10.1145/2854150 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.