FPGA-based Matched Filter Group Optimisation for SKA Pulsar Search Engine
Pith reviewed 2026-05-08 10:00 UTC · model grok-4.3
The pith
Scheduling filters by longest processing time optimizes FPGA matched filtering for pulsar searches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The generic time-domain matched filter design is optimised by employing the longest processing time first rule to distribute filter templates across pipelines, while Fourier-domain versions show a direct relationship between required off-chip memory space and the speedup obtained over the generic design. When placed against a well-optimised GPU implementation, a mid-range FPGA is up to 7.5 times slower yet achieves slightly superior performance per watt.
What carries the argument
The longest processing time first scheduling rule used to assign filters of different lengths to multiple parallel processing pipelines.
If this is right
- Time-domain designs finish faster once filters are assigned to pipelines by longest processing time first.
- Frequency-domain designs trade additional memory for measurable speedup over the baseline.
- FPGA implementations can deliver better energy efficiency than GPU alternatives for the same matched-filter task.
- The same scheduling approach applies to any signal-processing workload that uses groups of FIR filters with unequal lengths.
Where Pith is reading between the lines
- Lower power draw per computation could reduce the total electricity needed to run the full SKA pulsar search pipeline.
- The same distribution rule might be tested on other FPGA families or combined with partial frequency-domain processing to see further gains.
- If the workload changes to include many more very short filters, the relative benefit of the longest-processing-time rule would shrink.
Load-bearing premise
The tested filter sizes, numbers, and pipeline distributions represent the actual pulsar-search workload that will run in the SKA central signal processor.
What would settle it
Measure throughput and power draw when the optimised FPGA design and the generic baseline both process a large set of real SKA-scale filter groups on the target hardware.
Figures
read the original abstract
Pulsar search is one of the main tasks for the Square Kilometre Array (SKA), implemented in the central signal processor (CSP) sub-element. As most the characteristics of undiscovered pulsars are unknown by definition, exhaustive searches over a multi-dimensional parameter space are employed. One main compute-intensive task of the pulsar search modules in the CPS is the matched filter group, which convolves the input signals with a group of large FIR filters. High-performance designs on FPGAs have been proposed that can process multiple large filters efficiently. But given that in many applications, including the here targeted pulsar search, FIR filters have many different sizes, there is further potential for optimisation. This paper investigates the optimisation of matched filtering designs. While the results are tranferable to other domains, we are motivated by the needs of the SKA pulsar search engine. The influence of changing number of filters and the difference in sizes is analysed. The generic design in time-domain (TD) is optimised by employing the longest processing time (LPT) first rule to distribute filter templates across filter processing pipelines. For the Fourier-domain (FD), the relationship between the required off-chip memory space and speedup over the generic design is investigated. To put the results into relation with with GPU design, we compared with a well-optimised design for top-end GPUs (NVIDIA Tesal P100). While a mid-range Intel Arria 10 is up to 7.5x slower than the P100, the performance per watt is slightly better on the Arria 10.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates optimizations for matched-filter groups on FPGAs targeted at the SKA pulsar-search engine in the central signal processor. It analyzes the effect of varying filter counts and sizes, applies the longest-processing-time (LPT) first scheduling rule to time-domain (TD) pipelines, examines the off-chip memory versus speedup tradeoff in Fourier-domain (FD) implementations, and compares the resulting performance and power efficiency against a published NVIDIA Tesla P100 GPU design. The headline result is that a mid-range Intel Arria 10 FPGA is up to 7.5× slower than the P100 yet achieves slightly better performance per watt.
Significance. If the tested filter sizes, counts, and pipeline distributions can be shown to match SKA CSP requirements, the work supplies a useful data point on FPGA versus GPU trade-offs for power-constrained, high-volume matched filtering. The concrete speed and efficiency numbers, together with the LPT and memory-speedup analyses, would be directly relevant to SKA instrumentation design where both throughput and power are critical.
major comments (3)
- [§4] §4 (TD optimisation results): The LPT-first distribution of filter templates across pipelines is presented with concrete speed-up figures, yet the manuscript supplies no mapping or justification showing that the chosen filter sizes, numbers, and size distributions correspond to the actual SKA pulsar-search workload (DM trial spacing, pulse widths, sampling rates, or CSP pipeline requirements). Without this link the reported speed-ups cannot be evaluated for relevance to the target application.
- [§5] §5 (GPU comparison): The claim that the Arria 10 is up to 7.5× slower than the P100 while offering slightly better performance per watt is stated without accompanying filter parameters, workload statistics, error bars, or verification steps. Because the GPU baseline is taken from an external published design, it is impossible to confirm that the FPGA and GPU workloads are comparable.
- [§4.2] §4.2 (FD memory-speedup analysis): The relationship between required off-chip memory and achieved speedup over the generic design is quantified, but again the specific memory sizes and filter counts examined are not shown to be representative of SKA CSP needs; this leaves the practical utility of the memory-speedup curve unclear.
minor comments (2)
- [Abstract] Abstract: 'tranferable' should read 'transferable'; 'Tesal' should read 'Tesla'; 'with with' should read 'with'.
- [Abstract] Abstract and §1: The phrasing 'the here targeted pulsar search' and 'most the characteristics' are awkward; minor rewording would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thorough review and constructive feedback on our manuscript. The comments highlight important points regarding the applicability of our results to the SKA CSP pulsar-search workload. We address each major comment below with clarifications drawn from the paper's motivation and analyses. Where appropriate, we will revise the manuscript to strengthen the links to SKA requirements while preserving the core contributions on FPGA optimizations.
read point-by-point responses
-
Referee: [§4] §4 (TD optimisation results): The LPT-first distribution of filter templates across pipelines is presented with concrete speed-up figures, yet the manuscript supplies no mapping or justification showing that the chosen filter sizes, numbers, and size distributions correspond to the actual SKA pulsar-search workload (DM trial spacing, pulse widths, sampling rates, or CSP pipeline requirements). Without this link the reported speed-ups cannot be evaluated for relevance to the target application.
Authors: We agree that an explicit mapping strengthens the paper. The filter sizes (32–2048 taps) and group sizes (up to 64 filters) were selected to reflect the heterogeneous FIR filter lengths arising from SKA pulsar-search dedispersion and matched-filtering stages, as described in SKA CSP design documents and related literature on time-domain convolution for pulsar searches. The LPT-first scheduling is a general heuristic that improves pipeline utilisation precisely when filter lengths vary, which is the dominant case in SKA workloads. In the revised manuscript we will add a short subsection in §4 that cites the relevant SKA technical requirements and shows how our parameter ranges align with typical DM-trial and pulse-width distributions. revision: yes
-
Referee: [§5] §5 (GPU comparison): The claim that the Arria 10 is up to 7.5× slower than the P100 while offering slightly better performance per watt is stated without accompanying filter parameters, workload statistics, error bars, or verification steps. Because the GPU baseline is taken from an external published design, it is impossible to confirm that the FPGA and GPU workloads are comparable.
Authors: The GPU numbers are taken directly from the published P100 implementation that targets the same class of matched-filter groups for pulsar search. We chose our FPGA filter counts and lengths to lie within the ranges reported in that work. Because both designs perform the identical mathematical operation (group convolution with FIR templates), the comparison is on a per-filter basis. We acknowledge that a side-by-side parameter table would improve transparency; the revised version will include such a table together with the synthesis and power-measurement methodology used for the Arria 10. Deterministic hardware results do not carry statistical error bars, but we will add a note on the verification steps (post-place-and-route timing and power estimation). revision: partial
-
Referee: [§4.2] §4.2 (FD memory-speedup analysis): The relationship between required off-chip memory and achieved speedup over the generic design is quantified, but again the specific memory sizes and filter counts examined are not shown to be representative of SKA CSP needs; this leaves the practical utility of the memory-speedup curve unclear.
Authors: The memory-speedup curves were generated for filter-group sizes and off-chip memory budgets that bracket the on-board DDR capacities available on mid-range Arria 10 devices used in SKA-scale deployments. The generic versus optimised FD implementations differ only in the amount of pre-computed twiddle factors and filter coefficients stored off-chip. We will revise §4.2 to include a brief justification paragraph that maps the examined memory sizes to the SKA CSP memory hierarchy constraints cited in the introduction. revision: yes
Circularity Check
No circularity; empirical performance study with external baselines
full rationale
The paper is an engineering evaluation of FPGA optimizations for matched-filter groups. It applies the standard LPT scheduling heuristic to time-domain pipelines, measures memory-vs-speedup tradeoffs in the Fourier domain, and benchmarks against a published external GPU design (NVIDIA Tesla P100). No equations, fitted parameters, self-citations, or ansatzes are invoked in a load-bearing way that reduces any claim to its own inputs by construction. All reported speedups and efficiency numbers are direct measurements from implemented designs; the representativeness concern is a validity issue, not circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Altera SDK for OpenCL Best Practices Guide, 2016
Altera. Altera SDK for OpenCL Best Practices Guide, 2016
work page 2016
-
[2]
An OpenCL deep learning accelerator on arria 10
Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C Ling, and Gordon R Chiu. An OpenCL deep learning accelerator on arria 10. InProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 55–64. ACM, 2017
work page 2017
-
[3]
A radio astronomy correlator optimized for the xilinx virtex-4 sx fpga
Ludovico De Souza, John D Bunton, Duncan Campbell-Wilson, Roger J Cappallo, and Bart Kincaid. A radio astronomy correlator optimized for the xilinx virtex-4 sx fpga. In2007 International Conference on Field Programmable Logic and Applications, pages 62–67. IEEE, 2007
work page 2007
-
[4]
Peter Dewdney, Peter Hall, R Schillizzi, and J Lazio. The square kilometre array.Proceedings of the Institute of Electrical and Electronics Engineers IEEE, 97(8):1482–1496, 2009
work page 2009
-
[5]
Sofia Dimoudi, Karel Adamek, Prabu Thiagaraj, Scott M Ransom, Aris Karastergiou, and Wesley Armour. A gpu implementation of the correlation technique for real-time fourier domain pulsar acceleration searches.The Astrophysical Journal Supplement Series, 239(2):28, 2018
work page 2018
-
[6]
Jeremy Fowers, Greg Brown, John Wernsing, and Greg Stitt. A performance and energy comparison of convolution on gpus, fpgas, and multicore processors.ACM Transactions on Architecture and Code Optimization (TACO), 9(4):25, 2013
work page 2013
-
[7]
Ronald L. Graham. Bounds on multiprocessing timing anomalies.SIAM journal on Applied Mathematics, 17(2):416–429, 1969
work page 1969
-
[8]
Optimization and approximation in deterministic sequencing and scheduling: a survey
Ronald L Graham, Eugene L Lawler, Jan Karel Lenstra, and AHG Rinnooy Kan. Optimization and approximation in deterministic sequencing and scheduling: a survey. InAnnals of discrete mathematics, volume 5, pages 287–326. Elsevier, 1979
work page 1979
-
[9]
Intel FPGA SDK OpenCL best pratices guide, 2019
Intel. Intel FPGA SDK OpenCL best pratices guide, 2019
work page 2019
-
[10]
Aaftab Munshi. The opencl specification. In2009 IEEE Hot Chips 21 Symposium (HCS), pages 1–314. IEEE, 2009
work page 2009
-
[11]
Hiroki Nakahara, Haruyoshi Yonekawa, and Shimpei Sato. An object detector based on multiscale sliding window search using a fully pipelined binarized cnn on an fpga. In2017 International Conference on Field Programmable Technology (ICFPT), pages 168–175. IEEE, 2017
work page 2017
-
[12]
Digital instrumentation for the radio astronomy community
Aaron Parsons, Dan Werthimer, Donald Backer, Tim Bastian, Geoffrey Bower, Walter Brisken, Henry Chen, Adam Deller, Terry Filiba, Dale Gary, et al. Digital instrumentation for the radio astronomy community. Inastro2010: The Astronomy and Astrophysics Decadal Survey, volume 2010, 2009
work page 2010
-
[13]
Algorithms for efficient computation of convolution
Karas Pavel and Svoboda David. Algorithms for efficient computation of convolution. InDesign and Architectures for Digital Signal Processing. IntechOpen, 2013
work page 2013
-
[14]
Presto: Pulsar exploration and search toolkit.Astrophysics source code library, 2011
Scott Ransom. Presto: Pulsar exploration and search toolkit.Astrophysics source code library, 2011
work page 2011
-
[15]
Scott M Ransom, Stephen S Eikenberry, and John Middleditch. Fourier techniques for very long astrophysical time-series analysis.The Astronomical Journal, 124(3):1788, 2002
work page 2002
-
[16]
Digital channelised receivers on fpgas platforms
MA Sanchez, Mario Garrido, Marisa López-Vallejo, Jesús Grajal, and Carlos López-Barrio. Digital channelised receivers on fpgas platforms. InIEEE International Radar Conference, 2005., pages 816–821. IEEE, 2005
work page 2005
-
[17]
The scientist and engineer’s guide to digital signal processing
Steven W Smith et al. The scientist and engineer’s guide to digital signal processing. 1997
work page 1997
-
[18]
Srikanth Sridharan, Paolo Durante, Christian Faerber, and Niko Neufeld. Accelerating particle identification for high-speed data-filtering using opencl on fpgas and other architectures. In2016 26th International Conference on Field Programmable Logic and Applications (FPL), pages 1–7. IEEE, 2016
work page 2016
-
[19]
An introduction to matched filters.IRE transactions on Information theory, 6(3):311–329, 1960
George Turin. An introduction to matched filters.IRE transactions on Information theory, 6(3):311–329, 1960
work page 1960
-
[20]
Haomiao Wang, Prabu Thiagaraj, and Oliver Sinnen. Combining multiple optimized fpga-based pulsar search modules using opencl.Journal of Astronomical Instrumentation
-
[21]
Haomiao Wang, Prabu Thiagaraj, and Oliver Sinnen. Fpga-based acceleration of ft convolution for pulsar search using opencl.ACM Transactions on Reconfigurable Technology and Systems (TRETS), 11(4):24, 2019
work page 2019
-
[22]
Fpga-based acceleration of fdas module using opencl
Haomiao Wang, Ming Zhang, Prabu Thiagaraj, and Oliver Sinnen. Fpga-based acceleration of fdas module using opencl. In2016 International Conference on Field-Programmable Technology (FPT), pages 53–60. IEEE, 2016
work page 2016
-
[23]
A framework for generating high throughput cnn implementations on fpgas
Hanqing Zeng, Ren Chen, Chi Zhang, and Viktor Prasanna. A framework for generating high throughput cnn implementations on fpgas. InProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 117–126. ACM, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.