pith. sign in

arxiv: 2604.22124 · v1 · submitted 2026-04-24 · 🌌 astro-ph.IM

FPGA-based Matched Filter Group Optimisation for SKA Pulsar Search Engine

Pith reviewed 2026-05-08 10:00 UTC · model grok-4.3

classification 🌌 astro-ph.IM
keywords FPGAmatched filterpulsar searchSKAtime domainFourier domainpower efficiencyfilter scheduling
0
0 comments X

The pith

Scheduling filters by longest processing time optimizes FPGA matched filtering for pulsar searches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to improve the efficiency of matched filter groups on FPGAs when the filters have varying sizes, a situation common in pulsar searches for the Square Kilometre Array. It applies the longest processing time first rule to assign templates to separate processing pipelines in time-domain designs, reducing idle time compared with generic distributions. For frequency-domain designs the work maps the extra off-chip memory required to achieve different levels of speedup. The resulting FPGA implementation is slower overall than a high-end GPU version but delivers slightly better performance per watt.

Core claim

The generic time-domain matched filter design is optimised by employing the longest processing time first rule to distribute filter templates across pipelines, while Fourier-domain versions show a direct relationship between required off-chip memory space and the speedup obtained over the generic design. When placed against a well-optimised GPU implementation, a mid-range FPGA is up to 7.5 times slower yet achieves slightly superior performance per watt.

What carries the argument

The longest processing time first scheduling rule used to assign filters of different lengths to multiple parallel processing pipelines.

If this is right

  • Time-domain designs finish faster once filters are assigned to pipelines by longest processing time first.
  • Frequency-domain designs trade additional memory for measurable speedup over the baseline.
  • FPGA implementations can deliver better energy efficiency than GPU alternatives for the same matched-filter task.
  • The same scheduling approach applies to any signal-processing workload that uses groups of FIR filters with unequal lengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Lower power draw per computation could reduce the total electricity needed to run the full SKA pulsar search pipeline.
  • The same distribution rule might be tested on other FPGA families or combined with partial frequency-domain processing to see further gains.
  • If the workload changes to include many more very short filters, the relative benefit of the longest-processing-time rule would shrink.

Load-bearing premise

The tested filter sizes, numbers, and pipeline distributions represent the actual pulsar-search workload that will run in the SKA central signal processor.

What would settle it

Measure throughput and power draw when the optimised FPGA design and the generic baseline both process a large set of real SKA-scale filter groups on the target hardware.

Figures

Figures reproduced from arXiv: 2604.22124 by Ben Stappers, Haomiao Wang, Oliver Sinnen, Prabu Thiagaraj.

Figure 1
Figure 1. Figure 1: Process flow of the overlap-add algorithm (top) and the overlap-save algorithm (bottom) view at source ↗
Figure 2
Figure 2. Figure 2: Reduced taps plotted over tap incremental and number of filters applying Naive TD (left) and differently sized OLA algorithms for view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of multiple OLA-TD and multiple OLS-FD implementations on one FPGA node view at source ↗
Figure 4
Figure 4. Figure 4: Processing order of TD-OLA-based matched filter group by employing the LPT rule view at source ↗
Figure 5
Figure 5. Figure 5: Launch times for FPGAs that can parallelise 64 taps (left) and 256 taps (right) view at source ↗
Figure 6
Figure 6. Figure 6: Example of padded input group for FD-OLS method view at source ↗
Figure 7
Figure 7. Figure 7: Speedup and saved memory space by changing the number of shared padded inputs through view at source ↗
Figure 8
Figure 8. Figure 8: Resource usage (left) and performance (right) over DSP block usage view at source ↗
Figure 9
Figure 9. Figure 9: NOLA−tap−opt based on evaluation of implementations for S5 (left) and A10 (right) ● ● ● ● ● ● ● ● ● ● ● 26 27 28 29 210 211 212 213 214 215 216 20% 40% 60% 80% 100% FFT length Resource Utilization ● Logic cell−−S5 DSP block−−S5 RAM block−−S5 Logic cell−−A10 DSP block−−A10 RAM block−−A10 Logic cell−−X+A10 DSP block−−X+A10 RAM block−−X+A10 ● ● ● ● ● ● ● ● ● ● ● 26 27 28 29 210 211 212 213 214 215 216 20 40 6… view at source ↗
Figure 10
Figure 10. Figure 10: Resource usage (left) and performance (right) of 8-point FFT engine view at source ↗
Figure 11
Figure 11. Figure 11: Frequency (left) and performance (right) of 4-point and 16-point FFT engines view at source ↗
Figure 12
Figure 12. Figure 12: Performance of multiple FD-OLS implementations using 4-point (left) and 8-point (right) FFT engines view at source ↗
Figure 13
Figure 13. Figure 13: Performance of multiple FD-OLS implementations using 16-point FFT engine view at source ↗
Figure 14
Figure 14. Figure 14: Speedup of FD designs over TD designs on S5 (left) and A10 (right) view at source ↗
Figure 15
Figure 15. Figure 15: Required off-chip memory of FD designs relative to corresponding TD designs view at source ↗
Figure 16
Figure 16. Figure 16: Speedup of a single P100 card [5] over a single A10 card view at source ↗
read the original abstract

Pulsar search is one of the main tasks for the Square Kilometre Array (SKA), implemented in the central signal processor (CSP) sub-element. As most the characteristics of undiscovered pulsars are unknown by definition, exhaustive searches over a multi-dimensional parameter space are employed. One main compute-intensive task of the pulsar search modules in the CPS is the matched filter group, which convolves the input signals with a group of large FIR filters. High-performance designs on FPGAs have been proposed that can process multiple large filters efficiently. But given that in many applications, including the here targeted pulsar search, FIR filters have many different sizes, there is further potential for optimisation. This paper investigates the optimisation of matched filtering designs. While the results are tranferable to other domains, we are motivated by the needs of the SKA pulsar search engine. The influence of changing number of filters and the difference in sizes is analysed. The generic design in time-domain (TD) is optimised by employing the longest processing time (LPT) first rule to distribute filter templates across filter processing pipelines. For the Fourier-domain (FD), the relationship between the required off-chip memory space and speedup over the generic design is investigated. To put the results into relation with with GPU design, we compared with a well-optimised design for top-end GPUs (NVIDIA Tesal P100). While a mid-range Intel Arria 10 is up to 7.5x slower than the P100, the performance per watt is slightly better on the Arria 10.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper investigates optimizations for matched-filter groups on FPGAs targeted at the SKA pulsar-search engine in the central signal processor. It analyzes the effect of varying filter counts and sizes, applies the longest-processing-time (LPT) first scheduling rule to time-domain (TD) pipelines, examines the off-chip memory versus speedup tradeoff in Fourier-domain (FD) implementations, and compares the resulting performance and power efficiency against a published NVIDIA Tesla P100 GPU design. The headline result is that a mid-range Intel Arria 10 FPGA is up to 7.5× slower than the P100 yet achieves slightly better performance per watt.

Significance. If the tested filter sizes, counts, and pipeline distributions can be shown to match SKA CSP requirements, the work supplies a useful data point on FPGA versus GPU trade-offs for power-constrained, high-volume matched filtering. The concrete speed and efficiency numbers, together with the LPT and memory-speedup analyses, would be directly relevant to SKA instrumentation design where both throughput and power are critical.

major comments (3)
  1. [§4] §4 (TD optimisation results): The LPT-first distribution of filter templates across pipelines is presented with concrete speed-up figures, yet the manuscript supplies no mapping or justification showing that the chosen filter sizes, numbers, and size distributions correspond to the actual SKA pulsar-search workload (DM trial spacing, pulse widths, sampling rates, or CSP pipeline requirements). Without this link the reported speed-ups cannot be evaluated for relevance to the target application.
  2. [§5] §5 (GPU comparison): The claim that the Arria 10 is up to 7.5× slower than the P100 while offering slightly better performance per watt is stated without accompanying filter parameters, workload statistics, error bars, or verification steps. Because the GPU baseline is taken from an external published design, it is impossible to confirm that the FPGA and GPU workloads are comparable.
  3. [§4.2] §4.2 (FD memory-speedup analysis): The relationship between required off-chip memory and achieved speedup over the generic design is quantified, but again the specific memory sizes and filter counts examined are not shown to be representative of SKA CSP needs; this leaves the practical utility of the memory-speedup curve unclear.
minor comments (2)
  1. [Abstract] Abstract: 'tranferable' should read 'transferable'; 'Tesal' should read 'Tesla'; 'with with' should read 'with'.
  2. [Abstract] Abstract and §1: The phrasing 'the here targeted pulsar search' and 'most the characteristics' are awkward; minor rewording would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback on our manuscript. The comments highlight important points regarding the applicability of our results to the SKA CSP pulsar-search workload. We address each major comment below with clarifications drawn from the paper's motivation and analyses. Where appropriate, we will revise the manuscript to strengthen the links to SKA requirements while preserving the core contributions on FPGA optimizations.

read point-by-point responses
  1. Referee: [§4] §4 (TD optimisation results): The LPT-first distribution of filter templates across pipelines is presented with concrete speed-up figures, yet the manuscript supplies no mapping or justification showing that the chosen filter sizes, numbers, and size distributions correspond to the actual SKA pulsar-search workload (DM trial spacing, pulse widths, sampling rates, or CSP pipeline requirements). Without this link the reported speed-ups cannot be evaluated for relevance to the target application.

    Authors: We agree that an explicit mapping strengthens the paper. The filter sizes (32–2048 taps) and group sizes (up to 64 filters) were selected to reflect the heterogeneous FIR filter lengths arising from SKA pulsar-search dedispersion and matched-filtering stages, as described in SKA CSP design documents and related literature on time-domain convolution for pulsar searches. The LPT-first scheduling is a general heuristic that improves pipeline utilisation precisely when filter lengths vary, which is the dominant case in SKA workloads. In the revised manuscript we will add a short subsection in §4 that cites the relevant SKA technical requirements and shows how our parameter ranges align with typical DM-trial and pulse-width distributions. revision: yes

  2. Referee: [§5] §5 (GPU comparison): The claim that the Arria 10 is up to 7.5× slower than the P100 while offering slightly better performance per watt is stated without accompanying filter parameters, workload statistics, error bars, or verification steps. Because the GPU baseline is taken from an external published design, it is impossible to confirm that the FPGA and GPU workloads are comparable.

    Authors: The GPU numbers are taken directly from the published P100 implementation that targets the same class of matched-filter groups for pulsar search. We chose our FPGA filter counts and lengths to lie within the ranges reported in that work. Because both designs perform the identical mathematical operation (group convolution with FIR templates), the comparison is on a per-filter basis. We acknowledge that a side-by-side parameter table would improve transparency; the revised version will include such a table together with the synthesis and power-measurement methodology used for the Arria 10. Deterministic hardware results do not carry statistical error bars, but we will add a note on the verification steps (post-place-and-route timing and power estimation). revision: partial

  3. Referee: [§4.2] §4.2 (FD memory-speedup analysis): The relationship between required off-chip memory and achieved speedup over the generic design is quantified, but again the specific memory sizes and filter counts examined are not shown to be representative of SKA CSP needs; this leaves the practical utility of the memory-speedup curve unclear.

    Authors: The memory-speedup curves were generated for filter-group sizes and off-chip memory budgets that bracket the on-board DDR capacities available on mid-range Arria 10 devices used in SKA-scale deployments. The generic versus optimised FD implementations differ only in the amount of pre-computed twiddle factors and filter coefficients stored off-chip. We will revise §4.2 to include a brief justification paragraph that maps the examined memory sizes to the SKA CSP memory hierarchy constraints cited in the introduction. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical performance study with external baselines

full rationale

The paper is an engineering evaluation of FPGA optimizations for matched-filter groups. It applies the standard LPT scheduling heuristic to time-domain pipelines, measures memory-vs-speedup tradeoffs in the Fourier domain, and benchmarks against a published external GPU design (NVIDIA Tesla P100). No equations, fitted parameters, self-citations, or ansatzes are invoked in a load-bearing way that reduces any claim to its own inputs by construction. All reported speedups and efficiency numbers are direct measurements from implemented designs; the representativeness concern is a validity issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Engineering optimization study; abstract introduces no new physical constants, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5595 in / 1088 out tokens · 49410 ms · 2026-05-08T10:00:52.810780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Altera SDK for OpenCL Best Practices Guide, 2016

    Altera. Altera SDK for OpenCL Best Practices Guide, 2016

  2. [2]

    An OpenCL deep learning accelerator on arria 10

    Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C Ling, and Gordon R Chiu. An OpenCL deep learning accelerator on arria 10. InProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 55–64. ACM, 2017

  3. [3]

    A radio astronomy correlator optimized for the xilinx virtex-4 sx fpga

    Ludovico De Souza, John D Bunton, Duncan Campbell-Wilson, Roger J Cappallo, and Bart Kincaid. A radio astronomy correlator optimized for the xilinx virtex-4 sx fpga. In2007 International Conference on Field Programmable Logic and Applications, pages 62–67. IEEE, 2007

  4. [4]

    The square kilometre array.Proceedings of the Institute of Electrical and Electronics Engineers IEEE, 97(8):1482–1496, 2009

    Peter Dewdney, Peter Hall, R Schillizzi, and J Lazio. The square kilometre array.Proceedings of the Institute of Electrical and Electronics Engineers IEEE, 97(8):1482–1496, 2009

  5. [5]

    A gpu implementation of the correlation technique for real-time fourier domain pulsar acceleration searches.The Astrophysical Journal Supplement Series, 239(2):28, 2018

    Sofia Dimoudi, Karel Adamek, Prabu Thiagaraj, Scott M Ransom, Aris Karastergiou, and Wesley Armour. A gpu implementation of the correlation technique for real-time fourier domain pulsar acceleration searches.The Astrophysical Journal Supplement Series, 239(2):28, 2018

  6. [6]

    A performance and energy comparison of convolution on gpus, fpgas, and multicore processors.ACM Transactions on Architecture and Code Optimization (TACO), 9(4):25, 2013

    Jeremy Fowers, Greg Brown, John Wernsing, and Greg Stitt. A performance and energy comparison of convolution on gpus, fpgas, and multicore processors.ACM Transactions on Architecture and Code Optimization (TACO), 9(4):25, 2013

  7. [7]

    Ronald L. Graham. Bounds on multiprocessing timing anomalies.SIAM journal on Applied Mathematics, 17(2):416–429, 1969

  8. [8]

    Optimization and approximation in deterministic sequencing and scheduling: a survey

    Ronald L Graham, Eugene L Lawler, Jan Karel Lenstra, and AHG Rinnooy Kan. Optimization and approximation in deterministic sequencing and scheduling: a survey. InAnnals of discrete mathematics, volume 5, pages 287–326. Elsevier, 1979

  9. [9]

    Intel FPGA SDK OpenCL best pratices guide, 2019

    Intel. Intel FPGA SDK OpenCL best pratices guide, 2019

  10. [10]

    The opencl specification

    Aaftab Munshi. The opencl specification. In2009 IEEE Hot Chips 21 Symposium (HCS), pages 1–314. IEEE, 2009

  11. [11]

    An object detector based on multiscale sliding window search using a fully pipelined binarized cnn on an fpga

    Hiroki Nakahara, Haruyoshi Yonekawa, and Shimpei Sato. An object detector based on multiscale sliding window search using a fully pipelined binarized cnn on an fpga. In2017 International Conference on Field Programmable Technology (ICFPT), pages 168–175. IEEE, 2017

  12. [12]

    Digital instrumentation for the radio astronomy community

    Aaron Parsons, Dan Werthimer, Donald Backer, Tim Bastian, Geoffrey Bower, Walter Brisken, Henry Chen, Adam Deller, Terry Filiba, Dale Gary, et al. Digital instrumentation for the radio astronomy community. Inastro2010: The Astronomy and Astrophysics Decadal Survey, volume 2010, 2009

  13. [13]

    Algorithms for efficient computation of convolution

    Karas Pavel and Svoboda David. Algorithms for efficient computation of convolution. InDesign and Architectures for Digital Signal Processing. IntechOpen, 2013

  14. [14]

    Presto: Pulsar exploration and search toolkit.Astrophysics source code library, 2011

    Scott Ransom. Presto: Pulsar exploration and search toolkit.Astrophysics source code library, 2011

  15. [15]

    Fourier techniques for very long astrophysical time-series analysis.The Astronomical Journal, 124(3):1788, 2002

    Scott M Ransom, Stephen S Eikenberry, and John Middleditch. Fourier techniques for very long astrophysical time-series analysis.The Astronomical Journal, 124(3):1788, 2002

  16. [16]

    Digital channelised receivers on fpgas platforms

    MA Sanchez, Mario Garrido, Marisa López-Vallejo, Jesús Grajal, and Carlos López-Barrio. Digital channelised receivers on fpgas platforms. InIEEE International Radar Conference, 2005., pages 816–821. IEEE, 2005

  17. [17]

    The scientist and engineer’s guide to digital signal processing

    Steven W Smith et al. The scientist and engineer’s guide to digital signal processing. 1997

  18. [18]

    Accelerating particle identification for high-speed data-filtering using opencl on fpgas and other architectures

    Srikanth Sridharan, Paolo Durante, Christian Faerber, and Niko Neufeld. Accelerating particle identification for high-speed data-filtering using opencl on fpgas and other architectures. In2016 26th International Conference on Field Programmable Logic and Applications (FPL), pages 1–7. IEEE, 2016

  19. [19]

    An introduction to matched filters.IRE transactions on Information theory, 6(3):311–329, 1960

    George Turin. An introduction to matched filters.IRE transactions on Information theory, 6(3):311–329, 1960

  20. [20]

    Combining multiple optimized fpga-based pulsar search modules using opencl.Journal of Astronomical Instrumentation

    Haomiao Wang, Prabu Thiagaraj, and Oliver Sinnen. Combining multiple optimized fpga-based pulsar search modules using opencl.Journal of Astronomical Instrumentation

  21. [21]

    Fpga-based acceleration of ft convolution for pulsar search using opencl.ACM Transactions on Reconfigurable Technology and Systems (TRETS), 11(4):24, 2019

    Haomiao Wang, Prabu Thiagaraj, and Oliver Sinnen. Fpga-based acceleration of ft convolution for pulsar search using opencl.ACM Transactions on Reconfigurable Technology and Systems (TRETS), 11(4):24, 2019

  22. [22]

    Fpga-based acceleration of fdas module using opencl

    Haomiao Wang, Ming Zhang, Prabu Thiagaraj, and Oliver Sinnen. Fpga-based acceleration of fdas module using opencl. In2016 International Conference on Field-Programmable Technology (FPT), pages 53–60. IEEE, 2016

  23. [23]

    A framework for generating high throughput cnn implementations on fpgas

    Hanqing Zeng, Ren Chen, Chi Zhang, and Viktor Prasanna. A framework for generating high throughput cnn implementations on fpgas. InProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 117–126. ACM, 2018