pith. sign in

arxiv: 2604.16563 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.AI

Classification of systolic murmurs in heart sounds using multiresolution complex Gabor dictionary and vision transformer

Pith reviewed 2026-05-10 08:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords systolic murmursheart sound classificationGabor dictionaryorthogonal matching pursuitvision transformertime-frequency featurescardiac signal processing
0
0 comments X

The pith

Projecting heart sound segments onto a shared multiresolution complex Gabor dictionary and classifying the resulting features with a vision transformer identifies four systolic murmur types at 95.96 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an automatic classifier for systolic murmurs, extra heart sounds during the contraction phase that often point to turbulent blood flow and valve problems. It first extracts time-frequency features by applying complex orthogonal matching pursuit to one or more segments of a recording against a redundant dictionary of multiresolution complex Gabor basis functions, with the key constraint that all segments share the same dictionary atoms so the derived feature matrices remain consistent despite natural murmur variation. These variable-resolution matrices are tokenized by separate convolutional networks, concatenated, and passed through a transformer encoder that uses multi-head attention and residual connections. On the CirCor DigiScope collection the method reaches 95.96 percent accuracy across four murmur categories, which would matter for supporting faster and more consistent identification of cardiac abnormalities during routine listening exams.

Core claim

The central claim is that enforcing a shared multiresolution complex Gabor dictionary when projecting multiple segments of a single recording produces consistent variable-resolution time-frequency feature matrices; feeding those matrices into a vision transformer that tokenizes each resolution separately, concatenates the embeddings, and applies multi-head attention with a 1x1 convolutional residual block yields reliable separation of four systolic murmur classes, demonstrated by 95.96 percent accuracy on the CirCor DigiScope dataset.

What carries the argument

A redundant dictionary of multiresolution complex Gabor basis functions whose projection weights, obtained under a shared-dictionary constraint via complex orthogonal matching pursuit, are reshaped into variable-resolution time-frequency matrices that serve as the multi-input to the vision transformer.

If this is right

  • Enforcing a shared dictionary across multiple segments of one recording reduces the impact of natural murmur variability on the final feature matrices.
  • Splitting and reshaping the projection weights into several resolution-specific matrices lets the classifier capture both fine-grained and coarse time-frequency structure.
  • Concatenating patch tokens from separate convolutional front-ends before the transformer encoder allows a single attention layer to integrate information across resolutions.
  • The reported 95.96 percent accuracy on four murmur types indicates that the combined sparse-dictionary and transformer pipeline can distinguish clinically relevant systolic patterns without hand-crafted acoustic descriptors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shared-dictionary strategy could be tested on diastolic murmurs or other transient cardiac sounds where segment-to-segment consistency is also an issue.
  • Replacing the vision transformer with a lighter sequence model might reveal how much of the performance gain comes from the multiresolution Gabor front-end versus the attention mechanism.
  • Cross-device validation on recordings from consumer-grade digital stethoscopes would directly test whether the learned basis functions transfer beyond the original dataset hardware.
  • If the approach holds, it could be embedded in portable auscultation apps to flag murmur subtypes for remote review by cardiologists.

Load-bearing premise

The CirCor DigiScope recordings already contain enough real-world clinical variability that the shared-dictionary constraint will keep the extracted features discriminative when the same model encounters new patients, different recording devices, or unseen noise conditions.

What would settle it

Accuracy falling below 90 percent when the identical pipeline is evaluated on an independent collection of heart-sound recordings made with different stethoscopes or from patient populations absent from the original training set.

Figures

Figures reproduced from arXiv: 2604.16563 by Abeer FathAllah Brery, Mahmoud Fakhry.

Figure 1
Figure 1. Figure 1: Two cardiac cycles of heart sound, with systolic murmurs of diamond type [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Shapes of different types of systolic heart murmurs in the time domain. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Block diagram of the developed system. leveraged various signal representations, including spectrograms, Mel-frequency cepstral coef￾ficients, and short-time Fourier transforms. A multiscale attention convolutional compression network was proposed in [25] to detect coronary artery disease. The network uses a multi￾scale convolution structure to capture comprehensive features and a channel attention module … view at source ↗
Figure 4
Figure 4. Figure 4: Block diagram of the developed feature extraction module, mainly built [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Feature extraction from a murmur segment of diamond type using COMP [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Feature extraction from two murmur segments from the same heart sound [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of complex atoms with three different resolutions for [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Block diagram of the developed vision transformer encoder-based model with [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Block diagram of the multihead attention with [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The evolution of the training and validation loss and accuracy across epochs. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

Systolic murmurs are extra heart sounds that occur during the contraction phase of the cardiac cycle, often indicating heart abnormalities caused by turbulent blood flow. Their intensity, pitch, and quality vary, requiring precise identification for the accurate diagnosis of cardiac disorders. This study presents an automatic classification system for systolic murmurs using a feature extraction module, followed by a classification model. The feature extraction module employs complex orthogonal matching pursuit to project single or multiple murmur segments onto a redundant dictionary composed of multiresolution complex Gabor basis functions (GBFs). The resulting projection weights are split and reshaped into variable-resolution time--frequency feature matrices. Processing multiple segments of a single recording using a shared dictionary mitigates murmur variability. This is achieved by learning the weights for each segment while enforcing that they correspond to the same set of basis functions in the dictionary, promoting consistent time--frequency feature matrices. The classification model is built based on a vision transformer to process multiple input matrices of different resolutions by passing each through a convolutional neural network for patch tokenization. All embedding tokens are then concatenated to form a matrix and forwarded to an encoder layer that includes multihead attention, residual connections, and a convolutional network with a kernel size of one. This integration of multiresolution feature extraction with transformer-based feature classification enhances the accuracy and reliability of heart murmur identification. An experimental analysis of four types of systolic murmurs from the CirCor DigiScope dataset demonstrates the effectiveness of the system, achieving a classification accuracy of $95.96\%$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an automatic classification system for four types of systolic murmurs from heart sounds. It employs complex orthogonal matching pursuit to project single or multiple murmur segments onto a redundant multiresolution complex Gabor dictionary, yielding sparse coefficients that are split and reshaped into variable-resolution time-frequency feature matrices. A shared dictionary is enforced across segments of the same recording to promote consistency in the extracted features. These matrices are fed to a vision transformer that tokenizes each resolution level via a CNN, concatenates the resulting embeddings, and processes them through an encoder layer incorporating multi-head attention, residual connections, and a 1x1 convolutional network. An experimental analysis on the CirCor DigiScope dataset reports a classification accuracy of 95.96%.

Significance. If the performance claims hold under rigorous validation, the work could contribute to automated cardiac auscultation by showing how multiresolution sparse time-frequency representations can be integrated with a transformer to manage intra-recording variability. The shared-dictionary constraint in the OMP stage is a reasonable design choice for enforcing consistency, and the multi-resolution ViT handling is a plausible way to fuse features at different scales. However, the absence of standard experimental controls substantially reduces the immediate significance of the reported accuracy.

major comments (2)
  1. [Abstract] Abstract: The central claim of 95.96% classification accuracy on four systolic-murmur classes is stated without any description of the train-test split, cross-validation strategy, baseline comparisons, statistical significance testing, or class-imbalance handling on the CirCor DigiScope dataset. These details are load-bearing for evaluating whether the empirical result supports the effectiveness of the multiresolution Gabor + ViT pipeline.
  2. [Experimental Analysis] Experimental Analysis section: No information is given on whether recordings were partitioned at the patient level or whether the shared dictionary was learned in a way that could introduce leakage across segments of the same patient. This directly affects the weakest assumption that the features remain discriminative for new patients, devices, or noise conditions.
minor comments (2)
  1. [Feature Extraction Module] The description of how the projection weights are split and reshaped into variable-resolution time-frequency matrices would benefit from an explicit equation or pseudocode to clarify the matrix dimensions and resolution levels.
  2. [Classification Model] The vision transformer encoder is described at a high level; a diagram showing the CNN tokenization per resolution, concatenation, and single encoder layer would improve clarity of the multi-resolution fusion.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a standard empirical ML pipeline: complex OMP projects heart-sound segments onto a fixed multiresolution complex Gabor dictionary to produce sparse coefficients that are reshaped into time-frequency matrices; a shared dictionary is used across segments of one recording; these matrices are tokenized by CNNs and fed to a single-layer vision transformer. The 95.96% accuracy is obtained by training and testing this architecture on the CirCor DigiScope dataset. No equation reduces the reported accuracy to a fitted constant, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled in via prior work. The central claim is therefore an experimental result rather than an algebraic identity or self-referential definition.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on standard assumptions of sparse representability of heart sounds by Gabor atoms and on conventional deep-learning training procedures; no new physical entities or ungrounded axioms are introduced.

free parameters (2)
  • Number and choice of resolution levels in the Gabor dictionary
    Determines the variable-resolution feature matrices and is selected to balance detail and consistency.
  • Vision transformer hyperparameters (patch size, number of heads, layers)
    Chosen or tuned for the concatenated multi-resolution token matrix.
axioms (1)
  • domain assumption Heart-sound segments admit a sparse representation in the multiresolution complex Gabor dictionary.
    Invoked by the use of complex orthogonal matching pursuit to obtain projection weights.

pith-pipeline@v0.9.0 · 5573 in / 1376 out tokens · 44602 ms · 2026-05-10T08:59:17.197872+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    A vailable from: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

    World Health Organization, Cardiovascular Diseases CVDs, 2021. A vailable from: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

  2. [2]

    H. Jung, L. S. Lilly, The cardiac cycle: Mechanisms of heart sounds and murmurs, in Electronic Research Archive Volume 34, Issue 3, xxx–xxx 25 Pathophysiology of Heart Disease: A Collaborative Project of Medical Students and Faculty, 5th edition, Chapter 2, Philadelphia, (2011), 28–53

  3. [3]

    Boashash, time–frequency Signal Analysis and Processing: A Comprehensive Reference, 2nd edition, Academic Press, Oxford, 2015

    B. Boashash, time–frequency Signal Analysis and Processing: A Comprehensive Reference, 2nd edition, Academic Press, Oxford, 2015

  4. [4]

    Mallat and Z

    S. Mallat, Z. Zhang, Matching pursuit with time–frequency dictionaries, IEEE Trans. Signal Process., 41 (1993), 3397–3415. https://doi.org/10.1109/78.258082

  5. [5]

    Zhang, L

    X. Zhang, L. G. Durand, L. Senhadji, H. C. Lee, J. L. Coatrieux, Analysis-synthesis of the phonocardiogram based on the matching pursuit method, IEEE Trans. Biomed. Eng., 45 (1998), 962–971. https://doi.org/10.1109/10.704865

  6. [6]

    Zhang, L

    X. Zhang, L. Durand, L. Senhadji, H. Lee, J. L. Coatrieux, time–frequency scaling transfor- mation of the phonocardiogram based of the matching pursuit method, IEEE Trans. Biomed. Eng., 45 (1998), 972–979. https://doi.org/10.1109/10.704866

  7. [7]

    Goodfellow, Y

    I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, Cambridge, MA, 2017

  8. [8]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, in 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, (2017), 5998–6008

  9. [9]

    Y. Wang, Y. Deng, Y. Zheng, P. Chattopadhyay, L. Wang, Vision transform- ers for image classification: A comparative survey, Technologies, 13 (2025), 32. https://doi.org/10.3390/technologies13010032

  10. [10]

    Oliveira, F

    J. Oliveira, F. Renna, P. D. Costa, M. Nogueira, C. Oliveira, C. Ferreira, et al., The circor digiscope dataset: From murmur detection to murmur classification, IEEE J. Biomed. Health. Inf., 26 (2022), 2524–2535. https://doi.org/10.1109/JBHI.2021.3137048

  11. [11]

    R. O. Bonow, D. L. Mann, D. P. Zipes, P. Libby, Braunwald’s Heart Disease: A Textbook of Cardiovascular Medicine, 9th edition, Elsevier Health Sciences, Philadelphia, 2011

  12. [12]

    Singh, R

    J. Singh, R. Anand, Computer aided analysis of phonocardiogram, J. Med. Eng. Technol., 31 (2007), 319–323. https://doi.org/10.1080/03091900500282772

  13. [13]

    Zheng, X

    Y. Zheng, X. Guo, X. Ding, A novel hybrid energy fraction and entropy-based ap- proach for systolic heart murmurs identification, Expert Syst. Appl., 42 (2015), 2710–2721. https://doi.org/10.1016/j.eswa.2014.10.051

  14. [14]

    P. D. Stein, H. N. Sabbah, J. B. Lakier, S. R. Kemp, D. J. Magilligan, Frequency content of heart sounds and systolic murmurs in patients with porcine bioprosthetic valves: Diagnostic value for the early detection of valvular degeneration, Henry Ford Hosp. Med. J., 30 (1982), 119–123

  15. [15]

    Akay, time–frequency and Wavelets in Biomedical Signal Processing, Wiley-IEEE Press, New York, 1998

    M. Akay, time–frequency and Wavelets in Biomedical Signal Processing, Wiley-IEEE Press, New York, 1998

  16. [16]

    Fakhry, A

    M. Fakhry, A. Gallardo-Antolín, Variational mode decomposition and a light cnn- lstm model for classification of heart sound signals, in IEEE EUROCON 2023 - 20th International Conference on Smart Technologies, Torino, Italy, (2023), 295–300. https://doi.org/10.1109/EUROCON56442.2023.10199054 Electronic Research Archive Volume 34, Issue 3, xxx–xxx 26

  17. [17]

    Atanasov, T

    N. Atanasov, T. Ning, Isolation of systolic heart murmurs using wavelet transform and energy index, in 2008 Congress on Image and Signal Processing, Sanya, China, (2008), 216–

  18. [18]

    https://doi.org/10.1109/CISP.2008.758

  19. [19]

    Haghighi-Mood, N

    A. Haghighi-Mood, N. Torry, time–frequency analysis of systolic murmurs, in Computers in Cardiology 1997, IEEE, (1997), 113–116. https://doi.org/10.1109/CIC.1997.647843

  20. [20]

    J. G. Daugman, Uncertainty relation for resolution in space, spatial frequency, and orien- tation optimized by two-dimensional visual cortical filters, J. Opt. Soc. Am. A, 2 (1985), 1160–1169. https://doi.org/10.1364/JOSAA.2.001160

  21. [21]

    Fakhry, A

    M. Fakhry, A. F. Brery, A. Gallardo-Antolín, Analysis of heart sound signals using sparse modeling with gabor dictionary, in 2022 IEEE International Symposium on Multimedia (ISM), Italy, (2022), 92–96. https://doi.org/10.1109/ISM55400.2022.00021

  22. [22]

    Fakhry, A

    M. Fakhry, A. Gallardo-Antolín, Elastic net regularization and gabor dictionary for clas- sification of heart sound signals using deep learning, Eng. Appl. Artif. Intell., 127 (2024), 107406. https://doi.org/10.1016/j.engappai.2023.107406

  23. [23]

    Fakhry, A

    M. Fakhry, A. Gallardo-Antolín, Analysis of systolic murmurs in heart sounds using multiresolution complex gabor dictionary, in 2024 International Conference on Computer and Applications (ICCA), IEEE, Cairo, Egypt, (2024), 783–787. https://doi.org/10.1109/ICCA62237.2024.10927981

  24. [24]

    Jabbari, H

    S. Jabbari, H. Ghassemian, Modeling of heart systolic murmurs based on multivariate match- ing pursuit for diagnosis of valvular disorders, Comput. Biol. Med., 41 (2011), 802–811. https://doi.org/10.1016/j.compbiomed.2011.06.016

  25. [25]

    Shabbir, X

    M. Shabbir, X. Liu, M. Nasseri, S. Helgeson, Heart murmur classification in phonocardio- gram representations using convolutional neural networks, in The International FLAIRS Conference Proceedings, 36 (2023). https://doi.org/10.32473/flairs.36.133189

  26. [26]

    C. Yin, Y. Zheng, X. Ding, Y. Shi, J. Qin, X. Guo, Detection of coronary artery disease based on clinical phonocardiogram and multiscale attention convolu- tional compression network, IEEE J. Biomed. Health. Inf., 28 (2024), 1353–1362. https://doi.org/10.1109/JBHI.2024.3354832

  27. [27]

    J. Kim, G. Park, B. Suh, Classification of phonocardiogram recordings using vision trans- former architecture, in 2022 Computing in Cardiology (CinC), IEEE, Tampere, Finland, 498 (2022), 1-4. https://doi.org/10.22489/CinC.2022.084

  28. [28]

    Z. Liu, H. Jiang, F. W. Zhang, W. B. Ouyang, X. Li, X. Pan, Heart sound classification based on bispectrum features and vision transformer mode, Alexandria Eng. J., 85 (2023), 49–59. https://doi.org/10.1016/j.aej.2023.11.035

  29. [29]

    J. Han, A. Shaout, Enact-heart – ensemble-based assessment using cnn and transformer on heart sounds, preprint, arXiv:2502.16914. https://doi.org/10.48550/arXiv.2502.16914

  30. [30]

    W. Zhao, H. Ma, N. Jin, Y. Zheng, X. Guo, Detection of coronary heart disease based on heart sound and hybrid vision transformer, Appl. Acoust., 230 (2025), 110420. https://doi.org/10.1016/j.apacoust.2024.110420 Electronic Research Archive Volume 34, Issue 3, xxx–xxx 27

  31. [31]

    R. Wang, Y. Duan, Y. Li, D. Zheng, X. Liu, C. T. Lam, et al., Pctmf-net: heart sound clas- sification with parallel CNNs-transformer and second-order spectral analysis, Vis. Comput., 39 (2023), 3811–3822. https://doi.org/10.1007/s00371-023-03031-5

  32. [32]

    S. Qiu, H. G. Feichtinger, Discrete gabor structures and optimal representations, IEEE Trans. Signal Process., 43 (1995), 2258–2268. https://doi.org/10.1109/78.469862

  33. [33]

    I. Rish, G. Grabarnik, Sparse Modeling: Theory, Algorithms, and Applications, CRC Press, 2014

  34. [34]

    Průša, N

    Z. Průša, N. Holighaus, P. Balázs, Fast matching pursuit with multi-Gabor dictionaries, ACM Trans. Math. Software, 47 (2021), 1–20. https://doi.org/10.1145/3447958

  35. [35]

    Zhang, S

    Z. Zhang, S. Wei, D. Wei, L. Li, F. Liu, C. Liu, Comparison of four recovery algorithms used in compressed sensing for ecg signal processing, in 2016 Computing in Cardiology Conference (CinC), IEEE, (2016), 401–404

  36. [36]

    W. Dai, O. Milenkovic, Subspace pursuit for compressive sensing signal reconstruction, IEEE Trans. Inf. Theory, 55 (2008), 2230–2249. https://doi.org/10.1109/TIT.2009.2016006

  37. [37]

    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770– 778. © 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons....