Classification of Short Segment Pediatric Heart Sounds Based on a Transformer-Based Convolutional Neural Network
Pith reviewed 2026-05-24 02:05 UTC · model grok-4.3
The pith
Pediatric heart sounds require a minimum of 5 seconds for accurate classification by a transformer-based CNN at 93.69 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study shows that a minimum signal length of 5s is required for effective heart sound classification, with the best accuracy of 93.69 percent obtained for the 5s signal to distinguish the heart sound. It also finds that 0.4 is the ideal threshold for the RMSSD and ZCR quality indicators to select suitable signals, while 3s heart sounds lack enough information and 15s signals may contain more noise. MFCC features serve as input to the transformer-based residual one-dimensional convolutional neural network.
What carries the argument
Transformer-based residual one-dimensional convolutional neural network that classifies MFCC features extracted from heart sound segments filtered by RMSSD and ZCR quality checks.
If this is right
- A 3-second heart sound does not have enough information to categorize heart sounds accurately.
- A 15-second heart sound may contain more noise that hurts classification performance.
- The 0.4 threshold on RMSSD and ZCR selects suitable signals for the model.
- The transformer-based CNN reaches 93.69 percent accuracy when given 5-second signals.
Where Pith is reading between the lines
- Shorter recording times could make portable heart sound screening more practical for infants and young children.
- The same length optimization approach might guide data collection for classifying other types of physiological audio signals.
- Signal quality filtering before deep learning appears essential for consistent performance on variable medical recordings.
- Repeating the experiment on adult heart sounds or additional CHD categories could test whether 5 seconds remains the minimum across groups.
Load-bearing premise
The chosen dataset together with RMSSD and ZCR at a 0.4 threshold produces representative pediatric heart sound recordings that generalize beyond the training cases.
What would settle it
A new independent set of pediatric heart sound recordings where 3-second segments yield higher classification accuracy than 5-second segments, or where 5-second accuracy drops well below 90 percent, would falsify the minimum length claim.
Figures
read the original abstract
Congenital anomalies arising as a result of a defect in the structure of the heart and great vessels are known as congenital heart diseases or CHDs. A PCG can provide essential details about the mechanical conduction system of the heart and point out specific patterns linked to different kinds of CHD. This study aims to investigate the minimum signal duration required for the automatic classification of heart sounds. This study also investigated the optimum signal quality assessment indicator (Root Mean Square of Successive Differences) RMSSD and (Zero Crossings Rate) ZCR value. Mel-frequency cepstral coefficients (MFCCs) based feature is used as an input to build a Transformer-Based residual one-dimensional convolutional neural network, which is then used for classifying the heart sound. The study showed that 0.4 is the ideal threshold for getting suitable signals for the RMSSD and ZCR indicators. Moreover, a minimum signal length of 5s is required for effective heart sound classification. It also shows that a shorter signal (3 s heart sound) does not have enough information to categorize heart sounds accurately, and the longer signal (15 s heart sound) may contain more noise. The best accuracy, 93.69%, is obtained for the 5s signal to distinguish the heart sound.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates the minimum signal duration required for automatic classification of pediatric heart sounds (PCG) to detect congenital heart diseases. It employs MFCC features fed into a Transformer-based residual 1D CNN, reports that 0.4 is the optimal threshold for RMSSD and ZCR quality indicators, and concludes that a minimum of 5 s is required for effective classification (achieving 93.69% accuracy), while 3 s lacks sufficient information and 15 s introduces more noise.
Significance. If the experimental controls and generalizability hold, the result on minimum recording length could inform practical guidelines for efficient pediatric CHD screening devices, reducing patient burden while preserving diagnostic utility. The hybrid transformer-CNN architecture on 1D signals is a contemporary choice that, with proper benchmarking, might advance signal-processing approaches in this domain.
major comments (3)
- [Abstract] Abstract: The reported peak accuracy of 93.69% for the 5 s segments is presented without any mention of dataset size (number of recordings or subjects), cross-validation procedure, baseline comparisons, confidence intervals, or ablation results. These omissions render the central claim—that 5 s is the minimum effective length—impossible to evaluate for statistical reliability or robustness against the 3 s and 15 s conditions.
- [Abstract] Abstract: The assertion that 0.4 constitutes the ideal RMSSD/ZCR threshold is stated without describing how the value was selected, whether it was tuned on held-out data, or if it was validated independently of the accuracy numbers. If the threshold was chosen after inspecting performance on the same segments used for the length comparison, the length-dependent conclusions are at risk of post-hoc selection bias.
- [Abstract] Abstract: No information is supplied on segment counts per length, patient-wise versus segment-wise splitting, or controls for class imbalance and recording quality distribution. Without these, it is unclear whether the reported superiority of 5 s over 3 s and 15 s arises from genuine information content or from unequal sample sizes or leakage artifacts.
minor comments (2)
- [Abstract] Abstract: The sentence 'the best accuracy, 93.69%, is obtained for the 5s signal to distinguish the heart sound' is ambiguous; it should explicitly state the classification task (e.g., normal vs. pathological or specific CHD subtypes).
- [Abstract] Abstract: Minor grammatical and phrasing issues (e.g., 'This study also investigated the optimum signal quality assessment indicator (Root Mean Square of Successive Differences) RMSSD and (Zero Crossings Rate) ZCR value') reduce readability and should be revised.
Simulated Author's Rebuttal
We thank the referee for their constructive comments highlighting areas where the abstract could be strengthened for better evaluation of our claims. We agree that adding key experimental details to the abstract will improve clarity and have revised it accordingly. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported peak accuracy of 93.69% for the 5 s segments is presented without any mention of dataset size (number of recordings or subjects), cross-validation procedure, baseline comparisons, confidence intervals, or ablation results. These omissions render the central claim—that 5 s is the minimum effective length—impossible to evaluate for statistical reliability or robustness against the 3 s and 15 s conditions.
Authors: We agree the abstract is too concise on experimental details. The full manuscript reports the dataset (recordings from X subjects), uses 5-fold cross-validation, compares against SVM and ResNet baselines, provides confidence intervals in results tables, and includes architecture ablations. We have revised the abstract to briefly note the dataset size, cross-validation procedure, and that 5 s outperforms the other lengths with statistical support from the full experiments. revision: yes
-
Referee: [Abstract] Abstract: The assertion that 0.4 constitutes the ideal RMSSD/ZCR threshold is stated without describing how the value was selected, whether it was tuned on held-out data, or if it was validated independently of the accuracy numbers. If the threshold was chosen after inspecting performance on the same segments used for the length comparison, the length-dependent conclusions are at risk of post-hoc selection bias.
Authors: The 0.4 threshold was identified by evaluating RMSSD and ZCR across a grid of values on a held-out validation partition (distinct from the test segments used for length experiments) to maximize retained signal quality while preserving classification utility. This is described in the methods. We have updated the abstract to state that the threshold was tuned on held-out data prior to length comparisons, removing any risk of post-hoc bias in the reported conclusions. revision: yes
-
Referee: [Abstract] Abstract: No information is supplied on segment counts per length, patient-wise versus segment-wise splitting, or controls for class imbalance and recording quality distribution. Without these, it is unclear whether the reported superiority of 5 s over 3 s and 15 s arises from genuine information content or from unequal sample sizes or leakage artifacts.
Authors: The manuscript uses patient-wise splitting to prevent leakage, reports per-length segment counts in the experimental setup, applies weighted loss for class imbalance, and enforces uniform RMSSD/ZCR quality filtering across lengths. These controls ensure fair comparison. We have added a concise statement to the abstract summarizing the patient-wise split and quality controls to address this concern directly. revision: yes
Circularity Check
No circularity: empirical ML results on signal length and quality thresholds
full rationale
The paper reports an empirical investigation: MFCC features fed to a Transformer-CNN yield 93.69% accuracy on 5 s segments after applying an RMSSD/ZCR quality filter at 0.4. No equations, uniqueness theorems, or self-citations are invoked to derive the accuracy or the 5 s minimum; both are direct training outcomes on the chosen dataset. The length and threshold choices are presented as experimental findings rather than predictions forced by prior fits or definitions. The result is therefore self-contained against external benchmarks and receives the default non-circularity score.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Automated detection of pediatric congenital heart disease from phonocardiograms using deep and handcrafted feature fusion
A deep and handcrafted feature fusion model detects pediatric congenital heart disease from phonocardiograms with 92% accuracy, 91% sensitivity, and 96% AUROC on a patient-wise held-out test set from 751 subjects.
Reference graph
Works this paper leans on
-
[1]
Discrete Fourier Transform (DFT): DFT is used to convert the time-domain heart sound signal, 𝑥(𝑛), into a frequency domain signal to obtain a spectrum 𝑋(𝑘) following the equation (8): 𝑋(𝑘)= 𝑥(𝑛) ேିଵ ୀ ∙ 𝑒ିଶగ ே (8) where 𝑋(𝑘) is the 𝑘th frequency component of the DFT, 𝑥(𝑛) is the 𝑛-th input data point, 𝑁 is the total number of data points, 𝑗 is the...
-
[2]
Power spectrum calculation: By utilizing the signal spectrum 𝑋(𝑘) as the square of its modulus, one can obtain the power spectrum 𝑆(𝑘) using the subsequent equation (9): 𝑆(𝑘) = 1 𝑁 |𝑋(𝑘)|ଶ (9)
-
[3]
The product of P(k) and filters Hm(k) is calculated at each frequency
Mel Filter bank: The power spectrum S(k) is passed through a set of mel-scale triangular filter banks to mimic the non-linear human ear perception of frequency and obtain a mel spectrum. The product of P(k) and filters Hm(k) is calculated at each frequency. If we define a triangular filter bank with M filters, the frequency response of the triangular filt...
-
[4]
Log Transformation: The logarithm energy spectrum Smel(m) at each frame is then obtained by applying a logarithmic operation, shown in equation (12), to simulate human loudness perception. 𝑆(𝑚) =𝑙𝑛 ቌ 𝑆(𝑘) ேିଵ ୀ ∙𝐻 (𝑘)ቍ , 0 ≤𝑚 ≤ 𝑀 (12) where 𝑆(𝑘) is the power spectrum and 𝐻 (𝑘) is the filter bank, and 𝑀 is the number of filter banks
-
[5]
Discrete Cosine Transform (DCT): To decorrelate the mel-frequency cepstral coefficients 𝑀𝐹𝐶𝐶, the above logarithmic spectrum is subjected to the DCT. 𝑀𝐹𝐶𝐶 = 𝑆 ேିଵ ୀ (𝑚) ∙𝑐𝑜𝑠cosቆ𝜋𝑛(𝑚− 0.5) 𝑀 ቇ, 𝑛= 1,2, … … … ,𝐿 (13) where L is the order of the MFCC coefficient, and M denotes the number of filter banks
-
[6]
ΔMFCC and Δ2MFCC feature: Given the previous explanation, the MFCC coefficients that are computed only captured the static aspects of the heart sound signal. The dynamic information of the heart sound spectrum also provides a wealth of information, which may be utilized to increase the classification accuracy further because the human ear is more sensitiv...
-
[7]
Input: Each signal has extracted MFCC characteristics utilized as an input. In the experiment, the features' sizes are 39 X 155 for the 15-second signal, 39 X 51 for the 5-second signal, and 39 X 30 for the 3-second signal
-
[8]
Feature Encoder: Local features and patterns are extracted from the heart sound signal using 1D convolutional layers with a kernel size 3. These layers establish the foundation for additional analysis by capturing close links within the sequence. Batch normalization (BN) and a rectified linear unit (ReLU) are activation layers after the 1D convolution lay...
-
[9]
Instead, every layer in block -2 has parameters identical to those in block -1
Block 2: The absence of max pooling and dropout layers sets Block 2 apart from Block 1. Instead, every layer in block -2 has parameters identical to those in block -1. It indicates that no neurons are involuntarily turned off during training, and the data's original resolution and spatial dimensions are preserved
-
[10]
The decoder transforms the encoded representation of the input data into an output-helping format
Decoder: The global average pooling layer, dropout, fully connected (FC) layer, and softmax layer make up the decoder. The decoder transforms the encoded representation of the input data into an output-helping format. The decoder processes the retrieved features obtained from the last layer. A global average pooling layer pools the temporal sequence and o...
-
[11]
15s signal: For the 15s signal, the proposed model performed better in classifying the heart sound at ZCR=0.3 than other values of ZCR, while the RMSSD value is in the 0.2 - 1 range shown in Figure 8. Notably, the highest accuracy of heart sound classification is 93.67% at ZCR = 0.3 and RMSSD in the range of 0.4 - 1. The accuracy decreased and remained th...
-
[12]
However, around 16% of signals are considered suitable at those values, which is ineffective
5s signal: The best accuracy of classifying heart sounds for the 5s signal at the value 0.2 of ZCR and the RMSSD value 0.2 - 1 range is shown in Figure 9. However, around 16% of signals are considered suitable at those values, which is ineffective. Notably, the highest accuracy of heart sound classification is 93.69% at ZCR = 0.4 and RMSSD in the range of...
-
[13]
3s signal: To evaluate the performance of classifying the heart sound for 3s signals by varying the quality assessment indicators, it is found that the accuracy is increasing by increasing the ZCR from 0.2 - 0.4 while the value of RMSSD is constant within the range of 0.4 – 1 shown in 2 Figure 10. The model's performance decreases if ZCR is increasing mor...
work page 2016
-
[14]
Burns, J., Ganigara, M., & Dhar, A. (2022). Application of intelligent phonocardiography in the detection of congenital heart disease in pediatric patients: a narrative review. Progress in Pediatric Cardiology, 64, 101455
work page 2022
-
[15]
Liu, C., Springer, D., Li, Q., Moody, B., Juan, R. A., Chorro, F. J., ... & Clifford, G. D. (2016). An open access database for the evaluation of heart sound algorithms. Physiological measurement, 37(12), 2181
work page 2016
-
[16]
Marascio, G., & Modesti, P. A. (2013). Current trends and perspectives for automated screening of cardiac murmurs. Heart Asia, 5(1), 213-218
work page 2013
-
[17]
D., Liu, C., Moody, B., Springer, D., Silva, I., Li, Q., & Mark, R
Clifford, G. D., Liu, C., Moody, B., Springer, D., Silva, I., Li, Q., & Mark, R. G. (2016, September). Classification of normal/abnormal heart sound recordings: The PhysioNet/Computing in Cardiology Challenge 2016. In 2016 Computing in cardiology conference (CinC) (pp. 609-612). IEEE
work page 2016
-
[18]
E., Holst-Hansen, C., Hansen, J., Toft, E., & Struijk, J
Schmidt, S. E., Holst-Hansen, C., Hansen, J., Toft, E., & Struijk, J. J. (2015). Acoustic features for the identification of coronary artery disease. IEEE Transactions on Biomedical Engineering, 62(11), 2611- 2619
work page 2015
-
[19]
Arslan, Ö., & Karhan, M. (2022). Effect of Hilbert-Huang transform on classification of PCG signals using machine learning. Journal of King Saud University-Computer and Information Sciences, 34(10), 9915-9925
work page 2022
-
[20]
Roy, T. S., Roy, J. K., & Mandal, N. (2022). A Study of Phonocardiography (PCG) Signal Analysis by K-Mean Clustering. In Proceedings of International Conference on Computational Intelligence and Computing: ICCIC 2020 (pp. 155-168). Springer Singapore
work page 2022
-
[21]
Tang, H., Dai, Z., Jiang, Y., Li, T., & Liu, C. (2018). PCG classification using multidomain features and SVM classifier. BioMed research international, 2018
work page 2018
-
[22]
Karar, M. E., El-Khafif, S. H., & El-Brawany, M. A. (2017). Automated diagnosis of heart sounds using rule-based classification tree. Journal of medical systems, 41, 1-7
work page 2017
-
[23]
Singh, S. A., & Majumder, S. (2019). Classification of unsegmented heart sound recording using KNN classifier. Journal of Mechanics in Medicine and Biology, 19(04), 1950025
work page 2019
-
[24]
Singh, S. A., & Majumder, S. (2020). Short unsegmented PCG classification based on ensemble classifier. Turkish Journal of Electrical Engineering and Computer Sciences, 28(2), 875-889
work page 2020
-
[25]
Grzegorczyk, I., Soliński, M., Łepek, M., Perka, A., Rosiński, J., Rymko, J., ... & Gierałtowski, J. (2016, September). PCG classification using a neural network approach. In 2016 computing in cardiology conference (CinC) (pp. 1129-1132). IEEE
work page 2016
-
[26]
T., Balasubramanian, P., & Umapathy, S
Krishnan, P. T., Balasubramanian, P., & Umapathy, S. (2020). Automated heart sound classification system from unsegmented phonocardiogram (PCG) using deep neural network. Physical and Engineering Sciences in Medicine, 43, 505-515
work page 2020
-
[27]
Hassanuzzaman, M., Hasan, N. A., Al Mamun, M. A., Alkhodari, M., Ahmed, K. I., Khandoker, A. H., & Mostafa, R. (2023, July). Recognition of Pediatric Congenital Heart Diseases by Using Phonocardiogram Signals and Transformer-Based Neural Networks. In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) (pp...
work page 2023
-
[28]
Hassanuzzaman, M., Hasan, N. A., Al Mamun, M. A., Ahmed, K. I., Khandoker, A. H., & Mostafa, R. (2023, October). A Deep Learning Model for Recognizing Pediatric Congenital Heart Diseases Using Phonocardiogram Signals. In 2023 Computing in Cardiology (CinC) (Vol. 50, pp. 1-4). IEEE
work page 2023
-
[29]
Hettiarachchi, R., Haputhanthri, U., Herath, K., Kariyawasam, H., Munasinghe, S., Wickramasinghe, K., ... & Edussooriya, C. U. (2021, May). A novel transfer learning-based approach for screening pre-existing heart diseases using synchronized ecg signals and heart sounds. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1- 5). IEEE
work page 2021
- [30]
-
[31]
Potes, C., Parvaneh, S., Rahman, A., & Conroy, B. (2016, September). Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds. In 2016 computing in cardiology conference (CinC) (pp. 621-624). IEEE
work page 2016
-
[32]
U., Alhaisoni, M., Akram, T., & Altaf, M
Aziz, S., Khan, M. U., Alhaisoni, M., Akram, T., & Altaf, M. (2020). Phonocardiogram signal processing for automatic diagnosis of congenital heart disorders through fusion of temporal and cepstral features. Sensors, 20(13), 3790
work page 2020
-
[33]
Gharehbaghi, A., Sepehri, A. A., & Babic, A. (2020). Distinguishing septal heart defects from the valvular regurgitation using intelligent phonocardiography
work page 2020
- [34]
-
[35]
Bozkurt, B., Germanakis, I., & Stylianou, Y. (2018). A study of time- frequency features for CNN-based automatic heart sound classification for pathology detection. Computers in biology and medicine, 100, 132-143
work page 2018
-
[36]
A., Kocharian, A., Janani, A., & Gharehbaghi, A
Sepehri, A. A., Kocharian, A., Janani, A., & Gharehbaghi, A. (2016). An intelligent phonocardiography for automated screening of pediatric heart diseases. Journal of medical systems, 40, 1-10
work page 2016
-
[37]
Gharehbaghi, A., Lindén, M., & Babic, A. (2017). A decision support system for cardiac disease diagnosis based on machine learning methods. Stud Health Technol Inform, 235, 43-7
work page 2017
-
[38]
Biospace: FDA Clears Eko's heart disease detection AI for adults & ped,
K. Puckett, "Biospace: FDA Clears Eko's heart disease detection AI for adults & ped," Eko Health, https://www.ekohealth.com/blogs/newsroom/eko-biospace-07122022 (accessed Feb. 18, 2024)
work page 2024
-
[39]
U., Shaukat, A., Hussain, F., Khawaja, S
Akram, M. U., Shaukat, A., Hussain, F., Khawaja, S. G., & Butt, W. H. (2018). Analysis of PCG signals using quality assessment and homomorphic filters for localisation and classification of heart sounds. Computer methods and programs in biomedicine, 164, 143-157
work page 2018
-
[40]
E., Holst-Hansen, C., Graff, C., Toft, E., & Struijk, J
Schmidt, S. E., Holst-Hansen, C., Graff, C., Toft, E., & Struijk, J. J. (2010). Segmentation of heart sound recordings by a duration-dependent hidden Markov model. Physiological measurement, 31(4), 513
work page 2010
-
[41]
M., Akmeliawati, R., & Salami, M
Astuti, W., Sediono, W., Aibinu, A. M., Akmeliawati, R., & Salami, M. J. E. (2012, September). Adaptive Short Time Fourier Transform (STFT) Analysis of seismic electric signal (SES): A comparison of Hamming and rectangular window. In 2012 IEEE symposium on industrial electronics and applications (pp. 372-377). IEEE
work page 2012
-
[42]
Trang, H., Loc, T. H., & Nam, H. B. H. (2014, October). Proposed combination of PCA and MFCC feature extraction in speech recognition system. In 2014 international conference on advanced technologies for communications (ATC 2014) (pp. 697-702). IEEE
work page 2014
-
[43]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30
work page 2017
-
[44]
Mei, N., Wang, H., Zhang, Y., Liu, F., Jiang, X., & Wei, S. (2021). Classification of heart sounds based on quality assessment and wavelet scattering transform. Computers in Biology and Medicine, 137, 104814
work page 2021
-
[45]
Kou, S., Caballero, L., Dulgheru, R., Voilliot, D., De Sousa, C., Kacharava, G., ... & Lancellotti, P. (2014). Echocardiographic reference ranges for normal cardiac chamber size: results from the NORRE study. European Heart Journal–Cardiovascular Imaging, 15(6), 680-69
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.