LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification

Hilmi R. Dajani; Marco Janeczek; Martin Bouchard; Niloofar Jazaeri

arxiv: 2603.02245 · v3 · pith:4CTFA65Qnew · submitted 2026-02-24 · 📡 eess.AS · cs.LG· cs.SD

LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification

Niloofar Jazaeri , Hilmi R. Dajani , Marco Janeczek , Martin Bouchard This is my paper

Pith reviewed 2026-05-15 19:29 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD

keywords infant cry classificationcross-domain generalizationLegendre Memory Unitposterior fusionMFCC featuresSTFT featureson-device monitoringhealthcare audio

0 comments

The pith

A framework combining multi-branch CNNs, Legendre Memory Units, and entropy-gated fusion improves cross-domain classification of infant cry causes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a compact system to identify causes of infant cries despite challenges like short signals, few labels, and shifts between different babies or recording setups. It encodes MFCC, STFT, and F0 features through a multi-branch CNN, models sequences with an LMU for efficiency over LSTMs, and fuses outputs using entropy-gated posterior weighting to balance domain knowledge and reduce bias. This matters for healthcare because reliable cross-domain performance could support on-device monitoring without needing new data or retraining for each infant or environment.

Core claim

The central discovery is that an enhanced LMU provides stable temporal modeling with fewer parameters than LSTMs, and when paired with calibrated posterior ensemble fusion using entropy-gated weighting in a multi-branch CNN encoder for MFCC, STFT, and F0 features, the system achieves improved macro-F1 scores under cross-domain evaluation on the Baby2020 and Baby Crying datasets while supporting real-time on-device deployment and using leakage-aware splits.

What carries the argument

The enhanced Legendre Memory Unit (LMU) for efficient sequence modeling combined with entropy-gated posterior ensemble fusion to mitigate dataset bias.

If this is right

The model achieves higher macro-F1 in cross-dataset tests on Baby2020 and Baby Crying.
LMU enables stable modeling with substantially fewer recurrent parameters than LSTMs.
Entropy-gated weighting preserves domain-specific expertise while reducing bias.
Leakage-aware splits ensure no data leakage in evaluations.
The framework supports real-time feasibility for on-device monitoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such a system could enable continuous infant monitoring in homes or hospitals without frequent model updates for new babies.
The efficiency of LMU might allow integration into wearable devices for other audio-based health signals.
Entropy gating could be tested on other multi-domain audio tasks to see if it generalizes beyond cry classification.

Load-bearing premise

The combination of these specific features, LMU modeling, and entropy-gated fusion will reliably reduce dataset bias and generalize to new infants without needing extra tuning or allowing data leakage.

What would settle it

A new set of infant cry recordings from different infants and conditions where the proposed method shows no macro-F1 improvement or worse performance than simpler baselines.

read the original abstract

Decoding infant cry causes remains challenging for healthcare monitoring due to short nonstationary signals, limited annotations, and strong domain shifts across infants and datasets. We propose a compact acoustic framework that fuses mel-frequency cepstral coefficients (MFCCs), short-time Fourier transform (STFT) features, and fundamental-frequency (F0) contours within a multi-branch convolutional neural network (CNN) encoder, and models temporal dynamics using an enhanced Legendre Memory Unit (LMU). Compared to LSTMs, the LMU backbone provides stable sequence modeling with substantially fewer recurrent parameters, supporting efficient deployment. To improve cross-dataset generalization, we introduce calibrated posterior ensemble fusion with entropy-gated weighting to preserve domain-specific expertise while mitigating dataset bias. Experiments on Baby2020 and Baby Crying demonstrate improved macro-F1 under cross-domain evaluation, along with leakage aware splits and real-time feasibility for on-device monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies LMU sequence modeling plus entropy-gated fusion to infant cry audio for cross-domain robustness, but the abstract supplies no numbers or ablations to support the claimed macro-F1 gains.

read the letter

The main contribution here is a practical engineering combination: multi-branch CNN features from MFCC, STFT, and F0 fed into an LMU backbone, followed by calibrated posterior fusion that uses entropy to weight domain-specific models. The LMU choice is sensible because it keeps parameter count low while handling short non-stationary sequences better than a standard LSTM, which matters for on-device monitoring. The leakage-aware splits and real-time feasibility angle also show attention to deployment realities that many similar papers skip. Those elements are the parts that could actually be useful to someone building infant monitoring tools. The soft spot is the missing evidence. The abstract asserts improved macro-F1 under cross-domain conditions on Baby2020 and Baby Crying, yet it never shows the actual scores, baseline comparisons, or error bars. There is also no ablation that isolates the entropy gate from plain averaging, and no sensitivity check on the threshold. If the calibration step pulls any statistics from the target domain, even indirectly, the cross-domain claim becomes harder to trust. The stress-test concern about implicit leakage is worth verifying in the full text; without that check the gains could be artifacts of split construction rather than the architecture. This is the sort of paper that belongs in a reading group focused on applied audio for health, not in a theory session. Engineers working on compact medical classifiers might pull the LMU implementation or the fusion rule if the numbers and pseudocode are solid in the full manuscript. I would send it to peer review so referees can see the actual tables and confirm whether the fusion rule is reproducible without target-domain peeking.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a compact acoustic framework for infant cry classification that extracts MFCC, STFT, and F0 features via a multi-branch CNN encoder, models temporal dynamics with an enhanced Legendre Memory Unit (LMU), and applies calibrated posterior ensemble fusion using entropy-gated weighting to mitigate cross-domain shifts. Experiments on the Baby2020 and Baby Crying datasets report improved macro-F1 under cross-domain evaluation with leakage-aware splits and claim real-time on-device feasibility.

Significance. If the reported cross-domain gains hold after proper ablation and leakage controls, the work would offer a parameter-efficient alternative to LSTM-based models for on-device infant monitoring, with the LMU backbone and entropy-gated fusion providing a concrete route to domain-robust inference without target calibration. The emphasis on leakage-aware splits and real-time constraints addresses practical deployment needs in healthcare signal processing.

major comments (2)

[Methods (fusion subsection)] The description of the entropy-gated posterior fusion (likely in the methods section following the LMU architecture) supplies only a qualitative account of the weighting rule and calibration step. No explicit formula, pseudocode, or sensitivity analysis on the entropy threshold is provided, which is load-bearing for the central claim that the gate preserves domain expertise while eliminating dataset bias without target-domain information.
[Experiments and Results] Results tables (presumably Table 2 or 3) report macro-F1 improvements under cross-domain splits but omit error bars, statistical significance tests, and an ablation isolating the entropy gate from uniform averaging or simple concatenation. Without these, it is impossible to determine whether the gains exceed what would be expected from the LMU backbone alone or from implicit leakage in threshold selection.

minor comments (2)

[Abstract] The abstract states 'improved macro-F1' without quoting the numerical deltas or baseline values; moving the key quantitative results into the abstract would strengthen the summary.
[LMU Architecture] Notation for the LMU state update and the entropy calculation should be defined once in a single equation block rather than scattered across paragraphs to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Methods (fusion subsection)] The description of the entropy-gated posterior fusion (likely in the methods section following the LMU architecture) supplies only a qualitative account of the weighting rule and calibration step. No explicit formula, pseudocode, or sensitivity analysis on the entropy threshold is provided, which is load-bearing for the central claim that the gate preserves domain expertise while eliminating dataset bias without target-domain information.

Authors: We agree that the original description of the entropy-gated posterior fusion was primarily qualitative and that an explicit formulation is essential for reproducibility and to fully support the central claims. In the revised manuscript, we have added the precise mathematical definition of the weighting rule: the calibrated weight for each posterior p_k is given by w_k = (1 - H(p_k)/log C) / sum_j (1 - H(p_j)/log C), where H denotes Shannon entropy and C is the number of classes, followed by a temperature-scaled calibration step. We have also inserted pseudocode (new Algorithm 1) detailing the full fusion procedure, including the entropy computation and gating logic. Finally, we performed a sensitivity analysis by sweeping the entropy threshold over [0.3, 1.8] on source-domain validation data only and report the resulting macro-F1 curves in a new supplementary figure; the analysis confirms stable performance in the chosen operating range without requiring target-domain information. revision: yes
Referee: [Experiments and Results] Results tables (presumably Table 2 or 3) report macro-F1 improvements under cross-domain splits but omit error bars, statistical significance tests, and an ablation isolating the entropy gate from uniform averaging or simple concatenation. Without these, it is impossible to determine whether the gains exceed what would be expected from the LMU backbone alone or from implicit leakage in threshold selection.

Authors: We acknowledge that the reported results lacked error bars, statistical tests, and a dedicated ablation of the fusion component. In the revised manuscript we have updated all result tables to include mean macro-F1 plus standard deviation computed over five independent runs with distinct random seeds. We have added paired t-tests (with p-values reported in table captions) comparing our full model against the LMU backbone alone and against the listed baselines. We have further inserted a new ablation subsection that isolates the entropy-gated fusion by comparing it directly against (i) uniform posterior averaging and (ii) simple feature concatenation; the gated version yields statistically significant gains (p < 0.05) beyond the LMU backbone. Regarding threshold selection, we clarify in the revised experimental protocol that the entropy threshold was chosen exclusively via source-domain cross-validation and never tuned on target data; this detail, together with the leakage-aware split description already present, addresses the concern about implicit leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: method and claims are empirical, not self-referential

full rationale

The paper proposes an LMU-based CNN encoder with MFCC/STFT/F0 features and calibrated posterior ensemble fusion, then reports macro-F1 gains on cross-domain splits of Baby2020 and Baby Crying. No equations, derivations, or parameter-fitting steps appear in the supplied text. The fusion rule is described at the level of a technique ('entropy-gated weighting') without any self-referential definition that would make a reported prediction identical to its own input by construction. Self-citations, if present in the full manuscript, are not load-bearing for any uniqueness theorem or ansatz that collapses the central result. The evaluation relies on leakage-aware splits and real-time feasibility checks, which are external benchmarks rather than tautological. This is the normal case of an applied ML paper whose claims rest on experiment, not on a derivation that reduces to its own fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms or invented entities; full manuscript would be required to audit these elements.

pith-pipeline@v0.9.0 · 5469 in / 1120 out tokens · 30790 ms · 2026-05-15T19:29:14.725852+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Automated newborn cry diagnostic system using machine learning,

F. S. Matikolaie , S. A. Nezhad, and C. Tadj, "Automated newborn cry diagnostic system using machine learning," Biomed. Signal Process. Control, vol. 73, pp. 103466, Mar 2022

work page 2022
[2]

Pain cues override identity cues in baby cries,

S. Corvin, M. Lopez, and P. Gervain, "Pain cues override identity cues in baby cries," iScience, vol. 27, no. 2, pp. 120345, Jul 2024

work page 2024
[3]

Machine learning-based cry diagnostic system for identifying septic newborns,

F. S. Matikolaie and C. Tadj, "Machine learning-based cry diagnostic system for identifying septic newborns," Journal of Voice, vol. 38, no. 1, pp. 151-160, Jul 2024

work page 2024
[4]

Infant crying detection in real-world environments,

X. Yao, J. Wu, and T. Wang, "Infant crying detection in real-world environments," in Proc. IEEE ICASSP, pp. 291-295, May 2022

work page 2022
[5]

Design and implementation of infant crying monitoring and analysis system,

J. Huang, X. Zhang , and Y. Xu, "Design and implementation of infant crying monitoring and analysis system," in Proc. ICEITCE, pp. 101-106, Oct 2024

work page 2024
[6]

Infant cry analysis of cry signal segments towards identifying the cry-cause factors,

S. Sharma and V. K. Mittal , "Infant cry analysis of cry signal segments towards identifying the cry-cause factors," in Proc. IEEE TENCON, pp. 445-450, Nov 2017

work page 2017
[7]

Using transfer learning, SVM, and ensemble classification to classify baby cries,

L. Le, V. Tran, and D. Nguyen, "Using transfer learning, SVM, and ensemble classification to classify baby cries," in Proc. IEEE MASSW, pp. 1-5 , Nov 2019

work page 2019
[8]

An ensemble multi stream classifier for infant needs detection,

H. A. Fahmy, M . M . Elsayad , and A. El-Sawy, "An ensemble multi stream classifier for infant needs detection," Heliyon, vol. 9, no. 8, pp. e18654 , Mar 2023

work page 2023
[9]

Baby cry classification using ensemble learning and Whisper method comparison,

I. P. Y. P. Dharmawan , N. K. Wardani, and A. S. Wijaya, "Baby cry classification using ensemble learning and Whisper method comparison," J. Appl. Inform. Comput., vol. 9, no. I, pp. 13-23, Mar 2025

work page 2025
[10]

Machine learning-based infant crying interpretation,

M. Hammoud , S. Ali, and K. Elbassuoni, "Machine learning-based infant crying interpretation," Frontiers in Artificial Intelligence, vol. 7, pp. 135-144, Feb 2024

work page 2024
[11]

A comprehensive survey of infant cry classification research trends and methods,

N. G. Setyoningrum, A. P. Putra, and M.A. Nugroho, "A comprehensive survey of infant cry classification research trends and methods," in Proc. IEEE /CORIS, pp. 181-186 , Nov 2024

work page 2024
[12]

A study of deep learning models for audio classification : ResNet vs. EfficientNet for infant cry detection,

D. M. Herlea, F. Mihalache, and M. Cartwright, "A study of deep learning models for audio classification : ResNet vs. EfficientNet for infant cry detection," Information, vol. 12, no. 2, pp. 50, May 2025

work page 2025
[13]

Infant cry classification with graph convolutional networks,

C. Ji, L. Xu, and Y. Li, "Infant cry classification with graph convolutional networks," in Proc. IEEE /CCCS , pp. 1-5, Apr 2021

work page 2021
[14]

Infant cry classification using an efficient graph structure and attention-based model,

X. Qiao, Y. Liu, Z. Ma, and J. Zhang, "Infant cry classification using an efficient graph structure and attention-based model," Kuwait Journal of Science, vol. 51, no. I, pp. 1-10 , Jul 2024

work page 2024
[15]

Improved rep resentation learning using tree-structured ontology,

A. Zharmagambetov, T. Backstrom, and A. K. Singh, "Improved rep resentation learning using tree-structured ontology," in Proc. IEEE ICASSP, pp. 526-530, May 2022

work page 2022
[16]

Whisper encoder features for infant cry classification ,

M. Charola, L. Abou-Abbas, and C. Tadj, "Whisper encoder features for infant cry classification ," in Proc. Interspeech, pp. 3150-3154, Aug 2023

work page 2023
[17]

Improving self-supervised audio representation based on contrastive learning with Conformer encoder,

Q.T. Duong , H. Le, and T. Nguyen, "Improving self-supervised audio representation based on contrastive learning with Conformer encoder," in Proc. IS/CT , pp. 23-28 , Dec 2022

work page 2022
[18]

Deep contextualized acoustic represen tations for semi-supervised speech recognition,

S. Ling, Y. Wang, and D. Yu, "Deep contextualized acoustic represen tations for semi-supervised speech recognition," in Proc. IEEE ICASSP, pp. 6429-0433, May 2020 . [ 19] A. Gorin, N. Ellourni, and H. Benie, "Self-supervised learning for infant cry analysis ," in Proc. IEEE ICASSP Workshops, pp. 41-45, Jun 2023

work page 2020
[19]

Multi-microphone simultaneous speakers detection and localization of multi-sources for separation and noise reduction ,

A. Schwartz, M. Zohar, and Y. Be'ery , "Multi-microphone simultaneous speakers detection and localization of multi-sources for separation and noise reduction ," EURASIP J. Audio Speech Music Process., vol. 2024, no. 1, pp. 50, Oct 2024

work page 2024
[20]

Long short-term memory,

S. Hochreiter and J. Schrnidhuber , "Long short-term memory," Neural Comput., vol. 9, no. 8, pp. 1735-1780, Nov 1997

work page 1997
[21]

Legendre memory units: Continuous-time representation in recurrent neural networks,

A. R. Voelker, I. Kajic, and C. Eliasrnith, "Legendre memory units: Continuous-time representation in recurrent neural networks," in Proc. Neur/PS, pp. 15544-15553 , Dec 2019

work page 2019
[22]

The Baby_crying Database,

E. Hngynjy, "The Baby_crying Database," Apr 2021. [Online] . Avail able: https ://aistudio .baidu.com/datasetdetail/84370

work page 2021
[23]

CREPE: A convo lutional representation for pitch estimation ,

J. W. Kim, J. Salamon, P. Li, and J. P. Bello, "CREPE: A convo lutional representation for pitch estimation ," in Proc. IEEE ICASSP, pp. 161-165, Apr 2018

work page 2018

[1] [1]

Automated newborn cry diagnostic system using machine learning,

F. S. Matikolaie , S. A. Nezhad, and C. Tadj, "Automated newborn cry diagnostic system using machine learning," Biomed. Signal Process. Control, vol. 73, pp. 103466, Mar 2022

work page 2022

[2] [2]

Pain cues override identity cues in baby cries,

S. Corvin, M. Lopez, and P. Gervain, "Pain cues override identity cues in baby cries," iScience, vol. 27, no. 2, pp. 120345, Jul 2024

work page 2024

[3] [3]

Machine learning-based cry diagnostic system for identifying septic newborns,

F. S. Matikolaie and C. Tadj, "Machine learning-based cry diagnostic system for identifying septic newborns," Journal of Voice, vol. 38, no. 1, pp. 151-160, Jul 2024

work page 2024

[4] [4]

Infant crying detection in real-world environments,

X. Yao, J. Wu, and T. Wang, "Infant crying detection in real-world environments," in Proc. IEEE ICASSP, pp. 291-295, May 2022

work page 2022

[5] [5]

Design and implementation of infant crying monitoring and analysis system,

J. Huang, X. Zhang , and Y. Xu, "Design and implementation of infant crying monitoring and analysis system," in Proc. ICEITCE, pp. 101-106, Oct 2024

work page 2024

[6] [6]

Infant cry analysis of cry signal segments towards identifying the cry-cause factors,

S. Sharma and V. K. Mittal , "Infant cry analysis of cry signal segments towards identifying the cry-cause factors," in Proc. IEEE TENCON, pp. 445-450, Nov 2017

work page 2017

[7] [7]

Using transfer learning, SVM, and ensemble classification to classify baby cries,

L. Le, V. Tran, and D. Nguyen, "Using transfer learning, SVM, and ensemble classification to classify baby cries," in Proc. IEEE MASSW, pp. 1-5 , Nov 2019

work page 2019

[8] [8]

An ensemble multi stream classifier for infant needs detection,

H. A. Fahmy, M . M . Elsayad , and A. El-Sawy, "An ensemble multi stream classifier for infant needs detection," Heliyon, vol. 9, no. 8, pp. e18654 , Mar 2023

work page 2023

[9] [9]

Baby cry classification using ensemble learning and Whisper method comparison,

I. P. Y. P. Dharmawan , N. K. Wardani, and A. S. Wijaya, "Baby cry classification using ensemble learning and Whisper method comparison," J. Appl. Inform. Comput., vol. 9, no. I, pp. 13-23, Mar 2025

work page 2025

[10] [10]

Machine learning-based infant crying interpretation,

M. Hammoud , S. Ali, and K. Elbassuoni, "Machine learning-based infant crying interpretation," Frontiers in Artificial Intelligence, vol. 7, pp. 135-144, Feb 2024

work page 2024

[11] [11]

A comprehensive survey of infant cry classification research trends and methods,

N. G. Setyoningrum, A. P. Putra, and M.A. Nugroho, "A comprehensive survey of infant cry classification research trends and methods," in Proc. IEEE /CORIS, pp. 181-186 , Nov 2024

work page 2024

[12] [12]

A study of deep learning models for audio classification : ResNet vs. EfficientNet for infant cry detection,

D. M. Herlea, F. Mihalache, and M. Cartwright, "A study of deep learning models for audio classification : ResNet vs. EfficientNet for infant cry detection," Information, vol. 12, no. 2, pp. 50, May 2025

work page 2025

[13] [13]

Infant cry classification with graph convolutional networks,

C. Ji, L. Xu, and Y. Li, "Infant cry classification with graph convolutional networks," in Proc. IEEE /CCCS , pp. 1-5, Apr 2021

work page 2021

[14] [14]

Infant cry classification using an efficient graph structure and attention-based model,

X. Qiao, Y. Liu, Z. Ma, and J. Zhang, "Infant cry classification using an efficient graph structure and attention-based model," Kuwait Journal of Science, vol. 51, no. I, pp. 1-10 , Jul 2024

work page 2024

[15] [15]

Improved rep resentation learning using tree-structured ontology,

A. Zharmagambetov, T. Backstrom, and A. K. Singh, "Improved rep resentation learning using tree-structured ontology," in Proc. IEEE ICASSP, pp. 526-530, May 2022

work page 2022

[16] [16]

Whisper encoder features for infant cry classification ,

M. Charola, L. Abou-Abbas, and C. Tadj, "Whisper encoder features for infant cry classification ," in Proc. Interspeech, pp. 3150-3154, Aug 2023

work page 2023

[17] [17]

Improving self-supervised audio representation based on contrastive learning with Conformer encoder,

Q.T. Duong , H. Le, and T. Nguyen, "Improving self-supervised audio representation based on contrastive learning with Conformer encoder," in Proc. IS/CT , pp. 23-28 , Dec 2022

work page 2022

[18] [18]

Deep contextualized acoustic represen tations for semi-supervised speech recognition,

S. Ling, Y. Wang, and D. Yu, "Deep contextualized acoustic represen tations for semi-supervised speech recognition," in Proc. IEEE ICASSP, pp. 6429-0433, May 2020 . [ 19] A. Gorin, N. Ellourni, and H. Benie, "Self-supervised learning for infant cry analysis ," in Proc. IEEE ICASSP Workshops, pp. 41-45, Jun 2023

work page 2020

[19] [19]

Multi-microphone simultaneous speakers detection and localization of multi-sources for separation and noise reduction ,

A. Schwartz, M. Zohar, and Y. Be'ery , "Multi-microphone simultaneous speakers detection and localization of multi-sources for separation and noise reduction ," EURASIP J. Audio Speech Music Process., vol. 2024, no. 1, pp. 50, Oct 2024

work page 2024

[20] [20]

Long short-term memory,

S. Hochreiter and J. Schrnidhuber , "Long short-term memory," Neural Comput., vol. 9, no. 8, pp. 1735-1780, Nov 1997

work page 1997

[21] [21]

Legendre memory units: Continuous-time representation in recurrent neural networks,

A. R. Voelker, I. Kajic, and C. Eliasrnith, "Legendre memory units: Continuous-time representation in recurrent neural networks," in Proc. Neur/PS, pp. 15544-15553 , Dec 2019

work page 2019

[22] [22]

The Baby_crying Database,

E. Hngynjy, "The Baby_crying Database," Apr 2021. [Online] . Avail able: https ://aistudio .baidu.com/datasetdetail/84370

work page 2021

[23] [23]

CREPE: A convo lutional representation for pitch estimation ,

J. W. Kim, J. Salamon, P. Li, and J. P. Bello, "CREPE: A convo lutional representation for pitch estimation ," in Proc. IEEE ICASSP, pp. 161-165, Apr 2018

work page 2018