RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection
Pith reviewed 2026-06-28 13:03 UTC · model grok-4.3
The pith
A longitudinal voice dataset from 26 RRP patients shows models detect disease state rather than speaker identity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper creates the RRP-Voice dataset of voice recordings from 26 patients with up to ten years of longitudinal follow-up, each session annotated and verified by laryngoscopy at the time of recording. Under a patient-audited cross-validation protocol, benchmarks of multiple model families plus per-subject analyses demonstrate that discriminative performance arises from changes in disease state rather than fixed speaker characteristics.
What carries the argument
The patient-level audit inside session-level cross-validation, which isolates disease-related voice changes from stable speaker identity.
If this is right
- Models can distinguish RRP recurrence from post-surgical remission using longitudinal voice recordings.
- The benchmark supplies reference results for handcrafted features, end-to-end networks, self-supervised models, and audio large language models on this task.
- Per-subject analyses confirm the voice signal tracks oscillating disease state rather than speaker traits.
- The resource supports voice-based monitoring tools for rare laryngeal conditions in clinical settings.
Where Pith is reading between the lines
- Comparable longitudinal voice collections could be assembled for other fluctuating laryngeal disorders to enable similar tracking.
- Voice monitoring could eventually lower reliance on repeated invasive laryngoscopy for RRP follow-up if prediction reliability holds.
- Adding more patients or extra signals such as breathing patterns might improve forecasts of when recurrence will occur.
- The patient-level validation step points to a general safeguard needed in medical audio datasets to block identity leakage.
Load-bearing premise
Laryngoscopy performed at each recording session supplies an accurate and unbiased label for the true disease state, and the patient-level audit prevents models from learning speaker identity instead of disease effects.
What would settle it
Evidence that a model reaches comparable accuracy when all sessions from one patient are assigned the same disease label regardless of laryngoscopy findings, or repeated clinical mismatches between voice-based predictions and laryngoscopy over multiple years.
read the original abstract
Deep learning has advanced pathological voice detection rapidly, yet rare laryngeal diseases remain underexplored due to data scarcity. Recurrent Respiratory Papillomatosis (RRP) exemplifies this gap: an HPV-induced disease of the larynx in which patients oscillate between recurrence and post-surgical remission over the years. RRP demands continuous voice monitoring that existing cross-sectional corpora cannot support. We introduce the first longitudinal voice dataset for RRP, comprising recordings from 26 patients with up to ten years of follow-up. Each session pairs sustained vowels with sentence-level utterances, which are annotated by otolaryngologists and confirmed synchronously with laryngoscopy. Building on this resource, we establish a systematic benchmark spanning handcrafted features, end-to-end deep networks, self-supervised pretrained models, and recent audio large language models, all evaluated under session-level cross-validation with patient-level audit. Per-subject longitudinal analyses further confirm that the cross-sectional discriminative signal reflects laryngoscopic disease state rather than stable speaker attributes. This work lays a foundation for rare longitudinal pathological voice tasks in low-resource clinical settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RRP-Voice, the first longitudinal voice dataset for Recurrent Respiratory Papillomatosis (RRP), with recordings from 26 patients spanning up to 10 years of follow-up. Each session includes sustained vowels and sentences annotated by otolaryngologists and confirmed synchronously with laryngoscopy. It establishes a benchmark across handcrafted features, end-to-end deep networks, self-supervised models, and audio LLMs, evaluated under session-level cross-validation with patient-level audit. Per-subject longitudinal analyses are presented to argue that the cross-sectional discriminative signal tracks laryngoscopic disease state rather than stable speaker attributes.
Significance. If the dataset curation and validation hold, this provides a valuable new resource for longitudinal pathological voice research on a rare disease where existing corpora are cross-sectional only. The multi-approach benchmark and explicit attention to patient-level leakage are strengths that could support future clinical monitoring tools in low-resource settings.
major comments (2)
- [Dataset Construction] Dataset section: The central claim that per-subject longitudinal analyses confirm the signal reflects disease state (rather than speaker attributes) rests on laryngoscopy supplying accurate, unbiased labels at each session. No details are given on how laryngoscopic findings are mapped to binary/graded disease-state labels, inter-rater agreement, or within-patient label consistency over time.
- [Experimental Setup] Experimental Setup and Results sections: The patient-level audit in the session-level CV is described as blocking speaker-identity leakage, yet the manuscript reports no quantitative checks (speaker-identification accuracy of the same models, ablation removing speaker cues, or within-patient variance analysis). Without these, the longitudinal confirmation does not follow from the described protocol.
minor comments (2)
- [Abstract] Abstract and Dataset section: Total number of sessions, average recordings per patient, and exact follow-up statistics are not stated, making it difficult to gauge the longitudinal depth.
- [Benchmark] Benchmark section: The precise fold construction for session-level CV (e.g., how many sessions per patient are held out) should be tabulated for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for highlighting areas where additional detail would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Dataset Construction] Dataset section: The central claim that per-subject longitudinal analyses confirm the signal reflects disease state (rather than speaker attributes) rests on laryngoscopy supplying accurate, unbiased labels at each session. No details are given on how laryngoscopic findings are mapped to binary/graded disease-state labels, inter-rater agreement, or within-patient label consistency over time.
Authors: We agree that explicit documentation of the labeling protocol is required to support the longitudinal claims. In the revised manuscript we will add a dedicated subsection under Dataset Construction that (i) specifies the exact mapping from laryngoscopic observations (e.g., visible papilloma burden, location, and size) to the binary disease-state label used in experiments, (ii) reports the number of otolaryngologists involved and any inter-rater agreement statistics that were collected, and (iii) presents within-patient label stability metrics across repeated sessions for the same individual. These additions will make the foundation of the per-subject analyses transparent. revision: yes
-
Referee: [Experimental Setup] Experimental Setup and Results sections: The patient-level audit in the session-level CV is described as blocking speaker-identity leakage, yet the manuscript reports no quantitative checks (speaker-identification accuracy of the same models, ablation removing speaker cues, or within-patient variance analysis). Without these, the longitudinal confirmation does not follow from the described protocol.
Authors: We acknowledge that the current description of the patient-level audit would benefit from quantitative corroboration. We will therefore augment the Experimental Setup and Results sections with three targeted analyses: (1) speaker-identification accuracy obtained by training the same model families on speaker labels, (2) an ablation that removes or masks speaker-dependent acoustic cues before disease classification, and (3) a within-patient variance decomposition showing that classification performance tracks session-wise laryngoscopic state rather than stable speaker identity. These experiments will be reported under the existing patient-level audit protocol and will directly test the claim that the discriminative signal reflects disease state. revision: yes
Circularity Check
No circularity: empirical dataset and benchmark paper with no derivation chain
full rationale
The paper's central contribution is the creation of a longitudinal voice dataset for RRP and empirical benchmarking of models against laryngoscopy-confirmed labels. No mathematical derivations, parameter fittings presented as predictions, or self-citation load-bearing uniqueness theorems are described. The longitudinal confirmation claim rests on per-subject analyses using external clinical ground truth, which is independent of the models' outputs. This is a standard non-circular empirical study; the reader's assessment of score 1.0 aligns with the absence of any reducible derivation steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Laryngoscopy performed synchronously with voice recording provides accurate ground-truth disease state labels
Reference graph
Works this paper leans on
-
[1]
V oice as a biomarker: Exploratory analysis for benign and malignant vocal fold lesions,
P. Jenkins, R. Harrison, S. Bedrick, L. Karstens, et al., “V oice as a biomarker: Exploratory analysis for benign and malignant vocal fold lesions,”Frontiers in Digital Health, vol. V olume 7 - 2025, 2025,ISSN: 2673-253X. DOI: 10.3389/fdgth.2025.1609811 [Online]. Available: https : / / www. frontiersin . org / journals / digital - health / articles/10.338...
-
[2]
V oice for health: The use of vocal biomarkers from research to clinical practice,
G. Fagherazzi, A. Fischer, M. Ismael, and V . Despo- tovic, “V oice for health: The use of vocal biomarkers from research to clinical practice,”Digital Biomarkers, vol. 5, pp. 78–88, Apr. 2021.DOI: 10.1159/000515346
-
[3]
Saarbruecken voice database,
B. Woldert-Jokisz, “Saarbruecken voice database,”
-
[4]
Available: https://api.semanticscholar
[Online]. Available: https://api.semanticscholar. org/CorpusID:59673801
-
[5]
Jitter, shimmer and hnr classification within gender, tones and vowels in healthy voices,
J. P. Teixeira and P. O. Fernandes, “Jitter, shimmer and hnr classification within gender, tones and vowels in healthy voices,”Procedia Technology, vol. 16, pp. 1228– 1237, 2014,ISSN: 2212-0173.DOI: https : / / doi . org / 10 . 1016 / j . protcy. 2014 . 10 . 138 [Online]. Available: https : / / www. sciencedirect . com / science / article / pii / S2212017...
2014
-
[6]
An analytical study of speech pathology detection based on mfcc and deep neural networks,
M. Zakariah, R. B, Y . Alotaibi, Y . Guo, K. Tran-Trung, and M. Elahi, “An analytical study of speech pathology detection based on mfcc and deep neural networks,” Computational and Mathematical Methods in Medicine, vol. 2022, Apr. 2022.DOI: 10.1155/2022/7814952
-
[7]
Convolutional neural networks for pathological voice detection,
H. Wu, J. Soraghan, A. Lowit, and G. Di Caterina, “Convolutional neural networks for pathological voice detection,” in2018 40th Annual International Confer- ence of the IEEE Engineering in Medicine and Biology Society (EMBC), 2018, pp. 1–4.DOI: 10.1109/EMBC. 2018.8513222
-
[8]
V oice pathology detection using deep learning: A preliminary study,
P. Harar, J. B. Alonso-Hernandezy, J. Mekyska, Z. Galaz, R. Burget, and Z. Smekal, “V oice pathology detection using deep learning: A preliminary study,” in 2017 international conference and workshop on bioin- spired intelligence (IWOBI), IEEE, 2017, pp. 1–4
2017
-
[9]
Wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,”Advances in neural infor- mation processing systems, vol. 33, pp. 12 449–12 460, 2020
2020
-
[10]
Wavlm: Large-scale self-supervised pre- training for full stack speech processing,
S. Chen et al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
2022
-
[11]
Automatic voice disorder detection using self-supervised representations,
D. Ribas, M. A. Pastor, A. Miguel, D. Mart ´ınez, A. Ortega, and E. Lleida, “Automatic voice disorder detection using self-supervised representations,”IEEE Access, vol. 11, pp. 14 915–14 927, 2023.DOI: 10.1109/ ACCESS.2023.3243986
-
[12]
Gemini: A Family of Highly Capable Multimodal Models
G. Team et al., “Gemini: A family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
G. Comanici et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Decoding phonation with artificial intelligence (dep ai): Proof of concept,
M. Powell et al., “Decoding phonation with artificial intelligence (dep ai): Proof of concept,”Laryngoscope Investigative Otolaryngology, vol. 4, Mar. 2019.DOI: 10.1002/lio2.259
-
[15]
The derkay scale as a predic- tor of voice dysfunction in recurrent respiratory papil- lomatosis: Correlations between acoustic and patient- reported outcomes,
B. Mia ´skiewicz et al., “The derkay scale as a predic- tor of voice dysfunction in recurrent respiratory papil- lomatosis: Correlations between acoustic and patient- reported outcomes,”Journal of Clinical Medicine, vol. 14, no. 19, 2025,ISSN: 2077-0383.DOI: 10.3390/ jcm14197093 [Online]. Available: https://www.mdpi. com/2077-0383/14/19/7093
2025
-
[16]
M. Santamaria, Y . Christakis, C. Demanuele, Y . Zhang, et al., “Longitudinal voice monitoring in a decentralized bring your own device trial for respiratory illness detec- tion,”NPJ digital medicine, vol. 8, p. 202, Apr. 2025. DOI: 10.1038/s41746-025-01584-4
-
[17]
The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,
F. Eyben et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,”IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016.DOI: 10.1109/TAFFC. 2015.2457417
-
[18]
Lightgbm: A highly efficient gradient boosting decision tree,
G. Ke et al., “Lightgbm: A highly efficient gradient boosting decision tree,” inAdvances in Neural Informa- tion Processing Systems, I. Guyon et al., Eds., vol. 30, Curran Associates, Inc., 2017. [Online]. Available: https: //proceedings.neurips.cc/paper files/paper/2017/file/ 6449f44a102fde848669bdd9eb6b76fa-Paper.pdf 6
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.