Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

John H. L. Hansen; Zahra Omidi

arxiv: 2606.27543 · v1 · pith:QSUYQAKSnew · submitted 2026-06-25 · 💻 cs.SD · cs.LG

Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

Zahra Omidi , John H. L. Hansen This is my paper

Pith reviewed 2026-06-29 00:46 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords vocal effort classificationWavLMdata augmentationsoft labelsspeech processingAVID corpusself-supervised learning

0 comments

The pith

WavLM with gradual unfreezing, data augmentation, and Gaussian-neighbor soft labels reaches 78.2 percent mean accuracy on vocal effort classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests WavLM on the task of labeling vocal effort levels in everyday speech recordings that lack calibration and reflect real variability. Effort ranges continuously from whisper through soft, neutral, and loud to shout, so adjacent categories are easily mixed up and labeled data are scarce. The authors apply a range of augmentations and introduce soft labels that distribute probability mass to neighboring effort levels according to a Gaussian. These steps are shown to lift performance over prior self-supervised models, which matters for making downstream speech systems more reliable when speakers change volume or style.

Core claim

WavLM-BASE, trained with gradual unfreezing, combined with data augmentation and Gaussian-neighbor soft labels, attains 78.2 percent mean accuracy on the AVID corpus and establishes a new state-of-the-art, exceeding results obtained with wav2vec2 and HuBERT.

What carries the argument

WavLM self-supervised model together with Gaussian-neighbor soft labels that spread credit across adjacent vocal effort categories to reflect the underlying continuum.

If this is right

Multiple augmentation strategies each raise WavLM accuracy by between 0.6 and 1.8 percentage points.
Gaussian-neighbor soft labels specifically reduce confusions between neighboring effort categories.
WavLM produces stronger representations for this task than wav2vec2 or HuBERT under the same training regime.
The complete pipeline sets a new state-of-the-art result on the AVID corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Gaussian-neighbor soft labels could be tested on other speech attributes that form continua, such as speaking rate or emotional intensity.
Explicit modeling of label uncertainty may lower the amount of labeled data needed for effort-related tasks in general.
Success on uncalibrated recordings points to possible use in real-time applications where speaker volume varies without prior adjustment.

Load-bearing premise

The AVID corpus captures the full range of naturalistic non-calibrated vocal effort variability and the gains from augmentations and soft labels generalize beyond this specific dataset.

What would settle it

Running the final system on an independent collection of everyday speech recordings with vocal effort labels would show whether accuracy stays near 78 percent or drops under changed recording conditions or speaker groups.

read the original abstract

The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is challenging since effort lies on a continuum, adjacent categories are easily confused, and labeled data remain scarce. Prior SSL approaches with wav2vec2, HuBERT, and AST improve performance on the AVID corpus but still suffer from boundary errors. In this study, we introduce WavLM for the first time in vocal effort classification and benchmark it against wav2vec2 and HuBERT. To address data scarcity, we conduct a systematic study of augmentation strategies, covering RIR convolution, additive noise, time masking, speed perturbation, band-limiting, MixUp, and CutMix. Augmentation consistently improves WavLM, with gains ranging from +0.6% to +1.8% absolute. We further propose Gaussian-neighbor soft labels, which further reduce near-boundary confusions by modeling the vocal effort continuum. Our best system, WavLM-BASE with gradual unfreezing, augmentation, and Gaussian-neighbor soft labels, achieves 78.2% mean accuracy, establishing a new state-of-the-art on AVID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WavLM plus Gaussian soft labels and augmentations lift AVID accuracy to 78.2%, but the gains stay inside one corpus with no external check.

read the letter

The paper's main contribution is applying WavLM to vocal effort classification for the first time, pairing it with a systematic sweep of augmentations and a new Gaussian-neighbor soft labeling scheme that treats effort as a continuum. On the AVID corpus this combination reaches 78.2% mean accuracy and edges out prior wav2vec2 and HuBERT baselines. The augmentations give consistent but modest lifts of 0.6–1.8 points, and the soft labels are credited with cutting near-boundary mistakes.

The work is straightforward and the labeling trick is a reasonable way to handle the ordinal nature of effort categories. Reporting the full augmentation list and the gradual unfreezing schedule makes the recipe easy to re-run on the public data.

The soft spots are the usual ones for this style of paper. Everything is measured on AVID alone; there is no held-out corpus or cross-domain test to show the improvements survive changes in recording conditions or speaker populations. The abstract gives no standard deviations, no significance tests, and no ablation that isolates whether the final number depends on post-hoc choices of augmentation parameters or the Gaussian width. Those details matter when the headline claim is a new SOTA on a single narrow task.

This is useful reading for anyone already working on effort-aware speech systems or SSL fine-tuning for paralinguistic labels. It is not foundational, but the empirical steps are clear enough that a referee could check the controls and ask for the missing variance numbers. I would send it to review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The manuscript benchmarks WavLM-BASE (with gradual unfreezing) against wav2vec2 and HuBERT on the AVID corpus for five-class vocal effort classification. It reports that a systematic suite of augmentations (RIR, noise, time masking, speed perturbation, band-limiting, MixUp, CutMix) yields consistent absolute gains of 0.6–1.8 % and that Gaussian-neighbor soft labels further reduce boundary confusions, producing a new claimed SOTA of 78.2 % mean accuracy.

Significance. If the reported gains prove robust, the work supplies a practical recipe for improving SSL-based vocal-effort classifiers under data scarcity and provides the first WavLM results on this task. The systematic augmentation ablation is a clear strength that can be reused by the community.

major comments (3)

[§4, Table 3] §4 (Results) and Table 3: the headline 78.2 % figure is presented without per-run standard deviations, number of random seeds, or any statistical significance test against the prior SSL baselines; the +0.6–1.8 % augmentation gains therefore cannot be distinguished from run-to-run variance.
[§3.2, §4.2] §3.2 (Labeling) and §4.2: the Gaussian width parameter is a free hyper-parameter whose value is not justified by cross-validation or sensitivity analysis; because the soft-label distribution directly affects the reported accuracy, this choice is load-bearing for the central claim that the method models the vocal-effort continuum.
[§5] §5 (Discussion): no cross-corpus or held-out naturalistic recording set is evaluated, so the generalization claim that AVID “adequately captures naturalistic non-calibrated speech variability” rests on a single corpus whose label distribution may have been implicitly tuned during augmentation and label-width search.

minor comments (2)

[Abstract] The abstract states “mean accuracy” but does not specify whether this is macro- or micro-averaged; consistent terminology should be used throughout.
[Figure 2] Figure 2 (augmentation pipeline) lacks axis labels on the spectrogram examples, making it difficult to verify the claimed band-limiting and speed-perturbation effects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, committing to revisions that strengthen the statistical rigor and transparency of the work while honestly noting limitations that cannot be fully resolved without new data collection.

read point-by-point responses

Referee: [§4, Table 3] §4 (Results) and Table 3: the headline 78.2 % figure is presented without per-run standard deviations, number of random seeds, or any statistical significance test against the prior SSL baselines; the +0.6–1.8 % augmentation gains therefore cannot be distinguished from run-to-run variance.

Authors: We agree that the absence of run-level statistics weakens the claims. In the revised manuscript we will report all key results (including the 78.2 % figure and augmentation ablations) as means over five random seeds with standard deviations, and we will add a paired statistical test (t-test on per-seed accuracies) against the wav2vec2 and HuBERT baselines to establish significance of the observed gains. revision: yes
Referee: [§3.2, §4.2] §3.2 (Labeling) and §4.2: the Gaussian width parameter is a free hyper-parameter whose value is not justified by cross-validation or sensitivity analysis; because the soft-label distribution directly affects the reported accuracy, this choice is load-bearing for the central claim that the method models the vocal-effort continuum.

Authors: The width σ = 0.5 was selected after limited validation sweeps; we accept that a fuller justification is required. The revision will include a sensitivity table (or appendix figure) showing mean accuracy and boundary-confusion rates for σ ∈ {0.3, 0.5, 0.7, 1.0} under the same augmentation regime, thereby documenting the robustness of the chosen value. revision: yes
Referee: [§5] §5 (Discussion): no cross-corpus or held-out naturalistic recording set is evaluated, so the generalization claim that AVID “adequately captures naturalistic non-calibrated speech variability” rests on a single corpus whose label distribution may have been implicitly tuned during augmentation and label-width search.

Authors: We concur that single-corpus results limit strong generalization statements. AVID remains the only publicly available corpus with the required five-class vocal-effort labels in naturalistic conditions; no comparable held-out dataset exists for immediate evaluation. The revised discussion will explicitly flag this as a limitation and list cross-corpus validation on future datasets as a priority for follow-up work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarks

full rationale

The paper reports direct empirical accuracy results (78.2% mean accuracy) from benchmarking WavLM variants, augmentations, and Gaussian-neighbor soft labels on the fixed public AVID corpus against prior SSL baselines. No equations, parameter fits, or derivations are present that reduce any claimed result to its own inputs by construction; all gains are measured outcomes on held-out data rather than self-definitional or fitted-input predictions. Self-citations, if present, are not load-bearing for the central claim, which remains externally falsifiable via the public corpus and standard ML evaluation protocols.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The claim rests on standard assumptions that SSL speech models transfer to vocal effort and that augmentation improves generalization, plus one new invented labeling scheme whose effectiveness is shown only empirically on this corpus.

free parameters (2)

augmentation parameters
Specific levels for RIR convolution, additive noise, speed perturbation, and MixUp/CutMix are selected and likely tuned on validation data.
Gaussian width parameter
The spread parameter controlling how much probability mass is assigned to neighboring effort classes is a tunable choice.

axioms (2)

domain assumption Self-supervised models pretrained on large unlabeled speech corpora extract features useful for vocal effort discrimination.
Basis for benchmarking WavLM against wav2vec2 and HuBERT.
domain assumption Data augmentation strategies improve robustness without introducing harmful distribution shift for this task.
Invoked to justify the systematic augmentation study.

invented entities (1)

Gaussian-neighbor soft labels no independent evidence
purpose: To represent vocal effort as a continuum and reduce boundary confusions between adjacent classes.
New proposal introduced to address limitations of hard labels.

pith-pipeline@v0.9.1-grok · 5764 in / 1335 out tokens · 44776 ms · 2026-06-29T00:46:14.513050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

INTRODUCTION V ocal effort, ranging from whisper, soft, neutral, loud, and shouted speech, alters speech production and acoustic struc- ture and impacts both intelligibility and performance of speech systems. Prior studies have examined this continuum, showing spectral and energy shifts that degrade robustness in speaker identification [1, 2, 3]. Related ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Dataset All experiments are performed using A VID corpus [8], orig- inally designed for vocal effort

METHODOLOGY 2.1. Dataset All experiments are performed using A VID corpus [8], orig- inally designed for vocal effort. The corpus contains record- ings from 50 English speakers (25 male, 25 female), each asked to read prompted sentences at four instructed vocal in- tensity levels:soft,normal,loud, andvery loud. Recordings were made in a laboratory setting...
[3]

This way, the mixed targets follow the mixing weights and the or- dinal structure of VE

and CutMix [21] where, instead of interpolating one- hot labels, we interpolate Gaussian-neighbor soft labels. This way, the mixed targets follow the mixing weights and the or- dinal structure of VE. Figure 3 illustrates the effect of different Soft-label strate- gies. Hard labels assign full probability to a single class, while label smoothing redistribu...
[4]

Experimental Setup All experiments are conducted on the A VID corpus [8] un- der the non-calibrated condition, using the four instructed effort categories

EXPERIMENTS 3.1. Experimental Setup All experiments are conducted on the A VID corpus [8] un- der the non-calibrated condition, using the four instructed effort categories. Evaluation follows 10-fold group cross- validation, and results are reported as mean accuracy with standard deviation across folds. Models are fine-tuned end- to-end with Adam (encoder...
[5]

This is the first study of WavLM for vocal effort classification and showed that it outperforms wav2vec2 and HuBERT by over 7% and 1% ab- solute, respectively

CONCLUSION This study has considered advancements in vocal effort clas- sification for naturalistic speech data. This is the first study of WavLM for vocal effort classification and showed that it outperforms wav2vec2 and HuBERT by over 7% and 1% ab- solute, respectively. A systematic evaluation of augmentation strategies demonstrated consistent gains of ...
[6]

Analysis and classifi- cation of speech mode: Whispered through shouted,

C. Zhang and J. H. L. Hansen, “Analysis and classifi- cation of speech mode: Whispered through shouted,” in Proc. Interspeech, 2007, pp. 2289–2292

2007
[7]

Foren- sic speaker recognition under vocal effort variation,

V . Hughes, A. Eriksson, and A. Kachkovskiy, “Foren- sic speaker recognition under vocal effort variation,” in Proc. Interspeech, 2023, pp. 2333–2337

2023
[8]

Speaker verifi- cation under vocal effort variation: Compensation ap- proaches,

A. Prieto, A. Miguel, and E. Lleida, “Speaker verifi- cation under vocal effort variation: Compensation ap- proaches,” inProc. Interspeech, 2022, pp. 2913–2917

2022
[9]

Deep neural network training for whisper speech recognition using small databases and generative model sampling,

S. Ghaffarzadegan, H. Boril, and J. H. L. Hansen, “Deep neural network training for whisper speech recognition using small databases and generative model sampling,” International Journal of Speech Technology, vol. 20, no. 4, pp. 1063–1075, Dec. 2017

2017
[10]

Analysis and calibra- tion of lombard effect and whisper for speaker recogni- tion,

F. Kelly and J. H. L. Hansen, “Analysis and calibra- tion of lombard effect and whisper for speaker recogni- tion,”IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 29, pp. 927–942, Feb. 2021

2021
[11]

Ad- vancements in whispered speech detection for interac- tive/speech systems,

C. Zhang, J. H. L. Hansen, and H. A. Patil, “Ad- vancements in whispered speech detection for interac- tive/speech systems,” inSignal Acoust. Model. Speech Commun. Disord., vol. 5, pp. 9–32. De Gruyter, 2018

2018
[12]

An advanced entropy- based feature with frame-level vocal effort likelihood space modeling for distant whisper-island detection,

C. Zhang and J. H. L. Hansen, “An advanced entropy- based feature with frame-level vocal effort likelihood space modeling for distant whisper-island detection,” Speech Communication, vol. 66, pp. 107–117, 2015

2015
[13]

Avid: A speech database for machine learning studies on vocal intensity,

P. Alku, M. Kodali, M. Laaksonen, and S. R. Kadiri, “Avid: A speech database for machine learning studies on vocal intensity,”Speech Communication, vol. 157, pp. 103039, 2024

2024
[14]

Wavelet scattering network features for intensity cat- egory classification and prediction of spl from speech,

M. Kodali, S. R. Kadiri, S. Narayanan, and P. Alku, “Wavelet scattering network features for intensity cat- egory classification and prediction of spl from speech,” inProc. IEEE ICASSP, 2025

2025
[15]

wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020, pp. 12449–12460

2020
[16]

Hubert: Self- supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self- supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, no. 9, pp. 3451–3460, 2021

2021
[17]

Ast: Audio spec- trogram transformer,

Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spec- trogram transformer,” inProc. Interspeech, 2021, pp. 571–575

2021
[18]

Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals,

M. Kodali, S. R. Kadiri, and P. Alku, “Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals,” inProc. Interspeech, 2024, pp. 482–486

2024
[19]

Comparison of ssl embeddings for vocal intensity classification in non- calibrated conditions,

M. Kodali, S. R. Kadiri, and P. Alku, “Comparison of ssl embeddings for vocal intensity classification in non- calibrated conditions,” inProc. IEEE ICASSP, 2023, pp. 1–5

2023
[20]

On usage of an end-to- end deep neural architecture for handwritten digit string recognition,

Z. Omidi and B. Babaali, “On usage of an end-to- end deep neural architecture for handwritten digit string recognition,”Signal, Image and Video Processing, vol. 18, no. 4, pp. 3009–3020, 2024

2024
[21]

Specaugment: A simple data aug- mentation method for automatic speech recognition,

D. S. Park et al., “Specaugment: A simple data aug- mentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–2617

2019
[22]

Specmix: A mixed sample data augmentation method for training with time-frequency domain features,

G. Kim, D. K. Han, and H. Ko, “Specmix: A mixed sample data augmentation method for training with time-frequency domain features,” inProc. Interspeech, 2021, pp. 388–392

2021
[23]

When does label smoothing help?,

R. M ¨uller, S. Kornblith, and G. Hinton, “When does label smoothing help?,” inAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2019, vol. 32, pp. 4696–4705

2019
[24]

Soft label training for deep neural networks,

J. Xu and W. Zhou, “Soft label training for deep neural networks,” inProc. International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–8

2020
[25]

mixup: Beyond empirical risk minimization,

H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” inProc. International Conference on Learning Representations (ICLR), 2018

2018
[26]

Cutmix: Regularization strategy to train strong clas- sifiers with localizable features,

S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y . Yoo, “Cutmix: Regularization strategy to train strong clas- sifiers with localizable features,” inProc. IEEE Inter- national Conference on Computer Vision (ICCV), 2019, pp. 6023–6032

2019
[27]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen et al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[28]

Fearless steps apollo: To- wards community resource development for science, technology, education, and historical preservation,

J. H. L. Hansen et al., “Fearless steps apollo: To- wards community resource development for science, technology, education, and historical preservation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Seoul, Korea, Apr. 2024, pp. 12816–12820

2024

[1] [1]

Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

INTRODUCTION V ocal effort, ranging from whisper, soft, neutral, loud, and shouted speech, alters speech production and acoustic struc- ture and impacts both intelligibility and performance of speech systems. Prior studies have examined this continuum, showing spectral and energy shifts that degrade robustness in speaker identification [1, 2, 3]. Related ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Dataset All experiments are performed using A VID corpus [8], orig- inally designed for vocal effort

METHODOLOGY 2.1. Dataset All experiments are performed using A VID corpus [8], orig- inally designed for vocal effort. The corpus contains record- ings from 50 English speakers (25 male, 25 female), each asked to read prompted sentences at four instructed vocal in- tensity levels:soft,normal,loud, andvery loud. Recordings were made in a laboratory setting...

[3] [3]

This way, the mixed targets follow the mixing weights and the or- dinal structure of VE

and CutMix [21] where, instead of interpolating one- hot labels, we interpolate Gaussian-neighbor soft labels. This way, the mixed targets follow the mixing weights and the or- dinal structure of VE. Figure 3 illustrates the effect of different Soft-label strate- gies. Hard labels assign full probability to a single class, while label smoothing redistribu...

[4] [4]

Experimental Setup All experiments are conducted on the A VID corpus [8] un- der the non-calibrated condition, using the four instructed effort categories

EXPERIMENTS 3.1. Experimental Setup All experiments are conducted on the A VID corpus [8] un- der the non-calibrated condition, using the four instructed effort categories. Evaluation follows 10-fold group cross- validation, and results are reported as mean accuracy with standard deviation across folds. Models are fine-tuned end- to-end with Adam (encoder...

[5] [5]

This is the first study of WavLM for vocal effort classification and showed that it outperforms wav2vec2 and HuBERT by over 7% and 1% ab- solute, respectively

CONCLUSION This study has considered advancements in vocal effort clas- sification for naturalistic speech data. This is the first study of WavLM for vocal effort classification and showed that it outperforms wav2vec2 and HuBERT by over 7% and 1% ab- solute, respectively. A systematic evaluation of augmentation strategies demonstrated consistent gains of ...

[6] [6]

Analysis and classifi- cation of speech mode: Whispered through shouted,

C. Zhang and J. H. L. Hansen, “Analysis and classifi- cation of speech mode: Whispered through shouted,” in Proc. Interspeech, 2007, pp. 2289–2292

2007

[7] [7]

Foren- sic speaker recognition under vocal effort variation,

V . Hughes, A. Eriksson, and A. Kachkovskiy, “Foren- sic speaker recognition under vocal effort variation,” in Proc. Interspeech, 2023, pp. 2333–2337

2023

[8] [8]

Speaker verifi- cation under vocal effort variation: Compensation ap- proaches,

A. Prieto, A. Miguel, and E. Lleida, “Speaker verifi- cation under vocal effort variation: Compensation ap- proaches,” inProc. Interspeech, 2022, pp. 2913–2917

2022

[9] [9]

Deep neural network training for whisper speech recognition using small databases and generative model sampling,

S. Ghaffarzadegan, H. Boril, and J. H. L. Hansen, “Deep neural network training for whisper speech recognition using small databases and generative model sampling,” International Journal of Speech Technology, vol. 20, no. 4, pp. 1063–1075, Dec. 2017

2017

[10] [10]

Analysis and calibra- tion of lombard effect and whisper for speaker recogni- tion,

F. Kelly and J. H. L. Hansen, “Analysis and calibra- tion of lombard effect and whisper for speaker recogni- tion,”IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 29, pp. 927–942, Feb. 2021

2021

[11] [11]

Ad- vancements in whispered speech detection for interac- tive/speech systems,

C. Zhang, J. H. L. Hansen, and H. A. Patil, “Ad- vancements in whispered speech detection for interac- tive/speech systems,” inSignal Acoust. Model. Speech Commun. Disord., vol. 5, pp. 9–32. De Gruyter, 2018

2018

[12] [12]

An advanced entropy- based feature with frame-level vocal effort likelihood space modeling for distant whisper-island detection,

C. Zhang and J. H. L. Hansen, “An advanced entropy- based feature with frame-level vocal effort likelihood space modeling for distant whisper-island detection,” Speech Communication, vol. 66, pp. 107–117, 2015

2015

[13] [13]

Avid: A speech database for machine learning studies on vocal intensity,

P. Alku, M. Kodali, M. Laaksonen, and S. R. Kadiri, “Avid: A speech database for machine learning studies on vocal intensity,”Speech Communication, vol. 157, pp. 103039, 2024

2024

[14] [14]

Wavelet scattering network features for intensity cat- egory classification and prediction of spl from speech,

M. Kodali, S. R. Kadiri, S. Narayanan, and P. Alku, “Wavelet scattering network features for intensity cat- egory classification and prediction of spl from speech,” inProc. IEEE ICASSP, 2025

2025

[15] [15]

wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020, pp. 12449–12460

2020

[16] [16]

Hubert: Self- supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self- supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, no. 9, pp. 3451–3460, 2021

2021

[17] [17]

Ast: Audio spec- trogram transformer,

Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spec- trogram transformer,” inProc. Interspeech, 2021, pp. 571–575

2021

[18] [18]

Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals,

M. Kodali, S. R. Kadiri, and P. Alku, “Fine-tuning of pre-trained models for classification of vocal intensity category from speech signals,” inProc. Interspeech, 2024, pp. 482–486

2024

[19] [19]

Comparison of ssl embeddings for vocal intensity classification in non- calibrated conditions,

M. Kodali, S. R. Kadiri, and P. Alku, “Comparison of ssl embeddings for vocal intensity classification in non- calibrated conditions,” inProc. IEEE ICASSP, 2023, pp. 1–5

2023

[20] [20]

On usage of an end-to- end deep neural architecture for handwritten digit string recognition,

Z. Omidi and B. Babaali, “On usage of an end-to- end deep neural architecture for handwritten digit string recognition,”Signal, Image and Video Processing, vol. 18, no. 4, pp. 3009–3020, 2024

2024

[21] [21]

Specaugment: A simple data aug- mentation method for automatic speech recognition,

D. S. Park et al., “Specaugment: A simple data aug- mentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–2617

2019

[22] [22]

Specmix: A mixed sample data augmentation method for training with time-frequency domain features,

G. Kim, D. K. Han, and H. Ko, “Specmix: A mixed sample data augmentation method for training with time-frequency domain features,” inProc. Interspeech, 2021, pp. 388–392

2021

[23] [23]

When does label smoothing help?,

R. M ¨uller, S. Kornblith, and G. Hinton, “When does label smoothing help?,” inAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2019, vol. 32, pp. 4696–4705

2019

[24] [24]

Soft label training for deep neural networks,

J. Xu and W. Zhou, “Soft label training for deep neural networks,” inProc. International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–8

2020

[25] [25]

mixup: Beyond empirical risk minimization,

H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” inProc. International Conference on Learning Representations (ICLR), 2018

2018

[26] [26]

Cutmix: Regularization strategy to train strong clas- sifiers with localizable features,

S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y . Yoo, “Cutmix: Regularization strategy to train strong clas- sifiers with localizable features,” inProc. IEEE Inter- national Conference on Computer Vision (ICCV), 2019, pp. 6023–6032

2019

[27] [27]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen et al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022

[28] [28]

Fearless steps apollo: To- wards community resource development for science, technology, education, and historical preservation,

J. H. L. Hansen et al., “Fearless steps apollo: To- wards community resource development for science, technology, education, and historical preservation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Seoul, Korea, Apr. 2024, pp. 12816–12820

2024