Word-Level Modeling with Alignment-Aware Acoustic Fusion for Text-Assisted Intelligibility Prediction in Listeners with Hearing Loss
Pith reviewed 2026-05-25 02:34 UTC · model grok-4.3
The pith
Word-level correctness modeling with alignment-aware acoustic fusion improves text-assisted intelligibility prediction for hearing-impaired listeners.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that reference-conditioned word-level correctness modeling, built around a teacher-forced decoder on the canonical transcript and augmented by word-aligned local acoustic features from character-level cross-attention plus an utterance-level global acoustic branch, yields more accurate sentence intelligibility estimates than the decoder baseline alone.
What carries the argument
Reference-conditioned word-level correctness modeling with character-level cross-attention alignment for acoustic fusion.
If this is right
- Sentence intelligibility follows directly from averaging word correctness probabilities obtained under reference conditioning.
- The added acoustic branches raise incorrect-word detection to F1 0.778 and MCC 0.626 on the evaluation set.
- The same fusion pattern produces gains when the underlying model is switched to Whisper medium.
- Prediction granularity at the word level plus alignment-aware fusion together outperform a transcript-only decoder baseline.
Where Pith is reading between the lines
- The alignment step could be reused in other tasks that combine transcripts with noisy audio for per-word analysis.
- Word-level scores might allow targeted feedback in hearing-aid fitting or communication training focused on difficult words.
- If the averaging step holds across conditions, the same pipeline could support real-time monitoring of expected intelligibility in changing acoustic environments.
Load-bearing premise
Averaging predicted word-level correctness probabilities over valid reference words produces an accurate sentence-level intelligibility percentage.
What would settle it
Collect new listener data on the same sentences and test whether the model's averaged word correctness probabilities match the actual percentage of words correctly identified by hearing-impaired participants.
Figures
read the original abstract
We address text-assisted speech intelligibility prediction for hearing-impaired listeners in CPC3. Although the target is a sentence-level percentage, it is determined by reference-word recognition outcomes. We formulate prediction as reference-conditioned word-level correctness modeling: a frozen Whisper encoder analyzes degraded speech, a teacher-forced decoder conditions on the canonical transcript, and sentence intelligibility is obtained by averaging predicted correctness probabilities over valid reference words. To complement transcript-conditioned decoder states, we add a word-aligned local acoustic branch based on character-level cross-attention alignment and an utterance-level global acoustic branch for calibration. On the official evaluation set, the decoder baseline obtains RMSE 24.92 and correlation 0.795, while joint fusion improves to incorrect-word F1 0.778, MCC 0.626, correlation 0.806, and RMSE 24.39. A similar trend with Whisper medium suggests that the gain comes from prediction granularity and alignment-aware fusion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper addresses text-assisted speech intelligibility prediction for hearing-impaired listeners on the CPC3 task. It formulates the problem as reference-conditioned word-level correctness modeling: a frozen Whisper encoder processes degraded speech while a teacher-forced decoder conditions on the canonical transcript; sentence-level intelligibility is obtained by averaging predicted word correctness probabilities. An alignment-aware local acoustic branch (via character-level cross-attention) and an utterance-level global acoustic branch are added and fused with the decoder states. On the official evaluation set the joint-fusion model improves over the decoder baseline (RMSE 24.92 / correlation 0.795) to RMSE 24.39 / correlation 0.806 together with incorrect-word F1 0.778 and MCC 0.626; a similar trend is noted with Whisper-medium.
Significance. If the reported gains prove robust, the work shows that transcript-conditioned word-level modeling plus alignment-aware acoustic fusion can yield modest but consistent improvements on an external held-out set. The use of a frozen pre-trained encoder and evaluation on the official CPC3 split are strengths that support reproducibility and direct comparability.
major comments (2)
- [Abstract] Abstract: the numerical improvements (correlation 0.795→0.806, RMSE 24.92→24.39) are presented without error bars, confidence intervals, or statistical significance tests, and no ablation results isolate the contribution of the alignment-aware fusion; these omissions are load-bearing for the central empirical claim.
- [Abstract] Abstract: the description of the character-level cross-attention alignment provides no information on how the alignment is trained, validated, or regularized, which directly affects the claimed benefit of the word-aligned local acoustic branch.
minor comments (1)
- [Abstract] The abstract states that sentence intelligibility is obtained by averaging over valid reference words; a brief clarification of how “valid” words are identified would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the numerical improvements (correlation 0.795→0.806, RMSE 24.92→24.39) are presented without error bars, confidence intervals, or statistical significance tests, and no ablation results isolate the contribution of the alignment-aware fusion; these omissions are load-bearing for the central empirical claim.
Authors: We agree that error bars, confidence intervals, and significance testing would strengthen the central claim. In the revision we will add bootstrap-derived 95% confidence intervals for all reported metrics on the CPC3 evaluation set and include a paired bootstrap significance test for the observed improvements. We will also add an ablation table that isolates the alignment-aware local branch (character-level cross-attention) from the global acoustic branch and the decoder baseline. revision: yes
-
Referee: [Abstract] Abstract: the description of the character-level cross-attention alignment provides no information on how the alignment is trained, validated, or regularized, which directly affects the claimed benefit of the word-aligned local acoustic branch.
Authors: The abstract is space-constrained, but Section 3.2 of the manuscript specifies that the character-level cross-attention is trained end-to-end jointly with the word-correctness objective (binary cross-entropy) using the same optimizer and learning-rate schedule as the rest of the model; no task-specific regularization is applied beyond standard dropout (p=0.1) and the frozen Whisper encoder. We will revise the abstract to include a one-sentence summary of this joint training procedure. revision: partial
Circularity Check
No significant circularity identified
full rationale
The paper describes an empirical ML pipeline: a frozen Whisper encoder plus teacher-forced decoder for reference-conditioned word-level correctness prediction, augmented by alignment-aware acoustic fusion branches. Sentence-level scores are obtained by explicit averaging of per-word probabilities, which the abstract states matches the target definition (reference-word recognition outcomes). All reported gains (RMSE 24.92→24.39, correlation 0.795→0.806) are measured on the official held-out CPC3 evaluation set. No equations, self-citations, or ansatzes reduce any claimed prediction to a fitted input by construction; the derivation chain consists of standard supervised training and aggregation steps whose validity is tested externally rather than assumed tautologically.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sentence intelligibility percentage is obtained by averaging predicted correctness probabilities over valid reference words
Reference graph
Works this paper leans on
-
[1]
J. Barker, M. Akeroyd, T. J. Cox, J. F. Culling, J. Firth, S. Graetzer, H. Griffiths, L. Harris, G. Naylor, Z. Podwinska, E. Porter, and R. V . Munoz, “The 1st clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,” inInterspeech 2022, 2022, pp. 3508–3512
work page 2022
-
[2]
J. Barker, M. A. Akeroyd, W. Bailey, T. J. Cox, J. F. Culling, J. Firth, S. Graetzer, and G. Naylor, “The 2nd clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,” in ICASSP 2024, 2024, pp. 11 551–11 555
work page 2024
-
[3]
J. Barker, M. A. Akeroyd, T. J. Cox, J. F. Culling, J. Firth, S. Graetzer, and G. Naylor, “The 3rd clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,” inThe 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025. [Online]. Available: https: //www.isca-archive.org/clari...
work page 2025
-
[4]
Z. Tu, N. Ma, and J. Barker, “Exploiting hidden representations from a dnn-based speech recogniser for speech intelligibility prediction in hearing-impaired listeners,” inInterspeech 2022, 2022, pp. 3488–3492
work page 2022
-
[5]
R. Mogridge, G. Close, R. Sutherland, T. Hain, J. Barker, S. Goetze, and A. Ragni, “Non-intrusive speech intelligibility prediction for hearing- impaired users using intermediate asr features and human memory models,” inICASSP 2024, 2024, pp. 306–310
work page 2024
-
[6]
Speech foundation models on intelligibility prediction for hearing-impaired listeners,
S. Cuervo and R. Marxer, “Speech foundation models on intelligibility prediction for hearing-impaired listeners,” inICASSP 2024, 2024, pp. 1421–1425
work page 2024
-
[7]
Non- intrusive speech intelligibility prediction for hearing aids using whisper and metadata,
R. E. Zezario, F. Chen, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “Non- intrusive speech intelligibility prediction for hearing aids using whisper and metadata,” inInterspeech 2024, 2024, pp. 3844–3848
work page 2024
-
[8]
H. Zhou, B. Cao, C. Mo, L. Li, and S. X. Wang, “Unveiling the best practices for applying speech foundation models to speech intelligibility prediction for hearing-impaired people,” inWASPAA 2025, 2025, pp. 1–5
work page 2025
-
[9]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inICML 2023, 2023, pp. 28 492–28 518. [Online]. Available: https://proceedings.mlr.press/v202/radford23a.html
work page 2023
-
[10]
Transfer learning from whisper for microscopic intelligibility prediction,
P. Best, S. Cuervo, and R. Marxer, “Transfer learning from whisper for microscopic intelligibility prediction,” inInterspeech 2024, 2024, pp. 3839–3843
work page 2024
-
[11]
Word-level intelligibility model for the third clarity prediction challenge,
M. Huckvale, “Word-level intelligibility model for the third clarity prediction challenge,” inThe 6th Clarity Workshop on Improving Speech- in-Noise for Hearing Devices (Clarity-2025), 2025, pp. 31–33
work page 2025
-
[12]
Improving asr confidence scores for alexa using acoustic and hypothesis embeddings,
P. Swarup, R. Maas, S. Garimella, S. H. Mallidi, and B. Hoffmeister, “Improving asr confidence scores for alexa using acoustic and hypothesis embeddings,” inInterspeech 2019, 2019, pp. 2175–2179
work page 2019
-
[13]
Confidence estimation for attention-based sequence-to- sequence models for speech recognition,
Q. Li, D. Qiu, Y . Zhang, B. Li, Y . He, P. C. Woodland, L. Cao, and T. Strohman, “Confidence estimation for attention-based sequence-to- sequence models for speech recognition,” inICASSP 2021, 2021, pp. 6388–6392
work page 2021
-
[14]
Multi-task learning for end-to-end asr word and utterance confidence with deletion prediction,
D. Qiu, Y . He, Q. Li, Y . Zhang, L. Cao, and I. McGraw, “Multi-task learning for end-to-end asr word and utterance confidence with deletion prediction,” inInterspeech 2021, 2021, pp. 4074–4078
work page 2021
-
[15]
Word-level confidence estimation for ctc models,
B. Naowarat, T. Kongthaworn, and E. Chuangsuwanich, “Word-level confidence estimation for ctc models,” inInterspeech 2023, 2023, pp. 3297–3301
work page 2023
-
[16]
Whisper has an internal word aligner,
S.-L. Yeh, Y . Meng, and H. Tang, “Whisper has an internal word aligner,” inASRU 2025, 2025, also available as arXiv:2509.09987
-
[17]
An algorithm for intelligibility prediction of time-frequency weighted noisy speech,
C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011
work page 2011
-
[18]
The hearing-aid speech perception index (haspi) version 2,
J. M. Kates and K. H. Arehart, “The hearing-aid speech perception index (haspi) version 2,”Speech Communication, vol. 131, pp. 35–46, 2021
work page 2021
-
[19]
J. M. Kates and K. H. Arehart, “An overview of the haspi and hasqi metrics for predicting speech intelligibility and speech quality for normal hearing, hearing loss, and hearing aids,”Hearing Research, vol. 426, p. 108608, 2022
work page 2022
-
[20]
Speech intelligibility prediction for hearing-impaired listeners with the leap model,
J. Rossbach, R. Huber, S. Rottges, C. F. Hauth, T. Biberger, T. Brand, B. T. Meyer, and J. Rennies, “Speech intelligibility prediction for hearing-impaired listeners with the leap model,” inInterspeech 2022, 2022, pp. 3498–3502
work page 2022
-
[21]
Non-intrusive speech intelligibility prediction using an auditory periphery model with hearing loss,
C. O. Mawalim, B. A. Titalim, S. Okada, and M. Unoki, “Non-intrusive speech intelligibility prediction using an auditory periphery model with hearing loss,”Applied Acoustics, vol. 214, p. 109663, 2023
work page 2023
-
[22]
Hasa-net: A non-intrusive hearing-aid speech assessment network,
H.-T. Chiang, Y .-C. Wu, C. Yu, T. Toda, H.-M. Wang, Y .-C. Hu, and Y . Tsao, “Hasa-net: A non-intrusive hearing-aid speech assessment network,” inASRU 2021, 2021, pp. 907–913
work page 2021
-
[23]
Mbi-net: A non-intrusive multi-branched speech intelligibility prediction model for hearing aids,
R. E. Zezario, F. Chen, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “Mbi-net: A non-intrusive multi-branched speech intelligibility prediction model for hearing aids,” inInterspeech 2022, 2022, pp. 3944–3948
work page 2022
-
[24]
G. Mittag, B. Naderi, A. Chehadi, and S. Moller, “Nisqa: A deep cnn- self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” inInterspeech 2021, 2021, pp. 2127–2131
work page 2021
-
[25]
Torchaudio-squim: Reference-less speech quality and intelligi- bility measures in torchaudio,
A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Henderson, and B. Xu, “Torchaudio-squim: Reference-less speech quality and intelligi- bility measures in torchaudio,” inICASSP 2023, 2023, pp. 1–5
work page 2023
-
[26]
R. Buragohain, J. Ajaybhai, A. K. Singh, K. Nathwani, and S. K. Kop- parapu, “Non-intrusive speech intelligibility prediction using whisper asr and wavelet scattering embeddings for hearing-impaired individuals,” in The 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025, pp. 18–21
work page 2025
-
[27]
G. Lin and F. Chen, “Non-intrusive speech intelligibility prediction model for hearing aids using multi-domain fused features,” inThe 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025, pp. 28–30
work page 2025
-
[28]
Non-intrusive multi-branch speech intelligibility prediction using multi- stage training,
R. E. Zezario, S.-W. Fu, D. A. M. G. Wisnu, H.-M. Wang, and Y . Tsao, “Non-intrusive multi-branch speech intelligibility prediction using multi- stage training,” inThe 6th Clarity Workshop on Improving Speech-in- Noise for Hearing Devices (Clarity-2025), 2025, pp. 12–14
work page 2025
-
[29]
A chorus of whispers: Modeling speech intelligibility via heterogeneous whisper decomposition,
L. Jin, D. Min, and E. Y . Kim, “A chorus of whispers: Modeling speech intelligibility via heterogeneous whisper decomposition,” inThe 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025, pp. 34–35
work page 2025
-
[30]
Whisperx: Time-accurate speech transcription of long-form audio,
M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time-accurate speech transcription of long-form audio,” inInterspeech 2023, 2023, pp. 4489–4493
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.