Exploring Speech Foundation Models for Speaker Diarization Across Lifespan

Anfeng Xu; Shrikanth Narayanan; Tiantian Feng

arxiv: 2604.05201 · v2 · pith:QN4MLT5Lnew · submitted 2026-04-06 · 📡 eess.AS

Exploring Speech Foundation Models for Speaker Diarization Across Lifespan

Anfeng Xu , Tiantian Feng , Shrikanth Narayanan This is my paper

Pith reviewed 2026-05-21 09:12 UTC · model grok-4.3

classification 📡 eess.AS

keywords speaker diarizationspeech foundation modelsage domain shiftlifespan evaluationmulti-age trainingEEND-VCWhisper encoder

0 comments

The pith

Speech foundation models for speaker diarization lose accuracy on child and older-adult conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests speech foundation models inside one end-to-end diarization system on conversations that include children, adults, and older adults. Adult-only training produces clear drops in performance when the same models are applied to the other age groups. Training on mixed-age data improves results across all groups while leaving adult performance unchanged. Age-specific adaptation adds still more gains, especially when the Whisper encoder is used. Anyone building speech tools that must work for real families or mixed-age meetings needs to know whether current foundation models already handle these shifts or require explicit age-aware training.

Core claim

Within the EEND-VC end-to-end neural diarization framework, zero-shot transfer from adult-trained foundation models to child and older-adult conversational data produces substantial performance degradation. Joint multi-age training restores robustness without any reduction in performance on canonical adult conversations. Targeted adaptation to a single age group produces further gains, with the largest improvements observed when the Whisper encoder is used.

What carries the argument

The EEND-VC unified end-to-end neural diarization framework, used to run and compare zero-shot cross-age inference, joint multi-age training, and domain-specific adaptation on the same backbone models.

If this is right

Joint training on data from multiple age groups offers a direct route to lifespan-robust diarization without sacrificing performance on adult conversations.
Targeted adaptation to one age group can deliver measurable extra gains once a base model has been trained on mixed ages.
Adult-only training leaves measurable gaps in real-world applications that include children or older adults.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Age functions as a domain-shift factor that likely affects other speech tasks such as recognition or emotion detection when foundation models are involved.
System builders should include age diversity during initial training rather than relying on later adaptation alone.
Combining joint multi-age pre-training with lightweight age-specific adaptation may give the strongest overall results.

Load-bearing premise

The chosen conversational datasets for each age group are representative of typical speech patterns and acoustic conditions, and the EEND-VC evaluation setup compares the models fairly across domains.

What would settle it

A new balanced lifespan dataset on which adult-only trained models achieve equal or better diarization error rates than either joint multi-age models or age-adapted models would falsify the need for age-aware training.

Figures

Figures reproduced from arXiv: 2604.05201 by Anfeng Xu, Shrikanth Narayanan, Tiantian Feng.

**Figure 1.** Figure 1: DER (%) for Playlogue and SeniorTalk under combined training and domain adaptation using Whisper-Medium and WavLM-DiariZen under 8s and 16s windows. general adult datasets. Joint fine-tuning across age groups effectively improves robustness to age-related variability without sacrificing accuracy on standard adult speech. 4.3. Results from Domain Adaptation [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

Speech foundation models have shown strong transferability across a wide range of speech applications. However, their robustness to age-related domain shift in speaker diarization remains underexplored. In this work, we present a cross-lifespan evaluation within a unified end-to-end neural diarization framework (EEND-VC), covering speech samples from conversations involving children, adults, and older adults. We compare models under zero-shot cross-age inference, joint multi-age training, and domain-specific adaptation. Results show substantial performance degradation when models trained on adult-specific speech are applied to child and older-adult conversational data. Moreover, joint multi-age training across different age groups improves robustness without reducing diarization performance in canonical adult conversations, while targeted age group adaptation yields further gains in diarization performance, particularly when using the Whisper encoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 4 minor

Summary. The paper investigates the robustness of speech foundation models for speaker diarization across different age groups using the EEND-VC framework. It compares zero-shot cross-age inference, joint multi-age training, and domain-specific adaptation on conversational data from children, adults, and older adults. The results indicate substantial performance degradation in zero-shot transfer from adult to child and older-adult data, improved robustness from joint training without loss on adult conversations, and additional gains from targeted adaptation, especially with the Whisper encoder.

Significance. If these findings are confirmed, the work makes a significant contribution by highlighting age-related domain shifts in speaker diarization and proposing effective mitigation strategies through multi-age training and adaptation. This has implications for developing more inclusive speech technologies that perform well across the lifespan. The controlled experimental setup within a single framework allows for direct comparisons that strengthen the validity of the conclusions. The consistent DER trends across conditions provide a solid empirical basis for the claims.

major comments (2)

[§3.1] §3.1, Datasets: The description of the age-stratified conversational datasets lacks key details on speaker counts, total audio duration, age ranges within groups, and recording conditions. This information is load-bearing for interpreting the domain-shift results and confirming that observed DER differences reflect age-related factors rather than unaccounted confounders.
[§4.2] §4.2, Results: The reported DER values for zero-shot degradation, joint-training gains, and adaptation improvements are presented without error bars, standard deviations across runs, or statistical significance tests. This weakens the ability to verify that the trends (particularly the robustness without adult-performance loss) are reliable rather than due to experimental variability.

minor comments (4)

[Abstract] Abstract: Including one or two key quantitative DER values (e.g., the magnitude of degradation or improvement) would better convey the scale of the reported effects to readers.
[§2] §2: Provide additional detail on whether and how the voice-conversion component of EEND-VC is modified or held constant across age groups to ensure the framework comparisons remain unbiased.
[Figure 2] Figure 2 or 3: Legends and axis labels should explicitly distinguish the training/inference conditions (zero-shot, joint, adapted) and age groups for immediate clarity.
[References] References: A few recent papers on age-related speech variability or multi-domain diarization could be added to strengthen the positioning of the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of our work and the recommendation for minor revision. We believe the findings highlight important age-related challenges in speaker diarization and are grateful for the suggestions to improve the manuscript's clarity and rigor.

read point-by-point responses

Referee: [§3.1] §3.1, Datasets: The description of the age-stratified conversational datasets lacks key details on speaker counts, total audio duration, age ranges within groups, and recording conditions. This information is load-bearing for interpreting the domain-shift results and confirming that observed DER differences reflect age-related factors rather than unaccounted confounders.

Authors: We agree that the dataset description in §3.1 is insufficiently detailed. In the revised manuscript, we will add the requested information, including speaker counts per age group, total audio durations, specific age ranges within each group, and details on recording conditions. This will allow readers to better assess potential confounders and confirm the age-related domain shifts. revision: yes
Referee: [§4.2] §4.2, Results: The reported DER values for zero-shot degradation, joint-training gains, and adaptation improvements are presented without error bars, standard deviations across runs, or statistical significance tests. This weakens the ability to verify that the trends (particularly the robustness without adult-performance loss) are reliable rather than due to experimental variability.

Authors: We acknowledge that reporting measures of variability would enhance the credibility of the results. We will revise §4.2 to include error bars or standard deviations for the DER values, derived from multiple runs with different random seeds. We will also conduct and report appropriate statistical tests (such as paired t-tests) to verify the significance of the improvements and the absence of degradation on adult data under joint training. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivations or fitted predictions

full rationale

The manuscript reports controlled empirical evaluations of speech foundation models within the EEND-VC diarization framework on age-stratified conversational corpora. Central claims concern observed DER degradation under adult-to-child/older-adult zero-shot transfer, robustness from joint multi-age training, and gains from targeted adaptation. These rest on direct metric comparisons across conditions rather than any derivation chain, equations, or parameter fitting that could reduce outputs to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text or abstract. The work is self-contained as a set of benchmark experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the framework and data assumptions stated there.

axioms (1)

domain assumption The EEND-VC framework is a suitable unified end-to-end neural diarization setup for comparing foundation models across age groups.
Invoked as the common evaluation platform in the abstract.

pith-pipeline@v0.9.0 · 5674 in / 1228 out tokens · 42592 ms · 2026-05-21T09:12:48.094121+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compare models under zero-shot cross-age inference, joint multi-age training, and domain-specific adaptation... Results show substantial performance degradation when models trained on adult-specific speech are applied to child and older-adult conversational data.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The EEND module consists of an encoder, a Conformer, and a linear classification layer... powerset loss, supporting up to four speakers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

who spoke when and for how long?

Introduction Speaker diarization aims to automatically determine “who spoke when and for how long?” in multi-speaker record- ings and serves as a fundamental component for downstream speech technologies, including automatic speech recognition (ASR) [1]. The recent emergence of large-scale speech foun- dation models has reshaped the landscape of speech pro...

work page
[2]

Exploring Speech Foundation Models for Speaker Diarization Across Lifespan

Method 2.1. EEND-VC Pipeline The proposed system is built upon DiariZen [12, 13], which follows the EEND-VC framework. It uses the Pyannote [14] backend to cluster speakers and combine local EEND speaker diarization results. We vary the encoder of the EEND module across different speech foundation models, while keeping all arXiv:2604.05201v1 [eess.AS] 6 A...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Datasets We use three widely adopted public datasets for general adult speaker diarization, along with one public dataset involving children and one involving older adults

Experiments 3.1. Datasets We use three widely adopted public datasets for general adult speaker diarization, along with one public dataset involving children and one involving older adults. Unless otherwise spec- ified, we use the official train, dev, and test split from each dataset. The dataset details are in Table 1. 3.1.1. Diarization datasets involvi...

work page
[4]

Results from Adult-only Training Table 2 reports the DERs when the models are trained only on adult datasets

Results and Analysis 4.1. Results from Adult-only Training Table 2 reports the DERs when the models are trained only on adult datasets. Among the WavLM and Whisper models trained from scratch for speaker diarization, Whisper-Medium achieves the best overall performance across the adult in- domain datasets. Out-of-domain results show heterogeneous trends, ...

work page
[5]

Our results show that models trained only on adult speech perform poorly on child and older-adult data, confirming a significant age-related do- main shift

Conclusion In this work, we have presented a systematic cross-lifespan evaluation of speech foundation models for speaker diarization within a unified EEND-VC framework. Our results show that models trained only on adult speech perform poorly on child and older-adult data, confirming a significant age-related do- main shift. Joint multi-age training impro...

work page
[6]

These tools were not used to generate or interpret experimental results

Generative AI Use Disclosure We used Generative AI tools in this study to assist with lan- guage polishing, manuscript editing, and limited code develop- ment support for data analyses and visualizations. These tools were not used to generate or interpret experimental results. All authors are fully aware of the extent of generative AI use, take full respo...

work page 2026
[7]

A review of speaker diarization: Recent advances with deep learning,

T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan, “A review of speaker diarization: Recent advances with deep learning,”Computer Speech & Language, vol. 72, p. 101317, 2022

work page 2022
[8]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023
[9]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[10]

Acoustics of children’s speech: Developmental changes of temporal and spectral parame- ters,

S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of children’s speech: Developmental changes of temporal and spectral parame- ters,”The Journal of the Acoustical Society of America, vol. 105, no. 3, pp. 1455–1468, 1999

work page 1999
[11]

Robust recognition of chil- dren’s speech,

A. Potamianos and S. Narayanan, “Robust recognition of chil- dren’s speech,”IEEE Transactions on speech and audio process- ing, vol. 11, no. 6, pp. 603–616, 2004

work page 2004
[12]

Temporal character- istics of the speech of normal elderly adults,

B. L. Smith, J. Wasowicz, and J. Preston, “Temporal character- istics of the speech of normal elderly adults,”Journal of Speech, Language, and Hearing Research, vol. 30, no. 4, pp. 522–529, 1987

work page 1987
[13]

Speech-and language-based clas- sification of alzheimer’s disease: a systematic review,

I. Vigo, L. Coelho, and S. Reis, “Speech-and language-based clas- sification of alzheimer’s disease: a systematic review,”Bioengi- neering, vol. 9, no. 1, p. 27, 2022

work page 2022
[14]

End-to-end neural speaker diarization with self- attention,

Y . Fujita, N. Kanda, S. Horiguchi, Y . Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self- attention,” in2019 IEEE Automatic Speech Recognition and Un- derstanding Workshop (ASRU). IEEE, 2019, pp. 296–303

work page 2019
[15]

End-to-end neural speaker diarization with permutation-free objectives,

Y . Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watan- abe, “End-to-end neural speaker diarization with permutation-free objectives,”Interspeech, 2019

work page 2019
[16]

Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,

K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,” inICASSP 2021-2021 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7198–7202

work page 2021
[17]

Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech,

——, “Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech,”In- terspeech, 2021

work page 2021
[18]

Leveraging self-supervised learning for speaker diarization,

J. Han, F. Landini, J. Rohdin, A. Silnova, M. Diez, and L. Burget, “Leveraging self-supervised learning for speaker diarization,” in Proc. ICASSP, 2025

work page 2025
[19]

Efficient and generalizable speaker diarization via structured pruning of self-supervised models,

J. Han, P. P ´alka, M. Delcroix, F. Landini, J. Rohdin, J. Cernock`y, and L. Burget, “Efficient and generalizable speaker diarization via structured pruning of self-supervised models,”arXiv preprint arXiv:2506.18623, 2025

work page arXiv 2025
[20]

pyannote. audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,

H. Bredin, “pyannote. audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,” in24th Interspeech Conference (INTERSPEECH 2023). ISCA, 2023, pp. 1983–1987

work page 2023
[21]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,”Interspeech, 2020

work page 2020
[22]

Advancing speaker embedding learning: Wespeaker toolkit for research and produc- tion,

S. Wang, Z. Chen, B. Han, H. Wang, C. Liang, B. Zhang, X. Xi- ang, W. Ding, J. Rohdin, A. Silnovaet al., “Advancing speaker embedding learning: Wespeaker toolkit for research and produc- tion,”Speech Communication, vol. 162, p. 103104, 2024

work page 2024
[23]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,”Interspeech, 2018

work page 2018
[24]

V ox-profile: A speech foundation model benchmark for characterizing diverse speaker and speech traits,

T. Feng, J. Lee, A. Xu, Y . Lee, T. Lertpetchpun, X. Shi, H. Wang, T. Thebaud, L. Moro-Velazquez, D. Byrdet al., “V ox-profile: A speech foundation model benchmark for characterizing diverse speaker and speech traits,”arXiv preprint arXiv:2505.14648, 2025

work page arXiv 2025
[25]

V oxlect: A speech foundation model benchmark for modeling dialects and regional languages around the globe,

T. Feng, K. Huang, A. Xu, X. Shi, T. Lertpetchpun, J. Lee, Y . Lee, D. Byrd, and S. Narayanan, “V oxlect: A speech foundation model benchmark for modeling dialects and regional languages around the globe,”KDD, 2025

work page 2025
[26]

Exploring speech foundation models for speaker diarization in child-adult dyadic interactions,

A. Xu, K. Huang, T. Feng, L. Shen, H. Tager-Flusberg, and S. Narayanan, “Exploring speech foundation models for speaker diarization in child-adult dyadic interactions,”Interspeech, 2024

work page 2024
[27]

Data efficient child-adult speaker diarization with simulated conversations,

A. Xu, T. Feng, H. Tager-Flusberg, C. Lord, and S. Narayanan, “Data efficient child-adult speaker diarization with simulated conversations,” inICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[28]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

work page 2021
[29]

Su- perb: Speech processing universal performance benchmark,

S.-w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Linet al., “Su- perb: Speech processing universal performance benchmark,”In- terspeech, 2021

work page 2021
[30]

The ami meeting corpus,

W. Kraaij, T. Hain, M. Lincoln, and W. Post, “The ami meeting corpus,” inProc. International Conference on Methods and Tech- niques in Behavioral Research, 2005, pp. 1–4

work page 2005
[31]

Aishell-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,

Y . Fu, L. Cheng, S. Lv, Y . Jv, Y . Kong, Z. Chen, Y . Hu, L. Xie, J. Wu, H. Buet al., “Aishell-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,”Interspeech, 2021

work page 2021
[32]

M2met: The icassp 2022 multi- channel multi-party meeting transcription challenge,

F. Yu, S. Zhang, Y . Fu, L. Xie, S. Zheng, Z. Du, W. Huang, P. Guo, Z. Yan, B. Maet al., “M2met: The icassp 2022 multi- channel multi-party meeting transcription challenge,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6167–6171

work page 2022
[33]

Playlogue: Dataset and benchmarks for analyzing adult-child conversations during play,

M. Kalanadhabhatta, M. M. Rastikerdar, T. Rahman, A. S. Gra- bell, and D. Ganesan, “Playlogue: Dataset and benchmarks for analyzing adult-child conversations during play,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Tech- nologies, vol. 8, no. 4, pp. 1–34, 2024

work page 2024
[34]

The talkbank project,

B. MacWhinney, “The talkbank project,” inCreating and dig- itizing language corpora: Volume 1: Synchronic databases. Springer, 2007, pp. 163–180

work page 2007
[35]

arXiv preprint arXiv:1909.09577 , year=

O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Gins- burg, S. Kriman, S. Beliaev, V . Lavrukhin, J. Cooket al., “Nemo: a toolkit for building ai applications using neural modules,”arXiv preprint arXiv:1909.09577, 2019

work page arXiv 1909
[36]

End-to- end joint asr and speaker role diarization with child-adult interac- tions,

A. Xu, T. Feng, S. Bishop, C. Lord, and S. Narayanan, “End-to- end joint asr and speaker role diarization with child-adult interac- tions,”arXiv preprint arXiv:2601.17640, 2026

work page arXiv 2026
[37]

Seniortalk: A chinese conversation dataset with rich annotations for super-aged seniors,

Y . Chen, H. Wang, S. Wang, J. Chen, J. He, J. Zhou, X. Yang, Y . Wang, Y . Lin, and Y . Qin, “Seniortalk: A chinese conversation dataset with rich annotations for super-aged seniors,”Neurips, 2025

work page 2025
[38]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022
[39]

The third dihard diarization challenge,

N. Ryant, P. Singh, V . Krishnamohan, R. Varma, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liberman, “The third dihard diarization challenge,”arXiv preprint arXiv:2012.01477, 2020

work page arXiv 2012

[1] [1]

who spoke when and for how long?

Introduction Speaker diarization aims to automatically determine “who spoke when and for how long?” in multi-speaker record- ings and serves as a fundamental component for downstream speech technologies, including automatic speech recognition (ASR) [1]. The recent emergence of large-scale speech foun- dation models has reshaped the landscape of speech pro...

work page

[2] [2]

Exploring Speech Foundation Models for Speaker Diarization Across Lifespan

Method 2.1. EEND-VC Pipeline The proposed system is built upon DiariZen [12, 13], which follows the EEND-VC framework. It uses the Pyannote [14] backend to cluster speakers and combine local EEND speaker diarization results. We vary the encoder of the EEND module across different speech foundation models, while keeping all arXiv:2604.05201v1 [eess.AS] 6 A...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Datasets We use three widely adopted public datasets for general adult speaker diarization, along with one public dataset involving children and one involving older adults

Experiments 3.1. Datasets We use three widely adopted public datasets for general adult speaker diarization, along with one public dataset involving children and one involving older adults. Unless otherwise spec- ified, we use the official train, dev, and test split from each dataset. The dataset details are in Table 1. 3.1.1. Diarization datasets involvi...

work page

[4] [4]

Results from Adult-only Training Table 2 reports the DERs when the models are trained only on adult datasets

Results and Analysis 4.1. Results from Adult-only Training Table 2 reports the DERs when the models are trained only on adult datasets. Among the WavLM and Whisper models trained from scratch for speaker diarization, Whisper-Medium achieves the best overall performance across the adult in- domain datasets. Out-of-domain results show heterogeneous trends, ...

work page

[5] [5]

Our results show that models trained only on adult speech perform poorly on child and older-adult data, confirming a significant age-related do- main shift

Conclusion In this work, we have presented a systematic cross-lifespan evaluation of speech foundation models for speaker diarization within a unified EEND-VC framework. Our results show that models trained only on adult speech perform poorly on child and older-adult data, confirming a significant age-related do- main shift. Joint multi-age training impro...

work page

[6] [6]

These tools were not used to generate or interpret experimental results

Generative AI Use Disclosure We used Generative AI tools in this study to assist with lan- guage polishing, manuscript editing, and limited code develop- ment support for data analyses and visualizations. These tools were not used to generate or interpret experimental results. All authors are fully aware of the extent of generative AI use, take full respo...

work page 2026

[7] [7]

A review of speaker diarization: Recent advances with deep learning,

T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan, “A review of speaker diarization: Recent advances with deep learning,”Computer Speech & Language, vol. 72, p. 101317, 2022

work page 2022

[8] [8]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023

[9] [9]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[10] [10]

Acoustics of children’s speech: Developmental changes of temporal and spectral parame- ters,

S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of children’s speech: Developmental changes of temporal and spectral parame- ters,”The Journal of the Acoustical Society of America, vol. 105, no. 3, pp. 1455–1468, 1999

work page 1999

[11] [11]

Robust recognition of chil- dren’s speech,

A. Potamianos and S. Narayanan, “Robust recognition of chil- dren’s speech,”IEEE Transactions on speech and audio process- ing, vol. 11, no. 6, pp. 603–616, 2004

work page 2004

[12] [12]

Temporal character- istics of the speech of normal elderly adults,

B. L. Smith, J. Wasowicz, and J. Preston, “Temporal character- istics of the speech of normal elderly adults,”Journal of Speech, Language, and Hearing Research, vol. 30, no. 4, pp. 522–529, 1987

work page 1987

[13] [13]

Speech-and language-based clas- sification of alzheimer’s disease: a systematic review,

I. Vigo, L. Coelho, and S. Reis, “Speech-and language-based clas- sification of alzheimer’s disease: a systematic review,”Bioengi- neering, vol. 9, no. 1, p. 27, 2022

work page 2022

[14] [14]

End-to-end neural speaker diarization with self- attention,

Y . Fujita, N. Kanda, S. Horiguchi, Y . Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self- attention,” in2019 IEEE Automatic Speech Recognition and Un- derstanding Workshop (ASRU). IEEE, 2019, pp. 296–303

work page 2019

[15] [15]

End-to-end neural speaker diarization with permutation-free objectives,

Y . Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watan- abe, “End-to-end neural speaker diarization with permutation-free objectives,”Interspeech, 2019

work page 2019

[16] [16]

Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,

K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-to- end neural and clustering-based diarization: Getting the best of both worlds,” inICASSP 2021-2021 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7198–7202

work page 2021

[17] [17]

Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech,

——, “Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech,”In- terspeech, 2021

work page 2021

[18] [18]

Leveraging self-supervised learning for speaker diarization,

J. Han, F. Landini, J. Rohdin, A. Silnova, M. Diez, and L. Burget, “Leveraging self-supervised learning for speaker diarization,” in Proc. ICASSP, 2025

work page 2025

[19] [19]

Efficient and generalizable speaker diarization via structured pruning of self-supervised models,

J. Han, P. P ´alka, M. Delcroix, F. Landini, J. Rohdin, J. Cernock`y, and L. Burget, “Efficient and generalizable speaker diarization via structured pruning of self-supervised models,”arXiv preprint arXiv:2506.18623, 2025

work page arXiv 2025

[20] [20]

pyannote. audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,

H. Bredin, “pyannote. audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,” in24th Interspeech Conference (INTERSPEECH 2023). ISCA, 2023, pp. 1983–1987

work page 2023

[21] [21]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,”Interspeech, 2020

work page 2020

[22] [22]

Advancing speaker embedding learning: Wespeaker toolkit for research and produc- tion,

S. Wang, Z. Chen, B. Han, H. Wang, C. Liang, B. Zhang, X. Xi- ang, W. Ding, J. Rohdin, A. Silnovaet al., “Advancing speaker embedding learning: Wespeaker toolkit for research and produc- tion,”Speech Communication, vol. 162, p. 103104, 2024

work page 2024

[23] [23]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,”Interspeech, 2018

work page 2018

[24] [24]

V ox-profile: A speech foundation model benchmark for characterizing diverse speaker and speech traits,

T. Feng, J. Lee, A. Xu, Y . Lee, T. Lertpetchpun, X. Shi, H. Wang, T. Thebaud, L. Moro-Velazquez, D. Byrdet al., “V ox-profile: A speech foundation model benchmark for characterizing diverse speaker and speech traits,”arXiv preprint arXiv:2505.14648, 2025

work page arXiv 2025

[25] [25]

V oxlect: A speech foundation model benchmark for modeling dialects and regional languages around the globe,

T. Feng, K. Huang, A. Xu, X. Shi, T. Lertpetchpun, J. Lee, Y . Lee, D. Byrd, and S. Narayanan, “V oxlect: A speech foundation model benchmark for modeling dialects and regional languages around the globe,”KDD, 2025

work page 2025

[26] [26]

Exploring speech foundation models for speaker diarization in child-adult dyadic interactions,

A. Xu, K. Huang, T. Feng, L. Shen, H. Tager-Flusberg, and S. Narayanan, “Exploring speech foundation models for speaker diarization in child-adult dyadic interactions,”Interspeech, 2024

work page 2024

[27] [27]

Data efficient child-adult speaker diarization with simulated conversations,

A. Xu, T. Feng, H. Tager-Flusberg, C. Lord, and S. Narayanan, “Data efficient child-adult speaker diarization with simulated conversations,” inICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025

[28] [28]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

work page 2021

[29] [29]

Su- perb: Speech processing universal performance benchmark,

S.-w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Linet al., “Su- perb: Speech processing universal performance benchmark,”In- terspeech, 2021

work page 2021

[30] [30]

The ami meeting corpus,

W. Kraaij, T. Hain, M. Lincoln, and W. Post, “The ami meeting corpus,” inProc. International Conference on Methods and Tech- niques in Behavioral Research, 2005, pp. 1–4

work page 2005

[31] [31]

Aishell-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,

Y . Fu, L. Cheng, S. Lv, Y . Jv, Y . Kong, Z. Chen, Y . Hu, L. Xie, J. Wu, H. Buet al., “Aishell-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,”Interspeech, 2021

work page 2021

[32] [32]

M2met: The icassp 2022 multi- channel multi-party meeting transcription challenge,

F. Yu, S. Zhang, Y . Fu, L. Xie, S. Zheng, Z. Du, W. Huang, P. Guo, Z. Yan, B. Maet al., “M2met: The icassp 2022 multi- channel multi-party meeting transcription challenge,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6167–6171

work page 2022

[33] [33]

Playlogue: Dataset and benchmarks for analyzing adult-child conversations during play,

M. Kalanadhabhatta, M. M. Rastikerdar, T. Rahman, A. S. Gra- bell, and D. Ganesan, “Playlogue: Dataset and benchmarks for analyzing adult-child conversations during play,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Tech- nologies, vol. 8, no. 4, pp. 1–34, 2024

work page 2024

[34] [34]

The talkbank project,

B. MacWhinney, “The talkbank project,” inCreating and dig- itizing language corpora: Volume 1: Synchronic databases. Springer, 2007, pp. 163–180

work page 2007

[35] [35]

arXiv preprint arXiv:1909.09577 , year=

O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Gins- burg, S. Kriman, S. Beliaev, V . Lavrukhin, J. Cooket al., “Nemo: a toolkit for building ai applications using neural modules,”arXiv preprint arXiv:1909.09577, 2019

work page arXiv 1909

[36] [36]

End-to- end joint asr and speaker role diarization with child-adult interac- tions,

A. Xu, T. Feng, S. Bishop, C. Lord, and S. Narayanan, “End-to- end joint asr and speaker role diarization with child-adult interac- tions,”arXiv preprint arXiv:2601.17640, 2026

work page arXiv 2026

[37] [37]

Seniortalk: A chinese conversation dataset with rich annotations for super-aged seniors,

Y . Chen, H. Wang, S. Wang, J. Chen, J. He, J. Zhou, X. Yang, Y . Wang, Y . Lin, and Y . Qin, “Seniortalk: A chinese conversation dataset with rich annotations for super-aged seniors,”Neurips, 2025

work page 2025

[38] [38]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022

[39] [39]

The third dihard diarization challenge,

N. Ryant, P. Singh, V . Krishnamohan, R. Varma, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liberman, “The third dihard diarization challenge,”arXiv preprint arXiv:2012.01477, 2020

work page arXiv 2012