PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

Aditya Joshi; Dipankar Srirag; Sicheng Jin

arxiv: 2605.17860 · v1 · pith:CRZ3TTR5new · submitted 2026-05-18 · 💻 cs.CL · cs.AI

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

Sicheng Jin , Dipankar Srirag , Aditya Joshi This is my paper

Pith reviewed 2026-05-20 11:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-accent speechautomatic speech recognitiondatasetspontaneous speechaccented EnglishNLP domainword error rate

0 comments

The pith

Fine-tuning on a new multi-accent dataset of NLP paper discussions significantly reduces word error rates for state-of-the-art speech recognizers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PAREDA as a dataset of spontaneous discussions about NLP research papers, recorded from speakers with Australian, Indian-English, and Chinese-English accents. It includes both monologue summaries and question-and-answer exchanges rich in technical terms and natural conversation. State-of-the-art ASR models show higher error rates when tested on this material without prior exposure. Fine-tuning those same models on PAREDA lowers the word error rate, indicating that the recordings contain speech patterns absent from standard training sets. The work positions the dataset as a resource for improving recognition of accented, fast, and domain-specific speech in real applications.

Core claim

The authors establish that PAREDA captures linguistic characteristics often missing from existing corpora because fine-tuning state-of-the-art ASR models on the dataset produces a significant reduction in word error rate, while zero-shot evaluation confirms the material remains challenging due to accent mixing, spontaneous delivery, and technical content.

What carries the argument

PAREDA, the multi-accent corpus of spontaneous monologues and question-and-answer sessions on NLP papers.

Load-bearing premise

The chosen accents, spontaneous discussion format, and technical NLP domain together represent the key real-world variability that causes ASR degradation, and that the evaluated SOTA models are representative enough to draw general conclusions about accent robustness.

What would settle it

A test in which fine-tuning on PAREDA fails to lower word error rate on a separate collection of similar accented spontaneous technical discussions would undermine the claim that the dataset supplies broadly missing linguistic characteristics.

Figures

Figures reproduced from arXiv: 2605.17860 by Aditya Joshi, Dipankar Srirag, Sicheng Jin.

**Figure 1.** Figure 1: Methodology for dataset collection 2.1. Speakers and Prompts We conduct elicitation with three participants, one for each locale. The three locales covered in this dataset are: Australian (en-AU), Indian (en-IN), Northern Chinese (en-ZH). We did not collect American (en-US) samples as there is already an excessive amount of en-US speech samples available in other datasets, and the models we use have alr… view at source ↗

**Figure 2.** Figure 2: Per-Accent Tuning Results [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Per-Accent Relative Results samples, while cross-accent testing (e.g., tuning on Indian/Australian and testing on Chinese) resulted in significantly higher errors. Conversely, per-accent tuning benefited larger models, particularly Medium, where WER decreased across all accents. The Large model showed a contradictory pattern for Australian samples, where tuning on its own accent caused the highest WER in… view at source ↗

read the original abstract

While modern Automatic Speech Recognition (ASR) systems achieve high accuracy on benchmark corpora, their performance often degrades when there is real-world variability. This work focuses on variability arising due to accented, spontaneous, and domain-specific speech. In particular, we introduce PAper REading DAtaset (PAREDA), a first-of-its-kind multi-accent speech dataset consisting of discussions on academic Natural Language Processing (NLP) papers between speakers with Australian, Indian-English, and Chinese English accents. Each session elicits a spontaneous monologue (a summary of a paper's abstract) and a non-monologue (a question-and-answer session between participants), resulting in a corpus rich with technical jargon and conversational phenomena. We evaluate the performance of SOTA ASR models on PAREDA, analysing the impact of accent mixing and increased speech rate. Our results show that, in the zero-shot setting, models perform worse, confirming the dataset's challenging nature. However, fine-tuning on PAREDA significantly reduces the Word Error Rate (WER), demonstrating that our dataset captures linguistic characteristics often missing from existing corpora. PAREDA serves as a valuable new resource for building and evaluating more robust and inclusive ASR systems for specialised, real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PAREDA gives a narrow but real new dataset for accented spontaneous technical speech; the fine-tuning gains look plausible but need controls to show they are not just generic adaptation.

read the letter

The main takeaway is that this paper releases PAREDA, a collection of spontaneous monologues and Q&A sessions on NLP papers recorded from speakers with Australian, Indian-English, and Chinese accents. That combination of accents, technical domain, and conversational format is not covered by the datasets they cite, so the resource itself is the primary contribution. They run SOTA ASR models, note worse zero-shot results, and report that fine-tuning on PAREDA lowers WER. For a dataset paper that is a reasonable way to show the data is usable and fills a gap in real-world variability.

Referee Report

1 major / 2 minor

Summary. The paper introduces PAREDA, a multi-accent speech dataset of spontaneous NLP paper discussions involving speakers with Australian, Indian-English, and Chinese-English accents. Sessions include monologues (paper abstract summaries) and non-monologues (Q&A), yielding technical jargon and conversational phenomena. SOTA ASR models are evaluated in zero-shot and fine-tuned settings; the central claim is that zero-shot performance degrades while fine-tuning on PAREDA significantly reduces WER, showing that the dataset supplies linguistic characteristics absent from prior corpora.

Significance. If supported by targeted controls, PAREDA would be a useful new resource for ASR robustness research on accented, spontaneous, and domain-specific technical speech. The dataset creation itself is a clear positive contribution. However, the significance is tempered because the reported WER gains have not been isolated from general in-domain adaptation effects.

major comments (1)

[Evaluation section] Evaluation section (fine-tuning experiments): the claim that fine-tuning 'significantly reduces the Word Error Rate (WER), demonstrating that our dataset captures linguistic characteristics often missing from existing corpora' is not supported by any ablation against control corpora of matched size and domain (e.g., accent subsets of Common Voice or other spontaneous technical speech). Without this contrast the observed reduction could be explained by any in-domain adaptation rather than the multi-accent spontaneous NLP format highlighted as novel.

minor comments (2)

[Abstract] Abstract and results: quantitative WER values, error bars, exact model names, data-split details, and exclusion rules are not reported, making it impossible to assess the magnitude or reliability of the claimed improvements.
[Dataset description] Dataset description: total hours, speaker counts per accent, speech-rate statistics, and exact recording protocol should be stated explicitly to support reproducibility and the claims about accent mixing.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment below and outline planned revisions.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (fine-tuning experiments): the claim that fine-tuning 'significantly reduces the Word Error Rate (WER), demonstrating that our dataset captures linguistic characteristics often missing from existing corpora' is not supported by any ablation against control corpora of matched size and domain (e.g., accent subsets of Common Voice or other spontaneous technical speech). Without this contrast the observed reduction could be explained by any in-domain adaptation rather than the multi-accent spontaneous NLP format highlighted as novel.

Authors: We agree that the current experiments do not include direct ablations against control corpora of matched size and domain. The observed WER reductions after fine-tuning on PAREDA could indeed be partly attributable to general in-domain adaptation effects rather than the specific multi-accent spontaneous NLP discussion format. The zero-shot degradation on PAREDA does indicate challenges beyond standard benchmarks, but this alone does not fully isolate the contribution of our dataset's novel characteristics. To strengthen the claim, we will add control experiments in the revised manuscript. Specifically, we will fine-tune the same ASR models on accent-matched or domain-similar subsets from Common Voice and other spontaneous technical speech corpora of comparable size, then compare relative WER improvements. We will update the evaluation section and discussion to reflect these results and acknowledge the limitation of the original analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset evaluation with external metrics

full rationale

The paper introduces a new multi-accent speech dataset and reports standard zero-shot and fine-tuned WER numbers on external SOTA ASR models. No mathematical derivation, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The central claim rests on measured performance differences against public benchmarks rather than any quantity defined in terms of itself or prior author work. This is a standard dataset-plus-evaluation contribution whose results are falsifiable by replication on the released corpus.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Dataset papers rest on standard assumptions about speech variability and evaluation rather than new mathematical derivations; no free parameters or invented entities are introduced.

axioms (2)

standard math Word Error Rate is an appropriate metric for measuring ASR performance on accented spontaneous speech.
Common practice in ASR research invoked implicitly when reporting WER reductions.
domain assumption The selected accents and discussion style capture meaningful real-world variability missing from existing corpora.
Central motivation stated in the abstract without further justification.

pith-pipeline@v0.9.0 · 5747 in / 1408 out tokens · 43475 ms · 2026-05-20T11:26:49.357216+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce PAper REading DAtaset (PAREDA), a first-of-its-kind multi-accent speech dataset... fine-tuning on PAREDA significantly reduces the Word Error Rate (WER)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Table 3: WER Comparison when fine-tuning Whisper with/without PAREDA

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

Jacob Eisenstein, Vinodkumar Prabhakaran, Clara Rivera, Dorottya Demszky, and Devyani Sharma. 2023. Md3: The multi-dialect dataset of dialogues. In INTERSPEECH 2023

work page 2023
[2]

Wessel Kraaij, Thomas Hain, Mike Lincoln, and Wilfried Post. 2005. The ami meeting corpus. In Proc. International Conference on Methods and Techniques in Behavioral Research, pages 1--4

work page 2005
[3]

Keon Lee, Kyumin Park, and Daeyoung Kim. 2023. Dailytalk: Spoken dialogue dataset for conversational text-to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

work page 2023
[4]

I McCowan, J Carletta, W Kraaij, S Ashby, S Bourban, M Flynn, M Guillemot, T Hain, J Kadlec, V Karaiskos, et al. 2005. The ami meeting corpus. In Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research, pages 137--140. Noldus Information Technology

work page 2005
[5]

Wenbin Wang, Yang Song, and Sanjay Jha. 2024. Globe: A high-quality english corpus with global accents for zero-shot speaker adaptive text-to-speech. In INTERSPEECH 2024

work page 2024
[6]

Librispeech: An ASR corpus based on public domain audio books , year=

Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev , booktitle=. Librispeech: An ASR corpus based on public domain audio books , year=

work page
[7]

Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research , pages=

The AMI meeting corpus , author=. Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research , pages=. 2005 , organization=

work page 2005
[8]

INTERSPEECH 2024 , year=

GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech , author=. INTERSPEECH 2024 , year=

work page 2024
[9]

MD3: The Multi-Dialect Dataset of Dialogues , year =

Eisenstein, Jacob and Prabhakaran, Vinodkumar and Rivera, Clara and Demszky, Dorottya and Sharma, Devyani , booktitle =. MD3: The Multi-Dialect Dataset of Dialogues , year =

work page
[10]

The AMI meeting corpus , author=. Proc. International Conference on Methods and Techniques in Behavioral Research , pages=

work page
[11]

BESSTIE : A Benchmark for Sentiment and Sarcasm Classification for Varieties of E nglish

Srirag, Dipankar and Joshi, Aditya and Painter, Jordan and Kanojia, Diptesh. BESSTIE : A Benchmark for Sentiment and Sarcasm Classification for Varieties of E nglish. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.441

work page doi:10.18653/v1/2025.findings-acl.441 2025
[12]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Dailytalk: Spoken dialogue dataset for conversational text-to-speech , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

work page 2023
[13]

Proceedings of the 40th International Conference on Machine Learning , pages =

Robust Speech Recognition via Large-Scale Weak Supervision , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[14]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[17]

Information Fusion , volume=

A review of deep learning techniques for speech processing , author=. Information Fusion , volume=. 2023 , publisher=

work page 2023
[18]

Computer Speech & Language , volume=

Towards inclusive automatic speech recognition , author=. Computer Speech & Language , volume=. 2024 , publisher=

work page 2024
[19]

IEEE Access , volume=

Exploring native and non-native English child speech recognition with Whisper , author=. IEEE Access , volume=. 2024 , publisher=

work page 2024
[20]

CMES-Computer Modeling in Engineering and Sciences , year=

Challenges and limitations in speech recognition technology: A critical review of speech signal processing algorithms, tools and systems , author=. CMES-Computer Modeling in Engineering and Sciences , year=

work page
[21]

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Sneha Basak, Himanshi Agrawal, Shreya Jena, Shilpa Gite, Mrinal Bachute, Biswajeet Pradhan, and Mazen Assiri. 2023. Challenges and limitations in speech recognition technology: A critical review of speech signal processing algorithms, tools and systems. CMES-Computer Modeling in Engineering and Sciences

work page 2023
[23]

Siyuan Feng, Bence Mark Halpern, Olya Kudina, and Odette Scharenborg. 2024. Towards inclusive automatic speech recognition. Computer Speech & Language, 84:101567

work page 2024
[24]

Rishabh Jain, Andrei Barcovschi, Mariam Yahayah Yiwere, Peter Corcoran, and Horia Cucu. 2024. Exploring native and non-native english child speech recognition with whisper. IEEE Access, 12:41601--41610

work page 2024
[25]

Ambuj Mehrish, Navonil Majumder, Rishabh Bharadwaj, Rada Mihalcea, and Soujanya Poria. 2023. A review of deep learning techniques for speech processing. Information Fusion, 99:101869

work page 2023
[26]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. 2023 a . https://proceedings.mlr.press/v202/radford23a.html Robust speech recognition via large-scale weak supervision . 202:28492--28518

work page 2023
[27]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023 b . Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

work page 2023
[28]

Mario Zusag, Laurin Wagner, and Bernhad Thallinger. 2024. https://doi.org/10.21437/interspeech.2024-731 Crisperwhisper: Accurate timestamps on verbatim speech transcriptions . In Interspeech 2024, page 1265–1269. ISCA

work page doi:10.21437/interspeech.2024-731 2024

[1] [1]

Jacob Eisenstein, Vinodkumar Prabhakaran, Clara Rivera, Dorottya Demszky, and Devyani Sharma. 2023. Md3: The multi-dialect dataset of dialogues. In INTERSPEECH 2023

work page 2023

[2] [2]

Wessel Kraaij, Thomas Hain, Mike Lincoln, and Wilfried Post. 2005. The ami meeting corpus. In Proc. International Conference on Methods and Techniques in Behavioral Research, pages 1--4

work page 2005

[3] [3]

Keon Lee, Kyumin Park, and Daeyoung Kim. 2023. Dailytalk: Spoken dialogue dataset for conversational text-to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

work page 2023

[4] [4]

I McCowan, J Carletta, W Kraaij, S Ashby, S Bourban, M Flynn, M Guillemot, T Hain, J Kadlec, V Karaiskos, et al. 2005. The ami meeting corpus. In Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research, pages 137--140. Noldus Information Technology

work page 2005

[5] [5]

Wenbin Wang, Yang Song, and Sanjay Jha. 2024. Globe: A high-quality english corpus with global accents for zero-shot speaker adaptive text-to-speech. In INTERSPEECH 2024

work page 2024

[6] [6]

Librispeech: An ASR corpus based on public domain audio books , year=

Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev , booktitle=. Librispeech: An ASR corpus based on public domain audio books , year=

work page

[7] [7]

Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research , pages=

The AMI meeting corpus , author=. Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research , pages=. 2005 , organization=

work page 2005

[8] [8]

INTERSPEECH 2024 , year=

GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech , author=. INTERSPEECH 2024 , year=

work page 2024

[9] [9]

MD3: The Multi-Dialect Dataset of Dialogues , year =

Eisenstein, Jacob and Prabhakaran, Vinodkumar and Rivera, Clara and Demszky, Dorottya and Sharma, Devyani , booktitle =. MD3: The Multi-Dialect Dataset of Dialogues , year =

work page

[10] [10]

The AMI meeting corpus , author=. Proc. International Conference on Methods and Techniques in Behavioral Research , pages=

work page

[11] [11]

BESSTIE : A Benchmark for Sentiment and Sarcasm Classification for Varieties of E nglish

Srirag, Dipankar and Joshi, Aditya and Painter, Jordan and Kanojia, Diptesh. BESSTIE : A Benchmark for Sentiment and Sarcasm Classification for Varieties of E nglish. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.441

work page doi:10.18653/v1/2025.findings-acl.441 2025

[12] [12]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Dailytalk: Spoken dialogue dataset for conversational text-to-speech , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

work page 2023

[13] [13]

Proceedings of the 40th International Conference on Machine Learning , pages =

Robust Speech Recognition via Large-Scale Weak Supervision , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023

[14] [14]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[15] [17]

Information Fusion , volume=

A review of deep learning techniques for speech processing , author=. Information Fusion , volume=. 2023 , publisher=

work page 2023

[16] [18]

Computer Speech & Language , volume=

Towards inclusive automatic speech recognition , author=. Computer Speech & Language , volume=. 2024 , publisher=

work page 2024

[17] [19]

IEEE Access , volume=

Exploring native and non-native English child speech recognition with Whisper , author=. IEEE Access , volume=. 2024 , publisher=

work page 2024

[18] [20]

CMES-Computer Modeling in Engineering and Sciences , year=

Challenges and limitations in speech recognition technology: A critical review of speech signal processing algorithms, tools and systems , author=. CMES-Computer Modeling in Engineering and Sciences , year=

work page

[19] [21]

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [22]

Sneha Basak, Himanshi Agrawal, Shreya Jena, Shilpa Gite, Mrinal Bachute, Biswajeet Pradhan, and Mazen Assiri. 2023. Challenges and limitations in speech recognition technology: A critical review of speech signal processing algorithms, tools and systems. CMES-Computer Modeling in Engineering and Sciences

work page 2023

[21] [23]

Siyuan Feng, Bence Mark Halpern, Olya Kudina, and Odette Scharenborg. 2024. Towards inclusive automatic speech recognition. Computer Speech & Language, 84:101567

work page 2024

[22] [24]

Rishabh Jain, Andrei Barcovschi, Mariam Yahayah Yiwere, Peter Corcoran, and Horia Cucu. 2024. Exploring native and non-native english child speech recognition with whisper. IEEE Access, 12:41601--41610

work page 2024

[23] [25]

Ambuj Mehrish, Navonil Majumder, Rishabh Bharadwaj, Rada Mihalcea, and Soujanya Poria. 2023. A review of deep learning techniques for speech processing. Information Fusion, 99:101869

work page 2023

[24] [26]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. 2023 a . https://proceedings.mlr.press/v202/radford23a.html Robust speech recognition via large-scale weak supervision . 202:28492--28518

work page 2023

[25] [27]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023 b . Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

work page 2023

[26] [28]

Mario Zusag, Laurin Wagner, and Bernhad Thallinger. 2024. https://doi.org/10.21437/interspeech.2024-731 Crisperwhisper: Accurate timestamps on verbatim speech transcriptions . In Interspeech 2024, page 1265–1269. ISCA

work page doi:10.21437/interspeech.2024-731 2024