arxiv: 2601.22792 · v2 · submitted 2026-01-30 · 📡 eess.AS · cs.CL· cs.SD

Recognition: 1 theorem link

· Lean Theorem

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Muhammad Shakeel , Yosuke Fukumoto , Chikara Maeda , Chyi-Jiunn Lin , Shinji Watanabe

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:47 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords multi-speaker ASRcontextual biasingspeaker embeddingpersonalized ASRend-to-end modelingoverlapping speechLibriSpeechMixCSJMix

0 comments

The pith

CALM integrates speaker embeddings for target extraction with dynamic vocabulary biasing to halve biased error rates in overlapping multi-speaker ASR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CALM as an end-to-end framework that fuses acoustic target-speaker extraction with linguistic contextual biasing for personalized multi-speaker speech recognition. It shows that this joint approach substantially lowers error rates on simulated two-speaker mixtures in both English and Japanese, where separate use of acoustic or linguistic cues alone proves insufficient. The work matters because overlapping speech in meetings or conversations often requires both who is speaking and what has been said before to resolve ambiguities accurately. Results on LibriSpeechMix and CSJMix demonstrate concrete gains, with further checks on the AMI corpus supporting broader applicability to standardized mixtures.

Core claim

CALM implements speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing within a single end-to-end ASR framework. On two-speaker mixtures this lowers biased word error rate from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate from 16.6 to 8.4 on CSJMix2 (eval3), with additional validation on the AMI IHM-mix condition.

What carries the argument

Speaker embedding-driven target-speaker extraction combined with dynamic vocabulary-based contextual biasing inside an end-to-end multi-speaker ASR model.

If this is right

Cuts B-WER from 12.7 to 4.7 on English LibriSpeech2Mix mixtures
Cuts B-CER from 16.6 to 8.4 on Japanese CSJMix2 mixtures
Demonstrates gains from joint acoustic-linguistic modeling across two languages
Maintains performance on the standardized AMI IHM-mix condition

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The extraction and biasing components may extend to three or more simultaneous speakers if interference remains manageable.
Performance on real recordings could degrade if acoustic variability such as reverberation or sudden noise exceeds the simulation statistics.
Additional linguistic signals like full dialogue history or personal user profiles could be incorporated to strengthen the biasing step further.

Load-bearing premise

Simulated two-speaker mixtures capture the acoustic and linguistic statistics of real overlapping conversations where turns, noise, and context vary more widely.

What would settle it

Measuring biased error rates on a corpus of naturally recorded multi-party conversations containing variable speaker turns and background noise, then checking whether the reported reductions from 12.7 to 4.7 and 16.6 to 8.4 still hold.

read the original abstract

We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker ASR that combines speaker embedding-driven target-speaker extraction with dynamic vocabulary-based contextual biasing. It reports concrete empirical gains on simulated two-speaker mixtures: B-WER reduced from 12.7 to 4.7 on LibriSpeech2Mix and B-CER from 16.6 to 8.4 on CSJMix2 (eval3), with additional validation on the AMI IHM-mix condition, claiming effectiveness of the joint modeling approach across English and Japanese.

Significance. If the central empirical claims hold after addressing evaluation details, the work provides evidence that joint acoustic-linguistic conditioning can yield substantial reductions in biased error rates for overlapping speech, which would be relevant for personalized multi-speaker ASR systems. The cross-lingual evaluation on LibriSpeechMix and CSJMix plus the inclusion of a real-meeting corpus (AMI) are positive aspects that strengthen the demonstration.

major comments (2)

[Evaluation] Evaluation section (results on LibriSpeech2Mix and CSJMix2): the load-bearing claim that joint modeling is effective for personalized overlapping conversations rests on the assumption that the simulated mixtures adequately represent real acoustic and linguistic variability (overlap density, turn-taking, prosody, topic drift); the reported B-WER/B-CER drops could shrink if the random mixing procedure produces easier conditioning signals than natural conversations, and the manuscript should include an explicit analysis or additional real-data experiments to test this.
[Methods] Methods section (speaker embedding integration and dynamic vocabulary construction): the abstract describes the end-to-end framework but lacks sufficient detail on how biasing vocabularies are built, how data splits avoid leakage, and the precise fusion of acoustic and linguistic cues; without these, it is impossible to confirm that the baseline comparisons are fair and that the gains are not influenced by post-hoc choices.

minor comments (2)

[Abstract] Abstract: explicitly define or cite the definitions of B-WER and B-CER, as these are central to the reported metrics.
[Results] Results on AMI: expand the discussion of the IHM-mix condition to directly compare overlap statistics and error rates with the simulated corpora for better context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments on evaluation assumptions and methods details below, and will revise the manuscript accordingly to improve clarity and strengthen the validation.

read point-by-point responses

Referee: [Evaluation] Evaluation section (results on LibriSpeech2Mix and CSJMix2): the load-bearing claim that joint modeling is effective for personalized overlapping conversations rests on the assumption that the simulated mixtures adequately represent real acoustic and linguistic variability (overlap density, turn-taking, prosody, topic drift); the reported B-WER/B-CER drops could shrink if the random mixing procedure produces easier conditioning signals than natural conversations, and the manuscript should include an explicit analysis or additional real-data experiments to test this.

Authors: We agree that simulated mixtures may not capture all nuances of natural conversations and that an explicit analysis would strengthen the claims. The current manuscript already reports results on the AMI IHM-mix condition using real meeting recordings with natural overlaps. In the revision we will add a dedicated analysis subsection that compares overlap density, turn-taking patterns, and error breakdowns between the simulated LibriSpeech2Mix/CSJMix sets and the AMI data, directly addressing whether the reported gains are sensitive to the mixing procedure. revision: yes
Referee: [Methods] Methods section (speaker embedding integration and dynamic vocabulary construction): the abstract describes the end-to-end framework but lacks sufficient detail on how biasing vocabularies are built, how data splits avoid leakage, and the precise fusion of acoustic and linguistic cues; without these, it is impossible to confirm that the baseline comparisons are fair and that the gains are not influenced by post-hoc choices.

Authors: We acknowledge that the current description is insufficient for full reproducibility. In the revised manuscript we will expand the Methods section with: (i) the exact procedure for constructing and updating the dynamic biasing vocabularies from linguistic context, (ii) the data splitting protocol used to prevent leakage, and (iii) the precise fusion architecture that combines speaker embeddings with linguistic context vectors. These additions will allow readers to verify the fairness of the reported baseline comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with direct dataset comparisons

full rationale

The paper introduces the CALM framework for joint acoustic-linguistic modeling in multi-speaker ASR and evaluates it via end-to-end training on simulated mixtures (LibriSpeechMix, CSJMix) plus AMI IHM-mix. Reported gains (B-WER drop from 12.7 to 4.7, B-CER from 16.6 to 8.4) are obtained by comparing the trained model against baselines on held-out test sets. No equations, derivations, uniqueness theorems, or first-principles predictions appear in the provided text; therefore no step reduces by construction to a fitted parameter, self-citation, or renamed input. The evaluation is externally falsifiable on the cited corpora and contains no load-bearing self-referential structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; the framework implicitly relies on standard neural-network training assumptions and the existence of pre-trained speaker embeddings, none of which are enumerated.

pith-pipeline@v0.9.0 · 5497 in / 1208 out tokens · 28837 ms · 2026-05-16T09:47:23.362902+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing... FiLM-based modulation... weighted softmax... CTC self-conditioning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

[1]

INTRODUCTION Single-speaker automatic speech recognition (ASR) systems have achieved state-of-the-art (SOTA) performance across many speech- processing tasks [1–3]. However, in multi-speaker settings [4] with overlapping speech [5–7] and conversation-specific vocabu- lary [8–10], performance degrades substantially, limiting person- alization in real-world...

work page
[2]

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

CALM We consider a multi-speaker conversation scenario where the input mixture is represented asX= PC c=1 Y c ⊙S c +G, X∈R T , with clean sourcesS c ofCspeakers, activity masksY c ∈[0,1] T , and additive noiseG. In this work, we employWavLM-Large[1] as an upstream model to extract frame-level featuresXfe ∈R T fe×D. These are projected into the encoder spa...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Frame-level target- speaker activity posteriors are computed as: P vad =σ(W vad ˆH(L) +b vad),(11) withP vad ∈[0,1] T enc

to the final adapted encoder states ˆH(L). Frame-level target- speaker activity posteriors are computed as: P vad =σ(W vad ˆH(L) +b vad),(11) withP vad ∈[0,1] T enc . The V AD loss (Lvad) is then the binary cross- entropy between predictions and ground-truth activity labelsY vad: Lvad = BCE(P vad, Y vad).(12) The final training objective is a weighted mul...

work page
[4]

EXPERIMENTS The CALM framework is built on ESPnet [45], pairing a Conformer encoder with a Transformer decoder. The Conformer has 12 lay- ers with 4 heads and 1024 linear units (kernel size 31) and applies self-conditioned interCTC similar to [44] at layers 3, 6, and 9; the decoder is a 6-layer Transformer with 4 heads and 2048 units. For the input stack,...

work page 2048
[5]

However, unlike in simulated conditions, overall WER in- creases from 37.4 to 39.1 absolute points. Our error analysis indi- cates that this degradation is primarily driven by an increase in inser- tion errors, particularly for short utterances where speaker attribution is more challenging. This effect is most pronounced with smaller list (e.g.,N=100), in...

work page
[6]

CONCLUSION This paper introduced CALM, the first end-to-end framework that integrates target-speaker embeddings with dynamic vocabulary ex- pansion for personalization of multi-speaker ASR. By combining acoustic speaker conditioning with linguistic biasing in a unified architecture, CALM effectively addresses both overlap-induced acoustic errors and unsee...

work page
[7]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Pro- cessing,

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Pro- cessing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[8]

Robust Speech Recog- nition via Large-Scale Weak Supervision,

Alec Radford, Jong Wook Kim, Tao Xu, et al., “Robust Speech Recog- nition via Large-Scale Weak Supervision,” inProc. ICML, 2023

work page 2023
[9]

OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning,

Yifan Peng, Muhammad Shakeel, et al., “OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning,” inProc. Interspeech, 2025, pp. 2225–2229

work page 2025
[10]

Unifying Diarization, Separation, and ASR with Multi- Speaker Encoder,

Muhammad Shakeel, Yui Sudo, Yifan Peng, Chyi-Jiunn Lin, and Shinji Watanabe, “Unifying Diarization, Separation, and ASR with Multi- Speaker Encoder,” inProc. ASRU, 2025

work page 2025
[11]

CHiME-6 Challenge: Tackling Multispeaker Speech Recog- nition for Unsegmented Recordings,

Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, et al., “CHiME-6 Challenge: Tackling Multispeaker Speech Recog- nition for Unsegmented Recordings,” inProc. CHiME, 2020, pp. 1–7

work page 2020
[12]

M2Met: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge,

Fan Yu, Shiliang Zhang, Yihui Fu, et al., “M2Met: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge,” inProc. ICASSP, 2022, pp. 6167–6171

work page 2022
[13]

NOTSOFAR-1 Chal- lenge: New Datasets, Baseline, and Tasks for Distant Meeting Tran- scription,

Alon Vinnikov, Amir Ivry, Aviv Hurvitz, et al., “NOTSOFAR-1 Chal- lenge: New Datasets, Baseline, and Tasks for Distant Meeting Tran- scription,” inProc. Interspeech, 2024, pp. 5003–5007

work page 2024
[14]

Contextualized Stream- ing End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion,

Duc Le, Mahaveer Jain, Gil Keren, et al., “Contextualized Stream- ing End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion,” inProc. Interspeech, 2021, pp. 1772–1776

work page 2021
[15]

Contextualized Automatic Speech Recognition With Attention-Based Bias Phrase Boosted Beam Search,

Yui Sudo et al., “Contextualized Automatic Speech Recognition With Attention-Based Bias Phrase Boosted Beam Search,” inProc. ICASSP, 2024, pp. 10896–10900

work page 2024
[16]

Contextualized End-to-end Automatic Speech Recognition with In- termediate Biasing Loss,

Muhammad Shakeel, Yui Sudo, Yifan Peng, and Shinji Watanabe, “Contextualized End-to-end Automatic Speech Recognition with In- termediate Biasing Loss,” inProc. Interspeech, 2024, pp. 3909–3913

work page 2024
[17]

Meeting Recogni- tion with Continuous Speech Separation and Transcription-Supported Diarization,

Thilo V on Neumann, Christoph Boeddeker, et al., “Meeting Recogni- tion with Continuous Speech Separation and Transcription-Supported Diarization,” inProc. ICASSP Workshops, 2024, pp. 775–779

work page 2024
[18]

NTT Multi- Speaker ASR System for the DASR Task of CHiME-8 Challenge,

Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, et al., “NTT Multi- Speaker ASR System for the DASR Task of CHiME-8 Challenge,” in Proc. CHiME, 2024, pp. 69–74

work page 2024
[19]

End-to-End SpeakerBeam for Single Channel Target Speech Recognition,

Marc Delcroix, Shinji Watanabe, Tsubasa Ochiai, et al., “End-to-End SpeakerBeam for Single Channel Target Speech Recognition,” inProc. Interspeech, 2019, pp. 451–455

work page 2019
[20]

Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Ob- jectives,

Aswin Shanmugam Subramanian et al., “Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Ob- jectives,” inProc. ICASSP, 2020, pp. 7299–7303

work page 2020
[21]

Streaming Target-Speaker ASR with Neural Transducer,

Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, et al., “Streaming Target-Speaker ASR with Neural Transducer,” inProc. Interspeech, 2022, pp. 2673–2677

work page 2022
[22]

Adapt- ing Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings,

Zili Huang, Desh Raj, Paola Garc ´ıa, and Sanjeev Khudanpur, “Adapt- ing Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings,” inProc. ICASSP, 2023, pp. 1–5

work page 2023
[23]

Conformer-Based Target-Speaker Automatic Speech Recogni- tion For Single-Channel Audio,

Yang Zhang, Krishna C. Puvvada, Vitaly Lavrukhin, and Boris Gins- burg, “Conformer-Based Target-Speaker Automatic Speech Recogni- tion For Single-Channel Audio,” inProc. ICASSP, 2023, pp. 1–5

work page 2023
[24]

End-to- End Joint Target and Non-Target Speakers ASR,

Ryo Masumura, Naoki Makishima, Taiga Yamane, et al., “End-to- End Joint Target and Non-Target Speakers ASR,” inProc. Interspeech, 2023, pp. 2903–2907

work page 2023
[25]

End-to-End Speaker-Attributed ASR with Transformer,

Naoyuki Kanda et al., “End-to-End Speaker-Attributed ASR with Transformer,” inProc. Interspeech, 2021, pp. 4413–4417

work page 2021
[26]

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings,

Naoyuki Kanda, Jian Wu, Yu Wu, et al., “Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings,” inProc. Interspeech, 2022, pp. 521–525

work page 2022
[27]

Target-Speaker V oice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario,

Ivan Medennikov, Maxim Korenevsky, et al., “Target-Speaker V oice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario,” inProc. Interspeech, 2020, pp. 274–278

work page 2020
[28]

Online Neural Speaker Diarization With Target Speaker Tracking,

Weiqing Wang and Ming Li, “Online Neural Speaker Diarization With Target Speaker Tracking,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 32, pp. 5078–5091, 2024

work page 2024
[29]

Joint Target-Speaker ASR and Activity Detec- tion,

Chikara Maeda et al., “Joint Target-Speaker ASR and Activity Detec- tion,” inProc. Interspeech, 2025, pp. 1683–1687

work page 2025
[30]

DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,

Alexander Polok, Dominik Klement, Martin Kocour, et al., “DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition,”Computer speech & language, vol. 95, pp. 101841, 2026

work page 2026
[31]

Target Speaker ASR with Whisper,

Alexander Polok, Dominik Klement, Matthew Wiesner, et al., “Target Speaker ASR with Whisper,” inProc. ICASSP, 2025, pp. 1–5

work page 2025
[32]

SQ-Whisper: Speaker-Querying Based Whisper Model for Target-Speaker ASR,

Pengcheng Guo and Lei Xie, “SQ-Whisper: Speaker-Querying Based Whisper Model for Target-Speaker ASR,”IEEE Trans. Audio, Speech, Lang. Process., vol. 33, pp. 175–185, 2025

work page 2025
[33]

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System,

Lingwei Meng, Jiawen Kang, Yuejiao Wang, et al., “Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System,” inProc. Interspeech, 2024, pp. 4653–4657

work page 2024
[34]

Shallow-Fusion End-to-End Con- textual Biasing,

Ding Zhao, Tara N. Sainath, et al., “Shallow-Fusion End-to-End Con- textual Biasing,” inProc. Interspeech, 2019, pp. 1418–1422

work page 2019
[35]

Robust Acoustic And Semantic Contextual Biasing In Neural Trans- ducers For Speech Recognition,

Xuandi Fu, Kanthashree Mysore Sathyendra, Ankur Gandhe, et al., “Robust Acoustic And Semantic Contextual Biasing In Neural Trans- ducers For Speech Recognition,” inProc. ICASSP, 2023, pp. 1–5

work page 2023
[36]

Contextual Adapters for Personalized Speech Recog- nition in Neural Transducers,

Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Feng-Ju Chang, et al., “Contextual Adapters for Personalized Speech Recog- nition in Neural Transducers,” inProc. ICASSP, 2022, pp. 8537–8541

work page 2022
[37]

Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network,

Kaixun Huang, Ao Zhang, Zhanheng Yang, et al., “Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network,” inProc. Interspeech, 2023, pp. 4933–4937

work page 2023
[38]

Improving ASR Contextual Biasing with Guided Attention,

Jiyang Tang et al., “Improving ASR Contextual Biasing with Guided Attention,” inProc. ICASSP, 2024, pp. 12096–12100

work page 2024
[39]

CopyNE: Better Contextual ASR by Copying Named Entities,

Shilin Zhou, Zhenghua Li, et al., “CopyNE: Better Contextual ASR by Copying Named Entities,” inProc. ACL, Aug. 2024, pp. 2675–2686

work page 2024
[40]

Contextualized Automatic Speech Recognition With Dynamic V ocabulary,

Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, and Shinji Watanabe, “Contextualized Automatic Speech Recognition With Dynamic V ocabulary,” inProc. SLT, 2024, pp. 78–85

work page 2024
[41]

CMT- LLM: Contextual Multi-Talker ASR Utilizing Large Language Mod- els,

Jiajun He, Naoki Sawada, Koichi Miyazaki, and Tomoki Toda, “CMT- LLM: Contextual Multi-Talker ASR Utilizing Large Language Mod- els,” inProc. Interspeech, 2025, pp. 2575–2579

work page 2025
[42]

CTC-Assisted LLM- Based Contextual ASR,

Guanrou Yang, Ziyang Ma, Zhifu Gao, et al., “CTC-Assisted LLM- Based Contextual ASR,” inProc. SLT, 2024, pp. 126–131

work page 2024
[43]

Serialized out- put training for end-to-end overlapped speech recognition,

Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, et al., “Serialized out- put training for end-to-end overlapped speech recognition,” inProc. Interspeech, 2020, pp. 2797–2801

work page 2020
[44]

The AMI Meeting Corpus: A Pre-announcement,

Jean Carletta, Simone Ashby, Sebastien Bourban, et al., “The AMI Meeting Corpus: A Pre-announcement,” inMachine Learning for Mul- timodal Interaction, 2006, pp. 28–39

work page 2006
[45]

Corpus of spontaneous Japanese: Its design and evaluation,

Kikuo Maekawa, “Corpus of spontaneous Japanese: Its design and evaluation,” inISCA & IEEE Workshop on Spontaneous Speech Pro- cessing and Recognition, 2003

work page 2003
[46]

ECAPA-TDNN: Emphasized channel at- tention, propagation and aggregation in tdnn based speaker verifica- tion,

Brecht Desplanques et al., “ECAPA-TDNN: Emphasized channel at- tention, propagation and aggregation in tdnn based speaker verifica- tion,” inProc. Interspeech, 2020, pp. 3830–3834

work page 2020
[47]

Conformer: Convolution-augmented Transformer for Speech Recognition,

Anmol Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” inProc. Interspeech, 2020, pp. 5036–5040

work page 2020
[48]

FiLM: Visual Reasoning with a General Condition- ing Layer,

Ethan Perez et al., “FiLM: Visual Reasoning with a General Condition- ing Layer,”Proc. AAAI, vol. 32, no. 1, Apr. 2018

work page 2018
[49]

Attention is All you Need,

Ashish Vaswani, Noam Shazeer, Niki Parmar, et al., “Attention is All you Need,” inProc. NeurIPS, 2017, vol. 30

work page 2017
[50]

DYNAC: Dynamic V ocabulary- based Non-Autoregressive Contextualization for Speech Recognition,

Yui Sudo, Yosuke Fukumoto, et al., “DYNAC: Dynamic V ocabulary- based Non-Autoregressive Contextualization for Speech Recognition,” inInterspeech 2025, 2025, pp. 2215–2219

work page 2025
[51]

ESPnet: End-to-End Speech Processing Toolkit,

Shinji Watanabe et al., “ESPnet: End-to-End Speech Processing Toolkit,” inProc. Interspeech, 2018, pp. 2207–2211

work page 2018
[52]

Librispeech: An ASR corpus based on public domain audio books,

Vassil Panayotov et al., “Librispeech: An ASR corpus based on public domain audio books,” inProc. ICASSP, 2015, pp. 5206–5210

work page 2015
[53]

WHAM!: Extending speech separation to noisy environments,

Gordon Wichern et al., “WHAM!: Extending speech separation to noisy environments,” inProc. Interspeech, 2019, pp. 1368–1372. 5

work page 2019