{"paper":{"title":"Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Synthetic conversational data approaches real-data baselines and mixing both yields substantial gains for multi-talker ASR and speaker diarization.","cross_cats":[],"primary_cat":"eess.AS","authors_text":"Alexander Polok, Ivan Medennikov, Jan \\v{C}ernock\\'y, Luk\\'a\\v{s} Burget, Samuele Cornell, Shinji Watanabe","submitted_at":"2026-05-14T21:53:10Z","abstract_excerpt":"Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixtures and real-world interactions, we present a study of synthetic data generation for leading MT-ASR (DiCoW) and SD (Sortformer) systems. By introducing FastMSS, a highly efficient open-source simulator, we analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies. Our findings reve"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The specific simulation choices and acoustic augmentations in FastMSS produce mixtures whose statistical properties are close enough to real conversational recordings that performance trends observed on synthetic data will transfer to real-world use.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Task-dependent simulation strategies for synthetic conversational data allow synthetic-only training to approach real-data baselines for multi-talker ASR and diarization, with mixing yielding further gains.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Synthetic conversational data approaches real-data baselines and mixing both yields substantial gains for multi-talker ASR and speaker diarization.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5b7076892ffb0619a2fc37811af3b79dc42ab7b26858601bb3131279d6033b78"},"source":{"id":"2605.15442","kind":"arxiv","version":1},"verdict":{"id":"f26912f9-77f4-4c74-ad35-547bcb733a6a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T14:34:17.220256Z","strongest_claim":"synthetic-only training approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training across both tasks.","one_line_summary":"Task-dependent simulation strategies for synthetic conversational data allow synthetic-only training to approach real-data baselines for multi-talker ASR and diarization, with mixing yielding further gains.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The specific simulation choices and acoustic augmentations in FastMSS produce mixtures whose statistical properties are close enough to real conversational recordings that performance trends observed on synthetic data will transfer to real-world use.","pith_extraction_headline":"Synthetic conversational data approaches real-data baselines and mixing both yields substantial gains for multi-talker ASR and speaker diarization."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.15442/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"cited_work_retraction","ran_at":"2026-05-19T15:52:50.872433Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"citation_quote_validity","ran_at":"2026-05-19T15:50:25.858383Z","status":"completed","version":"0.1.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T15:01:17.645703Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T14:50:22.662041Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T14:21:54.119096Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T13:33:22.685209Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"45d57c6bb9ca54c78a6e259ee939b54824e398079cc067e78da67aeac761352f"},"references":{"count":72,"sample":[{"doi":"","year":null,"title":"Introduction Multi-talker conversational speech processing is undergoing a rapid transformation, driven largely by the shift from highly specialized pipelines to less data-hungry methods built on pret","work_id":"95f54b31-2f03-4f49-81ba-f9defb72df61","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization","work_id":"a15aa124-f60d-43a3-ac42-ada85ef0605c","ref_index":2,"cited_arxiv_id":"2605.15442","is_internal_anchor":true},{"doi":"","year":null,"title":"Multi-Speaker Conversation Simulation To enable controlled and fast experimentation along the axes described above, we developed FastMSS, an open-source multi- speaker conversation simulator focused o","work_id":"3da26791-7009-4d01-973d-7312314a6187","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Experimental Setup 4.1. Datasets As source domains for synthetic generation, we use: Lib- riSpeech [49] (read speech, 960h), V oxPopuli [50] (semi- spontaneous parliamentary speech, 543h), otoSpeech [","work_id":"6c1b5292-0a78-488f-9e10-118512a1c23c","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"All datasets were re-aligned using the Montreal Forced Aligner [52] to ensure consistent word-level timestamps","work_id":"be27b4db-5d9d-4130-9c47-eb60766d505f","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":72,"snapshot_sha256":"8214eec0cc92a11831ed5c29e0773f0d572b36f94f0595235eb95540bbd94299","internal_anchors":3},"formal_canon":{"evidence_count":2,"snapshot_sha256":"35725c42018b7474c82a41a6fcb6c5311a5c62fe8c66763af5b484fa7263ecb5"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}