Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

Katalin M\'ady; M\'at\'e Gedeon; P\'eter Mihajlik; Piroska Zs\'ofia Barta

arxiv: 2605.31469 · v1 · pith:F3YEBCTNnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

M\'at\'e Gedeon , Piroska Zs\'ofia Barta , P\'eter Mihajlik , Katalin M\'ady This is my paper

Pith reviewed 2026-06-28 22:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SDeess.AS

keywords Hungarian ASRconversational speech recognitiondialogue corpusspeaker overlapSOT adaptationWhisper modelFastConformer

0 comments

The pith

BEA-Dialogue+ scales Hungarian dialogue ASR data to 200 hours by relaxing speaker-disjoint rules for non-primary roles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BEA-Dialogue+, an expansion of the original BEA-Dialogue corpus that increases usable transcribed natural conversations from 85 hours to 200 hours. It achieves this by relaxing the speaker-disjoint split for experimenters and dialogue partners while keeping primary speakers fully separated across train, dev, and eval. Evaluations of Whisper and FastConformer models show that the larger set is harder for models without fine-tuning, but Serialized Output Training adaptation produces consistent gains in WER, CER, cpWER, and cpCER. The result supplies a bigger yet still demanding benchmark for Hungarian conversational ASR training and testing.

Core claim

BEA-Dialogue+ demonstrates that increasing training data volume through controlled speaker overlap raises difficulty for unadapted models yet supports consistent performance gains via SOT-based fine-tuning on Whisper- and FastConformer-based systems, measured across standard error rates for dialogue transcription.

What carries the argument

The relaxed speaker-disjoint split that preserves complete separation of primary speakers while allowing overlap for experimenters and dialogue partners.

If this is right

Without adaptation the larger corpus produces higher error rates than the smaller one.
SOT fine-tuning delivers measurable reductions in all four error metrics on both corpus versions.
The expanded resource supports training of dialogue transcription systems at greater scale.
The design isolates the effect of data volume from primary-speaker overlap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar relaxed-split strategies could scale dialogue corpora in other languages with limited speaker-disjoint data.
The same size-versus-overlap trade-off could be tested by varying the degree of allowed overlap while tracking evaluation validity.
Models trained on BEA-Dialogue+ might generalize better to real-world multi-speaker settings that include non-primary voices.

Load-bearing premise

Relaxing speaker separation only for experimenters and dialogue partners still yields evaluation splits whose difficulty matches that of the strictly disjoint original corpus.

What would settle it

A direct comparison in which unadapted models achieve equal or lower error rates on the 200-hour set than on the 85-hour set, or in which SOT adaptation fails to reduce WER, CER, cpWER, or cpCER.

Figures

Figures reproduced from arXiv: 2605.31469 by Katalin M\'ady, M\'at\'e Gedeon, P\'eter Mihajlik, Piroska Zs\'ofia Barta.

**Figure 1.** Figure 1: Distribution of the number of speaker changes per segment in BEA-Dialogue. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of Speaker Turns per Entry 0.0 0.1 0.2 0.3 0.4 0.5 Relative Frequency train dev eval [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of the number of speaker changes per segment in BEA-Dialogue+. 4 Experiments For the BEA-Dialogue+ dataset, we trained the same models as those presented in the original BEA-Dialogue paper [3], and complemented them with our own inhouse models. During training, we applied Serialized Output Training (SOT) [4], in which speaker changes are indicated by a <sc> (speaker change) token. These token… view at source ↗

read the original abstract

Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while preserving complete separation of the primary speakers. This results in 200 hours of transcribed natural conversations and enables a controlled study of the trade-off between additional training data and speaker overlap across the splits. We evaluate several Whisper- and FastConformer-based models on both corpus versions, including Serialized Output Training (SOT)-based fine-tuning for dialogue transcription. Our results show that the larger corpus is more challenging for models without fine-tuning, whereas SOT-based adaptation yields consistent improvements in WER, CER, cpWER, and cpCER. Overall, BEA-Dialogue+ provides a substantially larger yet still demanding benchmark for Hungarian dialogue ASR, and a practical resource for training and evaluating dialogue transcription systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BEA-Dialogue+ adds 115 hours of Hungarian dialogue data via a relaxed split and shows SOT helps, but the controlled data-size comparison may not hold if the new eval set differs in difficulty.

read the letter

The paper's core move is releasing BEA-Dialogue+, which relaxes speaker-disjoint rules only for experimenters and dialogue partners while keeping primary speakers separate. This yields 200 hours instead of 85 and lets them compare models on the strict and relaxed versions. They test Whisper and FastConformer baselines plus SOT fine-tuning and report that the larger set is harder without adaptation but SOT improves WER, CER, cpWER, and cpCER.

What is actually new is the relaxed split definition and the direct head-to-head evaluation on both corpus versions. For Hungarian conversational ASR, where public dialogue data is scarce, the extra transcribed hours are a practical addition. The goal of isolating data volume from speaker overlap is reasonable for this task.

The work is straightforward and fills a narrow but real gap. Releasing the corpus and running the same models on both versions gives readers something concrete to use.

The soft spot is the stress-test concern: the claim of a controlled study assumes the new eval split has essentially the same difficulty as the original. Relaxing overlap even for non-primary speakers can still shift dialogue dynamics or acoustics, so any rise in error rates might not be cleanly due to data size. The abstract mentions consistent SOT gains but supplies no training details, exclusion rules, or significance tests, which leaves the main empirical result hard to judge. If those sections are thin in the full text, the comparison loses force.

This is for researchers who need more Hungarian dialogue data or benchmarks for conversational ASR in low-resource settings. A reader working on dialogue transcription or similar languages gets direct value from the resource. It is worth sending to peer review so referees can check the split construction and the strength of the reported gains.

Referee Report

2 major / 2 minor

Summary. The paper introduces BEA-Dialogue+, an expanded 200-hour Hungarian conversational ASR corpus obtained by relaxing the speaker-disjoint split criterion for experimenters and dialogue partners (while preserving separation of primary speakers) from the original 85-hour BEA-Dialogue corpus. It evaluates Whisper- and FastConformer-based models on both versions, with and without Serialized Output Training (SOT) fine-tuning, and claims that the larger corpus is more challenging without fine-tuning while SOT adaptation produces consistent gains in WER, CER, cpWER, and cpCER.

Significance. If the evaluation splits remain comparable in difficulty, the work supplies a substantially larger public benchmark for Hungarian dialogue ASR and demonstrates the practical value of SOT-based adaptation for multi-speaker transcription. The corpus release itself constitutes a concrete resource contribution for low-resource conversational ASR.

major comments (2)

[Abstract and §3] Abstract and §3 (Corpus Construction): the central claim that the relaxed split 'enables a controlled study of the trade-off between additional training data and speaker overlap' rests on the unverified assumption that the new eval split has essentially the same inherent difficulty as the original 85-hour eval split. No quantitative comparison of turn-taking statistics, dialogue length distributions, or acoustic conditions between the two eval sets is reported, so the observed WER increase cannot be unambiguously attributed to data size versus overlap.
[Results] Results section (and abstract claim of 'consistent improvements'): the reported gains from SOT adaptation are presented without error bars, statistical significance tests, or details on training procedures, hyper-parameters, data exclusion rules, or number of runs. This prevents verification of whether the improvements are robust or merely within-run variation.

minor comments (2)

[Tables] Table captions and metric definitions (cpWER, cpCER) should explicitly state whether they are computed at the dialogue or utterance level and how speaker attribution is handled.
[§3] The paper would benefit from an explicit statement of the exact hours retained after each filtering step when constructing the 200-hour training set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which help clarify the presentation of our contributions. We respond to each major comment below and will incorporate revisions to address the concerns raised.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Corpus Construction): the central claim that the relaxed split 'enables a controlled study of the trade-off between additional training data and speaker overlap' rests on the unverified assumption that the new eval split has essentially the same inherent difficulty as the original 85-hour eval split. No quantitative comparison of turn-taking statistics, dialogue length distributions, or acoustic conditions between the two eval sets is reported, so the observed WER increase cannot be unambiguously attributed to data size versus overlap.

Authors: The controlled nature of the study derives from preserving complete separation of all primary speakers while relaxing the split only for experimenters and dialogue partners; this isolates the effect of increased training data volume under a specific, limited form of overlap. Nevertheless, we agree that explicit quantitative comparisons between the two evaluation sets would strengthen the interpretation. In the revised manuscript we will add a table (or subsection) in §3 reporting turn-taking statistics, dialogue length distributions, and any available acoustic metadata for both the original and expanded evaluation splits. This will allow readers to assess potential differences in inherent difficulty. revision: yes
Referee: [Results] Results section (and abstract claim of 'consistent improvements'): the reported gains from SOT adaptation are presented without error bars, statistical significance tests, or details on training procedures, hyper-parameters, data exclusion rules, or number of runs. This prevents verification of whether the improvements are robust or merely within-run variation.

Authors: We accept that the current Results section does not provide sufficient information to evaluate robustness. In the revision we will (i) report means and standard deviations across multiple independent runs with different random seeds, (ii) include statistical significance tests comparing SOT versus baseline performance, and (iii) add explicit details on training hyperparameters, data exclusion rules, and the number of runs. These additions will appear in the main Results section with supporting material placed in an appendix if needed. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical corpus release and model evaluation

full rationale

The paper introduces an expanded dialogue corpus by relaxing speaker-disjoint rules for non-primary speakers and reports empirical WER/CER results on Whisper and FastConformer models with and without SOT fine-tuning. No equations, derivations, fitted parameters, or load-bearing self-citations appear in the provided text or abstract. The central claims are direct experimental observations (larger corpus more challenging without fine-tuning; SOT yields improvements) rather than any derived result that reduces to its own inputs by construction. The skeptic concern about eval-split comparability is a methodological question about experimental controls, not a circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical resource paper; the central claim rests on the existence and quality of the new corpus split and on standard ASR evaluation metrics. No free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5751 in / 1118 out tokens · 20736 ms · 2026-06-28T22:45:29.259333+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 2 canonical work pages

[1]

Language modeling for au- tomatic turkish broadcast news transcription

Ebru Arısoy, Haşim Sak, and Murat Saraçlar. Language modeling for au- tomatic turkish broadcast news transcription. InInterspeech 2007, pages 2381–2384, 2007. https://doi.org/10.21437/Interspeech.2007-273

work page doi:10.21437/interspeech.2007-273 2007
[2]

Bestdataismoresuperviseddata–evenforhungarianasr

Gergely Dobsinszki, Péter Mihajlik, Máté Soma Kádár, Tibor Fegyó, and KatalinMády. Bestdataismoresuperviseddata–evenforhungarianasr. In Alexey Karpov and Gábor Gosztolya, editors,Speech and Computer, pages 60–69, Cham, 2026. Springer Nature Switzerland. ISBN 978-3-032-07959-6

2026
[3]

Toward conversational hungarian speech recognition: Introducing the BEA-Large and BEA-Dialogue datasets, 2025

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Tekla Etelka Gráczi, Anna Kohári, and Katalin Mády. Toward conversational hungarian speech recognition: Introducing the BEA-Large and BEA-Dialogue datasets, 2025. URLhttps://arxiv.org/abs/2511.13529

arXiv 2025
[4]

Serializedoutputtrainingforend-to-endoverlappedspeechrecog- nition

Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, and Takuya Yoshioka. Serializedoutputtrainingforend-to-endoverlappedspeechrecog- nition. InInterspeech, 2020. URLhttps://api.semanticscholar.org/ CorpusID:214714409

2020
[5]

Leakage and the reproducibility crisis in machine-learning-based science.Patterns, 4(9):100804, 2023

Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in machine-learning-based science.Patterns, 4(9):100804, 2023. ISSN 2666-3899. https://doi.org/https://doi.org/10.1016/j.patter.2023.100804. URLhttps://www.sciencedirect.com/science/article/pii/ S2666389923001599

work page doi:10.1016/j.patter.2023.100804 2023
[6]

Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, and Jonathan M. Cohen. Nemo: a toolkit for building ai applications using neural modules, 2019. URLhttps://arxiv.org/abs/1909.09577

arXiv 2019
[7]

Re- vised annotation conventions in hungarian speech corpora.Beszédtudomány / Speech Science, 4(1):185–202, 2024

Katalin Mády, Gráczi Tekla Etelka, Anna Kohári, and Péter Mihajlik. Re- vised annotation conventions in hungarian speech corpora.Beszédtudomány / Speech Science, 4(1):185–202, 2024. 18 p

2024
[8]

BEA-base: A benchmark for ASR of sponta- neous Hungarian

Peter Mihajlik, Andras Balog, Tekla Etelka Graczi, Anna Kohari, Balázs Tarján, and Katalin Mady. BEA-base: A benchmark for ASR of sponta- neous Hungarian. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and St...

1970
[9]

Seza Doğruöz

Peter Mihajlik, Katalin Mády, Anna Kohári, Fruzsina Sára Fruzsina, Gábor Kiss, Tekla Etelka Gráczi, and A. Seza Doğruöz. Is spoken Hungarian low- resource?: A quantitative survey of Hungarian speech data sets. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, The BEA-Dialogue+ Corpus 11 and Nianwen Xue, editors,Procee...

2024
[10]

Development of a large sponta- neous speech database of agglutinative hungarian language

Tilda Neuberger, Dorottya Gyarmathy, Tekla Etelka Gráczi, Viktória Horváth, Mária Gósy, and András Beke. Development of a large sponta- neous speech database of agglutinative hungarian language. In Petr Sojka, Aleš Horák, Ivan Kopeček, and Karel Pala, editors,Text, Speech and Dia- logue, pages 424–431, Cham, 2014. Springer International Publishing. ISBN 9...

2014
[11]

Robust speech recognition via large-scale weak supervision, 2022

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022. URLhttps://arxiv.org/abs/2212.04356

Pith/arXiv arXiv 2022
[12]

Fast conformer with lin- early scalable attention for efficient speech recognition, 2023

Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majum- dar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg. Fast conformer with lin- early scalable attention for efficient speech recognition, 2023. URLhttps: //arxiv.org/abs/2305.05084

arXiv 2023
[13]

Deep neural networks for automatic speech processing: A survey from large corpora to limited data, 2020

Vincent Roger, Jérôme Farinas, and Julien Pinquier. Deep neural networks for automatic speech processing: A survey from large corpora to limited data, 2020. URLhttps://arxiv.org/abs/2003.04241

arXiv 2020

[1] [1]

Language modeling for au- tomatic turkish broadcast news transcription

Ebru Arısoy, Haşim Sak, and Murat Saraçlar. Language modeling for au- tomatic turkish broadcast news transcription. InInterspeech 2007, pages 2381–2384, 2007. https://doi.org/10.21437/Interspeech.2007-273

work page doi:10.21437/interspeech.2007-273 2007

[2] [2]

Bestdataismoresuperviseddata–evenforhungarianasr

Gergely Dobsinszki, Péter Mihajlik, Máté Soma Kádár, Tibor Fegyó, and KatalinMády. Bestdataismoresuperviseddata–evenforhungarianasr. In Alexey Karpov and Gábor Gosztolya, editors,Speech and Computer, pages 60–69, Cham, 2026. Springer Nature Switzerland. ISBN 978-3-032-07959-6

2026

[3] [3]

Toward conversational hungarian speech recognition: Introducing the BEA-Large and BEA-Dialogue datasets, 2025

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Tekla Etelka Gráczi, Anna Kohári, and Katalin Mády. Toward conversational hungarian speech recognition: Introducing the BEA-Large and BEA-Dialogue datasets, 2025. URLhttps://arxiv.org/abs/2511.13529

arXiv 2025

[4] [4]

Serializedoutputtrainingforend-to-endoverlappedspeechrecog- nition

Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, and Takuya Yoshioka. Serializedoutputtrainingforend-to-endoverlappedspeechrecog- nition. InInterspeech, 2020. URLhttps://api.semanticscholar.org/ CorpusID:214714409

2020

[5] [5]

Leakage and the reproducibility crisis in machine-learning-based science.Patterns, 4(9):100804, 2023

Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in machine-learning-based science.Patterns, 4(9):100804, 2023. ISSN 2666-3899. https://doi.org/https://doi.org/10.1016/j.patter.2023.100804. URLhttps://www.sciencedirect.com/science/article/pii/ S2666389923001599

work page doi:10.1016/j.patter.2023.100804 2023

[6] [6]

Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, and Jonathan M. Cohen. Nemo: a toolkit for building ai applications using neural modules, 2019. URLhttps://arxiv.org/abs/1909.09577

arXiv 2019

[7] [7]

Re- vised annotation conventions in hungarian speech corpora.Beszédtudomány / Speech Science, 4(1):185–202, 2024

Katalin Mády, Gráczi Tekla Etelka, Anna Kohári, and Péter Mihajlik. Re- vised annotation conventions in hungarian speech corpora.Beszédtudomány / Speech Science, 4(1):185–202, 2024. 18 p

2024

[8] [8]

BEA-base: A benchmark for ASR of sponta- neous Hungarian

Peter Mihajlik, Andras Balog, Tekla Etelka Graczi, Anna Kohari, Balázs Tarján, and Katalin Mady. BEA-base: A benchmark for ASR of sponta- neous Hungarian. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and St...

1970

[9] [9]

Seza Doğruöz

Peter Mihajlik, Katalin Mády, Anna Kohári, Fruzsina Sára Fruzsina, Gábor Kiss, Tekla Etelka Gráczi, and A. Seza Doğruöz. Is spoken Hungarian low- resource?: A quantitative survey of Hungarian speech data sets. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, The BEA-Dialogue+ Corpus 11 and Nianwen Xue, editors,Procee...

2024

[10] [10]

Development of a large sponta- neous speech database of agglutinative hungarian language

Tilda Neuberger, Dorottya Gyarmathy, Tekla Etelka Gráczi, Viktória Horváth, Mária Gósy, and András Beke. Development of a large sponta- neous speech database of agglutinative hungarian language. In Petr Sojka, Aleš Horák, Ivan Kopeček, and Karel Pala, editors,Text, Speech and Dia- logue, pages 424–431, Cham, 2014. Springer International Publishing. ISBN 9...

2014

[11] [11]

Robust speech recognition via large-scale weak supervision, 2022

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022. URLhttps://arxiv.org/abs/2212.04356

Pith/arXiv arXiv 2022

[12] [12]

Fast conformer with lin- early scalable attention for efficient speech recognition, 2023

Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majum- dar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg. Fast conformer with lin- early scalable attention for efficient speech recognition, 2023. URLhttps: //arxiv.org/abs/2305.05084

arXiv 2023

[13] [13]

Deep neural networks for automatic speech processing: A survey from large corpora to limited data, 2020

Vincent Roger, Jérôme Farinas, and Julien Pinquier. Deep neural networks for automatic speech processing: A survey from large corpora to limited data, 2020. URLhttps://arxiv.org/abs/2003.04241

arXiv 2020