Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

Grach Mkrtchian; Kirill Borodin; Maxim Maslov; Mikhail Gorodnichev; Nikita Vasiliev; Vasiliy Kudryavtsev

arxiv: 2507.13563 · v2 · submitted 2025-07-17 · 💻 cs.CL · cs.SD· eess.AS

Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

Kirill Borodin , Nikita Vasiliev , Vasiliy Kudryavtsev , Maxim Maslov , Mikhail Gorodnichev , Grach Mkrtchian This is my paper

Pith reviewed 2026-05-19 03:39 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords Russian speechprosody annotationdata-centric pipelinespeech denoisingtext-to-speechlexical stressannotated corpusautomatic transcription

0 comments

The pith

Balalaika pipeline enriches Russian audio with stress, punctuation and phonemes to improve denoising and TTS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Balalaika, an open-source pipeline that turns raw Russian audio into prosody-aware annotated data. It segments audio with semantic voice activity detection, transcribes via multi-ASR consensus, applies quality and speaker filters, and then restores punctuation, marks lexical stress, normalizes certain vowels, and adds IPA phonemes. The authors apply the pipeline across multiple sources to create a 5.1k-hour corpus. Models trained on this corpus for speech denoising and text-to-speech synthesis show consistent gains when training budgets are held equal. Ablation tests indicate that stress and punctuation supply complementary value and that stricter quality filtering further lifts synthesis performance.

Core claim

Balalaika combines semantic VAD segmentation, ROVER-style ASR ensembling, automatic quality and speaker-purity filtering, and text enrichment with punctuation restoration, lexical stress marking, vowel normalization, and IPA phonemes; the resulting 5.1k-hour multi-source Russian corpus produces measurable improvements in both denoising and TTS when used under matched training conditions, with ablations confirming the added value of stress and punctuation annotations.

What carries the argument

The Balalaika pipeline, a sequence of semantic segmentation, consensus transcription, filtering, and automatic prosody enrichment steps that turns raw audio into a richly labeled corpus.

If this is right

Training denoising and TTS models on the 5.1k-hour Balalaika corpus yields consistent gains under equalized training budgets.
Stress and punctuation annotations provide complementary benefits beyond basic transcripts.
Stricter MOS-based quality filtering produces better synthesis quality than looser filtering.
The multi-source corpus supports improved Russian speech applications when used for model training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A similar automated pipeline could speed creation of prosody-rich datasets for other languages that currently rely on manual labeling.
Integrating limited human review on top of the automatic pipeline might raise annotation reliability for high-stakes uses without losing scale.
The same enrichment steps could be tested on additional downstream tasks such as automatic speech recognition or spoken language understanding.

Load-bearing premise

The automatic stress, punctuation, and phoneme labels are accurate enough that models trained on them learn real prosodic structure rather than pipeline artifacts.

What would settle it

Replace the automatic stress and punctuation labels in the 5.1k-hour corpus with human-verified versions and retrain the denoising and TTS models; if the performance gains disappear or reverse, the claim that the automatic annotations drive genuine improvement is falsified.

Figures

Figures reproduced from arXiv: 2507.13563 by Grach Mkrtchian, Kirill Borodin, Maxim Maslov, Mikhail Gorodnichev, Nikita Vasiliev, Vasiliy Kudryavtsev.

read the original abstract

We introduce Balalaika, an open-source, data-centric pipeline for processing audio and producing prosody-aware annotations. It combines semantic VAD for context-preserving segmentation, multi-ASR ensembling with ROVER consensus decoding, while retaining optional word-level timestamps, followed by automatic quality and speaker-purity filtering. The text is further enriched with punctuation restoration, lexical stress and "\textipa{e}/\textipa{\H{e}}" normalization, and IPA phonemes. Using Balalaika, we build a 5.1k-hour multi-source Russian corpus with rich annotations, and show consistent gains under equalized training budgets for both speech denoising and TTS; ablations confirm complementary benefits of stress and punctuation and improved synthesis with stricter MOS filtering. The datasets are publicly available at \href{https://huggingface.co/collections/lab260/balalaika-dataset}{\underline{\textbf{HuggingFace}}}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Balalaika is a practical pipeline that assembles existing tools into a 5.1k-hour annotated Russian speech corpus and reports downstream gains, but the automatic label accuracy is not independently checked.

read the letter

The main thing to know is that this paper puts together a data processing pipeline for Russian audio and releases a sizable public corpus with added stress, punctuation, and phoneme labels. They combine semantic VAD for segmentation, ROVER-based ASR consensus, and automatic enrichment steps, then show that models trained on the enriched data do better on denoising and TTS when training budgets are held equal. Ablations suggest the stress and punctuation additions help separately, and stricter filtering improves synthesis quality. Releasing the datasets on HuggingFace is a straightforward positive for anyone who needs Russian speech data right now.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Balalaika, an open-source data-centric pipeline for Russian speech that combines semantic VAD for context-preserving segmentation, multi-ASR ensembling with ROVER, optional word-level timestamps, quality and speaker-purity filtering, plus text enrichment via punctuation restoration, lexical stress marking, e/ĕ normalization, and IPA phonemes. The authors apply the pipeline to construct a publicly released 5.1k-hour multi-source corpus and report consistent gains in speech denoising and TTS under equalized training budgets, with ablations attributing complementary benefits to stress and punctuation annotations and further gains from stricter MOS filtering.

Significance. If the automatic prosody annotations prove reliable, the pipeline and released corpus would constitute a practical contribution to Russian speech resources by enabling prosody-aware modeling at scale. The open-source release, multi-source construction, and explicit ablation of annotation components are positive elements that could support reproducibility in the field.

major comments (2)

[Abstract] Abstract: the headline claim of 'consistent gains' in denoising and TTS (and the attribution of those gains to stress/punctuation in the ablations) is presented without any numerical results, baseline comparisons, statistical tests, or details on training/validation splits, making it impossible to evaluate the magnitude or robustness of the reported improvements.
[Ablations / Experiments section] Ablations / Experiments section: the central attribution of downstream improvements to the prosody-aware annotations (stress, punctuation, phonemes) requires evidence that these automatic labels are accurate on the final 5.1k-hour corpus. No human-annotated error rates, inter-annotator agreement figures, or held-out validation of the stress/punctuation/phoneme outputs are reported, leaving open the possibility that observed gains arise from segmentation, speaker filtering, or data volume rather than the enriched labels.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one concrete quantitative result (e.g., WER or MOS delta) alongside the qualitative claim of 'consistent gains'.
[Dataset construction] Clarify the exact sources, speaker counts, and any overlap handling in the 5.1k-hour multi-source corpus to allow readers to assess diversity and potential leakage.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We address each of the major comments below and have made revisions to improve the clarity and completeness of the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 'consistent gains' in denoising and TTS (and the attribution of those gains to stress/punctuation in the ablations) is presented without any numerical results, baseline comparisons, statistical tests, or details on training/validation splits, making it impossible to evaluate the magnitude or robustness of the reported improvements.

Authors: We agree that the abstract would be strengthened by including quantitative evidence. In the revised manuscript, we have incorporated key numerical results from our experiments, such as the specific gains observed in denoising and TTS tasks, and referenced the evaluation details provided in the main text. The full baselines, statistical tests, and train/validation splits are detailed in the Experiments section. revision: yes
Referee: [Ablations / Experiments section] Ablations / Experiments section: the central attribution of downstream improvements to the prosody-aware annotations (stress, punctuation, phonemes) requires evidence that these automatic labels are accurate on the final 5.1k-hour corpus. No human-annotated error rates, inter-annotator agreement figures, or held-out validation of the stress/punctuation/phoneme outputs are reported, leaving open the possibility that observed gains arise from segmentation, speaker filtering, or data volume rather than the enriched labels.

Authors: The ablations control for data volume, segmentation, and filtering by using identical base datasets and varying only the presence of the prosody annotations. This design helps isolate their contribution. We have expanded the discussion in the revised version to better explain this control and to cite validation results from the underlying annotation tools. However, a comprehensive human evaluation of all labels across the entire corpus was beyond the scope of the current study. revision: partial

standing simulated objections not resolved

Comprehensive human-annotated accuracy metrics for the prosody annotations on the full 5.1k-hour corpus

Circularity Check

0 steps flagged

No circularity: engineering pipeline with external benchmarks

full rationale

The paper presents a data-processing pipeline (semantic VAD, ASR ensembling, punctuation/stress/phoneme enrichment, quality filtering) applied to external multi-source audio, followed by downstream training of denoising and TTS models under equalized budgets with ablations. No equations, fitted parameters, or self-citations appear that would reduce the reported gains to internal definitions or prior author results by construction. The claimed improvements are measured against external task performance and are therefore independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The pipeline rests on standard speech-processing assumptions (accuracy of off-the-shelf ASR and VAD, validity of ROVER consensus, usefulness of lexical stress for TTS) rather than new axioms or invented entities. No free parameters or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5880 in / 1155 out tokens · 67850 ms · 2026-05-19T03:39:00.383623+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Balalaika, an open-source, data-centric pipeline for processing audio and producing prosody-aware annotations. It combines semantic VAD ... punctuation restoration, lexical stress and e-normalisation, and IPA phonemes.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ablations confirm complementary benefits of stress and punctuation and improved synthesis with stricter MOS filtering.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative
cs.SD 2026-03 accept novelty 6.0

RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

https://openslr.org/96/, 2021

Russian librispeech (ruls) dataset. https://openslr.org/96/, 2021

work page 2021
[2]

Ardila, M

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa- hara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedi...

work page
[3]

ISBN 979-10- 95546-34-4

European Language Resources Association. ISBN 979-10- 95546-34-4. URL https://aclanthology.org/2020.lrec-1.520/

work page 2020
[4]

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari. The t05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classi- fier to naturalness MOS prediction of high-quality synthetic speech. In IEEE Spoken Language Technology Workshop (SLT), 2024

work page 2024
[5]

H. Bredin. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In Proc. INTERSPEECH 2023, 2023

work page 2023
[6]

I. Celeste. The m-ailabs speech dataset, 2019. URL https://github.com/ imdatceleste/m-ailabs-dataset. A large free dataset containing nearly 1000 hours of audio across 8 languages for speech recognition and syn- thesis

work page 2019
[7]

Chao, W.-H

R. Chao, W.-H. Cheng, M. L. Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y . Tsao. An investigation of incorporating mamba for speech enhancement. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 302–308, 2024. doi: 10.1109/SLT61566.2024. 10832332

work page doi:10.1109/slt61566.2024 2024
[8]

Chinen, F

M. Chinen, F. S. C. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines. Visqol v3: An open source production ready objective speech and audio metric. In 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX) , pages 1–6, 2020. doi: 10.1109/ QoMEX48832.2020.9123150

work page arXiv 2020
[9]

J. S. Chung, A. Nagrani, and A. Zisserman. V oxceleb2: Deep speaker recognition. In Interspeech 2018 , pages 1086–1090, 2018. doi: 10. 21437/Interspeech.2018-1929

work page 2018
[10]

Fedoseev

G. Fedoseev. Russian speech recognition system based on mozilla’s deepspeech tensorflow implementation. https://github. com/GeorgeFedoseev/DeepSpeech, 2017. URL https://github.com/ GeorgeFedoseev/DeepSpeech. Forked from mozilla/DeepSpeech

work page 2017
[11]

Gabdrakhmanov, R

L. Gabdrakhmanov, R. Garaev, and E. Razinkov. Ruslan: Russian spo- ken language corpus for speech synthesis. In Speech and Computer , pages 113–121, Cham, 2019. Springer International Publishing. ISBN 978-3-030-26061-3

work page 2019
[12]

Hu and P

Y . Hu and P. C. Loizou. Evaluation of objective measures for speech en- hancement. In Interspeech 2006, pages paper 2007–Tue3FoP.10, 2006. doi: 10.21437/Interspeech.2006-84

work page doi:10.21437/interspeech.2006-84 2006
[13]

B. Ivan. nisqa-s. https://github.com/deepvk/nisqa-s, 2024

work page 2024
[14]

Karpov, A

N. Karpov, A. Denisenko, and F. Minkin. Golos: Russian Dataset for Speech Research. In Proc. Interspeech 2021, pages 1419–1423, 2021. doi: 10.21437/Interspeech.2021-462

work page doi:10.21437/interspeech.2021-462 2021
[15]

J. Kim, J. Kong, and J. Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Confer- ence on Machine Learning , volume 139 of Proceedings of Machine Learning Research, pages 5530–5540. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.pres...

work page 2021
[16]

S. Kirdey. V oicerestore: Flow-matching transformers for speech record- ing quality restoration, 2025. URL https://arxiv.org/abs/2501.00794

work page arXiv 2025
[17]

T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur. A study on data augmentation of reverberant speech for robust speech recogni- tion. In ICASSP 2017, pages 5220–5224, 2017. doi: 10.1109/ICASSP. 2017.7953152

work page doi:10.1109/icassp 2017
[18]

J. Kong, J. Park, B. Kim, J. Kim, D. Kong, and S. Kim. Vits2: Im- proving quality and efficiency of single-stage text-to-speech with ad- versarial learning and architecture design. In Interspeech 2023, pages 4374–4378, 2023. doi: 10.21437/Interspeech.2023-534

work page doi:10.21437/interspeech.2023-534 2023
[19]

Y . Lin, M. Cheng, F. Zhang, Y . Gao, S. Zhang, and M. Li. V oxblink2: A 100k+ speaker recognition corpus and the open-set speaker- identification benchmark. In Interspeech 2024, pages 4263–4267, 2024. doi: 10.21437/Interspeech.2024-1490

work page doi:10.21437/interspeech.2024-1490 2024
[20]

Y .-X. Lu, Y . Ai, and Z.-H. Ling. MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra. In Proc. Interspeech, pages 3834–3838, 2023

work page 2023
[21]

McAuliffe, M

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sondereg- ger. Montreal Forced Aligner: Trainable Text-Speech Alignment Us- ing Kaldi. In Proc. Interspeech 2017 , pages 498–502, 2017. doi: 10.21437/Interspeech.2017-1386

work page doi:10.21437/interspeech.2017-1386 2017
[22]

Mittag and S

G. Mittag and S. Möller. Deep learning based assessment of synthetic speech naturalness. In Interspeech 2020, pages 1748–1752, 2020. doi: 10.21437/Interspeech.2020-2382

work page doi:10.21437/interspeech.2020-2382 2020
[23]

Mittag, B

G. Mittag, B. Naderi, A. Chehadi, and S. Möller. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. Interspeech 2021, 2021. doi: 10.21437/ interspeech.2021-299

work page 2021
[24]

D. Petrov. Rupunct models. https://huggingface.co/RUPunct, 2024

work page 2024
[25]

D. A. Petrov. RUAccent: Advanced system for stress placement in Russian with homograph resolution. In Proceedings of the 31st Inter- national Conference on Computational Linguistics , pages 6642–6648, Abu Dhabi, UAE, Jan. 2025. Association for Computational Linguis- tics. URL https://aclanthology.org/2025.coling-main.444/

work page 2025
[26]

Plaquet and H

A. Plaquet and H. Bredin. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023, 2023

work page 2023
[27]

X. Qin, N. Li, C. Weng, D. Su, and M. Li. Simple attention mod- ule based speaker verification with iterative noisy label detection. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6722–6726, 2022. doi: 10.1109/ICASSP43922.2022.9746294

work page doi:10.1109/icassp43922.2022.9746294 2022
[28]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak super- vision, 2022. URL https://arxiv.org/abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

A. Rix, J. Beerends, M. Hollier, and A. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In ICASSP 2001 , pages 749–752,

work page 2001
[30]

doi: 10.1109/ICASSP.2001.941023

work page doi:10.1109/icassp.2001.941023 2001
[31]

E. V . Rodionova. Word order and information structure in russian syn- tax. Master’s thesis, University of North Dakota, Grand Forks, ND, USA, 2001. URL https://commons.und.edu/theses/4482

work page 2001
[32]

Marc Mézard and Andrea Montanari.Information, Physics, and Computation

A. Rozovskaya and D. Roth. Grammar error correction in morphologi- cally rich languages: The case of russian. Transactions of the Associa- tion for Computational Linguistics , 7:1–17, 2019. doi: 10.1162/tacl_a_ 00251

work page doi:10.1162/tacl_a_ 2019
[33]

Gigaam: the family of open-source acoustic models for speech processing

Salute Developers. Gigaam: the family of open-source acoustic models for speech processing. https://github.com/salute-developers/GigaAM,

work page
[34]

Accessed: April 10, 2025

Released under the MIT License. Accessed: April 10, 2025

work page 2025
[35]

Schröter, T

H. Schröter, T. Rosenkranz, A. N. Escalante-B., and A. Maier. Deep- FilterNet: Perceptually motivated real-time speech enhancement. InIN- TERSPEECH, 2023

work page 2023
[36]

K. Shen, Z. Ju, X. Tan, E. Liu, Y . Leng, L. He, T. Qin, sheng zhao, and J. Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. In The Twelfth Inter- national Conference on Learning Representations , 2024. URL https: //openreview.net/forum?id=Rc7dAwVL3v

work page 2024
[37]

Slizhikova, A

A. Slizhikova, A. Veysov, D. Nurtdinova, and D. V oronin. Russian open speech to text (stt/asr) dataset, 2019. URL https://github.com/snakers4/ open_stt

work page 2019
[38]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey. MUSAN: A Music, Speech, and Noise Corpus, 2015. arXiv:1510.08484v1

work page internal anchor Pith review Pith/arXiv arXiv 2015
[39]

Sova dataset: Multilingual stt/asr corpus

SOV A AI. Sova dataset: Multilingual stt/asr corpus. https://github.com/ sovaai/sova-dataset, 2022

work page 2022
[40]

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7): 2125–2136, 2011. doi: 10.1109/TASL.2011.2114881

work page doi:10.1109/tasl.2011.2114881 2011
[41]

O. K. Trubach, D. I. Gorshkova, and L. N. Sklyar. Comparative analysis of phonetic systems of the russian, french and chinese lan- guages. RUDN Journal of Language Studies, Semiotics and Semantics , 14(1):171–188, 2023. ISSN 2313-2299. URL https://journals.rudn.ru/ semiotics-semantics/article/view/34176

work page 2023
[42]

T. Ylonen. Wiktextract: Wiktionary as machine-readable structured data. In Proceedings of the Thirteenth Language Resources and Eval- uation Conference, pages 1317–1325, 2022. URL https://aclanthology. org/2022.lrec-1.140/

work page 2022
[43]

Yolchuyeva, G

S. Yolchuyeva, G. Németh, and B. Gyires-Tóth. Transformer based grapheme-to-phoneme conversion. In Interspeech 2019 , page 2095–2099. ISCA, Sept. 2019. doi: 10.21437/interspeech.2019-1954. URL http://dx.doi.org/10.21437/Interspeech.2019-1954

work page doi:10.21437/interspeech.2019-1954 2019
[44]

Zhang, C.-C

W. Zhang, C.-C. Yeh, W. Beckman, T. Raitio, R. Rasipuram, L. Golipour, and D. Winarsky. Audiobook synthesis with long- form neural text-to-speech. In 12th ISCA Speech Synthesis Workshop (SSW2023), pages 139–143, 2023. doi: 10.21437/SSW.2023-22

work page doi:10.21437/ssw.2023-22 2023
[45]

S. Zhao, Y . Ma, C. Ni, C. Zhang, H. Wang, T. H. Nguyen, K. Zhou, J. Q. Yip, D. Ng, and B. Ma. Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 10356– 10360, 2024. doi: 10.110...

work page doi:10.1109/icassp48485.2024.10445985 2024

[1] [1]

https://openslr.org/96/, 2021

Russian librispeech (ruls) dataset. https://openslr.org/96/, 2021

work page 2021

[2] [2]

Ardila, M

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa- hara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedi...

work page

[3] [3]

ISBN 979-10- 95546-34-4

European Language Resources Association. ISBN 979-10- 95546-34-4. URL https://aclanthology.org/2020.lrec-1.520/

work page 2020

[4] [4]

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari. The t05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classi- fier to naturalness MOS prediction of high-quality synthetic speech. In IEEE Spoken Language Technology Workshop (SLT), 2024

work page 2024

[5] [5]

H. Bredin. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In Proc. INTERSPEECH 2023, 2023

work page 2023

[6] [6]

I. Celeste. The m-ailabs speech dataset, 2019. URL https://github.com/ imdatceleste/m-ailabs-dataset. A large free dataset containing nearly 1000 hours of audio across 8 languages for speech recognition and syn- thesis

work page 2019

[7] [7]

Chao, W.-H

R. Chao, W.-H. Cheng, M. L. Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y . Tsao. An investigation of incorporating mamba for speech enhancement. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 302–308, 2024. doi: 10.1109/SLT61566.2024. 10832332

work page doi:10.1109/slt61566.2024 2024

[8] [8]

Chinen, F

M. Chinen, F. S. C. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines. Visqol v3: An open source production ready objective speech and audio metric. In 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX) , pages 1–6, 2020. doi: 10.1109/ QoMEX48832.2020.9123150

work page arXiv 2020

[9] [9]

J. S. Chung, A. Nagrani, and A. Zisserman. V oxceleb2: Deep speaker recognition. In Interspeech 2018 , pages 1086–1090, 2018. doi: 10. 21437/Interspeech.2018-1929

work page 2018

[10] [10]

Fedoseev

G. Fedoseev. Russian speech recognition system based on mozilla’s deepspeech tensorflow implementation. https://github. com/GeorgeFedoseev/DeepSpeech, 2017. URL https://github.com/ GeorgeFedoseev/DeepSpeech. Forked from mozilla/DeepSpeech

work page 2017

[11] [11]

Gabdrakhmanov, R

L. Gabdrakhmanov, R. Garaev, and E. Razinkov. Ruslan: Russian spo- ken language corpus for speech synthesis. In Speech and Computer , pages 113–121, Cham, 2019. Springer International Publishing. ISBN 978-3-030-26061-3

work page 2019

[12] [12]

Hu and P

Y . Hu and P. C. Loizou. Evaluation of objective measures for speech en- hancement. In Interspeech 2006, pages paper 2007–Tue3FoP.10, 2006. doi: 10.21437/Interspeech.2006-84

work page doi:10.21437/interspeech.2006-84 2006

[13] [13]

B. Ivan. nisqa-s. https://github.com/deepvk/nisqa-s, 2024

work page 2024

[14] [14]

Karpov, A

N. Karpov, A. Denisenko, and F. Minkin. Golos: Russian Dataset for Speech Research. In Proc. Interspeech 2021, pages 1419–1423, 2021. doi: 10.21437/Interspeech.2021-462

work page doi:10.21437/interspeech.2021-462 2021

[15] [15]

J. Kim, J. Kong, and J. Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Confer- ence on Machine Learning , volume 139 of Proceedings of Machine Learning Research, pages 5530–5540. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.pres...

work page 2021

[16] [16]

S. Kirdey. V oicerestore: Flow-matching transformers for speech record- ing quality restoration, 2025. URL https://arxiv.org/abs/2501.00794

work page arXiv 2025

[17] [17]

T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur. A study on data augmentation of reverberant speech for robust speech recogni- tion. In ICASSP 2017, pages 5220–5224, 2017. doi: 10.1109/ICASSP. 2017.7953152

work page doi:10.1109/icassp 2017

[18] [18]

J. Kong, J. Park, B. Kim, J. Kim, D. Kong, and S. Kim. Vits2: Im- proving quality and efficiency of single-stage text-to-speech with ad- versarial learning and architecture design. In Interspeech 2023, pages 4374–4378, 2023. doi: 10.21437/Interspeech.2023-534

work page doi:10.21437/interspeech.2023-534 2023

[19] [19]

Y . Lin, M. Cheng, F. Zhang, Y . Gao, S. Zhang, and M. Li. V oxblink2: A 100k+ speaker recognition corpus and the open-set speaker- identification benchmark. In Interspeech 2024, pages 4263–4267, 2024. doi: 10.21437/Interspeech.2024-1490

work page doi:10.21437/interspeech.2024-1490 2024

[20] [20]

Y .-X. Lu, Y . Ai, and Z.-H. Ling. MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra. In Proc. Interspeech, pages 3834–3838, 2023

work page 2023

[21] [21]

McAuliffe, M

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sondereg- ger. Montreal Forced Aligner: Trainable Text-Speech Alignment Us- ing Kaldi. In Proc. Interspeech 2017 , pages 498–502, 2017. doi: 10.21437/Interspeech.2017-1386

work page doi:10.21437/interspeech.2017-1386 2017

[22] [22]

Mittag and S

G. Mittag and S. Möller. Deep learning based assessment of synthetic speech naturalness. In Interspeech 2020, pages 1748–1752, 2020. doi: 10.21437/Interspeech.2020-2382

work page doi:10.21437/interspeech.2020-2382 2020

[23] [23]

Mittag, B

G. Mittag, B. Naderi, A. Chehadi, and S. Möller. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. Interspeech 2021, 2021. doi: 10.21437/ interspeech.2021-299

work page 2021

[24] [24]

D. Petrov. Rupunct models. https://huggingface.co/RUPunct, 2024

work page 2024

[25] [25]

D. A. Petrov. RUAccent: Advanced system for stress placement in Russian with homograph resolution. In Proceedings of the 31st Inter- national Conference on Computational Linguistics , pages 6642–6648, Abu Dhabi, UAE, Jan. 2025. Association for Computational Linguis- tics. URL https://aclanthology.org/2025.coling-main.444/

work page 2025

[26] [26]

Plaquet and H

A. Plaquet and H. Bredin. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023, 2023

work page 2023

[27] [27]

X. Qin, N. Li, C. Weng, D. Su, and M. Li. Simple attention mod- ule based speaker verification with iterative noisy label detection. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6722–6726, 2022. doi: 10.1109/ICASSP43922.2022.9746294

work page doi:10.1109/icassp43922.2022.9746294 2022

[28] [28]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak super- vision, 2022. URL https://arxiv.org/abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

A. Rix, J. Beerends, M. Hollier, and A. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In ICASSP 2001 , pages 749–752,

work page 2001

[30] [30]

doi: 10.1109/ICASSP.2001.941023

work page doi:10.1109/icassp.2001.941023 2001

[31] [31]

E. V . Rodionova. Word order and information structure in russian syn- tax. Master’s thesis, University of North Dakota, Grand Forks, ND, USA, 2001. URL https://commons.und.edu/theses/4482

work page 2001

[32] [32]

Marc Mézard and Andrea Montanari.Information, Physics, and Computation

A. Rozovskaya and D. Roth. Grammar error correction in morphologi- cally rich languages: The case of russian. Transactions of the Associa- tion for Computational Linguistics , 7:1–17, 2019. doi: 10.1162/tacl_a_ 00251

work page doi:10.1162/tacl_a_ 2019

[33] [33]

Gigaam: the family of open-source acoustic models for speech processing

Salute Developers. Gigaam: the family of open-source acoustic models for speech processing. https://github.com/salute-developers/GigaAM,

work page

[34] [34]

Accessed: April 10, 2025

Released under the MIT License. Accessed: April 10, 2025

work page 2025

[35] [35]

Schröter, T

H. Schröter, T. Rosenkranz, A. N. Escalante-B., and A. Maier. Deep- FilterNet: Perceptually motivated real-time speech enhancement. InIN- TERSPEECH, 2023

work page 2023

[36] [36]

K. Shen, Z. Ju, X. Tan, E. Liu, Y . Leng, L. He, T. Qin, sheng zhao, and J. Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. In The Twelfth Inter- national Conference on Learning Representations , 2024. URL https: //openreview.net/forum?id=Rc7dAwVL3v

work page 2024

[37] [37]

Slizhikova, A

A. Slizhikova, A. Veysov, D. Nurtdinova, and D. V oronin. Russian open speech to text (stt/asr) dataset, 2019. URL https://github.com/snakers4/ open_stt

work page 2019

[38] [38]

MUSAN: A Music, Speech, and Noise Corpus

D. Snyder, G. Chen, and D. Povey. MUSAN: A Music, Speech, and Noise Corpus, 2015. arXiv:1510.08484v1

work page internal anchor Pith review Pith/arXiv arXiv 2015

[39] [39]

Sova dataset: Multilingual stt/asr corpus

SOV A AI. Sova dataset: Multilingual stt/asr corpus. https://github.com/ sovaai/sova-dataset, 2022

work page 2022

[40] [40]

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7): 2125–2136, 2011. doi: 10.1109/TASL.2011.2114881

work page doi:10.1109/tasl.2011.2114881 2011

[41] [41]

O. K. Trubach, D. I. Gorshkova, and L. N. Sklyar. Comparative analysis of phonetic systems of the russian, french and chinese lan- guages. RUDN Journal of Language Studies, Semiotics and Semantics , 14(1):171–188, 2023. ISSN 2313-2299. URL https://journals.rudn.ru/ semiotics-semantics/article/view/34176

work page 2023

[42] [42]

T. Ylonen. Wiktextract: Wiktionary as machine-readable structured data. In Proceedings of the Thirteenth Language Resources and Eval- uation Conference, pages 1317–1325, 2022. URL https://aclanthology. org/2022.lrec-1.140/

work page 2022

[43] [43]

Yolchuyeva, G

S. Yolchuyeva, G. Németh, and B. Gyires-Tóth. Transformer based grapheme-to-phoneme conversion. In Interspeech 2019 , page 2095–2099. ISCA, Sept. 2019. doi: 10.21437/interspeech.2019-1954. URL http://dx.doi.org/10.21437/Interspeech.2019-1954

work page doi:10.21437/interspeech.2019-1954 2019

[44] [44]

Zhang, C.-C

W. Zhang, C.-C. Yeh, W. Beckman, T. Raitio, R. Rasipuram, L. Golipour, and D. Winarsky. Audiobook synthesis with long- form neural text-to-speech. In 12th ISCA Speech Synthesis Workshop (SSW2023), pages 139–143, 2023. doi: 10.21437/SSW.2023-22

work page doi:10.21437/ssw.2023-22 2023

[45] [45]

S. Zhao, Y . Ma, C. Ni, C. Zhang, H. Wang, T. H. Nguyen, K. Zhou, J. Q. Yip, D. Ng, and B. Ma. Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 10356– 10360, 2024. doi: 10.110...

work page doi:10.1109/icassp48485.2024.10445985 2024