pith. machine review for the scientific record. sign in

arxiv: 2507.13563 · v2 · submitted 2025-07-17 · 💻 cs.CL · cs.SD· eess.AS

Recognition: unknown

Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

Authors on Pith no claims yet
classification 💻 cs.CL cs.SDeess.AS
keywords balalaikaannotationsdata-centricfilteringhuggingfacepipelineprosody-awarepunctuation
0
0 comments X
read the original abstract

We introduce Balalaika, an open-source, data-centric pipeline for processing audio and producing prosody-aware annotations. It combines semantic VAD for context-preserving segmentation, multi-ASR ensembling with ROVER consensus decoding, while retaining optional word-level timestamps, followed by automatic quality and speaker-purity filtering. The text is further enriched with punctuation restoration, lexical stress and "\textipa{e}/\textipa{\H{e}}" normalization, and IPA phonemes. Using Balalaika, we build a 5.1k-hour multi-source Russian corpus with rich annotations, and show consistent gains under equalized training budgets for both speech denoising and TTS; ablations confirm complementary benefits of stress and punctuation and improved synthesis with stricter MOS filtering. The datasets are publicly available at \href{https://huggingface.co/collections/lab260/balalaika-dataset}{\underline{\textbf{HuggingFace}}}

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative

    cs.SD 2026-03 accept novelty 6.0

    RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.