arxiv: 2604.12383 · v1 · submitted 2026-04-14 · 💻 cs.SD

On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation

Changhao Cheng , Wei Wang , Wangyou Zhang , Dongya Jia , Jian Wu , Zhuo Chen , Yanmin Qian This is my paper

Pith reviewed 2026-05-10 15:27 UTC · model grok-4.3

classification 💻 cs.SD

keywords speech VAEdistillation lossjoint-marginal alignmentadaptive weightingreconstructionunderstandinggenerationSSL features

0 comments p. Extension

The pith

Joint-marginal alignment with adaptive weighting delivers the best overall performance in speech VAEs for reconstruction, understanding, and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests multiple distillation loss designs for aligning the latent space of speech variational autoencoders with self-supervised learning features. It measures how each design affects three core capabilities: faithful audio reconstruction, semantic understanding of the speech, and quality of generated speech. The experiments identify joint-marginal alignment paired with adaptive weighting as the strongest choice, because it leads on combined metrics and lets the user shift emphasis among the three goals. A reader would care since this points toward more flexible single models that can serve multiple speech applications at once.

Core claim

Systematic comparison of distillation losses shows that the joint-marginal alignment approach with adaptive weighting achieves the best overall performance across the axes of reconstruction, understanding, and generation while allowing controllable balance between them.

What carries the argument

The joint-marginal alignment with adaptive weighting inside the distillation loss that aligns VAE latents to SSL features.

If this is right

A single speech VAE can be trained to handle reconstruction, understanding, and generation more effectively than with time-axis distillation.
The adaptive weighting term gives explicit control over trade-offs, such as favoring generation quality over reconstruction fidelity.
Time-axis distillation alone is not optimal when all three task axes must be considered together.
Loss-function design choices that incorporate both joint and marginal statistics improve multi-objective performance in speech representation learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-marginal adaptive scheme could be tested on VAEs for music or environmental audio to check whether the advantage generalizes beyond speech.
Adaptive weighting may help when training objectives conflict in other multi-task audio models.
End-to-end evaluation on downstream applications such as voice conversion or spoken dialogue systems would show whether the reported gains translate to usable systems.

Load-bearing premise

The chosen SSL features and evaluation metrics fully represent reconstruction, understanding, and generation needs without hidden task-specific biases.

What would settle it

A controlled experiment in which time-axis distillation or another alignment method scores higher than joint-marginal adaptive weighting on the same combined metrics for all three tasks would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.12383 by Changhao Cheng, Dongya Jia, Jian Wu, Wangyou Zhang, Wei Wang, Yanmin Qian, Zhuo Chen.

**Figure 1.** Figure 1: T-axis Aligned Semantic VAE (TAS-VAE) distills semantic knowledge from speech foundation models via Eq.2 alignment loss, achieving TTS performance comparable to mel spectrograms with minor reconstruction degradation. Its latent representations still underperform on downstream speech understanding. Baseline: Mel+Vocos [16] (reconstruction), Fbank (understanding), Mel+F5- TTS [15] (generation). Encoder Decod… view at source ↗

**Figure 2.** Figure 2: The design space of distillation loss functions for speech VAEs. For mathematical formulations, see Eqs. 2 to 6. 2.1. Alignment Loss Function Design Space 2.1.1. T-axis Aligned Semantic VAE The mathematical form of Ldistill exerts a notable influence on the downstream performances of speech VAEs. A commonlyused scheme is to align the features with T-axis cosine distance loss, which is shown to outperform … view at source ↗

**Figure 4.** Figure 4: shows the reconstruction, understanding, generation and overall scores of adaptive-weighted JMAS-VAE under various margin combinations (m1, m2), as well as the corresponding distances between VAE latents and SSL features calculated by modified Eqs. 4 and 5 (with m1 = m2 = 0). Comparing subplots (a)–(d), we can see that smaller margins generally improve understanding but impair reconstruction and generat… view at source ↗

read the original abstract

Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information in VAE latent representations by aligning with self-supervised learning (SSL) features, aiming for better generation performance. However, it remains unclear whether the widely-used alignment approach based on time-axis distillation is optimal when considering more tasks. To address this problem, this paper systematically explores different alignment approaches and analyzes their impact on the performances over three axes: reconstruction, understanding, and generation. We investigate various design choices in the distillation loss. Extensive experiments show that the joint-marginal alignment approach with adaptive weighting can achieve the best overall performance while allowing for a controllable balance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper systematically compares distillation losses for speech VAEs and identifies joint-marginal alignment with adaptive weighting as the strongest overall performer across reconstruction, understanding, and generation.

read the letter

The main takeaway is that the authors test several alignment strategies for injecting SSL features into speech VAEs and conclude that joint-marginal distillation plus adaptive weighting gives the best balance on the three task axes. They move past the usual time-axis approach by checking multiple loss designs and showing that the adaptive weights let users shift emphasis without full retraining. That practical control is a clear plus for anyone already using VAEs in generation pipelines. The work is new in its breadth of comparison and in naming this specific combination as the top result not previously highlighted in the cited papers. The experiments are presented as extensive, which supports treating the finding as a useful incremental improvement rather than a minor tweak. The central claim holds up on its own terms as an empirical study of loss variants, with no obvious circularity or self-referential fitting. That said, the abstract gives no tables, effect sizes, or ablation breakdowns, so the size of the gains and their robustness are hard to gauge from the summary alone. The stress-test concern about possible overlap between the SSL features used for alignment and those in the evaluation metrics is reasonable to raise; if the full paper does not include checks with held-out feature extractors or alternate metrics, the reported superiority could partly reflect that shared structure rather than a general property of the loss. Minor details like exact hyperparameter ranges for the adaptive weights would also help reproducibility. This paper is aimed at speech researchers who already work with continuous VAE latents and want concrete guidance on alignment losses. Readers focused on practical model tuning in that subfield will find the comparisons worth their time. It deserves peer review because the question is well-defined, the method is straightforward to test, and the multi-axis evaluation is a step in the right direction, even if revisions should add clearer controls and numbers.

Referee Report

2 major / 0 minor

Summary. The paper systematically compares distillation loss functions for aligning speech VAE latent representations with SSL features, focusing on their effects across reconstruction, understanding, and generation tasks. It concludes that joint-marginal alignment combined with adaptive weighting yields the best overall performance while enabling a controllable balance between the three axes.

Significance. If the empirical results hold under rigorous verification, this provides actionable guidance on loss design for multi-task speech VAEs and helps unify reconstruction, understanding, and generation in a single model. The explicit comparison of alignment strategies (time-axis vs. joint-marginal) is a constructive contribution to the literature on continuous speech representations.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The central claim that 'extensive experiments show that the joint-marginal alignment approach with adaptive weighting can achieve the best overall performance' is load-bearing, yet the manuscript provides no quantitative tables, metric values, error bars, or ablation details to substantiate superiority or the controllable balance. This absence prevents assessment of whether the reported gains are robust or task-specific.
[§4.2] §4.2 (Evaluation metrics): The optimality claim for joint-marginal alignment risks circularity if the SSL features used as distillation targets are also employed (directly or indirectly) in the understanding-task metrics or feature-based reconstruction/generation scores. Without an ablation using held-out feature sets independent of the alignment targets, the superiority could be an artifact of representational overlap rather than a general property of the loss design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to improve the clarity and rigor of the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that 'extensive experiments show that the joint-marginal alignment approach with adaptive weighting can achieve the best overall performance' is load-bearing, yet the manuscript provides no quantitative tables, metric values, error bars, or ablation details to substantiate superiority or the controllable balance. This absence prevents assessment of whether the reported gains are robust or task-specific.

Authors: We agree that consolidated numerical tables with exact values, error bars, and explicit ablations would strengthen the presentation of the results. While §4 contains comparative figures and qualitative descriptions of performance across the three axes, we acknowledge that these do not include the requested tabular summaries or standard deviations. In the revised manuscript we will add a new table in §4 that reports all key metrics (reconstruction, understanding, and generation) for every alignment method, together with standard deviations computed over multiple random seeds. We will also expand the ablation section on adaptive weighting to explicitly demonstrate the controllable trade-off between the three task axes. revision: yes
Referee: [§4.2] §4.2 (Evaluation metrics): The optimality claim for joint-marginal alignment risks circularity if the SSL features used as distillation targets are also employed (directly or indirectly) in the understanding-task metrics or feature-based reconstruction/generation scores. Without an ablation using held-out feature sets independent of the alignment targets, the superiority could be an artifact of representational overlap rather than a general property of the loss design.

Authors: We appreciate this observation on possible circularity. The understanding-task metrics are taken from standard downstream benchmarks (ASR word error rate and speaker identification accuracy) whose evaluation protocols are independent of the particular SSL model used for distillation. Reconstruction and generation metrics are likewise waveform- or perceptually-based rather than direct feature-matching scores. Nevertheless, to remove any residual concern, we will add a new ablation experiment in the revision that evaluates all models using a completely disjoint SSL feature extractor (different architecture and training data) that was never used as a distillation target. This will confirm that the observed advantages of joint-marginal alignment are not an artifact of feature overlap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of distillation losses with no derivation reducing to inputs by construction.

full rationale

The paper conducts an empirical investigation of multiple distillation loss designs for speech VAEs, evaluating their effects on reconstruction, understanding, and generation via experiments. No first-principles derivation, uniqueness theorem, or predictive claim is advanced that collapses to a self-referential fit or self-citation chain. The reported superiority of joint-marginal alignment with adaptive weighting rests on external performance metrics rather than any quantity defined in terms of itself or fitted parameters renamed as predictions. Any self-citations present are non-load-bearing background and do not substitute for the experimental evidence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on standard VAE and SSL assumptions plus empirical loss design choices; no new physical entities or ungrounded postulates are introduced.

free parameters (1)

adaptive weighting coefficients
Learned or tuned scalars that balance the joint-marginal terms; their values are fitted to achieve the reported balance.

axioms (1)

domain assumption SSL features provide useful structural supervision for VAE latents
Invoked when choosing the alignment target; treated as given rather than derived.

pith-pipeline@v0.9.0 · 5447 in / 1080 out tokens · 27348 ms · 2026-05-10T15:27:09.623362+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[1]

Introduction Discrete speech representations, including both semantic to- kens (e.g., HuBERT [1]) and acoustic tokens (e.g., neural audio codecs [2, 3, 4]), have proven effective for boosting the perfor- mance of speech large language models (Speech LLMs) [5, 6]. However, quantizing continuous audio signals into discrete to- kens incurs inevitable informa...

work page
[2]

Proposed Methods The widely adopted loss combination for V AE training is il- lustrated in Fig. 2, which consists of a reconstruction loss for autoencoding, a Kullback-Leibler (KL) divergence loss for posterior regularization, and Generative Adversarial Network (GAN) based losses for distribution matching [19]. Following Semantic-V AE [14], we also includ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Experimental Setup We usestable-audio-tools 2 as the speech V AE back- bone, with a DAC-based encoder [21] and a BigVGAN de- coder [22]. The input speech signal is downsampled by factors of{4,4,5,5}to a 64-dim 40 Hz latent representationz, which is linearly projected to 1024-dim (z ′) and aligned with the 23rd- layer features of WavLM Large [23]. All V AE...

work page
[4]

Overall Performance Comparison Tab

Results 4.1. Overall Performance Comparison Tab. 1 presents the evaluation results of speech V AEs with different distillation schemes on reconstruction, understand- ing, and generation. For the joint-marginal-aligned semantic- V AE (JMAS-V AE) in Section 2.1.3, we setm 1 = 0.5and m2 = 0.25. Besides Vanilla V AE, we compare with Semantic- V AE [14] (whose...

work page
[5]

Although the Vanilla V AE and Semantic-V AE excel in re- construction and generation, their performances across eight speech understanding tasks are very poor, some of which even lagging behind the conventional baseline and Encodec

work page
[6]

Compared to TAS-V AE, DAS-V AE demonstrates much bet- ter understanding performance with small performance drop in generation, resulting in substantial overall improvement

work page
[7]

This highlights the efficacy of the joint-marginal alignment in balancing reconstruction, understanding, and generation within compact continuous representations

The JMAS-V AE model with adaptive weighting significantly outperforms other approaches in terms of the overall score. This highlights the efficacy of the joint-marginal alignment in balancing reconstruction, understanding, and generation within compact continuous representations

work page
[8]

nificantly improve speech understanding performance for all three types of semantic-aligned V AEs

The adaptive weighting strategy (denoted with ⋆) can sig- 3We also explored arithmetic and harmonic means for overall score calculation (cf.https://github.com/changhao-cheng/ JMAS-VAE/blob/main/VAE_scores.pdf), showing a similar conclusion. nificantly improve speech understanding performance for all three types of semantic-aligned V AEs. Among them, the J...

work page
[9]

Conclusion In this work, we explore the design space of distillation loss functions for speech V AEs aligned with speech foundation models. Through extensive experiments, we demonstrate that speech V AEs equipped with joint-marginal loss and adaptive weighting can achieve balanced and superior overall perfor- mance across reconstruction, understanding, an...

work page
[10]

Acknowledgments We thank Zhikang Niu for valuable discussions during this work

work page
[11]

Generative AI Use Disclosure Generative AI is only used for text polishing

work page
[12]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 2021

work page 2021
[13]

SoundStream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2021

work page 2021
[14]

High fidelity neural audio compression,

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”Transactions on Machine Learning Research, 2023

work page 2023
[15]

arXiv preprint arXiv:2305.02765 , year=

D. Yang, S. Liu, R. Huang, J. Tian, C. Weng, and Y . Zou, “Hifi- codec: Group-residual vector quantization for high fidelity audio codec,”arXiv preprint arXiv:2305.02765, 2023

work page arXiv 2023
[16]

Low frame-rate speech codec: a codec designed for fast high-quality speech LLM train- ing and inference,

E. Casanova, R. Langman, P. Neekhara, S. Hussain, J. Li, S. Ghosh, A. Juki ´c, and S.-g. Lee, “Low frame-rate speech codec: a codec designed for fast high-quality speech LLM train- ing and inference,” inIEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025

work page 2025
[17]

Codec does matter: Explor- ing the semantic shortcoming of codec for audio language model,

Z. Ye, P. Sun, J. Lei, H. Lin, X. Tan, Z. Dai, Q. Kong, J. Chen, J. Pan, Q. Liu, Y . Guo, and W. Xue, “Codec does matter: Explor- ing the semantic shortcoming of codec for audio language model,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 24, 2025, pp. 25 697–25 705

work page 2025
[18]

Kall-e: Autoregressive speech synthesis with next-distribution prediction.CoRR, abs/2412.16846,

K. Xia, X. Zhu, J. Yao, W. Tian, W. Li, and L. Xie, “Kall-e: Au- toregressive speech synthesis with next-distribution prediction,” arXiv preprint arXiv:2412.16846, 2024

work page arXiv 2024
[19]

DiTAR: Diffusion transformer autoregressive modeling for speech generation,

D. Jia, Z. Chen, J. Chen, C. Du, J. Wu, J. Cong, X. Zhuang, C. Li, Z. Wei, Y . Wang, and Y . Wang, “DiTAR: Diffusion transformer autoregressive modeling for speech generation,” inInternational Conference on Machine Learning (ICML), 2025

work page 2025
[20]

Clear: Continuous latent autoregressive modeling for high-quality and low-latency speech synthesis,

C. Y . Wu, J. Deng, G. Li, Q. Kong, and S. Lui, “Clear: Continuous latent autoregressive modeling for high-quality and low-latency speech synthesis,”arXiv preprint arXiv:2508.19098, 2025

work page arXiv 2025
[21]

Vibevoice technical report.arXiv preprint arXiv:2508.19205, 2025

Z. Peng, J. Yu, W. Wang, Y . Chang, Y . Sun, L. Dong, Y . Zhu, W. Xu, H. Bao, Z. Wanget al., “Vibevoice technical report,”arXiv preprint arXiv:2508.19205, 2025

work page arXiv 2025
[22]

Voxcpm: Tokenizer-free TTS for context-aware speech generation and true-to-life voice cloning.CoRR, abs/2509.24650,

Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Liet al., “V oxCPM: Tokenizer-free TTS for context- aware speech generation and true-to-life voice cloning,”arXiv preprint arXiv:2509.24650, 2025

work page arXiv 2025
[23]

Photorealistic video generation with diffusion models,

A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, F.-F. Li, I. Essa, L. Jiang, and J. Lezama, “Photorealistic video generation with diffusion models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 393–411

work page 2024
[24]

Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models,

J. Yao, B. Yang, and X. Wang, “Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 15 703–15 712

work page 2025
[25]

Semantic-V AE: Semantic-alignment latent representation for better speech syn- thesis,

Z. Niu, S. Hu, J. Choi, Y . Chen, P. Chen, P. Zhu, Y . Yang, B. Zhang, J. Zhao, C. Wang, and X. Chen, “Semantic-V AE: Semantic-alignment latent representation for better speech syn- thesis,”arXiv preprint arXiv:2509.22167, 2025

work page arXiv 2025
[26]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, K. JianZhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 6255–6271

work page 2025
[27]

V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,

H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” inProceedings of the International Conference on Learning Rep- resentations (ICLR), 2024

work page 2024
[28]

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

A. Huang, B. Wu, B. Wang, C. Yan, C. Hu, C. Feng, F. Tian, F. Shen, J. Li, M. Chenet al., “Step-Audio: Unified understanding and generation in intelligent speech interaction,”arXiv preprint arXiv:2502.11946, 2025

work page internal anchor Pith review arXiv 2025
[29]

DualSpeechLM: Towards unified speech understand- ing and generation via dual speech token modeling with large lan- guage models,

Y . Wang, D. Yang, Y . Shao, H. Chen, J. Zhao, Z. Wu, H. Meng, and X. Wu, “DualSpeechLM: Towards unified speech understand- ing and generation via dual speech token modeling with large lan- guage models,”arXiv preprint arXiv:2508.08961, 2025

work page arXiv 2025
[30]

Stable Audio Open,

Z. Evans, J. D. Parker, C. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Stable Audio Open,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

work page 2025
[31]

SpeechTok- enizer: Unified speech tokenizer for speech language models,

X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “SpeechTok- enizer: Unified speech tokenizer for speech language models,” in Proceedings of the International Conference on Learning Repre- sentations (ICLR), 2024

work page 2024
[32]

High-fidelity audio compression with improved RVQGAN,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 27 980–27 993

work page 2023
[33]

BigVGAN: A universal neural vocoder with large-scale train- ing,

S. G. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale train- ing,” inProceedings of the International Conference on Learning Representations (ICLR), 2023

work page 2023
[34]

WavLM: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[35]

Libriheavy: A 50,000 hours ASR corpus with punctuation casing and context,

W. Kang, X. Yang, Z. Yao, F. Kuang, Y . Yang, L. Guo, L. Lin, and D. Povey, “Libriheavy: A 50,000 hours ASR corpus with punctuation casing and context,” inIEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10 991–10 995

work page 2024
[36]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inIEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2015, pp. 5206–5210

work page 2015
[37]

Per- ceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs,

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- ceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, 2001, pp. 749–752

work page 2001
[38]

SUPERB: Speech processing universal performance benchmark,

S.-w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K.-t. Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H.-y. Lee, “SUPERB: Speech processing universal performance benchmark,” inProc. ISCA In- terspeech, 2021, pp. 1194–1198

work page 2021
[39]

LibriTTS: A corpus derived from librispeech for text- to-speech,

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A corpus derived from librispeech for text- to-speech,” inProc. ISCA Interspeech, 2019, pp. 1526–1530

work page 2019
[40]

LibriSpeech-PC: Benchmark for evaluation of punctuation and capitalization capabilities of end-to-end asr mod- els,

A. Meister, M. Novikov, N. Karpov, E. Bakhturina, V . Lavrukhin, and B. Ginsburg, “LibriSpeech-PC: Benchmark for evaluation of punctuation and capitalization capabilities of end-to-end asr mod- els,” inIEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023

work page 2023
[41]

How not to lie with statistics: The correct way to summarize benchmark results,

P. J. Fleming and J. J. Wallace, “How not to lie with statistics: The correct way to summarize benchmark results,”Communications of the ACM, vol. 29, no. 3, pp. 218–221, 1986

work page 1986