Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts

arxiv: 2604.09472 · v1 · submitted 2026-04-10 · 📡 eess.AS

Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts

Valentin Pelloin , Lina Bekkali , Reda Dehak , David Doukhan This is my paper

Pith reviewed 2026-05-10 16:17 UTC · model grok-4.3

classification 📡 eess.AS

keywords self-supervised learningaudio representationsdata selectionFrench audiovisual broadcastsdownstream evaluationmembership inference attackpretraining datasets

0 comments p. Extension

The pith

Pretraining SSL audio models on diverse French broadcast content improves downstream task performance over speech-only training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the impact of pretraining dataset composition on self-supervised audio encoders by constructing a large corpus from French TV and radio broadcasts. Automatic annotation tools label the audio for speech, music, and other categories, allowing the creation of subsets with varying levels of diversity. Models trained on these subsets are evaluated on automatic speech recognition, voice activity detection, music detection, and speaker recognition tasks. Results indicate that diverse pretraining data enhances performance across these tasks without limiting to clean speech. A membership inference attack further reveals the need for deduplication to prevent memorization of training data.

Core claim

The authors establish that self-supervised learning models for audio representations benefit from pretraining on highly diverse audiovisual broadcast content rather than being restricted to clean segmented speech. Using a corpus of French TV and radio audio annotated automatically, they demonstrate improved results on multiple downstream tasks including ASR, speaker recognition, and music detection. They also highlight that without deduplication, models may memorize training examples, as shown by membership inference attacks. This suggests that unified pretraining can connect speech and music processing communities.

What carries the argument

Automatic annotation-based subset creation from broadcast audio to control pretraining data diversity for SSL models.

Load-bearing premise

Automatic tools for annotating audio content must generate labels accurate enough that performance differences between subsets stem from their diversity rather than from labeling inaccuracies or unrelated variables.

What would settle it

Re-evaluating the downstream tasks after manually correcting the automatic labels or after matching the subsets exactly on factors like total duration and acoustic conditions would show if the reported gains persist or vanish.

Figures

Figures reproduced from arXiv: 2604.09472 by David Doukhan, Lina Bekkali, Reda Dehak, Valentin Pelloin.

**Figure 1.** Figure 1: Number of hours of audio content in the source INA dataset per year. The Institut National de l’Audiovisuel (INA) is in charge since 1975 of collecting and archiving TV and Radio content in France. In partnership with them, we obtained a randomly sampled dataset of 473k hours of content, broadcast on 113 French TV and Radio channels, from 1940 to 2022. Thus, this dataset covers various kinds of audiovisual… view at source ↗

**Figure 2.** Figure 2: Overview of the data preprocessing pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The per-gender speech ratio per year. 4. Self-Supervised Learning We train audio SSL models following data2vec2 architecture presented by Baevski et al. (2023). The architecture follows a teacher-student encoder setting, where the teacher corresponds to the exponentially moving average from the student weights. The student has to predict the masked audio sequence representation of the teacher. The enco… view at source ↗

**Figure 4.** Figure 4: Overview of architecture used of the Voice [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of the proposed Membership Inference Attack (MIA) downstream model. During training, we notice the downstream MIA model struggles to converge on the development set, exhibiting the difficulty of the task. Next, for evaluating the MIA, we construct three different test sets of 10 hours (1,200 segments each): unseen where neither of the model has seen the examples during their pretraining ; once whe… view at source ↗

**Figure 6.** Figure 6: ROC curves for the membership inference attack. each other nor the 22h downstream training set. We present in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Audio and speech self-supervised encoder models are now widely used for a lot of different tasks. Many of these models are often trained on clean segmented speech content such as LibriSpeech. In this paper, we look into how the pretraining datasets of such SSL (Self-Supervised Learning) models impact their downstream results. We build a large pretraining corpus of highly diverse TV and Radio broadcast audio content, which we describe with automatic tools. We use these annotations to build smaller subsets, which we use to train audio SSL models. Then, we evaluate the models on multiple downstream tasks such as automatic speech recognition, voice activity and music detection, or speaker recognition. The results show the potential of pretraining SSL models on diverse audio content without restricting it to speech. We also perform a membership inference attack to evaluate the encoder ability to memorize their training datasets, which highlight the importance of data deduplication. This unified training could bridge speech and music machine learning communities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a clear empirical case that diverse French broadcast audio helps SSL pretraining on mixed tasks more than speech-only data, with a useful privacy check added.

read the letter

The main thing to know is that pretraining audio SSL models on varied TV and radio content, rather than clean speech alone, produced better results across automatic speech recognition, voice activity detection, music detection, and speaker identification. The authors also ran a membership inference attack to show memorization risks and the value of deduplication. This is a focused, practical study on data selection for media audio in French.

Referee Report

2 major / 3 minor

Summary. The paper examines the impact of pretraining data selection on self-supervised audio encoders by assembling a large corpus of French TV and radio broadcasts, using automatic annotation tools to characterize and subset the data by content type, training SSL models on these subsets, and evaluating them on downstream tasks including ASR, VAD, music detection, and speaker recognition. A membership inference attack is also performed to assess memorization of training data.

Significance. If the results hold under controlled conditions, the work provides empirical support for the benefits of diverse broadcast audio pretraining over speech-only corpora for SSL models, with potential to improve cross-task robustness and connect speech and music ML communities. The membership inference experiment usefully highlights deduplication needs in such models.

major comments (2)

[§3.2] §3.2: The central attribution of downstream gains to content diversity relies on automatic annotations for subset construction, but no quantitative validation (e.g., precision/recall on a manually labeled hold-out set) of these tools is reported, risking that observed differences arise from annotation artifacts rather than diversity.
[Table 3] Table 3: Downstream task results are presented as single-point metrics without standard deviations, multiple random seeds, or statistical significance tests, which is insufficient to support claims of consistent gains attributable to the diversity subsets.

minor comments (3)

[Abstract] Abstract: The phrasing 'a lot of different tasks' is informal and should be revised to 'various tasks'.
[§4.1] §4.1: Training hyperparameters for the SSL models (e.g., batch size, learning rate schedule) are summarized but lack explicit values or references to the exact configuration files used.
[Figure 2] Figure 2: The caption does not clarify the meaning of error bars or whether they represent standard error across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses

Referee: [§3.2] §3.2: The central attribution of downstream gains to content diversity relies on automatic annotations for subset construction, but no quantitative validation (e.g., precision/recall on a manually labeled hold-out set) of these tools is reported, risking that observed differences arise from annotation artifacts rather than diversity.

Authors: We agree that explicit validation of the automatic annotation pipeline would strengthen the attribution of gains to content diversity. The tools (speech/music classifiers, VAD, and speaker diarization) follow established implementations from prior literature, yet we did not report precision/recall on a held-out manually labeled set in the original manuscript. In the revised version we will add a dedicated validation subsection that quantifies tool accuracy on a manually annotated subset of the broadcast corpus, thereby reducing the risk that subset differences reflect annotation artifacts. revision: yes
Referee: [Table 3] Table 3: Downstream task results are presented as single-point metrics without standard deviations, multiple random seeds, or statistical significance tests, which is insufficient to support claims of consistent gains attributable to the diversity subsets.

Authors: We acknowledge that single-run point estimates limit the strength of claims about consistent gains. Pretraining each SSL model requires substantial GPU resources, which constrained us to one run per data configuration. Nevertheless, the same ordering of results (diverse broadcast subsets outperforming speech-only baselines) appears across four distinct downstream tasks. In the revision we will (i) report standard deviations over multiple evaluation folds where applicable, (ii) add a limitations paragraph discussing the single-seed constraint, and (iii) include a statistical significance test (paired t-test or Wilcoxon) on the per-utterance metrics that are already available, thereby providing a more rigorous presentation without requiring additional full pretraining runs. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely empirical study with independent evaluations

full rationale

The manuscript is an empirical study that constructs pretraining corpora from broadcast audio, applies automatic annotations to create subsets, trains SSL encoders, and measures performance on held-out downstream tasks (ASR, VAD, music detection, speaker ID) plus a membership-inference experiment. No derivation chain, first-principles equations, or fitted parameters are presented that reduce to the inputs by construction. All reported gains are external observations on separate evaluation sets rather than internal consistency checks or self-referential definitions. The paper contains no load-bearing self-citations, uniqueness theorems, or ansatzes that close a loop back to the authors' prior claims. The central result—that diverse broadcast pretraining yields downstream benefits—is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard assumption that contrastive or predictive SSL objectives learn transferable audio features and that automatic tools can reliably partition broadcast audio by content type. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Self-supervised objectives on raw audio produce representations useful for downstream classification and recognition tasks
Invoked implicitly when claiming that pretraining on different subsets affects downstream performance.
domain assumption Automatic annotation tools provide sufficiently accurate labels to construct diversity-controlled training subsets
Required to interpret subset differences as causal for the reported gains.

pith-pipeline@v0.9.0 · 5477 in / 1362 out tokens · 62524 ms · 2026-05-10T16:17:37.806403+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts

Introduction Self-Supervised Learning (SSL) consists in pre- training models on unsupervised data, without us- inglabeleddata. Inthecontextofaudioandspeech SSL models, an encoder model is pretrained on a large corpus of audio content. This model then generates embeddings that can be finetuned and used as input of downstream models to perform various tasks...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

(2023), data2vec2 is an efficient and multimodal architecture to train SSL encoders

Related works Introduced by Baevski et al. (2023), data2vec2 is an efficient and multimodal architecture to train SSL encoders. This architecture is composed of a teacher-student encoder, and can be trained with similar objectives for text, image or speech. Using anequivalentarchitecture,Lietal.(2022)published music2vec, a model pretrained on music (1,000...

work page 2023
[3]

music”and“noise

Audio datasets for SSL Our objective is to build audio SSL models as gen- eral as possible. These models could work for both speech analysis tasks such as speech recog- nition, speech understanding, speaker diarization or verification; and also for Music Information Re- trieval (MIR) tasks: music and singing voice detec- tion. We aim at applying these mod...

work page 1940
[4]

Self-Supervised Learning We train audio SSL models following data2vec2 ar- chitecture presented by Baevski et al. (2023). The architecture follows a teacher-student encoder set- ting, where the teacher corresponds to the expo- nentially moving average from the student weights. The student has to predict the masked audio se- quence representation of the te...

work page 2023
[5]

We also assess the ability of our models to recall their pretraining dataset with a membership inference attack

Evaluation on downstream tasks In this section we benchmark our audio encoders with multiple downstream tasks: automatic speech recognition, voice activity detection, music detec- tion and speaker recognition. We also assess the ability of our models to recall their pretraining dataset with a membership inference attack. We compare our audio encoders with...

work page 2021
[6]

We build 6 pretrained audio SSL models that we benchmark on various downstream evaluations

Conclusion In this paper, we construct a 100,000 hours pre- trainingcorpusofaudiovisualTVandRadiocontent. We build 6 pretrained audio SSL models that we benchmark on various downstream evaluations. Our observations shows that for speech recog- nition, pretraining on content without music im- proves the results compared to more diverse con- tent. Gender-wi...

work page
[7]

Pantagruel

Acknowledgments The authors would like to thank Jean-Hugues Chenot, Nicolas Hervé and Sandrine Depoix for theirhelpduringtheconstructionoftheINAdataset and with the audio deduplication tool. They also thank Aude Formagne regarding the legal chal- lenges of publishing the pretrained models. ThisresearchhasbeenfundedbytheFrenchNa- tionalResearchAgency(ANR),...

work page 2022
[8]

Bibliographical References Martine Adda-Decker and Lori Lamel. 2005. Do speech recognizers prefer female speakers? In Interspeech 2005, pages 2205–2208. Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, and Dirk Hovy. 2024. Twists, humps, and pebbles: Multilingual speech recognition mod- els exhibit gender performance gaps. InPro- ceedings of the 2024 C...

work page 2005
[9]

In30th USENIX Security Sympo- sium (USENIX Security 21), pages 2633–2650

Extracting training data from large lan- guage models. In30th USENIX Security Sympo- sium (USENIX Security 21), pages 2633–2650. USENIX Association. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xian...

work page 2022
[10]

InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Ma- chine Learning Research, pages 10697–10707

Deduplicating training data mitigates pri- vacy risks in language models. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Ma- chine Learning Research, pages 10697–10707. PMLR. Biswajit Karan, Joshua Jansen van Vüren, Febe de Wet, and Thomas Niesler. 2024. A Transformer- Based Voice Activity Detector. InI...

work page 2024
[11]

InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 28492–28518

Robust speech recognition via large-scale weak supervision. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 28492–28518. PMLR. Wei-Cheng Tseng, Wei-Tsung Kao, and Hung yi Lee. 2022. Membership Inference Attacks Against Self-supervised Speech Models. InIn- terspeech 2022, ...

work page 2022
[12]

Marcely Zanon Boito, Laurent Besacier, Natalia Tomashenko,andYannickEstève.2022

On protecting the data privacy of large language models (llms): A survey. Marcely Zanon Boito, Laurent Besacier, Natalia Tomashenko,andYannickEstève.2022. AStudy of Gender Impact in Self-supervised Models for Speech-to-Text Systems. InInterspeech 2022, pages 1278–1282. Marcely Zanon Boito, Vivek Iyer, Nikolaos La- gos, Laurent Besacier, and Ioan Calapodescu

work page 2022
[13]

InInterspeech 2024, pages 3939–3943

mHuBERT-147: A Compact Multilingual HuBERT Model. InInterspeech 2024, pages 3939–3943

work page 2024
[14]

Language Resource References Christophe Benzitoun, Jeanne-Marie Debaisieux, and Henri-José Deulofeu. 2016. Le projet orféo: un corpus d’étude pour le français contemporain. Corpus, (15). Karim Boudahmane, Bianka Buschbeck, Eu- nah Cho, Josep Maria Crego, Markus Freitag, Thomas Lavergne, Hermann Ney, Jan Niehues, Stephan Peitz, Jean Senellart, Artem Sokolo...

work page 2016
[15]

InInterspeech 2018, pages 1239–1243

Ava-speech: A densely labeled dataset of speech activity in movies. InInterspeech 2018, pages 1239–1243. David Doukhan, Christine Maertens, William Le Personnic, Ludovic Speroni, and Reda De- hak. 2024. InaGVAD : A challenging French TV and radio corpus annotated for speech ac- tivity detection and speaker gender segmenta- tion. InProceedings of the 2024 ...

work page 2018
[16]

MLS: A Large-Scale Multilingual Dataset for Speech Research

The EPAC corpus: Manual and auto- matic annotations of conversational speech in French broadcast news. InProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. European Language Resources Associa- tion (ELRA). Sylvain Galliano, Guillaume Gravier, and Laura Chaubard. 2009. The ester 2 evaluation...

work page internal anchor Pith review arXiv 2009

[1] [1]

Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts

Introduction Self-Supervised Learning (SSL) consists in pre- training models on unsupervised data, without us- inglabeleddata. Inthecontextofaudioandspeech SSL models, an encoder model is pretrained on a large corpus of audio content. This model then generates embeddings that can be finetuned and used as input of downstream models to perform various tasks...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

(2023), data2vec2 is an efficient and multimodal architecture to train SSL encoders

Related works Introduced by Baevski et al. (2023), data2vec2 is an efficient and multimodal architecture to train SSL encoders. This architecture is composed of a teacher-student encoder, and can be trained with similar objectives for text, image or speech. Using anequivalentarchitecture,Lietal.(2022)published music2vec, a model pretrained on music (1,000...

work page 2023

[3] [3]

music”and“noise

Audio datasets for SSL Our objective is to build audio SSL models as gen- eral as possible. These models could work for both speech analysis tasks such as speech recog- nition, speech understanding, speaker diarization or verification; and also for Music Information Re- trieval (MIR) tasks: music and singing voice detec- tion. We aim at applying these mod...

work page 1940

[4] [4]

Self-Supervised Learning We train audio SSL models following data2vec2 ar- chitecture presented by Baevski et al. (2023). The architecture follows a teacher-student encoder set- ting, where the teacher corresponds to the expo- nentially moving average from the student weights. The student has to predict the masked audio se- quence representation of the te...

work page 2023

[5] [5]

We also assess the ability of our models to recall their pretraining dataset with a membership inference attack

Evaluation on downstream tasks In this section we benchmark our audio encoders with multiple downstream tasks: automatic speech recognition, voice activity detection, music detec- tion and speaker recognition. We also assess the ability of our models to recall their pretraining dataset with a membership inference attack. We compare our audio encoders with...

work page 2021

[6] [6]

We build 6 pretrained audio SSL models that we benchmark on various downstream evaluations

Conclusion In this paper, we construct a 100,000 hours pre- trainingcorpusofaudiovisualTVandRadiocontent. We build 6 pretrained audio SSL models that we benchmark on various downstream evaluations. Our observations shows that for speech recog- nition, pretraining on content without music im- proves the results compared to more diverse con- tent. Gender-wi...

work page

[7] [7]

Pantagruel

Acknowledgments The authors would like to thank Jean-Hugues Chenot, Nicolas Hervé and Sandrine Depoix for theirhelpduringtheconstructionoftheINAdataset and with the audio deduplication tool. They also thank Aude Formagne regarding the legal chal- lenges of publishing the pretrained models. ThisresearchhasbeenfundedbytheFrenchNa- tionalResearchAgency(ANR),...

work page 2022

[8] [8]

Bibliographical References Martine Adda-Decker and Lori Lamel. 2005. Do speech recognizers prefer female speakers? In Interspeech 2005, pages 2205–2208. Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, and Dirk Hovy. 2024. Twists, humps, and pebbles: Multilingual speech recognition mod- els exhibit gender performance gaps. InPro- ceedings of the 2024 C...

work page 2005

[9] [9]

In30th USENIX Security Sympo- sium (USENIX Security 21), pages 2633–2650

Extracting training data from large lan- guage models. In30th USENIX Security Sympo- sium (USENIX Security 21), pages 2633–2650. USENIX Association. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xian...

work page 2022

[10] [10]

InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Ma- chine Learning Research, pages 10697–10707

Deduplicating training data mitigates pri- vacy risks in language models. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Ma- chine Learning Research, pages 10697–10707. PMLR. Biswajit Karan, Joshua Jansen van Vüren, Febe de Wet, and Thomas Niesler. 2024. A Transformer- Based Voice Activity Detector. InI...

work page 2024

[11] [11]

InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 28492–28518

Robust speech recognition via large-scale weak supervision. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 28492–28518. PMLR. Wei-Cheng Tseng, Wei-Tsung Kao, and Hung yi Lee. 2022. Membership Inference Attacks Against Self-supervised Speech Models. InIn- terspeech 2022, ...

work page 2022

[12] [12]

Marcely Zanon Boito, Laurent Besacier, Natalia Tomashenko,andYannickEstève.2022

On protecting the data privacy of large language models (llms): A survey. Marcely Zanon Boito, Laurent Besacier, Natalia Tomashenko,andYannickEstève.2022. AStudy of Gender Impact in Self-supervised Models for Speech-to-Text Systems. InInterspeech 2022, pages 1278–1282. Marcely Zanon Boito, Vivek Iyer, Nikolaos La- gos, Laurent Besacier, and Ioan Calapodescu

work page 2022

[13] [13]

InInterspeech 2024, pages 3939–3943

mHuBERT-147: A Compact Multilingual HuBERT Model. InInterspeech 2024, pages 3939–3943

work page 2024

[14] [14]

Language Resource References Christophe Benzitoun, Jeanne-Marie Debaisieux, and Henri-José Deulofeu. 2016. Le projet orféo: un corpus d’étude pour le français contemporain. Corpus, (15). Karim Boudahmane, Bianka Buschbeck, Eu- nah Cho, Josep Maria Crego, Markus Freitag, Thomas Lavergne, Hermann Ney, Jan Niehues, Stephan Peitz, Jean Senellart, Artem Sokolo...

work page 2016

[15] [15]

InInterspeech 2018, pages 1239–1243

Ava-speech: A densely labeled dataset of speech activity in movies. InInterspeech 2018, pages 1239–1243. David Doukhan, Christine Maertens, William Le Personnic, Ludovic Speroni, and Reda De- hak. 2024. InaGVAD : A challenging French TV and radio corpus annotated for speech ac- tivity detection and speaker gender segmenta- tion. InProceedings of the 2024 ...

work page 2018

[16] [16]

MLS: A Large-Scale Multilingual Dataset for Speech Research

The EPAC corpus: Manual and auto- matic annotations of conversational speech in French broadcast news. InProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. European Language Resources Associa- tion (ELRA). Sylvain Galliano, Guillaume Gravier, and Laura Chaubard. 2009. The ester 2 evaluation...

work page internal anchor Pith review arXiv 2009