MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

Adrien Bardes; Matthew J. Muckley; Michael Rabbat; Nicolas Ballas; Revant Teotia; Sumit Chopra

arxiv: 2606.25225 · v1 · pith:TKC4FNG4new · submitted 2026-06-23 · 💻 cs.CV · cs.LG

MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

Revant Teotia , Adrien Bardes , Michael Rabbat , Sumit Chopra , Matthew J. Muckley , Nicolas Ballas This is my paper

Pith reviewed 2026-06-25 23:55 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords audio-visual learningself-supervised learningjoint embedding predictive architectureunified encodercross-modal predictionAudioSetfrozen representations

0 comments

The pith

MJEPA uses one unified encoder and a single predictive objective applied within and across audio and visual modalities to learn joint representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MJEPA as a modality-agnostic joint-embedding predictive architecture that replaces separate encoders and mixed contrastive-reconstruction losses with one shared ViT encoder and one prediction task. Cross-modal prediction is shown to be essential: a shared encoder without it falls below unimodal baselines, while including it lets each modality improve from the other. The resulting frozen model exceeds the best prior frozen audio-visual baseline by more than 6.8 mAP on AudioSet-20K, beats fully fine-tuned models on ESC-50 and FSD50K, and remains competitive on video tasks despite training on roughly one-tenth the video data.

Core claim

MJEPA demonstrates that a single shared encoder trained with one predictive objective applied both intra-modally and cross-modally produces audio-visual representations whose quality depends on the cross-modal component; without cross-modal prediction the shared encoder underperforms separate unimodal encoders, while with it each modality benefits from the other, yielding the reported gains on frozen audio benchmarks and competitive video results with far less video data.

What carries the argument

MJEPA: a joint-embedding predictive architecture that applies a single predictive objective both within and across modalities using one unified encoder for audio and video.

If this is right

Removing cross-modal prediction from the shared encoder causes performance to drop below separate unimodal models.
Adding cross-modal prediction allows each modality's representation to improve by using information from the other.
The frozen ViT-g model improves AudioSet-20K mAP by more than 6.8 points over the best prior frozen baseline.
The same model surpasses fully fine-tuned baselines on ESC-50 and FSD50K.
Video benchmark results remain competitive while using approximately 10 times less video data than prior work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-objective design may extend to other paired modalities such as text and image without new loss terms.
Scalability could increase further if the same architecture is applied to larger unlabeled video corpora.
The necessity of cross-modal prediction suggests that predictive rather than contrastive objectives may be the more natural route for modality synergy.

Load-bearing premise

The claim that cross-modal prediction is required for a shared encoder to exceed unimodal baselines assumes the ablation comparisons were run under otherwise identical conditions and with the same unimodal baselines.

What would settle it

A controlled experiment in which a shared encoder trained without the cross-modal prediction term matches or exceeds the unimodal baselines under identical data, optimizer, and schedule settings.

Figures

Figures reproduced from arXiv: 2606.25225 by Adrien Bardes, Matthew J. Muckley, Michael Rabbat, Nicolas Ballas, Revant Teotia, Sumit Chopra.

**Figure 1.** Figure 1: Overview of MJEPA. (a) Prior AV-SSL methods typically combine modalityspecific encoders with a mixture of contrastive losses, reconstruction objectives, and heavy augmentations. (b) MJEPA adopts a deliberately simple design: a single shared encoder processes all modalities (audio, video, and audio-video) and is trained with only joint-embedding predictive (JEPA) objective. An intra-modal predictor learns … view at source ↗

**Figure 2.** Figure 2: MJEPA architecture. A single shared encoder Eθ is trained across multiple prediction tasks. (a) Intra-modal prediction: the encoder and a shared predictor Pϕ are applied independently to three input modes—audio (a→a), video (v→v), and audio-video (av→av)—each predicting masked token representations against targets from an EMA encoder Eθ¯. (b) Cross-modal prediction: shown for a→v: pooled last-layer encoder… view at source ↗

**Figure 3.** Figure 3: Progressive ablation of architecture and scaling on AS20K. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Detailed architecture of MJEPA, illustrated for the joint audio-video intramodal prediction (Lav→av). The audio-only (La→a) and video-only (Lv→v) cases follow the same pipeline with a single modality branch. Top-left: Masked audio and video inputs are tokenized by modality-specific projection layers (2D Conv for audio, 3D Conv for video), and augmented with modality-specific positional and modality embedd… view at source ↗

read the original abstract

Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it remains challenging. Existing audio-visual self-supervised methods rely on modality-specific encoders and complex combinations of contrastive or reconstruction objectives, limiting cross-modal synergy and scalability. Joint Embedding Predictive Architectures (JEPAs) offer a simple, modality-agnostic alternative, but have to date been applied primarily to individual modalities. We introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities. Our approach uses only a single predictive objective, applied both within and across modalities. We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality's representation benefits from the other. Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MJEPA shows a shared encoder plus single JEPA objective can deliver strong audio results with less video data, but the claim that cross-modal prediction is essential rests on an ablation whose controls are not described in the abstract.

read the letter

The new piece is applying JEPA to audio-visual data with one encoder and one predictive objective that runs both inside each stream and between them. The abstract reports that this setup lets a frozen ViT-g beat prior frozen baselines by 6.8 mAP on AudioSet-20K, exceed some fully finetuned models on ESC-50 and FSD50K, and stay competitive on video tasks despite 10x less video data.

Those numbers are the main empirical hook. If they hold with matched training, the approach is simpler than the modality-specific encoders and mixed losses in earlier audio-visual SSL work.

The soft spot is the supporting claim that cross-modal prediction is critical. The abstract states that without it a shared encoder falls below unimodal baselines, yet supplies no table or section showing which unimodal baselines were used, whether the intra-modal JEPA loss was kept in the ablation, or whether optimizer, masking, and augmentations stayed identical. The stress-test concern lands here: without those controls the observed drop could be an artifact of mismatched training rather than proof that cross-modal signals are required.

No equations or derivations appear circular. The method description is high-level but consistent with prior JEPA papers.

This paper is for people building scalable audio-visual representations who want to test whether a single predictive objective can replace contrastive or reconstruction mixtures. A reader focused on practical SSL gains would find the numbers worth checking, once the ablation details are verified.

It deserves peer review because the performance deltas are large enough to justify referee time even if revisions are needed on the controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces MJEPA, a joint-embedding predictive architecture for audio-visual self-supervised learning that employs a single unified ViT encoder and applies one predictive objective both within and across audio and visual modalities. It asserts that cross-modal prediction is essential, as a shared encoder without it falls below unimodal baselines, and reports that a frozen ViT-g model achieves over 6.8 mAP improvement on AudioSet-20K versus prior frozen baselines, exceeds fully finetuned models on ESC-50 and FSD50K, and remains competitive on video tasks while using 10x less video data.

Significance. If the ablation controls and performance numbers are substantiated with matched training conditions, the work would offer a notably simpler and more scalable alternative to existing audio-visual SSL approaches that rely on modality-specific encoders and mixed contrastive/reconstruction losses. The emphasis on a single predictive objective and the reported data efficiency could influence unified multimodal representation learning.

major comments (2)

[results section / ablation experiments] The central claim that cross-modal prediction is critical (abstract; results section) rests on an ablation showing that a shared encoder without cross-modal prediction degrades below unimodal baselines. No details are provided on (a) the specific unimodal baselines (e.g., Audio-MAE, Video-MAE, or the identical ViT-g trained unimodally), (b) whether the ablated shared encoder still received the full intra-modal JEPA loss on both streams, or (c) whether optimizer, augmentations, masking schedules, and data were held identical. This control information is load-bearing for interpreting the degradation as evidence for cross-modal necessity rather than a training mismatch.
[evaluation tables / experimental setup] The headline performance numbers (abstract; evaluation tables) are presented without accompanying error bars, dataset split details, or training hyperparameter parity statements for the reported comparisons. Given that the abstract already notes quantitative gains, the full experimental section must supply these to allow assessment of whether the 6.8 mAP gain and outperformance of finetuned models are robust.

minor comments (1)

[abstract] The abstract states the method uses 'only a single predictive objective' but does not preview the precise form of the predictor or masking strategy; a one-sentence clarification would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying our ablation controls and experimental reporting. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [results section / ablation experiments] The central claim that cross-modal prediction is critical (abstract; results section) rests on an ablation showing that a shared encoder without cross-modal prediction degrades below unimodal baselines. No details are provided on (a) the specific unimodal baselines (e.g., Audio-MAE, Video-MAE, or the identical ViT-g trained unimodally), (b) whether the ablated shared encoder still received the full intra-modal JEPA loss on both streams, or (c) whether optimizer, augmentations, masking schedules, and data were held identical. This control information is load-bearing for interpreting the degradation as evidence for cross-modal necessity rather than a training mismatch.

Authors: We agree the ablation description requires more explicit controls to support the claim. The unimodal baselines are the identical ViT-g trained separately on audio-only and video-only streams using the same JEPA objective. The ablated shared-encoder condition (no cross-modal term) still applied the full intra-modal JEPA loss to both modalities. Optimizer, augmentations, masking schedules, and training data were identical across all conditions. In revision we will expand the ablation subsection with a dedicated paragraph and table listing these matched settings. revision: yes
Referee: [evaluation tables / experimental setup] The headline performance numbers (abstract; evaluation tables) are presented without accompanying error bars, dataset split details, or training hyperparameter parity statements for the reported comparisons. Given that the abstract already notes quantitative gains, the full experimental section must supply these to allow assessment of whether the 6.8 mAP gain and outperformance of finetuned models are robust.

Authors: We acknowledge that error bars, explicit split references, and hyperparameter parity statements improve interpretability. Standard dataset splits (AudioSet-20K balanced, ESC-50 5-fold, FSD50K official) are used and will be stated. Hyperparameters for comparisons follow the original papers where possible; we will add a short paragraph confirming parity. Because additional random seeds were not run for every baseline due to compute limits, we will report error bars only for the runs we performed and note this limitation; the 6.8 mAP figure will be presented with its observed variance. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents MJEPA as an empirical extension of prior JEPA work to audio-visual modalities using a unified encoder and single predictive objective. The abstract and provided text contain no equations, parameter fits, or derivations that reduce by construction to author-defined inputs (e.g., no self-definitional ratios, fitted inputs renamed as predictions, or ansatz smuggled via self-citation). The central claim that cross-modal prediction is critical is framed as an empirical ablation result rather than a mathematical necessity. No load-bearing self-citation chains, uniqueness theorems, or renamings of known results appear in the given content. The method is self-contained as an architectural proposal whose performance claims rest on external benchmarks rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5771 in / 1069 out tokens · 24547 ms · 2026-06-25T23:55:42.405720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 3 canonical work pages

[1]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

Akbari, H., Yuan, L., Qian, R., Chuang, W.H., Chang, S.F., Cui, Y., Gong, B.: Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 24206– 24221. Curran Associates, Inc. (2021),...

2021
[2]

In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=odU59TxdiB

Alex, T., Atito, S., Mustafa, A., Awais, M., Jackson, P.J.B.: SSLAM: Enhancing self-supervised models with audio mixtures for polyphonic soundscapes. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=odU59TxdiB

2025
[3]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

2017
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Araujo, E., Rouditchenko, A., Gong, Y., Bhati, S., Thomas, S., Kingsbury, B., Kar- linsky, L., Feris, R., Glass, J.R.: Cav-mae sync: Improving contrastive audio-visual mask autoencoders via fine-grained alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025
[5]

arXiv preprint arXiv:2506.09985 (2025)

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., Arnaud, S., Gejji, A., Martin, A., Robert Hogan, F., Dugas, D., Bojanowski, P., Khalidov, V., Labatut, P., Massa, F., Szafraniec, M., Krishnakumar, K., Li, Y., Ma, X., Chandar, S., Meier, F., LeCun, Y., Rabbat, M., Ballas, N.: ...

Pith/arXiv arXiv 2025
[6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predic- tive architecture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15619–15629 (2023)

2023
[7]

Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: data2vec: A gen- eral framework for self-supervised learning in speech, vision and language (2022), https://arxiv.org/abs/2202.03555

arXiv 2022
[8]

arXiv:2404.08471 (2024)

Bardes, A., Garrido, Q., Ponce, J., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471 (2024)

Pith/arXiv arXiv 2024
[9]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

Chen,S.,Wang,C.,Chen,Z.,Wu,Y.,Liu,S.,Chen,Z.,Li,J.,Kanda,N.,Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., Wei, F.: Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing16(6), 1505–1518 (Oct 2022).https://doi.org/10.1109/jstsp...

work page doi:10.1109/jstsp.2022.3188113 2022
[10]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Chen, S., Wu, Y., Wang, C., Liu, S., Tompkins, D., Chen, Z., Che, W., Yu, X., Wei, F.: BEATs: Audio pre-training with acoustic tokenizers. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceed- ings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 51...

2023
[11]

In: Interspeech 2024 (2024)

Dinkel, H., Yan, Z., Wang, Y., Zhang, J., Wang, Y., Wang, B.: Scaling up masked audio encoder learning for general audio classification. In: Interspeech 2024 (2024)

2024
[12]

Teotia et al

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: 16 R. Teotia et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021)

2021
[13]

Fei, Z., Fan, M., Huang, J.: A-jepa: Joint-embedding predictive architecture can listen (2024),https://arxiv.org/abs/2311.15830

arXiv 2024
[14]

IEEE/ACM Transactions on Audio, Speech, and Language Processing30, 829–852 (2022)

Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X.: FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing30, 829–852 (2022)

2022
[15]

In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 776–780 (2017).https://doi.org/10.1109/ ICASSP.2017.7952261

arXiv 2017
[16]

In: ICCV (2023)

Georgescu, M.I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audio-Visual Masked Autoencoders. In: ICCV (2023)

2023
[17]

In: CVPR (2023)

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: CVPR (2023)

2023
[18]

In: Proc

Gong, Y., Chung, Y.A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021. pp. 571–575 (2021).https://doi.org/10.21437/Interspeech. 2021-698

work page doi:10.21437/interspeech 2021
[19]

In: The Eleventh Inter- national Conference on Learning Representations (2023),https://openreview

Gong, Y., Rouditchenko, A., Liu, A.H., Harwath, D., Karlinsky, L., Kuehne, H., Glass, J.R.: Contrastive audio-visual masked autoencoder. In: The Eleventh Inter- national Conference on Learning Representations (2023),https://openreview. net/forum?id=QPtMRyk5rb

2023
[20]

something something

Goyal, R., Kahou, S.E., Michalski, V., Materzyńska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The "something something" video database for learning and evaluating visual common sense (2017),https://arxiv.org/abs/1706.04261

Pith/arXiv arXiv 2017
[21]

In: Advances in Neural Information Processing Systems

Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., et al.: Bootstrap your own latent: A new approach to self-supervised learning. In: Advances in Neural Information Processing Systems. vol. 33, pp. 21271–21284 (2020)

2020
[22]

arXiv preprint arXiv:2111.06377 (2021)

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)

Pith/arXiv arXiv 2021
[23]

In: Thirty- seventh Conference on Neural Information Processing Systems (2023),https:// openreview.net/forum?id=OmTMaTbjac

Huang, P.Y., Sharma, V., Xu, H., Ryali, C., Fan, H., Li, Y., Li, S.W., Ghosh, G., Malik, J., Feichtenhofer, C.: MAVil: Masked audio-video learners. In: Thirty- seventh Conference on Neural Information Processing Systems (2023),https:// openreview.net/forum?id=OmTMaTbjac

2023
[24]

In: NeurIPS (2022)

Huang, P.Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feicht- enhofer, C.: Masked autoencoders that listen. In: NeurIPS (2022)

2022
[25]

In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett,J.,Berkenkamp,F.(eds.)Proceedingsofthe41stInternationalConference on Machine Learning

Huh, M., Cheung, B., Wang, T., Isola, P.: Position: The platonic representation hypothesis. In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett,J.,Berkenkamp,F.(eds.)Proceedingsofthe41stInternationalConference on Machine Learning. Proceedings of Machine Learning Research, vol. 235, pp. 20617–20642. PMLR (21–27 Jul 2024),https:/...

2024
[26]

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The kinetics human action video dataset (2017),https://arxiv.org/abs/1705.06950

Pith/arXiv arXiv 2017
[27]

In: Salakhutdinov, R., Kolter, Z., MJEPA 17 Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F

Kim, J., Lee, H., Rho, K., Kim, J., Chung, J.S.: EquiAV: Leveraging equiv- ariance for audio-visual contrastive learning. In: Salakhutdinov, R., Kolter, Z., MJEPA 17 Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Proceed- ings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. ...

2024
[28]

2, 2022-06-27

LeCun, Y., et al.: A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review62(1), 1–62 (2022)

2022
[29]

Lee, J.J., Yoon, S.W.: Can one modality model synergize training of other modality models? In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=5BXWhVbHAK

2025
[30]

arXiv preprint arXiv:2403.19638 (2024)

Lin, Y.B., Bertasius, G.: Siamese vision transformers are scalable audio-visual learners. arXiv preprint arXiv:2403.19638 (2024)

arXiv 2024
[31]

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019),https: //arxiv.org/abs/1711.05101

Pith/arXiv arXiv 2019
[32]

In: ICCV (2019)

Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In: ICCV (2019)

2019
[33]

Transactions on Ma- chineLearningResearch(2024),https://openreview.net/forum?id=a68SUt6zFt, featured Certification

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual feat...

2024
[34]

ESC: Dataset for Environmental Sound Classification

Piczak, K.J.: ESC: Dataset for Environmental Sound Classification. In: Proceed- ings of the 23rd Annual ACM Conference on Multimedia. pp. 1015–1018. ACM Press (2015).https://doi.org/10.1145/2733373.2806390,http://dl.acm.org/ citation.cfm?doid=2733373.2806390

work page doi:10.1145/2733373.2806390 2015
[35]

In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=FbY5Co2NWk

Rauch, L., Heinrich, R., Ghaffari, H., Miklautz, L., Moummad, I., Sick, B., Scholz, C.: Unmute the patch tokens: Rethinking probing in multi-label audio classifica- tion. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=FbY5Co2NWk

2026
[36]

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge (2015),https://arxiv.org/abs/1409.0575

Pith/arXiv arXiv 2015
[37]

Sarkar, P., Etemad, A.: Xkd: Cross-modal knowledge distillation with domain alignment for video representation learning (2022)

2022
[38]

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: Dinov3 (2025),https://ar...

Pith/arXiv arXiv 2025
[39]

In: NeurIPS (2022)

Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In: NeurIPS (2022)

2022
[40]

Yang,X.,Yang,Y.,Jin,Z.,Cui,Z.,Wu,W.,Li,B.,Zhang,C.,Woodland,P.:Spear: A unified ssl framework for learning speech and audio representations (2026), https://arxiv.org/abs/2510.25955

Pith/arXiv arXiv 2026
[41]

In: International Conference on Learning Representations (2025) 18 R

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 18 R. Teotia et al

2025
[42]

In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

Zhu, B., Lin, B., Ning, M., YAN, Y., Cui, J., HongFa, W., Pang, Y., Jiang, W., Zhang, J., Li, Z., Zhang, C., Li, Z., Liu, W., Li, Y.: Languagebind: Extend- ing video-language pretraining to n-modality by language-based semantic align- ment. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on Learning ...

2024

[1] [1]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

Akbari, H., Yuan, L., Qian, R., Chuang, W.H., Chang, S.F., Cui, Y., Gong, B.: Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 24206– 24221. Curran Associates, Inc. (2021),...

2021

[2] [2]

In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=odU59TxdiB

Alex, T., Atito, S., Mustafa, A., Awais, M., Jackson, P.J.B.: SSLAM: Enhancing self-supervised models with audio mixtures for polyphonic soundscapes. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=odU59TxdiB

2025

[3] [3]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

2017

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

Araujo, E., Rouditchenko, A., Gong, Y., Bhati, S., Thomas, S., Kingsbury, B., Kar- linsky, L., Feris, R., Glass, J.R.: Cav-mae sync: Improving contrastive audio-visual mask autoencoders via fine-grained alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025)

2025

[5] [5]

arXiv preprint arXiv:2506.09985 (2025)

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., Arnaud, S., Gejji, A., Martin, A., Robert Hogan, F., Dugas, D., Bojanowski, P., Khalidov, V., Labatut, P., Massa, F., Szafraniec, M., Krishnakumar, K., Li, Y., Ma, X., Chandar, S., Meier, F., LeCun, Y., Rabbat, M., Ballas, N.: ...

Pith/arXiv arXiv 2025

[6] [6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predic- tive architecture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15619–15629 (2023)

2023

[7] [7]

Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: data2vec: A gen- eral framework for self-supervised learning in speech, vision and language (2022), https://arxiv.org/abs/2202.03555

arXiv 2022

[8] [8]

arXiv:2404.08471 (2024)

Bardes, A., Garrido, Q., Ponce, J., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471 (2024)

Pith/arXiv arXiv 2024

[9] [9]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

Chen,S.,Wang,C.,Chen,Z.,Wu,Y.,Liu,S.,Chen,Z.,Li,J.,Kanda,N.,Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., Wei, F.: Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing16(6), 1505–1518 (Oct 2022).https://doi.org/10.1109/jstsp...

work page doi:10.1109/jstsp.2022.3188113 2022

[10] [10]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Chen, S., Wu, Y., Wang, C., Liu, S., Tompkins, D., Chen, Z., Che, W., Yu, X., Wei, F.: BEATs: Audio pre-training with acoustic tokenizers. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceed- ings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 51...

2023

[11] [11]

In: Interspeech 2024 (2024)

Dinkel, H., Yan, Z., Wang, Y., Zhang, J., Wang, Y., Wang, B.: Scaling up masked audio encoder learning for general audio classification. In: Interspeech 2024 (2024)

2024

[12] [12]

Teotia et al

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: 16 R. Teotia et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021)

2021

[13] [13]

Fei, Z., Fan, M., Huang, J.: A-jepa: Joint-embedding predictive architecture can listen (2024),https://arxiv.org/abs/2311.15830

arXiv 2024

[14] [14]

IEEE/ACM Transactions on Audio, Speech, and Language Processing30, 829–852 (2022)

Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X.: FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing30, 829–852 (2022)

2022

[15] [15]

In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 776–780 (2017).https://doi.org/10.1109/ ICASSP.2017.7952261

arXiv 2017

[16] [16]

In: ICCV (2023)

Georgescu, M.I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audio-Visual Masked Autoencoders. In: ICCV (2023)

2023

[17] [17]

In: CVPR (2023)

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: CVPR (2023)

2023

[18] [18]

In: Proc

Gong, Y., Chung, Y.A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021. pp. 571–575 (2021).https://doi.org/10.21437/Interspeech. 2021-698

work page doi:10.21437/interspeech 2021

[19] [19]

In: The Eleventh Inter- national Conference on Learning Representations (2023),https://openreview

Gong, Y., Rouditchenko, A., Liu, A.H., Harwath, D., Karlinsky, L., Kuehne, H., Glass, J.R.: Contrastive audio-visual masked autoencoder. In: The Eleventh Inter- national Conference on Learning Representations (2023),https://openreview. net/forum?id=QPtMRyk5rb

2023

[20] [20]

something something

Goyal, R., Kahou, S.E., Michalski, V., Materzyńska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The "something something" video database for learning and evaluating visual common sense (2017),https://arxiv.org/abs/1706.04261

Pith/arXiv arXiv 2017

[21] [21]

In: Advances in Neural Information Processing Systems

Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., et al.: Bootstrap your own latent: A new approach to self-supervised learning. In: Advances in Neural Information Processing Systems. vol. 33, pp. 21271–21284 (2020)

2020

[22] [22]

arXiv preprint arXiv:2111.06377 (2021)

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)

Pith/arXiv arXiv 2021

[23] [23]

In: Thirty- seventh Conference on Neural Information Processing Systems (2023),https:// openreview.net/forum?id=OmTMaTbjac

Huang, P.Y., Sharma, V., Xu, H., Ryali, C., Fan, H., Li, Y., Li, S.W., Ghosh, G., Malik, J., Feichtenhofer, C.: MAVil: Masked audio-video learners. In: Thirty- seventh Conference on Neural Information Processing Systems (2023),https:// openreview.net/forum?id=OmTMaTbjac

2023

[24] [24]

In: NeurIPS (2022)

Huang, P.Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feicht- enhofer, C.: Masked autoencoders that listen. In: NeurIPS (2022)

2022

[25] [25]

In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett,J.,Berkenkamp,F.(eds.)Proceedingsofthe41stInternationalConference on Machine Learning

Huh, M., Cheung, B., Wang, T., Isola, P.: Position: The platonic representation hypothesis. In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett,J.,Berkenkamp,F.(eds.)Proceedingsofthe41stInternationalConference on Machine Learning. Proceedings of Machine Learning Research, vol. 235, pp. 20617–20642. PMLR (21–27 Jul 2024),https:/...

2024

[26] [26]

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The kinetics human action video dataset (2017),https://arxiv.org/abs/1705.06950

Pith/arXiv arXiv 2017

[27] [27]

In: Salakhutdinov, R., Kolter, Z., MJEPA 17 Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F

Kim, J., Lee, H., Rho, K., Kim, J., Chung, J.S.: EquiAV: Leveraging equiv- ariance for audio-visual contrastive learning. In: Salakhutdinov, R., Kolter, Z., MJEPA 17 Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Proceed- ings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. ...

2024

[28] [28]

2, 2022-06-27

LeCun, Y., et al.: A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review62(1), 1–62 (2022)

2022

[29] [29]

Lee, J.J., Yoon, S.W.: Can one modality model synergize training of other modality models? In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=5BXWhVbHAK

2025

[30] [30]

arXiv preprint arXiv:2403.19638 (2024)

Lin, Y.B., Bertasius, G.: Siamese vision transformers are scalable audio-visual learners. arXiv preprint arXiv:2403.19638 (2024)

arXiv 2024

[31] [31]

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019),https: //arxiv.org/abs/1711.05101

Pith/arXiv arXiv 2019

[32] [32]

In: ICCV (2019)

Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In: ICCV (2019)

2019

[33] [33]

Transactions on Ma- chineLearningResearch(2024),https://openreview.net/forum?id=a68SUt6zFt, featured Certification

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual feat...

2024

[34] [34]

ESC: Dataset for Environmental Sound Classification

Piczak, K.J.: ESC: Dataset for Environmental Sound Classification. In: Proceed- ings of the 23rd Annual ACM Conference on Multimedia. pp. 1015–1018. ACM Press (2015).https://doi.org/10.1145/2733373.2806390,http://dl.acm.org/ citation.cfm?doid=2733373.2806390

work page doi:10.1145/2733373.2806390 2015

[35] [35]

In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=FbY5Co2NWk

Rauch, L., Heinrich, R., Ghaffari, H., Miklautz, L., Moummad, I., Sick, B., Scholz, C.: Unmute the patch tokens: Rethinking probing in multi-label audio classifica- tion. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=FbY5Co2NWk

2026

[36] [36]

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge (2015),https://arxiv.org/abs/1409.0575

Pith/arXiv arXiv 2015

[37] [37]

Sarkar, P., Etemad, A.: Xkd: Cross-modal knowledge distillation with domain alignment for video representation learning (2022)

2022

[38] [38]

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: Dinov3 (2025),https://ar...

Pith/arXiv arXiv 2025

[39] [39]

In: NeurIPS (2022)

Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In: NeurIPS (2022)

2022

[40] [40]

Yang,X.,Yang,Y.,Jin,Z.,Cui,Z.,Wu,W.,Li,B.,Zhang,C.,Woodland,P.:Spear: A unified ssl framework for learning speech and audio representations (2026), https://arxiv.org/abs/2510.25955

Pith/arXiv arXiv 2026

[41] [41]

In: International Conference on Learning Representations (2025) 18 R

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 18 R. Teotia et al

2025

[42] [42]

In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

Zhu, B., Lin, B., Ning, M., YAN, Y., Cui, J., HongFa, W., Pang, Y., Jiang, W., Zhang, J., Li, Z., Zhang, C., Li, Z., Liu, W., Li, Y.: Languagebind: Extend- ing video-language pretraining to n-modality by language-based semantic align- ment. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on Learning ...

2024